首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
The logical analysis of data (LAD) is one of the most promising data mining methods developed to date for extracting knowledge from data. The key feature of the LAD is the capability of detecting hidden patterns in the data. Because patterns are basically combinations of certain attributes, they can be used to build a decision boundary for classification in the LAD by providing important information to distinguish observations in one class from those in the other. The use of patterns may result in a more stable performance in terms of being able to classify both positive and negative classes due to their robustness to measurement errors.The LAD technique, however, tends to choose too many patterns by solving a set covering problem to build a classifier; this is especially the case when outliers exist in the data set. In the set covering problem of the LAD, each observation should be covered by at least one pattern, even though the observation is an outlier. Thus, existing approaches tend to select too many patterns to cover these outliers, resulting in the problem of overfitting. Here, we propose new pattern selection approaches for LAD that take both outliers and the coverage of a pattern into account. The proposed approaches can avoid the problem of overfitting by building a sparse classifier. The performances of the proposed pattern selection approaches are compared with existing LAD approaches using several public data sets. The computational results show that the sparse classifiers built on the patterns selected by the proposed new approaches yield an improved classification performance compared to the existing approaches, especially when outliers exist in the data set.  相似文献   

2.
频繁模式挖掘是最基本的数据挖掘问题,由于内在复杂性,提高挖掘算法性能一直是个难题.耶是通过数据库混合投影来挖掘频繁模式完全集的全新算法.HP混合投影思想是:任意数据集都不能简单地归入某个单一特性类别,挖掘过程应根据局部数据子集的特性变化动态地调整频繁模式树构造策略、事务子集表示形式、投影方法.HP提出基于树表示的虚拟投影与基于数组表示的非过滤投影,较好地解决了提高时间效率与节省内存空间的矛盾.实验表明,HP时间效率比Apriori,FP—Growth和H-Mine高出1~3个数量级,并且空间可伸缩性也大大优于这些算法.  相似文献   

3.
Various methods of reducing correlation between classifiers in a multiple classifier framework have been attempted. Here we propose a recursive partitioning technique for analysing feature space of multiple classifier decisions. Spectral summation of individual pattern components in intermediate feature space enables each training pattern to be rated according to its contribution to separability, measured as k-monotonic constraints. A constructive algorithm sequentially extracts maximally separable subsets of patterns, from which is derived an inconsistently classified set (ICS). Leaving out random subsets of ICS patterns from individual (base) classifier training sets is shown to improve performance of the combined classifiers. For experiments reported here on artificial and real data, the constituent classifiers are identical single hidden layer MLPs with fixed parameters.  相似文献   

4.
The task of discovering natural groupings of input patterns, or clustering, is an important aspect of machine learning and pattern analysis. In this paper, we study the widely used spectral clustering algorithm which clusters data using eigenvectors of a similarity/affinity matrix derived from a data set. In particular, we aim to solve two critical issues in spectral clustering: (1) how to automatically determine the number of clusters, and (2) how to perform effective clustering given noisy and sparse data. An analysis of the characteristics of eigenspace is carried out which shows that (a) not every eigenvectors of a data affinity matrix is informative and relevant for clustering; (b) eigenvector selection is critical because using uninformative/irrelevant eigenvectors could lead to poor clustering results; and (c) the corresponding eigenvalues cannot be used for relevant eigenvector selection given a realistic data set. Motivated by the analysis, a novel spectral clustering algorithm is proposed which differs from previous approaches in that only informative/relevant eigenvectors are employed for determining the number of clusters and performing clustering. The key element of the proposed algorithm is a simple but effective relevance learning method which measures the relevance of an eigenvector according to how well it can separate the data set into different clusters. Our algorithm was evaluated using synthetic data sets as well as real-world data sets generated from two challenging visual learning problems. The results demonstrated that our algorithm is able to estimate the cluster number correctly and reveal natural grouping of the input data/patterns even given sparse and noisy data.  相似文献   

5.
In this paper a new, abstract method for analysis and visualization of multidimensional data sets in pattern recognition problems is introduced. It can be used to determine the properties of an unknown, complex data set and to assist in finding the most appropriate recognition algorithm. Additionally, it can be employed to design layers of a feedforward artificial neural network or to visualize the higher-dimensional problems in 2-D and 3-D without losing relevant data set information. The method is derived from the convex set theory and works by considering convex subsets within the data and analyzing their respective positions in the original dimension. Its ability to describe certain set features that cannot be explicitly projected into lower dimensions sets it apart from many other visualization techniques. Two classical multidimensional problems are analyzed and the results show the usefulness of the presented method and underline its strengths and weaknesses.  相似文献   

6.
The problem of feature definition in the design of a pattern recognition system where the number of available training samples is small but the number of potential features is excessively large has not received adequate attention. Most of the existing feature extraction and feature selection procedures are not feasible due to computational considerations when the number of features exceeds, say, 100, and are not even applicable when the number of features exceeds the number of patterns. The feature definition procedure which we have proposed involves partitioning a large set of highly correlated features into subsets, or clusters, through hierarchical clustering. Almost any feature selection or extraction procedure, including the constrained maximum variance approach introduced here, can then be applied to each subset to obtain a single representative feature. The original set of correlated features is thus reduced to a small set of nearly uncorrelated features. The utility of this procedure has been demonstrated on a speaker-identification data base which consists of 20 subjects, 156 features, and 180 samples.  相似文献   

7.
K-hyperline clustering is an iterative algorithm based on singular value decomposition and it has been successfully used in sparse component analysis. In this paper, we prove that the algorithm converges to a locally optimal solution for a given set of training data, based on Lloyd’s optimality conditions. Furthermore, the local optimality is shown by developing an Expectation-Maximization procedure for learning dictionaries to be used in sparse representations and by deriving the clustering algorithm as its special case. The cluster centroids obtained from the algorithm are proved to tessellate the space into convex Voronoi regions. The stability of clustering is shown by posing the problem as an empirical risk minimization procedure over a function class. It is proved that, under certain conditions, the cluster centroids learned from two sets of i.i.d. training samples drawn from the same probability space become arbitrarily close to each other, as the number of training samples increase asymptotically.  相似文献   

8.
全集高效用模式挖掘算法存在的关键问题之一是会产生冗余的高效用项集,这将导致用户很难在大量的高效用项集中发现有用的信息,严重降低了高效用模式挖掘算法的性能。为解决这一问题,衍生出了精简高效用模式挖掘算法,其主要包括最大高效用模式、闭合高效用模式、top-k高效用模式以及三者之间的组合高效用模式挖掘算法等。首先,介绍了精简高效用模式的相关问题描述;然后,从有无候选项集生成、一两阶段挖掘方法、数据结构类型和剪枝策略等角度,重点分类总结了精简高效用模式挖掘方法;最后,给出了精简高效用模式的进一步研究方向,包括处理基于负项的高效用精简模式、处理基于时间的高效用精简模式及处理动态复杂的数据等。  相似文献   

9.
A sequential rule expresses a relationship between two series of events happening one after another. Sequential rules are potentially useful for analyzing data in sequential format, ranging from purchase histories, network logs and program execution traces.In this work, we investigate and propose a syntactic characterization of a non-redundant set of sequential rules built upon past work on compact set of representative patterns. A rule is redundant if it can be inferred from another rule having the same support and confidence. When using the set of mined rules as a composite filter, replacing a full set of rules with a non-redundant subset of the rules does not impact the accuracy of the filter.We consider several rule sets based on composition of various types of pattern sets—generators, projected-database generators, closed patterns and projected-database closed patterns. We investigate the completeness and tightness of these rule sets. We characterize a tight and complete set of non-redundant rules by defining it based on the composition of two pattern sets. Furthermore, we propose a compressed set of non-redundant rules in a spirit similar to how closed patterns serve as a compressed representation of a full set of patterns. Lastly, we propose an algorithm to mine this compressed set of non-redundant rules. A performance study shows that the proposed algorithm significantly improves both the runtime and compactness of mined rules over mining a full set of sequential rules.  相似文献   

10.
曹娜  王永利  孙建红  赵宁  宫小泽 《自动化学报》2020,46(12):2638-2646
提出了一种基于字典学习和拓展联合动态稀疏表示的合成孔径雷达(Synthetic aperture radar, SAR)图像的目标自动识别(Automatic target recognition, ATR)方法.首先, 在图像预处理时, 分割出目标区域和目标遮挡地面形成的阴影区域, 将这两个区域的信息结合起来能更好地表示图像.其次, 将字典学习方法LC-KSVD (Label consistent k-singular value decomposition)引入到训练阶段中, 分别学习目标区域和阴影区域的特征字典, 而不是直接将所有训练样本作为固定字典.最后, 在测试阶段提出了拓展联合动态稀疏表示算法, 使图像数据中的两个特征共享相似但不完全相同的稀疏模式, 还可处理图像噪声遮挡损坏问题.标准数据集上的实验结果表明, 该方法使不同类别更具区分性, 有效地提高了SAR图像的目标识别准确度.  相似文献   

11.
Of all of the challenges which face the effective application of computational intelligence technologies for pattern recognition, dataset dimensionality is undoubtedly one of the primary impediments. In order for pattern classifiers to be efficient, a dimensionality reduction stage is usually performed prior to classification. Much use has been made of rough set theory for this purpose as it is completely data-driven and no other information is required; most other methods require some additional knowledge. However, traditional rough set-based methods in the literature are restricted to the requirement that all data must be discrete. It is therefore not possible to consider real-valued or noisy data. This is usually addressed by employing a discretisation method, which can result in information loss. This paper proposes a new approach based on the tolerance rough set model, which has the ability to deal with real-valued data whilst simultaneously retaining dataset semantics. More significantly, this paper describes the underlying mechanism for this new approach to utilise the information contained within the boundary region or region of uncertainty. The use of this information can result in the discovery of more compact feature subsets and improved classification accuracy. These results are supported by an experimental evaluation which compares the proposed approach with a number of existing feature selection techniques.  相似文献   

12.
Clustering is an old research topic in data mining and machine learning. Most of the traditional clustering methods can be categorized as local or global ones. In this paper, a novel clustering method that can explore both the local and global information in the data set is proposed. The method, Clustering with Local and Global Regularization (CLGR), aims to minimize a cost function that properly trades off the local and global costs. We show that such an optimization problem can be solved by the eigenvalue decomposition of a sparse symmetric matrix, which can be done efficiently using iterative methods. Finally, the experimental results on several data sets are presented to show the effectiveness of our method.  相似文献   

13.
可靠性分析是衡量物流运输网络运行服务水平的主要手段之一。给出了一种评估物流运输网络连通可靠性的高效分解算法,算法充分利用分解过程中获得的相关信息,通过引入保持网络可靠性不变的串联边化简、并联边化简以及节点合并等规则,并结合向量集分解方法,能够快速实现对网络状态向量空间的分解,达到提高网络可靠性评估效率的目的。实例分析以及和现有方法的比较验证了算法的性能和分解效率。  相似文献   

14.
15.
This paper describes a technique to transform a two-dimensional shape into a generalized fuzzy binary relation whose clusters represent the meaningful simple parts of the shape. The fuzzy binary relation is defined on the set of convex and concave boundary points, implying a piecewise linear approximation of the boundary, and describes the dissemblance of two vertices to a common cluster. Next some fuzzy subsets are defined over the points which determine the connection between the clusters.The decomposition method first determines nearly convex regions, which are subgraphs of the total graph, and then selects the greatest nearly convex region which satisfies best the defined fuzzy subsets and relations. Using this procedure on touching chromosomes defining the simple parts to be the separated chromosomes, the decomposition often corresponds well to the decomposition that a human might make.  相似文献   

16.
This paper addresses the issue of building a case-based preliminary desing system by using Hopfield networks.One limitation of Hopfield networks is that it cannot be trained,i.e.the weights between two neurons must be set in advance.A pattern stored in Hopfield networks cannot be recalled if the pattern is not a local minimum.Two concepts are proposed to deal with this problem.They are the multiple training encoding method and the puppet encoding method.The multiple training encoding method,which guarantees to recall a single stored pattern under appropriate initial conditions of data,is theoreticall analyzed,and the minimal number of times for using a pattern for training to guarantee recalling of the pattern among a set of patterns is derived.The puppet encoding method is proved to be able to guarantee recalling of all stored patterns if attaching puppet data to the stored patterns is available. An integrated software PDS (Preliminary Design System),which is developed from two aspects,is described.One is from a case-based expert system--CPDS(Case-based Preliminary Design System),which is based on the algorithm of the Hopfield and developed for uncertain problems in PDS;the other is RPDS (Rule-based Preliminary Design System),which attacks logic or deduced problems in PDS.Based on the results of CPDS,RPDS can search for feasible solution in design model.CPDS is demonstrated to be useful in the domains of preliminary designs of cable-stayed bridges in this paper.  相似文献   

17.
18.
Data seldom create value by themselves. They need to be linked and combined from multiple sources, which can often come with variable data quality. The task of improving data quality is a recurring challenge. In this paper, we use a case study of a large telecom company to develop a generic process pattern model for improving data quality. The process pattern model is defined as a proven series of activities, aimed at improving the data quality given a certain context, a particular objective, and a specific set of initial conditions. Four different patterns are derived to deal with the variations in data quality of datasets. Instead of having to find the way to improve the quality of big data for each situation, the process model provides data users with generic patterns, which can be used as a reference model to improve big data quality.  相似文献   

19.
Due to the huge size of patterns to be searched,multiple pattern searching remains a challenge to several newly-arising applications like network intrusion detection.In this paper,we present an attempt to design efficient multiple pattern searching algorithms on multi-core architectures.We observe an important feature which indicates that the multiple pattern matching time mainly depends on the number and minimal length of patterns.The multi-core algorithm proposed in this paper leverages this feature to decompose pattern set so that the parallel execution time is minimized.We formulate the problem as an optimal decomposition and scheduling of a pattern set,then propose a heuristic algorithm,which takes advantage of dynamic programming and greedy algorithmic techniques,to solve the optimization problem.Experimental results suggest that our decomposition approach can increase the searching speed by more than 200% on a 4-core AMD Barcelona system.  相似文献   

20.
王庆东  陈建 《计算机工程》2007,33(24):41-43
针对处理不完备信息系统时利用完备化方法会引起不同程度的知识失真等缺点,提出了一种不完备信息系统分解方法。该方法不需事先对系统进行完备化,而是基于粗糙集模板评价函数选择模板,利用模板逐层从不完备系统中提取完备子集。结合粗糙集理论来构造中间变量,依据中间变量分解不完备信息系统以简化规则集。对得到的规则集逐层进行推理和决策分析。以汽轮发电机组的振动故障诊断数据为实例给出了该方法的具体实现过程,验证了该算法在处理不完备信息系统时的有效性。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号