首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 593 毫秒
1.
《Information Systems》2001,26(1):35-58
Clustering, in data mining, is useful for discovering groups and identifying interesting distributions in the underlying data. Traditional clustering algorithms either favor clusters with spherical shapes and similar sizes, or are very fragile in the presence of outliers. We propose a new clustering algorithm called CURE that is more robust to outliers, and identifies clusters having non-spherical shapes and wide variances in size. CURE achieves this by representing each cluster by a certain fixed number of points that are generated by selecting well scattered points from the cluster and then shrinking them toward the center of the cluster by a specified fraction. Having more than one representative point per cluster allows CURE to adjust well to the geometry of non-spherical shapes and the shrinking helps to dampen the effects of outliers. To handle large databases, CURE employs a combination of random sampling and partitioning. A random sample drawn from the data set is first partitioned and each partition is partially clustered. The partial clusters are then clustered in a second pass to yield the desired clusters. Our experimental results confirm that the quality of clusters produced by CURE is much better than those found by existing algorithms. Furthermore, they demonstrate that random sampling and partitioning enable CURE to not only outperform existing algorithms but also to scale well for large databases without sacrificing clustering quality.  相似文献   

2.
In practical cluster analysis tasks, an efficient clustering algorithm should be less sensitive to parameter configurations and tolerate the existence of outliers. Based on the neural gas (NG) network framework, we propose an efficient prototype-based clustering (PBC) algorithm called enhanced neural gas (ENG) network. Several problems associated with the traditional PBC algorithms and original NG algorithm such as sensitivity to initialization, sensitivity to input sequence ordering and the adverse influence from outliers can be effectively tackled in our new scheme. In addition, our new algorithm can establish the topology relationships among the prototypes and all topology-wise badly located prototypes can be relocated to represent more meaningful regions. Experimental results1on synthetic and UCI datasets show that our algorithm possesses superior performance in comparison to several PBC algorithms and their improved variants, such as hard c-means, fuzzy c-means, NG, fuzzy possibilistic c-means, credibilistic fuzzy c-means, hard/fuzzy robust clustering and alternative hard/fuzzy c-means, in static data clustering tasks with a fixed number of prototypes.  相似文献   

3.
基于数学形态学的模糊异常点检测   总被引:1,自引:0,他引:1  
异常点检测作为数据挖掘的一项重要任务,可能会导致意想不到的知识发现.但传统的异常点检测技术都忽略了数据的自然结构,即异常点与簇的联系.然而,把异常点得分和聚类方法结合起来有利于对异常点与簇的联系的研究.提出基于数学形态学的模糊异常点检测与分析,把数学形态学技术和基于连接的异常点检测方法集成到一个模糊模型中,从异常隶属度和模糊隶属度这两个方面来分析对象与簇集的模糊关系.通过充分的实验证明,该算法能够对复杂面状和变密度的数据集,正确、高效地找出异常点,同时发现与异常点相关联的簇信息,探索异常点与簇核的关联深度,对异常点本身的意义具有启发作用.  相似文献   

4.
In this paper, a new classification method (SDCC) for high dimensional text data with multiple classes is proposed. In this method, a subspace decision cluster classification (SDCC) model consists of a set of disjoint subspace decision clusters, each labeled with a dominant class to determine the class of new objects falling in the cluster. A cluster tree is first generated from a training data set by recursively calling a subspace clustering algorithm Entropy Weighting k-Means algorithm. Then, the SDCC model is extracted from the subspace decision cluster tree. Various tests including Anderson–Darling test are used to determine the stopping condition of the tree growing. A series of experiments on real text data sets have been conducted. Their results show that the new classification method (SDCC) outperforms the existing methods like decision tree and SVM. SDCC is particularly suitable for large, high dimensional sparse text data with many classes.  相似文献   

5.
Cluster validity indexes can be used to evaluate the fitness of data partitions produced by a clustering algorithm. Validity indexes are usually independent of clustering algorithms. However, the values of validity indexes may be heavily influenced by noise and outliers. These noise and outliers may not influence the results from clustering algorithms, but they may affect the values of validity indexes. In the literature, there is little discussion about the robustness of cluster validity indexes. In this paper, we analyze the robustness of a validity index using the ? function of M-estimate and then propose several robust-type validity indexes. Firstly, we discuss the validity measure on a single data point and focus on those validity indexes that can be categorized as the mean type of validity indexes. We then propose median-type validity indexes that are robust to noise and outliers. Comparative examples with numerical and real data sets show that the proposed median-type validity indexes work better than the mean-type validity indexes.  相似文献   

6.
针对谱匹配方法对噪声和出格点的鲁棒性较差的问题,提出了一种基于拟Laplacian谱和点对拓扑特征的点模式匹配算法。首先,用赋权图的最小生成树构造无符号Laplacian矩阵,通过对矩阵谱分解得到的特征值和特征向量表示点的特征,进而计算点的初始匹配概率;其次,利用点对拓扑特征的相似性测度来定义点对间的局部相容性,然后借助概率松弛的方法更新由拟Laplacian谱得到的匹配概率,得出匹配结果。对比实验结果表明,该方法在处理存在噪声和出格点的点集匹配上具有较高的鲁棒性。  相似文献   

7.
This paper develops theory and algorithms concerning a new metric for clustering data. The metric minimizes the total volume of clusters, where the volume of a cluster is defined as the volume of the minimum volume ellipsoid (MVE) enclosing all data points in the cluster. This metric is scale-invariant, that is, the optimal clusters are invariant under an affine transformation of the data space. We introduce the concept of outliers in the new metric and show that the proposed method of treating outliers asymptotically recovers the data distribution when the data comes from a single multivariate Gaussian distribution. Two heuristic algorithms are presented that attempt to optimize the new metric. On a series of empirical studies with Gaussian distributed simulated data, we show that volume-based clustering outperforms well-known clustering methods such as k-means, Ward's method, SOM, and model-based clustering.  相似文献   

8.
结合簇类结构和链式结构的优点,按照“向基站方向成链,链中尽量成簇”的思想,提出一种基于半径递增有向成簇的WSN路由算法。网络中节点首先在较小半径R0范围内进行组簇后,未加入网络的节点增大组网半径,并引入前向传输的概念,在半径2R0的前向传输区域内根据权值比较选择下一跳节点,而后继续增大组网半径,直至所有节点进入网络。算法生成一种综合了簇、链、树三种结构的混合路由路径。仿真数据表明,新算法能有效优化网络路由结构,提高能量效率和均衡性能,能够延长网络生存周期10%左右。  相似文献   

9.
Data clustering has attracted a lot of research attention in the field of computational statistics and data mining. In most related studies, the dissimilarity between two clusters is defined as the distance between their centroids or the distance between two closest (or farthest) data points However, all of these measures are vulnerable to outliers and removing the outliers precisely is yet another difficult task. In view of this, we propose a new similarity measure, referred to as cohesion, to measure the intercluster distances. By using this new measure of cohesion, we have designed a two-phase clustering algorithm, called cohesion-based self-merging (abbreviated as CSM), which runs in time linear to the size of input data set. Combining the features of partitional and hierarchical clustering methods, algorithm CSM partitions the input data set into several small subclusters in the first phase and then continuously merges the subclusters based on cohesion in a hierarchical manner in the second phase. The time and the space complexities of algorithm CSM are analyzed. As shown by our performance studies, the cohesion-based clustering is very robust and possesses excellent tolerance to outliers in various workloads. More importantly, algorithm CSM is shown to be able to cluster the data sets of arbitrary shapes very efficiently and provide better clustering results than those by prior methods.  相似文献   

10.
多代表点特征树与空间聚类算法   总被引:1,自引:0,他引:1  
空间数据具有海量、复杂、连续、空间自相关、存在缺损与误差等的特点,要求空间聚类算法具有高效率,能处理各种复杂形状的簇,聚类结果与数据空间分布顺序无关,并且对离群点是健壮的等性能,已有的算法难以同时满足要求。本文提出了一个适合处理海量复杂空间数据的数据结构一多代表点特征树。基于多代表点特征树提出了适合挖掘海量复杂空间数据聚类算法CAMFT,该算法利用多代表点特征树对海量的数据进行压缩,结合随机采样的方法进一步增强算法处理海量数据的能力;同时,多代表点特征树能够保存复杂形状的聚类特征,适合处理复杂空间数据。实验表明了算法CAMFT能够快速处理带有离群点的复杂形状聚类的空间数据,结果与对象空间分布顺序无关,并且效率优于已有的同类聚类算法BLRCH与CURE。  相似文献   

11.
In this work, a novel multi-phase modified shuffled frog leaping algorithm (MPMSFLA) framework is presented to solve the multi-depot vehicle routing problem (MDVRP) more quickly. The presented algorithm adopts the K-means algorithm to execute the clustering analyses for all customers, generates a frog population according to the result of the clustering analyses, and then proceeds to the three-phase process. In the first phase, a cluster MSFLA local search is carried out for each cluster. In the second phase, the algorithm selects good individuals through a binary tournament to construct a new population and then performs a global optimization for all customers and depots using the global MSFLA. In the third phase, a cluster adjustment is implemented for the population to generate new clusters. These procedures continue until the convergence criterion is satisfied. The experimental results show that our algorithm can achieve a high quality solution within a short runtime for the MDVRP, the MDVRP with time windows (MDVRPTW) and the capacitated vehicle routing problem (CVRP). The proposed algorithm is suitable for solving large-scale problems.  相似文献   

12.
针对传统K-means算法对初始聚类中心敏感的问题,提出了基于数据样本分布情况的动态选取初始聚类中心的改进K-means算法。该算法根据数据点的距离构造最小生成树,并对最小生成树进行剪枝得到K个初始数据集合,得到初始的聚类中心。由此得到的初始聚类中心非常地接近迭代聚类算法收敛的聚类中心。理论分析与实验表明,改进的K-means算法能改善算法的聚类性能,减少聚类的迭代次数,提高效率,并能得到稳定的聚类结果,取得较高的分类准确率。  相似文献   

13.
Very large scale networks have become common in distributed systems. To efficiently manage these networks, various techniques are being developed in the distributed and networking research community. In this paper, we focus on one of those techniques, network clustering, i.e., the partitioning of a system into connected subsystems. The clustering we compute is size-oriented: given a parameter K of the algorithm, we compute, as far as possible, clusters of size K. We present an algorithm to compute a binary hierarchy of nested disjoint clusters. A token browses the network and recruits nodes to its cluster. When a cluster reaches a maximal size defined by a parameter of the algorithm, it is divided when possible, and tokens are created in both of the new clusters. The new clusters are then built and divided in the same fashion. The token browsing scheme chosen is a random walk, in order to ensure local load balancing. To allow the division of clusters, a spanning tree is built for each cluster. At each division, information on how to route messages between the clusters is stored. The naming process used for the clusters, along with the information stored during each division, allows routing between any two clusters.  相似文献   

14.
由于数据流数据的动态性、时序性和数据量大等特点使得数据流上的数据挖掘变得更加困难和富有挑战。通过对Squeezer聚类算法的研究分析,并基于此算法提出了一种新的基于聚类的数据流离群数据检测算法O-Squeezer。把数据流看成一个随时间变化的过程,并将其分成许多数据分区,在每个数据块内用改进的O-Squeezer算法挖掘离群数据。理论分析和实验表明,算法可以有效发现数据流中的局部离群数据,算法是可行的。  相似文献   

15.
基于数据间内在关联性的自适应模糊聚类模型   总被引:2,自引:0,他引:2  
唐成龙  王石刚 《自动化学报》2010,36(11):1544-1556
提出了一种新的模糊聚类模型(Fuzzy C-means clustering model, FCM), 称为自适应模糊聚类(Adaptive FCM, AFCM). 和现有的大多数模糊聚类方法不同的是, AFCM考虑了数据集中全体数据的内在关联性, 模型中引入了自适应度向量W和自适应指数p. 其中, W在迭代过程中是自适应的, p是一个给定参数. W和p共同作用调控聚类过程. AFCM同时输出三组参数: 模糊隶属度集U, 自适应度向量W, 以及聚类原型集V. 本文给出了两组数据实验验证AFCM的性能. 第1组实验验证AFCM的聚类性能, 以FCM为比较对象. 实验表明 AFCM可以得到更好的聚类质量, 而且通过合理选择自适应指数p, AFCM和FCM在时间复杂性上保持同一水平. 第2组实验检验了AFCM的离群点挖掘性能, 以目前常用的基于密度的LOF为比较对象. 实验表明AFCM算法具有极大的计算效率优势, 且AFCM得到的离群点是全局的, 反映的是离群点和整个数据集的关系, 离群点涵盖的信息也更丰富. 文章指出, AFCM在挖掘大数据集和实时数据中的离群点应用方面, 以及获得高质量的聚类结果的应用方面, 特别在聚类的同时需要挖掘离群点的应用方面具有独特的优势.  相似文献   

16.
Since a large number of clustering algorithms exist, aggregating different clustered partitions into a single consolidated one to obtain better results has become an important problem. In Fred and Jain's evidence accumulation algorithm, they construct a co-association matrix on original partition labels, and then apply minimum spanning tree to this matrix for the combined clustering. In this paper, we will propose a novel clustering aggregation scheme, probability accumulation. In this algorithm, the construction of correlation matrices takes the cluster sizes of original clusterings into consideration. An alternate improved algorithm with additional pre- and post-processing is also proposed. Experimental results on both synthetic and real data-sets show that the proposed algorithms perform better than evidence accumulation, as well as some other methods.  相似文献   

17.
张悦  刘杰  李航 《计算机工程》2013,39(3):46-50,55
现有孤立点检测方法大多数都需要预先设定孤立点个数,若设定不准确将降低孤立点检测的准确性。针对该问题,提出一种基于概率的孤立点检测方法。结合基于密度的DBSCAN算法与中位数求方差的方法,对待检测数据集进行聚类,提取出不包含在任何聚类中的可疑孤立点并进行分析,从而确定最终孤立点。该方法所检测的数据与时间因素线性无关,不必预先设定孤立点个数及聚类数,并且对噪声数据具有较强的抗干扰能力。IRIS测试数据集上的实验结果表明,该方法能够有效地识别孤立点。  相似文献   

18.
Identifying the optimal cluster number and generating reliable clustering results are necessary but challenging tasks in cluster analysis. The effectiveness of clustering analysis relies not only on the assumption of cluster number but also on the clustering algorithm employed. This paper proposes a new clustering analysis method that identifies the desired cluster number and produces, at the same time, reliable clustering solutions. It first obtains many clustering results from a specific algorithm, such as Fuzzy C-Means (FCM), and then integrates these different results as a judgement matrix. An iterative graph-partitioning process is implemented to identify the desired cluster number and the final result. The proposed method is a robust approach as it is demonstrated its effectiveness in clustering 2D data sets and multi-dimensional real-world data sets of different shapes. The method is compared with cluster validity analysis and other methods such as spectral clustering and cluster ensemble methods. The method is also shown efficient in mesh segmentation applications. The proposed method is also adaptive because it not only works with the FCM algorithm but also other clustering methods like the k-means algorithm.  相似文献   

19.
This study proposes two optimization mathematical models for the clustering and selection of suppliers. Model 1 performs an analysis of supplier clusters, according to customer demand attributes, including production cost, product quality and production time. Model 2 uses the supplier cluster obtained in Model 1 to determine the appropriate supplier combinations. The study additionally proposes a two-phase method to solve the two mathematical models. Phase 1 integrates k-means and a simulated annealing algorithm with the Taguchi method (TKSA) to solve for Model 1. Phase 2 uses an analytic hierarchy process (AHP) for Model 2 to weight every factor and then uses a simulated annealing algorithm with the Taguchi method (ATSA) to solve for Model 2. Finally, a case study is performed, using parts supplier segmentation and an evaluation process, which compares different heuristic methods. The results show that TKSA+ATSA provides a quality solution for this problem.  相似文献   

20.
《Applied Soft Computing》2007,7(2):577-584
In the paper, as an improvement of fuzzy clustering neural network FCNN proposed by Zhang et al., a novel robust fuzzy clustering neural network RFCNN is presented to cope with the sensitive issue of clustering when outliers exist. This new algorithm is based on Vapnik's ɛ-insensitive loss function and quadratic programming optimization. Our experimental results demonstrate that RFCNN has much better robustness for outliers than FCNN.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号