首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 0 毫秒
1.
Clustering has been widely used in different fields of science, technology, social science, etc. Naturally, clusters are in arbitrary (non-convex) shapes in a dataset. One important class of clustering is distance based method. However, distance based clustering methods usually find clusters of convex shapes. Classical single-link is a distance based clustering method, which can find arbitrary shaped clusters. It scans dataset multiple times and has time requirement of O(n2), where n is the size of the dataset. This is potentially a severe problem for a large dataset. In this paper, we propose a distance based clustering method, l-SL to find arbitrary shaped clusters in a large dataset. In this method, first leaders clustering method is applied to a dataset to derive a set of leaders; subsequently single-link method (with distance stopping criteria) is applied to the leaders set to obtain final clustering. The l-SL method produces a flat clustering. It is considerably faster than the single-link method applied to dataset directly. Clustering result of the l-SL may deviate nominally from final clustering of the single-link method (distance stopping criteria) applied to dataset directly. To compensate deviation of the l-SL, an improvement method is also proposed. Experiments are conducted with standard real world and synthetic datasets. Experimental results show the effectiveness of the proposed clustering methods for large datasets.  相似文献   

2.
康大伟  陈天滋 《计算机应用》2007,27(11):2760-2762
分析了密度聚类算法(DBSCAN)的局限性,在此基础上提出了一种基于密度的面向线段的聚类方法,将DBSCAN中聚类的对象由点转变为线段。在对点聚类的基础上,研究了线段聚类的特点。该算法可以有效处理分布不均匀的线段对象集,发现分布密度不同的各种簇。通过试验证明了该方法的可行性与有效性。  相似文献   

3.
Several fast algorithms for clustering very large data sets have been proposed in the literature, including CLARA, CLARANS, GAC-R3, and GAC-RARw. CLARA is a combination of a sampling procedure and the classical PAM algorithm, while CLARANS adopts a serial randomized search strategy to find the optimal set of medoids. GAC-R3 and GAC-RARw exploit genetic search heuristics for solving clustering problems. In this research, we conducted an empirical comparison of these four clustering algorithms over a wide range of data characteristics described by data size, number of clusters, cluster distinctness, cluster asymmetry, and data randomness. According to the experimental results, CLARANS outperforms its counterparts both in clustering quality and execution time when the number of clusters increases, clusters are more closely related, more asymmetric clusters are present, or more random objects exist in the data set. With a specific number of clusters, CLARA can efficiently achieve satisfactory clustering quality when the data size is larger, whereas GAC-R3 and GAC-RARw can achieve satisfactory clustering quality and efficiency when the data size is small, the number of clusters is small, and clusters are more distinct and symmetric.  相似文献   

4.
As one of the most important techniques in data mining, cluster analysis has attracted more and more attentions in this big data era. Most clustering algorithms have encountered with challenges including cluster centers determination difficulty, low clustering accuracy, uneven clustering efficiency of different data sets and sensible parameter dependence. Aiming at clustering center determination difficulty and parameter dependence, a novel cluster center fast determination clustering algorithm was proposed in this paper. It is supposed that clustering centers are those data points with higher density and larger distance from other data points of higher density. Normal distribution curves are designed to fit the density distribution curve of density distance product. And the singular points outside the confidence interval by setting the confidence interval are proved to be clustering centers by theory analysis and simulations. Finally, according to these clustering centers, a time scan clustering is designed for the rest of the points by density to complete the clustering. Density radius is a sensible parameter in calculating density for each data point, mountain climbing algorithm is thus used to realize self-adaptive density radius. Abundant typical benchmark data sets are testified to evaluate the performance of the brought up algorithms compared with other clustering algorithms in both aspects of clustering quality and time complexity.  相似文献   

5.
基于密度的DBSCAN聚类算法的研究及应用   总被引:3,自引:0,他引:3       下载免费PDF全文
首先对DBSCAN(Density Based Spatial Clustering of Applications with Noise)聚类算法进行了深入研究,分析了它的特点、存在的问题及改进思想,提出了基于DBSCAN方法的交通事故多发点段的排查方法及其改进思路,并且给出了实例以说明处理过程及可行性。实验结果表明本文提出的方法可以大大提高交通事故黑点排查效率。  相似文献   

6.
7.
基于分布式的大数据集聚类分析   总被引:1,自引:0,他引:1  
为了提高聚类效率提出了一种基于分布式的大数据集聚类算法。该方法并不是一次性对所有的数据进行聚类,而是将大数据集随机分成若干个子集,对每个子集同时进行聚类,最后进行类的合并。实验结果表明大多数情况下该方法和传统的一次性聚类的结果一致,而且极大地提高了聚类的速度。  相似文献   

8.
提出一种集成粗糙集理论的RBF网络设计方法.由布尔逻辑推理方法进行属性离散化,得到初始决策模式集,通过差异度对初始决策模式的相似度进行衡量并实现聚类,以聚类决策模式构造RBF网络.为加快训练速度,分别对隐层参数和输出权值采用BP算法和线性最小二乘滤波法进行训练.实验结果表明,该方法设计的RBF网络结构简洁,泛化性能良好,混合学习算法的收敛速度优于单纯的BP算法.  相似文献   

9.
In this paper we propose a new density based clustering algorithm via using the Mahalanobis metric. This is motivated by the current state-of-the-art density clustering algorithm DBSCAN and some fuzzy clustering algorithms. There are two novelties for the proposed algorithm: One is to adopt the Mahalanobis metric as distance measurement instead of the Euclidean distance in DBSCAN and the other is its effective merging approach for leaders and followers defined in this paper. This Mahalanobis metric is closely associated with dataset distribution. In order to overcome the unique density issue in DBSCAN, we propose an approach to merge the sub-clusters by using the local sub-cluster density information. Eventually we show how to automatically and efficiently extract not only ‘traditional’ clustering information, such as representative points, but also the intrinsic clustering structure. Extensive experiments on some synthetic datasets show the validity of the proposed algorithm. Further the segmentation results on some typical images by using the proposed algorithm and DBSCAN are presented in this paper and they are shown that the proposed algorithm can produce much better visual results in image segmentation.  相似文献   

10.
Vague集是Fuzzy集的扩展,在给出几种构造Vague集相似矩阵方法的基础上,将Fuzzy集上的编网法和最大树法引入到Vague集上,定义了Vague关系图,并给出了基于Vague集的直接聚类法:编网法和最大树法。最后使用文献[1]中的例子,分别采用Vague传递闭包法和Vague直接聚类法进行计算。实验结果表明,Vague直接聚类法计算简单,不会造成原始信息的失真,比Vague传递闭包法更加有效。  相似文献   

11.
Clustering has been widely adopted in numerous applications, including pattern recognition, data analysis, image processing, and market research. When performing data mining, traditional clustering algorithms which use distance-based measurements to calculate the difference between data are unsuitable for non-numeric attributes such as nominal, Boolean, and categorical data. Applying an unsuitable similarity measurement in clustering may cause some valuable information embedded in the data attributes to be lost, and hence low quality clusters will be created. This paper proposes a novel hierarchical clustering algorithm, referred to as MPM, for the clustering of non-numeric data. The goals of MPM are to retain the data features of interest while effectively grouping data objects into clusters with high intra-similarity and low inter-similarity. MPM achieves these goals through two principal methods: (1) the adoption of a novel similarity measurement which has the ability to capture the “characterized properties” of information, and (2) the application of matrix permutation and matrix participation partitioning to the results of the similarity measurement (constructed in the form of a similarity matrix) in order to assign data to appropriate clusters. This study also proposes a heuristic-based algorithm, the Heuristic_MPM, to reduce the processing times required for matrix permutation and matrix partitioning, which together constitute the bulk of the total MPM execution time. An erratum to this article is available at .  相似文献   

12.
一种基于网格索引的数据聚类算法   总被引:1,自引:0,他引:1  
为了提高基于密度聚类算法的效率,避免算法在执行过程中的多余搜索,提出了一种基于DBSCAN算法的改进的空间数据聚类算法。该算法采用对象邻域空间进行划分的方法,将网格索引结构应用于该算法。在核心对象的邻域内选择八个方向上未标记且距离核心对象最边缘的对象来扩展种子对象,减少查询次数,降低聚类的时间复杂度。在实验中,利用海量数据集对算法进行测试,测试结果证明新算法在保证聚类精度的情况下时间效率显著高于DBSCAN算法。  相似文献   

13.
均值漂移谱聚类(MSSC)算法为模式识别聚类任务提供了一种较新的方案.然而由于其内嵌均值漂移过程的时问复杂度与样本容量呈平方关系,其在大数据集环境的实用性受到大大削弱.利用快速压缩集密度估计器(FRSDE)替代Parren窗密度估计式(PW)并融合基于图的松弛聚类(GRC)方法,提出了快速均值漂移谱聚类(FMSSC)算法.相比原MSSC,该算法的总体渐进时间复杂度与样本容量呈线性关系,并具有自适应性和便捷性.  相似文献   

14.
为了从移动终端位置数据中精准识别居民职住地,提出了一种基于时空约束密度聚类的职住地识别方法。首先,利用基于K-means的DBSCAN(density-based spatial clustering of applications with noise)时空驻点聚类过程将居民多天的原始轨迹点分成不同的时空驻点簇;然后,利用基于速度阈值的停留点簇和移动点簇识别过程将居民的每一个时空驻点簇区分为停留点簇或移动点簇;接着,利用基于K近距离的DBSCAN重要停留点聚类过程将居民的停留点分成不同的重要停留点簇;最后,利用基于KD-tree优化的KNN(K-nearest neighbor)职住地识别过程将居民的每个重要停留点识别为工作地、居住地、职住同一区域或兴趣地点区域。实验结果表明,该方法的每个过程都是合理有效的,并且最终的职住地识别效果要优于时间阈值法、累加时间法和信息熵法。  相似文献   

15.
A novel method based on rough sets (RS) and the affinity propagation (AP) clustering algorithm is developed to optimize a radial basis function neural network (RBFNN). First, attribute reduction (AR) based on RS theory, as a preprocessor of RBFNN, is presented to eliminate noise and redundant attributes of datasets while determining the number of neurons in the input layer of RBFNN. Second, an AP clustering algorithm is proposed to search for the centers and their widths without a priori knowledge about the number of clusters. These parameters are transferred to the RBF units of RBFNN as the centers and widths of the RBF function. Then the weights connecting the hidden layer and output layer are evaluated and adjusted using the least square method (LSM) according to the output of the RBF units and desired output. Experimental results show that the proposed method has a more powerful generalization capability than conventional methods for an RBFNN.  相似文献   

16.
针对k-means算法过度依赖初始聚类中心、收敛速度慢等局限性及其在处理海量数据时存在的内存不足问题,提出一种新的针对大数据集的混合聚类算法super-k-means,将改进的基于超网络的高维数据聚类算法与k-means相结合,并经过MapReduce并行化后部署在Hadoop集群上运行。实验表明,该算法不仅在收敛性以及聚类精度两方面得到优化,其加速比和扩展性也有了大幅度的改善。  相似文献   

17.
针对SVM在对大规模数据分类时求解规模过大的问题,提出了一种缩减数据集以提高训练速度的方法。该算法的第一步利用基于密度的方法大致定位能代表某个局域的质点,然后用SVM训练缩减后的数据得到一组支持向量,第二步的训练数据由支持向量以及其所代表的样本点构成。仿真实验证明该算法在保证分类准确率的情况下能有效地提高分类速度。  相似文献   

18.
为了避免PET/CT对病人造成大剂量的X辐射伤害和更好地对PET/MRI混合成像系统进行信号衰减校正。在组织分割方法的指导下,利用迁移模糊聚类算法将对人体无伤害的磁共振成像(MRI)划分成诸如空气、液体、软组织、骨头等不同组织成分,然后赋予不同组织不同的线性衰减系数,以此来实现配准的PET成像的衰减校正工作。本方法具有三大好处:(1)迁移模糊聚类算法可以利用历史高级知识来辅助当前病人MRI组织分割任务,从而保证了临床有效性和鲁棒性,降低了环境噪声、数据缺失及个体解剖结构差异等因素对算法的不良影响;(2)本算法内嵌的基于迁移学习的简单抽样策略,在保证算法鲁棒性的同时,极大地缩短了聚类划分的整体时间,适用于医学MRI大数据快速聚类分割的场合,因而有效地增强了算法的实用性;(3)本算法涉及的历史MRI知识,都是通过历史MRI源数据高度总结得到,非历史MRI源数据,这有效地保护了病人隐私,符合医学诊断的基本要求。通过在真实数据集上的实验表明了上述优点。  相似文献   

19.
密度峰值聚类(DPC)方法能够快速地对数据进行聚类,而不管它们的形状和包含它们的空间的维数,近年来得到广泛研究和应用。然而,当各个聚类中心的密度的差异较大,或者同一个类中包含多个密度中心时,DPC计算效果受到影响。针对于此,提出了基于密度二分法的密度峰值聚类方法。首先,求出全部数据平均密度,将数据分为高密度点和低密度点,然后,根据高密度的点的决策图识别出聚类中心后,根据是否存在可达距离的数据点对同类的聚类中心实现合并。最后,根据提出的分配策略,使高密度点和低密度点都分配到合适的聚类中心,从而实现聚类。在多个合成及实际数据集上的实验表明,该方法的聚类效果明显优于已有的DPC方法。  相似文献   

20.
Motivated by the poor performance (linear complexity) of the EM algorithm in clustering large data sets, and inspired by the successful accelerated versions of related algorithms like k-means, we derive an accelerated variant of the EM algorithm for Gaussian mixtures that: (1) offers speedups that are at least linear in the number of data points, (2) ensures convergence by strictly increasing a lower bound on the data log-likelihood in each learning step, and (3) allows ample freedom in the design of other accelerated variants. We also derive a similar accelerated algorithm for greedy mixture learning, where very satisfactory results are obtained. The core idea is to define a lower bound on the data log-likelihood based on a grouping of data points. The bound is maximized by computing in turn (i) optimal assignments of groups of data points to the mixture components, and (ii) optimal re-estimation of the model parameters based on average sufficient statistics computed over groups of data points. The proposed method naturally generalizes to mixtures of other members of the exponential family. Experimental results show the potential of the proposed method over other state-of-the-art acceleration techniques.
Nikos VlassisEmail:
  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号