首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到18条相似文献,搜索用时 261 毫秒
1.
基于密度复杂簇聚类算法研究与实现   总被引:3,自引:2,他引:1       下载免费PDF全文
聚类算法在模式识别、数据分析、图像处理、以及市场研究的应用中,需要解决的关键技术是如何有效地聚类各种复杂的数据对象簇。在分析与研究现有聚类算法的基础上,提出了一种基于密度和自适应密度可达的改进算法。实验证明,该算法能够有效聚类任意分布形状、不同密度、不同尺度的簇;同时,算法的计算复杂度与传统基于密度的聚类算法相比有明显的降低。  相似文献   

2.
传统基于划分的聚类算法需要人工给定聚类数,且由于算法采取刚性划分,可能会导致将较大或延伸状的聚类簇分割的现象,导致错误的聚类结果。密度峰聚类是近年提出的一种新的基于密度的聚类算法,该算法不需要预先指定聚类数目,且能够发现非球形簇。将密度峰思想引入基于划分的聚类算法,提出一种基于密度峰和划分的快速聚类算法(DDBSCAN),该算法首先获取一组簇的核心对象(密度峰),用于描述簇的“骨骼”,而后将周围的点划分到最近的核心对象,最后通过判断划分边界处的密度情况合并簇。实验证明,该算法能有效地适应任意形状、大小不一的数据集,与传统基于密度的聚类算法相比收敛速度更快。  相似文献   

3.
在分析常用聚类算法的特点和适应性基础上提出一种基于密度与划分方法的聚类算法。该算法根据数据对象密度分布状态来自动确定聚类簇密度吸引中心点和聚类簇的初始划分;然后利用划分的方法,根据密度可达定义来寻找密度可达数据对象簇,从而完成数据对象簇的最终聚类。实验证明该算法能够很好地处理具有任意形状和大小的簇,能够有效地屏蔽噪声和离群点的影响和发现孤立点;同时也减小了输入参数对领域知识的依赖性。  相似文献   

4.
密度聚类是数据挖掘和机器学习中最常用的分析方法之一,无须预先指定聚类数目就能够发现非球形聚类簇,但存在无法识别不同密度的相邻聚类簇等问题。采用逆近邻和影响空间的思想,提出一种密度聚类分析算法。利用欧氏距离计算数据对象的K近邻与逆近邻,依据逆近邻识别其核心对象,并确定其核心对象的影响空间;利用逆近邻和影响空间,重新定义密度聚类簇扩展条件,并通过广度优先遍历搜索核心对象的影响空间,形成密度聚类簇,有效解决了无法区分不同密度相邻聚类簇等不足,提高了密度聚类分析效果和效率。基于UCI和人工数据集实验验证了该算法的有效性。  相似文献   

5.
以网格化数据集来减少聚类过程中的计算复杂度,提出一种基于密度和网格的簇心可确定聚类算法.首先网格化数据集空间,以落在单位网格对象里的数据点数表示该网格对象的密度值,以该网格到更高密度网格对象的最近距离作为该网格的距离值;然后根据簇心网格对象同时拥有较高的密度和较大的距离值的特征,确定簇心网格对象,再通过一种基于密度的划分方式完成聚类;最后,在多个数据集上对所提出算法与一些现有聚类算法进行聚类准确性与执行时间的对比实验,验证了所提出算法具有较高的聚类准确性和较快的执行速度.  相似文献   

6.
针对粗糙K-means聚类及其相关衍生算法需要提前人为给定聚类数目、随机选取初始类簇中心导致类簇交叉区域的数据划分准确率偏低等问题,文中提出基于混合度量与类簇自适应调整的粗糙模糊K-means聚类算法.在计算边界区域的数据对象归属于不同类簇的隶属程度时,综合考虑局部密度和距离的混合度量,并采用自适应调整类簇数目的策略,获得最佳聚类数目.选取数据对象稠密区域中距离最小的两个样本的中点作为初始类簇中心,将附近局部密度高于平均密度的对象划分至该簇后再选取剩余的初始类簇中心,使初始类簇中心的选取更合理.在人工数据集和UCI标准数据集上的实验表明,文中算法在处理类簇交叠严重的球簇状数据集时,具有自适应性,聚类精度较优.  相似文献   

7.
一种改进的K均值文本聚类算法   总被引:1,自引:0,他引:1  
提出了一种改进的K均值文本聚类算法.该算法的改进基于以下两点:1)基于簇密度与文本间距离选取初始簇中心,引入置信半径来得到簇密度,即选取距离最远且簇密度最大的点为初始簇中心;2)基于权重的海明距离来计算文本相似度,同时采用轮廓系数来衡量不同算法的聚类质量.实验结果表明:该算法相比原始的K均值文本聚类算法和文献[1]中算法具有更好的聚类质量.  相似文献   

8.
密度峰值聚类算法在处理密度不均匀的数据集时易将低密度簇划分到高密度簇中或将高密度簇分为多个子簇,且在样本点分配过程中存在误差传递问题。提出一种基于相对密度的密度峰值聚类算法。引入自然最近邻域内的样本点信息,给出新的局部密度计算方法并计算相对密度。在绘制决策图确定聚类中心后,基于对簇间密度差异的考虑,提出密度因子计算各个簇的聚类距离,根据聚类距离对剩余样本点进行划分,实现不同形状、不同密度数据集的聚类。在合成数据集和真实数据集上进行实验,结果表明,该算法的FMI、ARI和NMI指标较经典的密度峰值聚类算法和其他3种聚类算法分别平均提高约14、26和21个百分点,并且在簇间密度相差较大的数据集上能够准确识别聚类中心和分配剩余的样本点。  相似文献   

9.
密度峰值聚类(DPC)算法是一种新颖的基于密度的聚类算法,其原理简单、运行效率高.但DPC算法的局部密度只考虑了样本之间的距离,忽略了样本所处的环境,导致算法对密度分布不均数据的聚类效果不理想;同时,样本分配过程易产生分配错误连带效应.针对上述问题,提出一种基于相对密度估计和多簇合并的密度峰值聚类(DPC-RD-MCM)算法. DPC-RD-MCM算法结合K近邻和相对密度思想,定义了相对K近邻的局部密度,以降低类簇疏密程度对类簇中心的影响,避免稀疏区域没有类簇中心;重新定义微簇间相似性度量准则,通过多簇合并策略得到最终聚类结果,避免分配错误连带效应.在密度分布不均数据集、复杂形态数据集和UCI数据集上,将DPC-RD-MCM算法与DPC及其改进算法进行对比,实验结果表明:DPC-RD-MCM算法能够在密度分布不均数据上获得十分优异的聚类效果,在复杂形态数据集和UCI数据集的聚类性能上高于对比算法.  相似文献   

10.
周世波  徐维祥 《控制与决策》2018,33(11):1921-1930
聚类是数据挖掘领域的一个重要研究方向,针对复杂数据集中存在的簇间密度不均匀、聚类形态多样、聚类中心的识别等问题,引入样本点k近邻信息计算样本点的相对密度,借鉴快速搜索和发现密度峰值聚类(CFSFDP)算法的簇中心点识别方法,提出一种基于相对密度和决策图的聚类算法,实现对任意分布形态数据集聚类中心快速、准确地识别和有效聚类.在7类典型测试数据集上的实验结果表明,所提出的聚类算法具有较好的适用性,与经典的DBSCAN算法和CFSFDP等算法相比,在没有显著提高时间复杂度的基础上,聚类效果更好,对不同类型数据集的适应性也更广.  相似文献   

11.
Large graphs are scale free and ubiquitous having irregular relationships. Clustering is used to find existent similar patterns in graphs and thus help in getting useful insights. In real-world, nodes may belong to more than one cluster thus, it is essential to analyze fuzzy cluster membership of nodes. Traditional centralized fuzzy clustering algorithms incur high communication cost and produce poor quality of clusters when used for large graphs. Thus, scalable solutions are obligatory to handle huge amount of data in less computational time with minimum disk access. In this paper, we proposed a parallel fuzzy clustering algorithm named ‘PGFC’ for handling scalable graph data. It will be advantageous from the viewpoint of expert systems to develop a clustering algorithm that can assure scalability along with better quality of clusters for handling large graphs.The algorithm is parallelized using bulk synchronous parallel (BSP) based Pregel model. The cluster centers are initialized using degree centrality measure, resulting in lesser number of iterations. The performance of PGFC is compared with other state of art clustering algorithms using synthetic graphs and real world networks. The experimental results reveal that the proposed PGFC scales up linearly to handle large graphs and produces better quality of clusters when compared to other graph clustering counterparts.  相似文献   

12.
针对利用自组织特征映射(SOFM)神经网络进行模糊聚类时出现的一些问题,提出改进结构的神经网络,采用自适应的聚类初值,能够实现高维数据和任意形状族的聚类,与具有同样聚类效果的其他算法相比,具有较低的时间复杂度。仿真实验结果表明,该聚类算法比单个的神经网络聚类算法和同类其他算法更有效。  相似文献   

13.
As data mining having attracted a significant amount of research attention, many clustering algorithms have been proposed in the past decades. However, most of existing clustering methods have high computational time or are not suitable for discovering clusters with non-convex shape. In this paper, an efficient clustering algorithm CHSMST is proposed, which is based on clustering based on hyper surface (CHS) and minimum spanning tree. In the first step, CHSMST applies CHS to obtain initial clusters immediately. Thereafter, minimum spanning tree is introduced to handle locally dense data which is hard for CHS to deal with. The experiments show that CHSMST can discover clusters with arbitrary shape. Moreover, CHSMST is insensitive to the order of input samples and the run time of the algorithm increases moderately as the scale of dataset becomes large.  相似文献   

14.
一种改进的基于密度的抽样聚类算法   总被引:1,自引:0,他引:1  
基于密度的聚类算法DBSCAN是一种有效的空间聚类算法,它能够发现任意形状的聚类并且有效地处理噪声。然而,DBSCAN算法也有一些缺点,例如,①在聚类时只考虑空间属性没有考虑非空间属性;②在对大规模空间数据库进行聚类分析时需要较大的内存支持和I/O消耗。为此,在分析DBSCAN算法不足的基础上,提出了一种改进的基于密度的抽样聚类(improved density-based spatial clustering algorithm with sampling,IDBSCAS)算法,使之能够有效地处理大规模空间数据库,并且它不仅考虑了空间属性也考虑了非空间属性。2维空间数据的测试结果表明,该算法是可行、有效的。  相似文献   

15.
The success rates of the expert or intelligent systems depend on the selection of the correct data clusters. The k-means algorithm is a well-known method in solving data clustering problems. It suffers not only from a high dependency on the algorithm's initial solution but also from the used distance function. A number of algorithms have been proposed to address the centroid initialization problem, but the produced solution does not produce optimum clusters. This paper proposes three algorithms (i) the search algorithm C-LCA that is an improved League Championship Algorithm (LCA), (ii) a search clustering using C-LCA (SC-LCA), and (iii) a hybrid-clustering algorithm called the hybrid of k-means and Chaotic League Championship Algorithm (KSC-LCA) and this algorithm has of two computation stages. The C-LCA employs chaotic adaptation for the retreat and approach parameters, rather than constants, which can enhance the search capability. Furthermore, to overcome the limitation of the original k-means algorithm using the Euclidean distance that cannot handle the categorical attribute type properly, we adopt the Gower distance and the mechanism for handling a discrete value requirement of the categorical value attribute. The proposed algorithms can handle not only the pure numeric data but also the mixed-type data and can find the best centroids containing categorical values. Experiments were conducted on 14 datasets from the UCI repository. The SC-LCA and KSC-LCA competed with 16 established algorithms including the k-means, k-means++, global k-means algorithms, four search clustering algorithms and nine hybrids of k-means algorithm with several state-of-the-art evolutionary algorithms. The experimental results show that the SC-LCA produces the cluster with the highest F-Measure on the pure categorical dataset and the KSC-LCA produces the cluster with the highest F-Measure for the pure numeric and mixed-type tested datasets. Out of 14 datasets, there were 13 centroids produced by the SC-LCA that had better F-Measures than that of the k-means algorithm. On the Tic-Tac-Toe dataset containing only categorical attributes, the SC-LCA can achieve an F-Measure of 66.61 that is 21.74 points over that of the k-means algorithm (44.87). The KSC-LCA produced better centroids than k-means algorithm in all 14 datasets; the maximum F-Measure improvement was 11.59 points. However, in terms of the computational time, the SC-LCA and KSC-LCA took more NFEs than the k-means and its variants but the KSC-LCA ranks first and SC-LCA ranks fourth among the hybrid clustering and the search clustering algorithms that we tested. Therefore, the SC-LCA and KSC-LCA are general and effective clustering algorithms that could be used when an expert or intelligent system requires an accurate high-speed cluster selection.  相似文献   

16.
There are two popular types of forecasting algorithms for fuzzy time series (FTS). One is based on intervals of universal sets of independent variables and the other is based on fuzzy clustering algorithms. Clustering based FTS algorithms are preferred since role and optimal length of intervals are not clearly understood. Therefore data of each variable are individually clustered which requires higher computational time. Fuzzy Logical Relationships (FLRs) are used in existing FTS algorithms to relate input and output data. High number of clusters and FLRs are required to establish precise input/output relations which incur high computational time. This article presents a forecasting algorithm based on fuzzy clustering (CFTS) which clusters vectors of input data instead of clustering data of each variable separately and uses linear combinations of the input variables instead of the FLRs. The cluster centers handle fuzziness and ambiguity of the data and the linear parts allow the algorithm to learn more from the available information. It is shown that CFTS outperforms existing FTS algorithms with considerably lower testing error and running time.  相似文献   

17.
A variety of clustering algorithms exists to group objects having similar characteristics. But the implementations of many of those algorithms are challenging in the process of dealing with categorical data. While some of the algorithms cannot handle categorical data, others are unable to handle uncertainty within categorical data in nature. This is prerequisite for clustering categorical data which also deal with uncertainty. An algorithm, termed minimum-minimum roughness (MMR) was proposed, which uses the rough set theory in order to deal with the above problems in clustering categorical data. Later many algorithms has developed to improve the handling of hybrid data. This research proposes information-theoretic dependency roughness (ITDR), another technique for categorical data clustering taking into account information-theoretic attributes dependencies degree of categorical-valued information systems. In addition, it is second to none of all its predecessors; MMR, MMeR, SDR and standard-deviation of standard-deviation roughness (SSDR). Experimental results on two benchmark UCI datasets show that ITDR technique is better with the baseline categorical data clustering technique with respect to computational complexity and the purity of clusters.  相似文献   

18.
机器学习的无监督聚类算法已被广泛应用于各种目标识别任务。基于密度峰值的快速搜索聚类算法(DPC)能快速有效地确定聚类中心点和类个数,但在处理复杂分布形状的数据和高维图像数据时仍存在聚类中心点不容易确定、类数偏少等问题。为了提高其处理复杂高维数据的鲁棒性,文中提出了一种基于学习特征表示的密度峰值快速搜索聚类算法(AE-MDPC)。该算法采用无监督的自动编码器(AutoEncoder)学出数据的最优特征表示,结合能刻画数据全局一致性的流形相似性,提高了同类数据间的紧致性和不同类数据间的分离性,促使潜在类中心点的密度值成为局部最大。在4个人工数据集和4个真实图像数据集上将AE-MDPC与经典的K-means,DBSCAN,DPC算法以及结合了PCA的DPC算法进行比较。实验结果表明,在外部评价指标聚类精度、内部评价指标调整互信息和调整兰德指数上,AE-MDPC的聚类性能优于对比算法,而且提供了更好的可视化性能。总之,基于特征表示学习且结合流形距离的AE-MDPC算法能有效地处理复杂流形数据和高维图像数据。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号