首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到19条相似文献,搜索用时 453 毫秒
1.
基于密度的增量式网格聚类算法   总被引:29,自引:0,他引:29  
提出基于密度的网格聚类算法GDcA,发现大规模空间数据库中任意形状的聚类.该算法首先将数据空间划分成若干体积相同的单元,然后对单元进行聚类只有密度不小于给定阈值的单元才得到扩展,从而大大降低了时间复杂性在GDcA的基础上,给出增量式聚类算法IGDcA,适用于数据的批量更新.  相似文献   

2.
为满足大规模空间数据库的聚类需求,面向计算机集群,提出一种基于密度的并行聚类算法。该算法根据数据库分布特征进行数据分区,在每一个节点上对数据块并行聚类,在主节点上合并聚类结果。实验结果表明,该算法的计算速度随着节点数的增多呈线性增加,具有较好的延展性。  相似文献   

3.
提出一种能够有效处理大规模分布的数据聚类问题且简化计算复杂度的分阶段非线性聚类方法,该算法包含两个阶段:首先将数据划分为若干个球形分布的子类,采用K近邻图理论对原始数据计算顶点能量并提取顶点攻能量样本;再采用K近邻算法对该高能量样本做一个划分,从而得到一个考虑高能量样本的粗划分同时估计出聚类的个数,最后,综合两次聚类结果整理得到最终聚类结果。该方法的主要优点是可以用来处理复杂聚类问题,算法较为稳定,并且在保持聚类正确率的同时,降低了大规模分布数据为相似性度量的计算代价。  相似文献   

4.
孙伟鹏 《计算机应用研究》2020,37(1):163-166,171
针对FSDP聚类算法在计算数据对象的局部密度与最小距离时,由于需要遍历整个数据集而导致算法的整体时间复杂度较高的问题,提出了一种基于Spark的并行FSDP聚类算法SFSDP。首先,通过空间网格划分将待聚类数据集划分成多个数据量相对均衡的数据分区;然后,利用改进的FSDP聚类算法并行地对各个分区内的数据执行聚类分析;最后,通过将分区间的局部簇集合并,生成全局簇集。实验结果表明,SFSDP与FSDP算法相比能够有效地进行大规模数据集的聚类分析,并且算法在准确性和扩展性方面都有很好的表现。  相似文献   

5.
黄学雨  向驰  陶涛 《计算机应用研究》2021,38(10):2988-2993,3024
对于基于划分的聚类算法随机选取初始聚类中心导致初始中心敏感,聚类结果不稳定、集群效率低等问题,提出一种基于MapReduce框架和改进的密度峰值的划分聚类算法(based on MapReduce framework and im-proved density peak partition clustering algorithm,MR-IDPACA).首先,通过自然最近邻定义新的局部密度计算方式,将搜索样本密度峰值点作为划分聚类算法的初始聚类中心;其次针对算法在大规模数据下运行时间复杂,提出基于E2LSH(exact Euclidean locality sensitive hashing)的一种分区方法,即KLSH(K of locality sensitive hashing).通过该方法对数据分区后结合MapReduce框架并行搜寻初始聚类中心,有效减少了算法在搜索初始聚类中心时的运行时间;对于MapReduce框架中的数据倾斜问题,提出ME(multistage equilibrium)策略对中间数据进行多段均衡分区,以提升算法运行效率;在MapReduce框架下并行聚类,得到最终聚类结果.实验得出MR-IDPACA算法在单机环境下有着较高的准确率和较强的稳定性,集群性能上也有着较好的加速比和运行时间,聚类效果有所提升.  相似文献   

6.
刘建伟  李卫民 《计算机科学》2009,36(11):148-151
传统的数据库管理系统和数据查询算法不能很好地支持对流数据的查询已经被广泛认识,因而需要研究新的流数据模式查询算法.提出了一种基于摘要技术的在线快速混合模型流数据聚类算法,该算法为分阶段混合模型聚类过程.算法首先时最初到达的流数据用多维网格结构进行划分,对划分形成的每一个单元进行数据摘要,提取足够的统计信息.对该摘要运行基于模型的贪心聚类算法,聚类形成的混合模型的摘要信息存储在永久摘要数据库中,从而形成初始聚类混合模型;在聚类模型的维持过程中,当不断有流数据到达时,对到达的数据块用多维网格结构进行划分,对划分形成的每一个单元提取足够的摘要信息.对该摘要运行基于模型的贪心聚类算法形成聚类混合模型.在判断是否可以把新到达的模型合并到现有的混合模型中去时,提出了三种合并标准.实验表明,该算法减少了分类误差,其速度也比传统的基于模型的贪心聚类算法大大加快.  相似文献   

7.
一种基于划分的不同参数值的DBSCAN算法   总被引:6,自引:0,他引:6  
聚类是数据挖掘领域中一个重要的研究方向,DBSCAN是一种基于密度的聚类算法.该算法将具有足够高密度的区域划分成簇,并可以在带有“噪声”的空间数据库中发现任意形状的簇.分析DBSCAN算法发现存在如下问题:当数据分布不均匀时,由于使用统一的全局变量,使得聚类的效果差.针对这一缺陷,提出了一种基于数据划分的思想,并对各个局部数据集采取不同的参数值分别进行聚类,最后合并各局部聚类结果.实验结果表明,改进后的算法有效并可行.  相似文献   

8.
Hilbert曲线能够线性填充数据空间,将数据空间分割成大小相等的网格,从而将位于网格中的点映射到线性空间中.本文利用Hilbert曲线的数据聚类性质,提出一种基于Hilbert曲线网格划分聚类算法,详细叙述算法的执行过程,并给出每一步的理论依据.算法首先以网格为单位合并出面积较小的聚集,然后将小聚集经过若干次合并形成较大聚集,最终使得聚集最优.实验结果表明该算法的执行时间少于经典聚类算法k-means和基于网格聚类算法CLIQUE.  相似文献   

9.
针对传统的人脸识别算法在处理单样本人脸识别时由于训练样本不足而影响识别率的问题,提出了一种基于分块聚类的多流形判别分析(MMDA)算法.将每个单训练样本划分成若干大小相等且互不重叠的局部小块,利用聚类算法将局部小块聚类到各个类所属的流形上,并使用特征变换最大化类与类之间的分离性;最后,计算出测试人脸的流形与所有训练样本流形之间的距离,采用最近邻分类器完成人脸的识别.在ORL及FERET两大人脸数据库上的实验验证了算法的有效性及可靠性,识别率可分别高达77.22%、57.59%,实验结果表明,相比几种较为先进的人脸识别算法,该算法在处理单训练样本人脸识别问题时取得了更好的识别效果.  相似文献   

10.
大型数据库中基于邻域连接的层次聚类算法   总被引:3,自引:2,他引:3  
董一鸿 《计算机工程与应用》2003,39(32):194-197,225
提出了一种基于邻域连接的层次聚类算法HANL,该算法首先采用分割的方法将数据集划分为若干个子簇,通过对子簇间的连接的分析,建立子簇间的连接构成图,图中带权的边代表了子簇间的连接紧密度。合并连接紧密度高的结点,得到最后的聚类结果。该算法适用于高维数据集,能够对任意形状的簇进行聚类,无论对于数值属性的数据库还是分类属性的数据库都是一个有效的聚类方法。同时这种方法聚类速度快,效率高,具有良好的伸缩性。  相似文献   

11.
The scalability problem in data mining involves the development of methods for handling large databases with limited computational resources such as memory and computation time. In this paper, two scalable clustering algorithms, bEMADS and gEMADS, are presented based on the Gaussian mixture model. Both summarize data into subclusters and then generate Gaussian mixtures from their data summaries. Their core algorithm, EMADS, is defined on data summaries and approximates the aggregate behavior of each subcluster of data under the Gaussian mixture model. EMADS is provably convergent. Experimental results substantiate that both algorithms can run several orders of magnitude faster than expectation-maximization with little loss of accuracy.  相似文献   

12.
Fuzzy C-means (FCM) clustering has been widely used successfully in many real-world applications. However, the FCM algorithm is sensitive to the initial prototypes, and it cannot handle non-traditional curved clusters. In this paper, a multi-center fuzzy C-means algorithm based on transitive closure and spectral clustering (MFCM-TCSC) is provided. In this algorithm, the initial guesses of the locations of the cluster centers or the membership values are not necessary. Multi-centers are adopted to represent the non-spherical shape of clusters. Thus, the clustering algorithm with multi-center clusters can handle non-traditional curved clusters. The novel algorithm contains three phases. First, the dataset is partitioned into some subclusters by FCM algorithm with multi-centers. Then, the subclusters are merged by spectral clustering. Finally, based on these two clustering results, the final results are obtained. When merging subclusters, we adopt the lattice similarity method as the distance between two subclusters, which has explicit form when we use the fuzzy membership values of subclusters as the features. Experimental results on two artificial datasets, UCI dataset and real image segmentation show that the proposed method outperforms traditional FCM algorithm and spectral clustering obviously in efficiency and robustness.  相似文献   

13.
Chien-Yu  Shien-Ching  Yen-Jen   《Pattern recognition》2005,38(12):2256-2269
As the sizes of many contemporary databases continue to grow rapidly, incremental clustering has emerged as an essential issue for conducting data analysis on contemporary databases. An incremental clustering algorithm refers to an abstraction of the distribution of the data instances generated by the previous run of the algorithm and therefore is able to cope well with the ever-growing contemporary databases. There are two main challenges in the design of incremental clustering algorithms. The first challenge is how to reduce information loss due to the data abstraction (or summarization) operations. The second challenge is that the clustering result should not be sensitive to the order of input data. This paper presents the GRIN algorithm, an incremental hierarchical clustering algorithm for numerical datasets based on the gravity theory in physics. In the design of GRIN, a statistical test aimed at reducing information loss and distortion is employed to control formation of subclusters as well as to monitor the evolution of the dataset. Due to the statistical test-based summarization approach, GRIN is able to achieve near linear scalability and is not sensitive to input ordering.  相似文献   

14.
Mining useful information and helpful knowledge from large databases has evolved into an important research area in recent years. Among the classes of knowledge derived, finding sequential patterns in temporal transaction databases is very important since it can help model customer behavior. In the past, researchers usually assumed databases were static to simplify data-mining problems. In real-world applications, new transactions may be added into databases frequently. Designing an efficient and effective mining algorithm that can maintain sequential patterns as a database grows is thus important. In this paper, we propose a novel incremental mining algorithm for maintaining sequential patterns based on the concept of pre-large sequences to reduce the need for rescanning original databases. Pre-large sequences are defined by a lower support threshold and an upper support threshold that act as gaps to avoid the movements of sequences directly from large to small and vice versa. The proposed algorithm does not require rescanning original databases until the accumulative amount of newly added customer sequences exceeds a safety bound, which depends on database size. Thus, as databases grow larger, the numbers of new transactions allowed before database rescanning is required also grow. The proposed approach thus becomes increasingly efficient as databases grow.  相似文献   

15.
Incrementally fast updated frequent pattern trees   总被引:3,自引:0,他引:3  
The frequent-pattern-tree (FP-tree) is an efficient data structure for association-rule mining without generation of candidate itemsets. It was used to compress a database into a tree structure which stored only large items. It, however, needed to process all transactions in a batch way. In real-world applications, new transactions are usually inserted into databases. In this paper, we thus attempt to modify the FP-tree construction algorithm for efficiently handling new transactions. A fast updated FP-tree (FUFP-tree) structure is proposed, which makes the tree update process become easier. An incremental FUFP-tree maintenance algorithm is also proposed for reducing the execution time in reconstructing the tree when new transactions are inserted. Experimental results also show that the proposed FUFP-tree maintenance algorithm runs faster than the batch FP-tree construction algorithm for handling new transactions and generates nearly the same tree structure as the FP-tree algorithm. The proposed approach can thus achieve a good trade-off between execution time and tree complexity.  相似文献   

16.
孤立数据的存在使数据挖掘结果不准确,甚至错误。现有的孤立点检测算法在通用性、有效性、用户友好性及处理高维大数据集的性能还不完善,为此,提出一种有效的全局孤立点检测方法,该方法进行凝聚层次聚类,根据聚类树和距离矩阵来可视化判断数据孤立程度,确定孤立点数目。从聚类树自顶向下,无监督地去除离群数据点。在多个数据集上的仿真实验结果表明,该方法能有效识别孤立程度最大的前n个全局孤立点,适用于不同形状的数据集,算法效率高,用户友好,且适用于大型高维数据集的孤立点检测。  相似文献   

17.
现有的孤立点检测算法在通用性、有效性、用户友好性及处理高维大数据集的性能还不完善,为此提出一种快速有效的基于层次聚类的全局孤立点检测方法。该方法基于层次聚类的结果,根据聚类树和距离矩阵可视化判断数据孤立程度,并确定孤立点数目。从聚类树自顶向下,无监督地去除孤立点。仿真实验验证了方法能快速有效识别全局孤立点,具有用户友好性,适用于不同形状的数据集,可用于大型高维数据集的孤立点检测。  相似文献   

18.
基于划分和层次的混合动态聚类算法*   总被引:1,自引:0,他引:1  
针对划分聚类对初始值较为敏感以及层次聚类时间复杂度高等缺陷,提出了一种基于划分和层次的混合动态聚类算法HDC-PH。该算法首先使用划分聚类快速生成一定数量的子簇,然后以整体相似度的聚类质量评价标准来动态改变聚类数目,同时给出了聚类过程中孤立点的剔除方法。实验结果表明,HDC-PH算法的性能明显优于划分和层次算法,提高了聚类质量,并获得了更自然的聚类结果。  相似文献   

19.
针对简单的分布式封锁方法和完全分布式加锁算法在加锁时所需通信开销大、封锁时间长、锁管理复杂的缺点,结合集中式数据库加锁管理算法的优点,指出了在分布式数据库中保持事务可串行化方面存在的难点,利用全局目录和事务调度器,提出了基于全局目录的分布式数据库加锁管理算法。该算法使用两阶段封锁协议和多粒度封锁协议,在全局目录服务器中使用全局锁管理器管理和维护全局目录中的锁结点信息并对分布式封锁请求进行集中控制和灵活管理,能有效地保证事务的可串行化调度,降低封锁时的通信开销。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号