首页 | 本学科首页   官方微博 | 高级检索  
     

一种基于近似EMD的DBSCAN改进算法
引用本文:张宏兵,陆建峰,汤九斌.一种基于近似EMD的DBSCAN改进算法[J].山东大学学报(工学版),2012,42(4):35-40.
作者姓名:张宏兵  陆建峰  汤九斌
作者单位:1.南京理工大学计算机科学技术学院, 江苏 南京 210094; 2.中国电信江苏公司, 江苏 南京 210037
基金项目:江苏省自然基金资助项目(BK2009489);江苏省青蓝工程资助项目
摘    要:DBSCAN(density based spatial clustering of applications with noise)算法是基于密度的经典聚类算法,但是该算法应用于高维数据时,常用距离函数不能很好地反映出数据点之间的关系, 从而可能导致聚类簇不够精确。如果能在高维空间中采用合适的距离度量,将会改善聚类结果。针对上述问题,提出利用近似EMD(earth mover’s distance,堆土机距离)作为距离测度,通过迭代搜索的方法找出所有直接密度可达对象实现聚类。实验结果表明:在高维文本数据的聚类中,和原来算法相比,改进算法的正确率提高了6%,两者在时间上相差不大;而对低维的Iris数据,改进算法通过EMD改善了实体间的相似性度量,减少了划分为噪声点的数据点个数,平均正确率提高了10%。实验结果表明了改进算法对高维数据的有效性,并可以改善聚类性能。

关 键 词:聚类  DBSCAN算法  近似EMD  高维数据  
收稿时间:2012-05-06

An improved DBSCAN algorithm based on the approximate EMD
ZHANG Hong-bing,LU Jian-feng,TANG Jiu-bin.An improved DBSCAN algorithm based on the approximate EMD[J].Journal of Shandong University of Technology,2012,42(4):35-40.
Authors:ZHANG Hong-bing  LU Jian-feng  TANG Jiu-bin
Affiliation:1. School of Computer Science and Technology, Nanjing University of Science and Technology, Nanjing 210094, China; 2. Jiangsu Corporation of China Telecom, Nanjing 210037, China
Abstract:The DBSCAN algorithm is one of the classic clustering algorithms based on the density.When this algorithm was applied to high-dimensional data,the distance measures in common use could not reflect the relationships between instances well,which would lead to the inaccurate clustering.If appropriate distance measures were adopted in high-dimensional space,the clustering result would be improved.To solve the above problem,the approximate EMD(earth mover′s distance) instead of the common distance was used as the distance measure,and the clustering was achieved by finding all density-reachable objects with the method of iterative search.The experimental results showed that the performance of improved algorithm was 6% higher than that of the original algorithm for the high-dimensional text clustering,while there is no obvious difference in time cost.For low-dimensional Iris data,the proposed algorithm could improve the similarity measure between the instances,reduce the number of data points classified as noise points,and boot the performance with 10%.The experimental results also indicated that the proposed algorithm could reveal its effectiveness for high-dimensional data,and could improve the clustering performance.
Keywords:clustering  DBSCAN algorithm  approximate EMD  high-dimensional data
本文献已被 CNKI 万方数据 等数据库收录!
点击此处可从《山东大学学报(工学版)》浏览原始摘要信息
点击此处可从《山东大学学报(工学版)》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号