Cure: an efficient clustering algorithm for large databases |
| |
Affiliation: | 1. School of Computer, Guangdong University of Technology, Guangzhou 510006, China;2. School of Automation, Guangdong University of Technology, Guangzhou 510006, China;3. School of Mathematics & Big Data, Foshan University, Foshan, Guangdong 528000, China;1. School of Information Science & Technology, Southwest Jiaotong University, Chengdu 610031, China;2. Chongqing Institute of Green and Intelligent Technology, Chinese Academy of Sciences, Chongqing, 400714, China;3. Chongqing Key Laboratory of Computational Intelligence, Chongqing University of Posts and Telecommunications, Chongqing 400065, China;4. School of Information Engineering, Guizhou University of Engineering Science, Bijie 551700, China;1. Department of Computer Engineering, University of Kurdistan, Sanandaj, Iran;2. Department of Computer Science, University of Human Development, Sulaymanyah, Iraq |
| |
Abstract: | Clustering, in data mining, is useful for discovering groups and identifying interesting distributions in the underlying data. Traditional clustering algorithms either favor clusters with spherical shapes and similar sizes, or are very fragile in the presence of outliers. We propose a new clustering algorithm called CURE that is more robust to outliers, and identifies clusters having non-spherical shapes and wide variances in size. CURE achieves this by representing each cluster by a certain fixed number of points that are generated by selecting well scattered points from the cluster and then shrinking them toward the center of the cluster by a specified fraction. Having more than one representative point per cluster allows CURE to adjust well to the geometry of non-spherical shapes and the shrinking helps to dampen the effects of outliers. To handle large databases, CURE employs a combination of random sampling and partitioning. A random sample drawn from the data set is first partitioned and each partition is partially clustered. The partial clusters are then clustered in a second pass to yield the desired clusters. Our experimental results confirm that the quality of clusters produced by CURE is much better than those found by existing algorithms. Furthermore, they demonstrate that random sampling and partitioning enable CURE to not only outperform existing algorithms but also to scale well for large databases without sacrificing clustering quality. |
| |
Keywords: | |
本文献已被 ScienceDirect 等数据库收录! |
|