首页 | 本学科首页   官方微博 | 高级检索  
     

一种基于数据垂直划分的分布式密度聚类算法
引用本文:倪巍伟,陈耿,孙志挥.一种基于数据垂直划分的分布式密度聚类算法[J].计算机研究与发展,2007,44(9):1612-1617.
作者姓名:倪巍伟  陈耿  孙志挥
作者单位:东南大学计算机科学与工程学院 南京210096(倪巍伟,孙志挥),南京审计学院审计信息工程实验室 南京210029(陈耿)
基金项目:江苏省自然科学基金 , 高等学校博士学科点专项科研项目
摘    要:聚类分析是数据挖掘领域的一项重要研究课题,对大数据集的聚类更以其数据量大、噪声数据多等而成为一个难点.针对数据垂直划分的情况,提出连通点集及局部噪声点集等概念.在分析局部噪声点集与全局噪声点集以及局部连通点集与全局连通点集关系的基础上,对全局噪声点进行有效过滤,进一步设计闭三角链表结构存储各个结点的聚类中间结果,提出了基于密度的分布式聚类算法DDBSCAN.理论分析和实验结果表明,算法可以有效解决垂直划分的大数据集聚类问题,算法是有效可行的.

关 键 词:分布式数据挖掘  数据垂直划分  连通点集  局部噪声点集  闭三角链表  噪声数据  划分  分布式  密度聚类算法  Distributed  Partitioned  Clustering  Algorithm  聚类问题  有效解决  中间结果  实验  理论  基于密度  结点  结构存储  链表  设计  过滤  关系  分析
修稿时间:2007-01-04

An Efficient Density-Based Clustering Algorithm for Vertically Partitioned Distributed Datasets
Ni Weiwei,Chen Geng,Sun Zhihui.An Efficient Density-Based Clustering Algorithm for Vertically Partitioned Distributed Datasets[J].Journal of Computer Research and Development,2007,44(9):1612-1617.
Authors:Ni Weiwei  Chen Geng  Sun Zhihui
Affiliation:1Department of Computer Science and Engineering, Southeastern University, Nanjing 210096;2. Laboratory of Audit Information Engineering, Nanjing Audit University, Nanjing 210029
Abstract:Clustering is an important research in data mining.Clustering massive datasets has especially been a challenge for its large scale and too much noise data points.Distributed clustering is an effective method to solve these problems.Most of existing distributed clustering research aims at circumstances of horizontally partitioned dataset.In this paper,considering vertically partitioned distributed datasets,based on the analysis of relations between local noise datasets and the corresponding global one,an efficient filtering is applied to the global noise,which can efficiently eliminate the negative affection of noise data and reduce the scale of dataset to be dealt on the center node.Furthermore,an effect storage structure CTL(closed triangle list)is designed to store the intermediate clustering results of each node,which can efficiently reduce communication costs among distributed computer nodes during the clustering process and is helpful to conveniently generate global clustering model with high space utilization ratio and complete clustering information.Thus,a distributed density-based clustering algorithm DDBSCAN is proposed.Theoretical analysis and experimental results testify that DDBSCAN can effectively solve the problem of clustering massive vertically partitioned datasets,and the algorithm is effective and efficient.
Keywords:distributed data mining  vertically partitioned data  connected set  local noise set  closed triangle list
本文献已被 CNKI 维普 万方数据 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号