首页 | 本学科首页   官方微博 | 高级检索  
     

基于熵特征优选分组聚类的相似重复记录检测
引用本文:张平,党选举,陈皓,杨文雷. 基于熵特征优选分组聚类的相似重复记录检测[J]. 传感器与微系统, 2011, 30(11): 135-137,141
作者姓名:张平  党选举  陈皓  杨文雷
作者单位:1. 桂林电子科技大学电子工程与自动化学院,广西桂林,541004
2. 桂林电子科技大学计算机科学与工程学院,广西桂林,541004
基金项目:国家自然科学基金资助项目(60964001); 广西自然科学基金资助项目(09910192); 广西信息与通讯实验室主任基金资助项目(01902)
摘    要:针对目前相似重复记录检测方法不能有效处理大数据量的问题,提出一种基于熵的特征优选分组聚类的算法.该方法通过构造一个基于对象间相似度的熵度量,对原始数据集中各属性进行重要性评估,筛选出关键属性集,并依据关键属性将数据划分为不相交的小数据集,在各小数据集中用DBSCAN聚类算法进行相似重复记录的检测.理论分析和实验结果表明...

关 键 词:相似重复记录    特征优选分组聚类

Detection of approximately duplicated records based on entropy feature selection grouping clustering
ZHANG Ping,DANG Xuan-ju,CHEN Hao,YANG Wen-lei. Detection of approximately duplicated records based on entropy feature selection grouping clustering[J]. Transducer and Microsystem Technology, 2011, 30(11): 135-137,141
Authors:ZHANG Ping  DANG Xuan-ju  CHEN Hao  YANG Wen-lei
Affiliation:ZHANG Ping1,DANG Xuan-ju1,CHEN Hao2,YANG Wen-lei2(1.School of Electronic Engineering and Automation,Guilin University of Electronic Technology,Guilin 541004,China,2.School of Computer Science and Engineering,China)
Abstract:At present,the approximately duplicate records of massive data can not be detected effectively by current methods,an algorithm based on entropy feature selection grouping clustering(FSGC) is proposed.The basic idea is that through constructing an entropy metric based on similarity between objects,the importance of each property can be evaluated and a key property subset can be obtained.According to the key property to split the data sets into small data sets,the approximately duplicated records are identifi...
Keywords:approximately duplicated records  entropy  feature selection grouping clutering(FSGC)  
本文献已被 CNKI 万方数据 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号