首页 | 本学科首页   官方微博 | 高级检索  
     

基于蚁群特征选择的相似重复记录分类检测
引用本文:曹建军,刁兴春,杜鹢,王芳潇,张潇毅.基于蚁群特征选择的相似重复记录分类检测[J].兵工学报,2010,31(9):1222-1227.
作者姓名:曹建军  刁兴春  杜鹢  王芳潇  张潇毅
作者单位:总参第63研究所,江苏,南京,210007;中国电子系统工程公司网管中心,北京,100036
基金项目:中国博士后科学基金资助项目,江苏省博士后科研资助计划资助项目 
摘    要:为实现相似重复记录的检测,提出一种基于蚁群算法特征选择的分类检测方法。将相似重复记录检测看成二分类问题,定义了字符串型、枚举型和日期型3种典型属性类型的相似特征和归一化算法,以两记录的相似特征向量作为分类器的输入进行检测;建立了以召回率、准确率和特征规模综合最优的特征选择多目标优化模型,并根据问题特点将多目标模型转化为单目标模型,应用蚁群算法设计了模型求解算法。最后,用欧氏距离分类法和支持向量机2种分类器验证了该方法的有效性。

关 键 词:信息处理技术  数据清洗  相似重复记录  蚁群算法  特征选择  支持向量机

Classification Detection of Approximately Duplicate Records Based on Feature Selection Using Ant Colony Algorithm
CAO Jian-jun,DIAO Xing-chun,DU Yi,WANG Fang-xiao,ZHANG Xiao-yi.Classification Detection of Approximately Duplicate Records Based on Feature Selection Using Ant Colony Algorithm[J].Acta Armamentarii,2010,31(9):1222-1227.
Authors:CAO Jian-jun  DIAO Xing-chun  DU Yi  WANG Fang-xiao  ZHANG Xiao-yi
Affiliation:(1.The 63rd Research Institute of the PLA General Staff Headquarters, Nanjing 210007, Jiangsu, China;2.Network Management Center, China Electronic System Engineering Company, Beijing 100036, China)
Abstract:In order to realize detection of approximately duplicate records, a classification method based on feature selection using ant colony algorithm was proposed. The detection of approximately duplicate records can be regarded as a classification problem of two classes. The similar feature and normalization algorithm of three typical attribute types, such as character string, enumeration and date, were defined. Taken the similar feature vectors of two records as the classifier's input, the detection of approximately duplicated records was realized. A multi-object optimization model based on comprehensive optimization of recall rate, precision and feature set's size was set up. According to the characteristics of the problem, the multi-object optimization model was transfered into a single object optimization model. A solving algorithm using ant colony algorithm was designed. Finally, the method was validated by Euclidean distance classifier and support vector machine classifier.
Keywords:information processing technique  data cleaning    approximately duplicate record    ant colony algorithm    feature selection    support vector machine  
本文献已被 CNKI 万方数据 等数据库收录!
点击此处可从《兵工学报》浏览原始摘要信息
点击此处可从《兵工学报》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号