首页 | 本学科首页   官方微博 | 高级检索  
     

一种基于VSM的检测相似重复记录的方法
引用本文:张昌年.一种基于VSM的检测相似重复记录的方法[J].微电子学与计算机,2008,25(8).
作者姓名:张昌年
作者单位:北京科技大学,信息工程学院,北京,100083;桂林航天工业高等专科学校,广西,桂林,541004
基金项目:北京市自然科学基金项目
摘    要:相似重复记录是数据集成系统中影响数据质量的关键问题之一.为了提高检测精度和效率,综合一些已有的传统方法并加以改进:(1)在字段间进行比较时,根据不同情况逐字符进行比较,使得算法能够适应不同的语言环境,具有较好的通用性.(2)在记录间进行比较时,为不同的字段赋予不同的权重,并采用了基于向量空间模型VSM的向量距离算法,提高了相似重复记录检测的精度.(3)在聚类的过程中采用优先队列策略,减少了记录间比较的次数,提高了检测的效率.理论分析和实验证明文中所提出的相似重复记录检测方法是有效的.

关 键 词:空间向量模型  聚类  相似重复记录  权重  优先队列

Approach for Detecting Approximately Duplicate Records Based on VSM
ZHANG Chang-nian.Approach for Detecting Approximately Duplicate Records Based on VSM[J].Microelectronics & Computer,2008,25(8).
Authors:ZHANG Chang-nian
Affiliation:ZHANG Chang-nian1,2
Abstract:Approximately duplicate records in data integration is one of the key problems affect the data quality.This article presents a synthetic approach for detecting approximately duplicate records.It has three distinctive features:(1)To compare the similarity of two fields,an all-purpose string comparison algorithm is proposed,which can tolerate the multi-language environment.(2)To improve the detecting precision,each field of records is appointed a proper weight and adopted the VSM-based algorithm.(3)An algorithm based on priority queue is proposed.It scans all sorted records sequentially,and makes those approximately duplicate records cluster together through comparing the similarity between current record and the records in the priority queue,it can improve the detecting efficiency.The effectiveness of the proposed approach is verified through analysis and experiment.
Keywords:VSM  clustering  approximately duplicate records  weight  priority queue
本文献已被 CNKI 维普 万方数据 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号