首页 | 官方网站   微博 | 高级检索  
     

基于语言节奏的大规模文档去重算法研究
引用本文:陈钒,冯志勇,李晓红,赵庚.基于语言节奏的大规模文档去重算法研究[J].计算机工程与应用,2011,47(11):15-18.
作者姓名:陈钒  冯志勇  李晓红  赵庚
作者单位:1. 天津大学,计算机科学与软件学院,天津,300072;天津财经大学,理工学院,信息科学与技术系,天津,300200
2. 天津大学,计算机科学与软件学院,天津,300072
3. 河北工业大学,天津,300130
基金项目:国家自然科学基金重大科技研究计划面上项目
摘    要:通过对Web中大规模文档研究发现,文档中的自然段落具有特殊的语言节奏。提出了一种基于语言节奏的文档重复性检测方法,通过构建文档中自然段落的语言节奏码并进行重复性分析,实现了基于段粒度的文档重复性检测。实验表明,此方法具有良好的召回率和准确率,可以将内容完全重复的文档、部分段落内容重复的文档及打乱段落顺序重组文档的重复性均检测出来,检测精度高且占用系统资源少。

关 键 词:文档重复性检测  语言节奏  标点
修稿时间: 

Study on large scale duplicated text deletion algorithm based on language cadence
CHEN Fan,FENG Zhiyong,LI Xiaohong,ZHAO Geng.Study on large scale duplicated text deletion algorithm based on language cadence[J].Computer Engineering and Applications,2011,47(11):15-18.
Authors:CHEN Fan  FENG Zhiyong  LI Xiaohong  ZHAO Geng
Affiliation:1.School of Computer Science and Technology,Tianjin University,Tianjin 300072,China 2.Dept. of Info. Science & Technology,College of Science,Tianjin University of Finance and Economics,Tianjin 300200,China 3.Hebei University of Technology,Tianjin 300130,China
Abstract:It is found that language cadence can mark the text uniquely by studying on large scale text in Web.The large scale duplicated text detection algorithm based on language cadence is prompted here.It has higher precision rate and efficiency that the algorithm based on semantic and text structure.Punctuations can mark the basic language cadence of each text. This cadence can be caught for creating the language cadence code of every paragraph in text,in order to detect the duplicate one quickly and easily.The experiments’result shows that this algorithm has good recall and precision rate in duplicated paragraph detection.It can find the duplicated content not only of page but also of paragraph.So it can detect the duplicated in content with different paragraph sequence.
Keywords:duplicated text detection  language cadence  punctuation
本文献已被 CNKI 维普 万方数据 等数据库收录!
点击此处可从《计算机工程与应用》浏览原始摘要信息
点击此处可从《计算机工程与应用》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号