首页 | 本学科首页   官方微博 | 高级检索  
     


Beyond topical similarity: a structural similarity measure for retrieving highly similar documents
Authors:Xiaojun Wan
Affiliation:(1) Institute of Computer Science and Technology, Peking University, Beijing, 100871, China
Abstract:Accurately measuring document similarity is important for many text applications, e.g. document similarity search, document recommendation, etc. Most traditional similarity measures are based only on “bag of words” of documents and can well evaluate document topical similarity. In this paper, we propose the notion of document structural similarity, which is expected to further evaluate document similarity by comparing document subtopic structures. Three related factors (i.e. the optimal matching factor, the text order factor and the disturbing factor) are proposed and combined to evaluate document structural similarity, among which the optimal matching factor plays the key role and the other two factors rely on its results. The experimental results demonstrate the high performance of the optimal matching factor for evaluating document topical similarity, which is as well as or better than most popular measures. The user study shows the good ability of the proposed overall measure with all three factors to further find highly similar documents from those topically similar documents, which is much better than that of the popular measures and other baseline structural similarity measures. Xiaojun Wan received a B.Sc. degree in information science, a M.Sc. degree in computer science and a Ph.D. degree in computer science from Peking University, Beijing, China, in 2000, 2003 and 2006, respectively. He is currently a lecturer at Institute of Computer Science and Technology of Peking University. His research interests include information retrieval and natural language processing.
Keywords:Document structural similarity  Similarity measure  Subtopic structure  TextTiling  Optimal matching  Text order
本文献已被 SpringerLink 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号