Beyond topical similarity: a structural similarity measure for retrieving highly similar documents |
| |
Authors: | Xiaojun Wan |
| |
Affiliation: | (1) Institute of Computer Science and Technology, Peking University, Beijing, 100871, China |
| |
Abstract: | Accurately measuring document similarity is important for many text applications, e.g. document similarity search, document recommendation, etc. Most traditional similarity measures are based only on “bag of words” of documents and can well evaluate document topical similarity. In this paper, we propose the notion of document structural similarity, which is expected to further evaluate document similarity by comparing document subtopic structures. Three related factors (i.e. the optimal matching factor, the text order factor and the disturbing factor) are proposed and combined to evaluate document structural similarity, among which the optimal matching factor plays the key role and the other two factors rely on its results. The experimental results demonstrate the high performance of the optimal matching factor for evaluating document topical similarity, which is as well as or better than most popular measures. The user study shows the good ability of the proposed overall measure with all three factors to further find highly similar documents from those topically similar documents, which is much better than that of the popular measures and other baseline structural similarity measures. Xiaojun Wan received a B.Sc. degree in information science, a M.Sc. degree in computer science and a Ph.D. degree in computer science from Peking University, Beijing, China, in 2000, 2003 and 2006, respectively. He is currently a lecturer at Institute of Computer Science and Technology of Peking University. His research interests include information retrieval and natural language processing. |
| |
Keywords: | Document structural similarity Similarity measure Subtopic structure TextTiling Optimal matching Text order |
本文献已被 SpringerLink 等数据库收录! |
|