A New Retrieval Model Based on TextTiling for Document Similarity Search期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

A New Retrieval Model Based on TextTiling for Document Similarity Search

Authors:	Xiao-Jun?Wan Email author" target="_blank">Yu-Xin?Peng Email author

Affiliation:	(1) National Key Laboratory of Text Processing Technology, Institute of Computer Science and Technology, Peking University, Beijing, 100871, P.R. China

Abstract:	Document similarity search is to find documents similar to a given query document and return a ranked list of similar documents to users, which is widely used in many text and web systems, such as digital library, search engine, etc. Traditional retrieval models, including the Okapi's BM25 model and the Smart's vector space model with length normalization, could handle this problem to some extent by taking the query document as a long query. In practice, the Cosine measure is considered as the best model for document similarity search because of its good ability to measure similarity between two documents. In this paper, the quantitative performances of the above models are compared using experiments. Because the Cosine measure is not able to reflect the structural similarity between documents, a new retrieval model based on TextTiling is proposed in the paper. The proposed model takes into account the subtopic structures of documents. It first splits the documents into text segments with TextTiling and calculates the similarities for different pairs of text segments in the documents. Lastly the overall similarity between the documents is returned by combining the similarities of different pairs of text segments with optimal matching method. Experiments are performed and results show: 1) the popular retrieval models (the Okapi's BM25 model and the Smart's vector space model with length normalization) do not perform well for document similarity search; 2) the proposed model based on TextTiling is effective and outperforms other models, including the Cosine measure; 3) the methods for the three components in the proposed model are validated to be appropriately employed.

Keywords:	document similarity search retrieval model similarity measure TextTiling optimal matching
本文献已被 CNKI 维普万方数据 SpringerLink 等数据库收录！
	点击此处可从《计算机科学技术学报》浏览原始摘要信息
	点击此处可从《计算机科学技术学报》下载全文

设为首页 | 免责声明 | 关于勤云 | 加入收藏