首页 | 本学科首页   官方微博 | 高级检索  
     


A topic-based term frequency normalization framework to enhance probabilistic information retrieval
Authors:Fanghong Jian  Jimmy X Huang  Jiashu Zhao  Zhiwei Ying  Yuqi Wang
Affiliation:1. National Engineering Research Center for E-Learning, Central China Normal University, Wuhan, China

School of Science, Jiujiang University, Jiujiang, China;2. School of Information Technology, York University, Toronto, Canada;3. Department of Physics and Computer Science, Wilfrid Laurier University, Waterloo, Canada

Abstract:Many well-known probabilistic information retrieval models have shown promise for use in document ranking, especially BM25. Nevertheless, it is observed that the control parameters in BM25 usually need to be adjusted to achieve improved performance on different data sets; additionally, the assumption in BM25 on the bag-of-words model prevents its direct utilization of rich information that lies at the sentence or document level. Inspired by the above challenges with respect to BM25, we first propose a new normalization method on the term frequency in BM25 (called BM25QL in this paper); in addition, the method is incorporated into CRTER2, a recent BM25-based model, to construct CRTER2QL. Then, we incorporate topic modeling and word embedding into BM25 to relax the assumption of the bag-of-words model. In this direction, we propose a topic-based retrieval model, TopTF, for BM25, which is then further incorporated into the language model (LM) and the multiple aspect term frequency (MATF) model. Furthermore, an enhanced topic-based term frequency normalization framework, ETopTF, based on embedding is presented. Experimental studies demonstrate the great effectiveness and performance of these methods. Specifically, on all tested data sets and in terms of the mean average precision (MAP), our proposed models, BM25QL and CRTER2QL, are comparable to BM25 and CRTER2 with the best b parameter value; the TopTF models significantly outperform the baselines, and the ETopTF models could further improve the TopTF in terms of the MAP.
Keywords:Dirichlet language model  embedding  LDA  probabilistic model  term frequency normalization  topic modeling
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号