首页 | 本学科首页   官方微博 | 高级检索  
     

文档检索中文本片段化机制的研究
引用本文:李宇,刘波.文档检索中文本片段化机制的研究[J].计算机科学与探索,2020,14(4):578-589.
作者姓名:李宇  刘波
作者单位:暨南大学 信息科学技术学院 计算机系,广州 510632;暨南大学 信息科学技术学院 计算机系,广州 510632
基金项目:The Foundation of Science and Technology Planning Project of Guangzhou under Grant No. 201604010037 (广州市科技计划基金)
摘    要:文档检索是自然语言处理的研究热点,相对于短文本文档具有信息丰富且冗长的特征。在长文本检索中,查询语句与长文本中的句子往往不是全部相关,可能会出现某些高相似片段的强干扰,因此查询语句与文档之间的相关性评分不能简单采用基于词语或字符串之间的相似度计算。提出了一种文本片段化机制(TSM)进行文档检索,首先将每个候选文档划分成片段,再计算查询语句与文档片段之间的相关度,所使用的相关度匹配方案考虑了语义和词频等因素,筛选出关键的文本片段并得出相关片段比率,综合这些片段信息计算查询与文档之间的相关性得分,从而获取Top-K文档集。针对Glasgow信息检索专用数据集的实验结果表明,利用文本片段化机制进行文本匹配可以提高信息检索的性能。

关 键 词:文本片段化机制  文档检索  相关性评分  相关片段比例  片段整合计算

Research on Text Snippet Mechanism in Document Retrieval
LI Yu,LIU Bo.Research on Text Snippet Mechanism in Document Retrieval[J].Journal of Frontier of Computer Science and Technology,2020,14(4):578-589.
Authors:LI Yu  LIU Bo
Affiliation:(College of Information Science and Technology,Jinan University,Guangzhou 510632,China)
Abstract:Document retrieval is a research hotspot of natural language processing.Compared with short text document which has the characteristics of information diversity and length,in long text retrieval,a query statement is often not related to all sentences in a long text,and strong interference of some highly similar segments will occur.Therefore,the correlation score between a query statement and a document can not be simply calculated based on the similarity between words or strings.Text snippet mechanism(TSM)is proposed for document retrieval.TSM first divides each candidate document into snippets,and then calculates the correlation between query statements and document snippets.The correlation matching scheme used takes into account the semantic and word frequency factors.TSM selects key text snippets and obtains the relevant snippet ratio,and then calculates the correlation score between query and target document based on these information,so as to obtain the Top-K document set.Experimental results show that TSM can improve the performance of information retrieval on IR test collection of Glasgow.
Keywords:text snippet mechanism  document retrieval  correlation calculation  relevant snippet ratio  snippet integration score
本文献已被 维普 万方数据 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号