首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Many well-known probabilistic information retrieval models have shown promise for use in document ranking, especially BM25. Nevertheless, it is observed that the control parameters in BM25 usually need to be adjusted to achieve improved performance on different data sets; additionally, the assumption in BM25 on the bag-of-words model prevents its direct utilization of rich information that lies at the sentence or document level. Inspired by the above challenges with respect to BM25, we first propose a new normalization method on the term frequency in BM25 (called BM25QL in this paper); in addition, the method is incorporated into CRTER2, a recent BM25-based model, to construct CRTER2QL. Then, we incorporate topic modeling and word embedding into BM25 to relax the assumption of the bag-of-words model. In this direction, we propose a topic-based retrieval model, TopTF, for BM25, which is then further incorporated into the language model (LM) and the multiple aspect term frequency (MATF) model. Furthermore, an enhanced topic-based term frequency normalization framework, ETopTF, based on embedding is presented. Experimental studies demonstrate the great effectiveness and performance of these methods. Specifically, on all tested data sets and in terms of the mean average precision (MAP), our proposed models, BM25QL and CRTER2QL, are comparable to BM25 and CRTER2 with the best b parameter value; the TopTF models significantly outperform the baselines, and the ETopTF models could further improve the TopTF in terms of the MAP.  相似文献   

2.
在传统的检索模型中,文档与查询的匹配计算主要考虑词项的统计特征,如词频、逆文档频率和文档长度,近年来的研究表明应用查询词项匹配在文档中的位置信息可以提高查询结果的准确性。如何更好地刻画查询词在文档中的位置信息并建模,是研究提高检索效果的问题之一。该文在结合语义的位置语言模型(SPLM)的基础上进一步考虑了词的邻近信息,并给出了用狄利克雷先验分布来计算邻近度的平滑策略,提出了结合邻近度的位置语言检索模型。在标准数据上的实验结果表明,提出的检索模型在性能上要优于结合语义的位置语言模型。  相似文献   

3.
传统的邻近性检索模型同等地看待所有查询词,不加区分地考虑所有查询词的邻近性,造成“平行概念效应”,影响邻近性检索方法的性能。文中提出一种查询词相似度加权的邻近性检索方法。该方法根据查询词之间的语义相似度对查询词邻近性统计量加权,可进一步推断用户的实际信息需求,挖掘查询中蕴含的更深层次的信息。实验结果表明,在短查询较多的应用环境下,文中方法可较显著提升传统邻近性检索模型的性能,有效规避查询词邻近性的平行概念效应。  相似文献   

4.
专家发现是实体检索领域的一个研究热点,针对经典专家发现模型存在索引术语独立性假设与检索性能低的缺陷,提出一种基于贝叶斯网络模型的专家发现方法。该方法模型采用四层网络结构,能够实现图形化的概率推理,同时运用词向量技术能够实现查询术语的语义扩展。实验结果显示该模型在多个评价指标上均优于经典专家发现模型,能够有效实现查询术语语义扩展,提高专家检索性能。  相似文献   

5.
Document similarity search is to find documents similar to a given query document and return a ranked list of similar documents to users, which is widely used in many text and web systems, such as digital library, search engine, etc. Traditional retrieval models, including the Okapi's BM25 model and the Smart's vector space model with length normalization, could handle this problem to some extent by taking the query document as a long query. In practice, the Cosine measure is considered as the best model for document similarity search because of its good ability to measure similarity between two documents. In this paper, the quantitative performances of the above models are compared using experiments. Because the Cosine measure is not able to reflect the structural similarity between documents, a new retrieval model based on TextTiling is proposed in the paper. The proposed model takes into account the subtopic structures of documents. It first splits the documents into text segments with TextTiling and calculates the similarities for different pairs of text segments in the documents. Lastly the overall similarity between the documents is returned by combining the similarities of different pairs of text segments with optimal matching method. Experiments are performed and results show: 1) the popular retrieval models (the Okapi's BM25 model and the Smart's vector space model with length normalization) do not perform well for document similarity search; 2) the proposed model based on TextTiling is effective and outperforms other models, including the Cosine measure; 3) the methods for the three components in the proposed model are validated to be appropriately employed.  相似文献   

6.
在大多数现有的检索模型中常常忽略了如下事实:一个文档中匹配到的查询词项的近邻性和打分时所基于的段落检索也可以被用来促进文档的打分。受此启发,提出了基于位置语言模型的中文信息检索系统,首先通过定义位置传播数的概念,为每个位置单独地建立语言模型;然后通过引入KL-divergence检索模型,并结合位置语言模型给每个位置单独打分;最后由多参数打分策略得到文档的最终得分。实验中还重点比较了基于词表和基于二元两种中文索引方法在位置语言模型中的检索效果。在标准NTCIR5、NTCIR6测试集上的实验结果表明,该检索方法在两种索引方式上都显著改善了中文检索系统的性能,并且优于向量空间模型、BM25概率模型、统计语言模型。  相似文献   

7.
This paper presents a new probabilistic model of information retrieval. The most important modeling assumption made is that documents and queries are defined by an ordered sequence of single terms. This assumption is not made in well-known existing models of information retrieval, but is essential in the field of statistical natural language processing. Advances already made in statistical natural language processing will be used in this paper to formulate a probabilistic justification for using tf×idf term weighting. The paper shows that the new probabilistic interpretation of tf×idf term weighting might lead to better understanding of statistical ranking mechanisms, for example by explaining how they relate to coordination level ranking. A pilot experiment on the TREC collection shows that the linguistically motivated weighting algorithm outperforms the popular BM25 weighting algorithm. Received: 17 December 1998 / Revised: 31 May 1999  相似文献   

8.
于莉 《软件》2011,(3):32-34
信息检索的模型,主要是用于检索和排序的计算用户查询请求和信息的匹配程度的问题。目前已有的检索模型有布尔模型、向量模型、概率模型以及以上三个经典模型的变形模型。通过对经典模型进行分析比较,以便在设计具体的检索系统时,根据检索对象的特点,采取合适的检索模型,提高检索效率。  相似文献   

9.
Keyword search enables web users to easily access XML data without understanding the complex data schemas. However, the native ambiguity of keyword search makes it arduous to select qualified relevant results matching keywords. To solve this problem, researchers have made much effort on establishing ranking models distinguishing relevant and irrelevant passages, such as the highly cited TF*IDF and BM25. However, these statistic based ranking methods mostly consider term frequency, inverse document frequency and length as ranking factors, ignoring the distribution and connection information between different keywords. Hence, these widely used ranking methods are powerless on recognizing irrelevant results when they are with high term frequency, indicating a performance limitation. In this paper, a new searching system XDist is accordingly proposed to attack the problems aforementioned. In XDist, we firstly use the semantic query model maximal lowest common ancestor (MAXLCA) to recognize the returned results of a given query, and then these candidate results are ranked by BM25. Especially, XDist re-ranks the top several results by a combined distribution measurement (CDM) which considers four measure criterions: term proximity, intersection of keyword classes, degree of integration among keywords and quantity variance of keywords. The weights of the four measures in CDM are trained by a listwise learning to optimize method. The experimental results on the evaluation platform of INEX show that the re-ranking method CDM can effectively improve the performance of the baseline BM25 by 22% under iP[0.01] and 18% under MAiP. Also the semantic model MAXLCA and the search engine XDist perform the best in their respective related fields.  相似文献   

10.
The radial basis function (RBF) centers play different roles in determining the classification capa- bility of a Gaussian radial basis function neural network (GRBFNN) and should hold different width values. However, it is very hard and time-consuming to optimize the centers and widths at the same time. In this paper, we introduce a new insight into this problem. We explore the impact of the definition of widths on the selection of the centers, propose an optimization algorithm of the RBF widths in order to select proper centers from the center candidate pool, and improve the classification performance of the GRBFNN. The design of the objective function of the optimization algorithm is based on the local mapping capability of each Gaussian RBF. Further, in the design of the objective function, we also handle the imbalanced problem which may occur even when different local regions have the same number of examples. Finally, the recursive orthogonal least square (ROLS) and genetic algorithm (GA), which are usually adopted to optimize the RBF centers, are separately used to select the centers from the center candidates with the initialized widths, in order to testify the validity of our proposed width initialization strategy on the selection of centers. Our experimental results show that, compared with the heuristic width setting method, the width optimization strategy makes the selected cen- ters more appropriate, and improves the classification performance of the GRBFNN. Moreover, the GRBFNN constructed by our method can attain better classification performance than the RBF LS-SVM, which is a state-of-the-art classifier.  相似文献   

11.
张纯青  陈超  邵正荣  俞能海 《计算机仿真》2008,25(1):134-137,239
在信息检索领域,相似度评价模型是一个重要的研究课题.基本的评价模型有布尔模型,向量空间模型和概率模型.后两种模型在许多的信息检索系统中被采用,但是它们都没有考虑查询词在文档中的位置信息对相似性度量起到的作用.一些研究考虑了诸如HTML标签之类的信息,但是确定加权系数的方案不是太理想.针对这些问题,文中提出了一种基于加权词频的相似度评价模型(Weighted Term Frequency Model,WTFM),而引入的权重系数可以通过模拟退火算法学习得到.实验结果表明,权重系数的引入提高了系统的相关度评价质量.  相似文献   

12.
Most existing Information Retrieval model including probabilistic and vector space models are based on the term independence hypothesis. To go beyond this assumption and thereby capture the semantics of document and query more accurately, several works have incorporated phrases or other syntactic information in IR, such attempts have shown slight benefit, at best. Particularly in language modeling approaches this extension is achieved through the use of the bigram or n-gram models. However, in these models all bigrams/n-grams are considered and weighted uniformly. In this paper we introduce a new approach to select and weight relevant n-grams associated with a document. Experimental results on three TREC test collections showed an improvement over three strongest state-of-the-art model baselines, which are the original unigram language model, the Markov Random Field model, and the positional language model.  相似文献   

13.
现有汉越跨语言新闻事件检索方法较少使用新闻领域内的事件实体知识,在候选文档中存在多个事件的情况下,与查询句无关的事件会干扰查询句与候选文档间的匹配精度,影响检索性能。提出一种融入事件实体知识的汉越跨语言新闻事件检索模型。通过查询翻译方法将汉语事件查询句翻译为越南语事件查询句,把跨语言新闻事件检索问题转化为单语新闻事件检索问题。考虑到查询句中只有单个事件,候选文档中多个事件共存会影响查询句和文档的精准匹配,利用事件触发词划分候选文档事件范围,减小文档中与查询无关事件的干扰。在此基础上,利用知识图谱和事件触发词得到事件实体丰富的知识表示,通过查询句与文档事件范围间的交互,提取到事件实体知识表示与词以及事件实体知识表示之间的排序特征。在汉越双语新闻数据集上的实验结果表明,与BM25、Conv-KNRM、ATER等基线模型相比,该模型能够取得较好的跨语言新闻事件检索效果,NDCG和MAP指标最高可提升0.712 2和0.587 2。  相似文献   

14.
为了解决文档与查询之间词的不匹配的问题,对问题扩展技术进行了研究,提出了一种基于维基百科的查询扩展方法.该方法使用与问题相关的维基百科页面对问题扩展,引入了基于局部文档集的查询扩展方法,并使用BM25算法对检索排序进行修正.通过测评对比,验证了用此方法得到的检索结果在原来的基础上有了很大提高.  相似文献   

15.
在信息检索建模中,确定索引词项在文档中的重要性是一项重要内容。以词袋(bag-of-word)的形式表示文档来建立检索模型的方法中大多是基于词项独立性假设,用TF和IDF的函数来计算词项的重要性,并未考虑词项之间的关系。该文采用基于词项图(graph-of-word)的文档表示形式来捕获词项间的依赖关系,提出了一种新的基于词重要性的信息检索图模型TI-IDF。根据词项图得到文档中词项的共现矩阵和词项间的概率转移矩阵,通过马尔科夫链计算方法来确定词项在文档中的重要性(Term Importance, TI),并以此替代索引过程中传统的词项频率TF。该模型具有更好的鲁棒性,我们在国际公开数据集上与传统的检索模型进行了比较。实验结果表明,该文提出的模型都要优于BM25,且在大多数情况下优于BM25的扩展模型、TW-IDF等模型。  相似文献   

16.

Social Book Search is an Information Retrieval (IR) approach that studies the impact of the Social Web on book retrieval. To understand this impact, it is necessary to develop a stronger classical baseline run by considering the contribution of query formulation, document representation, and retrieval model. Such a stronger baseline run can be re-ranked using metadata features from the Social Web to see if it improves the relevance of book search results over the classical IR approaches. However, existing studies neither considered collectively the contribution of the three mentioned factors in the baseline retrieval nor devised a re-ranking formula to exploit the collective impact of the metadata features in re-ranking. To fill these gaps in the literature, this research work first performs baseline retrieval by considering all three factors. For query formulation, it uses topic sets obtained from the discussion threads of LibraryThing. For book representation in indexing, it uses metadata from social websites including Amazon and LibraryThing. For the role of the retrieval model, it experiments with traditional, probabilistic, and fielded models. Second, it devises a re-ranking solution that exploits ratings, tags, reviews, and votes in reordering the baseline search results. Our best-performing retrieval methods outperform existing approaches on several topic sets and relevance judgments. The findings suggest that using all topic fields formulates the best search queries. The user-generated content gives better book representation if made part of the search index. Re-ranking the classical/baseline results improves relevance. The findings have implications for information science, IR, and Interactive IR.

  相似文献   

17.
18.
The state of the art methods of disambiguation in automatic text processing are considered. Special attention is paid to smoothed probabilistic N-gram models.  相似文献   

19.
Query language modeling based on relevance feedback has been widely applied to improve the effectiveness of information retrieval. However, intra‐query term dependencies (i.e., the dependencies between different query terms and term combinations) have not yet been sufficiently addressed in the existing approaches. This article aims to investigate this issue within a comprehensive framework, namely the Aspect Query Language Model (AM). We propose to extend the AM with a hidden Markov model (HMM) structure to incorporate the intra‐query term dependencies and learn the structure of a novel aspect HMM (AHMM) for query language modeling. In the proposed AHMM, the combinations of query terms are viewed as latent variables representing query aspects. They further form an ergodic HMM, where the dependencies between latent variables (nodes) are modeled as the transitional probabilities. The segmented chunks from the feedback documents are considered as observables of the HMM. Then the AHMM structure is optimized by the HMM, which can estimate the prior of the latent variables and the probability distribution of the observed chunks. Our extensive experiments on three large‐scale text retrieval conference (TREC) collections have shown that our method not only significantly outperforms a number of strong baselines in terms of both effectiveness and robustness but also achieves better results than the AM and another state‐of‐the‐art approach, namely the latent concept expansion model. © 2014 Wiley Periodicals, Inc.  相似文献   

20.
The Matrix Framework is a recent proposal by Information Retrieval (IR) researchers to flexibly represent information retrieval models and concepts in a single multi-dimensional array framework. We provide computational support for exactly this framework with the array database system SRAM (Sparse Relational Array Mapping), that works on top of a DBMS. Information retrieval models can be specified in its comprehension-based array query language, in a way that directly corresponds to the underlying mathematical formulas. SRAM efficiently stores sparse arrays in (compressed) relational tables and translates and optimizes array queries into relational queries. In this work, we describe a number of array query optimization rules. To demonstrate their effect on text retrieval, we apply them in the TREC TeraByte track (TREC-TB) efficiency task, using the Okapi BM25 model as our example. It turns out that these optimization rules enable SRAM to automatically translate the BM25 array queries into the relational equivalent of inverted list processing including compression, score materialization and quantization, such as employed by custom-built IR systems. The use of the high-performance MonetDB/X100 relational backend, that provides transparent database compression, allows the system to achieve very fast response times with good precision and low resource usage.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号