期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

马宏远王斌《中文信息学报》2012,26(6):19-27

针对搜索引擎查询结果缓存与预取问题,与传统的基于查询特性相关的方法不同,提出了一种基于用户特性的缓存与预取方法,用于提高搜索引擎系统性能,尤其针对部分用户效果更显著。通过对国内某著名商业搜索引擎用户的查询贡献分析得出,用户对搜索引擎的贡献具有长尾分布特性,结合该特性设计查询结果预测模型来进行预取和分区缓存。在该搜索引擎两个月的大规模真实用户查询日志上的实验结果表明,与传统的基于查询特性的典型方法相比,该方法可以获得3.03%~4.17%的命中率提升,对于查询贡献最大的0.25%的用户群体,可以获得20.52%~28.2%的命中率提升。相似文献

2.

一种基于预取感知接纳策略的查询结果缓存方法

马宏远王斌《计算机研究与发展》2012,(Z1):148-152

针对搜索引擎查询结果缓存问题,提出了一种基于预取感知接纳策略的查询结果缓存方法,用于提高搜索引擎检索系统性能.查询结果预取导致查询结果页码的缓存缺失率具有显著差异性,结合该特性设计预取感知接纳策略,该策略包含查询评估模型以及模型特征选择方法.在该策略基础上,设计了一种查询结果缓存方法.在该搜索引擎两个月的大规模真实用户查询日志上的实验结果表明,与传统的典型方法相比,该方法可以获得6.38%～11.99%的缓存命中率提升. 相似文献

3.

基于日志分析的搜索引擎查询结果缓存研究

马宏远王斌《计算机研究与发展》2012,(Z1):224-228

缓存是有效减少响应时间和系统负载的关键技术,是搜索引擎系统结构研究的重要领域之一.通过对搜狗搜索引擎在近1个月内约1500万条用户查询日志进行分析和研究,针对查询结果缓存,从查询局部性、缓存策略、缓存容量、工作负载周期性等方面进行分析.分析表明,混合缓存策略以及提高缓存容量相结合的技术能有效提高搜索引擎系统性能. 相似文献

4.

基于用户查询日志的双级缓存结构设计

梁敏解萍郝向宁《信息网络安全》2012,(6):44-46,50

文章针对目前分布式缓存系统命中率低和查询处理时间长的问题,在分析某中文搜索引擎于2009年9月查询日志集的用户查询特征和热点内容分布特征的基础上,设计并实现了包括静态缓存和动态缓存的双级缓存结构。最后,从理论分析和实验数据两方面,论证了基于查询日志的双级缓存结构在性能方面更具优越性。相似文献

5.

基于时空局部性的层次化查询结果缓存机制

朱亚东郭嘉丰兰艳艳程学旗《中文信息学报》2016,30(1):63-71

查询结果缓存可以对查询结果的文档标识符集合或者实际的返回页面进行缓存,以提高用户查询的响应速度,相应的缓存形式可以分别称之为标识符缓存或页面缓存。对于固定大小的内存,标识符缓存可以获得更高的命中率,而页面缓存可以达到更高的响应速度。该文根据用户查询访问的时间局部性和空间局部性,提出了一种新颖的基于时空局部性的层次化结果缓存机制。首先,该机制将固定大小的结果缓存划分为两层:页面缓存和标识符缓存。对于用户提交的查询,该机制会首先使用第一层的页面缓存进行应答,如果未能命中,则继续尝试使用第二层的标识符缓存。实验显示这种层次化的缓存机制较传统的仅依赖于单一缓存形式的机制,在平均查询响应时间上,取得了可观的性能提升:例如,相对单纯的页面缓存,平均达到9%,最好情况下达到11%。其次,该机制在标识符缓存的基础上,设计了一种启发式的预取策略,对用户查询检索的空间局部性进行挖掘。实验显示,这种预取策略的融合,能进一步促进检索系统性能的有效提升,从而最终建立起一套时空完备的、有效的结果缓存机制。相似文献

6.

查询日志中查询意图的自动识别

《计算机应用与软件》2015,(11)

针对用户对搜索引擎查询结果满意度不高的问题,提出一种基于用户行为分析的查询意图识别方法来提高搜索引擎查询质量。将查询意图识别视为一个分类问题,分析搜狗查询日志发现:信息事务类查询串点击的不同页面数较多,分布呈现多极值性;导航类查询串点击的不同页面数较少,分布呈现单极值性;导航类查询结果中,子页面噪声对查询分类结果产生严重干扰。根据以上特点,提出"不同页面点击数"、"点击分布值"和"异源页面点击数"三个特征,并结合前人研究,利用C4.5算法训练分类器,进行查询意图识别。实验结果中查询分类的整体正确率达到90%,与Baseline相比,提高了8.5%。结果表明,该方法对识别用户查询意图是有效的。相似文献

7.

基于查询热度和实体识别的查询推荐

任育伟吕学强李卓徐丽萍《计算机应用研究》2016,33(3)

查询推荐已经成为改善用户搜索体验和提高搜索引擎服务质量的重要方法。提高查询推荐串的质量和用户满意度显得尤为迫切。已有研究方法在相似度计算上忽略了命名实体的重要性和搜索日志整体的信息量度。通过对查询串进行聚类后的热度评估,并提取查询串中的命名实体。然后融合查询串热度信息和命名实体特征到相似度计算公式中,提出了一种新的查询推荐方法,该方法所得结果的满意度平均值均比最新的三种方法的推荐结果值高,表明了该方法的有效性。该方法在相似度计算上利用了识别出的命名实体,同时考虑了推荐串在全局日志中的热度,提高了推荐词的总体质量,但方法局限于提取特征的精确度,有赖于特征进一步的丰富和优化。相似文献

8.

基于用户行为的长查询用户满意度分析

朱彤刘奕群茹立云马少平《模式识别与人工智能》2012,25(3):469-474

搜索引擎性能评估是信息检索界一个重要课题.长查询具有较为丰富的信息内容,能更加准确地描述用户的信息需求.在此基础上文中提出长查询用户满意度分析的整体框架,定义用户满意度的概念,并在用户日志中提取相关用户行为特征,应用决策树和SVM两种分类算法评测用户满意度.在大规模商业搜索引擎日志上完成的实验结果证明了这套评价体系的有效性.结果表明,用户对于查询满意和不满意的分类准确率分别达到86％和70％. 相似文献

9.

基于隐马尔可夫模型的查询扩展方法

矫健张仰森《计算机科学》2014,41(12):168-171,188

对查询进行扩展的目的是找出查询中的潜在语义,确定用户意图,进而构造更适合于搜索引擎检索的查询语句,以提高检索的准确率。提出利用隐马尔可夫模型预测查询中的潜在语义的方法,该模型在大规模用户查询日志上进行训练。由该模型预测出的扩展语句查询的准确率较词共现扩展、同义词扩展等方案均有明显提升。相似文献

10.

基于关键词的深度万维网数据库查询

丁传羽陈军华夏海峰《计算机与数字工程》2013,41(4)

深度万维网蕴藏着海量的信息,现有的搜索引擎很难搜索到其中的内容.如何充分地获取深度万维网中的有价值的信息成为一个难题.论文提出了基于关键词的深度万维网的数据库的查询方法,该方法采用朴素贝叶斯算法对关键词进行分类,并采用日志挖掘对采样的数据库进行统计,最终生成查询的SQL,语句.该方法不仅解决了深度万维网多领域的数据库查询,而且能够与现有的搜索引擎进行整合,帮助用户快速有效的查询. 相似文献

11.

Three-Level Caching for Efficient Query Processing in Large Web Search Engines 总被引：1，自引：0，他引：1

Xiaohui Long Torsten Suel 《World Wide Web》2006,9(4):369-395

Large web search engines have to answer thousands of queries per second with interactive response times. Due to the sizes of the data sets involved, often in the range of multiple terabytes, a single query may require the processing of hundreds of megabytes or more of index data. To keep up with this immense workload, large search engines employ clusters of hundreds or thousands of machines, and a number of techniques such as caching, index compression, and index and query pruning are used to improve scalability. In particular, two-level caching techniques cache results of repeated identical queries at the frontend, while index data for frequently used query terms are cached in each node at a lower level. We propose and evaluate a three-level caching scheme that adds an intermediate level of caching for additional performance gains. This intermediate level attempts to exploit frequently occurring pairs of terms by caching intersections or projections of the corresponding inverted lists. We propose and study several offline and online algorithms for the resulting weighted caching problem, which turns out to be surprisingly rich in structure. Our experimental evaluation based on a large web crawl and real search engine query log shows significant performance gains for the best schemes, both in isolation and in combination with the other caching levels. We also observe that a careful selection of cache admission and eviction policies is crucial for best overall performance. Work supported by NSF CAREER Award CCR-0093400 and the New York State Center for Advanced Technology in Telecommunications (CATT) at Polytechnic University. 相似文献

12.

Query expansion by mining user logs 总被引：9，自引：0，他引：9

Hang Cui Ji-Rong Wen Jian-Yun Nie Wei-Ying Ma 《Knowledge and Data Engineering, IEEE Transactions on》2003,15(4):829-839

Queries to search engines on the Web are usually short. They do not provide sufficient information for an effective selection of relevant documents. Previous research has proposed the utilization of query expansion to deal with this problem. However, expansion terms are usually determined on term co-occurrences within documents. In this study, we propose a new method for query expansion based on user interactions recorded in user logs. The central idea is to extract correlations between query terms and document terms by analyzing user logs. These correlations are then used to select high-quality expansion terms for new queries. Compared to previous query expansion methods, ours takes advantage of the user judgments implied in user logs. The experimental results show that the log-based query expansion method can produce much better results than both the classical search method and the other query expansion methods. 相似文献

13.

基于“VASE”特征词的网络查询分类研究

王俞霖孙乐李文波《中文信息学报》2009,23(3):39-45

网络查询分类对提高搜索引擎的搜索质量有重要的意义。该文通过对真实用户查询日志的分析和标注,发现四种特征词(称之为“VASE”特征词)对查询分类起决定性作用。我们提取特征词并构造了一个特征词倒排索引,用于对查询进行主题分类。在此基础之上,提出了基于网络扩展和加权特征词的方法改善分类的效果。实验结果显示,基于此分类方法的正确率和召回率分别达到78.2%和77.3%。相似文献

14.

Reducing Query Latencies in Web Search Using Fine-Grained Parallelism

Eitan Frachtenberg 《World Wide Web》2009,12(4):441-460

Semantic Web search is a new application of recent advances in information retrieval (IR), natural language processing, artificial intelligence, and other fields. The Powerset group in Microsoft develops a semantic search engine that aims to answer queries not only by matching keywords, but by actually matching meaning in queries to meaning in Web documents. Compared to typical keyword search, semantic search can pose additional engineering challenges for the back-end and infrastructure designs. Of these, the main challenge addressed in this paper is how to lower query latencies to acceptable, interactive levels. Index-based semantic search requires more data processing, such as numerous synonyms, hypernyms, multiple linguistic readings, and other semantic information, both on queries and in the index. In addition, some of the algorithms can be super-linear, such as matching co-references across a document. Consequently, many semantic queries can run significantly slower than the same keyword query. Users, however, have grown to expect Web search engines to provide near-instantaneous results, and a slow search engine could be deemed unusable even if it provides highly relevant results. It is therefore imperative for any search engine to meet its users’ interactivity expectations, or risk losing them. Our approach to tackle this challenge is to exploit data parallelism in slow search queries to reduce their latency in multi-core systems. Although all search engines are designed to exploit parallelism, at the single-node level this usually translates to throughput-oriented task parallelism. This paper focuses on the engineering of two latency-oriented approaches (coarse- and fine-grained) and compares them to the task-parallel approach. We use Powerset’s deployed search engine to evaluate the various factors that affect parallel performance: workload, overhead, load balancing, and resource contention. We also discuss heuristics to selectively control the degree of parallelism and consequent overhead on a query-by-query level. Our experimental results show that using fine-grained parallelism with these dynamic heuristics can significantly reduce query latencies compared to fixed, coarse-granularity parallelization schemes. Although these results were obtained on, and optimized for, Powerset’s semantic search, they can be readily generalized to a wide class of inverted-index search engines. 相似文献

15.

Integrating Web Prefetching and Caching Using Prediction Models 总被引：2，自引：0，他引：2

Yang Qiang Zhang Henry Hanning 《World Wide Web》2001,4(4):299-321

Web caching and prefetching have been studied in the past separately. In this paper, we present an integrated architecture for Web object caching and prefetching. Our goal is to design a prefetching system that can work with an existing Web caching system in a seamless manner. In this integrated architecture, a certain amount of caching space is reserved for prefetching. To empower the prefetching engine, a Web-object prediction model is built by mining the frequent paths from past Web log data. We show that the integrated architecture improves the performance over Web caching alone, and present our analysis on the tradeoff between the reduced latency and the potential increase in network load. 相似文献

16.

一种新的搜索引擎查询导向系统

赵仲孟张禄林戚晓光田新燕《计算机工程》2002,28(8):133-134,145

网络上的专业搜索引擎数量众多，普通用户在选择时往往无所适从。文章提出了一个自动的查询导向系统，可以将用户查询自动导向到合适的专业搜索引擎，解决了这个矛盾。相似文献

17.

基于权重标准化SimRank方法的查询扩展技术研究 总被引：1，自引：0，他引：1

马云龙林原林鸿飞《中文信息学报》2011,25(1):28-35

查询扩展是信息检索中的一项重要技术。传统的局部分析查询扩展方法利用伪相关文档作为候选词集合,然而部分伪相关文档并不具有很高的相关性。该文利用真实的搜索引擎查询日志,建立了查询点击图,经过多次图结构的转化得到能够反映词之间关联程度的词项关系图,并在图结构的相似度算法SimRank的基础上,提出了一种基于权重标准化的改进SimRank方法,该方法利用词项关系图中词项的全局和间接关系,能够有效挖掘与原始查询相关联的扩展词。同时,为降低SimRank算法的计算复杂度,该文采用了剪枝等策略进行优化,使得计算效率有大幅提高。在TREC标准数据集上的实验表明,该文的方法可以有效地选择相关扩展词。MAP指标较局部分析查询扩展方法提高了1.81%,在P@10和P@20指标评价中效果分别提高了5.44%和3.73%。相似文献