期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Semi-supervised ranking for document retrieval

Kevin Duh Katrin Kirchhoff 《Computer Speech and Language》2011,25(2):261-281

Ranking functions are an important component of information retrieval systems. Recently there has been a surge of research in the field of “learning to rank”, which aims at using labeled training data and machine learning algorithms to construct reliable ranking functions. Machine learning methods such as neural networks, support vector machines, and least squares have been successfully applied to ranking problems, and some are already being deployed in commercial search engines.Despite these successes, most algorithms to date construct ranking functions in a supervised learning setting, which assume that relevance labels are provided by human annotators prior to training the ranking function. Such methods may perform poorly when human relevance judgments are not available for a wide range of queries. In this paper, we examine whether additional unlabeled data, which is easy to obtain, can be used to improve supervised algorithms. In particular, we investigate the transductive setting, where the unlabeled data is equivalent to the test data.We propose a simple yet flexible transductive meta-algorithm: the key idea is to adapt the training procedure to each test list after observing the documents that need to be ranked. We investigate two instantiations of this general framework: The Feature Generation approach is based on discovering more salient features from the unlabeled test data and training a ranker on this test-dependent feature-set. The importance weighting approach is based on ideas in the domain adaptation literature, and works by re-weighting the training data to match the statistics of each test list. We demonstrate that both approaches improve over supervised algorithms on the TREC and OHSUMED tasks from the LETOR dataset. 相似文献

2.

Multi-domain learning by confidence-weighted parameter combination

Mark Dredze Alex Kulesza Koby Crammer 《Machine Learning》2010,79(1-2):123-149

State-of-the-art statistical NLP systems for a variety of tasks learn from labeled training data that is often domain specific. However, there may be multiple domains or sources of interest on which the system must perform. For example, a spam filtering system must give high quality predictions for many users, each of whom receives emails from different sources and may make slightly different decisions about what is or is not spam. Rather than learning separate models for each domain, we explore systems that learn across multiple domains. We develop a new multi-domain online learning framework based on parameter combination from multiple classifiers. Our algorithms draw from multi-task learning and domain adaptation to adapt multiple source domain classifiers to a new target domain, learn across multiple similar domains, and learn across a large number of disparate domains. We evaluate our algorithms on two popular NLP domain adaptation tasks: sentiment classification and spam filtering. 相似文献

3.

Active evaluation of ranking functions based on graded relevance

Christoph Sawade Steffen Bickel Timo von Oertzen Tobias Scheffer Niels Landwehr 《Machine Learning》2013,92(1):41-64

Evaluating the quality of ranking functions is a core task in web search and other information retrieval domains. Because query distributions and item relevance change over time, ranking models often cannot be evaluated accurately on held-out training data. Instead, considerable effort is spent on manually labeling the relevance of query results for test queries in order to track ranking performance. We address the problem of estimating ranking performance as accurately as possible on a fixed labeling budget. Estimates are based on a set of most informative test queries selected by an active sampling distribution. Query labeling costs depend on the number of result items as well as item-specific attributes such as document length. We derive cost-optimal sampling distributions for the commonly used performance measures Discounted Cumulative Gain and Expected Reciprocal Rank. Experiments on web search engine data illustrate significant reductions in labeling costs. 相似文献

4.

Boosted ranking models: a unifying framework for ranking predictions 总被引：2，自引：2，他引：0

Kevin Dela Rosa Vangelis Metsis Vassilis Athitsos 《Knowledge and Information Systems》2012,30(3):543-568

Ranking is an important functionality in a diverse array of applications, including web search, similarity-based multimedia retrieval, nearest neighbor classification, and recommendation systems. In this paper, we propose a new method, called Boosted Ranking Model (BRM), for learning how to rank from training data. An important feature of the proposed method is that it is domain-independent and can thus be applied to a wide range of ranking domains. The main contribution of the new method is that it reduces the problem of learning how to rank to the much more simple, and well-studied problem of constructing an optimized binary classifier from simple, weak classifiers. Using that reduction, our method constructs an optimized ranking model using multiple simple, easy-to-define ranking models as building blocks. The new method is a unifying framework that includes, as special cases, specific methods that we have proposed in earlier publications for specific ranking applications, such as nearest neighbor retrieval and classification. In this paper, we reformulate those earlier methods as special cases of the proposed BRM method, and we also illustrate a novel application of BRM, on the problem of making movie recommendations to individual users. 相似文献

5.

Learning to rank code examples for code search engines

Haoran Niu Iman Keivanloo Ying Zou 《Empirical Software Engineering》2017,22(1):259-291

Source code examples are used by developers to implement unfamiliar tasks by learning from existing solutions. To better support developers in finding existing solutions, code search engines are designed to locate and rank code examples relevant to user’s queries. Essentially, a code search engine provides a ranking schema, which combines a set of ranking features to calculate the relevance between a query and candidate code examples. Consequently, the ranking schema places relevant code examples at the top of the result list. However, it is difficult to determine the configurations of the ranking schemas subjectively. In this paper, we propose a code example search approach that applies a machine learning technique to automatically train a ranking schema. We use the trained ranking schema to rank candidate code examples for new queries at run-time. We evaluate the ranking performance of our approach using a corpus of over 360,000 code snippets crawled from 586 open-source Android projects. The performance evaluation study shows that the learning-to-rank approach can effectively rank code examples, and outperform the existing ranking schemas by about 35.65 % and 48.42 % in terms of normalized discounted cumulative gain (NDCG) and expected reciprocal rank (ERR) measures respectively. 相似文献

6.

加速评估算法:一种提高Web结构挖掘质量的新方法 总被引：13，自引：1，他引：13

张岭马范援《计算机研究与发展》2004,41(1):98-103

利用Web结构挖掘可以找到Web上的高质量网页，它大大地提高了搜索引擎的检索精度,目前的Web结构挖掘算法是通过统计链接到每个页面的超链接的数量和源结点的质量对页面进行评估，基于统计链接数目的算法存在一个严重缺陷：页面评价两极分化，一些传统的高质量页面经常出现在Web检索结果的前面，而Web上新加入的高质量页面很难被用户找到,提出了加速评估算法以克服现有Web超链接分析中的不足，并通过搜索引擎平台对算法进行了测试和验证。相似文献

7.

Disclosing incoherent sparse and low-rank patterns inside homologous GPCR tasks for better modelling of ligand bioactivities

Jiansheng WU Chuangchuang LAN Xuelin YE Jiale DENG Wanqing HUANG Xueni YANG Yanxiang ZHU Haifeng HU 《Frontiers of Computer Science》2022,16(4):164322

There are many new and potential drug targets in G protein-coupled receptors (GPCRs) without sufficient ligand associations, and accurately predicting and interpreting ligand bioactivities is vital for screening and optimizing hit compounds targeting these GPCRs. To efficiently address the lack of labeled training samples, we proposed a multi-task regression learning with incoherent sparse and low-rank patterns (MTR-ISLR) to model ligand bioactivities and identify their key substructures associated with these GPCRs targets. That is, MTR-ISLR intends to enhance the performance and interpretability of models under a small size of available training data by introducing homologous GPCR tasks. Meanwhile, the low-rank constraint term encourages to catch the underlying relationship among homologous GPCR tasks for greater model generalization, and the entry-wise sparse regularization term ensures to recognize essential discriminative substructures from each task for explanative modeling. We examined MTR-ISLR on a set of 31 important human GPCRs datasets from 9 subfamilies, each with less than 400 ligand associations. The results show that MTR-ISLR reaches better performance when compared with traditional single-task learning, deep multi-task learning and multi-task learning with joint feature learning-based models on most cases, where MTR-ISLR obtains an average improvement of 7% in correlation coefficient (r²) and 12% in root mean square error (RMSE) against the runner-up predictors. The MTR-ISLR web server appends freely all source codes and data for academic usages. ^① 相似文献

8.

用于提高谷歌图像搜索结果的二分类器在线学习方法

万玉钗刘峡壁韩菲霏童坤琦刘宇《自动化学报》2014,40(8):1699-1708

对于基于关键词的图像检索,利用检索结果的视觉相似性学习二分类器有望成为改善检索结果的最有效途径之一. 为改善搜索引擎的搜索结果,本文提出一种算法框架并且基于此框架着重研究训练数据选择这一关键问题. 训练数据选择过程由两个阶段组成:1）训练数据初始化以开始分类器学习过程;2）分类器迭代学习过程中的动态数据选择. 对于初始训练数据的选择,我们探讨了基于聚类和基于排序两种方法,并且对比了自动训练数据选择与人工标注的结果. 对于动态数据选择,我们比较了支持向量机和基于最大最小后验伪概率的贝叶斯分类器的分类效果. 组合上述两个阶段的不同方法,我们得到了8种不同的算法,并将其用于谷歌搜索引擎进行基于关键词的图像检索. 实验结果证明,如何从含有噪声的搜索结果中选择训练数据是搜索结果改善的关键问题. 实验显示我们的方法能够有效的改善谷歌搜索的结果,尤其是排序在前的结果. 尽早为用户提供更相关的结果能够更大程度的减少用户逐个翻页查看结果的工作. 另外,如何使自动训练数据选择与人工标注媲美仍是需要继续研究的一个问题. 相似文献

9.

Web search enhancement by mining user actions

M.M. Sufyan Beg Nesar Ahmad 《Information Sciences》2007,177(23):5203-5218

Search engines are among the most popular as well as useful services on the web. There is a need, however, to cater to the preferences of the users when supplying the search results to them. We propose to maintain the search profile of each user, on the basis of which the search results would be determined. This requires the integration of techniques for measuring search quality, learning from the user feedback and biased rank aggregation, etc. For the purpose of measuring web search quality, the “user satisfaction” is gauged by the sequence in which he picks up the results, the time he spends at those documents and whether or not he prints, saves, bookmarks, e-mails to someone or copies-and-pastes a portion of that document. For rank aggregation, we adopt and evaluate the classical fuzzy rank ordering techniques for web applications, and also propose a few novel techniques that outshine the existing techniques. A “user satisfaction” guided web search procedure is also put forward. Learning from the user feedback proceeds in such a way that there is an improvement in the ranking of the documents that are consistently preferred by the users. As an integration of our work, we propose a personalized web search system. 相似文献

10.

Learning to rank on graphs

Shivani Agarwal 《Machine Learning》2010,81(3):333-357

Graph representations of data are increasingly common. Such representations arise in a variety of applications, including computational biology, social network analysis, web applications, and many others. There has been much work in recent years on developing learning algorithms for such graph data; in particular, graph learning algorithms have been developed for both classification and regression on graphs. Here we consider graph learning problems in which the goal is not to predict labels of objects in a graph, but rather to rank the objects relative to one another; for example, one may want to rank genes in a biological network by relevance to a disease, or customers in a social network by their likelihood of being interested in a certain product. We develop algorithms for such problems of learning to rank on graphs. Our algorithms build on the graph regularization ideas developed in the context of other graph learning problems, and learn a ranking function in a reproducing kernel Hilbert space (RKHS) derived from the graph. This allows us to show attractive stability and generalization properties. Experiments on several graph ranking tasks in computational biology and in cheminformatics demonstrate the benefits of our framework. 相似文献