首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到19条相似文献,搜索用时 93 毫秒
1.
用户通过检索平台能获得大量信息,但搜索结果往往会出现主题漂移、偏重旧网页的现象,不能满足用户实际需求。为改善这种现象,提出了一种改进的PageRank算法。该算法采用BM25相似度算法对主题相似度进行计算,根据相似度评分来赋予不同的影响权重,可以提相似度高的网页的排名;利用网页在搜索引擎周期内被搜索到的次数来表示网页存在的时间长短,补偿新网页的权重。将它们引入PageRank算法中,使得页面PR值计算更加合理。实验表明,改进后的PageRank算法在搜索应用中能有效减少主题漂移现象,搜索结果也更全面、更准确。  相似文献   

2.
为了克服PageRank在搜索过程中重复性地把当前受欢迎的网页放在搜索结果的首要位置,而不受欢迎的网页被大多数用户忽略的问题,采用了一种改进的评估函数及有效的用户模型,获得了一个新的PageRank优化算法。实验结果表明,该算法达到了较好的公平性。  相似文献   

3.
针对目前搜索引擎搜索结果中普遍存在大量重复网页的现象,提出了一种基于聚类算法DBSCAN的搜索结果优化算法.该算法选取源搜索结果中排名靠前的部分网页,对这部分网页根据网页相似度进行DBSCAN聚类,最大限度剔除冗余网页,实现搜索结果的优化.实验结果表明本算法可以提高搜索结果的全面性和准确性,提升用户使用搜索引擎的满意度.  相似文献   

4.
针对Google PageRank算法中存在的“平均分配原则”及网络链接结构所造成的“旧网页问题”、“主题漂移问题”,提出一种改进的网页排序算法N-PageRank.该算法通过对搜索日志进行数据挖掘,捕捉用户与搜索引擎之间的交互过程,发现隐藏在用户搜索行为背后的用户兴趣和搜索规律,利用用户行为反馈模型,分析网络日志里用户的各项行为特点,改善了排序结果的准确率,保证了搜索引擎的返回结果正是用户所希望看到的网页.实验证明该算法有效地降低了网页排序时的客观因素的影响,充分考虑了用户对于网页质量的评价,所得到的排序结果更加能够满足用户的需求.  相似文献   

5.
基于主题相似度模型的TS-PageRank算法   总被引:1,自引:1,他引:1  
PageRank算法是著名搜索引擎Google的核心算法,但存在主题漂移的问题,致使搜索结果中存在过多与查询主题无关的网页.在分析PageRank算法及其有关改进算法的基础上,提出了基于虚拟文档的主题相似度模型和基于主题相似度模型的TS-PageRank算法框架.只要选择不同的相似度计算模型,就可以得到不同的TS-PageRank算法,形成一个网页排序算法簇.理论分析和数值仿真实验表明,该算法在不需要额外文本信息,也不增加算法时空复杂度的情况下,就能极大地减少主题漂移现象,从而提高查询效率和质量.  相似文献   

6.
互联网的迅猛发展导致网络中的网页呈指数级别爆炸式增长。为解决在海量网页中寻找信息的问题,搜索引擎成为了人们使用互联网的重要工具。提出了一种基于净化网页的改进消重算法,并将它与传统的消重算法进行了比较。该算法结合关键字搜索和签名(计算指纹)搜索各自的优势来完成网页搜索消重。实验结果证明该方法对网页消重效果很好,提高了网页消重的查全率和查准率。  相似文献   

7.
王非  吴庆波  杨沙洲 《计算机工程》2009,35(21):247-249
网页排序技术是搜索引擎的核心技术之一。描述Web2.0社区构建语义搜索的必要性,分析影响网页排序的因素,将搜索引擎的排序算法借鉴到基于Web2.0社区的搜索模块中,以改进的TF/IDF和PageRank算法为基础,在一个Web2.0开源社区开发平台上实现基于语义排序的搜索模块。测试结果表明,该排序算法具有内容定位精确、有效结果靠前的特点。  相似文献   

8.
元搜索引擎结果生成技术研究   总被引:17,自引:0,他引:17  
元Web搜索引擎是通过将搜索请求传送给它所引用的搜索引擎,然后将这些搜索引擎返回的结果按照一定的结果集成算法合并,并将合并后的结果返回给用户.所用结果集成算法的好坏将直接影响该元搜索引擎的查询精度、查询完全度和响应速度.本文在分析常用的几种结果集成方法的基础上,提出几个改进的算法来改进搜索结果的一致性.  相似文献   

9.
基于内容的搜索引擎垃圾网页检测   总被引:1,自引:0,他引:1  
有些网页为了增加访问量,通过欺骗搜索引擎,提高在搜索引擎的搜索结果中的排名,这些网页被称为"搜索引擎垃圾网页"或"垃圾网页"。将搜索引擎垃圾网页的检测看成一个分类问题,采用C4.5分类算法建立决策树分类模型,将网页分成正常网页和垃圾网页两类。实验表明我们的分类模型可以有效地检测搜索引擎垃圾网页。  相似文献   

10.
当前,搜索引擎是互联网的一个重要组成部分.其网页摘要采用的是静态网页额摘要,用户不能快速从网页 摘要中获取网页内容的主题思想.本文在开源搜索引擎Nutch中运用自动文摘技术生成网页摘要,加快用户确认搜索准确 度的速度.  相似文献   

11.
针对XML文档集的关键词检索结果排序   总被引:1,自引:0,他引:1       下载免费PDF全文
探讨了针对XML文档集中只与内容相关的关键词检索结果的排序问题,针对XML文档特征提出了一种新的排序模型,它不同于面向Web的XML网页的搜索结果的排序。设计了满足这种排序模型的倒排列表索引结构和搜索引擎的体系结构。  相似文献   

12.
随着Web技术的发展和Web上越来越多的各种信息,如何提供高质量、相关的查询结果成为当前Web搜索引擎的一个巨大挑战.PageRank和HITS是两个最重要的基于链接的排序算法并在商业搜索引擎中使用.然而,在PageRank算法中,每个网页的PR值被平均地分配到它所指向的所有网页,网页之间的质量差异被完全忽略.这样的算法很容易被当前的Web SPAM攻击.基于这样的认识,提出了一个关于PageRank算法的改进,称为Page Quality Based PageRank(QPR)算法.QPR算法动态地评估每个网页的质量,并根据网页的质量对每个网页的PR值做相应公平的分配.在多个不同特性的数据集上进行了全面的实验,实验结果显示,提出的QPR算法能大大提高查询结果的排序,并能有效减轻SPAM网页对查询结果的影响.  相似文献   

13.
In this article we first explain the knowledge extraction (KE) process from the World Wide Web (WWW) using search engines. Then we explore the PageRank algorithm of Google search engine (a well-known link-based search engine) with its hidden Markov analysis. We also explore one of the problems of link-based ranking algorithms called hanging pages or dangling pages (pages without any forward links). The presence of these pages affects the ranking of Web pages. Some of the hanging pages may contain important information that cannot be neglected by the search engine during ranking. We propose methodologies to handle the hanging pages and compare the methodologies. We also introduce the TrustRank algorithm (an algorithm to handle the spamming problems in link-based search engines) and include it in our proposed methods so that our methods can combat Web spam. We implemented the PageRank algorithm and TrustRank algorithm and modified those algorithms to implement our proposed methodologies.  相似文献   

14.
Time plays important roles in Web search, because most Web pages contain temporal information and a lot of Web queries are time-related. How to integrate temporal information in Web search engines has been a research focus in recent years. However, traditional search engines have little support in processing temporal-textual Web queries. Aiming at solving this problem, in this paper, we concentrate on the extraction of the focused time for Web pages, which refers to the most appropriate time associated with Web pages, and then we used focused time to improve the search efficiency for time-sensitive queries. In particular, three critical issues are deeply studied in this paper. The first issue is to extract implicit temporal expressions from Web pages. The second one is to determine the focused time among all the extracted temporal information, and the last issue is to integrate focused time into a search engine. For the first issue, we propose a new dynamic approach to resolve the implicit temporal expressions in Web pages. For the second issue, we present a score model to determine the focused time for Web pages. Our score model takes into account both the frequency of temporal information in Web pages and the containment relationship among temporal information. For the third issue, we combine the textual similarity and the temporal similarity between queries and documents in the ranking process. To evaluate the effectiveness and efficiency of the proposed approaches, we build a prototype system called Time-Aware Search Engine (TASE). TASE is able to extract both the explicit and implicit temporal expressions for Web pages, and calculate the relevant score between Web pages and each temporal expression, and re-rank search results based on the temporal-textual relevance between Web pages and queries. Finally, we conduct experiments on real data sets. The results show that our approach has high accuracy in resolving implicit temporal expressions and extracting focused time, and has better ranking effectiveness for time-sensitive Web queries than its competitor algorithms.  相似文献   

15.
Most Web pages contain location information, which are usually neglected by traditional search engines. Queries combining location and textual terms are called as spatial textual Web queries. Based on the fact that traditional search engines pay little attention in the location information in Web pages, in this paper we study a framework to utilize location information for Web search. The proposed framework consists of an offline stage to extract focused locations for crawled Web pages, as well as an online ranking stage to perform location-aware ranking for search results. The focused locations of a Web page refer to the most appropriate locations associated with the Web page. In the offline stage, we extract the focused locations and keywords from Web pages and map each keyword with specific focused locations, which forms a set of <keyword, location> pairs. In the second online query processing stage, we extract keywords from the query, and computer the ranking scores based on location relevance and the location-constrained scores for each querying keyword. The experiments on various real datasets crawled from nj.gov, BBC and New York Time show that the performance of our algorithm on focused location extraction is superior to previous methods and the proposed ranking algorithm has the best performance w.r.t different spatial textual queries.  相似文献   

16.
This article presents an algorithm to determine the relevancy of hanging pages in the link-structure-based ranking algorithms. Recently, hanging pages have become major obstacles in web information retrieval (IR). As an increasing number of meaningful hanging pages appear on the Web, their relevancy has to be determined according to the query term in order to make the search engine result pages fair and relevant. Exclusion of these pages in ranking calculation can give biased/inconsistent results, but inclusion of these pages will reduce the speed significantly. Most of the IR ranking algorithms exclude the hanging pages. But there are relevant and important hanging pages on the Web, and they cannot be ignored. In our proposed method, we use anchor text to determine the hanging relevancy, and we use stability analysis to show that rank results are consistent before and after altering the link structure.  相似文献   

17.
Keyword queries have long been popular to search engines and to the information retrieval community and have recently gained momentum for its usage in the expert systems community. The conventional semantics for processing a user query is to find a set of top-k web pages such that each page contains all user keywords. Recently, this semantics has been extended to find a set of cohesively interconnected pages, each of which contains one of the query keywords scattered across these pages. The keyword query having the extended semantics (i.e., more than a list of keywords hyperlinked with each other) is referred to the graph query. In case of the graph query, all the query keywords may not be present on a single Web page. Thus, a set of Web pages with the corresponding hyperlinks need to be presented as the search result. The existing search systems reveal serious performance problem due to their failure to integrate information from multiple connected resources so that an efficient algorithm for keyword query over graph-structured data is proposed. It integrates information from multiple connected nodes of the graph and generates result trees with the occurrence of all the query keywords. We also investigate a ranking measure called graph ranking score (GRS) to evaluate the relevant graph results so that the score can generate a scalar value for keywords as well as for the topology.  相似文献   

18.
Ranking web pages for presenting the most relevant web pages to user's queries is one of the main issues in any search engine. In this paper, two new ranking algorithms are offered, using Reinforcement Learning (RL) concepts. RL is a powerful technique of modern artificial intelligence that tunes agent's parameters, interactively. In the first step, with formulation of ranking as an RL problem, a new connectivity-based ranking algorithm, called RL_Rank, is proposed. In RL_Rank, agent is considered as a surfer who travels between web pages by clicking randomly on a link in the current page. Each web page is considered as a state and value function of state is used to determine the score of that state (page). Reward is corresponded to number of out links from the current page. Rank scores in RL_Rank are computed in a recursive way. Convergence of these scores is proved. In the next step, we introduce a new hybrid approach using combination of BM25 as a content-based algorithm and RL_Rank. Both proposed algorithms are evaluated by well known benchmark datasets and analyzed according to concerning criteria. Experimental results show using RL concepts leads significant improvements in raking algorithms.  相似文献   

19.
In Web search, with the aid of related query recommendation, Web users can revise their initial queries in several serial rounds in pursuit of finding needed Web pages. In this paper, we address the Web search problem on aggregating search results of related queries to improve the retrieval quality. Given an initial query and the suggested related queries, our search system concurrently processes their search result lists from an existing search engine and then forms a single list aggregated by all the retrieved lists. We specifically propose a generic rank aggregation framework which consists of three steps. First we build a so-called Win/Loss graph of Web pages according to a competition rule, and then apply the random walk mechanism on the Win/Loss graph. Last we sort these Web pages by their ranks using a PageRank-like rank mechanism. The proposed framework considers not only the number of wins that an item won in competitions, but also the quality of its competitor items in calculating the ranking of Web page items. Experimental results show that our search system can clearly improve the retrieval quality in a parallel manner over the traditional search strategy that serially returns result lists. Moreover, we also provide empirical evidences as to demonstrate how different rank aggregation methods affect the retrieval quality.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号