首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 46 毫秒
1.
With the explosive growth of the World Wide Web, it is becoming increasingly difficult for users to discover Web pages that are relevant to a topic. To address this problem we are developing a system that allows the collection and analysis of Web pages related to a particular topic. In this paper we present the systems overall architecture and introduce the focused crawler used by the system. We also discuss the various techniques we use to allow the user to analyze and gain useful insights about a collection. Finally, we present some statistics on the collections.  相似文献   

2.
Given a user keyword query, current Web search engines return a list of individual Web pages ranked by their "goodness" with respect to the query. Thus, the basic unit for search and retrieval is an individual page, even though information on a topic is often spread across multiple pages. This degrades the quality of search results, especially for long or uncorrelated (multitopic) queries (in which individual keywords rarely occur together in the same document), where a single page is unlikely to satisfy the user's information need. We propose a technique that, given a keyword query, on the fly generates new pages, called composed pages, which contain all query keywords. The composed pages are generated by extracting and stitching together relevant pieces from hyperlinked Web pages and retaining links to the original Web pages. To rank the composed pages, we consider both the hyperlink structure of the original pages and the associations between the keywords within each page. Furthermore, we present and experimentally evaluate heuristic algorithms to efficiently generate the top composed pages. The quality of our method is compared to current approaches by using user surveys. Finally, we also show how our techniques can be used to perform query-specific summarization of Web pages.  相似文献   

3.
The Web is a hypertext body of approximately 300 million pages that continues to grow at roughly a million pages per day. Page variation is more prodigious than the data's raw scale: taken as a whole, the set of Web pages lacks a unifying structure and shows far more authoring style and content variation than that seen in traditional text document collections. This level of complexity makes an “off-the-shelf” database management and information retrieval solution impossible. To date, index based search engines for the Web have been the primary tool by which users search for information. Such engines can build giant indices that let you quickly retrieve the set of all Web pages containing a given word or string. Experienced users can make effective use of such engines for tasks that can be solved by searching for tightly constrained key words and phrases. These search engines are, however, unsuited for a wide range of equally important tasks. In particular, a topic of any breadth will typically contain several thousand or million relevant Web pages. How then, from this sea of pages, should a search engine select the correct ones-those of most value to the user? Clever is a search engine that analyzes hyperlinks to uncover two types of pages: authorities, which provide the best source of information on a given topic; and hubs, which provide collections of links to authorities. We outline the thinking that went into Clever's design, report briefly on a study that compared Clever's performance to that of Yahoo and AltaVista, and examine how our system is being extended and updated  相似文献   

4.
Seed URLs selection for focused Web crawler intends to guide related and valuable information that meets a user's personal information requirement and provide more effective information retrieval. In this paper, we propose a seed URLs selection approach based on user-interest ontology. In order to enrich semantic query, we first intend to apply Formal Concept Analysis to construct user-interest concept lattice with user log profile. By using concept lattice merger, we construct the user-interest ontology which can describe the implicit concepts and relationships between them more appropriately for semantic representation and query match. On the other hand, we make full use of the user-interest ontology for extracting the user interest topic area and expanding user queries to receive the most related pages as seed URLs, which is an entrance of the focused crawler. In particular, we focus on how to refine the user topic area using the bipartite directed graph. The experiment proves that the user-interest ontology can be achieved effectively by merging concept lattices and that our proposed approach can select high quality seed URLs collection and improve the average precision of focused Web crawler.  相似文献   

5.
一种基于HITS的主题敏感爬行方法   总被引:2,自引:0,他引:2  
基于主题的信息采集是信息检索领域内一个新兴且实用的方法,通过将下载页面限定在特定的主题领域,来提高搜索引擎的效率和提供信息的质量。其思想是在爬行过程中按预先定义好的主题有选择地收集相关网页,避免下载主题不相关的网页,其目标是更准确地找到对用户有用的信息。探讨了主题爬虫的一些关键问题,通过改进主题模型、链接分类模型的学习方法及链接分析方法来提高下载网页的主题相关度及质量。在此基础上设计并实现了一个主题爬虫系统,该系统利用主题敏感HITS来计算网页优先级。实验表明效果良好。  相似文献   

6.
Thousands of users issue keyword queries to the Web search engines to find information on a number of topics. Since the users may have diverse backgrounds and may have different expectations for a given query, some search engines try to personalize their results to better match the overall interests of an individual user. This task involves two great challenges. First the search engines need to be able to effectively identify the user interests and build a profile for every individual user. Second, once such a profile is available, the search engines need to rank the results in a way that matches the interests of a given user. In this article, we present our work towards a personalized Web search engine and we discuss how we addressed each of these challenges. Since users are typically not willing to provide information on their personal preferences, for the first challenge, we attempt to determine such preferences by examining the click history of each user. In particular, we leverage a topical ontology for estimating a user’s topic preferences based on her past searches, i.e. previously issued queries and pages visited for those queries. We then explore the semantic similarity between the user’s current query and the query-matching pages, in order to identify the user’s current topic preference. For the second challenge, we have developed a ranking function that uses the learned past and current topic preferences in order to rank the search results to better match the preferences of a given user. Our experimental evaluation on the Google query-stream of human subjects over a period of 1 month shows that user preferences can be learned accurately through the use of our topical ontology and that our ranking function which takes into account the learned user preferences yields significant improvements in the quality of the search results.  相似文献   

7.
基于主题的信息采集是信息检索领域内一个新兴且实用的方法,通过将下载页面限定在特定的主题领域,来提高搜索引擎的效率和提供信息的质量。其思想是在爬行过程中按预先定义好的主题有选择地收集相关网页,避免下载主题不相关的网页,其目标是更准确地找到对用户有用的信息。探讨了主题爬虫的一些关键问题,通过改进主题模型、链接分类模型的学习方法及链接分析方法来提高下载网页的主题相关度及质量。在此基础上设计并实现了一个主题爬虫系统,该系统利用主题敏感HITS来计算网页优先级。实验表明效果良好。  相似文献   

8.
The perception of the visual complexity of World Wide Web (Web) pages is a topic of significant interest. Previous work has examined the relationship between complexity and various aspects of presentation, including font styles, colours and images, but automatically quantifying this dimension of a web page at the level of the document remains a challenge. In this paper we demonstrate that areas of high complexity can be identified by detecting areas, or ‘chunks’, of a web page high in block-level elements. We report a computational algorithm that captures this metric and places web pages in a sequence that shows an 86% correlation with the sequences generated through user judgements of complexity. The work shows that structural aspects of a web page influence how complex a user perceives it to be, and presents a straightforward means of determining complexity through examining the DOM.  相似文献   

9.
基于关键词相关度的Deep Web爬虫爬行策略   总被引:1,自引:0,他引:1       下载免费PDF全文
田野  丁岳伟 《计算机工程》2008,34(15):220-222
Deep Web蕴藏丰富的、高质量的信息资源,为了获取某Deep Web站点的页面,用户不得不键入一系列的关键词集。由于没有直接指向Deep Web页面的静态链接,目前大多数搜索引擎不能发现这些页面。该文提出的Deep Web爬虫爬行策略,可以有效地下载Deep Web页面。由于该页面只提供一个查询接口,因此Deep Web爬虫设计面对的主要挑战是怎样选择最佳的查询关键词产生有意义的查询。实验证明文中提出的一种基于不同关键词相关度权重的选择方法是有效的。  相似文献   

10.
Inaccessible Web pages and 404 “Page Not Found” responses are a common Web phenomenon and a detriment to the user’s browsing experience. The rediscovery of missing Web pages is, therefore, a relevant research topic in the digital preservation as well as in the Information Retrieval realm. In this article, we bring these two areas together by analyzing four content- and link-based methods to rediscover missing Web pages. We investigate the retrieval performance of the methods individually as well as their combinations and give an insight into how effective these methods are over time. As the main result of this work, we are able to recommend not only the best performing methods but also the sequence in which they should be applied, based on their performance, complexity required to generate them, and evolution over time. Our least complex single method results in a rediscovery rate of almost $70\,\%$ of Web pages of our sample dataset based on URIs sampled from the Open Directory Project (DMOZ). By increasing the complexity level and combining three different methods, our results show an increase of the success rate of up to $77\,\%$ . The results, based on our sample dataset, indicate that Web pages are often not completely lost but have moved to a different location and “just” need to be rediscovered.  相似文献   

11.
Understanding the content of a Web page and navigating within and between pages are crucial tasks for any Web user. To those who are accessing pages through non-visual means, such as screen readers, the challenges offered by these tasks are not easily overcome, even when pages are unchanging documents. The advent of ‘Web 2.0’ and Web applications, however, means that documents often are not static, but update, either automatically or due to user interaction. This development poses a difficult question for screen reader designers: how should users be notified of page changes? In this article we introduce rules for presenting such updates, derived from studies of how sighted users interact with them. An implementation of the rules has been evaluated, showing that users who were blind or visually impaired found updates easier to deal with than the relatively quiet way in which current screen readers often present them.  相似文献   

12.
面向主题的WWW信息挖掘系统   总被引:3,自引:0,他引:3  
余晨  顾毓清 《计算机科学》2003,30(2):158-160
1 概述 WWW正以令人难以置信的速度飞速地发展,逐渐成为人们发布和获取信息的主要平台。虽然人们可以从WWW上获得大量信息,但由于WWW上的信息是无结构的、动态的、分散的,因此如何从WWW上高效地提取有用的信息仍是一个很有挑战性的课题。搜索引擎(如Excite、Google、Alta Vista)的广泛应用,使人们检索信息的效率大大提高。搜索引擎的工作原理是:由一个爬行器(Crawler)尽可能多地收  相似文献   

13.
网络用户可以使用浏览器收藏夹收藏网页并快速访问其中内容。基于收藏夹的用户行为研究将对用户个性化、网页质量评估、大规模网页目录构建等方面的工作具有指导意义。该文使用近27万个用户的收藏夹数据,从组织结构、收藏内容和用户兴趣三个方面对用户收藏行为进行了研究。首先,我们提出收藏夹浏览点击模型,分析了收藏夹结构特征和使用效率;其次,通过与PageRank值比较,我们发现用户倾向于收藏质量高的网络资源;最后,我们结合ODP分析了收藏夹用户的兴趣分布特点。  相似文献   

14.
Focused crawlers have as their main goal to crawl Web pages that are relevant to a specific topic or user interest, playing an important role for a great variety of applications. In general, they work by trying to find and crawl all kinds of pages deemed as related to an implicitly declared topic. However, users are often not simply interested in any document about a topic, but instead they may want only documents of a given type or genre on that topic to be retrieved. In this article, we describe an approach to focused crawling that exploits not only content-related information but also genre information present in Web pages to guide the crawling process. This approach has been designed to address situations in which the specific topic of interest can be expressed by specifying two sets of terms, the first describing genre aspects of the desired pages and the second related to the subject or content of these pages, thus requiring no training or any kind of preprocessing. The effectiveness, efficiency and scalability of the proposed approach are demonstrated by a set of experiments involving the crawling of pages related to syllabi of computer science courses, job offers in the computer science field and sale offers of computer equipments. These experiments show that focused crawlers constructed according to our genre-aware approach achieve levels of F1 superior to 88%, requiring the analysis of no more than 65% of the visited pages in order to find 90% of the relevant pages. In addition, we experimentally analyze the impact of term selection on our approach and evaluate a proposed strategy for semi-automatic generation of such terms. This analysis shows that a small set of terms selected by an expert or a set of terms specified by a typical user familiar with the topic is usually enough to produce good results and that such a semi-automatic strategy is very effective in supporting the task of selecting the sets of terms required to guide a crawling process.  相似文献   

15.
In this study, we experiment with several multiobjective evolutionary algorithms to determine a suitable approach for clustering Web user sessions, which consist of sequences of Web pages visited by the users. Our experimental results show that the multiobjective evolutionary algorithm-based approaches are successful for sequence clustering. We look at a commonly used cluster validity index to verify our findings. The results for this index indicate that the clustering solutions are of high quality. As a case study, the obtained clusters are then used in a Web recommender system for representing usage patterns. As a result of the experiments, we see that these approaches can successfully be applied for generating clustering solutions that lead to a high recommendation accuracy in the recommender model we used in this paper.  相似文献   

16.
Web Spam is one of the main difficulties that crawlers have to overcome and therefore one of the main problems of the WWW. There are several studies about characterising and detecting Web Spam pages. However, none of them deals with all the possible kinds of Web Spam. This paper shows an analysis of different kinds of Web Spam pages and identifies new elements that characterise it, to define heuristics which are able to partially detect them. We also discuss and explain several heuristics from the point of view of their effectiveness and computational efficiency. Taking them into account, we study several sets of heuristics and demonstrate how they improve the current results. Finally, we propose a new Web Spam detection system called SAAD (Spam Analyzer And Detector), which is based on the set of proposed heuristics and their use in a C4.5 classifier improved by means of Bagging and Boosting techniques. We have also tested our system in some well known Web Spam datasets and we have found it to be very effective.  相似文献   

17.
周红芳  冯博琴 《计算机工程》2007,33(18):40-41,4
从语义相关性角度分析超链归纳主题搜索(HITS)算法,发现其产生主题漂移的原因在于页面被投影到错误的语义基上,提出了一种基于模糊集的主题提取和层次发现算法(FSTH),通过用户日志扩展查询词,构造符合用户需要的个性化根集和基础集合,达到防止主题漂移的目的。FSTH采用模糊集划分方法,层次地发现与用户查询相关的主题页面集合,利用HITS算法分别计算每个主题页面集合中页面的权威值,返回与查询相关的其他主题权威页面。在14个查询上的实验结果表明,与HITS算法相比,FSTH算法不仅可以减少7%~53%的主题漂移率,而且可以发现与查询相关的多个主题.  相似文献   

18.
The Web is a huge network composed of Web pages and hyperlinks. It is often reported that related Web pages are densely linked with each other. Finding groups of such related pages, which are called Web communities, is important for information retrieval from the Web. Several attempts have been made for the discovery of Web communities such as Kumar’s trawling and Flake’s method. In addition to the communities of related Web pages, there are communities of users sharing common interests. Finding the latter communities, which we called user communities in this paper, is also important for clarifying the behaviors of Web users. It is expected that the characteristics of user communities in the Web correspond to those in real human communities. A method for discovering user communities is described in this paper. Client-level log data (Web audience measurement data) is used as the data of users’ Web watching behaviors. Maximal complete bipartite graphs are searched from term-user graph obtained from the log data without analyzing the contents of Web pages. Experimental results show that our method succeeds in discovering many interesting user communities with labels that characterize the communities.  相似文献   

19.
Cellary  W. Wiza  W. Walczak  K. 《Computer》2004,37(5):87-89
The exponential growth in Web sites is making it increasingly difficult to extract useful information on the Internet using existing search engines. Despite a wide range of sophisticated indexing and data retrieval features, search engines often deliver satisfactory results only when users know precisely what they are looking for. Traditional textual interfaces present results as a list of links to Web pages. Because most users are unwilling to explore an extensive list, search engines arbitrarily reduce the number of links returned, aiming also to provide quick response times. Moreover, their proprietary ranking algorithms often do not reflect individual user preferences. Those who need comprehensive general information about a topic or have vague initial requirements instead want a holistic presentation of data related to their queries. To address this need, we have developed Periscope, a 3D search result visualization system that displays all the Web pages found in a synthetic, yet comprehensible format.  相似文献   

20.
Internet上专题资源网页汇聚和检索是垂直搜索引擎中的核心问题,HITS算法是早期解决这个问题的经典算法,很多文献对它进行了改进,但无论索引的主题相关率还是引擎的查准率都有提高的余地。提出一种基于锚文本和标题信息过滤并结合网页内容相关度判断的HITS专题检索策略,利用专题训练集判断主题相关度,很好地解决了只依靠查询字符串判断的弊端。实验表明,此策略能很好地提高专题信息汇聚精确度和检索的准确率,并且减少了非相关URL的下载量。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号