共查询到20条相似文献,搜索用时 46 毫秒
1.
With the explosive growth of the World Wide Web, it is becoming increasingly difficult for users to discover Web pages that are relevant to a topic. To address this problem we are developing a system that allows the collection and analysis of Web pages related to a particular topic. In this paper we present the systems overall architecture and introduce the focused crawler used by the system. We also discuss the various techniques we use to allow the user to analyze and gain useful insights about a collection. Finally, we present some statistics on the collections. 相似文献
2.
Varadarajan R. Hristidis V. Tao Li 《Knowledge and Data Engineering, IEEE Transactions on》2008,20(3):411-424
Given a user keyword query, current Web search engines return a list of individual Web pages ranked by their "goodness" with respect to the query. Thus, the basic unit for search and retrieval is an individual page, even though information on a topic is often spread across multiple pages. This degrades the quality of search results, especially for long or uncorrelated (multitopic) queries (in which individual keywords rarely occur together in the same document), where a single page is unlikely to satisfy the user's information need. We propose a technique that, given a keyword query, on the fly generates new pages, called composed pages, which contain all query keywords. The composed pages are generated by extracting and stitching together relevant pieces from hyperlinked Web pages and retaining links to the original Web pages. To rank the composed pages, we consider both the hyperlink structure of the original pages and the associations between the keywords within each page. Furthermore, we present and experimentally evaluate heuristic algorithms to efficiently generate the top composed pages. The quality of our method is compared to current approaches by using user surveys. Finally, we also show how our techniques can be used to perform query-specific summarization of Web pages. 相似文献
3.
Chakrabarti S. Dom B.E. Kumar S.R. Raghavan P. Rajagopalan S. Tomkins A. Gibson D. Kleinberg J. 《Computer》1999,32(8):60-67
The Web is a hypertext body of approximately 300 million pages that continues to grow at roughly a million pages per day. Page variation is more prodigious than the data's raw scale: taken as a whole, the set of Web pages lacks a unifying structure and shows far more authoring style and content variation than that seen in traditional text document collections. This level of complexity makes an “off-the-shelf” database management and information retrieval solution impossible. To date, index based search engines for the Web have been the primary tool by which users search for information. Such engines can build giant indices that let you quickly retrieve the set of all Web pages containing a given word or string. Experienced users can make effective use of such engines for tasks that can be solved by searching for tightly constrained key words and phrases. These search engines are, however, unsuited for a wide range of equally important tasks. In particular, a topic of any breadth will typically contain several thousand or million relevant Web pages. How then, from this sea of pages, should a search engine select the correct ones-those of most value to the user? Clever is a search engine that analyzes hyperlinks to uncover two types of pages: authorities, which provide the best source of information on a given topic; and hubs, which provide collections of links to authorities. We outline the thinking that went into Clever's design, report briefly on a study that compared Clever's performance to that of Yahoo and AltaVista, and examine how our system is being extended and updated 相似文献
4.
Seed URLs selection for focused Web crawler intends to guide related and valuable information that meets a user's personal information requirement and provide more effective information retrieval. In this paper, we propose a seed URLs selection approach based on user-interest ontology. In order to enrich semantic query, we first intend to apply Formal Concept Analysis to construct user-interest concept lattice with user log profile. By using concept lattice merger, we construct the user-interest ontology which can describe the implicit concepts and relationships between them more appropriately for semantic representation and query match. On the other hand, we make full use of the user-interest ontology for extracting the user interest topic area and expanding user queries to receive the most related pages as seed URLs, which is an entrance of the focused crawler. In particular, we focus on how to refine the user topic area using the bipartite directed graph. The experiment proves that the user-interest ontology can be achieved effectively by merging concept lattices and that our proposed approach can select high quality seed URLs collection and improve the average precision of focused Web crawler. 相似文献
5.
一种基于HITS的主题敏感爬行方法 总被引:2,自引:0,他引:2
基于主题的信息采集是信息检索领域内一个新兴且实用的方法,通过将下载页面限定在特定的主题领域,来提高搜索引擎的效率和提供信息的质量。其思想是在爬行过程中按预先定义好的主题有选择地收集相关网页,避免下载主题不相关的网页,其目标是更准确地找到对用户有用的信息。探讨了主题爬虫的一些关键问题,通过改进主题模型、链接分类模型的学习方法及链接分析方法来提高下载网页的主题相关度及质量。在此基础上设计并实现了一个主题爬虫系统,该系统利用主题敏感HITS来计算网页优先级。实验表明效果良好。 相似文献
6.
Thousands of users issue keyword queries to the Web search engines to find information on a number of topics. Since the users may have diverse backgrounds and may have different expectations for a given query, some search engines try to personalize their results to better match the overall interests of an individual user. This task involves two great challenges. First the search engines need to be able to effectively identify the user interests and build a profile for every individual user. Second, once such a profile is available, the search engines need to rank the results in a way that matches the interests of a given user. In this article, we present our work towards a personalized Web search engine and we discuss how we addressed each of these challenges. Since users are typically not willing to provide information on their personal preferences, for the first challenge, we attempt to determine such preferences by examining the click history of each user. In particular, we leverage a topical ontology for estimating a user’s topic preferences based on her past searches, i.e. previously issued queries and pages visited for those queries. We then explore the semantic similarity between the user’s current query and the query-matching pages, in order to identify the user’s current topic preference. For the second challenge, we have developed a ranking function that uses the learned past and current topic preferences in order to rank the search results to better match the preferences of a given user. Our experimental evaluation on the Google query-stream of human subjects over a period of 1 month shows that user preferences can be learned accurately through the use of our topical ontology and that our ranking function which takes into account the learned user preferences yields significant improvements in the quality of the search results. 相似文献
7.
8.
《Behaviour & Information Technology》2012,31(5):491-502
The perception of the visual complexity of World Wide Web (Web) pages is a topic of significant interest. Previous work has examined the relationship between complexity and various aspects of presentation, including font styles, colours and images, but automatically quantifying this dimension of a web page at the level of the document remains a challenge. In this paper we demonstrate that areas of high complexity can be identified by detecting areas, or ‘chunks’, of a web page high in block-level elements. We report a computational algorithm that captures this metric and places web pages in a sequence that shows an 86% correlation with the sequences generated through user judgements of complexity. The work shows that structural aspects of a web page influence how complex a user perceives it to be, and presents a straightforward means of determining complexity through examining the DOM. 相似文献
9.
Deep Web蕴藏丰富的、高质量的信息资源,为了获取某Deep Web站点的页面,用户不得不键入一系列的关键词集。由于没有直接指向Deep Web页面的静态链接,目前大多数搜索引擎不能发现这些页面。该文提出的Deep Web爬虫爬行策略,可以有效地下载Deep Web页面。由于该页面只提供一个查询接口,因此Deep Web爬虫设计面对的主要挑战是怎样选择最佳的查询关键词产生有意义的查询。实验证明文中提出的一种基于不同关键词相关度权重的选择方法是有效的。 相似文献
10.
Inaccessible Web pages and 404 “Page Not Found” responses are a common Web phenomenon and a detriment to the user’s browsing experience. The rediscovery of missing Web pages is, therefore, a relevant research topic in the digital preservation as well as in the Information Retrieval realm. In this article, we bring these two areas together by analyzing four content- and link-based methods to rediscover missing Web pages. We investigate the retrieval performance of the methods individually as well as their combinations and give an insight into how effective these methods are over time. As the main result of this work, we are able to recommend not only the best performing methods but also the sequence in which they should be applied, based on their performance, complexity required to generate them, and evolution over time. Our least complex single method results in a rediscovery rate of almost $70\,\%$ of Web pages of our sample dataset based on URIs sampled from the Open Directory Project (DMOZ). By increasing the complexity level and combining three different methods, our results show an increase of the success rate of up to $77\,\%$ . The results, based on our sample dataset, indicate that Web pages are often not completely lost but have moved to a different location and “just” need to be rediscovered. 相似文献
11.
Andy Brown Caroline Jay Simon Harper 《International journal of human-computer studies》2012,70(3):179-196
Understanding the content of a Web page and navigating within and between pages are crucial tasks for any Web user. To those who are accessing pages through non-visual means, such as screen readers, the challenges offered by these tasks are not easily overcome, even when pages are unchanging documents. The advent of ‘Web 2.0’ and Web applications, however, means that documents often are not static, but update, either automatically or due to user interaction. This development poses a difficult question for screen reader designers: how should users be notified of page changes? In this article we introduce rules for presenting such updates, derived from studies of how sighted users interact with them. An implementation of the rules has been evaluated, showing that users who were blind or visually impaired found updates easier to deal with than the relatively quiet way in which current screen readers often present them. 相似文献
12.
面向主题的WWW信息挖掘系统 总被引:3,自引:0,他引:3
1 概述 WWW正以令人难以置信的速度飞速地发展,逐渐成为人们发布和获取信息的主要平台。虽然人们可以从WWW上获得大量信息,但由于WWW上的信息是无结构的、动态的、分散的,因此如何从WWW上高效地提取有用的信息仍是一个很有挑战性的课题。搜索引擎(如Excite、Google、Alta Vista)的广泛应用,使人们检索信息的效率大大提高。搜索引擎的工作原理是:由一个爬行器(Crawler)尽可能多地收 相似文献
13.
14.
Guilherme T. de Assis Alberto H. F. Laender Marcos André Gonçalves Altigran S. da Silva 《World Wide Web》2009,12(3):285-319
Focused crawlers have as their main goal to crawl Web pages that are relevant to a specific topic or user interest, playing
an important role for a great variety of applications. In general, they work by trying to find and crawl all kinds of pages
deemed as related to an implicitly declared topic. However, users are often not simply interested in any document about a
topic, but instead they may want only documents of a given type or genre on that topic to be retrieved. In this article, we
describe an approach to focused crawling that exploits not only content-related information but also genre information present
in Web pages to guide the crawling process. This approach has been designed to address situations in which the specific topic
of interest can be expressed by specifying two sets of terms, the first describing genre aspects of the desired pages and
the second related to the subject or content of these pages, thus requiring no training or any kind of preprocessing. The
effectiveness, efficiency and scalability of the proposed approach are demonstrated by a set of experiments involving the
crawling of pages related to syllabi of computer science courses, job offers in the computer science field and sale offers
of computer equipments. These experiments show that focused crawlers constructed according to our genre-aware approach achieve
levels of F1 superior to 88%, requiring the analysis of no more than 65% of the visited pages in order to find 90% of the
relevant pages. In addition, we experimentally analyze the impact of term selection on our approach and evaluate a proposed
strategy for semi-automatic generation of such terms. This analysis shows that a small set of terms selected by an expert
or a set of terms specified by a typical user familiar with the topic is usually enough to produce good results and that such
a semi-automatic strategy is very effective in supporting the task of selecting the sets of terms required to guide a crawling
process. 相似文献
15.
Multiobjective evolutionary clustering of Web user sessions: a case study in Web page recommendation
G. Nildem Demir A. Şima Uyar Şule Gündüz-Öğüdücü 《Soft Computing - A Fusion of Foundations, Methodologies and Applications》2010,14(6):579-597
In this study, we experiment with several multiobjective evolutionary algorithms to determine a suitable approach for clustering
Web user sessions, which consist of sequences of Web pages visited by the users. Our experimental results show that the multiobjective
evolutionary algorithm-based approaches are successful for sequence clustering. We look at a commonly used cluster validity
index to verify our findings. The results for this index indicate that the clustering solutions are of high quality. As a
case study, the obtained clusters are then used in a Web recommender system for representing usage patterns. As a result of
the experiments, we see that these approaches can successfully be applied for generating clustering solutions that lead to
a high recommendation accuracy in the recommender model we used in this paper. 相似文献
16.
Web Spam is one of the main difficulties that crawlers have to overcome and therefore one of the main problems of the WWW. There are several studies about characterising and detecting Web Spam pages. However, none of them deals with all the possible kinds of Web Spam. This paper shows an analysis of different kinds of Web Spam pages and identifies new elements that characterise it, to define heuristics which are able to partially detect them. We also discuss and explain several heuristics from the point of view of their effectiveness and computational efficiency. Taking them into account, we study several sets of heuristics and demonstrate how they improve the current results. Finally, we propose a new Web Spam detection system called SAAD (Spam Analyzer And Detector), which is based on the set of proposed heuristics and their use in a C4.5 classifier improved by means of Bagging and Boosting techniques. We have also tested our system in some well known Web Spam datasets and we have found it to be very effective. 相似文献
17.
从语义相关性角度分析超链归纳主题搜索(HITS)算法,发现其产生主题漂移的原因在于页面被投影到错误的语义基上,提出了一种基于模糊集的主题提取和层次发现算法(FSTH),通过用户日志扩展查询词,构造符合用户需要的个性化根集和基础集合,达到防止主题漂移的目的。FSTH采用模糊集划分方法,层次地发现与用户查询相关的主题页面集合,利用HITS算法分别计算每个主题页面集合中页面的权威值,返回与查询相关的其他主题权威页面。在14个查询上的实验结果表明,与HITS算法相比,FSTH算法不仅可以减少7%~53%的主题漂移率,而且可以发现与查询相关的多个主题. 相似文献
18.
Tsuyoshi Murata 《New Generation Computing》2007,25(3):293-303
The Web is a huge network composed of Web pages and hyperlinks. It is often reported that related Web pages are densely linked
with each other. Finding groups of such related pages, which are called Web communities, is important for information retrieval
from the Web. Several attempts have been made for the discovery of Web communities such as Kumar’s trawling and Flake’s method.
In addition to the communities of related Web pages, there are communities of users sharing common interests. Finding the
latter communities, which we called user communities in this paper, is also important for clarifying the behaviors of Web
users. It is expected that the characteristics of user communities in the Web correspond to those in real human communities.
A method for discovering user communities is described in this paper. Client-level log data (Web audience measurement data)
is used as the data of users’ Web watching behaviors. Maximal complete bipartite graphs are searched from term-user graph
obtained from the log data without analyzing the contents of Web pages. Experimental results show that our method succeeds
in discovering many interesting user communities with labels that characterize the communities. 相似文献
19.
The exponential growth in Web sites is making it increasingly difficult to extract useful information on the Internet using existing search engines. Despite a wide range of sophisticated indexing and data retrieval features, search engines often deliver satisfactory results only when users know precisely what they are looking for. Traditional textual interfaces present results as a list of links to Web pages. Because most users are unwilling to explore an extensive list, search engines arbitrarily reduce the number of links returned, aiming also to provide quick response times. Moreover, their proprietary ranking algorithms often do not reflect individual user preferences. Those who need comprehensive general information about a topic or have vague initial requirements instead want a holistic presentation of data related to their queries. To address this need, we have developed Periscope, a 3D search result visualization system that displays all the Web pages found in a synthetic, yet comprehensible format. 相似文献