首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
The Web is a hypertext body of approximately 300 million pages that continues to grow at roughly a million pages per day. Page variation is more prodigious than the data's raw scale: taken as a whole, the set of Web pages lacks a unifying structure and shows far more authoring style and content variation than that seen in traditional text document collections. This level of complexity makes an “off-the-shelf” database management and information retrieval solution impossible. To date, index based search engines for the Web have been the primary tool by which users search for information. Such engines can build giant indices that let you quickly retrieve the set of all Web pages containing a given word or string. Experienced users can make effective use of such engines for tasks that can be solved by searching for tightly constrained key words and phrases. These search engines are, however, unsuited for a wide range of equally important tasks. In particular, a topic of any breadth will typically contain several thousand or million relevant Web pages. How then, from this sea of pages, should a search engine select the correct ones-those of most value to the user? Clever is a search engine that analyzes hyperlinks to uncover two types of pages: authorities, which provide the best source of information on a given topic; and hubs, which provide collections of links to authorities. We outline the thinking that went into Clever's design, report briefly on a study that compared Clever's performance to that of Yahoo and AltaVista, and examine how our system is being extended and updated  相似文献   

2.
3.
We propose an approach for the word-level indexing of modern printed documents which are difficult to recognize using current OCR engines. By means of word-level indexing, it is possible to retrieve the position of words in a document, enabling queries involving proximity of terms. Web search engines implement this kind of indexing, allowing users to retrieve Web pages on the basis of their textual content. Nowadays, digital libraries hold collections of digitized documents that can be retrieved either by browsing the document images or relying on appropriate metadata assembled by domain experts. Word indexing tools would therefore increase the access to these collections. The proposed system is designed to index homogeneous document collections by automatically adapting to different languages and font styles without relying on OCR engines for character recognition. The approach is based on three main ideas: the use of self organizing maps (SOM) to perform unsupervised character clustering, the definition of one suitable vector-based word representation whose size depends on the word aspect-ratio, and the run-time alignment of the query word with indexed words to deal with broken and touching characters. The most appropriate applications are for processing modern printed documents (17th to 19th centuries) where current OCR engines are less accurate. Our experimental analysis addresses six data sets containing documents ranging from books of the 17th century to contemporary journals.  相似文献   

4.
The Internet and corporate intranets have brought a lot of information. People usually resort to search engines to find required information. However, these systems tend to use only one fixed ranking strategy regardless of the contexts. This poses serious performance problems when characteristics of different users, queries, and text collections are taken into account. We argue that the ranking strategy should be context specific and we propose a , new systematic method that can automatically generate ranking strategies for different contexts based on genetic programming (GP). The new method was tested on TREC data and the results are very promising.  相似文献   

5.
Conventional keyword search engines are restricted to a given data model and cannot easily adapt to unstructured, semi-structured or structured data. In this paper, we propose an efficient and adaptive keyword search method, called EASE, for indexing and querying large collections of heterogeneous data. To achieve high efficiency in processing keyword queries, we first model unstructured, semi-structured and structured data as graphs, and then summarize the graphs and construct graph indices instead of using traditional inverted indices. We propose an extended inverted index to facilitate keyword-based search, and present a novel ranking mechanism for enhancing search effectiveness. We have conducted an extensive experimental study using real datasets, and the results show that EASE achieves both high search efficiency and high accuracy, and outperforms the existing approaches significantly.  相似文献   

6.
为解决搜索引擎返回结果数量上的限制,扩展了元搜索技术,提出链接群落、链接繁殖的概念,并与生物群落进行了对比。链接繁殖的思想是首先将多个搜索引擎返回的结果作为起始信息源,利用预定义的繁殖规则,优化并整合搜索结果,对链接所指网页的链接进行分析,繁殖出更多的相关信息源。在分析不同的搜索引擎结果集时,系统根据不同搜索引擎直接与繁殖发现信息源的能力与质量,动态调整繁殖的链接的优先次序。经过实验验证,链接繁殖可以大大扩展通过搜索引擎发现主题信息源的数量。  相似文献   

7.
Databases deepen the Web   总被引:2,自引:0,他引:2  
Ghanem  T.M. Aref  W.G. 《Computer》2004,37(1):116-117
The Web has become the preferred medium for many database applications, such as e-commerce and digital libraries. These applications store information in huge databases that users access, query, and update through the Web. Database-driven Web sites have their own interfaces and access forms for creating HTML pages on the fly. Web database technologies define the way that these forms can connect to and retrieve data from database servers. The number of database-driven Web sites is increasing exponentially, and each site is creating pages dynamically-pages that are hard for traditional search engines to reach. Such search engines crawl and index static HTML pages; they do not send queries to Web databases. The information hidden inside Web databases is called the "deep Web" in contrast to the "surface Web" that traditional search engines access easily. We expect deep Web search engines and technologies to improve rapidly and to dramatically affect how the Web is used by providing easy access to many more information resources.  相似文献   

8.
Query recommendation helps users to describe their information needs more clearly so that search engines can return appropriate answers and meet their needs. State-of-the-art researches prove that the use of users’ behavior information helps to improve query recommendation performance. Instead of finding the most similar terms previous users queried, we focus on how to detect users’ actual information need based on their search behaviors. The key idea of this paper is that although the clicked documents are not always relevant to users’ queries, the snippets which lead them to the click most probably meet their information needs. Based on analysis into large-scale practical search behavior log data, two snippet click behavior models are constructed and corresponding query recommendation algorithms are proposed. Experimental results based on two widely-used commercial search engines’ click-through data prove that the proposed algorithms outperform practical recommendation methods of these two search engines. To the best of our knowledge, this is the first time that snippet click models are proposed for query recommendation task.  相似文献   

9.
There is a significant commercial and research interest in location-based web search engines. Given a number of search keywords and one or more locations (geographical points) that a user is interested in, a location-based web search retrieves and ranks the most textually and spatially relevant web pages. In this type of search, both the spatial and textual information should be indexed. Currently, no efficient index structure exists that can handle both the spatial and textual aspects of data simultaneously and accurately. Existing approaches either index space and text separately or use inefficient hybrid index structures with poor performance and inaccurate results. Moreover, most of these approaches cannot accurately rank web-pages based on a combination of space and text and are not easy to integrate into existing search engines. In this paper, we propose a new index structure called Spatial-Keyword Inverted File for Points to handle point-based indexing of web documents in an integrated/efficient manner. To seamlessly find and rank relevant documents, we develop a new distance measure called spatial tf-idf. We propose four variants of spatial-keyword relevance scores and two algorithms to perform top-k searches. As verified by experiments, our proposed techniques outperform existing index structures in terms of search performance and accuracy.  相似文献   

10.
Audiovisual archives are investing in large-scale digitization efforts of their analogue holdings and, in parallel, ingesting an ever-increasing amount of born-digital files in their digital storage facilities. Digitization opens up new access paradigms and boosted re-use of audiovisual content. Query-log analyses show the shortcomings of manual annotation, therefore archives are complementing these annotations by developing novel search engines that automatically extract information from both audio and the visual tracks. Over the past few years, the TRECVid benchmark has developed a novel relationship with the Netherlands Institute of Sound and Vision (NISV) which goes beyond the NISV just providing data and use cases to TRECVid. Prototype and demonstrator systems developed as part of TRECVid are set to become a key driver in improving the quality of search engines at the NISV and will ultimately help other audiovisual archives to offer more efficient and more fine-grained access to their collections. This paper reports the experiences of NISV in leveraging the activities of the TRECVid benchmark.  相似文献   

11.
基于移动爬虫的专用Web信息收集系统的设计   总被引:3,自引:0,他引:3  
搜索引擎已经成为网上导航的重要工具。为了能够提供强大的搜索能力,搜索引擎对网上可访问文档维持着详尽的索引。创建和维护索引的任务由网络爬虫完成,网络爬虫代表搜索引擎递归地遍历和下载Web页面。Web页面在下载之后,被搜索引擎分析、建索引,然后提供检索服务。文章介绍了一种更加有效的建立Web索引的方法,该方法是基于移动爬虫(MobileCrawler)的。在此提出的爬虫首先被传送到数据所在的站点,在那里任何不需要的数据在传回搜索引擎之前在当地被过滤。这个方法尤其适用于实施所谓的“智能”爬行算法,这些算法根据已访问过的Web页面的内容来决定一条有效的爬行路径。移动爬虫是移动计算和专业搜索引擎两大技术趋势的结合,能够从技术上很好地解决现在通用搜索引擎所面临的问题。  相似文献   

12.
In this paper we address the following important questions for concept-based video retrieval: (1) What is the impact of detector performance on the performance of concept-based retrieval engines, and (2) will these engines be applicable to real-life search tasks if detector performance improves in the future? We use Monte Carlo simulations to answer these questions. To generate the simulation input, we propose to use a probabilistic model of two Gaussians for the confidence scores that concept detectors emit. Modifying the model??s parameters affects the detector performance and the search performance. We study the relation between these two performances on two video collections. For detectors with similar discriminative power and a concept vocabulary of around 100 concepts, the simulation reveals that in order to achieve a search performance of 0.20 mean average precision (MAP)??which is considered sufficient performance for real-life applications??one needs detectors with at least 0.60 MAP . We also find that, given our simulation model and low detector performance, MAP is not always a good evaluation measure for concept detectors since it is not strongly correlated with the search performance.  相似文献   

13.
Shang  Yi  Li  Longzhuang 《World Wide Web》2002,5(2):159-173
In this paper, we present a general approach for statistically evaluating precision of search engines on the Web. Search engines are evaluated in two steps based on a large number of sample queries: (a) computing relevance scores of hits from each search engine, and (b) ranking the search engines based on statistical comparison of the relevance scores. In computing relevance scores of hits, we study four relevance scoring algorithms. Three of them are variations of algorithms widely used in the traditional information retrieval field. They are cover density ranking, Okapi similarity measurement, and vector space model algorithms. In addition, we develop a new three-level scoring algorithm to mimic commonly used manual approaches. In ranking the search engines in terms of precision, we apply a statistical metric called probability of win. In our experiments, six popular search engines, AltaVista, Fast, Google, Go, iWon, and NorthernLight, were evaluated based on queries from two domains of interest: parallel and distributed processing, and knowledge and data engineering. The first query set contains 1726 queries collected from the index terms of papers published in the IEEE Transactions on Knowledge and Data Engineering. The second set contains 1383 queries collected from the index terms of papers published in the IEEE Transactions on Parallel and Distributed Systems. Search engines were queried and compared in two different search modes: the default search mode and the exact phrase search mode. Our experimental results show that these six search engines performed differently under different search modes and scoring methods. Overall, Google was the best. NorthernLight was mostly second in the default search mode, whereas iWon was mostly second in the exact phrase search mode.  相似文献   

14.
Although search engines are essential tools for finding information on the World Wide Web, the effective use of search engines for information retrieval (IR) is a crucial challenge for any Internet user. Based on the user-focused approach, this study investigates individual information retrieval behaviors using information processing theory. The results show that experience with search engines significantly affects users’ attitudes toward search engines for information retrieval, the query-based service is more popular than the directory-based service, users are not completely satisfied with the precision of retrieved information and the response time of search engines, and users’ motivation is a key factor that predicts their intention to use search engines for information retrieval. Furthermore, this study proposes a conceptual model for investigating individual attitudes toward search engines for information retrieval.  相似文献   

15.
基于知识的网页检索工具   总被引:3,自引:0,他引:3  
随着因特网在全球范围的广泛使用,越来越多的人们借助于因特网从事科研和商务活动,而网页检索工具成了人们必不可少的软件工具.然而,目前流行的检索工具大多基于关键字查询,常常出现信息过载或有用信息丢失等现象.造成这一原因主要有两方面:用户提交的查询不能很好地表达他的目的;查询的结果没有建立有效的索引机制,引导人们快速找到有用信息。为此我们提出一种基于知识的网页检索工具(KWSE),它是在已有的检索工具的  相似文献   

16.
智能型元搜索引擎的设计与实现   总被引:13,自引:0,他引:13  
刘丽  孙燕唐 《计算机工程》2003,29(6):118-120,133
研究现有元搜索引擎技术,提出了智能型元搜索引擎模型,即采用数据挖掘技术,根据独立型搜索引擎工作情况的记录,动态生成元搜索引擎的调度策略。在对各数据挖掘方法进行比较之后,选择了决策树归纳分类分析技术生成元搜索引擎调用策略,并详细介绍了调度策略的处理过程、系统评估度量的建立以及用微软最近发布的OLE DB for DM数据挖掘通用接口进行数据挖掘的具体实现。  相似文献   

17.
该文提出了一种分布式信息检索系统,叫作协作式搜索引擎(CSE),它是由多个相互协作的本地元搜索引擎构成的。每一个本地搜索引擎都有它自己的索引数据库,能够很快地进行更新。CSE通过基于站点选择搜索和对Web文档计分等方法来减少通信延迟、缩短收集时间,实现快速收集、及时更新和定位准确,从而克服了目前的搜索引擎更新周期太长的缺点。  相似文献   

18.
ABSTRACT

Understanding the search behaviour of online users is among the long-tail practices of Interactive Information Retrieval that helps identify the user information needs. The Interactive Social Book Search (SBS), under the umbrella of Interactive Information Retrieval (IIR), aims to understand the user interactions with book collections and the associated professionally-curated and socially-constructed metadata on the baseline and multistage user interfaces (UIs). This paper reports on the book search behaviour of users by reviewing research publications related to the Interactive SBS published during the last two decades. It presents a holistic view of the overall progress of Interactive SBS by summarising and visualising the experimental structure, search systems, datasets, demographics of participants, and findings to identify the research trends and possible future directions. Based on the collected evidence, it attempts to answer how the search system, user interface (UI), and the nature of tasks affect the book search behaviour of users. The article is the first of its kind that attempts to understand the book search behaviour of users in the context of Social Book Search with implications for usability experts and others working in UI design, web search engines, book search engines, digital libraries, collaborative social cataloguing websites, and e-Commerce applications.  相似文献   

19.
Nowadays, Internet users are depending on various search engines in order to be able to find requested information on the Web. Although most users feel that they are and remain anonymous when they place their search queries, reality proves otherwise. The increasing importance of search engines for the location of the desired information on the Internet usually leads to considerable inroads into the privacy of users. A heated debate is currently ongoing at European level regarding the question if search engine providers that are established outside the European Union are covered by the European data protection framework and the obligations it imposes on entities that process personal data. The scope of this paper is to examine the applicability of the European data protection legislation to non-EU-based search engine providers and to study the main privacy issues with regard to search engines, such as the character of search logs, their anonymisation and their retention period. Ixquick, a privacy-friendly meta-search engine, will be presented as an alternative to privacy intrusive existing practices of search engines.  相似文献   

20.
Semplore: A scalable IR approach to search the Web of Data   总被引:1,自引:0,他引:1  
The Web of Data keeps growing rapidly. However, the full exploitation of this large amount of structured data faces numerous challenges like usability, scalability, imprecise information needs and data change. We present Semplore, an IR-based system that aims at addressing these issues. Semplore supports intuitive faceted search and complex queries both on text and structured data. It combines imprecise keyword search and precise structured query in a unified ranking scheme. Scalable query processing is supported by leveraging inverted indexes traditionally used in IR systems. This is combined with a novel block-based index structure to support efficient index update when data changes. The experimental results show that Semplore is an efficient and effective system for searching the Web of Data and can be used as a basic infrastructure for Web-scale Semantic Web search engines.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号