首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 50 毫秒
1.
In this paper, we introduce a fuzzy language to extract information from the web extending the web query language WebSQL [1]. These extensions are based on two observations: the inadequacy of traditional Boolean query languages for web documents, and the need to move beyond the notion of query providing just a set of answers in order to provide a better data presentation through answers' restructuring. In order to address the first issue, we consider fuzzy sets to express imprecision in data, queries and answers. In our case, data imprecision comes from the data classification provided by several search engines. Query imprecision occurs in weighting values provided at query definition time. Answer imprecision allows to filter and rank the answers. To address the second point, we provide an answer restructuring language to model the restructuring phase that follows the query phase. The restructuring language allows creation/deletion of links and page creation. Thus several answer organizations are possible as a result to the same query. The resulting language extends in a uniform framework WebSQL. Then we provide a mapping for the language constructs into an extended relational algebra called SAMEW[2] expressing similarity-based queries over imprecisely classified data, queries involving navigation among web pages and answer restructurings. Finally, we study the optimization of similarity-based queries using equivalence and containment rules holding for SAMEWand presenting several algorithms for query evaluation.  相似文献   

2.
In many decision-making scenarios, decision makers require rapid feedback to their queries, which typically involve aggregates. The traditional blocking execution model can no longer meet the demands of these users. One promising approach in the literature, called online aggregation, evaluates an aggregation query progressively as follows: as soon as certain data have been evaluated, approximate answers are produced with their respective running confidence intervals; as more data are examined, the answers and their corresponding running confidence intervals are refined. In this paper, we extend this approach to handle nested queries with aggregates (i.e., at least one inner query block is an aggregate query) by providing users with (approximate) answers progressively as the inner aggregation query blocks are evaluated. We address the new issues pose by nested queries. In particular, the answer space begins with a superset of the final answers and is refined as the aggregates from the inner query blocks are refined. For the intermediary answers to be meaningful, they have to be interpreted with the aggregates from the inner queries. We also propose a multi-threaded model in evaluating such queries: each query block is assigned to a thread, and the threads can be evaluated concurrently and independently. The time slice across the threads is nondeterministic in the sense that the user controls the relative rate at which these subqueries are being evaluated. For enumerative nested queries, we propose a priority-based evaluation strategy to present answers that are certainly in the final answer space first, before presenting those whose validity may be affected as the inner query aggregates are refined. We implemented a prototype system using Java and evaluated our system. Results for nested queries with a level and multiple levels of nesting are reported. Our results show the effectiveness of the proposed mechanisms in providing progressive feedback that reduces the initial waiting time of users significantly without sacrificing the quality of the answers. Received April 25, 2000 / Accepted June 27, 2000  相似文献   

3.
The idea of allowing query users to relax their correctness requirements in order to improve performance of a data stream management system (e.g., location-based services and sensor networks) has been recently studied. By exploiting the maximum error (or tolerance) allowed in query answers, algorithms for reducing the use of system resources have been developed. In most of these works, however, query tolerance is expressed as a numerical value, which may be difficult to specify. We observe that in many situations, users may not be concerned with the actual value of an answer, but rather which object satisfies a query (e.g., "who is my nearest neighbor?”). In particular, an entity-based query returns only the names of objects that satisfy the query. For these queries, it is possible to specify a tolerance that is "nonvalue-based.” In this paper, we study fraction-based tolerance, a type of nonvalue-based tolerance, where a user specifies the maximum fractions of a query answer that can be false positives and false negatives. We develop fraction-based tolerance for two major classes of entity-based queries: 1) nonrank-based query (e.g., range queries) and 2) rank-based query (e.g., k-nearest-neighbor queries). These definitions provide users with an alternative to specify the maximum tolerance allowed in their answers. We further investigate how these definitions can be exploited in a distributed stream environment. We design adaptive filter algorithms that allow updates be dropped conditionally at the data stream sources without affecting the overall query correctness. Extensive experimental results show that our protocols reduce the use of network and energy resources significantly.  相似文献   

4.
Hundreds of millions of users each day submit queries to the Web search engine. The user queries are typically very short which makes query understanding a challenging problem. In this paper, we propose a novel approach for query representation and classification. By submitting the query to a web search engine, the query can be represented as a set of terms found on the web pages returned by search engine. In this way, each query can be considered as a point in high-dimensional space and standard classification algorithms such as regression can be applied. However, traditional regression is too flexible in situations with large numbers of highly correlated predictor variables. It may suffer from the overfitting problem. By using search click information, the semantic relationship between queries can be incorporated into the learning system as a regularizer. Specifically, from all the functions which minimize the empirical loss on the labeled queries, we select the one which best preserves the semantic relationship between queries. We present experimental evidence suggesting that the regularized regression algorithm is able to use search click information effectively for query classification.  相似文献   

5.
Web Search is increasingly entity centric; as a large fraction of common queries target specific entities, search results get progressively augmented with semi-structured and multimedia information about those entities. However, search over personal web browsing history still revolves around keyword-search mostly. In this paper, we present a novel approach to answer queries over web browsing logs that takes into account entities appearing in the web pages, user activities, as well as temporal information. Our system, B-hist, aims at providing web users with an effective tool for searching and accessing information they previously looked up on the web by supporting multiple ways of filtering results using clustering and entity-centric search. In the following, we present our system and motivate our User Interface (UI) design choices by detailing the results of a survey on web browsing and history search. In addition, we present an empirical evaluation of our entity-based approach used to cluster web pages.  相似文献   

6.
Nowadays, spatial and temporal data play an important role in social networks. These data are distributed and dispersed in several heterogeneous data sources. These peculiarities make that geographic information retrieval being a non-trivial task, considering that the spatial data are often unstructured and built by different collaborative communities from social networks. The problem arises when user queries are performed with different levels of semantic granularity. This fact is very typical in social communities, where users have different levels of expertise. In this paper, a novelty approach based on three matching-query layers driven by ontologies on the heterogeneous data sources is presented. A technique of query contextualization is proposed for addressing to available heterogeneous data sources including social networks. It consists of contextualizing a query in which whether a data source does not contain a relevant result, other sources either provide an answer or in the best case, each one adds a relevant answer to the set of results. This approach is a collaborative learning system based on experience level of users in different domains. The retrieval process is achieved from three domains: temporal, geographical and social, which are involved in the user-content context. The work is oriented towards defining a GIScience collaborative learning for geographic information retrieval, using social networks, web and geodatabases.  相似文献   

7.
识别搜索引擎用户的查询意图在信息检索领域是备受关注的研究内容。文中提出一种融合多类特征识别Web查询意图的方法。将Web查询意图识别作为一个分类问题,并从不同类型的资源包括查询文本、搜索引擎返回内容及Web查询日志中抽取出有效的分类特征。在人工标注的真实Web查询语料上采用文中方法进行查询意图识别实验,实验结果显示文中采用的各类特征对于提高查询意图识别的效果皆有一定帮助,综合使用这些特征进行查询意图识别,88。5%的测试查询获得准确的意图识别结果。  相似文献   

8.
如何快速有效地对数据立方体上的聚集查询给出近似的回答,是数据挖掘和数据仓库研究领域中的核心问题之一。现有大多数聚集查询算法在同一个数据立方体上只能支持某种特定的而非多种类型的聚集查询。本文给出了一种新的框架AdenTS,即基于密度的自适应树结构,它可以回答同一数据立方体上的各类聚集查询,也提出了一些近似和启发式技术,改善了查询结果和精度。实验结果表明,这种方法在支持的查询种类和性能上是更好的。  相似文献   

9.
Abstract. The analysis of web usage has mostly focused on sites composed of conventional static pages. However, huge amounts of information available in the web come from databases or other data collections and are presented to the users in the form of dynamically generated pages. The query interfaces of such sites allow the specification of many search criteria. Their generated results support navigation to pages of results combining cross-linked data from many sources. For the analysis of visitor navigation behaviour in such web sites, we propose the web usage miner (WUM), which discovers navigation patterns subject to advanced statistical and structural constraints. Since our objective is the discovery of interesting navigation patterns, we do not focus on accesses to individual pages. Instead, we construct conceptual hierarchies that reflect the query capabilities used in the production of those pages. Our experiments with a real web site that integrates data from multiple databases, the German SchulWeb, demonstrate the appropriateness of WUM in discovering navigation patterns and show how those discoveries can help in assessing and improving the quality of the site. Received June 21, 1999 / Accepted December 24, 1999  相似文献   

10.
Question answering on the Web is moving beyond the stage where users simply type a query and retrieve a ranked ordering of appropriate Web pages. Users and analysts want targeted answers to their questions without extraneous information. These answers might contain information from current and authoritative sources, terms with the same meaning as those used in the query, relevant links such as justifications, follow-up questions fitting the context, and provenance information. Next-generation question-answering systems might also provide better querying support. This could include identifying whether questions are incoherent and therefore can't be answered, too general and would retrieve too many answers, or over constrained and would retrieve few if any answers. We present a spectrum of techniques for improving question answering and discuss their potential uses and impact.  相似文献   

11.
To avoid returning irrelevant web pages for search engine results, technologies that match user queries to web pages have been widely developed. In this study, web pages for search engine results are classified as low-adjacence (each web page includes all query keywords) or high-adjacence (each web page includes some of the query keywords) sets. To match user queries with web pages using formal concept analysis (FCA), a concept lattice of the low-adjacence set is defined and the non-redundancy association rules defined by Zaki for the concept lattice are extended. OR- and AND-RULEs between non-query and query keywords are proposed and an algorithm and mining method for these rules are proposed for the concept lattice. The time complexity of the algorithm is polynomial. An example illustrates the basic steps of the algorithm. Experimental and real application results demonstrate that the algorithm is effective.  相似文献   

12.
A common task of Web users is querying structured information from Web pages. For realizing this interesting scenario we propose a novel query processor for systematically discovering instances of semantic relations in Web search results and joining these relation instances into complex result tuples with conjunctive queries. Our query processor transforms a structured user query into keyword queries that are submitted to a search engine, forwards search results to a relation extractor, and then combines relations into complex result tuples. The processor automatically learns discriminative and effective keywords for different types of semantic relations. Thereby, our query processor leverages the index of a search engine to query potentially billions of pages. Unfortunately, relation extractors may fail to return a relation for a result tuple. Moreover, user defined data sources may not return at least k complete result tuples. Therefore we propose an adaptive routing model based on information theory for retrieving missing attributes of incomplete result tuples. The model determines the most promising next incomplete tuple and attribute type for returning any-k complete result tuples at any point during the query execution process. We report a thorough experimental evaluation over multiple relation extractors. Our query processor returns complete result tuples while processing only very few Web pages.  相似文献   

13.
Databases and information systems are often hard to use because they do not explicitly attempt to cooperate with their users. Direct answers to database and knowledge base queries may not always be the best answers. Instead, an answer with extra or alternative information may be more useful and less misleading to a user. This paper surveys foundational work that has been done toward endowing intelligent information systems with the ability to exhibit cooperative behavior. Grice's maxims of cooperative conversation, which provided a starting point for the field of cooperative answering, are presented along with relevant work in natural language dialogue systems, database query answering systems, and logic programming and deductive databases. The paper gives a detailed account of cooperative techniques that have been developed for considering users' beliefs and expectations, presuppositions, and misconceptions. Also, work in intensional answering and generalizing queries and answers is covered. Finally, the Cooperative Answering System at Maryland, which is intended to be a general, portable platform for supporting a wide spectrum of cooperative answering techniques, is described.  相似文献   

14.
An increasing number of emerging web database applications deal with large georeferenced data sets. However, exploring these large data sets through spatial queries can be very time and resource intensive. The need for interactive spatial queries has arisen in many applications such as Geographic Information Systems (GIS) for efficient decision-support. In this paper, we propose a new interactive spatial query processing technique for GIS. We present a family of the Incremental Refining Spatial Join (IRSJ) algorithms that can be used to report incrementally refined running estimates for aggregate queries while simultaneously displaying the actual query result tuples of the data sets sampled so far. Our goal is to minimize the time until an acceptably accurate estimate of the query result is available (to users) measured by a confidence interval. Our approach enables more interactive data exploration and analysis. While similar work has been done in relational databases, to the best of our knowledge, this is the first work using this approach in GIS. We investigate and evaluate different sampling methodologies through extensive experimental performance comparisons. Experiments on both real and synthetic data show an order of magnitude response time improvement relative to the final answer obtained when using a full R-tree join. We also show the impact of different index structures on the performance of our algorithms using three known sampling methods.  相似文献   

15.
The architectural choices underlying Linked Data have led to a compendium of data sources which contain both duplicated and fragmented information on a large number of domains. One way to enable non-experts users to access this data compendium is to provide keyword search frameworks that can capitalize on the inherent characteristics of Linked Data. Developing such systems is challenging for three main reasons. First, resources across different datasets or even within the same dataset can be homonyms. Second, different datasets employ heterogeneous schemas and each one may only contain a part of the answer for a certain user query. Finally, constructing a federated formal query from keywords across different datasets requires exploiting links between the different datasets on both the schema and instance levels. We present Sina, a scalable keyword search system that can answer user queries by transforming user-supplied keywords or natural-languages queries into conjunctive SPARQL queries over a set of interlinked data sources. Sina uses a hidden Markov model to determine the most suitable resources for a user-supplied query from different datasets. Moreover, our framework is able to construct federated queries by using the disambiguated resources and leveraging the link structure underlying the datasets to query. We evaluate Sina over three different datasets. We can answer 25 queries from the QALD-1 correctly. Moreover, we perform as well as the best question answering system from the QALD-3 competition by answering 32 questions correctly while also being able to answer queries on distributed sources. We study the runtime of SINA in its mono-core and parallel implementations and draw preliminary conclusions on the scalability of keyword search on Linked Data.  相似文献   

16.
17.
In this paper, we propose a multimodal query suggestion method for video search which can leverage multimodal processing to improve the quality of search results. When users type general or ambiguous textual queries, our system MQSS provides keyword suggestions and representative image examples in an easy-to-use dropdown manner which can help users specify their search intent more precisely and effortlessly. It is a powerful complement to initial queries. After the queries are formulated as multimodal query (i.e., text, image), the new queries are input to individual search models, such as text-based, concept-based and visual example-based search model. Then we apply multimodal fusion method to aggregate the above-mentioned several search results. The effectiveness of MQSS is demonstrated by evaluations over a web video data set.  相似文献   

18.
Extracting data from various semi-structured sources is a topic that has received a lot of attention. Wrapper induction specifically has been studied extensively, where users annotate a couple of data sources with examples of the data they want, after which a procedure (wrapper) is constructed that can optimally extract similar data as well. In this paper a novel wrapper induction approach is proposed, exploiting the premise of the general applicability of the XPath query language, studied specifically within the context of web pages. After a user annotates a limited set of web pages with the required data, a generalised XPath is constructed that is capable of extracting the examples and, optimally, similar data as well. This generalised baseline XPath is then enriched with predicates, based on context and structure of the data sources, to optimise the precision/recall balance of the data extraction capability of the wrapper. Six variations of such limiting predicates are introduced and investigated. In this paper, it is shown that the baseline approach often generalises the samples too much, leading to a decreased precision. Enriching the baseline wrapper by the addition of predicates limits the generalisation power of the queries in an intelligent manner. Experimental results show that there is a significant improvement in the overall precision of the generalised query, without an excessive loss in recall. Documented tests and real world experience with a large amount of data show that the technique is flexible, easily understood and applicable in a broad range of applications. It is not only of interest in the fields of web information retrieval, but can also be used in the contexts of, e.g., reverse engineering of databases, ontology expansion and deep web data mining, as both simple lists of data and complex structures can be extracted.  相似文献   

19.
Evaluating refined queries in top-k retrieval systems   总被引:2,自引:0,他引:2  
In many applications, users specify target values for certain attributes/features without requiring exact matches to these values in return. Instead, the result is typically a ranked list of "top k" objects that best match the specified feature values. User subjectivity is an important aspect of such queries, i.e., which objects are relevant to the user and which are not depends on the perception of the user. Due to the subjective nature of top-k queries, the answers returned by the system to an user query often do not satisfy the users need right away, either because the weights and the distance functions associated with the features do not accurately capture the users perception or because the specified target values do not fully capture her information need or both. In such cases, the user would like to refine the query and resubmit it in order to get back a better set of answers. While there has been a lot of research on query refinement models, there is no work that we are aware of on supporting refinement of top-k queries efficiently in a database system. Done naively, each "refined" query can be treated as a "starting" query and evaluated from scratch. We explore alternative approaches that significantly improve the cost of evaluating refined queries by exploiting the observation that the refined queries are not modified drastically from one iteration to another. Our experiments over a real-life multimedia data set show that the proposed techniques save more than 80 percent of the execution cost of refined queries over the naive approach and is more than an order of magnitude faster than a simple sequential scan.  相似文献   

20.
We present a new next generation domain search engine called MedicoPort. MedicoPort is a medical search engine designed for the users with no medical expertise. It is enhanced with the domain knowledge obtained from Unified Medical Language System (UMLS) to increase the effectiveness of the searches. The power of the system is based on the ability to understand the semantics of web pages and the user queries. MedicoPort transforms a keyword search into a conceptual search. Through our system we present a topical web crawling technique and indexing techniques empowered by the semantics information. MedicoPort aims to generate maximum output with semantic value using minimum input from the user. Since MedicoPort is designed to help people seeking information about health on the web, our target users are not medical specialists who can effectively use the special jargon of medicine and access medical databases. Medical experts have the advantage of shrinking the answer set by expressing several terms using medical terminology. MedicoPort provides the same advantage to its users through the automated use of the medical domain knowledge in the background. The results of our experiments indicate that, expanding the queries with domain knowledge, such as using the synonyms and partially or contextually relevant terms from UMLS, increase dramatically the relevance of an answer set produced by MedicoPort and the number of retrieved web pages that are relevant to the user request.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号