期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Generating efficient safe query plans for probabilistic databases

Biao Yuni 《Data & Knowledge Engineering》2008,67(3):485-503

Managing uncertain information using probabilistic databases has drawn much attention recently in many fields. Generating efficient safe plans is the key to evaluating queries whose data complexities are PTIME. In this paper, we propose a new approach generating efficient safe plans for queries. Our algorithm adopts effective preprocessing and multiway split techniques, thus the generating safe plans avoid unnecessary probabilistic cartesian-products and have the minimum number of probabilistic projections. Further, we extend existing transformation rules to allow the safe plans generated by the Safe-Plan algorithm [N. Dalvi, D. Suciu, Efficient query evaluation on probabilistic database, The VLDB Journal 16 (4) (2007) 523–544] and the proposed algorithm to be transformed by each other. Applying our approach through the TPC-H benchmark queries, the experiments show that the safe plans generated by our algorithm are more efficient than those generated by the Safe-Plan algorithm. 相似文献

2.

Probabilistic query answering over inconsistent databases

Sergio Greco Cristian Molinaro 《Annals of Mathematics and Artificial Intelligence》2012,64(2-3):185-207

This paper presents a framework for querying inconsistent databases in the presence of functional dependencies. Most of the works dealing with the problem of extracting reliable information from inconsistent databases are based on the notion of repair, a minimal set of tuple insertions and deletions which leads the database to a consistent state (called repaired database), and the notion of consistent query answer, a query answer that can be obtained from every repaired database. In this work, both the notion of repair and query answer differ from the original ones. In the presence of functional dependencies, tuple deletions are the only operations that are performed in order to restore the consistency of an inconsistent database. However, deleting a tuple to remove an integrity violation potentially eliminates useful information in that tuple. In order to cope with this problem, we adopt a notion of repair, based on tuple updates, which allows us to better preserve information in the source database. A drawback of the notion of consistent query answer is that it does not allow us to discriminate among non-consistent answers, namely answers which can be obtained from a non-empty proper subset of the repaired databases. To obtain more informative query answers, we propose the notion of probabilistic query answer, that is query answers are tuples associated with probabilities. This new semantics of query answering over inconsistent databases allows us to give a measure of uncertainty to query answers. We show that the problem of computing probabilistic query answers is FP ^#P-complete. We also propose a technique for computing probabilistic answers to arbitrary relational algebra queries. 相似文献

3.

Incremental Evaluation of Sliding-Window Queries over Data Streams

Ghanem T.M. Hammad M.A. Mokbel M.F. Aref W.G. Elmagarmid A.K. 《Knowledge and Data Engineering, IEEE Transactions on》2007,19(1):57-72

Two research efforts have been conducted to realize sliding-window queries in data stream management systems, namely, query revaluation and incremental evaluation. In the query reevaluation method, two consecutive windows are processed independently of each other. On the other hand, in the incremental evaluation method, the query answer for a window is obtained incrementally from the answer of the preceding window. In this paper, we focus on the incremental evaluation method. Two approaches have been adopted for the incremental evaluation of sliding-window queries, namely, the input-triggered approach and the negative tuples approach. In the input-triggered approach, only the newly inserted tuples flow in the query pipeline and tuple expiration is based on the timestamps of the newly inserted tuples. On the other hand, in the negative tuples approach, tuple expiration is separated from tuple insertion where a tuple flows in the pipeline for every inserted or expired tuple. The negative tuples approach avoids the unpredictable output delays that result from the input-triggered approach. However, negative tuples double the number of tuples through the query pipeline, thus reducing the pipeline bandwidth. Based on a detailed study of the incremental evaluation pipeline, we classify the incremental query operators into two classes according to whether an operator can avoid the processing of negative tuples or not. Based on this classification, we present several optimization techniques over the negative tuples approach that aim to reduce the overhead of processing negative tuples while avoiding the output delay of the query answer. A detailed experimental study, based on a prototype system implementation, shows the performance gains over the input-triggered approach of the negative tuples approach when accompanied with the proposed optimizations 相似文献

4.

The Threshold Algorithm: From Middleware Systems to the Relational Engine

Bruno N. Hui Wang 《Knowledge and Data Engineering, IEEE Transactions on》2007,19(4):523-537

The answer to a top-k query is an ordered set of tuples, where the ordering is based on how closely each tuple matches the query. In the context of middleware systems, new algorithms to answer top-k queries have been recently proposed. Among these, the threshold algorithm (TA) is the most well-known instance due to its simplicity and memory requirements. TA is based on an early-termination condition and can evaluate top-k queries without examining all the tuples. This top-k query model is prevalent not only over middleware systems, but also over plain relational data. In this work, we analyze the challenges that must be addressed to adapt TA to a relational database system. We show that, depending on the available indices, many alternative TA strategies can be used to answer a given query. Choosing the best alternative requires a cost model that can be seamlessly integrated with that of current optimizers. In this work, we address these challenges and conduct an extensive experimental evaluation of the resulting techniques by characterizing which scenarios can take advantage of TA-like algorithms to answer top-k queries in relational database systems 相似文献

5.

Query evaluation over probabilistic XML 总被引：2，自引：0，他引：2

Benny Kimelfeld Yuri Kosharovsky Yehoshua Sagiv 《The VLDB Journal The International Journal on Very Large Data Bases》2009,18(5):1117-1140

Query evaluation over probabilistic XML is explored. The queries are twig patterns with projection, and the data is represented in terms of three models of probabilistic XML (that extend existing ones in the literature). The first model makes an assumption of independence among the probabilistic junctions, whereas the second model can encode probabilistic dependencies. The third model combines the first two and, hence, is the most general. An efficient algorithm (under data complexity) is given for query evaluation in the first model. In addition, various optimizations are proposed, and their effectiveness is shown both analytically and experimentally. For the other two models, it is shown that every query is either intractable or trivial. Nonetheless, efficient (additive and multiplicative) approximation algorithms are given for these two models. Finally, Boolean queries are enriched by allowing disjunctions and negations of branches. The above algorithm for the first model is extended to handle these queries. For the other two models, there is an efficient additive approximation, and a multiplicative one also exists if there is no negation; in addition, it is shown that if the query is non-monotonic, then no efficient multiplicative approximation exists unless NP = RP. 相似文献

6.

Reliability of answers to queries in relational databases 总被引：1，自引：0，他引：1

Sadri F. 《Knowledge and Data Engineering, IEEE Transactions on》1991,3(2):245-251

The author studies the problem of determining the reliability of answers to queries in a relational database system, where the information in the database comes from various sources with varying degrees of reliability. An extended relational model is proposed in which each tuple in a relation is associated with an information source vector which identifies the information source(s) that contributed to that tuple. The author shows how relational algebra operations can be extended, and implemented using information source vectors, to calculate the vector corresponding to each tuple in the answer to a query, and hence, to identify information source(s) contributing to each tuple in the answer. This also enables the database system to calculate the reliability of each tuple in the answer to a query as a function of the reliability of information sources 相似文献

7.

Beyond search: Retrieving complete tuples from a text-database

Alexander Löser Christoph Nagel Stephan Pieper Christoph Boden 《Information Systems Frontiers》2013,15(3):311-329

A common task of Web users is querying structured information from Web pages. For realizing this interesting scenario we propose a novel query processor for systematically discovering instances of semantic relations in Web search results and joining these relation instances into complex result tuples with conjunctive queries. Our query processor transforms a structured user query into keyword queries that are submitted to a search engine, forwards search results to a relation extractor, and then combines relations into complex result tuples. The processor automatically learns discriminative and effective keywords for different types of semantic relations. Thereby, our query processor leverages the index of a search engine to query potentially billions of pages. Unfortunately, relation extractors may fail to return a relation for a result tuple. Moreover, user defined data sources may not return at least k complete result tuples. Therefore we propose an adaptive routing model based on information theory for retrieving missing attributes of incomplete result tuples. The model determines the most promising next incomplete tuple and attribute type for returning any-k complete result tuples at any point during the query execution process. We report a thorough experimental evaluation over multiple relation extractors. Our query processor returns complete result tuples while processing only very few Web pages. 相似文献

8.

基于关系的概率XML数据存储方法研究

下载免费PDF全文

王建卫郝忠孝《计算机工程与应用》2011,47(23):130-132

根据概率数据的描述形式对概率数据分为基于关系的概率数据模型和基于XML的概率数据模型两类。基于关系的概率数据模型是为每个元组引入概率标记属性表示不确定性,使元组的存储、查询处理变得复杂;基于XML的概率数据模型是在普通XML树中添加表示概率属性结点,能够表示多粒度的概率信息。设计了映射为关系的概率XML数据的与PDTD无关的PXRel和PXParent两种存储模式,并通过实验验证了其有效性。相似文献

9.

Finding the least influenced set in uncertain databases

Xiang Lian Lei Chen Guoren Wang 《Information Systems》2011

Due to the inherent existence of uncertainty in many real-world applications, in this paper, we investigate an important query in uncertain databases, namely probabilistic least influenced set (PLIS) query, which retrieves all the uncertain objects in an uncertain database such that they are the least affected by a given query object with high probabilities. Such a PLIS query is useful in applications such as business planning. We propose and tackle both monochromatic and bichromatic versions (i.e. M-PLIS and B-PLIS, respectively) of the PLIS query. In order to efficiently answer PLIS queries, we present three pruning methods, MINMAX, Regional, and Candidate pruning, which can effectively reduce the PLIS search space. The proposed pruning methods can be seamlessly integrated into efficient query procedures. Moreover, we also study important variants of PLIS query with uncertain query object (i.e. UQ-PLIS). Furthermore, we formulate and tackle the PLIS problem on uncertain moving objects (i.e. UMOD-PLIS). Extensive experiments have demonstrated the efficiency and effectiveness of our proposed approaches under various settings. 相似文献

10.

Data integration with uncertainty 总被引：1，自引：0，他引：1

Xin Luna Dong Alon Halevy Cong Yu 《The VLDB Journal The International Journal on Very Large Data Bases》2009,18(2):469-500

This paper reports our first set of results on managing uncertainty in data integration. We posit that data-integration systems need to handle uncertainty at three levels and do so in a principled fashion. First, the semantic mappings between the data sources and the mediated schema may be approximate because there may be too many of them to be created and maintained or because in some domains (e.g., bioinformatics) it is not clear what the mappings should be. Second, the data from the sources may be extracted using information extraction techniques and so may yield erroneous data. Third, queries to the system may be posed with keywords rather than in a structured form. As a first step to building such a system, we introduce the concept of probabilistic schema mappings and analyze their formal foundations. We show that there are two possible semantics for such mappings: by-table semantics assumes that there exists a correct mapping but we do not know what it is; by-tuple semantics assumes that the correct mapping may depend on the particular tuple in the source data. We present the query complexity and algorithms for answering queries in the presence of probabilistic schema mappings, and we describe an algorithm for efficiently computing the top-k answers to queries in such a setting. Finally, we consider using probabilistic mappings in the scenario of data exchange. 相似文献

11.

基于聚类的非一致性数据库查询重写

谢东杨路明蒲保兴刘波《小型微型计算机系统》2007,28(12):2199-2202

在非一致性数据库上,以元组匹配技术所产生的聚类和概率数据库的元组概率为基础,提出了可信聚类概率和可重写查询判断方法.考虑了最普通的IC情况（key-to-key和nonkey-to-key）,给出了无连接和有连接的查询重写方法.连接查询重写方法缩小了用于连接的中间结果集中可信聚类的元组数量,有效地提高了查询性能.实验使用TPC-H决策支持基准的数据和查询进行性能研究,分析了聚类基数和数据库尺寸等相关因素的影响,结果显示方法是有效的. 相似文献

12.

Representing and processing lineages over uncertain data based on the Bayesian network

《Applied Soft Computing》2015

Processing lineages (also called provenances) over uncertain data consists in tracing the origin of uncertainty based on the process of data production and evolution. In this paper, we focus on the representation and processing of lineages over uncertain data, where we adopt Bayesian network (BN), one of the popular and important probabilistic graphical models (PGMs), as the framework of uncertainty representation and inferences. Starting from the lineage expressed as Boolean formulae for SPJ (Selection–Projection–Join) queries over uncertain data, we propose a method to transform the lineage expression into directed acyclic graphs (DAGs) equivalently. Specifically, we discuss the corresponding probabilistic semantics and properties to guarantee that the graphical model can support effective probabilistic inferences in lineage processing theoretically. Then, we propose the function-based method to compute the conditional probability table (CPT) for each node in the DAG. The BN for representing lineage expressions over uncertain data, called lineage BN and abbreviated as LBN, can be constructed while generally suitable for both safe and unsafe query plans. Therefore, we give the variable-elimination-based algorithm for LBN's exact inferences to obtain the probabilities of query results, called LBN-based query processing. Then, we focus on obtaining the probabilities of inputs or intermediate tuples conditioned on query results, called LBN-based inference query processing, and give the Gibbs-sampling-based algorithm for LBN's approximate inferences. Experimental results show the efficiency and effectiveness of our methods. 相似文献

13.

PrDB: managing and exploiting rich correlations in probabilistic databases 总被引：2，自引：0，他引：2

Prithviraj Sen Amol Deshpande Lise Getoor 《The VLDB Journal The International Journal on Very Large Data Bases》2009,18(5):1065-1090

Due to numerous applications producing noisy data, e.g., sensor data, experimental data, data from uncurated sources, information extraction, etc., there has been a surge of interest in the development of probabilistic databases. Most probabilistic database models proposed to date, however, fail to meet the challenges of real-world applications on two counts: (1) they often restrict the kinds of uncertainty that the user can represent; and (2) the query processing algorithms often cannot scale up to the needs of the application. In this work, we define a probabilistic database model, PrDB, that uses graphical models, a state-of-the-art probabilistic modeling technique developed within the statistics and machine learning community, to model uncertain data. We show how this results in a rich, complex yet compact probabilistic database model, which can capture the commonly occurring uncertainty models (tuple uncertainty, attribute uncertainty), more complex models (correlated tuples and attributes) and allows compact representation (shared and schema-level correlations). In addition, we show how query evaluation in PrDB translates into inference in an appropriately augmented graphical model. This allows us to easily use any of a myriad of exact and approximate inference algorithms developed within the graphical modeling community. While probabilistic inference provides a generic approach to solving queries, we show how the use of shared correlations, together with a novel inference algorithm that we developed based on bisimulation, can speed query processing significantly. We present a comprehensive experimental evaluation of the proposed techniques and show that even with a few shared correlations, significant speedups are possible. 相似文献

14.

Evaluation of probabilistic queries over imprecise data in constantly-evolving environments

Reynold Cheng Dmitri V. Kalashnikov Sunil Prabhakar 《Information Systems》2007

Sensors are often employed to monitor continuously changing entities like locations of moving objects and temperature. The sensor readings are reported to a database system, and are subsequently used to answer queries. Due to continuous changes in these values and limited resources (e.g., network bandwidth and battery power), the database may not be able to keep track of the actual values of the entities. Queries that use these old values may produce incorrect answers. However, if the degree of uncertainty between the actual data value and the database value is limited, one can place more confidence in the answers to the queries. More generally, query answers can be augmented with probabilistic guarantees of the validity of the answers. In this paper, we study probabilistic query evaluation based on uncertain data. A classification of queries is made based upon the nature of the result set. For each class, we develop algorithms for computing probabilistic answers, and provide efficient indexing and numeric solutions. We address the important issue of measuring the quality of the answers to these queries, and provide algorithms for efficiently pulling data from relevant sensors or moving objects in order to improve the quality of the executing queries. Extensive experiments are performed to examine the effectiveness of several data update policies. 相似文献

15.

基于元组存在性的概率数据模型研究

陈鹏《计算机科学》2012,39(105):265-270

随着信息与通讯技术的快速发展,数据管理正面临着越来越多的挑战,其中之一就是数据的不确定性。提出一种基于元组存在性的概率数据模型相似文献

16.

An efficient index structure for XML based on generalized suffix tree

Liang Zuopeng Hu Kongfa Ye Ning Dong Yisheng 《Information Systems》2007

A novel index structure based on the generalized suffix tree (PIGST) is proposed. Combined with post lists, PIGST can answer both structural and content queries. The distinct paths in an XML collection are mapped into strings. The construction algorithm of the PIGST for the path strings is presented based on the modification and improvement of a well-known suffix tree construction algorithm that only requires linear time and space complexity. The query process merely needs m character comparisons for direct containment queries, where m is the length of a query string. An efficient processing method for the indirect containment queries that avoids the inefficient tree traversal operation is also presented. Experiments show that PIGST outperforms earlier approaches. 相似文献

17.

一种基于维层次编码的OLAP聚集查询算法 总被引：8，自引：2，他引：8

胡孔法董逸生徐立臻杨科华《计算机研究与发展》2004,41(4):608-614

联机分析处理(OLAP)查询往往需在海量数据上进行即席的复杂分组聚集查询，在其SQL语句中通常包含多表连接和分组聚集操作，因而减少多表连接和压缩关键字，以及对查询数据进行有效地分组聚集操作，成为ROLAP查询处理的关键问题。提出了一种基于维层次编码的新型预分组聚集算法DHEPGA．DHEPGA算法充分利用了编码长度较小的维层次编码及其前缀，来快速检索出与查询关键字相匹配的维层次编码，求得维层次属性的查询范围，减少了I／O开销，提高了OLAP查询效率。理论分析和实验结果表明，DHEPGA算法性能是非常有效的。相似文献

18.

Optimization and evaluation of disjunctive queries 总被引：2，自引：0，他引：2

Claussen J. Kemper A. Moerkotte G. Peithner K. Steinbrunn M. 《Knowledge and Data Engineering, IEEE Transactions on》2000,12(2):238-260

It is striking that the optimization of disjunctive queries-i.e. those which contain at least one OR-connective in the query predicate-has been vastly neglected in the literature, as well as in commercial systems. In this paper, we propose a novel technique, called bypass processing, for evaluating such disjunctive queries. The bypass processing technique is based on new selection and join operators that produce two output streams: the TRUE-stream with tuples satisfying the selection (join) predicate and the FALSE-stream with tuples not satisfying the corresponding predicate. Splitting the tuple streams in this way enables us to “bypass” costly predicates whenever the “fate” of the corresponding tuple (stream) can be determined without evaluating this predicate. In the paper, we show how to systematically generate bypass evaluation plans utilizing a bottom-up building-block approach. We show that our evaluation technique allows us to incorporate the standard SQL semantics of null values. For this, we devise two different approaches: one is based on explicitly incorporating three-valued logic into the evaluation plans; the other one relies on two-valued logic by “moving” all negations to atomic conditions of the selection predicate. We describe how to extend an iterator-based query engine to support bypass evaluation with little extra overhead. This query engine was used to quantitatively evaluate the bypass evaluation plans against the traditional evaluation techniques utilizing a CNFor DNF-based query predicate 相似文献

19.

Probabilistic location-dependent queries at different location granularities

《Pervasive and Mobile Computing》2017

Approaches for the processing of location-dependent queries usually assume that the location data are expressed precisely, usually using GPS locations. However, this is unrealistic because positioning methods do not have a perfect accuracy (e.g., the positioning approach used in cellular networks handles only the cell where mobile users are located). Besides, users may need to express queries based on concepts of locations other than traditional GPS locations, which we call location granules.In this paper, we focus on location granule-based query processing (i.e., processing of queries with location granules) in situations where the location data available is imprecise, which we have called probabilistic location-dependent queries. For that purpose, we exploit the concept of uncertainty location granule, which represents the location uncertainty of an object. In particular, we tackle the problem of processing probabilistic inside (range) constraints. We analyze in detail how those constraints can be processed, taking into account both the existence of location uncertainty affecting the relevant objects and the location granularity specified. An extensive experimental evaluation shows the feasibility of the proposed probabilistic query processing approach and analyzes the advantages of using index structures to speed up the query processing. 相似文献

20.

Queries and materialized views on probabilistic databases

Nilesh Dalvi Christopher Re Dan Suciu 《Journal of Computer and System Sciences》2011,77(3):473-490

We review in this paper some recent yet fundamental results on evaluating queries over probabilistic databases. While one can see this problem as a special instance of general purpose probabilistic inference, we describe in this paper two key database specific techniques that significantly reduce the complexity of query evaluation on probabilistic databases. The first is the separation of the query and the data: we show here that by doing so, one can identify queries whose data complexity is #P-hard, and queries whose data complexity is in PTIME. The second is the aggressive use of previously computed query results (materialized views): in particular, by rewriting a query in terms of views, one can reduce its complexity from #P-complete to PTIME. We describe a notion of a partial representation for views, and show that, once computed and stored, this partial representation can be used to answer subsequent queries on the probabilistic databases. evaluation. 相似文献