共查询到20条相似文献,搜索用时 0 毫秒
1.
Efficient processing of probabilistic reverse nearest neighbor queries over uncertain data 总被引:2,自引:0,他引:2
Xiang Lian Lei Chen 《The VLDB Journal The International Journal on Very Large Data Bases》2009,18(3):787-808
Reverse nearest neighbor (RNN) search is very crucial in many real applications. In particular, given a database and a query
object, an RNN query retrieves all the data objects in the database that have the query object as their nearest neighbors.
Often, due to limitation of measurement devices, environmental disturbance, or characteristics of applications (for example,
monitoring moving objects), data obtained from the real world are uncertain (imprecise). Therefore, previous approaches proposed
for answering an RNN query over exact (precise) database cannot be directly applied to the uncertain scenario. In this paper,
we re-define the RNN query in the context of uncertain databases, namely probabilistic reverse nearest neighbor (PRNN) query,
which obtains data objects with probabilities of being RNNs greater than or equal to a user-specified threshold. Since the
retrieval of a PRNN query requires accessing all the objects in the database, which is quite costly, we also propose an effective
pruning method, called geometric pruning (GP), that significantly reduces the PRNN search space yet without introducing any
false dismissals. Furthermore, we present an efficient PRNN query procedure that seamlessly integrates our pruning method.
Extensive experiments have demonstrated the efficiency and effectiveness of our proposed GP-based PRNN query processing approach,
under various experimental settings. 相似文献
2.
Current skyline evaluation techniques are mainly to find the outstanding tuples from a large dataset. In this paper, we generalize the concept of skyline query and introduce a novel type of query, the combinatorial skyline query, which is to find the outstanding combinations from all combinations of the given tuples. The past skyline query is a special case of the combinatorial skyline query. This generalized concept is semantically more abundant when used in decision making, market analysis, business planning, and quantitative economics research. In this paper, we first introduce the concept of the combinatorial skyline query (CSQ) and explain the difficulty in resolving this type of query. Then, we propose two algorithms to solve the problem. The experiments manifest the effectiveness and efficiency of the proposed algorithms. 相似文献
3.
Due to the inherent existence of uncertainty in many real-world applications, in this paper, we investigate an important query in uncertain databases, namely probabilistic least influenced set (PLIS) query, which retrieves all the uncertain objects in an uncertain database such that they are the least affected by a given query object with high probabilities. Such a PLIS query is useful in applications such as business planning. We propose and tackle both monochromatic and bichromatic versions (i.e. M-PLIS and B-PLIS, respectively) of the PLIS query. In order to efficiently answer PLIS queries, we present three pruning methods, MINMAX, Regional, and Candidate pruning, which can effectively reduce the PLIS search space. The proposed pruning methods can be seamlessly integrated into efficient query procedures. Moreover, we also study important variants of PLIS query with uncertain query object (i.e. UQ-PLIS). Furthermore, we formulate and tackle the PLIS problem on uncertain moving objects (i.e. UMOD-PLIS). Extensive experiments have demonstrated the efficiency and effectiveness of our proposed approaches under various settings. 相似文献
4.
Recently, uncertain data have received dramatic attention along with technical advances on geographical tracking, sensor network and RFID etc. Also, ranking queries over uncertain data has become a research focus of uncertain data management. With dramatically growing applications of fuzzy set theory, lots of queries involving fuzzy conditions appear nowadays. These fuzzy conditions are widely applied for querying over uncertain data. For instance, in the weather monitoring system, weather data are inherent uncertainty due to some measurement errors. Weather data depicting heavy rain are desired, where ??heavy?? is ambiguous in the fuzzy query. However, fuzzy queries cannot ensure returning expected results from uncertain databases. In this paper, we study a novel kind of ranking queries, Fuzzy Ranking queries (FRanking queries) which extend the traditional notion of ranking queries. FRanking queries are able to handle fuzzy queries submitted by users and return k results which are the most likely to satisfy fuzzy queries in uncertain databases. Due to fuzzy query conditions, the ranks of tuples cannot be evaluated by existing ranking functions. We propose Fuzzy Ranking Function to calculate tuples?? ranks in uncertain databases for both attribute-level and tuple-level uncertainty models. Our ranking function take both the uncertainty and fuzzy semantics into account. FRanking queries are formally defined based on Fuzzy Ranking Function. In the processing of answering FRanking queries, we present a pruning method which safely prunes unnecessary tuples to reduce the search space. To further improve the efficiency, we design an efficient algorithm, namely Incremental Membership Algorithm (IMA) which efficiently answers FRanking queries by evaluating the ranks of incremental tuples under each threshold for the fuzzy set. We demonstrate the effectiveness and efficiency of our methods through the theoretical analysis and experiments with synthetic and real datasets. 相似文献
5.
Pervasive applications, such as natural habitat monitoring and location-based services, have attracted plenty of research interest. These applications, which deploy a lot of sensor devices to collect data from external environments, often have limited network bandwidth and battery resources. The sensors also cannot record accurate values. The uncertainty of data captured by a sensor should thus be considered for query evaluation. To this end, probabilistic queries, which consider data impreciseness and provide statistical guarantees in answers, have been recently studied. 相似文献
6.
As data of an unprecedented scale are becoming accessible, it becomes more and more important to help each user identify the ideal results of a manageable size. As such a mechanism, skyline queries have recently attracted a lot of attention for its intuitive query formulation. This intuitiveness, however, has a side effect of retrieving too many results, especially for high-dimensional data. This paper is to support personalized skyline queries as identifying “truly interesting” objects based on user-specific preference and retrieval size k. In particular, we abstract personalized skyline ranking as a dynamic search over skyline subspaces guided by user-specific preference. We then develop a novel algorithm navigating on a compressed structure itself, to reduce the storage overhead. Furthermore, we also develop novel techniques to interleave cube construction with navigation for some scenarios without a priori structure. Finally, we extend the proposed techniques for user-specific preferences including equivalence preference. Our extensive evaluation results validate the effectiveness and efficiency of the proposed algorithms on both real-life and synthetic data. 相似文献
7.
Many recent applications involve processing and analyzing uncertain data. In this paper, we combine the feature of top-k objects with that of skyline to model the problem of top-k skyline objects against uncertain data. The problem of efficiently computing top-k skyline objects on large uncertain datasets is challenging in both discrete and continuous cases. In this paper, firstly an efficient exact algorithm for computing the top-k skyline objects is developed for discrete cases. To address applications where each object may have a massive set of instances or a continuous probability density function, we also develop an efficient randomized algorithm with an ?‐approximation guarantee. Moreover, our algorithms can be immediately extended to efficiently compute p-skyline; that is, retrieving the uncertain objects with skyline probabilities above a given threshold. Our extensive experiments on synthetic and real data demonstrate the efficiency of both algorithms and the randomized algorithm is highly accurate. They also show that our techniques significantly outperform the existing techniques for computing p-skyline. 相似文献
8.
Sensors are often employed to monitor continuously changing entities like locations of moving objects and temperature. The sensor readings are reported to a database system, and are subsequently used to answer queries. Due to continuous changes in these values and limited resources (e.g., network bandwidth and battery power), the database may not be able to keep track of the actual values of the entities. Queries that use these old values may produce incorrect answers. However, if the degree of uncertainty between the actual data value and the database value is limited, one can place more confidence in the answers to the queries. More generally, query answers can be augmented with probabilistic guarantees of the validity of the answers. In this paper, we study probabilistic query evaluation based on uncertain data. A classification of queries is made based upon the nature of the result set. For each class, we develop algorithms for computing probabilistic answers, and provide efficient indexing and numeric solutions. We address the important issue of measuring the quality of the answers to these queries, and provide algorithms for efficiently pulling data from relevant sensors or moving objects in order to improve the quality of the executing queries. Extensive experiments are performed to examine the effectiveness of several data update policies. 相似文献
9.
We revisit the problem of revising probabilistic beliefs using uncertain evidence, and report results on several major issues relating to this problem: how should one specify uncertain evidence? How should one revise a probability distribution? How should one interpret informal evidential statements? Should, and do, iterated belief revisions commute? And what guarantees can be offered on the amount of belief change induced by a particular revision? Our discussion is focused on two main methods for probabilistic revision: Jeffrey's rule of probability kinematics and Pearl's method of virtual evidence, where we analyze and unify these methods from the perspective of the questions posed above. 相似文献
10.
Recently, uncertain data processing has become more and more important. Although a significant amount of previous research
explores various continuous queries on data streams, continuous queries on uncertain data streams have seldom been investigated.
In this paper, we formulate a novel and challenging problem of continuously monitoring top-k uncertain data streams, and propose a probabilistic threshold method. We develop four algorithms systematically: a deterministic
exact algorithm, a randomized method, and their space-efficient versions using quantile summaries. An extensive empirical
study using real data sets and synthetic data sets is reported to verify the effectiveness and the efficiency of our methods.
We are grateful to the anonymous reviewers and Dr. Ihab F. Ilyas, the guest editor, for their constructive comments which
help to improve the quality of the paper substantially.
The research of Ming Hua and Jian Pei is supported in part by an NSERC Discovery grant and an NSERC Discovery Accelerator
Supplements grant. All opinions, findings, conclusions and recommendations in this paper are those of the authors and do not
necessarily reflect the views of the funding agencies. 相似文献
11.
In recent years, a great attention has been paid to skyline computation over uncertain data. In this paper, we study how to conduct advanced skyline analysis over uncertain databases where uncertainty is modeled thanks to the evidence theory (a.k.a., belief functions theory). We particularly tackle an important issue, namely the skyline stars (denoted by SKY2) over the evidential data. This kind of skyline aims at retrieving the best evidential skyline objects (or the stars). Efficient algorithms have been developed to compute the SKY2. Extensive experiments have demonstrated the efficiency and effectiveness of our proposed approaches that considerably refine the huge skyline. In addition, the conducted experiments have shown that our algorithms significantly outperform the basic skyline algorithms in terms of CPU and memory costs. 相似文献
12.
In this paper, we study a variant of reachability queries, called label-constraint reachability (LCR) queries. Specifically, given a label set S and two vertices u1 and u2 in a large directed graph G, we check the existence of a directed path from u1 to u2, where edge labels along the path are a subset of S. We propose the path-label transitive closure method to answer LCR queries. Specifically, we t4ransform an edge-labeled directed graph into an augmented DAG by replacing the maximal strongly connected components as bipartite graphs. We also propose a Dijkstra-like algorithm to compute path-label transitive closure by re-defining the “distance” of a path. Comparing with the existing solutions, we prove that our method is optimal in terms of the search space. Furthermore, we propose a simple yet effective partition-based framework (local path-label transitive closure+online traversal) to answer LCR queries in large graphs. We prove that finding the optimal graph partition to minimize query processing cost is a NP-hard problem. Therefore, we propose a sampling-based solution to find the sub-optimal partition. Moreover, we address the index maintenance issues to answer LCR queries over the dynamic graphs. Extensive experiments confirm the superiority of our method. 相似文献
13.
Hakan Ferhatosmanoglu Ali Şaman Tosun Guadalupe Canahuate Aravind Ramachandran 《Distributed and Parallel Databases》2006,20(2):117-147
A common technique used to minimize I/O in data intensive applications is data declustering over parallel servers. This technique
involves distributing data among several disks so as to parallelize query retrieval and thus, improve performance. We focus
on optimizing access to large spatial data, and the most common type of queries on such data, i.e., range queries. An optimal
declustering scheme is one in which the processing for all range queries is balanced uniformly among the available disks.
It has been shown that single copy based declustering schemes are non-optimal for range queries. In this paper, we integrate
replication in conjunction with parallel disk declustering for efficient processing of range queries. We note that replication
is largely used in database applications for several purposes like load balancing, fault tolerance and availability of data.
We propose theoretical foundations for replicated declustering and propose a class of replicated declustering schemes, periodic allocations, which are shown to be strictly optimal for a number of disks. We propose a framework for replicated declustering, using a limited amount of replication and provide
extensions to apply it on real data, which include arbitrary grids and a large number of disks. Our framework also provides
an effective indexing scheme that enables fast identification of data of interest in parallel servers. In addition to optimal
processing of single queries, we show that this framework is effective for parallel processing of multiple queries. We present
experimental results comparing the proposed replication scheme to other techniques for both single queries and multiple queries,
on synthetic and real data sets.
Recommended by: Ahmed Elmagarmid
Supported by U.S. Department of Energy (DOE) Award No. DE-FG02-03ER25573, and National Science Foundation (NSF) grant CNS-0403342. 相似文献
14.
Two-tier streaming settings are a typical dynamic environment where continuous skylines represent an important semantic indicator for multiple attributes. To monitor skylines over the dynamic data in such settings, one needs to continuously update the skyline query results in order to reflect the new data values. This paper tackles the problem of continuous skyline monitoring on a central query server over dynamic data from multiple data sites. Simply sending the updates of tuple values to the server is cost-prohibitive. Therefore, we propose an approach that allows the central server to collaborate with the data sites to monitor the possible skyline changes. By doing so, the processing load is distributed over all the data sites instead of only on the central server. Furthermore, this collaborative approach minimizes the bandwidth consumption between the server and the data sites, which is often critical in a widely distributed environment such as a wide-area sensor network. We give theoretical upper bounds for the computation costs and communication costs of the proposed collaborative approach. We also conduct extensive experiments on both synthetic and real data sets. The experimental results demonstrate that our collaborative approach is efficient, scalable and well-balanced in terms of communication costs and computation costs. 相似文献
15.
The flexibility of XML data model allows a more natural representation of uncertain data compared with the relational model. Matching twig pattern against XML data is a fundamental problem in querying information from XML documents. For a probabilistic XML document, each twig answer has a probabilistic value because of the uncertainty of data. The twig answers that have small probabilistic value are useless to the users, and usually users only want to get the answers with the k largest probabilistic values. To this end, existing algorithms for ordinary XML documents cannot be directly applicable due to the need for handling probability distributional nodes and efficient calculation of top-k probabilities of answers in probabilistic XML. In this paper, we address the problem of finding twig answers with top-k probabilistic values against probabilistic XML documents directly. We propose a new encoding scheme called PEDewey for probabilistic XML in this paper. Based on this encoding scheme, we then design two algorithms for finding answers of top-k probabilities for twig queries. One is called ProTJFast, to process probabilistic XML data based on element streams in document order, and the other is called PTopKTwig, based on the element streams ordered by the path probability values. Experiments have been conducted to study the performance of these algorithms. 相似文献
16.
Despite the importance of ranked queries in numerous applications involving multi-criteria decision making, they are not efficiently supported by traditional database systems. In this paper, we propose a simple yet powerful technique for processing such queries based on multi-dimensional access methods and branch-and-bound search. The advantages of the proposed methodology are: (i) it is space efficient, requiring only a single index on the given relation (storing each tuple at most once), (ii) it achieves significant (i.e., orders of magnitude) performance gains with respect to the current state-of-the-art, (iii) it can efficiently handle data updates, and (iv) it is applicable to other important variations of ranked search (including the support for non-monotone preference functions), at no extra space overhead. We confirm the superiority of the proposed methods with a detailed experimental study. 相似文献
17.
A new mining approach for uncertain databases using CUFP trees 总被引:1,自引:0,他引:1
Chun-Wei Lin 《Expert systems with applications》2012,39(4):4084-4093
In the past, many algorithms have been proposed to mine frequent itemsets from transactional databases, in which the presence or absence of items in transactions was certainly known. In some applications, items may also be uncertain in transactions with their existential probabilities ranging from 0 to 1 in the uncertain dataset. Apparently, the processing in uncertain datasets is quite different from those in certain datasets. The UF-tree algorithm was proposed to construct the UF-tree structure from an uncertain dataset and mine frequent itemsets from the tree. In the UF-tree construction process, however, only the same items with the same existential probabilities in transactions were merged together in the tree, thus causing many redundant nodes in the tree. In this paper, a new tree structure called the compressed uncertain frequent-pattern tree (CUFP tree) is designed to efficiently keep the related information in the mining process. In the CUFP tree, the same items will be merged in a branch of the tree even when the existential probabilities in transactions are not the same. A mining algorithm called the CUFP-mine algorithm is then proposed based on the tree structure to find uncertain frequent patterns. Experimental results show that the proposed approach has a better performance than UF-tree algorithm both in the execution time and in the number of tree nodes. 相似文献
18.
19.
LING TokWang 《中国科学F辑(英文版)》2009,52(10):1830-1847
As huge volumes of data are organized or exported in tree-structured form, it is quite necessary to extract useful information from these data collections using effective and efficient query processing methods. A natural way of retrieving desired information from XML documents is using twig pattern (TP), which is, actually, the core component of existing XML query languages. Twig pattern possesses the inherent feature that query nodes on the same path have concrete precedence relationships. It is this featu... 相似文献
20.
Experimental data is subject to uncertainty as every measurement apparatus is inaccurate at some level. However, the design of most computer vision and pattern recognition techniques (e.g., Hough transform) overlooks this fact and treats intensities, locations and directions as precise values. In order to take imprecisions into account, entries are often resampled to create input datasets where the uncertainty of each original entry is characterized by as many exact elements as necessary. Clear disadvantages of the sampling-based approach are the natural processing penalty imposed by a larger dataset and the difficulty of estimating the minimum number of required samples. We present an improved voting scheme for the General Framework for Subspace Detection (hence to its particular case: the Hough transform) that allows processing both exact and uncertain data. Our approach is based on an analytical derivation of the propagation of Gaussian uncertainty from the input data into the distribution of votes in an auxiliary parameter space. In this parameter space, the uncertainty is also described by Gaussian distributions. In turn, the votes are mapped to the actual parameter space as non-Gaussian distributions. Our results show that resulting accumulators have smoother distributions of votes and are in accordance with the ones obtained using the conventional sampling process, thus safely replacing them with significant performance gains. 相似文献