首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Multi-dimensional top-k dominating queries   总被引:1,自引:0,他引:1  
The top-k dominating query returns k data objects which dominate the highest number of objects in a dataset. This query is an important tool for decision support since it provides data analysts an intuitive way for finding significant objects. In addition, it combines the advantages of top-k and skyline queries without sharing their disadvantages: (i) the output size can be controlled, (ii) no ranking functions need to be specified by users, and (iii) the result is independent of the scales at different dimensions. Despite their importance, top-k dominating queries have not received adequate attention from the research community. This paper is an extensive study on the evaluation of top-k dominating queries. First, we propose a set of algorithms that apply on indexed multi-dimensional data. Second, we investigate query evaluation on data that are not indexed. Finally, we study a relaxed variant of the query which considers dominance in dimensional subspaces. Experiments using synthetic and real datasets demonstrate that our algorithms significantly outperform a previous skyline-based approach. We also illustrate the applicability of this multi-dimensional analysis query by studying the meaningfulness of its results on real data.  相似文献   

2.
Ranking queries, also known as top-k queries, produce results that are ordered on some computed score. Typically, these queries involve joins, where users are usually interested only in the top-k join results. Top-k queries are dominant in many emerging applications, e.g., multimedia retrieval by content, Web databases, data mining, middlewares, and most information retrieval applications. Current relational query processors do not handle ranking queries efficiently, especially when joins are involved. In this paper, we address supporting top-k join queries in relational query processors. We introduce a new rank-join algorithm that makes use of the individual orders of its inputs to produce join results ordered on a user-specified scoring function. The idea is to rank the join results progressively during the join operation. We introduce two physical query operators based on variants of ripple join that implement the rank-join algorithm. The operators are nonblocking and can be integrated into pipelined execution plans. We also propose an efficient heuristic designed to optimize a top-k join query by choosing the best join order. We address several practical issues and optimization heuristics to integrate the new join operators in practical query processors. We implement the new operators inside a prototype database engine based on PREDATOR. The experimental evaluation of our approach compares recent algorithms for joining ranked inputs and shows superior performance.Received: 23 December 2003, Accepted: 31 March 2004, Published online: 12 August 2004Edited by: S. AbiteboulExtended version of the paper published in the Proceedings of the 29th International Conference on Very Large Databases, VLDB 2003, Berlin, Germany, pp 754-765  相似文献   

3.
杨皓  段磊  胡斌  邓松  王文韬  秦攀 《软件学报》2015,26(11):2994-3009
对比序列模式能够表达序列数据集合间的差异,在商品推荐、用户行为分析和电力供应预测等领域有广泛的应用.已有的对比序列模式挖掘算法需要用户设定正例支持度阈值和负例支持度阈值.在不具备足够先验知识的情况下,用户难以设定恰当的支持度阈值,从而可能错失一些对比显著的模式.为此,提出了带间隔约束的top-k对比序列模式挖掘算法kDSP-Miner(top-k distinguishing sequential patterns with gap constraint miner).kDSP-Miner中用户只需设置期望发现的对比最显著的模式个数,从而避免了直接设置对比支持度阈值.相应地,挖掘算法更容易使用,并且结果更易于解释.同时,为了提高算法执行效率,设计了若干剪枝策略和启发策略.进一步设计了kDSP-Miner的多线程版本,以提高其对高维序列元素情况的处理能力.通过在真实世界数据集上的详实实验,验证了算法的有效性和执行效率.  相似文献   

4.
Finding typical instances is an effective approach to understand and analyze large data sets. In this paper, we apply the idea of typicality analysis from psychology and cognitive science to database query answering, and study the novel problem of answering top-k typicality queries. We model typicality in large data sets systematically. Three types of top-k typicality queries are formulated. To answer questions like “Who are the top-k most typical NBA players?”, the measure of simple typicality is developed. To answer questions like “Who are the top-k most typical guards distinguishing guards from other players?”, the notion of discriminative typicality is proposed. Moreover, to answer questions like “Who are the best k typical guards in whole representing different types of guards?”, the notion of representative typicality is used. Computing the exact answer to a top-k typicality query requires quadratic time which is often too costly for online query answering on large databases. We develop a series of approximation methods for various situations: (1) the randomized tournament algorithm has linear complexity though it does not provide a theoretical guarantee on the quality of the answers; (2) the direct local typicality approximation using VP-trees provides an approximation quality guarantee; (3) a local typicality tree data structure can be exploited to index a large set of objects. Then, typicality queries can be answered efficiently with quality guarantees by a tournament method based on a Local Typicality Tree. An extensive performance study using two real data sets and a series of synthetic data sets clearly shows that top-k typicality queries are meaningful and our methods are practical.  相似文献   

5.
We study here fundamental issues involved in top-k query evaluation in probabilistic databases. We consider simple probabilistic databases in which probabilities are associated with individual tuples, and general probabilistic databases in which, additionally, exclusivity relationships between tuples can be represented. In contrast to other recent research in this area, we do not limit ourselves to injective scoring functions. We formulate three intuitive postulates for the semantics of top-k queries in probabilistic databases, and introduce a new semantics, Global-Topk, that satisfies those postulates to a large degree. We also show how to evaluate queries under the Global-Topk semantics. For simple databases we design dynamic-programming based algorithms. For general databases we show polynomial-time reductions to the simple cases, and provide effective heuristics to speed up the computation in practice. For example, we demonstrate that for a fixed k the time complexity of top-k query evaluation is as low as linear, under the assumption that probabilistic databases are simple and scoring functions are injective. Research partially supported by NSF grant IIS-0307434. An earlier version of some of the results in this paper was presented in [1].  相似文献   

6.
Recently, due to intrinsic characteristics in many underlying data sets, a number of probabilistic queries on uncertain data have been investigated. Top-k dominating queries are very important in many applications including decision making in a multidimensional space. In this paper, we study the problem of efficiently computing top-k dominating queries on uncertain data. We first formally define the problem. Then, we develop an efficient, threshold-based algorithm to compute the exact solution. To overcome some inherent computational deficiency in an exact computation, we develop an efficient randomized algorithm with an accuracy guarantee. Our extensive experiments demonstrate that both algorithms are quite efficient, while the randomized algorithm is quite scalable against data set sizes, object areas, k values, etc. The randomized algorithm is also highly accurate in practice.  相似文献   

7.
We consider a distributed system where each node keeps a local count for items (similar to elections where nodes are ballot boxes and items are candidates). A top-k query in such a system asks which are the k items whose global count, across all nodes in the system, is the largest. In this paper, we present a Monte Carlo algorithm that outputs, with high probability, a set of k candidates which approximates the top-k items. The algorithm is motivated by sensor networks in that it focuses on reducing the individual communication complexity. In contrast to previous algorithms, the communication complexity depends only on the global scores and not on the partition of scores among nodes. If the number of nodes is large, our algorithm dramatically reduces the communication complexity when compared with deterministic algorithms. We show that the complexity of our algorithm is close to a lower bound on the cell-probe complexity of any non-interactive top-k approximation algorithm. We show that for some natural global distributions (such as the Geometric or Zipf distributions), our algorithm needs only polylogarithmic number of communication bits per node. An extended abstract of this paper appeared in Proc. 13th Int. Colloquium on Structural Information and Communication Complexity, SIROCCO 2006, Lecture Notes in Computer Science 4056, pp. 319–333.  相似文献   

8.
We study an important data analysis operator, which extracts the k most important groups from data (i.e., the k groups with the highest aggregate values). In a data warehousing context, an example of the above query is “find the 10 combinations of product-type and month with the largest sum of sales”. The problem is challenging as the potential number of groups can be much larger than the memory capacity. We propose on-demand methods for efficient top-k groups processing, under limited memory size. In particular, we design top-k groups retrieval techniques for three representative scenarios as follows. For the scenario with data physically ordered by measure, we propose the write-optimized multi-pass sorted access algorithm (WMSA), that exploits available memory for efficient top-k groups computation. Regarding the scenario with unordered data, we develop the recursive hash algorithm (RHA), which applies hashing with early aggregation, coupled with branch-and-bound techniques and derivation heuristics for tight score bounds of hash partitions. Next, we design the clustered groups algorithm (CGA), which accelerates top-k groups processing for the case where data is clustered by a subset of group-by attributes. Extensive experiments with real and synthetic datasets demonstrate the applicability and efficiency of the proposed algorithms.  相似文献   

9.
Recently, due to the imprecise nature of the data generated from a variety of streaming applications, such as sensor networks, query processing on uncertain data streams has become an important problem. However, all the existing works on uncertain data streams study unbounded streams. In this paper, we take the first step towards the important and challenging problem of answering sliding-window queries on uncertain data streams, with a focus on one of the most important types of queries—top-k queries. It is nontrivial to find an efficient solution for answering sliding-window top-k queries on uncertain data streams, because challenges not only stem from the strict space and time requirements of processing both arriving and expiring tuples in high-speed streams, but also rise from the exponential blowup in the number of possible worlds induced by the uncertain data model. In this paper, we design a unified framework for processing sliding-window top-k queries on uncertain streams. We show that all the existing top-k definitions in the literature can be plugged into our framework, resulting in several succinct synopses that use space much smaller than the window size, while they are also highly efficient in terms of processing time. We also extend our framework to answering multiple top-k queries. In addition to the theoretical space and time bounds that we prove for these synopses, we present a thorough experimental report to verify their practical efficiency on both synthetic and real data.  相似文献   

10.
A co-location pattern is a set of spatial features whose instances frequently appear in a spatial neighborhood. This paper efficiently mines the top-k probabilistic prevalent co-locations over spatially uncertain data sets and makes the following contributions: 1) the concept of the top-k probabilistic prevalent co-locations based on a possible world model is defined; 2) a framework for discovering the top-k probabilistic prevalent co-locations is set up; 3) a matrix method is proposed to improve the computation of the prevalence probability of a top-k candidate, and two pruning rules of the matrix block are given to accelerate the search for exact solutions; 4) a polynomial matrix is developed to further speed up the top-k candidate refinement process; 5) an approximate algorithm with compensation factor is introduced so that relatively large quantity of data can be processed quickly. The efficiency of our proposed algorithms as well as the accuracy of the approximation algorithms is evaluated with an extensive set of experiments using both synthetic and real uncertain data sets.  相似文献   

11.
12.
Why-not and why questions can be posed by database users to seek clarifications on unexpected query results. Specifically, why-not questions aim to explain why certain expected tuples are absent from the query results, while why questions try to clarify why certain unexpected tuples are present in the query results. This paper systematically explores the why-not and why questions on reverse top-k queries, owing to its importance in multi-criteria decision making. We first formalize why-not questions on reverse top-k queries, which try to include the missing objects in the reverse top-k query results, and then, we propose a unified framework called WQRTQ to answer why-not questions on reverse top-k queries. Our framework offers three solutions to cater for different application scenarios. Furthermore, we study why questions on reverse top-k queries, which aim to exclude the undesirable objects from the reverse top-k query results, and extend the framework WQRTQ to efficiently answer why questions on reverse top-k queries, which demonstrates the flexibility of our proposed algorithms. Extensive experimental evaluation with both real and synthetic data sets verifies the effectiveness and efficiency of the presented algorithms under various experimental settings.  相似文献   

13.
Aiming at the problem of top-k spatial join query processing in cloud computing systems, a Spark-based top-k spatial join (STKSJ) query processing algorithm is proposed. In this algorithm, the whole data space is divided into grid cells of the same size by a grid partitioning method, and each spatial object in one data set is projected into a grid cell. The Minimum Bounding Rectangle (MBR) of all spatial objects in each grid cell is computed. The spatial objects overlapping with these MBRs in another spatial data set are replicated to the corresponding grid cells, thereby filtering out spatial objects for which there are no join results, thus reducing the cost of subsequent spatial join processing. An improved plane sweeping algorithm is also proposed that speeds up the scanning mode and applies threshold filtering, thus greatly reducing the communication and computation costs of intermediate join results in subsequent top-k aggregation operations. Experimental results on synthetic and real data sets show that the proposed algorithm has clear advantages, and better performance than existing top-k spatial join query processing algorithms.  相似文献   

14.
Skyline queries are often used on data sets in multi-dimensional space for many decision-making applications. Traditionally, an object p is said to dominate another object q if, for all dimensions, it is no worse than q and is better on at least one dimension. Therefore, the skyline of a data set consists of all objects not dominated by any other object. To better cater to application requirements such as controlling the size of the skyline or handling data sets that are not well-structured, various works have been proposed to extend the definition of skyline based on variants of the dominance relationship. In view of the proliferation of variants, in this paper, a generalized framework is proposed to guide the extension of skyline query from conventional definition to different variants. Our framework explicitly and carefully examines the various properties that should be preserved in a variant of the dominance relationship so that: (1) maintaining original advantages, while extending adaptivity to application semantics, and (2) keeping computational complexity almost unaffected. We prove that traditional dominance is the only relationship satisfying all desirable properties, and present some new dominance relationships by relaxing some of the properties. These relationships are general enough for us to design new top-k skyline queries that return robust results of a controllable size. We analyze the existing skyline algorithms based on their minimum requirements on dominance properties. We also extend our analysis to data sets with missing values, and present extensive experimental results on the combinations of new dominance relationships and skyline algorithms.  相似文献   

15.
Recently, uncertain data processing has become more and more important. Although a significant amount of previous research explores various continuous queries on data streams, continuous queries on uncertain data streams have seldom been investigated. In this paper, we formulate a novel and challenging problem of continuously monitoring top-k uncertain data streams, and propose a probabilistic threshold method. We develop four algorithms systematically: a deterministic exact algorithm, a randomized method, and their space-efficient versions using quantile summaries. An extensive empirical study using real data sets and synthetic data sets is reported to verify the effectiveness and the efficiency of our methods. We are grateful to the anonymous reviewers and Dr. Ihab F. Ilyas, the guest editor, for their constructive comments which help to improve the quality of the paper substantially. The research of Ming Hua and Jian Pei is supported in part by an NSERC Discovery grant and an NSERC Discovery Accelerator Supplements grant. All opinions, findings, conclusions and recommendations in this paper are those of the authors and do not necessarily reflect the views of the funding agencies.  相似文献   

16.
Continuous processing of top-k queries over data streams is a promising technique for alleviating the information overload problem as it distinguishes relevant from irrelevant data stream objects with respect to a given scoring function over time. Thus it enables filtering of irrelevant data objects and delivery of top-k objects relevant to user interests in real-time. We propose a solution for distributed continuous top-k processing based on the publish/subscribe communication paradigm—top-k publish/subscribe over sliding windows (top-k/w publish/subscribe). It identifies k best-ranked objects with respect to a given scoring function over a sliding window of size w, and extends the publish/subscribe communication paradigm by continuous top-k processing algorithms coming from the field of data stream processing.In this paper, we introduce, analyze and evaluate the essential building blocks of distributed top-k/w publish/subscribe systems: first, we present a formal top-k/w publish/subscribe model and compare it to the prevailing Boolean publish/subscribe model. Next, we outline the top-k/w processing tasks performed by publish/subscribe nodes and investigate the properties of supported scoring functions. Furthermore, we explore potential routing strategies for distributed top-k/w publish/subscribe systems. Finally, we experimentally evaluate model properties and provide a comparative study investigating traffic requirements of potential routing strategies.  相似文献   

17.
In this paper, we define a new class of queries, the top-k multiple-type integrated query (simply, top-k MULTI query). It deals with multiple data types and finds the information in the order of relevance between the query and the object. Various data types such as spatial, textual, and relational data types can be used for the top-k MULTI query. The top-k MULTI query distinguishes itself from the traditional top-k query in that the component scores to calculate final scores are determined dependent of the query. Hence, each component score is calculated only when the query is given for each data type rather than being calculated apriori as in the top-k query. As a representative instance, the traditional top-k spatial keyword query is an instance of the top-k MULTI query. It deals with the spatial data type and text data type and finds the information based on spatial proximity and textual relevance between the query and the object, which is determined only when the query is given. In this paper, we first define the top-k MULTI query formally and define a new specific instance for the top-k MULTI query, the top-k spatial-keyword-relational(SKR) query, by integrating the relational data type into the traditional top-k spatial keyword query. Then, we investigate the processing approaches for the top-k MULTI query. We discuss the scalability of those approaches as new data types are integrated. We also devise the processing methods for the top-k SKR query. Finally, through extensive experiments on the top-k SKR query using real and synthetic data sets, we compare efficiency of the methods in terms of the query performance and storage.  相似文献   

18.
A top-k dominating query reports the k items with the highest domination score. Algorithms for efficient processing of this query have been recently proposed in the literature. Those methods, either index based or index free, apply a series of pruning criteria toward efficient processing. However, they are characterized by several limitations, such as (1) they lack progressiveness (they report the k best items at the end of the processing), (2) they require a multi-dimensional index or they build a grid-based index on-the-fly, which suffers from performance degradation, especially in high dimensionalities, and (3) they do not support vertically decomposed data. In this paper, we design efficient algorithms that can handle any subset of the dimensions in a progressive manner. Among the studied algorithms, the Differential Algorithm shows the best overall performance.  相似文献   

19.
Classical recommender systems provide users with a list of recommendations where each recommendation consists of a single item, e.g., a book or DVD. However, several applications can benefit from a system capable of recommending packages of items, in the form of sets. Sample applications include travel planning with a limited budget (price or time) and twitter users wanting to select worthwhile tweeters to follow, given that they can deal with only a bounded number of tweets. In these contexts, there is a need for a system that can recommend the top-k packages for the user to choose from. Motivated by these applications, we consider composite recommendations, where each recommendation comprises a set of items. Each item has both a value (rating) and a cost associated with it, and the user specifies a maximum total cost (budget) for any recommended set of items. Our composite recommender system has access to one or more component recommender systems focusing on different domains, as well as to information sources which can provide the cost associated with each item. Because the problem of decidingwhether there is a recommendation (package)whose cost is under a given budget and whose value exceeds some threshold is NP-complete, we devise several approximation algorithms for generating the top-k packages as recommendations. We analyze the efficiency as well as approximation quality of these algorithms. Finally, using two real and two synthetic datasets, we subject our algorithms to thorough experimentation and empirical analysis. Our findings attest to the efficiency and quality of our approximation algorithms for the top-k packages compared to exact algorithms.  相似文献   

20.
Despite advances in machine learning technologies a schema matching result between two database schemas (e.g., those derived from COMA++) is likely to be imprecise. In particular, numerous instances of ??possible mappings?? between the schemas may be derived from the matching result. In this paper, we study problems related to managing possible mappings between two heterogeneous XML schemas. First, we study how to efficiently generate possible mappings for a given schema matching task. While this problem can be solved by existing algorithms, we show how to improve the performance of the solution by using a divide-and-conquer approach. Second, storing and querying a large set of possible mappings can incur large storage and evaluation overhead. For XML schemas, we observe that their possible mappings often exhibit a high degree of overlap. We hence propose a novel data structure, called the block tree, to capture the commonalities among possible mappings. The block tree is useful for representing the possible mappings in a compact manner and can be efficiently generated. Moreover, it facilitates the evaluation of a probabilistic twig query (PTQ), which returns the non-zero probability that a fragment of an XML document matches a given query. For users who are interested only in answers with k-highest probabilities, we also propose the top-k PTQ and present an efficient solution for it. An extensive evaluation on real-world data sets shows that our approaches significantly improve the efficiency of generating, storing, and querying possible mappings.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号