首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 23 毫秒
1.
The database community has devoted extensive amount of efforts to indexing and querying temporal data in the past decades. However, insufficient amount of attention has been paid to temporal ranking queries. More precisely, given any time instance t, the query asks for the top-k objects at time t with respect to some score attribute. Some generic indexing structures based on R-trees do support ranking queries on temporal data, but as they are not tailored for such queries, the performance is far from satisfactory. We present the Seb-tree, a simple indexing scheme that supports temporal ranking queries much more efficiently. The Seb-tree answers a top-k query for any time instance t in the optimal number of I/Os in expectation, namely, O(logB \fracNB+\frackB){O\left({\rm log}_B\,\frac{N}{B}+\frac{k}{B}\right)} I/Os, where N is the size of the data set and B is the disk block size. The index has near-linear size (for constant and reasonable k max values, where k max is the maximum value for the possible values of the query parameter k), can be constructed in near-linear time, and also supports insertions and deletions without affecting its query performance guarantee. Most of all, the Seb-tree is especially appealing in practice due to its simplicity as it uses the B-tree as the only building block. Extensive experiments on a number of large data sets, show that the Seb-tree is more than an order of magnitude faster than the R-tree based indexes for temporal ranking queries.  相似文献   

2.
Finding typical instances is an effective approach to understand and analyze large data sets. In this paper, we apply the idea of typicality analysis from psychology and cognitive science to database query answering, and study the novel problem of answering top-k typicality queries. We model typicality in large data sets systematically. Three types of top-k typicality queries are formulated. To answer questions like “Who are the top-k most typical NBA players?”, the measure of simple typicality is developed. To answer questions like “Who are the top-k most typical guards distinguishing guards from other players?”, the notion of discriminative typicality is proposed. Moreover, to answer questions like “Who are the best k typical guards in whole representing different types of guards?”, the notion of representative typicality is used. Computing the exact answer to a top-k typicality query requires quadratic time which is often too costly for online query answering on large databases. We develop a series of approximation methods for various situations: (1) the randomized tournament algorithm has linear complexity though it does not provide a theoretical guarantee on the quality of the answers; (2) the direct local typicality approximation using VP-trees provides an approximation quality guarantee; (3) a local typicality tree data structure can be exploited to index a large set of objects. Then, typicality queries can be answered efficiently with quality guarantees by a tournament method based on a Local Typicality Tree. An extensive performance study using two real data sets and a series of synthetic data sets clearly shows that top-k typicality queries are meaningful and our methods are practical.  相似文献   

3.
The core logic of web applications that suggest some particular service, such as online shopping, e-commerce etc., is typically captured by Business Processes (BPs). Among all the (maybe infinitely many) possible execution flows of a BP, analysts are often interested in identifying flows that are “most important”, according to some weight metric. The goal of the present paper is to provide efficient algorithms for top-k query evaluation over the possible executions of Business Processes, under some given weight function. Unique difficulties in top-k analysis in this settings stem from (1) the fact that the number of possible execution flows of a given BP is typically very large, or even infinite in presence of recursion and (2) that the weights (e.g., likelihood, monetary cost, etc.) induced by actions performed during the execution (e.g., product purchase) may be inter-dependent (due to probabilistic dependencies, combined discount deals etc.). We exemplify these difficulties, and overcome them to provide efficient algorithms for query evaluation where possible. We also describe in details an application prototype that we have developed for recommending optimal navigation in an online shopping web site that is based on our model and algorithms.  相似文献   

4.
Multi-dimensional top-k dominating queries   总被引:1,自引:0,他引:1  
The top-k dominating query returns k data objects which dominate the highest number of objects in a dataset. This query is an important tool for decision support since it provides data analysts an intuitive way for finding significant objects. In addition, it combines the advantages of top-k and skyline queries without sharing their disadvantages: (i) the output size can be controlled, (ii) no ranking functions need to be specified by users, and (iii) the result is independent of the scales at different dimensions. Despite their importance, top-k dominating queries have not received adequate attention from the research community. This paper is an extensive study on the evaluation of top-k dominating queries. First, we propose a set of algorithms that apply on indexed multi-dimensional data. Second, we investigate query evaluation on data that are not indexed. Finally, we study a relaxed variant of the query which considers dominance in dimensional subspaces. Experiments using synthetic and real datasets demonstrate that our algorithms significantly outperform a previous skyline-based approach. We also illustrate the applicability of this multi-dimensional analysis query by studying the meaningfulness of its results on real data.  相似文献   

5.
Recently, due to intrinsic characteristics in many underlying data sets, a number of probabilistic queries on uncertain data have been investigated. Top-k dominating queries are very important in many applications including decision making in a multidimensional space. In this paper, we study the problem of efficiently computing top-k dominating queries on uncertain data. We first formally define the problem. Then, we develop an efficient, threshold-based algorithm to compute the exact solution. To overcome some inherent computational deficiency in an exact computation, we develop an efficient randomized algorithm with an accuracy guarantee. Our extensive experiments demonstrate that both algorithms are quite efficient, while the randomized algorithm is quite scalable against data set sizes, object areas, k values, etc. The randomized algorithm is also highly accurate in practice.  相似文献   

6.
Searching XML data using keyword queries has attracted much attention because it enables Web users to easily access XML data without having to learn a structured query language or study possibly complex data schemas. Most of the current approaches identify the meaningful results of a given keyword query based on the semantics of lowest common ancestor (LCA) and its variants. However, given the fact that LCA candidates are usually numerous and of low relevance to the users?? information need, how to effectively and efficiently identify the most relevant results from a large number of LCA candidates is still a challenging and unresolved issue. In this article, we introduce a novel semantics of relevant results based on mutual information between the query keywords. Then, we introduce a novel approach for identifying the relevant answers of a given query by adopting skyline semantics. We also recommend three different ranking criteria for selecting the top-k relevant results of the query. Efficient algorithms are proposed which rely on some provable properties of the dominance relationship between result candidates to rapidly identify the top-k dominant results. Extensive experiments were conducted to evaluate our approach and the results show that the proposed approach has a good performance compared with other existing approaches in different data sets and evaluation metrics  相似文献   

7.
A top-k dominating query reports the k items with the highest domination score. Algorithms for efficient processing of this query have been recently proposed in the literature. Those methods, either index based or index free, apply a series of pruning criteria toward efficient processing. However, they are characterized by several limitations, such as (1) they lack progressiveness (they report the k best items at the end of the processing), (2) they require a multi-dimensional index or they build a grid-based index on-the-fly, which suffers from performance degradation, especially in high dimensionalities, and (3) they do not support vertically decomposed data. In this paper, we design efficient algorithms that can handle any subset of the dimensions in a progressive manner. Among the studied algorithms, the Differential Algorithm shows the best overall performance.  相似文献   

8.
Due to the recent massive data generation, preference queries are becoming an increasingly important for users because such queries retrieve only a small number of preferable data objects from a huge multi-dimensional dataset. A top-k dominating query, which retrieves the k data objects dominating the highest number of data objects in a given dataset, is particularly important in supporting multi-criteria decision making because this query can find interesting data objects in an intuitive way exploiting the advantages of top-k and skyline queries. Although efficient algorithms for top-k dominating queries have been studied over centralized databases, there are no studies which deal with top-k dominating queries in distributed environments. The recent data management is becoming increasingly distributed, so it is necessary to support processing of top-k dominating queries in distributed environments. In this paper, we address, for the first time, the challenging problem of processing top-k dominating queries in distributed networks and propose a method for efficient top-k dominating data retrieval, which avoids redundant communication cost and latency. Furthermore, we also propose an approximate version of our proposed method, which further reduces communication cost. Extensive experiments on both synthetic and real data have demonstrated the efficiency and effectiveness of our proposed methods.  相似文献   

9.
Supporting ranking queries on uncertain and incomplete data   总被引:1,自引:0,他引:1  
Large databases with uncertain information are becoming more common in many applications including data integration, location tracking, and Web search. In these applications, ranking records with uncertain attributes introduces new problems that are fundamentally different from conventional ranking. Specifically, uncertainty in records’ scores induces a partial order over records, as opposed to the total order that is assumed in the conventional ranking settings. In this paper, we present a new probabilistic model, based on partial orders, to encapsulate the space of possible rankings originating from score uncertainty. Under this model, we formulate several ranking query types with different semantics. We describe and analyze a set of efficient query evaluation algorithms. We show that our techniques can be used to solve the problem of rank aggregation in partial orders under two widely adopted distance metrics. In addition, we design sampling techniques based on Markov chains to compute approximate query answers. Our experimental evaluation uses both real and synthetic data. The experimental study demonstrates the efficiency and effectiveness of our techniques under various configurations.  相似文献   

10.
Due to its great benefits over many database applications, skyline queries have received formidable concern in the last decades. Skyline queries attempt to assist users by identifying the set of data items which represents the best results that meet the conditions of a given query. Most of the existing skyline techniques concentrate on identifying skylines over a single relation. However, in distributed databases, the process of skyline queries required accessing multiple relations which might be located at different sites. Consequently, data items from these multiple relations need to be joined and thus transferring these data items from one site to another is unavoidable. Moreover, the previous techniques also assume that the values of dimensions for every data item are presented (complete) which is not always true as some values may be missing. In this paper, we proposed an approach for processing skyline queries in incomplete distributed databases. The approach derives skylines from multiple relations where dominated data items are removed before joining the relations to reduce the processing time and the network cost. The experimental results illustrate that our proposed approach outperforms the previous approaches in terms of processing time and network cost.  相似文献   

11.
杨志君  叶东毅 《计算机应用》2010,30(5):1280-1283
非负矩阵分解(NMF)作为一种特征提取与数据降维的新方法,相较于一些传统算法,具有实现上的简便性,分解形式和分解结果上的可解释性等优点。但当样本矩阵不完备时,NMF无法对其进行直接分解。提出一种基于加权的不完备非负矩阵分解(NMFI)算法,该算法在处理不完备样本矩阵时,先采用随机修复的方法降低误差,再利用加权来控制各样本的权重,尽量削弱缺损数据对分解结果产生的干扰。此外,NMFI算法使用区域权重来进一步减少关键区域数据缺损对分解产生的影响。实验结果表明,NMFI算法能有效提取样本中残余数据的信息,减少缺损数据对分解结果的影响。  相似文献   

12.
Reducing network traffic in unstructured P2P systems using Top-k queries   总被引:1,自引:0,他引:1  
A major problem of unstructured P2P systems is their heavy network traffic. This is caused mainly by high numbers of query answers, many of which are irrelevant for users. One solution to this problem is to use Top-k queries whereby the user can specify a limited number (k) of the most relevant answers. In this paper, we present FD, a (Fully Distributed) framework for executing Top-k queries in unstructured P2P systems, with the objective of reducing network traffic. FD consists of a family of algorithms that are simple but effective. FD is completely distributed, does not depend on the existence of certain peers, and addresses the volatility of peers during query execution. We validated FD through implementation over a 64-node cluster and simulation using the BRITE topology generator and SimJava. Our performance evaluation shows that FD can achieve major performance gains in terms of communication and response time. Recommended by: Sunil Prabhakar Work partially funded by the ARA Massive Data of the Agence Nationale de la Recherche.  相似文献   

13.
大数据处理分析算法在优化研究过程中,速度常常受限于数据集的规模。在数据集体量不足时,算法的通信时间往往要高于真正的计算时间,无法验证真实的效果。故设计实现了一个大数据集生成器,为运行在超级计算机上的并行大数据处理分析算法提供基准测试数据集。首先,使用MPI并行编程技术构造了一个并行随机数生成器,在此基础上设计实现了可控制规模及复杂性的人工数据集,主要包括:分类和聚类数据集、回归数据集、流形学习数据集和因子分解数据集等。其次,设计了大数据集生成器的I/O系统,提供MPI-I/O并行读、写数据集的接口,并设置了数据集在不同进程间的分发、映射规则,通过点对点通信实现不同节点之间的数据交互。实验结果表明,并行大数据集生成器有效提高了数据生成效率和生成规模,为并行大数据处理分析算法提供了高质量、大体量的测试数据集。  相似文献   

14.
乔颖  王士同  杭文龙 《控制与决策》2017,32(6):1075-1083
受Kuramoto模型启发,构造一种新的万有引力同步模型,用以解决现有同步聚类算法时间复杂度高的问题,并提出大规模数据集的引力同步聚类算法(LSCGS).首先,使用快速压缩集密度估计(RSDE)算法对大规模数据集进行压缩;然后,通过万有引力同步聚类算法对压缩数据集进行聚类,使用Davies-Bouldin指标自动寻优到最佳聚类数;最后,利用提出的剩余样本聚类(RSC)算法对除压缩集以外的剩余数据进行聚类,可以有效地区分孤立类以及噪声点.通过在大规模人造数据集、UCI真实数据集和图像数据上的实验,验证LSCGS算法的有效性,与传统同步聚类算法相比,聚类的运算成本得到大幅度的降低.  相似文献   

15.
The skyline operator has been extensively explored in the literature, and most of the existing approaches assume that all dimensions are available for all data items. However, many practical applications such as sensor networks, decision making, and location-based services, may involve incomplete data items, i.e., some dimensional values are missing, due to the device failure or the privacy preservation. This paper is the first, to our knowledge, study of k-skyband (kSB) query processing on incomplete data, where multi-dimensional data items are missing some values of their dimensions. We formalize the problem, and then present two efficient algorithms for processing it. Our methods introduce some novel concepts including expired skyline, shadow skyline, and thickness warehouse, in order to boost the search performance. As a second step, we extend our techniques to tackle constrained skyline (CS) and group-by skyline (GBS) queries over incomplete data. Extensive experiments with both real and synthetic data sets demonstrate the effectiveness and efficiency of our proposed algorithms under various experimental settings.  相似文献   

16.
Optimizing large join queries using a graph-based approach   总被引:4,自引:0,他引:4  
Although many query tree optimization strategies have been proposed in the literature, there still is a lack of a formal and complete representation of all possible permutations of query operations (i.e., execution plans) in a uniform manner. A graph-theoretic approach presented in the paper provides a sound mathematical basis for representing a query and searching for an execution plan. In this graph model, a node represents an operation and a directed edge between two nodes indicates the older of executing these two operations in an execution plan. Each node is associated with a weight and so is an edge. The weight is an expression containing optimization required parameters, such as relation size, tuple size, join selectivity factors. All possible execution plans are representable in this graph and each spanning tree of the graph becomes an execution plan. It is a general model which can be used in the optimizer of a DBMS for internal query representation. On the basis of this model, we devise an algorithm that finds a near optimal execution plan using only polynomial time. The algorithm is compared with a few other popular optimization methods. Experiments show that the proposed algorithm is superior to the others under most circumstances  相似文献   

17.
18.
《Computer Communications》2007,30(14-15):2880-2891
Hierarchical routing techniques have long been known to increase network scalability by constructing a virtual backbone. Even though MANETs have no physical backbone, a virtual backbone can be constructed by finding a connected dominating set (CDS) in the network graph. Many centralized as well as distributed algorithms have been designed to find a CDS in a graph (network). Theoretically, any centralized algorithm can be implemented in a distributed fashion, with the tradeoff of higher protocol overhead. Because centralized approaches do not scale well and because distributed approaches are more practical especially in MANETs, we propose a fast distributed connected dominating set (FDDS) construction in MANETs. FDDS has message and time complexity of O(n) and O(Δ2), where n is the number of nodes in the network and Δ is the maximum node degree. According to our knowledge, FDDS achieves the best message and time complexity combinations among the previously suggested approaches. Moreover, FDDS constructs a reliable virtual backbone that takes into account (1) node’s limited energy, (2) node’s mobility, and (3) node’s traffic pattern. Our simulation study shows that FDDS achieves a very low network stretch. Also, when the network size is large, FDDS constructs a backbone with size smaller than other well known schemes found in the literature.  相似文献   

19.
In this paper, we study a variant of reachability queries, called label-constraint reachability (LCR) queries. Specifically, given a label set S and two vertices u1 and u2 in a large directed graph G, we check the existence of a directed path from u1 to u2, where edge labels along the path are a subset of S. We propose the path-label transitive closure method to answer LCR queries. Specifically, we t4ransform an edge-labeled directed graph into an augmented DAG by replacing the maximal strongly connected components as bipartite graphs. We also propose a Dijkstra-like algorithm to compute path-label transitive closure by re-defining the “distance” of a path. Comparing with the existing solutions, we prove that our method is optimal in terms of the search space. Furthermore, we propose a simple yet effective partition-based framework (local path-label transitive closure+online traversal) to answer LCR queries in large graphs. We prove that finding the optimal graph partition to minimize query processing cost is a NP-hard problem. Therefore, we propose a sampling-based solution to find the sub-optimal partition. Moreover, we address the index maintenance issues to answer LCR queries over the dynamic graphs. Extensive experiments confirm the superiority of our method.  相似文献   

20.
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号