首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Approximate query processing using wavelets   总被引:7,自引:0,他引:7  
Approximate query processing has emerged as a cost-effective approach for dealing with the huge data volumes and stringent response-time requirements of today's decision support systems (DSS). Most work in this area, however, has so far been limited in its query processing scope, typically focusing on specific forms of aggregate queries. Furthermore, conventional approaches based on sampling or histograms appear to be inherently limited when it comes to approximating the results of complex queries over high-dimensional DSS data sets. In this paper, we propose the use of multi-dimensional wavelets as an effective tool for general-purpose approximate query processing in modern, high-dimensional applications. Our approach is based on building wavelet-coefficient synopses of the data and using these synopses to provide approximate answers to queries. We develop novel query processing algorithms that operate directly on the wavelet-coefficient synopses of relational tables, allowing us to process arbitrarily complex queries entirely in the wavelet-coefficient domain. This guarantees extremely fast response times since our approximate query execution engine can do the bulk of its processing over compact sets of wavelet coefficients, essentially postponing the expansion into relational tuples until the end-result of the query. We also propose a novel wavelet decomposition algorithm that can build these synopses in an I/O-efficient manner. Finally, we conduct an extensive experimental study with synthetic as well as real-life data sets to determine the effectiveness of our wavelet-based approach compared to sampling and histograms. Our results demonstrate that our techniques: (1) provide approximate answers of better quality than either sampling or histograms; (2) offer query execution-time speedups of more than two orders of magnitude; and (3) guarantee extremely fast synopsis construction times that scale linearly with the size of the data. Received: 7 August 2000 / Accepted: 1 April 2001 Published online: 7 June 2001  相似文献   

2.
Our aim is to develop new database technologies for the approximate matching of unstructured string data using indexes. We explore the potential of the suffix tree data structure in this context. We present a new method of building suffix trees, allowing us to build trees in excess of RAM size, which has hitherto not been possible. We show that this method performs in practice as well as the O(n) method of Ukkonen [70]. Using this method we build indexes for 200 Mb of protein and 300 Mbp of DNA, whose disk-image exceeds the available RAM. We show experimentally that suffix trees can be effectively used in approximate string matching with biological data. For a range of query lengths and error bounds the suffix tree reduces the size of the unoptimised O(mn) dynamic programming calculation required in the evaluation of string similarity, and the gain from indexing increases with index size. In the indexes we built this reduction is significant, and less than 0.3% of the expected matrix is evaluated. We detail the requirements for further database and algorithmic research to support efficient use of large suffix indexes in biological applications. Received: November 1, 2001 / Accepted: March 2, 2002 Published online: September 25, 2002  相似文献   

3.
In this paper we propose algorithms for generation of frequent item sets by successive construction of the nodes of a lexicographic tree of item sets. We discuss different strategies in generation and traversal of the lexicographic tree such as breadth-first search, depth-first search, or a combination of the two. These techniques provide different trade-offs in terms of the I/O, memory, and computational time requirements. We use the hierarchical structure of the lexicographic tree to successively project transactions at each node of the lexicographic tree and use matrix counting on this reduced set of transactions for finding frequent item sets. We tested our algorithm on both real and synthetic data. We provide an implementation of the tree projection method which is up to one order of magnitude faster than other recent techniques in the literature. The algorithm has a well-structured data access pattern which provides data locality and reuse of data for multiple levels of the cache. We also discuss methods for parallelization of the TreeProjection algorithm.  相似文献   

4.
本文提出了一种公平分配代价的组播路由算法 DFC_ DCMT- -分布式公平分配代价的延迟受限组播路由算法 ,该算法在优化 tree- cost的条件下 ,能够计算出满足延迟限制的、各目的节点公平负担网络代价的点到多点的组播路由树 .本文还给出一种近似算法 ,可减少节点间交换的信息量 ,同时在一般情况下仍保持各目的节点公平负担网络代价 .  相似文献   

5.
We are interested in defining and querying views in a huge and highly heterogeneous XML repository (Web scale). In this context, view definitions are very large, involving lots of sources, and there is no apparent limitation to their size. This raises interesting problems that we address in the paper: (i) how to distribute views over several machines without having a negative impact on the query translation process; (ii) how to quickly select the relevant part of a view given a query; (iii) how to minimize the cost of communicating potentially large queries to the machines where they will be evaluated. The solution that we propose is based on a simple view definition language that allows for automatic generation of views. The language maps paths in the view abstract DTD to paths in the concrete source DTDs. It enables a distributed implementation of the view system that is scalable both in terms of data and load. In particular, the query translation algorithm is shown to have a good (linear) complexity. Received: November 1, 2001 / Accepted: March 2, 2002 Published online: September 25, 2002  相似文献   

6.
Geometric groundtruth at the character, word, and line levels is crucial for designing and evaluating optical character recognition (OCR) algorithms. Kanungo and Haralick proposed a closed-loop methodology for generating geometric groundtruth for rescanned document images. The procedure assumed that the original image and the corresponding groundtruth were available. It automatically registered the original image to the rescanned one using four corner points and then transformed the original groundtruth using the estimated registration transformation. In this paper, we present an attributed branch-and-bound algorithm for establishing the point correspondence that uses all the data points. We group the original feature points into blobs and use corners of blobs for matching. The Euclidean distance between character centroids is used as the error metric. We conducted experiments on synthetic point sets with varying layout complexity to characterize the performance of two matching algorithms. We also report results on experiments conducted using the University of Washington dataset. Finally, we show examples of application of this methodology for generating groundtruth for microfilmed and FAXed versions of the University of Washington dataset documents. Received: July 24, 2001 / Accepted: May 20, 2002  相似文献   

7.
An efficient distributed algorithm for constructing small dominating sets   总被引:1,自引:0,他引:1  
The dominating set problem asks for a small subset D of nodes in a graph such that every node is either in D or adjacent to a node in D. This problem arises in a number of distributed network applications, where it is important to locate a small number of centers in the network such that every node is nearby at least one center. Finding a dominating set of minimum size is NP-complete, and the best known approximation is logarithmic in the maximum degree of the graph and is provided by the same simple greedy approach that gives the well-known logarithmic approximation result for the closely related set cover problem. We describe and analyze new randomized distributed algorithms for the dominating set problem that run in polylogarithmic time, independent of the diameter of the network, and that return a dominating set of size within a logarithmic factor from optimal, with high probability. In particular, our best algorithm runs in rounds with high probability, where n is the number of nodes, is one plus the maximum degree of any node, and each round involves a constant number of message exchanges among any two neighbors; the size of the dominating set obtained is within of the optimal in expectation and within of the optimal with high probability. We also describe generalizations to the weighted case and the case of multiple covering requirements. Received: January 2002 / Accepted: August 2002 RID="*" ID="*" Supported by NSF CAREER award NSF CCR-9983901 RID="*" ID="*" Supported by NSF CAREER award NSF CCR-9983901  相似文献   

8.
Speeding up construction of PMR quadtree-based spatial indexes   总被引:5,自引:0,他引:5  
Spatial indexes, such as those based on the quadtree, are important in spatial databases for efficient execution of queries involving spatial constraints, especially when the queries involve spatial joins. In this paper we present a number of techniques for speeding up the construction of quadtree-based spatial indexes, specifically the PMR quadtree, which can index arbitrary spatial data. We assume a quadtree implementation using the “linear quadtree”, a disk-resident representation that stores objects contained in the leaf nodes of the quadtree in a linear index (e.g., a B-tree) ordered based on a space-filling curve. We present two complementary techniques: an improved insertion algorithm and a bulk-loading method. The bulk-loading method can be extended to handle bulk-insertions into an existing PMR quadtree. We make some analytical observations about the I/O cost and CPU cost of our PMR quadtree bulk-loading algorithm, and conduct an extensive empirical study of the techniques presented in the paper. Our techniques are found to yield significant speedup compared to traditional quadtree building methods, even when the size of a main memory buffer is very small compared to the size of the resulting quadtrees. Edited by R. Sacks-Davis. Received: July 10, 2001 / Accepted: March 25, 2002 Published online: September 25, 2002  相似文献   

9.
Approximate similarity retrieval with M-trees   总被引:4,自引:0,他引:4  
Motivated by the urgent need to improve the efficiency of similarity queries, approximate similarity retrieval is investigated in the environment of a metric tree index called the M-tree. Three different approximation techniques are proposed, which show how to forsake query precision for improved performance. Measures are defined that can quantify the improvements in performance efficiency and the quality of approximations. The proposed approximation techniques are then tested on various synthetic and real-life files. The evidence obtained from the experiments confirms our hypothesis that a high-quality approximated similarity search can be performed at a much lower cost than that needed to obtain the exact results. The proposed approximation techniques are scalable and appear to be independent of the metric used. Extensions of these techniques to the environments of other similarity search indexes are also discussed. Received July 7, 1998 / Accepted October 13, 1998  相似文献   

10.
We consider the problem of distributed deterministic broadcasting in radio networks of unknown topology and size. The network is synchronous. If a node u can be reached from two nodes which send messages in the same round, none of the messages is received by u. Such messages block each other and node u either hears the noise of interference of messages, enabling it to detect a collision, or does not hear anything at all, depending on the model. We assume that nodes know neither the topology nor the size of the network, nor even their immediate neighborhood. The initial knowledge of every node is limited to its own label. Such networks are called ad hoc multi-hop networks. We study the time of deterministic broadcasting under this scenario. For the model without collision detection, we develop a linear-time broadcasting algorithm for symmetric graphs, which is optimal, and an algorithm for arbitrary n-node graphs, working in time . Next we show that broadcasting with acknowledgement is not possible in this model at all. For the model with collision detection, we develop efficient algorithms for broadcasting and for acknowledged broadcasting in strongly connected graphs. Received: January 2000 / Accepted: June 2001  相似文献   

11.
The performance of several methods for estimating local surface geometry (the principal frame plus the principal quadric) are examined by applying them to a suite of synthetic and real test data which have been corrupted by various amounts of additive Gaussian noise. Methods considered include finite differences, a facet based approach, and quadric surface fitting. The nonlinear quadric fitting method considered was found to perform the best but has the greatest computational cost. The facet based approach works as well as the other quadric fitting methods and has a much smaller computational cost. Hence it is the recommended method to use in practice.  相似文献   

12.
Association Rule Mining algorithms operate on a data matrix (e.g., customers products) to derive association rules [AIS93b, SA96]. We propose a new paradigm, namely, Ratio Rules, which are quantifiable in that we can measure the “goodness” of a set of discovered rules. We also propose the “guessing error” as a measure of the “goodness”, that is, the root-mean-square error of the reconstructed values of the cells of the given matrix, when we pretend that they are unknown. Another contribution is a novel method to guess missing/hidden values from the Ratio Rules that our method derives. For example, if somebody bought $10 of milk and $3 of bread, our rules can “guess” the amount spent on butter. Thus, unlike association rules, Ratio Rules can perform a variety of important tasks such as forecasting, answering “what-if” scenarios, detecting outliers, and visualizing the data. Moreover, we show that we can compute Ratio Rules in a single pass over the data set with small memory requirements (a few small matrices), in contrast to association rule mining methods which require multiple passes and/or large memory. Experiments on several real data sets (e.g., basketball and baseball statistics, biological data) demonstrate that the proposed method: (a) leads to rules that make sense; (b) can find large itemsets in binary matrices, even in the presence of noise; and (c) consistently achieves a “guessing error” of up to 5 times less than using straightforward column averages. Received: March 15, 1999 / Accepted: November 1, 1999  相似文献   

13.
We consider the problem of scheduling a set of pages on a single broadcast channel using time-multiplexing. In a perfectly periodic schedule, time is divided into equal size slots, and each page is transmitted in a time slot precisely every fixed interval of time (the period of the page). We study the case in which each page i has a given demand probability , and the goal is to design a perfectly periodic schedule that minimizes the average time a random client waits until its page is transmitted. We seek approximate polynomial solutions. Approximation bounds are obtained by comparing the costs of a solution provided by an algorithm and a solution to a relaxed (non-integral) version of the problem. A key quantity in our methodology is a fraction we denote by , that depends on the maximum demand probability: . The best known polynomial algorithm to date guarantees an approximation of . In this paper, we develop a tree-based methodology for perfectly periodic scheduling, and using new techniques, we derive algorithms with better bounds. For small values, our best algorithm guarantees approximation of . On the other hand, we show that the integrality gap between the cost of any perfectly periodic schedule and the cost of the fractional problem is at least . We also provide algorithms with good performance guarantees for large values of . Received: December 2001 / Accepted: September 2002  相似文献   

14.
Many phenomena in nature and engineering happen simultaneously on rather diverse spatial and temporal scales. In other words, they exhibit a multi-scale character. A special numerical multilevel technique associated with a particular hierarchical data structure is adaptive mesh refinement (AMR). This scheme achieves locally very high spatial and temporal resolutions. Due to its popularity, many scientists are in need of interactive visualization tools for AMR data. In this article, we present a 3D texture-based volume-rendering algorithm for AMR data that directly utilizes the hierarchical structure. Thereby fast rendering performance is achieved even for high-resolution data sets. To avoid multiple rendering of regions that are covered by grids of different levels of resolution, we propose a space partitioning scheme to decompose the volume into axis-aligned regions of equal-sized cells. Furthermore the problems of interpolation artifacts, opacity corrections, and texture memory limitations are addressed. Published online: November 6, 2002 Correspondence to: R. K?hler  相似文献   

15.
Summary. The acronym CaRuD represents an interface specification and an algorithm for the management of memory shared by concurrent processes. The memory cells form a directed acyclic graph. This graph is only modified by adding a new node with a list of reachable children, and by removing unreachable nodes. If memory is not full, the algorithm ensures wait-free redistribution of free nodes. It uses atomic counters for reference counting and consensus variables to ensure exclusive access. Performance is enhanced by using nondeterminacy guided by insecure knowledge. Experiments indicate that the algorithm is very suitable for multiprocessing. Received: July 1998 / Accepted: July 2000  相似文献   

16.
Given a source node and a set of destination nodes in a network, multicast routing problem is usually treated as Steiner tree problem. Unlike this well-known tree based routing model, multicast routing under multi-path model is to find a set of paths rooted at the source node such that in each path at most a fixed number of destination nodes can be designated to receive the data and every destination node must be designated in a path to receive the data. The cost of routing is the total costs of paths found. In this paper we study how to construct a multicast routing of minimal cost under multi-path model. We propose two approximation algorithms for this NP-complete problem with guaranteed performance ratios.  相似文献   

17.
《Performance Evaluation》1999,35(1-2):49-74
Multicast network traffic is information with one source node, but many destination nodes. Rather than setting up individual connections between the source node and each destination node, or broadcasting the information to the entire network, multicasting efficiently exploits link capacity by allowing the source node to transmit a small number of copies of the information to mutually-exclusive groups of destination nodes. Multicasting is an important topic in the fields of networking (video and audio conferencing, video on demand, local-area network interconnection) and computer architecture (cache coherency, multiprocessor message passing). In this paper, we derive approximate expressions for the minimum cost (in terms of link utilization) of shortest-path multicast traffic in arbitrary tree networks. Our results provide a theoretical best-case scenario for link utilization of multicast distribution in tree topologies overlaid onto arbitrary graphs. In real networks such as the Internet MBONE, multicast distribution paths are often tree-like, but contain some cycles for purposes of fault tolerance. We find that even for richly-connected graphs such as the shufflenet and the hypercube, our expression provides a good prediction of the cost (in terms of link utilization) of multicast communication. Thus, this theoretical result has two applications: (1) a lower bound on the link capacity required for multicasting in random tree topologies, and (2) an approximation of the cost of multicasting in regular LAN and MAN topologies.  相似文献   

18.
针对R-树索引空间查询效率低下的问题,提出一种基于结点分裂优化的R-树索引结构:SR-树索引。SR-树索引在结点分裂过程中,通过增加叶子结点的空间数据聚集性来减少叶子结点最小外接矩形的覆盖面积。为了有效降低磁盘读写消耗,SR-树结点在写入索引时,首先将索引树在内存中建好,然后在文件中写入树信息,最后通过递归的方式写入结点。实验结果表明,与R-树索引相比,SR-树索引可以在减少最小外接矩形重叠面积的同时,有效降低查询响应时间,从而达到提高查询效率的目的。  相似文献   

19.
Searching in metric spaces by spatial approximation   总被引:5,自引:0,他引:5  
We propose a new data structure to search in metric spaces. A metric space is formed by a collection of objects and a distance function defined among them which satisfies the triangle inequality. The goal is, given a set of objects and a query, retrieve those objects close enough to the query. The complexity measure is the number of distances computed to achieve this goal. Our data structure, called sa-tree (“spatial approximation tree”), is based on approaching the searched objects spatially, that is, getting closer and closer to them, rather than the classic divide-and-conquer approach of other data structures. We analyze our method and show that the number of distance evaluations to search among n objects is sublinear. We show experimentally that the sa-tree is the best existing technique when the metric space is hard to search or the query has low selectivity. These are the most important unsolved cases in real applications. As a practical advantage, our data structure is one of the few that does not need to tune parameters, which makes it appealing for use by non-experts. Edited by R. Sacks-Davis Received: 17 April 2001 / Accepted: 24 January 2002 / Published online: 14 May 2002  相似文献   

20.
In many decision-making scenarios, decision makers require rapid feedback to their queries, which typically involve aggregates. The traditional blocking execution model can no longer meet the demands of these users. One promising approach in the literature, called online aggregation, evaluates an aggregation query progressively as follows: as soon as certain data have been evaluated, approximate answers are produced with their respective running confidence intervals; as more data are examined, the answers and their corresponding running confidence intervals are refined. In this paper, we extend this approach to handle nested queries with aggregates (i.e., at least one inner query block is an aggregate query) by providing users with (approximate) answers progressively as the inner aggregation query blocks are evaluated. We address the new issues pose by nested queries. In particular, the answer space begins with a superset of the final answers and is refined as the aggregates from the inner query blocks are refined. For the intermediary answers to be meaningful, they have to be interpreted with the aggregates from the inner queries. We also propose a multi-threaded model in evaluating such queries: each query block is assigned to a thread, and the threads can be evaluated concurrently and independently. The time slice across the threads is nondeterministic in the sense that the user controls the relative rate at which these subqueries are being evaluated. For enumerative nested queries, we propose a priority-based evaluation strategy to present answers that are certainly in the final answer space first, before presenting those whose validity may be affected as the inner query aggregates are refined. We implemented a prototype system using Java and evaluated our system. Results for nested queries with a level and multiple levels of nesting are reported. Our results show the effectiveness of the proposed mechanisms in providing progressive feedback that reduces the initial waiting time of users significantly without sacrificing the quality of the answers. Received April 25, 2000 / Accepted June 27, 2000  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号