共查询到20条相似文献,搜索用时 31 毫秒
1.
We propose a new algorithm, called Stripe-join, for performing a join given a join index. Stripe-join is inspired by an algorithm called ‘Jive-join’ developed by Li and Ross. Stripe-join makes a single sequential pass through each input relation, in addition to one pass through the join index and two passes through a set of temporary files that contain tuple identifiers but no input tuples. Stripe-join performs this efficiently even when the input relations are much larger than main memory, as long as the number of blocks in main memory is of the order of the square root of the number of blocks in the participating relations. Stripe-join is particularly efficient for self-joins. To our knowledge, Stripe-join is the first algorithm that, given a join index and a relation significantly larger than main memory, can perform a self-join with just a single pass over the input relation and without storing input tuples in intermediate files. Almost all the I/O is sequential, thus minimizing the impact of seek and rotational latency. The algorithm is resistant to data skew. It can also join multiple relations while still making only a single pass over each input relation. Using a detailed cost model, Stripe-join is analyzed and compared with competing algorithms. For large input relations, Stripe-join performs significantly better than Valduriez's algorithm and hash join algorithms. We demonstrate circumstances under which Stripe-join performs significantly better than Jive-join. Unlike Jive-join, Stripe-join makes no assumptions about the order of the join index. 相似文献
2.
新的利用连接索引的算法Jive,它用中间产生的临时文件和分割输出连接结果的方法,采用已有的数据结构-连接索引,只需要对输入关系的一次扫描,即可完成连接.在一般情况下优于Valduriez(1987)所提出的连接算法,在输入关系很大的情况下性能尤为突出。 相似文献
3.
Tae-Hyung Kwon Ki Yong Lee Myoung Ho Kim 《Journal of Intelligent Information Systems》2011,37(2):245-265
We address the problem of load shedding for continuous multi-way join queries over multiple data streams. When the arrival rates of tuples from data streams exceed the system capacity, a load shedding algorithm drops some subset of input tuples to avoid system overloads. To decide which tuples to drop among the input tuples, most existing load shedding algorithms determine the priority of each input tuple based on the frequency or some historical statistics of its join attribute value, and then drop tuples with the lowest priority. However, those value-based algorithms cannot determine the priorities of tuples properly in environments where join attribute values are unique and each join attribute value occurs at most once in each data stream. In this paper, we propose a load shedding algorithm specifically designed for such environments. The proposed load shedding algorithm determines the priority of each tuple based on the order of streams in which its join attribute value appears, rather than its join attribute value itself. Consequently, the priorities of tuples can be determined effectively in environments where join attribute values are unique and do not repeat. The experimental results show that the proposed algorithm outperforms the existing algorithms in such environments in terms of effectiveness and efficiency. 相似文献
4.
This paper proposes a category- and selection-enabled nearest neighbor join (NNJ) between relation r and relation s, with similarity on T and support for category attributes C and selection predicate θ. Our solution does not suffer from redundant fetches and index false hits, which are the main performance bottlenecks of current nearest neighbor join techniques.A category-enabled NNJ leverages the category attributes C for query evaluation. For example, the categories of relation r can be used to limit relation s accessed at most once. Solutions that are not category-enabled must process each category independently and end up fetching, either from disk or memory, the blocks of the input relations multiple times. A selection-enabled NNJ performs well independent of whether the DBMS optimizer pushes the selection down or evaluates it on the fly. In contrast, index-based solutions suffer from many index false hits or end up in an expensive nested loop.Our solution does not constrain the physical design, and is efficient for row- as well as column-stores. Current solutions for column-stores use late materialization, which is only efficient if the data is clustered on the category attributes C. Our evaluation algorithm finds, for each outer tuple r, the inner tuples that satisfy the equality on the category and have the smallest distance to r with only one scan of both inputs. We experimentally evaluate our solution using a data warehouse that manages analyses of animal feeds. 相似文献
5.
Evan P. Harris Kotagiri Ramamohanarao 《The VLDB Journal The International Journal on Very Large Data Bases》1996,5(1):64-84
A method of analysing join algorithms based upon the time required
to access, transfer and perform the relevant CPU-based operations
on a disk page is proposed. The costs of variations of several of
the standard join algorithms, including nested block, sort-merge,
GRACE hash and hybrid hash, are presented. For a given total
buffer size, the cost of these join algorithms depends on the
parts of the buffer allocated for each purpose. For example, when
joining two relations using the nested block join algorithm, the
amount of buffer space allocated for the outer and inner relations
can significantly affect the cost of the join. Analysis of
expected and experimental results of various join algorithms show
that a combination of the optimal nested block and optimal GRACE
hash join algorithms usually provide the greatest cost benefit,
unless the relation size is a small multiple of the memory size.
Algorithms to quickly determine a buffer allocation producing the
minimal cost for each of these algorithms are presented. When the
relation size is a small multiple of the amount of main memory
available (typically up to three to six times), the hybrid hash
join algorithm is preferable.
Edited by Masaru Kitsuregawa.
Received April 26, 1993 / Revised March 3, 1994 /
Accepted October 13, 1994 相似文献
6.
An important aspect of database processing in parallel computer systems is the use of data parallel algorithms. Several parallel algorithms for the relational database join operation in a hypercube multicomputer system are given. The join algorithms are classified as cycling or global partitioning based on the tuple distribution method employed. The various algorithms are compared under a common framework, using time complexity analysis as well as an implementation on a 64-node NCUBE hypercube system. In general, the global partitioning algorithms demonstrate better speedup. However, the cycling algorithm can perform better than the global algorithms in specific situations, viz., when the difference in input relation cardinalities is large and the hypercube dimension is small. The usefulness of the data redistribution operation in improving the performance of the join algorithms, in the presence of uneven data partitions, is examined. The results indicate that redistribution significantly decreases the join algorithm execution times for unbalanced partitions 相似文献
7.
8.
Based on the assumption that selections are zero-expense operations, “selection pushdown” rules, which apply selections in
random order before as many joins as possible in order to reduce subsequent join costs, have been widely applied in traditional
query optimization methods. However, in multimedia information systems, selections generally contain expensive multimedia
operations, making “pushdown” rules no longer able to produce optimal query execution plan. Therefore, we in this paper develop
a theory for optimizing queries with expensive multimedia operations, which can establish the optimal placement of each multimedia
operation in a query plan by the comprehensive consideration of selectivity and unit execution cost of each operation. Then
we present an algorithm for the theory and implement it in a prototype system. Experimental results show that, compared with
traditional optimization algorithms, our algorithm not only has the modest time complexity that is polynomial in the number
of multimedia operations in a query plan, but also can reduce the execution cost of a query plan by orders of magnitude. 相似文献
9.
Sudipto Guha 《The VLDB Journal The International Journal on Very Large Data Bases》2008,17(6):1509-1535
Synopses construction algorithms have been found to be of interest in query optimization, approximate query answering and
mining, and over the last few years several good synopsis construction algorithms have been proposed. These algorithms have
mostly focused on the running time of the synopsis construction vis-a-vis the synopsis quality. However the space complexity
of synopsis construction algorithms has not been investigated as thoroughly. Many of the optimum synopsis construction algorithms
are expensive in space. For some of these algorithms the space required to construct the synopsis is significantly larger
than the space required to store the input. These algorithms rely on the fact that they require a smaller “working space”
and most of the data can be resident on disc. The large space complexity of synopsis construction algorithms is a handicap
in several scenarios. In the case of streaming algorithms, space is a fundamental constraint. In case of offline optimal or
approximate algorithms, a better space complexity often makes these algorithms much more attractive by allowing them to run
in main memory and not use disc, or alternately allows us to scale to significantly larger problems without running out of
space. In this paper, we propose a simple and general technique that reduces space complexity of synopsis construction algorithms.
As a consequence we show that the notion of “working space” proposed in these contexts is redundant. This technique can be
easily applied to many existing algorithms for synopsis construction problems. We demonstrate the performance benefits of
our proposal through experiments on real-life and synthetic data. We believe that our algorithm also generalizes to a broader
range of dynamic programs beyond synopsis construction.
Sudipto Guha’s research supported in part by an Alfred P. Sloan Research Fellowship and by NSF Awards CCF-0430376, CCF-0644119.A
preliminary version of the paper appeared as “Space efficiency in synopsis construction algorithms”, VLDB Conference 2005,
Trondheim, [19]. 相似文献
10.
Karl Schnaitter Joshua Spiegel Neoklis Polyzotis 《The VLDB Journal The International Journal on Very Large Data Bases》2009,18(2):521-542
A relational ranking query uses a scoring function to limit the results of a conventional query to a small number of the most
relevant answers. The increasing popularity of this query paradigm has led to the introduction of specialized rank join operators
that integrate the selection of top tuples with join processing. These operators access just “enough” of the input in order
to generate just “enough” output and can offer significant speed-ups for query evaluation. The number of input tuples that
an operator accesses is called the input depth of the operator, and this is the driving cost factor in rank join processing. This introduces the important problem of depth estimation, which is crucial for the costing of rank join operators during query compilation and thus for their integration in optimized
physical plans. We introduce an estimation methodology, termed deep, for approximating the input depths of rank join operators in a physical execution plan. At the core of deep lies a general, principled framework that formalizes depth computation in terms of the joint distribution of scores in the
base tables. This framework results in a systematic estimation methodology that takes the characteristics of the data directly
into account and thus enables more accurate estimates. We develop novel estimation algorithms that provide an efficient realization
of the formal deep framework, and describe their integration on top of the statistics module of an existing query optimizer. We validate the
performance of deep with an extensive experimental study on data sets of varying characteristics. The results verify the effectiveness of deep as an estimation method and demonstrate its advantages over previously proposed techniques. 相似文献
11.
Contemporary long-term storage devices feature powerful embedded processors and sizeable memory buffers. Active Storage Devices (ASD) is the hard disk technology that makes use of these significant resources to not only manage the disk operation but
also to execute custom application code on large amounts of data. While prior research has shown that ASDs perform exceedingly
well with filter-type algorithms, the evaluation of binary-relational operators has been limited. In this paper, we analyze
and evaluate inter-operator parallelism of GRACE-based join algorithms that function atop ASDs. We derive accurate cost expressions
for existing algorithms and expose performance bottlenecks; upon these findings we propose Active Hash Join, a new algorithm that exploits all system resources. Through experimentation, we confirm that existing algorithms are best
suited for systems with either small or large numbers of ASDs. However, we find that the “adaptive” nature of Active Hash
Join yields enhanced parallelism in all cases, especially when the aggregate ASD resources are comparable to the main CPU
and main memory.
Recommended by: Ahmed Elmagarmid
Work partially supported by the University of Athens Research Foundation. 相似文献
12.
The saturation algorithm for symbolic state-space exploration 总被引:1,自引:0,他引:1
Gianfranco Ciardo Robert Marmorstein Radu Siminiceanu 《International Journal on Software Tools for Technology Transfer (STTT)》2006,8(1):4-25
We present various algorithms for generating the state space of an asynchronous system based on the use of multiway decision
diagrams to encode sets and Kronecker operators on boolean matrices to encode the next-state function. The Kronecker encoding
allows us to recognize and exploit the “locality of effect” that events might have on state variables. In turn, locality information
suggests better iteration strategies aimed at minimizing peak memory consumption. In particular, we focus on the saturation strategy, which is completely different from traditional breadth-first symbolic approaches, and extend its applicability
to models where the possible values of the state variables are not known a priori. The resulting algorithm merges “on-the-fly”
explicit state-space generation of each submodel with symbolic state-space generation of the overall model.
Each algorithm we present is implemented in our tool SmArT. This allows us to run fair and detailed comparisons between them
on a suite of representative models. Saturation, in particular, is shown to be many orders of magnitude more efficient in
terms of memory and time with respect to traditional methods. 相似文献
13.
Flip Korn Alexandros Labrinidis Yannis Kotidis Christos Faloutsos 《The VLDB Journal The International Journal on Very Large Data Bases》2000,8(3-4):254-266
Association Rule Mining algorithms operate on a data matrix (e.g., customers products) to derive association rules [AIS93b, SA96]. We propose a new paradigm, namely, Ratio Rules, which are quantifiable in that we can measure the “goodness” of a set of discovered rules. We also propose the “guessing
error” as a measure of the “goodness”, that is, the root-mean-square error of the reconstructed values of the cells of the
given matrix, when we pretend that they are unknown. Another contribution is a novel method to guess missing/hidden values
from the Ratio Rules that our method derives. For example, if somebody bought $10 of milk and $3 of bread, our rules can “guess”
the amount spent on butter. Thus, unlike association rules, Ratio Rules can perform a variety of important tasks such as forecasting,
answering “what-if” scenarios, detecting outliers, and visualizing the data. Moreover, we show that we can compute Ratio Rules
in a single pass over the data set with small memory requirements (a few small matrices), in contrast to association rule mining methods
which require multiple passes and/or large memory. Experiments on several real data sets (e.g., basketball and baseball statistics,
biological data) demonstrate that the proposed method: (a) leads to rules that make sense; (b) can find large itemsets in
binary matrices, even in the presence of noise; and (c) consistently achieves a “guessing error” of up to 5 times less than
using straightforward column averages.
Received: March 15, 1999 / Accepted: November 1, 1999 相似文献
14.
This paper presents a memory-constrained hash join algorithm (PaMeCo Join) designed to operate with main-memory column-store database systems. Whilst RAM has become more affordable and the popularity of main-memory database systems continues to grow, we recognize that RAM is a finite resource and that database systems rarely have an excess of memory available to them. Therefore, we design PaMeCo to operate within an arbitrary memory limitation by processing the input relations by parts, and by using a compact hash table that represents the contained tuples in a compact format. Coupled with a radix-clustering system that lowers memory latencies, we find that PaMeCo can offer competitive performance levels to other contemporary hash join algorithms in an unconstrained environment, while being up to three times faster than a high-performing hash join when memory constraints are applied. 相似文献
15.
Michael Bauland Elmar Böhler Nadia Creignou Steffen Reith Henning Schnoor Heribert Vollmer 《Theory of Computing Systems》2010,47(2):454-490
In this paper we are interested in quantified propositional formulas in conjunctive normal form with “clauses” of arbitrary
shapes. i.e., consisting of applying arbitrary relations to variables. We study the complexity of the evaluation problem,
the model checking problem, the equivalence problem, and the counting problem for such formulas, both with and without a bound
on the number of quantifier alternations. For each of these computational goals we get full complexity classifications: We
determine the complexity of each of these problems depending on the set of relations allowed in the input formulas. Thus,
on the one hand we exhibit syntactic restrictions of the original problems that are still computationally hard, and on the
other hand we identify non-trivial subcases that admit efficient algorithms. 相似文献
16.
空间数据库的性能问题严重制约了它的应用与发展 .由于空间连接运算是空间数据库中最复杂、最耗时的基本操作 ,因此其处理效率在很大程度上决定了空间数据库的整体性能 .尽管目前已经有许多空间连接算法 ,但空间连接运算的代价估计和查询优化仍然有待进一步研究 .众所周知 ,大部分空间连接算法都是基于 R树索引实现的 ,如果参与空间连接运算的关系上没有索引或只有部分索引 ,那么就需要使用特殊的算法来处理 .另外 ,各种算法的代价评估模型需要一个相对统一的计算方法 ,实践证明 ,根据空间数据库的实际情况 ,使用 I/ O代价来估计算法的复杂性较为合理 .在此基础上 ,针对复杂的空间查询中可能出现多个关系参与空间连接运算的情况 ,故还需要合理地应用动态编程算法来找出代价最优的连接顺序 ,以便最终形成一个通用的算法框架 .通过对该算法框架的复杂性分析可以看出 ,在此基础上实现的空间数据库查询优化系统将具有较高的时空效率 ,并且能够处理非常复杂的空间查询 相似文献
17.
Shao Dong Chen Hong Shen Topor R. 《Parallel and Distributed Systems, IEEE Transactions on》2002,13(4):413-431
We present four efficient parallel algorithms for computing a nonequijoin, called range-join, of two relations on N-dimensional mesh-connected computers. Range-joins of relations R and S are an important generalization of conventional equijoins and band-joins and are solved by permutation-based approaches in all proposed algorithms. In general, after sorting all subsets of both relations, the proposed algorithms permute every sorted subset of relation S to each processor in turn, where it is joined with the local subset of relation R. To permute the subsets of S efficiently, we propose two data permutation approaches, namely, the shifting approach which permutes the data recursively from lower dimensions to higher dimensions and the Hamiltonian-cycle approach which first constructs a Hamiltonian cycle on the mesh and then permutes the data along this cycle by repeatedly transferring data from each processor to its successor. We apply the shifting approach to meshes with different storage capacities which results in two different join algorithms. The basic shifting join (BASHJ) algorithm can minimize the number of subsets stored temporarily at a processor, but requires a large number of data transmissions, while the buffering shifting join (BUSHJ) algorithm can achieve a high parallelism and minimize the number of data transmissions, but requires a large number of subsets stored at each processor 相似文献
18.
The performance of online analytical processing (OLAP) is critical for meeting the increasing requirements of massive volume
analytical applications. Typical techniques, such as in-memory processing, column-storage, and join indexes focus on high
performance storage media, efficient storage models, and reduced query processing. While they effectively perform OLAP applications,
there is a vital limitation: mainmemory database based OLAP (MMOLAP) cannot provide high performance for a large size data
set. In this paper, we propose a novel memory dimension table model, in which the primary keys of the dimension table can
be directly mapped to dimensional tuple addresses. To achieve higher performance of dimensional tuple access, we optimize
our storage model for dimension tables based on OLAP query workload features. We present directly dimensional tuple accessing
(DDTA) based join (DDTAJOIN), a technique to optimize query processing on the memory dimension table by direct dimensional
tuple access. We also contribute by proposing an optimization of the predicate tree to shorten predicate operation length
by pruning useless predicate processing. Our experimental results show that the DDTA-JOIN algorithm is superior to both simulated
row-store main memory query processing and the open-source column-store main memory database MonetDB, thanks to the reduced
join cost and simple yet efficient query processing. 相似文献
19.
Robust Algorithms For Generalized Pham Systems 总被引:1,自引:0,他引:1
20.
A Distributed Algorithm for Content Based Indexing of Images by Projections on Ritz Primary Images 总被引:1,自引:0,他引:1
Haim Schweitzer 《Data mining and knowledge discovery》1997,1(4):375-390
Large collections of images can be indexed by their projections on a few “primary” images. The optimal primary images are
the eigenvectors of a large covariance matrix. We address the problem of computing primary images when access to the images
is expensive. This is the case when the images cannot be kept locally, but must be accessed through slow communication such
as the Internet, or stored in a compressed form. A distributed algorithm that computes optimal approximations to the eigenvectors
(known as Ritz vectors) in one pass through the image set is proposed. When iterated, the algorithm can recover the exact
eigenvectors. The widely used SVD technique for computing the primary images of a small image set is a special case of the
proposed algorithm. In applications to image libraries and learning, it is necessary to compute different primary images for
several sub-categories of the image set. The proposed algorithm can compute these additional primary images “offline”, without
the image data. Similar computation by other algorithms is impractical even when access to the images is inexpensive. 相似文献