Distributed stream join query processing with semijoins |
| |
Authors: | Tri Minh Tran Byung Suk Lee |
| |
Affiliation: | (1) Command and Control Directorate, Systems Division, ROK Air Force, P.O. Box, 309, Sinjang-dong, Pyeongtaek-si, Gyeonggi-do, Republic of Korea;(2) Department of Computer Science, Sookmyung Women’s University, 52 Hyochangwon-gil, Yongsan-gu, Seoul, 140-742, Republic of Korea;(3) Department of Computer Science, Korea Advanced Institute of Science and Technology (KAIST), 373-1 Guseong-Dong, Yuseong-Gu, Daejeon, 305-701, Republic of Korea |
| |
Abstract: | This paper addresses the distributed stream processing of window-based multi-way join queries considering the semijoin as
a key join operator. In distributed stream processing, data streams arriving at remote sites need to be shipped to the processing
site for query execution. This typically introduces high communication overhead. Our observation is that semijoin, effective
in reducing communication overhead in distributed database query processing, can be also effective in distributed stream query
processing. The challenge, however, lies in the streaming nature of the tuples, as it requires continuous and incremental
processing of an unbounded sequence of tuples instead of one-time processing of a set of stored tuples. This paper describes
our comprehensive work done to address the challenge. Specifically, we first propose a distributed stream join processing
model that handles the issue of network delays introduced from the shipment of data streams, and allows for efficient batch
processing. Then, based on the model, we propose join algorithms in a multi-way join case: first, one-way join algorithms
for different combinations of join placement and join method and, then, multi-way join algorithms assuming linear join ordering.
Regarding the join method, two distributed join methods are introduced: (1) simple join, in which full tuples are forwarded to the query processing site and (2) semijoin-based join, in which partial tuples are forwarded. A semijoin-based join can be executed with different possible semijoin strategies which incur different
communication overheads. We present a complete set of join algorithms considering all possible semijoin strategies, and propose
an optimization algorithm. The join algorithms are executed continuously in an incremental manner as tuples arrive, and never
ship tuples redundantly. The optimization algorithm constructs an efficient multi-way join plan by using a greedy heuristic
which adds to the plan one stream with the minimum join execution cost in each step. Through extensive experiments, we conduct
comparative studies of the performance among the proposed one-way join algorithms and the efficiency of the generated plan
between the optimization algorithm based on the greedy heuristic and the exhaustive search, respectively. |
| |
Keywords: | |
本文献已被 SpringerLink 等数据库收录! |
|