Distributed and Parallel Databases

, Volume 27, Issue 3, pp 211–254

Distributed stream join query processing with semijoins

Authors

    • Department of Computer ScienceUniversity of Vermont
  • Byung Suk Lee
    • Department of Computer ScienceUniversity of Vermont
Article

DOI: 10.1007/s10619-010-7062-7

Cite this article as:
Tran, T.M. & Lee, B.S. Distrib Parallel Databases (2010) 27: 211. doi:10.1007/s10619-010-7062-7

Abstract

This paper addresses the distributed stream processing of window-based multi-way join queries considering the semijoin as a key join operator. In distributed stream processing, data streams arriving at remote sites need to be shipped to the processing site for query execution. This typically introduces high communication overhead. Our observation is that semijoin, effective in reducing communication overhead in distributed database query processing, can be also effective in distributed stream query processing. The challenge, however, lies in the streaming nature of the tuples, as it requires continuous and incremental processing of an unbounded sequence of tuples instead of one-time processing of a set of stored tuples. This paper describes our comprehensive work done to address the challenge. Specifically, we first propose a distributed stream join processing model that handles the issue of network delays introduced from the shipment of data streams, and allows for efficient batch processing. Then, based on the model, we propose join algorithms in a multi-way join case: first, one-way join algorithms for different combinations of join placement and join method and, then, multi-way join algorithms assuming linear join ordering. Regarding the join method, two distributed join methods are introduced: (1) simple join, in which full tuples are forwarded to the query processing site and (2) semijoin-based join, in which partial tuples are forwarded. A semijoin-based join can be executed with different possible semijoin strategies which incur different communication overheads. We present a complete set of join algorithms considering all possible semijoin strategies, and propose an optimization algorithm. The join algorithms are executed continuously in an incremental manner as tuples arrive, and never ship tuples redundantly. The optimization algorithm constructs an efficient multi-way join plan by using a greedy heuristic which adds to the plan one stream with the minimum join execution cost in each step. Through extensive experiments, we conduct comparative studies of the performance among the proposed one-way join algorithms and the efficiency of the generated plan between the optimization algorithm based on the greedy heuristic and the exhaustive search, respectively.

Keywords

Distributed data streamsJoin queriesSemijoins

Copyright information

© Springer Science+Business Media, LLC 2010