Distributed and Parallel Databases

, Volume 27, Issue 3, pp 211–254

Distributed stream join query processing with semijoins

Article

Abstract

This paper addresses the distributed stream processing of window-based multi-way join queries considering the semijoin as a key join operator. In distributed stream processing, data streams arriving at remote sites need to be shipped to the processing site for query execution. This typically introduces high communication overhead. Our observation is that semijoin, effective in reducing communication overhead in distributed database query processing, can be also effective in distributed stream query processing. The challenge, however, lies in the streaming nature of the tuples, as it requires continuous and incremental processing of an unbounded sequence of tuples instead of one-time processing of a set of stored tuples. This paper describes our comprehensive work done to address the challenge. Specifically, we first propose a distributed stream join processing model that handles the issue of network delays introduced from the shipment of data streams, and allows for efficient batch processing. Then, based on the model, we propose join algorithms in a multi-way join case: first, one-way join algorithms for different combinations of join placement and join method and, then, multi-way join algorithms assuming linear join ordering. Regarding the join method, two distributed join methods are introduced: (1) simple join, in which full tuples are forwarded to the query processing site and (2) semijoin-based join, in which partial tuples are forwarded. A semijoin-based join can be executed with different possible semijoin strategies which incur different communication overheads. We present a complete set of join algorithms considering all possible semijoin strategies, and propose an optimization algorithm. The join algorithms are executed continuously in an incremental manner as tuples arrive, and never ship tuples redundantly. The optimization algorithm constructs an efficient multi-way join plan by using a greedy heuristic which adds to the plan one stream with the minimum join execution cost in each step. Through extensive experiments, we conduct comparative studies of the performance among the proposed one-way join algorithms and the efficiency of the generated plan between the optimization algorithm based on the greedy heuristic and the exhaustive search, respectively.

Keywords

Distributed data streams Join queries Semijoins 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    VMWare Workstation 6.0: http://www.vmware.com/
  2. 2.
    1998 World Cup Web Site Access Logs: http://ita.ee.lbl.gov/html/contrib/WorldCup.html
  3. 3.
    A Report of Highspeed Internet Access in the United States made by FCC (Federal Communications Commission), March 2008: http://hraunfoss.fcc.gov/edocs_public/attachmatch/DOC-280906A1.pdf
  4. 4.
    Abadi, D.J., Carney, D., Cetintemel, U., Cherniack, M., Convey, C., Lee, S., Stonebraker, M., Tatbul, N., Zdonik, S.: Aurora: a new model and architecture for data stream management. VLDB J. 12(2), 120–139 (2003) CrossRefGoogle Scholar
  5. 5.
    Amini, L., Jain, N., Sehgal, A., Silber, J., Verscheure, O.: Adaptive control of extreme-scale stream processing systems. In: Proceedings of the 26th IEEE International Conference on Distributed Computing Systems, no. 71 (CD) (2006) Google Scholar
  6. 6.
    Apers, P.M.G., Hevner, A.R., Yao, S.B.: Optimization algorithms for distributed queries. IEEE Trans. Knowl. Data Eng. 9(1), 57–68 (1983) Google Scholar
  7. 7.
    Arasu, A., Babcock, B., Babu, S., McAlister, J., Widom, J.: Characterizing memory requirements for queries over continuous data streams. ACM Trans. Database Syst. 29(1), 162–194 (2004) CrossRefGoogle Scholar
  8. 8.
    Babcock, B., Babu, S., Datar, M., Motwani, R., Widom, J.: Models and issues in data stream systems. In: Proceedings of the 21st ACM Symposium on Principles of Database Systems, pp. 1–16. ACM, New York (2002) Google Scholar
  9. 9.
    Babu, S., Arasu, A., Widom, J.: CQL: A language for continuous queries over streams and relations. In: Proceedings of the 8th International Symposium on Database Programming Languages, pp. 1–19. Springer, Berlin (2003) Google Scholar
  10. 10.
    Babu, S., Motwani, R., Munagala, K., Nishizawa, I., Widom, J.: Adaptive ordering of pipelined stream filters. In: Proceedings of the 23rd International Conference on Management of Data, pp. 407–418 (2004) Google Scholar
  11. 11.
    Babu, S., Munagala, K., Widom, J., Motwani, R.: Adaptive caching for continuous queries. In: Proceedings of the 21st International Conference on Data Engineering, pp. 118–129 (2005) Google Scholar
  12. 12.
    Bernstein, P.A., Chiu, D.-M.W.: Using semi-joins to solve relational queries. J. ACM 28(1), 25–40 (1981) MATHCrossRefMathSciNetGoogle Scholar
  13. 13.
    Bernstein, P.A., Goodman, N.: Power of natural semijoins. SIAM J. Comput. 10(4), 751–771 (1981) MATHCrossRefMathSciNetGoogle Scholar
  14. 14.
    Ceri, S., Pelagatti, G.: Distributed databases: Principles and systems (1984) Google Scholar
  15. 15.
    Chen, J.S.J., Li, V.O.K.: Domain-specific semijoin: a new operation for distributed query processing. Int. J. Inf. Sci. 52(2), 165–183 (1990) MATHGoogle Scholar
  16. 16.
    Chen, M.-S., Yu, P.S.: Combining join and semi-join operations for distributed query processing. IEEE Trans. Knowl. Data Eng. 5(3), 534–542 (1993) CrossRefGoogle Scholar
  17. 17.
    Cormode, G., Muthukrishnan, S., Zhuang, W.: What’s different: Distributed, continuous monitoring of duplicate-resilient aggregates on data streams. In: Proceedings of the 22nd International Conference on Data Engineering, no. 57 (CD) (2006) Google Scholar
  18. 18.
    Cranor, C., Johnson, T., Spataschek, O., Shkapenyuk, V.: Gigascope: a stream database for network applications. In: Proceedings of the 22nd International Conference on Management of Data, pp. 647–651. ACM, New York (2003) Google Scholar
  19. 19.
    Das, A., Gehrke, J., Riedewald, M.: Approximate join processing over data streams. In: Proceedings of the 22nd International Conference on Management of Data/Principles of Database Systems, pp. 40–51. ACM, New York (2003) Google Scholar
  20. 20.
    Das, A., Ganguly, S., Garofalakis, M.N., Rastogi, R.: Distributed set expression cardinality estimation. In: Proceedings of the 30th International Conference on Very Large Data Bases, pp. 312–323 (2004) Google Scholar
  21. 21.
    Gedik, B., Wu, K.-L., Yu, P.S., Liu, L.: A load shedding framework and optimizations for m-way windowed stream joins. In: Proceedings of the 23rd International Conference on Data Engineering, pp. 536–545 (2007) Google Scholar
  22. 22.
    Ghanem, T.M., Hammad, M.A., Mokbel, M.F., Aref, W.G., Elmagarmid, A.K.: Incremental evaluation of sliding-window queries over data streams. IEEE Trans. Knowl. Data Eng. 19(1), 57–72 (2007) CrossRefGoogle Scholar
  23. 23.
    Gibbons, P.B., Tirthapura, S.: Distributed streams algorithms for sliding windows. In: Proceedings of the 14th Annual ACM Symposium on Parallel Algorithms and Architectures, pp. 63–72. ACM, New York (2002) Google Scholar
  24. 24.
    Golab, L., Ozsu, M.T.: Processing sliding window multi-joins in continuous queries over data streams. In: Proceedings of the 29th International Conference on Very Large Data Bases, pp. 500–511. ACM, New York (2003) Google Scholar
  25. 25.
    Gorawski, M., Marks, P.: Fault-tolerant distributed stream processing system. In: Proceedings of a Workshop of 17th International Conference on Database and Expert Systems Applications, pp. 395–399 (2006) Google Scholar
  26. 26.
    Gu, X., Yu, P.S., Wang, H.: Adaptive load diffusion for multiway windowed stream joins. In: Proceedings of the 23rd International Conference on Data Engineering, pp. 146–155 (2007) Google Scholar
  27. 27.
    Kang, H., Roussopoulos, N.: Using 2-way semijoins in distributed query processing. In: Proceedings of the 3rd International Conference on Data Engineering, pp. 644–651. IEEE Comput. Soc., Los Alamitos (1987) Google Scholar
  28. 28.
    Kang, J., Naughton, J.F., Viglas, S.D.: Evaluating window joins over unbounded streams. In: Proceedings of the 19th International Conference on Data Engineering, pp. 341–352. IEEE Comput. Soc., Los Alamitos (2003) Google Scholar
  29. 29.
    Keralapura, R., Cormode, G., Ramamirtham, J.: Communication-efficient distributed monitoring of thresholded counts. In: Proceedings of the 25th International Conference on Management of Data, pp. 289–300 (2006) Google Scholar
  30. 30.
    Kriakov, V., Delis, A., Kollios, G.: Approximate data stream joins in distributed systems. In: Proceedings of the 27th IEEE International Conference on Distributed Computing Systems, no. 5 (CD) (2007) Google Scholar
  31. 31.
    Kumar, V., Cooper, B.F., Cai, Z., Eisenhauer, G., Schwan, K.: Resource-aware distributed stream management using dynamic overlays. In: Proceedings of the 25th IEEE International Conference on Distributed Computing Systems, pp. 783–792 (2005) Google Scholar
  32. 32.
    Kumar, V., Cooper, B.F., Schwan, K.: Distributed stream management using utility-driven self-adaptive middleware. In: Proceedings of the 2nd International Conference on Autonomic Computing, pp. 3–14 (2005) Google Scholar
  33. 33.
    Li, Z., Ross, K.A.: Perf join: An alternative to two-way semijoin and bloomjoin. In: Proceedings of the 4th International Conference on Information and Knowledge Management, pp. 137–144 (1995) Google Scholar
  34. 34.
    Li, J., Tufte, K., Shkapenyuk, V., Papadimos, V., Johnson, T., Maier, D.: Out-of-order processing: a new architecture for high-performance stream systems. In: Proceedings of the 34th International Conference on Very Large Data Bases, pp. 274–288 (2008) Google Scholar
  35. 35.
    Madden, S., Shah, M.A., Hellerstein, J.M., Raman, V.: Continuously adaptive continuous queries over streams. In: Proceedings of the 21st International Conference on Management of Data, pp. 49–60 (2002) Google Scholar
  36. 36.
    Moerkotte, G., Neumann, T.: Analysis of two existing and one new dynamic programming algorithm for the generation of optimal bushy join trees without cross products. In: Proceedings of the 32th International Conference on Very Large Data Bases, pp. 930–941 (2006) Google Scholar
  37. 37.
    Morrissey, J.M., Ogunbadejo, O.: Combining semijoins and hash-semijoins in a distributed query processing. In: Proceedings of 1999 IEEE Canadian Conference on Electrial and Computer Engineering, pp. 122–126. IEEE Comput. Soc., Los Alamitos (1999) Google Scholar
  38. 38.
    Olston, C., Jiang, J., Widom, J.: Adaptive filters for continuous queries over distributed data streams. In: Proceedings of the 22nd International Conference on Management of Data, pp. 563–574 (2003) Google Scholar
  39. 39.
    Ozsu, M.T., Valduriez, P.: Principles of distributed database systems (1999) Google Scholar
  40. 40.
    Perrizo, W., Chen, C.-S.: Composite semijoins in distributed query processing. Int. J. Inf. Sci. 50(2), 197–218 (1990) MATHGoogle Scholar
  41. 41.
    Roussopoulos, N., Kang, H.: A pipeline n-way join algorithm based on the 2-way semijoin program. IEEE Trans. Knowl. Data Eng. 3(4), 486–495 (1991) CrossRefGoogle Scholar
  42. 42.
    Seshadri, S., Kumar, V., Cooper, B.F.: Optimizing multiple queries in distributed data stream systems. In: Proceedings of Workshop of the 22nd International Conference on Data Engineering, no. 25 (CD) (2006) Google Scholar
  43. 43.
    Sharfman, I., Schuster, A., Keren, D.: A geometric approach to monitoring threshold functions over distributed data streams. In: Proceedings of the 25th International Conference on Management of Data, pp. 301–312 (2006) Google Scholar
  44. 44.
    Srivastava, U., Munagala, K., Widom, J.: Operator placement for in-network stream query processing. In: Proceedings of the 24th Symposium on Principles of Database Systems, pp. 250–258 (2005) Google Scholar
  45. 45.
    Steinbrunn, M., Moerkotte, G., Kemper, A.: Heuristic and randomized optimization for the join ordering problem. VLDB J. 6(3), 191–208 (1997) CrossRefGoogle Scholar
  46. 46.
    Tang, A., Liu, Z., Xia, C.H., Zhang, L.: Distributed resource allocation for stream data processing. In: Proceedings of the 2nd International Conference on High Performance Computing and Communications, pp. 91–100 (2006) Google Scholar
  47. 47.
    Tran, T.M., Lee, B.S., Bovee, M.W.: Why not semijoins for streams, when distributed? In: Proceedings of the 2nd International Conference on Digital Telecommunication, no. 27 (CD) (2007) Google Scholar
  48. 48.
    Tseng, J.C.R., Chen, A.L.P.: Improving distributed query processing by hash-semijoins. J. Inf. Sci. Eng. 8(4), 525–540 (1992) Google Scholar
  49. 49.
    Viglas, S., Naughton, J.F., Burger, J.: Maximizing the output rate of multi-way join queries over streaming information sources. In: Proceedings of the 29th International Conference on Very Large Data Bases, pp. 285–296 (2003) Google Scholar
  50. 50.
    Wang, C., Chen, A.L.P., Shyu, S.-C.: A parallel execution method for minimizing distributed query response time. IEEE Trans. Parallel Distrib. Syst. 3(3), 325–333 (1992) CrossRefGoogle Scholar
  51. 51.
    Wang, S., Rundensteiner, E.A., Ganguly, S., Bhatnagar, S.: State-slice: New paradigm of multi-query optimization of window-based stream queries. In: Proceedings of 28th International Conference on Very Large Data Bases, pp. 619–630 (2006) Google Scholar
  52. 52.
    Xia, T., Jin, C., Zhou, X., Zhou, A.: Filtering duplicate items over distributed data streams. In: Proceedings of the 6th International Conference on Web-Age Information Management, pp. 779–784 (2005) Google Scholar
  53. 53.
    Zhang, D., Li, J., Wang, W., Guo, L., Ai, C.: Processing frequent items over distributed data streams. In: Proceedings of the 7th Asia-Pacific Web Conference, pp. 523–529 (2005) Google Scholar
  54. 54.
    Zhang, D., Li, J., Kimeli, K., Wang, W.: Sliding window based multi-join algorithms over distributed data streams. In: Proceedings of the 22nd International Conference on Data Engineering, no. 139 (CD) (2006) Google Scholar
  55. 55.
    Zhou, Y., Yan, Y., Ooi, B.C., Tan, K.-L., Zhou, A.: Optimizing continuous multijoin queries over distributed streams. In: Proceedings of the 14th International Conference on Information and Knowledge Management, pp. 221–222 (2005) Google Scholar
  56. 56.
    Zhou, Y., Yan, Y., Yu, F., Zhou, A.: PMJoin: Optimizing distributed multi-way stream joins by stream partitioning. In: Proceedings of the 9th International Conference on Database Systems for Advanced Applications, pp. 325–341 (2006) Google Scholar

Copyright information

© Springer Science+Business Media, LLC 2010

Authors and Affiliations

  1. 1.Department of Computer ScienceUniversity of VermontBurlingtonUSA

Personalised recommendations