Advertisement

Flexible and Adaptive Stream Join Algorithm

  • Junhua Fang
  • Xiaotong Wang
  • Rong ZhangEmail author
  • Aoying Zhou
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9932)

Abstract

Flexibility and self-adaptivity are important to real-time join processing in a parallel shared-nothing environment. Join-Matrix is a high-performance model on distributed stream joins and supports arbitrary join predicates. It can handle data skew perfectly since it randomly routes tuples to cells with each steam corresponding to one side of the matrix. Designing of the partitioning scheme of the matrix is a determining factor to maximize system throughputs under the premise of economizing computing resources. In this paper, we propose a novel flexible and adaptive scheme partitioning algorithm for stream join operator, which ensures high throughput but with economical resource usages by allocating resources on demand. Specifically, a lightweight scheme generator, which requires the sample of each stream volume and processing resource quota of each physical machine, generates a join scheme; then a migration plan generator decides how to migrate data among machines under the consideration of minimizing migration cost while ensuring correctness. Extensive experiments are done on different kind of join workloads and show high competence comparing with baseline systems on benchmark.

Keywords

Matrix Model Migration Cost Task Number Migration Volume Matrix Scheme 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Notes

Acknowledgments

This work is partially supported by National High Technology Research and Development Program of China (863 Project) No. 2015AA015307, National Science Foundation of China under grant (No. 61232002 and NO. 61332006), and National Science Foundation of Shanghai (No. 14ZR1412600). The corresponding author is Rong Zhang.

References

  1. 1.
  2. 2.
    The TPC-H Benchmark. http://www.tpc.org/tpch
  3. 3.
    Nasir, M.A.U., De Francisci Morales, G., et al.: The power of both choices: practical load balancing for distributed stream processing engines. In: ICDE, pp. 137–148 (2015)Google Scholar
  4. 4.
    Cormode, G., Muthukrishnan, S.: An improved data stream summary: the count-min sketch and its applications. J. Algorithms 55(1), 58–75 (2005)MathSciNetCrossRefzbMATHGoogle Scholar
  5. 5.
    Elseidy, M., Elguindy, A., Vitorovic, A., Koch, C.: Scalable and adaptive online joins. In: VLDB, pp. 441–452 (2014)Google Scholar
  6. 6.
    Epstein, R.S., Stonebraker, M., Wong, E.: Distributed query processing in a relational data base system. In: SIGMOD, pp. 169–180 (1978)Google Scholar
  7. 7.
    Gedik, B.: Partitioning functions for stateful data parallelism in stream processing. VLDB J. 23(4), 517–539 (2014)CrossRefGoogle Scholar
  8. 8.
    Huebsch, R., Garofalakis, M., Hellerstein, J., Stoica, I.: Advanced join strategies for large-scale distributed computation. In: VLDB, pp. 1484–1495 (2014)Google Scholar
  9. 9.
    Kwon, Y., Balazinska, M., et al.: Skewtune: mitigating skew in mapreduce applications. In: SIGMOD, pp. 25–36 (2012)Google Scholar
  10. 10.
    Lin, Q., Ooi, B.C., Wang, Z., Yu, C.: Scalable distributed stream join processing. In: SIGMOD, pp. 811–825 (2015)Google Scholar
  11. 11.
    Liu, B., Zhu, Y., Jbantova, M., et al.: A dynamically adaptive distributed system for processing complex continuous queries. In: VLDB, pp. 1338–1341 (2005)Google Scholar
  12. 12.
    Nasir, M.A.U., Serafini, M., et al.: When two choices are not enough: balancing at scale in distributed stream processing. In: ICDE (2016)Google Scholar
  13. 13.
    Okcan, A., Riedewald, M.: Processing theta-joins using mapreduce. In: SIGMOD, pp. 949–960 (2011)Google Scholar
  14. 14.
    Stamos, J.W., Young, H.C.: A symmetric and replicate algorithm for distributed joins. IEEE Trans. Parallel Distrib. Syst. 4(12), 1345–1354 (1993)CrossRefGoogle Scholar
  15. 15.
    Ufler, B., Augsten, N., Reiser, A., Kemper, A.: Load balancing in mapreduce based on scalable cardinality estimates. In: ICDE, pp. 522–533 (2012)Google Scholar
  16. 16.
    Vitorovic, A., ElSeidy, M., Koch, C.: Load balancing and skew resilience for parallel joins. In: ICDE (2016)Google Scholar
  17. 17.
    Xing, Y., Hwang, J., Cetintemel, U., Zdonik, S.: Providing resiliency to load variations in distributed stream processing. In: VLDB, pp. 775–786 (2006)Google Scholar
  18. 18.
    Xu, Y., Kostamaa, P., Zhou, X., Chen, L.: Handling data skew in parallel joins in shared-nothing systems. In: SIGMOD, pp. 1043–1052 (2008)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2016

Authors and Affiliations

  • Junhua Fang
    • 1
  • Xiaotong Wang
    • 1
  • Rong Zhang
    • 1
    Email author
  • Aoying Zhou
    • 1
  1. 1.Institute for Data Science and Engineering, Software Engineering InstituteEast China Normal UniversityShanghaiChina

Personalised recommendations