Skip to main content

D-JB: An Online Join Method for Skewed and Varied Data Streams

  • Conference paper
  • First Online:
Intelligence Science II (ICIS 2018)

Part of the book series: IFIP Advances in Information and Communication Technology ((IFIPAICT,volume 539))

Included in the following conference series:

  • 1043 Accesses

Abstract

Scalable distributed join processing in a parallel environment requires a partitioning policy to transfer data. Online theta-joins over data streams are more computationally expensive and impose higher memory requirement in distributed data stream management systems (DDSMS) than database management systems (DBMS). The complete bipartite graph-based model can support distributed stream joins, and has the characteristics of memory-efficiency, elasticity and scalability. However, due to the instability of data stream rate and the imbalance of attribute value distribution, the online theta-joins over skewed and varied streams lead to the load imbalance of cluster. In this paper, we present a framework D-JB (Dynamic Join Biclique) for handling skewed and varied streams, enhancing the adaptability of the join model and minimizing the system cost based on the varying workloads. Our proposal includes a mixed key-based and tuple-based partitioning scheme to handle skewed data in each side of the bipartite graph-based model, a strategy for redistribution of query nodes in two sides of this model, and a migration algorithm about state consistency to support full-history joins. Experiments show that our method can effectively handle skewed and varied data streams and improve the throughput of DDSMS.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 54.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Hwang, J., Balazinska, M., Rasin, A., et al.: High-Availability algorithms for distributed stream processing. In: Proceedings of the 21st International Conference on Data Engineering, pp. 779–790. IEEE Press (2005)

    Google Scholar 

  2. Fernandez, R., Migliavacca, M., Kalyvianaki, E., et al.: Integrating scale out and fault tolerance in stream processing using operator state management. In: Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, pp. 725–736. ACM Press (2013)

    Google Scholar 

  3. Walton, C., Dale, A., Jenevein, R.: A taxonomy and performance model of data skew effects in parallel joins. In: Proceedings of the 17th International Conference on Very Large Data Bases, pp. 537–548. ACM Press (1991)

    Google Scholar 

  4. Elseidy, M., Elguindy, A., Vitorovic, A., et al.: Scalable and adaptive online joins. In: the VLDB Endowment, vol. 7(6), pp. 441–452 (2014)

    Article  Google Scholar 

  5. Lin, Q., Ooi, B.C., Wang, Z., et al.: Scalable distributed stream join processing. In: ACM SIGMOD International Conference on Management of Data, pp. 811–825. ACM Press (2015)

    Google Scholar 

  6. Vitorovic, A., Elseidy, M., Koch, C.: Load balancing and skew resilience for parallel joins. In: IEEE International Conference on Data Engineering, pp. 313–324. IEEE Press (2016)

    Google Scholar 

  7. Fang, J., Zhang, R., Wang, X., et al.: Cost-effective stream join algorithm on cloud system. In: the 25th ACM International on Conference on Information and Knowledge Management, pp. 1773–1782. ACM Press (2016)

    Google Scholar 

  8. Narendra, K., Richard, K.: An efficient approximation scheme for the one-dimensional bin-packing problem. In: 23rd Annual Symposium on Foundations of Computer Science, pp. 312–320. IEEE Press (1982)

    Google Scholar 

  9. Fang, J., Zhang, R., Wang, X., et al.: Parallel stream processing against workload skewness and variance. In: CoRR abs/1610.05121 (2016)

    Google Scholar 

  10. Toshniwal, A., Taneja, S., Shukla, A., et al.: Storm@twitter. In: ACM SIGMOD International Conference on Management of Data, pp. 147–156. ACM Press (2014)

    Google Scholar 

  11. http://kafka.apache.org/

  12. Li, H., Ghodsi, A., Zaharia, M., et al.: Tachyon: memory throughput I/O for cluster computing frameworks. In: LADIS (2013)

    Google Scholar 

  13. Ding, J., Fu, T., Ma, R., et al.: Optimal operator state migration for elastic data stream processing. In: HAL - INRIA, vol. 22(3), pp. 1–8 (2013)

    Google Scholar 

  14. http://www.tpc.org/tpch

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Chunkai Wang .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 IFIP International Federation for Information Processing

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Wang, C., Feng, J., Shi, Z. (2018). D-JB: An Online Join Method for Skewed and Varied Data Streams. In: Shi, Z., Pennartz, C., Huang, T. (eds) Intelligence Science II. ICIS 2018. IFIP Advances in Information and Communication Technology, vol 539. Springer, Cham. https://doi.org/10.1007/978-3-030-01313-4_12

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-01313-4_12

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-01312-7

  • Online ISBN: 978-3-030-01313-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics