Abstract
Scalable distributed join processing in a parallel environment requires a partitioning policy to transfer data. Online theta-joins over data streams are more computationally expensive and impose higher memory requirement in distributed data stream management systems (DDSMS) than database management systems (DBMS). The complete bipartite graph-based model can support distributed stream joins, and has the characteristics of memory-efficiency, elasticity and scalability. However, due to the instability of data stream rate and the imbalance of attribute value distribution, the online theta-joins over skewed and varied streams lead to the load imbalance of cluster. In this paper, we present a framework D-JB (Dynamic Join Biclique) for handling skewed and varied streams, enhancing the adaptability of the join model and minimizing the system cost based on the varying workloads. Our proposal includes a mixed key-based and tuple-based partitioning scheme to handle skewed data in each side of the bipartite graph-based model, a strategy for redistribution of query nodes in two sides of this model, and a migration algorithm about state consistency to support full-history joins. Experiments show that our method can effectively handle skewed and varied data streams and improve the throughput of DDSMS.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Hwang, J., Balazinska, M., Rasin, A., et al.: High-Availability algorithms for distributed stream processing. In: Proceedings of the 21st International Conference on Data Engineering, pp. 779–790. IEEE Press (2005)
Fernandez, R., Migliavacca, M., Kalyvianaki, E., et al.: Integrating scale out and fault tolerance in stream processing using operator state management. In: Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, pp. 725–736. ACM Press (2013)
Walton, C., Dale, A., Jenevein, R.: A taxonomy and performance model of data skew effects in parallel joins. In: Proceedings of the 17th International Conference on Very Large Data Bases, pp. 537–548. ACM Press (1991)
Elseidy, M., Elguindy, A., Vitorovic, A., et al.: Scalable and adaptive online joins. In: the VLDB Endowment, vol. 7(6), pp. 441–452 (2014)
Lin, Q., Ooi, B.C., Wang, Z., et al.: Scalable distributed stream join processing. In: ACM SIGMOD International Conference on Management of Data, pp. 811–825. ACM Press (2015)
Vitorovic, A., Elseidy, M., Koch, C.: Load balancing and skew resilience for parallel joins. In: IEEE International Conference on Data Engineering, pp. 313–324. IEEE Press (2016)
Fang, J., Zhang, R., Wang, X., et al.: Cost-effective stream join algorithm on cloud system. In: the 25th ACM International on Conference on Information and Knowledge Management, pp. 1773–1782. ACM Press (2016)
Narendra, K., Richard, K.: An efficient approximation scheme for the one-dimensional bin-packing problem. In: 23rd Annual Symposium on Foundations of Computer Science, pp. 312–320. IEEE Press (1982)
Fang, J., Zhang, R., Wang, X., et al.: Parallel stream processing against workload skewness and variance. In: CoRR abs/1610.05121 (2016)
Toshniwal, A., Taneja, S., Shukla, A., et al.: Storm@twitter. In: ACM SIGMOD International Conference on Management of Data, pp. 147–156. ACM Press (2014)
Li, H., Ghodsi, A., Zaharia, M., et al.: Tachyon: memory throughput I/O for cluster computing frameworks. In: LADIS (2013)
Ding, J., Fu, T., Ma, R., et al.: Optimal operator state migration for elastic data stream processing. In: HAL - INRIA, vol. 22(3), pp. 1–8 (2013)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 IFIP International Federation for Information Processing
About this paper
Cite this paper
Wang, C., Feng, J., Shi, Z. (2018). D-JB: An Online Join Method for Skewed and Varied Data Streams. In: Shi, Z., Pennartz, C., Huang, T. (eds) Intelligence Science II. ICIS 2018. IFIP Advances in Information and Communication Technology, vol 539. Springer, Cham. https://doi.org/10.1007/978-3-030-01313-4_12
Download citation
DOI: https://doi.org/10.1007/978-3-030-01313-4_12
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-01312-7
Online ISBN: 978-3-030-01313-4
eBook Packages: Computer ScienceComputer Science (R0)