Skip to main content
Log in

Partitioning functions for stateful data parallelism in stream processing

  • Regular Paper
  • Published:
The VLDB Journal Aims and scope Submit manuscript

Abstract

In this paper, we study partitioning functions for stream processing systems that employ stateful data parallelism to improve application throughput. In particular, we develop partitioning functions that are effective under workloads where the domain of the partitioning key is large and its value distribution is skewed. We define various desirable properties for partitioning functions, ranging from balance properties such as memory, processing, and communication balance, structural properties such as compactness and fast lookup, and adaptation properties such as fast computation and minimal migration. We introduce a partitioning function structure that is compact and develop several associated heuristic construction techniques that exhibit good balance and low migration cost under skewed workloads. We provide experimental results that compare our partitioning functions to more traditional approaches such as uniform and consistent hashing, under different workload and application characteristics, and show superior performance.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14

Similar content being viewed by others

Notes

  1. Consistent hash only migrates items from the existing nodes to the newly added node. No migrations happen between existing nodes.

  2. The lower bound does not hold during system initialization, as there is not enough history to use.

    Fig. 2
    figure 2

    Using three lossy counters over tumbling windows to emulate a sliding window

References

  1. Abadi, D., Ahmad, Y., Balazinska, M., Çetintemel, U., Cherniack, M., Hwang, J.H., Lindner, W., Maskey, A., Rasin, A., Ryvkina, E., Tatbul, N., Xing, Y., Zdonik, S.: The design of the Borealis stream processing engine. In: Proceedings of the Innovative Data Systems Research Conference (CIDR), pp. 277–289 (2005)

  2. Arasu, A., Manku, G.S.: Approximate counts and quantiles over sliding windows. In: Proceedings of the Symposium on Principles of Database Systems (ACM PODS) (2004)

  3. Arasu, A., Babcock, B., Babu, S., Datar, M., Ito, K., Motwani, R., Nishizawa, I., Srivastava, U., Thomas, D., Varma, R., Widom, J.: STREAM: the stanford stream data manager. IEEE Data Eng. Bull. 26(1), 665 (2003)

    Google Scholar 

  4. Balkesen, C., Tatbul, N.: Scalable data partitioning techniques for parallel sliding window processing over data streams. In: International Workshop on Data Management for Sensor Networks (DMSN) (2011)

  5. Cormode, G., Garofalakis, M., Haas, P., Jermaine, C.: Synopses for Massive Data: Samples, Histograms, Wavelets, Sketches. Now Publishing, Foundations and Trends in Databases Series (2011)

  6. Deshpande, A., Ives, Z.G., Raman, V.: Adaptive query processing. Found. Trends Databases 1(1) (2007)

  7. DeWitt, D., Naughton, J., Schneider, D., Seshadri, S.S.: Practical skew handling in parallel joins. In: Proceedings of the Very Large Data Bases Conference (VLDB) (1992)

  8. Gates, A.F., Natkovich, O., Chopra, S., Kamath, P., Narayanamurthy, S.M., Olston, C., Reed, B., Srinivasan, S., Srivastava, U.: Building a high-level data flow system on top of map-reduce: The PIG experience. In: Proceedings of the Very Large Data Bases Conference (VLDB) (2009)

  9. Gedik, B., Schneider, S., Hirzel, M., Wu, K.L.: Elastic scaling for data stream processing. IBM Research Technical Report, RC25401 (2013)

  10. Gedik, B., Andrade, H.: A model-based framework for building extensible, high performance stream processing middleware and programming language for IBM InfoSphere streams. Softw. Pract. Exp. 42(11), 1363–1391 (2012)

    Article  Google Scholar 

  11. Gufler, B., Augsten, N., Reiser, A., Kemper, A.: Handling data skew in mapreduce. In: Proceedings of the International Conference of Cloud Computing and Services Science (2011)

  12. Gufler, B., Augsten, N., Reiser, A., Kemper, A.: Load balancing in mapreduce based on scalable cardinality estimates. In: Proceedings of the International Conference on Data Engineering (IEEE ICDE) (2012)

  13. Hirzel, M., Andrade, H., Gedik, B., Kumar, V., Losa, G., Mendell, M., Nasgaard, H., Soulé, R., Wu, K.L.: SPL language spec. Tech. Rep. RC24897, IBM (2009)

  14. Jain, N., Amini, L., Andrade, H., King, R., Park, Y., Selo, P., Venkatramani, C.: Design, implementation, and evaluation of the linear road benchmark on the stream processing core. In: Proceedings of the International Conference on Management of Data (ACM SIGMOD) (2006)

  15. Karger, D.R., Lehman, E., Leighton, T., Panigrahy, R., Levine, M., Lewin, D.: Consistent hashing and random trees: distributed caching protocols for relieving hot spots on the world wide web. In: Proceedings of the International Symposium on Theory of Computing (ACM STOC), pp. 654–663 (1997)

  16. Karger, D.R., Sherman, A., Berkheimer, A., Bogstad, B., Dhanidina, R., Iwamoto, K., Kim, B., Matkins, L., Yerushalmi, Y.: Web caching with consistent hashing. Comput. Netw. 31(11–16), 1203–1213 (1999)

    Article  Google Scholar 

  17. Kwon, Y., Balazinska, M., Howe, B., Rolia, J.A.: SkewTune: mitigating skew in mapreduce applications. In: Proceedings of the International Conference on Management of Data (ACM SIGMOD) (2012)

  18. Manku, G.S., Motwani, R.: Approximate frequency counts over data streams. In: Proceedings of the International Conference on Very Large Databases (VLDB) (2002)

  19. MurMurHash3. http://code.google.com/p/smhasher/wiki/MurmurHash3 (2013). Retrieved May 2013

  20. Paton, N.W., Chavez, J.B., Chen, M., Raman, V., Swart, G., Narang, I., Yellin, D.M., Fernandes, A.A.A.: Autonomic query parallelization using non-dedicated computers: An evaluation of adaptivity options. In: Proceedings of the Very Large Data Bases Conference (VLDB) (2009)

  21. Poosala, V., Ioannidis, Y.E.: Estimation of query-result distribution and its application in parallel-join load balancing. In: Proceedings of the Very Large Data Bases Conference (VLDB) (1996)

  22. S4 distributed stream computing platform. http://www.s4.io/ (2012). Retrieved May 2012

  23. Schneider, S., Andrade, H., Gedik, B., Biem, A., Wu, K.L.: Elastic scaling of data parallel operators in stream processing. In: Proceedings of the International Parallel and Distributed Processing Symposium (IEEE IPDPS) (2009)

  24. Schneider, S., Hirzel, M., Gedik, B., Wu, K.L.: Auto-parallelizing stateful distributed streaming application. In: Proceedigns of the International Conference on Parallel Architectures and Compilation Techniques (PACT), pp. 53–64 (2012)

  25. Shah, M.A., Hellerstein, J.M., Chandrasekaran, S., Franklin, M.J.: Flux: An adaptive partitioning operator for continuous query systems. In: Proceedings of the International Conference on Data Engineering (IEEE ICDE) (2003)

  26. Shatdal, A., Naughton, J.: Adaptive parallel aggregation algorithms. In: Proceedings of the International Conference on Management of Data (ACM SIGMOD) (1995)

  27. Storm project. http://storm-project.net/ (2012). Retrieved May 2012

  28. StreamBase Systems. http://www.streambase.com (2012). Retr- ieved May 2012

  29. Walton, C., Dale, A., Jenevein, R.: A taxonomy and performance model of data skew effects in parallel joins. In: Proceedings of the Very Large Data Bases Conference (VLDB) (1991)

  30. Xu, Y., Kostamaa, P.: Efficient outer join data skew handling in parallel dbms. In: Proceedings of the Very Large Data Bases Conference (VLDB) (2009)

Download references

Acknowledgments

We thank IBM Thomas J. Watson Research Center for providing access to compute and software resources that made this research possible.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Buğra Gedik.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Gedik, B. Partitioning functions for stateful data parallelism in stream processing. The VLDB Journal 23, 517–539 (2014). https://doi.org/10.1007/s00778-013-0335-9

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00778-013-0335-9

Keywords

Navigation