Partitioning functions for stateful data parallelism in stream processing

Gedik, Buğra

doi:10.1007/s00778-013-0335-9

Partitioning functions for stateful data parallelism in stream processing

Regular Paper
Published: 31 August 2013

Volume 23, pages 517–539, (2014)
Cite this article

The VLDB Journal Aims and scope Submit manuscript

Buğra Gedik¹

2393 Accesses
86 Citations
Explore all metrics

Abstract

In this paper, we study partitioning functions for stream processing systems that employ stateful data parallelism to improve application throughput. In particular, we develop partitioning functions that are effective under workloads where the domain of the partitioning key is large and its value distribution is skewed. We define various desirable properties for partitioning functions, ranging from balance properties such as memory, processing, and communication balance, structural properties such as compactness and fast lookup, and adaptation properties such as fast computation and minimal migration. We introduce a partitioning function structure that is compact and develop several associated heuristic construction techniques that exhibit good balance and low migration cost under skewed workloads. We provide experimental results that compare our partitioning functions to more traditional approaches such as uniform and consistent hashing, under different workload and application characteristics, and show superior performance.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Cost-Effective Data Partition for Distributed Stream Processing System

HKS: Efficient Data Partitioning for Stateful Streaming

Adaptive key partitioning in distributed stream processing

Article Open access 12 January 2024

Notes

Consistent hash only migrates items from the existing nodes to the newly added node. No migrations happen between existing nodes.
The lower bound does not hold during system initialization, as there is not enough history to use.
Fig. 2
Using three lossy counters over tumbling windows to emulate a sliding window
Full size image

References

Abadi, D., Ahmad, Y., Balazinska, M., Çetintemel, U., Cherniack, M., Hwang, J.H., Lindner, W., Maskey, A., Rasin, A., Ryvkina, E., Tatbul, N., Xing, Y., Zdonik, S.: The design of the Borealis stream processing engine. In: Proceedings of the Innovative Data Systems Research Conference (CIDR), pp. 277–289 (2005)
Arasu, A., Manku, G.S.: Approximate counts and quantiles over sliding windows. In: Proceedings of the Symposium on Principles of Database Systems (ACM PODS) (2004)
Arasu, A., Babcock, B., Babu, S., Datar, M., Ito, K., Motwani, R., Nishizawa, I., Srivastava, U., Thomas, D., Varma, R., Widom, J.: STREAM: the stanford stream data manager. IEEE Data Eng. Bull. 26(1), 665 (2003)
Google Scholar
Balkesen, C., Tatbul, N.: Scalable data partitioning techniques for parallel sliding window processing over data streams. In: International Workshop on Data Management for Sensor Networks (DMSN) (2011)
Cormode, G., Garofalakis, M., Haas, P., Jermaine, C.: Synopses for Massive Data: Samples, Histograms, Wavelets, Sketches. Now Publishing, Foundations and Trends in Databases Series (2011)
Deshpande, A., Ives, Z.G., Raman, V.: Adaptive query processing. Found. Trends Databases 1(1) (2007)
DeWitt, D., Naughton, J., Schneider, D., Seshadri, S.S.: Practical skew handling in parallel joins. In: Proceedings of the Very Large Data Bases Conference (VLDB) (1992)
Gates, A.F., Natkovich, O., Chopra, S., Kamath, P., Narayanamurthy, S.M., Olston, C., Reed, B., Srinivasan, S., Srivastava, U.: Building a high-level data flow system on top of map-reduce: The PIG experience. In: Proceedings of the Very Large Data Bases Conference (VLDB) (2009)
Gedik, B., Schneider, S., Hirzel, M., Wu, K.L.: Elastic scaling for data stream processing. IBM Research Technical Report, RC25401 (2013)
Gedik, B., Andrade, H.: A model-based framework for building extensible, high performance stream processing middleware and programming language for IBM InfoSphere streams. Softw. Pract. Exp. 42(11), 1363–1391 (2012)
Article Google Scholar
Gufler, B., Augsten, N., Reiser, A., Kemper, A.: Handling data skew in mapreduce. In: Proceedings of the International Conference of Cloud Computing and Services Science (2011)
Gufler, B., Augsten, N., Reiser, A., Kemper, A.: Load balancing in mapreduce based on scalable cardinality estimates. In: Proceedings of the International Conference on Data Engineering (IEEE ICDE) (2012)
Hirzel, M., Andrade, H., Gedik, B., Kumar, V., Losa, G., Mendell, M., Nasgaard, H., Soulé, R., Wu, K.L.: SPL language spec. Tech. Rep. RC24897, IBM (2009)
Jain, N., Amini, L., Andrade, H., King, R., Park, Y., Selo, P., Venkatramani, C.: Design, implementation, and evaluation of the linear road benchmark on the stream processing core. In: Proceedings of the International Conference on Management of Data (ACM SIGMOD) (2006)
Karger, D.R., Lehman, E., Leighton, T., Panigrahy, R., Levine, M., Lewin, D.: Consistent hashing and random trees: distributed caching protocols for relieving hot spots on the world wide web. In: Proceedings of the International Symposium on Theory of Computing (ACM STOC), pp. 654–663 (1997)
Karger, D.R., Sherman, A., Berkheimer, A., Bogstad, B., Dhanidina, R., Iwamoto, K., Kim, B., Matkins, L., Yerushalmi, Y.: Web caching with consistent hashing. Comput. Netw. 31(11–16), 1203–1213 (1999)
Article Google Scholar
Kwon, Y., Balazinska, M., Howe, B., Rolia, J.A.: SkewTune: mitigating skew in mapreduce applications. In: Proceedings of the International Conference on Management of Data (ACM SIGMOD) (2012)
Manku, G.S., Motwani, R.: Approximate frequency counts over data streams. In: Proceedings of the International Conference on Very Large Databases (VLDB) (2002)
MurMurHash3. http://code.google.com/p/smhasher/wiki/MurmurHash3 (2013). Retrieved May 2013
Paton, N.W., Chavez, J.B., Chen, M., Raman, V., Swart, G., Narang, I., Yellin, D.M., Fernandes, A.A.A.: Autonomic query parallelization using non-dedicated computers: An evaluation of adaptivity options. In: Proceedings of the Very Large Data Bases Conference (VLDB) (2009)
Poosala, V., Ioannidis, Y.E.: Estimation of query-result distribution and its application in parallel-join load balancing. In: Proceedings of the Very Large Data Bases Conference (VLDB) (1996)
S4 distributed stream computing platform. http://www.s4.io/ (2012). Retrieved May 2012
Schneider, S., Andrade, H., Gedik, B., Biem, A., Wu, K.L.: Elastic scaling of data parallel operators in stream processing. In: Proceedings of the International Parallel and Distributed Processing Symposium (IEEE IPDPS) (2009)
Schneider, S., Hirzel, M., Gedik, B., Wu, K.L.: Auto-parallelizing stateful distributed streaming application. In: Proceedigns of the International Conference on Parallel Architectures and Compilation Techniques (PACT), pp. 53–64 (2012)
Shah, M.A., Hellerstein, J.M., Chandrasekaran, S., Franklin, M.J.: Flux: An adaptive partitioning operator for continuous query systems. In: Proceedings of the International Conference on Data Engineering (IEEE ICDE) (2003)
Shatdal, A., Naughton, J.: Adaptive parallel aggregation algorithms. In: Proceedings of the International Conference on Management of Data (ACM SIGMOD) (1995)
Storm project. http://storm-project.net/ (2012). Retrieved May 2012
StreamBase Systems. http://www.streambase.com (2012). Retr- ieved May 2012
Walton, C., Dale, A., Jenevein, R.: A taxonomy and performance model of data skew effects in parallel joins. In: Proceedings of the Very Large Data Bases Conference (VLDB) (1991)
Xu, Y., Kostamaa, P.: Efficient outer join data skew handling in parallel dbms. In: Proceedings of the Very Large Data Bases Conference (VLDB) (2009)

Download references

Acknowledgments

We thank IBM Thomas J. Watson Research Center for providing access to compute and software resources that made this research possible.

Author information

Authors and Affiliations

Computer Science Department, Bilkent University, Bilkent, 06800 , Ankara, Turkey
Buğra Gedik

Authors

Buğra Gedik
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Buğra Gedik.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Gedik, B. Partitioning functions for stateful data parallelism in stream processing. The VLDB Journal 23, 517–539 (2014). https://doi.org/10.1007/s00778-013-0335-9

Download citation

Received: 01 January 2013
Revised: 01 June 2013
Accepted: 13 August 2013
Published: 31 August 2013
Issue Date: August 2014
DOI: https://doi.org/10.1007/s00778-013-0335-9

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Partitioning functions for stateful data parallelism in stream processing

Abstract

Access this article

Similar content being viewed by others

Cost-Effective Data Partition for Distributed Stream Processing System

HKS: Efficient Data Partitioning for Stateful Streaming

Adaptive key partitioning in distributed stream processing

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Partitioning functions for stateful data parallelism in stream processing

Abstract

Access this article

Similar content being viewed by others

Cost-Effective Data Partition for Distributed Stream Processing System

HKS: Efficient Data Partitioning for Stateful Streaming

Adaptive key partitioning in distributed stream processing

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation