Handling Data Skew for Aggregation in Spark SQL Using Task Stealing

He, Zeyu; Huang, Qiuli; Li, Zhifang; Weng, Chuliang

doi:10.1007/s10766-020-00657-z

Handling Data Skew for Aggregation in Spark SQL Using Task Stealing

Published: 25 March 2020

Volume 48, pages 941–956, (2020)
Cite this article

International Journal of Parallel Programming Aims and scope Submit manuscript

Zeyu He ORCID: orcid.org/0000-0001-8017-2344¹,
Qiuli Huang¹,
Zhifang Li¹ &
…
Chuliang Weng¹

573 Accesses
4 Citations
Explore all metrics

Abstract

In distributed in-memory computing systems, data distribution has a large impact on performance. Designing a good partition algorithm is difficult and requires users to have adequate prior knowledge of data, which makes data skew common in reality. Traditional approaches to handling data skew by sampling and repartitioning often incur additional overhead. In this paper, we proposed a dynamic execution optimization for the aggregation operator, which is one of the most general and expensive operators in Spark SQL. Our optimization aims to avoid the additional overhead and improve the performance when data skew occurs. The core idea is task stealing. Based on the relative size of data partitions, we add two types of tasks, namely segment tasks for larger partitions and stealing tasks for smaller partitions. In a stage, stealing tasks could actively steal and process data from segment tasks after processing their own. The optimization achieves significant performance improvements from 16% up to 67% on different sizes and distributions of data. Experiments show that involved overhead is minimal and could be negligible.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Real-Time Partition Generation Mechanism for Data Skew Mitigation in Spark Computing Environment

Article 31 October 2023

Research on Optimization of Data Balancing Partition Algorithm Based on Spark Platform

Data balancing-based intermediate data partitioning and check point-based cache recovery in Spark environment

Article 02 August 2021

References

Acar, U.A., Chargueraud, A., Rainey, M.: Scheduling parallel programs by work stealing with private deques. In: Proceedings of the 18th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pp. 219–228. PPoPP ’13, ACM, New York, NY, USA (2013)
Armbrust, M., Xin, R.S., Lian, C., Huai, Y., Liu, D., Bradley, J.K., Meng, X., Kaftan, T., Franklin, M.J., Ghodsi, A., Zaharia, M.: Spark sql: relational data processing in spark. In: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, pp. 1383–1394. SIGMOD ’15, ACM, New York, NY, USA (2015)
Chen, Q., Yao, J., Xiao, Z.: LIBRA: lightweight data skew mitigation in mapreduce. IEEE Trans. Parallel Distrib. Syst. 26(9), 2520–2533 (2015)
Article Google Scholar
Cieslewicz, J., Ross, K.A.: Adaptive aggregation on chip multiprocessors. In: Proceedings of the 33rd International Conference on Very Large Data Bases, pp. 339–350. VLDB ’07, VLDB Endowment (2007)
Culhane, W., Kogan, K., Jayalath, C., Eugster, P.: LOOM: optimal aggregation overlays for in-memory big data processing. In: 6th USENIX Workshop on Hot Topics in Cloud Computing (HotCloud 14), pp. 13–13. USENIX Association (2014)
Culhane, W., Kogan, K., Jayalath, C., Eugster, P.: Optimal communication structures for big data aggregation. In: 2015 IEEE Conference on Computer Communications, pp. 1643–1651. IEEE (2015)
Hua, K.A., Lee, C.: Handling data skew in multiprocessor database computers using partition tuning. In: Proceedings of the 17th International Conference on Very Large Data Bases, pp. 525–535. VLDB ’91, Morgan Kaufmann Publishers Inc., San Francisco, CA, USA (1991)
Jiang, P., Agrawal, G.: Efficient SIMD and MIMD parallelization of hash-based aggregation by conflict mitigation. In: Proceedings of the International Conference on Supercomputing, pp. 24:1–24:11. ICS ’17, ACM, New York, NY, USA (2017)
Kwon, Y., Balazinska, M., Howe, B., Rolia, J.: Skew-resistant parallel processing of feature-extracting scientific user-defined functions. In: Proceedings of the 1st ACM Symposium on Cloud Computing, pp. 75–86. SoCC ’10, ACM, New York, NY, USA (2010)
Kwon, Y., Balazinska, M., Howe, B., Rolia, J.: A study of skew in mapreduce applications. Open Cirrus Summit 11, 30 (2011)
Google Scholar
Kwon, Y., Balazinska, M., Howe, B., Rolia, J.: Skewtune in action: mitigating skew in mapreduce applications. Proc. VLDB Endow. 5(12), 1934–1937 (2012)
Article Google Scholar
Kwon, Y., Balazinska, M., Howe, B., Rolia, J.: Skewtune: mitigating skew in mapreduce applications. In: Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data, pp. 25–36. SIGMOD ’12, ACM, New York, NY, USA (2012)
Li, J., Agrawal, K., Elnikety, S., He, Y., Lee, I.T.A., Lu, C., McKinley, K.S.: Work stealing for interactive services to meet target latency. In: Proceedings of the 21st ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pp. 14:1–14:13. PPoPP ’16, ACM, New York, NY, USA (2016)
Liu, F., Salmasi, A., Blanas, S., Sidiropoulos, A.: Chasing similarity: distribution-aware aggregation scheduling. Proc. VLDB Endow. 12(3), 292–306 (2018)
Article Google Scholar
Liu, G., Zhu, X., Wang, J., Guo, D., Bao, W., Guo, H.: SP-partitioner: a novel partition method to handle intermediate data skew in spark streaming. Future Gener. Comput. Syst. 86, 1054–1063 (2018)
Article Google Scholar
Liu, Z., Zhang, Q., Zhani, M.F., Boutaba, R., Liu, Y., Gong, Z.: DREAMS: dynamic resource allocation for mapreduce with data skew. In: 2015 IFIP/IEEE International Symposium on Integrated Network Management, pp. 18–26. IEEE (2015)
Merkel, A., Stoess, J., Bellosa, F.: Resource-conscious scheduling for energy efficiency on multicore processors. In: Proceedings of the 5th European Conference on Computer Systems, pp. 153–166. EuroSys ’10 (2010)
Müller, I., Sanders, P., Lacurie, A., Lehner, W., Färber, F.: Cache-efficient aggregation: hashing is sorting. In: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, pp. 1123–1136. SIGMOD ’15, ACM, New York, NY, USA (2015)
Okcan, A., Riedewald, M.: Processing theta-joins using mapreduce. In: Proceedings of the 2011 ACM SIGMOD International Conference on Management of Data, pp. 949–960. SIGMOD ’11, ACM, New York, NY, USA (2011)
Polychroniou, O., Raghavan, A., Ross, K.A.: Rethinking SIMD vectorization for in-memory databases. In: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, pp. 1493–1508. SIGMOD ’15, ACM, New York, NY, USA (2015)
Ricci, L., Carlini, E., Dazzi, P., Lulli, A.: Static and dynamic big data partitioning on apache spark. In: Conference on Parallel Computing, vol. 27, pp. 489–498. IOS PRESS (2016)
Spark homepage. https://spark.apache.org, last accessed 9 May 2019
Tang, Z., Zhang, X., Li, K., Li, K.: An intermediate data placement algorithm for load balancing in spark computing environment. Future Gener. Comput. Syst. 78, 287–301 (2018)
Article Google Scholar
The TPC-H benchmark. http://www.tpc.org/tpch, last accessed 10 May 2019
Wang, L., Zhou, M., Zhang, Z., Shan, M.C., Zhou, A.: NUMA-aware scalable and efficient in-memory aggregation on large domains. IEEE Trans. Knowl. Data Eng. 27(4), 1071–1084 (2015)
Article Google Scholar
Wang, L., Zhou, M., Zhang, Z., Yang, Y., Zhou, A., Bitton, D.: Elastic pipelining in an in-memory database cluster. In: Proceedings of the 2016 International Conference on Management of Data, pp. 1279–1294. SIGMOD ’16, ACM, New York, NY, USA (2016)
Wimmer, M., Cederman, D., Träff, J.L., Tsigas, P.: Work-stealing with configurable scheduling strategies. In: Proceedings of the 18th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pp. 315–316. PPoPP ’13, ACM, New York, NY, USA (2013)
Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauley, M., Franklin, M.J., Shenker, S., Stoica, I.: Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation, pp. 2–2. NSDI’12, USENIX Association, Berkeley, CA, USA (2012)

Download references

Acknowledgements

This research was supported by the National Key Research & Development Program of China (No. 2018YFB1003400).

Author information

Authors and Affiliations

School of Data Science and Engineering, East China Normal University, Shanghai, China
Zeyu He, Qiuli Huang, Zhifang Li & Chuliang Weng

Authors

Zeyu He
View author publications
You can also search for this author in PubMed Google Scholar
Qiuli Huang
View author publications
You can also search for this author in PubMed Google Scholar
Zhifang Li
View author publications
You can also search for this author in PubMed Google Scholar
Chuliang Weng
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Zeyu He.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

He, Z., Huang, Q., Li, Z. et al. Handling Data Skew for Aggregation in Spark SQL Using Task Stealing. Int J Parallel Prog 48, 941–956 (2020). https://doi.org/10.1007/s10766-020-00657-z

Download citation

Received: 06 August 2019
Accepted: 11 November 2019
Published: 25 March 2020
Issue Date: December 2020
DOI: https://doi.org/10.1007/s10766-020-00657-z

Handling Data Skew for Aggregation in Spark SQL Using Task Stealing

Abstract

Access this article

Similar content being viewed by others

A Real-Time Partition Generation Mechanism for Data Skew Mitigation in Spark Computing Environment

Research on Optimization of Data Balancing Partition Algorithm Based on Spark Platform

Data balancing-based intermediate data partitioning and check point-based cache recovery in Spark environment

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Handling Data Skew for Aggregation in Spark SQL Using Task Stealing

Abstract

Access this article

Similar content being viewed by others

A Real-Time Partition Generation Mechanism for Data Skew Mitigation in Spark Computing Environment

Research on Optimization of Data Balancing Partition Algorithm Based on Spark Platform

Data balancing-based intermediate data partitioning and check point-based cache recovery in Spark environment

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation