PowerHash: a hybrid grouping scheme by leveraging power-law properties of data

Wei, Xun; Kong, Xiaowang; Zhang, Yanfeng; Yu, Ge

doi:10.1007/s41060-019-00192-2

PowerHash: a hybrid grouping scheme by leveraging power-law properties of data

Regular Paper
Published: 25 June 2019

Volume 9, pages 273–284, (2020)
Cite this article

International Journal of Data Science and Analytics Aims and scope Submit manuscript

Xun Wei¹,
Xiaowang Kong¹,
Yanfeng Zhang¹ &
…
Ge Yu¹

171 Accesses
Explore all metrics

Abstract

We study GroupBy implementation scheme which is widely used in distributed systems and databases. The GroupBy operation partitions a set of out-of-order records into groups. Due to the massive data size, many I/O-efficient grouping schemes that exploit external memory have been proposed. In this paper, we observe that the group sizes of many real data exhibit power-law property and the grouping schemes’ performance varies a lot for data with different group sizes. The indexing–filling approach prefers data with big group size, while the partitioned hash approach prefers data with small group size. Based on this observation, we propose a hybrid approach, PowerHash, which invokes different grouping schemes for different data. The group size information is approximately estimated by the count-min sketch so that the big groups and small groups can be distinguished from each other. With a given memory budget, our results show that PowerHash can improve performance by up to six times over the existing GroupBy implementations.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similarity Grouping in Big Data Systems

Query Processing of Pre-partitioned Data Using Sandwich Operators

SP-TSRM: A Data Grouping Strategy in Distributed Storage System

Notes

References

Adamic, L.: The nature of markets in the world wide web. Quarterly J. Electron. Commer. 1(1) (2000)
Agrawal, S., Chaudhuri, S., Kollar, L., Marathe, A., Narasayya, V., Syamala, M.: Database tuning advisor for microsoft SQL server 2005: demo. In: ACM SIGMOD International Conference on Management of Data (SIGMOD 2005), pp. 930–932. ACM (2005)
Bartholomew, D.: Mariadb vs. MYSQL. Dostopano 7(10), 2014 (2012)
Google Scholar
Boicea, A., Radulescu, F., Agapin, L.I.: Mongodb vs oracle-database comparison. In: EIDWT 2012, pp. 330–335 (2012)
Bratbergsengen, K.: Hashing methods and relational algebra operations. VLDB 1984, 323–333 (1984)
Google Scholar
Cormode, G.: Count-min sketch. Encycl. Algorithms 29(1), 64–69 (2009)
Google Scholar
Cormode, G., Muthukrishnan, S.: An improved data stream summary: the count-min sketch and its applications. In: Farach-Colton, M. (ed.) LATIN 2004: Theoretical Informatics, pp. 29–38. Springer, Berlin (2004)
Chapter Google Scholar
Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)
Article Google Scholar
Freedman, C.: Hash aggregate (2006). https://blogs.msdn.microsoft.com/craigfr/2006/09/20/hash-aggregate/. Accessed 2018
George, K., George, K.: Human Behavior and the Principle of Least Effort: An Introduction to Human Ecology. Addison-Wesley Press, Boston (1949)
Google Scholar
Gray, J., Chaudhuri, S., Bosworth, A., Layman, A., Reichart, D., Venkatrao, M., Pellow, F., Pirahesh, H.: Data cube: a relational aggregation operator generalizing group-by, cross-tab, and sub-totals. Data Min. Knowl. Discov. 1(1), 29–53 (1997)
Article Google Scholar
Khattree, R., Bahuguna, M.: An alternative data analytic approach to measure the univariate and multivariate skewness. Int. J. Data Sci. Anal. 1, 1–16 (2018)
Google Scholar
Li, B., Mazur, E., Diao, Y., Mcgregor, A., Shenoy, P.: A platform for scalable one-pass analytics using mapreduce. In: ACM SIGMOD International Conference on Management of Data (SIGMOD 2011), pp. 985–996 (2011)
Lin, L., Lychagina, V., Liu, W., Kwon, Y., Mittal, S., Wong, M.: Tenzing a SQL implementation on the mapreduce framework. PVLDB 2011, 1318–1327 (2011)
Google Scholar
Momjian, B.: PostgreSQL: Introduction and Concepts, vol. 192. Addison-Wesley, New York (2001)
Google Scholar
MySQL, A.: Mysql 5.1 reference manual, 2006 (2009). http://dev.mysql.com/doc. Accessed 2018
Nasir, M.A.U., Morales, G.D.F., García-Soriano, D., Kourtellis, N., Serafini, M.: The power of both choices: Practical load balancing for distributed stream processing engines. In: IEEE 31st International Conference on Data Engineering (ICDE 2015), pp. 137–148. IEEE (2015)
Newman, M.: Power laws, pareto distributions and zipf’s law. Contemp. Phys. 46(5), 323–351 (2005)
Article Google Scholar
Shvachko, K., Kuang, H., Radia, S., Chansler, R.: The hadoop distributed file system. In: IEEE Symposium on Mass storage systems and technologies (MSST 2010), pp. 1–10. IEEE (2010)
Stephens, S.M., Chen, J.Y., Davidson, M.G., Thomas, S., Trute, B.M.: Oracle database 10g: a platform for blast search and regular expression pattern matching in life sciences. Nucleic Acids Res. 33(1), D675–D679 (2005)
Google Scholar
Teffer, D., Srinivasan, R., Ghosh, J.: Adahash: hashing-based scalable, adaptive hierarchical clustering of streaming data on mapreduce frameworks. Int. J. Data Sci. Anal. 1–11, (2018). https://doi.org/10.1007/s41060-018-0145-7
Yu, Y., Gunda, P.K., Isard, M.: Distributed aggregation for data-parallel computing: interfaces and implementations. In: ACM SIGOPS 22nd Symposium on Operating Systems Principles (SOSP 2009), pp. 247–260. ACM (2009)

Download references

Acknowledgements

This work was partially supported by National Key R&D Program of China (2018YFB1003404), National Natural Science Foundation of China (61672141) and Fundamental Research Funds for the Central Universities (N181605017).

Author information

Authors and Affiliations

Northeastern University, Shenyang, China
Xun Wei, Xiaowang Kong, Yanfeng Zhang & Ge Yu

Authors

Xun Wei
View author publications
You can also search for this author in PubMed Google Scholar
Xiaowang Kong
View author publications
You can also search for this author in PubMed Google Scholar
Yanfeng Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Ge Yu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yanfeng Zhang.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wei, X., Kong, X., Zhang, Y. et al. PowerHash: a hybrid grouping scheme by leveraging power-law properties of data. Int J Data Sci Anal 9, 273–284 (2020). https://doi.org/10.1007/s41060-019-00192-2

Download citation

Received: 20 January 2019
Accepted: 12 June 2019
Published: 25 June 2019
Issue Date: April 2020
DOI: https://doi.org/10.1007/s41060-019-00192-2

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

PowerHash: a hybrid grouping scheme by leveraging power-law properties of data

Abstract

Access this article

Similar content being viewed by others

Similarity Grouping in Big Data Systems

Query Processing of Pre-partitioned Data Using Sandwich Operators

SP-TSRM: A Data Grouping Strategy in Distributed Storage System

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

PowerHash: a hybrid grouping scheme by leveraging power-law properties of data

Abstract

Access this article

Similar content being viewed by others

Similarity Grouping in Big Data Systems

Query Processing of Pre-partitioned Data Using Sandwich Operators

SP-TSRM: A Data Grouping Strategy in Distributed Storage System

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation