Abstract
We study GroupBy implementation scheme which is widely used in distributed systems and databases. The GroupBy operation partitions a set of out-of-order records into groups. Due to the massive data size, many I/O-efficient grouping schemes that exploit external memory have been proposed. In this paper, we observe that the group sizes of many real data exhibit power-law property and the grouping schemes’ performance varies a lot for data with different group sizes. The indexing–filling approach prefers data with big group size, while the partitioned hash approach prefers data with small group size. Based on this observation, we propose a hybrid approach, PowerHash, which invokes different grouping schemes for different data. The group size information is approximately estimated by the count-min sketch so that the big groups and small groups can be distinguished from each other. With a given memory budget, our results show that PowerHash can improve performance by up to six times over the existing GroupBy implementations.
Similar content being viewed by others
References
Adamic, L.: The nature of markets in the world wide web. Quarterly J. Electron. Commer. 1(1) (2000)
Agrawal, S., Chaudhuri, S., Kollar, L., Marathe, A., Narasayya, V., Syamala, M.: Database tuning advisor for microsoft SQL server 2005: demo. In: ACM SIGMOD International Conference on Management of Data (SIGMOD 2005), pp. 930–932. ACM (2005)
Bartholomew, D.: Mariadb vs. MYSQL. Dostopano 7(10), 2014 (2012)
Boicea, A., Radulescu, F., Agapin, L.I.: Mongodb vs oracle-database comparison. In: EIDWT 2012, pp. 330–335 (2012)
Bratbergsengen, K.: Hashing methods and relational algebra operations. VLDB 1984, 323–333 (1984)
Cormode, G.: Count-min sketch. Encycl. Algorithms 29(1), 64–69 (2009)
Cormode, G., Muthukrishnan, S.: An improved data stream summary: the count-min sketch and its applications. In: Farach-Colton, M. (ed.) LATIN 2004: Theoretical Informatics, pp. 29–38. Springer, Berlin (2004)
Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)
Freedman, C.: Hash aggregate (2006). https://blogs.msdn.microsoft.com/craigfr/2006/09/20/hash-aggregate/. Accessed 2018
George, K., George, K.: Human Behavior and the Principle of Least Effort: An Introduction to Human Ecology. Addison-Wesley Press, Boston (1949)
Gray, J., Chaudhuri, S., Bosworth, A., Layman, A., Reichart, D., Venkatrao, M., Pellow, F., Pirahesh, H.: Data cube: a relational aggregation operator generalizing group-by, cross-tab, and sub-totals. Data Min. Knowl. Discov. 1(1), 29–53 (1997)
Khattree, R., Bahuguna, M.: An alternative data analytic approach to measure the univariate and multivariate skewness. Int. J. Data Sci. Anal. 1, 1–16 (2018)
Li, B., Mazur, E., Diao, Y., Mcgregor, A., Shenoy, P.: A platform for scalable one-pass analytics using mapreduce. In: ACM SIGMOD International Conference on Management of Data (SIGMOD 2011), pp. 985–996 (2011)
Lin, L., Lychagina, V., Liu, W., Kwon, Y., Mittal, S., Wong, M.: Tenzing a SQL implementation on the mapreduce framework. PVLDB 2011, 1318–1327 (2011)
Momjian, B.: PostgreSQL: Introduction and Concepts, vol. 192. Addison-Wesley, New York (2001)
MySQL, A.: Mysql 5.1 reference manual, 2006 (2009). http://dev.mysql.com/doc. Accessed 2018
Nasir, M.A.U., Morales, G.D.F., García-Soriano, D., Kourtellis, N., Serafini, M.: The power of both choices: Practical load balancing for distributed stream processing engines. In: IEEE 31st International Conference on Data Engineering (ICDE 2015), pp. 137–148. IEEE (2015)
Newman, M.: Power laws, pareto distributions and zipf’s law. Contemp. Phys. 46(5), 323–351 (2005)
Shvachko, K., Kuang, H., Radia, S., Chansler, R.: The hadoop distributed file system. In: IEEE Symposium on Mass storage systems and technologies (MSST 2010), pp. 1–10. IEEE (2010)
Stephens, S.M., Chen, J.Y., Davidson, M.G., Thomas, S., Trute, B.M.: Oracle database 10g: a platform for blast search and regular expression pattern matching in life sciences. Nucleic Acids Res. 33(1), D675–D679 (2005)
Teffer, D., Srinivasan, R., Ghosh, J.: Adahash: hashing-based scalable, adaptive hierarchical clustering of streaming data on mapreduce frameworks. Int. J. Data Sci. Anal. 1–11, (2018). https://doi.org/10.1007/s41060-018-0145-7
Yu, Y., Gunda, P.K., Isard, M.: Distributed aggregation for data-parallel computing: interfaces and implementations. In: ACM SIGOPS 22nd Symposium on Operating Systems Principles (SOSP 2009), pp. 247–260. ACM (2009)
Acknowledgements
This work was partially supported by National Key R&D Program of China (2018YFB1003404), National Natural Science Foundation of China (61672141) and Fundamental Research Funds for the Central Universities (N181605017).
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Wei, X., Kong, X., Zhang, Y. et al. PowerHash: a hybrid grouping scheme by leveraging power-law properties of data. Int J Data Sci Anal 9, 273–284 (2020). https://doi.org/10.1007/s41060-019-00192-2
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s41060-019-00192-2