Advertisement

Efficient Batch Grouping in Relational Datasets

  • Jizhou SunEmail author
  • Jianzhong Li
  • Hong Gao
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10177)

Abstract

Data Grouping is an expensive and frequently used operator in data processing, meanwhile data is often too big to fit in memory, where disk sorting based method is often employed. Disk sorting reads and writes the entire dataset for many times, which is very time-consuming, so reducing I/O costs is of great significants. In many applications, grouping a set of records multi-times on different keys is very common. Grouping in batch manner and techniques of sharing intermediate results are studied in this paper for efficiency. In batch grouping settings, different grouping orders may result in different I/O costs. To minimize I/O costs, we formalize the group-order scheduling problem as an optimization problem which can be proven in NP-Complete, and then propose a heuristic algorithm. Experimental results on TPC-H as well as synthetic datasets show the efficiency and robustness of our techniques.

Keywords

Batch grouping I/O efficiency Sharing Scheduling 

Notes

Acknowledgments

This work is supported in part by the Key Research and Development Plan of National Ministry of Science and Technology under grant No. 2016YFB1000703, and the Key Program of the National Natural Science Foundation of China under Grant No. 61190115, 61632010 and U1509216.

References

  1. 1.
    Agarwal, S., Agrawal, R., Deshpande, P. et al.: On the computation of multidimensional aggregates. In: Proceedings of 22th International Conference on Very Large Data Bases (1996)Google Scholar
  2. 2.
    Armstrong,W.W.: Dependency structures of data base relationships. In: IFIP Congress, pp. 580–583 (1974)Google Scholar
  3. 3.
    Balkesen, C., Alonso, G., Teubner, J., et al.: Multi-core, main-memory joins: sort vs. hash revisited. PVLDB 7(1), 85–96 (2013)Google Scholar
  4. 4.
    Cao, Y., Bramandia, R., Chan, C., et al.: Sort-sharing-aware query processing. VLDB J. 21(3), 411–436 (2012)CrossRefGoogle Scholar
  5. 5.
    Chandramouli, B., Goldstein, J.: Patience is a virtue: revisiting merge and sort on modern processors. In: Proceedings of 33rd International Conference on Management of Data, Snowbird, USA, pp. 731–742 (2014)Google Scholar
  6. 6.
    Charikar, M., Chaudhuri, S., Motwani, R. et al.: Towards estimation error guarantees for distinct values. In: Proceedings of 19th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, Dallas, USA, pp. 268–279 (2000)Google Scholar
  7. 7.
    Chen, S., Jiang, S., He, B. et al.: A study of sorting algorithms on approximate memory. In: Proceedings of 35th International Conference on Management of Data, SIGMOD 2016, San Francisco, USA, pp. 647–662. ACM (2016)Google Scholar
  8. 8.
    Estivill-Castro, V., Wood, D.: A survey of adaptive sorting algorithms. ACM Comput. Surv. 24(4), 441–476 (1992)CrossRefGoogle Scholar
  9. 9.
    Fan, W., Geerts, F., Jia, X., et al.: Conditional functional dependencies for capturing data inconsistencies. ACM Trans. Database Syst. 33(2), 6 (2008)CrossRefGoogle Scholar
  10. 10.
    Gibbons, P.B.: Distinct sampling for highly-accurate answers to distinct values queries and event reports. In: Proceedings of 27th International Conference on Very Large Data Bases, Roma, Italy, pp. 541–550 (2001)Google Scholar
  11. 11.
    Graefe, G.: Implementing sorting in database systems. ACM Comput. Surv. 38(3), 10 (2006)CrossRefGoogle Scholar
  12. 12.
    Guravannavar, R., Sudarshan, S.: Reducing order enforcement cost in complex query plans. In: Proceedings of 23rd International Conference on Data Engineering, Istanbul, Turkey, pp. 856–865 (2007)Google Scholar
  13. 13.
    Inoue, H., Taura, K.: SIMD- and cache-friendly algorithm for sorting an array of structures. PVLDB 8(11), 1274–1285 (2015)Google Scholar
  14. 14.
    Jünger, M. (ed.): 50 Years of Integer Programming 1958–2008: From the Early Years to the State-of-the-Art. Springer, Heidelberg (2010)Google Scholar
  15. 15.
    Neumann, T., Moerkotte, G.: A combined framework for grouping and order optimization. In: Proceedings of 30th International Conference on Very Large Data Bases, Toronto, Canada, pp. 960–971 (2004)Google Scholar
  16. 16.
    Neumann, T., Moerkotte, G.: An efficient framework for order optimization. In: Proceedings of 20th International Conference on Data Engineering, Boston, USA, pp. 461–472 (2004)Google Scholar
  17. 17.
    Simmen, D.E., Shekita, E.J., Malkemus, T.: Fundamental techniques for order optimization. In: Proceedings of 15th International Conference on Management of Data, Montreal, Canada, pp. 57–67 (1996)Google Scholar
  18. 18.
    Viglas, S.: Write-limited sorts and joins for persistent memory. PVLDB 7(5), 413–424 (2014)Google Scholar
  19. 19.
    Wang, X., Cherniack, M.: Avoiding sorting and grouping in processing queries. In: Proceedings of 29th International Conference on Very Large Data Bases, VLDB 2003, Berlin, Germany, pp. 826–837. VLDB Endowment (2003)Google Scholar
  20. 20.
    Xu, W., Feng, Z., Lo, E.: Fast multi-column sorting in main-memory column-stores. In: Proceedings of 35th International Conference on Management of Data, SIGMOD 2016, San Francisco, USA, pp. 1263–1278. ACM (2016)Google Scholar

Copyright information

© Springer International Publishing AG 2017

Authors and Affiliations

  1. 1.School of Computer Science and TechnologyHarbin Institute of TechnologyHarbinChina

Personalised recommendations