DBMS Data Loading: An Analysis on Modern Hardware

  • Adam Dziedzic
  • Manos Karpathiotakis
  • Ioannis Alagiannis
  • Raja Appuswamy
  • Anastasia Ailamaki
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10195)

Abstract

Data loading has traditionally been considered a “one-time deal” – an offline process out of the critical path of query execution. The architecture of DBMS is aligned with this assumption. Nevertheless, the rate in which data is produced and gathered nowadays has nullified the “one-off” assumption, and has turned data loading into a major bottleneck of the data analysis pipeline.

This paper analyzes the behavior of modern DBMS in order to quantify their ability to fully exploit multicore processors and modern storage hardware during data loading. We examine multiple state-of-the-art DBMS, a variety of hardware configurations, and a combination of synthetic and real-world datasets to identify bottlenecks in the data loading process and to provide guidelines on how to accelerate data loading. Our findings show that modern DBMS are unable to saturate the available hardware resources. We therefore identify opportunities to accelerate data loading.

References

  1. 1.
  2. 2.
  3. 3.
    SkyServer project. http://skyserver.sdss.org
  4. 4.
    Symantec Enterprise. https://www.symantec.com/
  5. 5.
    TPC-C Benchmark: Standard Specification. http://www.tpc.org/tpcc/
  6. 6.
    TPC-DS Benchmark: Standard Specification. http://www.tpc.org/tpcds/
  7. 7.
    TPC-H Benchmark: Standard Specification. http://www.tpc.org/tpch/
  8. 8.
    Alagiannis, I., Borovica, R., Branco, M., Idreos, S., Ailamaki, A.: NoDB: efficient query execution on raw data files. In: SIGMOD (2012)Google Scholar
  9. 9.
    Amer-Yahia, S., Cluet, S.: A declarative approach to optimize bulk loading into databases. ACM Trans. Database Syst. 29(2), 233–281 (2004)CrossRefGoogle Scholar
  10. 10.
    Barclay, T., Barnes, R., Gray, J., Sundaresan, P.: Loading databases using dataflow parallelism. SIGMOD Record 23(4), 72–83 (1994)CrossRefGoogle Scholar
  11. 11.
    Baru, C., Bhandarkar, M., Nambiar, R., Poess, M., Rabl, T.: Benchmarking big data systems and the bigdata top100 list. Big Data 1(1), 60–64 (2013)CrossRefGoogle Scholar
  12. 12.
    Cai, Y.D., Aydt, R.A., Brunner, R.: Optimized data loading for a multi-terabyte sky survey repository. In: SC2005, p. 42 (2005)Google Scholar
  13. 13.
    Cheng, Y., Rusu, F.: Parallel in-situ data processing with speculative loading. In: SIGMOD (2014)Google Scholar
  14. 14.
    den Bercken, J.V., Seeger, B.: An evaluation of generic bulk loading techniques. In: VLDB, pp. 461–470 (2001)Google Scholar
  15. 15.
    Idreos, S., Alagiannis, I., Johnson, R., Ailamaki, A.: Here are my data files. Here are my queries. Where are my results? In: CIDR (2011)Google Scholar
  16. 16.
    Imhoff, C., Galemmo, N., Geiger, J.: Mastering Data Warehouse Design, 2nd edn. Wiley Publishing Inc., Indianapolis (2003)Google Scholar
  17. 17.
    Ivanova, M., Kersten, M.L., Manegold, S.: Data vaults: a symbiosis between database technology and scientific file repositories. In: Proceedings of International Conference on Scientific and Statistical Database Management, June 2012Google Scholar
  18. 18.
    Kargin, Y., Kersten, M.L., Manegold, S., Pirk, H.: The DBMS - your big data sommelier. In: ICDE (2015)Google Scholar
  19. 19.
    Karpathiotakis, M., Alagiannis, I., Ailamaki, A.: Fast queries over heterogeneous data through engine customization. PVLDB 9(12), 972–983 (2016)Google Scholar
  20. 20.
    Karpathiotakis, M., Alagiannis, I., Heinis, T., Branco, M., Ailamaki, A.: Just-in-time data virtualization: lightweight data management with ViDa. In: CIDR (2015)Google Scholar
  21. 21.
    Karpathiotakis, M., Branco, M., Alagiannis, I., Ailamaki, A.: Adaptive query processing on RAW data. PVLDB 7(12), 1119–1130 (2014)Google Scholar
  22. 22.
    Kemper, A., Neumann, T.: HyPer: a hybrid OLTP&OLAP main memory database system based on virtual memory snapshots. In: ICDE (2011)Google Scholar
  23. 23.
    Kimball, R., Ross, M.: The Data Warehouse Toolkit: The Complete Guide to Dimensional Modeling, 2nd edn. Wiley, New York (2002)Google Scholar
  24. 24.
    Mühlbauer, T., Rödiger, W., Seilbeck, R., Reiser, A., Kemper, A., Neumann, T.: Instant loading for main memory databases. Proc. VLDB Endow. 6(14), 1702–1713 (2013)CrossRefGoogle Scholar
  25. 25.
    Papadopoulos, A., Manolopoulos, Y.: Parallel bulk-loading of spatial data. Parallel Comput. 29(10), 1419–1444 (2003)MathSciNetCrossRefGoogle Scholar
  26. 26.
    Sridhar, K.T., Sakkeer, M.A.: Optimizing database load and extract for big data era. In: Bhowmick, S.S., Dyreson, C.E., Jensen, C.S., Lee, M.L., Muliantara, A., Thalheim, B. (eds.) DASFAA 2014. LNCS, vol. 8422, pp. 503–512. Springer, Cham (2014). doi:10.1007/978-3-319-05813-9_34 CrossRefGoogle Scholar
  27. 27.
    Vassiliadis, P., Simitsis, A.: Near real time ETL. In: Kozielski, S., Wrembel, R. (eds.) New Trends in Data Warehousing and Data Analysis. Annals of Information Systems, vol. 3, pp. 1–31. Springer, London (2009)CrossRefGoogle Scholar
  28. 28.
    Wiener, J.L., Naughton, J.F.: OODB bulk loading revisited: the partitioned-list approach. In: VLDB, pp. 30–41 (1995)Google Scholar

Copyright information

© Springer International Publishing AG 2017

Authors and Affiliations

  • Adam Dziedzic
    • 1
  • Manos Karpathiotakis
    • 2
  • Ioannis Alagiannis
    • 2
  • Raja Appuswamy
    • 2
  • Anastasia Ailamaki
    • 2
    • 3
  1. 1.University of ChicagoChicagoUSA
  2. 2.Ecole Polytechnique Fédérale de Lausanne (EPFL)LausanneSwitzerland
  3. 3.RAW Labs SALausanneSwitzerland

Personalised recommendations