Abstract
Data loading has traditionally been considered a “one-time deal” – an offline process out of the critical path of query execution. The architecture of DBMS is aligned with this assumption. Nevertheless, the rate in which data is produced and gathered nowadays has nullified the “one-off” assumption, and has turned data loading into a major bottleneck of the data analysis pipeline.
This paper analyzes the behavior of modern DBMS in order to quantify their ability to fully exploit multicore processors and modern storage hardware during data loading. We examine multiple state-of-the-art DBMS, a variety of hardware configurations, and a combination of synthetic and real-world datasets to identify bottlenecks in the data loading process and to provide guidelines on how to accelerate data loading. Our findings show that modern DBMS are unable to saturate the available hardware resources. We therefore identify opportunities to accelerate data loading.
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
A. Dziedzic—Work done while the author was at EPFL.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsNotes
- 1.
We reported this behavior to the MonetDB developers, and it is fixed in the current release.
- 2.
References
MonetDB. http://www.monetdb.org/
PostgreSQL. https://www.postgresql.org/
SkyServer project. http://skyserver.sdss.org
Symantec Enterprise. https://www.symantec.com/
TPC-C Benchmark: Standard Specification. http://www.tpc.org/tpcc/
TPC-DS Benchmark: Standard Specification. http://www.tpc.org/tpcds/
TPC-H Benchmark: Standard Specification. http://www.tpc.org/tpch/
Alagiannis, I., Borovica, R., Branco, M., Idreos, S., Ailamaki, A.: NoDB: efficient query execution on raw data files. In: SIGMOD (2012)
Amer-Yahia, S., Cluet, S.: A declarative approach to optimize bulk loading into databases. ACM Trans. Database Syst. 29(2), 233–281 (2004)
Barclay, T., Barnes, R., Gray, J., Sundaresan, P.: Loading databases using dataflow parallelism. SIGMOD Record 23(4), 72–83 (1994)
Baru, C., Bhandarkar, M., Nambiar, R., Poess, M., Rabl, T.: Benchmarking big data systems and the bigdata top100 list. Big Data 1(1), 60–64 (2013)
Cai, Y.D., Aydt, R.A., Brunner, R.: Optimized data loading for a multi-terabyte sky survey repository. In: SC2005, p. 42 (2005)
Cheng, Y., Rusu, F.: Parallel in-situ data processing with speculative loading. In: SIGMOD (2014)
den Bercken, J.V., Seeger, B.: An evaluation of generic bulk loading techniques. In: VLDB, pp. 461–470 (2001)
Idreos, S., Alagiannis, I., Johnson, R., Ailamaki, A.: Here are my data files. Here are my queries. Where are my results? In: CIDR (2011)
Imhoff, C., Galemmo, N., Geiger, J.: Mastering Data Warehouse Design, 2nd edn. Wiley Publishing Inc., Indianapolis (2003)
Ivanova, M., Kersten, M.L., Manegold, S.: Data vaults: a symbiosis between database technology and scientific file repositories. In: Proceedings of International Conference on Scientific and Statistical Database Management, June 2012
Kargin, Y., Kersten, M.L., Manegold, S., Pirk, H.: The DBMS - your big data sommelier. In: ICDE (2015)
Karpathiotakis, M., Alagiannis, I., Ailamaki, A.: Fast queries over heterogeneous data through engine customization. PVLDB 9(12), 972–983 (2016)
Karpathiotakis, M., Alagiannis, I., Heinis, T., Branco, M., Ailamaki, A.: Just-in-time data virtualization: lightweight data management with ViDa. In: CIDR (2015)
Karpathiotakis, M., Branco, M., Alagiannis, I., Ailamaki, A.: Adaptive query processing on RAW data. PVLDB 7(12), 1119–1130 (2014)
Kemper, A., Neumann, T.: HyPer: a hybrid OLTP&OLAP main memory database system based on virtual memory snapshots. In: ICDE (2011)
Kimball, R., Ross, M.: The Data Warehouse Toolkit: The Complete Guide to Dimensional Modeling, 2nd edn. Wiley, New York (2002)
Mühlbauer, T., Rödiger, W., Seilbeck, R., Reiser, A., Kemper, A., Neumann, T.: Instant loading for main memory databases. Proc. VLDB Endow. 6(14), 1702–1713 (2013)
Papadopoulos, A., Manolopoulos, Y.: Parallel bulk-loading of spatial data. Parallel Comput. 29(10), 1419–1444 (2003)
Sridhar, K.T., Sakkeer, M.A.: Optimizing database load and extract for big data era. In: Bhowmick, S.S., Dyreson, C.E., Jensen, C.S., Lee, M.L., Muliantara, A., Thalheim, B. (eds.) DASFAA 2014. LNCS, vol. 8422, pp. 503–512. Springer, Cham (2014). doi:10.1007/978-3-319-05813-9_34
Vassiliadis, P., Simitsis, A.: Near real time ETL. In: Kozielski, S., Wrembel, R. (eds.) New Trends in Data Warehousing and Data Analysis. Annals of Information Systems, vol. 3, pp. 1–31. Springer, London (2009)
Wiener, J.L., Naughton, J.F.: OODB bulk loading revisited: the partitioned-list approach. In: VLDB, pp. 30–41 (1995)
Acknowledgments
This work is partially funded by the EU FP7 Programme (ERC-2013-CoG) under grant agreement number 617508 (ViDa), and the EU FP7 Programme (FP7 Collaborative project) under grant agreement number 317858 (BigFoot).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this paper
Cite this paper
Dziedzic, A., Karpathiotakis, M., Alagiannis, I., Appuswamy, R., Ailamaki, A. (2017). DBMS Data Loading: An Analysis on Modern Hardware. In: Blanas, S., Bordawekar, R., Lahiri, T., Levandoski, J., Pavlo, A. (eds) Data Management on New Hardware. ADMS IMDM 2016 2016. Lecture Notes in Computer Science(), vol 10195. Springer, Cham. https://doi.org/10.1007/978-3-319-56111-0_6
Download citation
DOI: https://doi.org/10.1007/978-3-319-56111-0_6
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-56110-3
Online ISBN: 978-3-319-56111-0
eBook Packages: Computer ScienceComputer Science (R0)