Abstract
With growing and pervasive interest in Big Data, SQL relational databases need to compete with data management by Hadoop, NoSQL and NoDB. Database research has mainly focused on result generation by query processing. But SQL databases require data in-place before queries may be processed. The process of DB loading has been a bottleneck leading to external ETL/ELT techniques for loading large data sets. This paper focuses on DB engine level techniques for optimizing both data loads and extracts in an MPP, shared-nothing SQL database, dbX, available on in-house commodity hardware and cloud systems. The agile, data loading of dbX exploits parallelism at multiple levels to achieve TBs of data load per hour making it suitable for cloud and continuous actionable knowledge applications. Implementation techniques at DB engine level, extensions to load/extract syntax and performance results are presented. Load optimization techniques help to speed up data extract to flat files and CTAS type SQL queries too. We show linear scale up with cluster scale out for load/extract in public cloud and commodity hardware systems without recourse to database tuning or use of expensive database appliances.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Pavlo, A., et al.: A Comparison of Approaches to Large Scale Data Analysis. In: SIGMOD 2009, pp. 165–178. ACM (2009)
Abouzied, A., Abadi, D.J., Silberschatz, A.: Invisible Loading: Access-Driven Data Transfer from Raw Files into Database Systems. In: EDBT/ICDT 2013, pp. 1–10. ACM (2013)
Baru, C., Bhandarkar, M., Nambiar, R., Poess, M., Rabl, T.: Benchmarking Big Data Systems and the Big Data Top 100 List. BIG DATA 1, 60–64 (2013)
Alagiannis, I., Borovica, R., Branco, M., Idreos, S., Ailamaki, A.: NoDB: Efficient Query Execution on Raw Data Files. In: SIGMOD 2012, pp. 241–252. ACM (2012)
Bent, J., et al.: PLFS: A Checkpoint Filesystem for Parallel Applications. In: SCO 2009. ACM (2009)
Gantz, J., Reinsel, D.: The Digital Universe in 2020: Big Data, Bigger Digital Shadows and Biggest Growth in the Far East. In: IDC IVIEW, IDC (2012)
Becla, J., et al.: Designing a Multi-petabyte Database for LSST. In: SPIE Conference on Observatory Operations, Strategy, Processes and Systems, SLAC-PUB-12292 (2006)
PostgreSQL: http://www.postgresql.org
Xu, R., et al.: Filesystem Aware Scalable I/O Framework for Data Intensive Parallel Applications. In: IPDPSW 2013, pp. 2007–2014. IEEE (2013)
Santos, R.J., Bernardino, J.: Real-time Data Warehouse Loading Methodology. In: Desai, B.C. (ed.) IDEAS 2008, pp. 49–58. ACM (2008)
Idreos, S., et al.: Here are my Data Files. Here are my Queries. Where are my Results? In: 5th Biennial Conference on Innovative Data Systems Research, CIDR, pp. 57–68 (2011)
XtremeData: dbX, http://www.xtremedata.com
XtremeData: dbX SQL User Guide, Vol. II, Document X4631-02. XtremeData (2011)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer International Publishing Switzerland
About this paper
Cite this paper
Sridhar, K.T., Sakkeer, M.A. (2014). Optimizing Database Load and Extract for Big Data Era. In: Bhowmick, S.S., Dyreson, C.E., Jensen, C.S., Lee, M.L., Muliantara, A., Thalheim, B. (eds) Database Systems for Advanced Applications. DASFAA 2014. Lecture Notes in Computer Science, vol 8422. Springer, Cham. https://doi.org/10.1007/978-3-319-05813-9_34
Download citation
DOI: https://doi.org/10.1007/978-3-319-05813-9_34
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-05812-2
Online ISBN: 978-3-319-05813-9
eBook Packages: Computer ScienceComputer Science (R0)