Enhancing Parallel Data Loading for Large Scale Scientific Database

Li, Hui; Li, Hongyuan; Chen, Mei; Dai, Zhenyu; Zhu, Ming; Huang, Menglin

doi:10.1007/978-3-319-27122-4_11

Enhancing Parallel Data Loading for Large Scale Scientific Database

Hui Li^17,18,
Hongyuan Li^17,18,
Mei Chen^17,18,
Zhenyu Dai^17,18,
Ming Zhu¹⁹ &
…
Menglin Huang¹⁹

Conference paper
First Online: 16 December 2015

1377 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 9529))

Abstract

The rapidly increased data size make large scale scientific database often have a huge time delay between loading data into the system and ready for receiving query request. To solve this problem, we proposed an efficient parallel data loading approach named FASTLoad. It is designed to maximize the given resource (e.g., network bandwidth, main memory) utilization for optimizing the data loading in large scale array model based scientific database system. To verify the efficiency of FASTLoad, we implemented it in our Adaptable Data Loading System and evaluate its performance over various sizes of large scientific data sets. Our experimental results show that the performance of FASTLoad can be 4 to 6 times fast than the built-in loading techniques of states-of-the-arts array model based scientific database system.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

Hey, T., Tansley, S., Tolle, K. (eds.): The Fourth Paradigm: Data-Intensive Scientific Discoveries. Microsoft Research, Redmond (2009)
Google Scholar
Cudre-Mauroux, P., Kimura, H., et al.: A demonstration of SciDB: a science-oriented DBMS. VLDB 2, 1534–1537 (2009)
Google Scholar
Alagiannis, I., Borovica, R., Branco, M., Idreos, S., et al.: NoDB in action: adaptive query processing on raw data. VLDB 5, 1942–1945 (2012)
Google Scholar
Alagiannis, I., Borovica, R., Branco, M., Idreos, S., et al.: NoDB: efficient query execution on raw data files. In: SIGMOD (2012)
Google Scholar
Blanas, S., Wu, K., Byna, S., Dong, B., Shoshani, A.: Parallel data analysis directly on scientific file formats. In: SIGMOD (2014)
Google Scholar
Witkowski, A., Colgan, M., Brumm, A., Cruanes, T., Baer, H.: Performant and Scalable Data Loading with Oracle Database 11g (2011)
Google Scholar
Cheng, Y., Rusu, F.: Parallel in-situ data processing with speculative loading. In: SIGMOD (2014)
Google Scholar
Arumugam, S., Dobra, A., Jermaine, C., et al.: The DataPath system: a data-centric analytic processing engine for large data warehouses. In: SIGMOD (2010)
Google Scholar
Lock (computer science). http://en.wikipedia.org/wiki/Lock_(computer_science)
Duggan, J., Stonebraker, M.: Incremental elasticity for array databases. In: SIGMOD/PODS 2014 (2014)
Google Scholar
Szalay, A.S.: The sloan digital sky survey. Comput. Sci. Eng. 1(2), 54–62 (1999)
Article MATH Google Scholar
Dobos, L., Szalay, A., Blakeley, J., Budavári, T., Csabai, I., Tomic, D., Milovanovic, M., et al.: Array Requirements for Scientific Applications and an Implementation for Microsoft SQL Server
Google Scholar
Widmann, N., Baumann, P.: Efficient execution of operations in a DBMS for multidimensional arrays. In: Proceedings of the SSDBM 1998, Capri, Italy, pp. 155–165, July 1998
Google Scholar
Thakar, A.R., Szalay, A.S., Kunszt, P.Z., Gray, J.: Migrating a multiterabyte archive from object to relational databases. Comput. Sci. Eng. 5(5), 16–29 (2003)
Article Google Scholar
Stonebraker, M., Becla, J., DeWitt, D., Lim, K.-T., Maier, D., Ratzesberger, O., Zdonik, S.: Requirements for science databases and SCIDB. In: CIDR 2009 Conference. Asilomar, CA, USA, January 2009
Google Scholar
Brown, P., et al.: Overview of SciDB: large scale array storage, processing and analysis. In: SIGMOD 2010, pp. 963–968 (2010)
Google Scholar
Cudre-Mauroux, P., Kimura, H., Lim, K.-T., Rogers, J., Simakov, R., et al.: A demonstration of SciDB: a science-oriented DBMS. In: VLDB 2009, pp. 1534–1537 (2009)
Google Scholar
Mathematical multidimensional array. http://en.wikipedia.org/wiki/Array_data_structure
Agrawal, R., et al.: Modeling multidimensional databases. In: Proceedings of the ICDE 1997, Birmingham, pp. 232–243, April 1997. [2]
Google Scholar
Lock (database). http://en.wikipedia.org/wiki/Lock_(database)
Soroush, E., Balazinska, M., Wang, D.: ArrayStore: a storage manager for complex parallel array processing. In: SIGMOD (2011)
Google Scholar
Seering, A., Cudre-Mauroux, P., et al.: Efficient versioning for scientific array databases. In: International Conference on Data Engineering (ICDE) (2012)
Google Scholar
Virtualization, October 2012. http://en.wikipedia.org/wiki/Virtualization
Kernel based virtual machine. http://www.linux-kvm.org/page/Main_Page
Hypervisor: http://en.wikipedia.org/wiki/Hypervisor
Virtualization support through KVM. Linux: 2.6.20 Kernel release notes, 05 February 2007. http://kernelnewbies.org. Accessed 16 June 2014
X86 virtualization. http://en.wikipedia.org/wiki/X86_virtualization
Set (mathematics). http://en.wikipedia.org/wiki/Set_(mathematics)
Cartesian product. http://en.wikipedia.org/wiki/Cartesian_product
Abouzied, A., Abadi, D.J., Silberschatz, A.: Invisible loading: Access-driven data transfer from raw files into database systems. In: EDBT/ICDT (2013)
Google Scholar
Planthaber, G., Stonebraker, M., Frew, J.: EarthDB: scalable analysis of MODIS data using SciDB. In: ACM SIGSPATIAL BIGSPATIAL 2012 (2012)
Google Scholar
Gray, J., Szalay, A.S., Thakar, A.R., Kunszt, P.Z., Stoughton, C., Slutz, D., vandenBerg, J.: Data mining the SDSS SkyServer database. MSR-TR-2002-01 (2002)
Google Scholar

Download references

Acknowledgments

This work was supported by the China Ministry of Science and Technology under the State Key Development Program for Basic Research (2012CB821800), Fund of National Natural Science Foundation of China (No. 61462012, 61562010, U1531246), Scientific Research Fund for talents recruiting of Guizhou University (No. 700246003301), Science and Technology Fund of Guizhou Province (No. J [2013]2099), High Tech. Project Fund of Guizhou Development and Reform Commission (No. [2013]2069), Industrial Research Projects of the Science and Technology Plan of Guizhou Province (No. GY[2014]3018) and The Major Applied Basic Research Program of Guizhou Province (No. JZ20142001, No. JZ20142001-05).

Author information

Authors and Affiliations

Department of Computer Science, Guizhou University, Guiyang, 550025, China
Hui Li, Hongyuan Li, Mei Chen & Zhenyu Dai
Guizhou Engineering Laboratory of ACMIS, Guiyang, 550025, China
Hui Li, Hongyuan Li, Mei Chen & Zhenyu Dai
National Astronomical Observatories, Chinese Academy of Sciences, Beijing, 100016, China
Ming Zhu & Menglin Huang

Authors

Hui Li
View author publications
You can also search for this author in PubMed Google Scholar
Hongyuan Li
View author publications
You can also search for this author in PubMed Google Scholar
Mei Chen
View author publications
You can also search for this author in PubMed Google Scholar
Zhenyu Dai
View author publications
You can also search for this author in PubMed Google Scholar
Ming Zhu
View author publications
You can also search for this author in PubMed Google Scholar
Menglin Huang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hui Li .

Editor information

Editors and Affiliations

Central South University, Changsha, China
Guojun Wang
The University of Sydney, Sydney, New South Wales, Australia
Albert Zomaya
University of Murcia, Murcia, Murcia, Spain
Gregorio Martinez
Hunan University , Changsha, China
Kenli Li

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Li, H., Li, H., Chen, M., Dai, Z., Zhu, M., Huang, M. (2015). Enhancing Parallel Data Loading for Large Scale Scientific Database. In: Wang, G., Zomaya, A., Martinez, G., Li, K. (eds) Algorithms and Architectures for Parallel Processing. ICA3PP 2015. Lecture Notes in Computer Science(), vol 9529. Springer, Cham. https://doi.org/10.1007/978-3-319-27122-4_11

Download citation

DOI: https://doi.org/10.1007/978-3-319-27122-4_11
Published: 16 December 2015
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-27121-7
Online ISBN: 978-3-319-27122-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics