Skip to main content

Enhancing Parallel Data Loading for Large Scale Scientific Database

  • Conference paper
  • First Online:
  • 1377 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 9529))

Abstract

The rapidly increased data size make large scale scientific database often have a huge time delay between loading data into the system and ready for receiving query request. To solve this problem, we proposed an efficient parallel data loading approach named FASTLoad. It is designed to maximize the given resource (e.g., network bandwidth, main memory) utilization for optimizing the data loading in large scale array model based scientific database system. To verify the efficiency of FASTLoad, we implemented it in our Adaptable Data Loading System and evaluate its performance over various sizes of large scientific data sets. Our experimental results show that the performance of FASTLoad can be 4 to 6 times fast than the built-in loading techniques of states-of-the-arts array model based scientific database system.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

  1. Hey, T., Tansley, S., Tolle, K. (eds.): The Fourth Paradigm: Data-Intensive Scientific Discoveries. Microsoft Research, Redmond (2009)

    Google Scholar 

  2. Cudre-Mauroux, P., Kimura, H., et al.: A demonstration of SciDB: a science-oriented DBMS. VLDB 2, 1534–1537 (2009)

    Google Scholar 

  3. Alagiannis, I., Borovica, R., Branco, M., Idreos, S., et al.: NoDB in action: adaptive query processing on raw data. VLDB 5, 1942–1945 (2012)

    Google Scholar 

  4. Alagiannis, I., Borovica, R., Branco, M., Idreos, S., et al.: NoDB: efficient query execution on raw data files. In: SIGMOD (2012)

    Google Scholar 

  5. Blanas, S., Wu, K., Byna, S., Dong, B., Shoshani, A.: Parallel data analysis directly on scientific file formats. In: SIGMOD (2014)

    Google Scholar 

  6. Witkowski, A., Colgan, M., Brumm, A., Cruanes, T., Baer, H.: Performant and Scalable Data Loading with Oracle Database 11g (2011)

    Google Scholar 

  7. Cheng, Y., Rusu, F.: Parallel in-situ data processing with speculative loading. In: SIGMOD (2014)

    Google Scholar 

  8. Arumugam, S., Dobra, A., Jermaine, C., et al.: The DataPath system: a data-centric analytic processing engine for large data warehouses. In: SIGMOD (2010)

    Google Scholar 

  9. Lock (computer science). http://en.wikipedia.org/wiki/Lock_(computer_science)

  10. Duggan, J., Stonebraker, M.: Incremental elasticity for array databases. In: SIGMOD/PODS 2014 (2014)

    Google Scholar 

  11. Szalay, A.S.: The sloan digital sky survey. Comput. Sci. Eng. 1(2), 54–62 (1999)

    Article  MATH  Google Scholar 

  12. Dobos, L., Szalay, A., Blakeley, J., Budavári, T., Csabai, I., Tomic, D., Milovanovic, M., et al.: Array Requirements for Scientific Applications and an Implementation for Microsoft SQL Server

    Google Scholar 

  13. Widmann, N., Baumann, P.: Efficient execution of operations in a DBMS for multidimensional arrays. In: Proceedings of the SSDBM 1998, Capri, Italy, pp. 155–165, July 1998

    Google Scholar 

  14. Thakar, A.R., Szalay, A.S., Kunszt, P.Z., Gray, J.: Migrating a multiterabyte archive from object to relational databases. Comput. Sci. Eng. 5(5), 16–29 (2003)

    Article  Google Scholar 

  15. Stonebraker, M., Becla, J., DeWitt, D., Lim, K.-T., Maier, D., Ratzesberger, O., Zdonik, S.: Requirements for science databases and SCIDB. In: CIDR 2009 Conference. Asilomar, CA, USA, January 2009

    Google Scholar 

  16. Brown, P., et al.: Overview of SciDB: large scale array storage, processing and analysis. In: SIGMOD 2010, pp. 963–968 (2010)

    Google Scholar 

  17. Cudre-Mauroux, P., Kimura, H., Lim, K.-T., Rogers, J., Simakov, R., et al.: A demonstration of SciDB: a science-oriented DBMS. In: VLDB 2009, pp. 1534–1537 (2009)

    Google Scholar 

  18. Mathematical multidimensional array. http://en.wikipedia.org/wiki/Array_data_structure

  19. Agrawal, R., et al.: Modeling multidimensional databases. In: Proceedings of the ICDE 1997, Birmingham, pp. 232–243, April 1997. [2]

    Google Scholar 

  20. Lock (database). http://en.wikipedia.org/wiki/Lock_(database)

  21. Soroush, E., Balazinska, M., Wang, D.: ArrayStore: a storage manager for complex parallel array processing. In: SIGMOD (2011)

    Google Scholar 

  22. Seering, A., Cudre-Mauroux, P., et al.: Efficient versioning for scientific array databases. In: International Conference on Data Engineering (ICDE) (2012)

    Google Scholar 

  23. Virtualization, October 2012. http://en.wikipedia.org/wiki/Virtualization

  24. Kernel based virtual machine. http://www.linux-kvm.org/page/Main_Page

  25. Hypervisor: http://en.wikipedia.org/wiki/Hypervisor

  26. Virtualization support through KVM. Linux: 2.6.20 Kernel release notes, 05 February 2007. http://kernelnewbies.org. Accessed 16 June 2014

  27. X86 virtualization. http://en.wikipedia.org/wiki/X86_virtualization

  28. Set (mathematics). http://en.wikipedia.org/wiki/Set_(mathematics)

  29. Cartesian product. http://en.wikipedia.org/wiki/Cartesian_product

  30. Abouzied, A., Abadi, D.J., Silberschatz, A.: Invisible loading: Access-driven data transfer from raw files into database systems. In: EDBT/ICDT (2013)

    Google Scholar 

  31. Planthaber, G., Stonebraker, M., Frew, J.: EarthDB: scalable analysis of MODIS data using SciDB. In: ACM SIGSPATIAL BIGSPATIAL 2012 (2012)

    Google Scholar 

  32. Gray, J., Szalay, A.S., Thakar, A.R., Kunszt, P.Z., Stoughton, C., Slutz, D., vandenBerg, J.: Data mining the SDSS SkyServer database. MSR-TR-2002-01 (2002)

    Google Scholar 

Download references

Acknowledgments

This work was supported by the China Ministry of Science and Technology under the State Key Development Program for Basic Research (2012CB821800), Fund of National Natural Science Foundation of China (No. 61462012, 61562010, U1531246), Scientific Research Fund for talents recruiting of Guizhou University (No. 700246003301), Science and Technology Fund of Guizhou Province (No. J [2013]2099), High Tech. Project Fund of Guizhou Development and Reform Commission (No. [2013]2069), Industrial Research Projects of the Science and Technology Plan of Guizhou Province (No. GY[2014]3018) and The Major Applied Basic Research Program of Guizhou Province (No. JZ20142001, No. JZ20142001-05).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hui Li .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer International Publishing Switzerland

About this paper

Cite this paper

Li, H., Li, H., Chen, M., Dai, Z., Zhu, M., Huang, M. (2015). Enhancing Parallel Data Loading for Large Scale Scientific Database. In: Wang, G., Zomaya, A., Martinez, G., Li, K. (eds) Algorithms and Architectures for Parallel Processing. ICA3PP 2015. Lecture Notes in Computer Science(), vol 9529. Springer, Cham. https://doi.org/10.1007/978-3-319-27122-4_11

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-27122-4_11

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-27121-7

  • Online ISBN: 978-3-319-27122-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics