Scientific Data Management in the Cloud: A Survey of Technologies, Approaches and Challenges

  • Sangmi Lee Pallickara
  • Shrideep Pallickara
  • Marlon Pierce


Experimental sciences create vast amounts of data. In astronomy, data produced by the Pan-STARRS project (Pan-STARRS project, 2010; Jedicke, Magnier, Kaiser, & Chambers, 2006) is expected to result in more than a petabyte of images every year. In high-energy physics, the Large Hadron Collider will generate 50–100 petabytes of data each year, with about 20 PB of that data being stored and processed on a worldwide federation of national grids linking 100,000 CPUs (Large Hadron Collider project, 2010; Massimo Lammana, 2004).

Cloud computing is immensely appealing to the scientific community, who increasingly see it as being part of the solution to cope with burgeoning data volumes. Cloud computing enables economies-of-scale in facility design and hardware construction. Groups of users are allowed to host, process, and analyze large volumes of data from various sources. There are several vendors that offer cloud computing platforms; these include Amazon Web Services (2010), Google’s App Engine (2010), AT&T’s Synaptic Hosting (2010), Rackspace (2010), GoGrid (2010) and AppNexus (2010). These vendors promise seemingly infinite amounts of computing power and storage that can be made available on demand, in a pay-only-for-what-you-use pricing model.


Large Hadron Collider Cloud Computing File System Data Cloud Hadoop Distribute File System 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. Adabi, D. J. (2009). Data management in the cloud: Limitations and opportunities. IEEE Data Engineering Bulletin, 32(1), 4–12.Google Scholar
  2. Agrawal, R., Kiernan, J., Srikant, R., & Xu, Y. (2004). Order preserving encryption for numeric data. Proceedings of SIGMOD, 563–574.Google Scholar
  3. Amazon Elastic MapReduce (2010). Accessed on February 20, 2010.
  4. Amazon EBS (2010), Accessed on February 20, 2010.
  5. Amazon Public Datasets (2010), Accessed on February 20, 2010.
  6. Amazon RDS (2010), Accessed on February 20, 2010.
  7. Amazon SimpleDB (2010), Accessed on February 20, 2010.
  8. Amazon Web Services (2010). Accessed on February 20, 2010.
  9. Antonioletti, M., Krause, A., Paton, N. W., Eisenberg, A., Laws, S., Malaika, S., et al. (2006). The WS-DAI family of specifications for web service data access and integration. ACM SIGMOD Record, 35(1), 48–55.CrossRefGoogle Scholar
  10. AppNexus (2010). Accessed on February 20, 2010.
  11. Baru, C. K., Fecteau, G., Goyal, A., Hsiao, H., Jhingran, A., Padmanabhan, S., et al. (1995). DB2 parrel edition. IBM Systems Journal, 34(2), 292–322.CrossRefGoogle Scholar
  12. Budavari, T., Malik, T., Szalay, A. S., Thakar, A., & Gray, J. (2003). SkyQuery – A prototype distributed query web service for the virtual observatory. In H. Payne, R. I. Jedrzejewski, & R. N. Hook (Eds.), Proceedings of ADASS XII, Astronomical Society of the Pacific, ASP Conference Series (Vol. 295, p. 31).Google Scholar
  13. Cary, A., Sun, Z., Hristidis, V., & Rishe, N. (2009). Experiences on processing spatial data with mapreduce. Proceedings of the 21st SSDBM Conference. Lecture notes in computer science, Vol. 5566, 302–319.Google Scholar
  14. Chang, F., Dean, J., Ghemawat, S., Hsieh, W. C., Wallach, D. A., Burrows, M., et al. (November 2006). Bigtable: A distributed storage system for structured data. OSDI’06: Seventh Symposium on Operating System Design and Implementation, Seattle, WA, 205–218.Google Scholar
  15. Chervenak, A., Foster, I., Kesselman, C., Salisbury, C., & Tuecke, S. (2001). The data grid: Towards an architecture for the distributed management and analysis of large scientific datasets. Journal of Network and Computer Applications, 23, 187–200.CrossRefGoogle Scholar
  16. Dean, J., & Ghemawat, S. (2008). MapReduce: Simplified data processing on large clusters. Communications of the ACM, Vol. 51(1), 107–113.Google Scholar
  17. Deanand, J., & Ghemawat, S. (December 2004). Mapreduce: Simplified data processing on large clusters. In Proceedings of OSDI, San Francisco, CA, 137–150.Google Scholar
  18. Dennins D. Gannon, D., & Dan D. Reed, D. (2009). “Parallelism and the cloud.” In: T. Hey, S. Hensley, and K. Tolle (Eds.), The fourth paradigm: Data-intensive scientific discovery, Microsoft research (pp 131–136), ISBN-10:0982544200.Google Scholar
  19. FITS (2010). Accessed on February 20, 2010.
  20. Gardner, J. (2007). Enabling knowledge discovery in a virtual universe. Proceedings of TeraGrid ’07: Broadening Participation in the TeraGrid, ACM Press.Google Scholar
  21. Gardner, J. P., Connolly, A., & McBride, C. (2007). Enabling rapid development of parallel tree search applications. Proceedings of the 2007 Symposium on Challenges of Large Applications in Distributed Environments (CLADE 2007), ACM Press, 1–10.Google Scholar
  22. Ge, T., & Zdonik, S. (2007). Answering aggregation queries in a secure system model. Proceedings of VLDB, 519–530.Google Scholar
  23. GenBank (2010). Accessed on February 20.
  24. Ghemawat, S., Gobioff, H., & Leung, S.-T. (October 2003). The google file system. Appeared in 19th ACM Symposium on Operating Systems Principles, Lake George, NY, 29–43.Google Scholar
  25. GoGrid (2010). Accessed on February 20, 2010.
  26. Google App Engine (2010). Accessed on February 20, 2010.
  27. Gray, J. (2009). Jim gray on eScience: A transformed scientific method. In T. Hey, S. Hensley, & K. Tolle (Eds.), The fourth paradigm: Data-intensive scientific discovery, Microsoft research (pp xvii–xxxi), ISBN-10:0982544200.Google Scholar
  28. Gray, J., Liu, D. T., Nieto-Santisteban, M. A., Szalay, A. S., Heber, G., & DeWitt, D. (December 2005). Scientific data management in the coming decade. SIGMOD Record, 34(4), 34–41.Google Scholar
  29. Hacigumus, H., Iyer, B., Li, C., & Mehrotra, S. (2002). Executing sql over encrypted data in the database-service-provider model. Proceedings of SIGMOD, 216–227.Google Scholar
  30. HDF (2010) Accessed on February 20, 2010.
  31. Isard, M., & Yu, Y. (July 2009). Distributed data-parallel computing using a high-level programming language. Proceedings of the International Conference on Management of Data (SIGMOD), 987–994.Google Scholar
  32. Isard, M., Budiu, M., Yu, Y., Birrel, A., & Fetterly, D. (March 2007). Dryad: Distributed data-parallel programs from sequential building blocks. Proceedings of European Conference on Computer Systems (EuroSys), Lisbon, Portugal, March 21–23, 59–72.Google Scholar
  33. IDL (2010) Interactive Data Language. Accessed on February 20, 2010.
  34. Jaeger-Frank, E., Crosby, C. J., Memon, A., Nandigam, V., Conner, J., Arrowsmith, J. R., et al. (December 2006). A domain independent three tier architecture applied to Lidar processing and monitoring. In the Special Issue of the Scientific Programming Journal devoted to WORKS06 and WSES06, 185–194.Google Scholar
  35. Jedicke, R., Magnier, E. A., Kaiser, N., & Chambers, K. C. (2006). The next decade of solar system discovery with pan-STARRS. Proceedings of IAU Symposium 236, 341–352.Google Scholar
  36. Kantarcoglu, M., & Clifton, C. (2004). Security issues in querying encrypted data. 19th Annual IFIP WG 11.3 Working Conference on Data and Applications Security, 325–337.Google Scholar
  37. Lakshman, A., Malik, P., & Ranganathan, K. (2008). Cassandra, structured storage system over a P2P network. Keynote Presentation, SIGMOD, Calgary, Canada, 5–5.Google Scholar
  38. Lammana, M. (November 2004). Nuclear instruments and methods in physics research section A: Accelerators, spectrometers, detectors and associated equipment. In the Proceedings of the 9th international Workshop on Advanced Computing and Analysis Techniques in Physics Research (Vol. 534, No. 1–2, pp. 1–6).Google Scholar
  39. Large Hadron Collider project (2010). Accessed on February 20, 2010.
  40. Li, Y., Perlman, E., Wan, M., Yang, Y., Meneveau, C., Burns, R., et al. (2008). A public turbulence database and applications to study lagrangian evolution of velocity increments in turbulence. Journal of Computational Physics, 9(31), 1468–5248.Google Scholar
  41. Loebman, S., Nunley, D., Kwon, Y. C., Howe, B., Balazinsk, M., & Gardner, J. P. (2009). Analyzing massive astrophysical datasets: Can pig/hadoop or a relational DBMS help? Proceedings of the Workshop on Interfaces and Architecture for Scientific Data Storage (IASDS), 1–10.Google Scholar
  42. LSST Science Collaborations and LSST Project (2009). LSST Science Book, Version 2.0, arXiv:0912.0201,
  43. MacCormick, J., Murphy, N., Najork, M., Thekkath, C. A., & Zhou, L. (December 2004). Boxwood: Abstractions as the foundation for storage infrastructure. Proceedings of the 6th Symposium on Operating Systems Design and Implementation (OSDI 2004), San Francisco, CA, USA, 105–120.Google Scholar
  44. Microsoft, SQL Azure (2010). Accessed on February 20, 2010.
  45. Microsoft, Windows Azure (2010). Accessed on February 20, 2010.
  46. Moore, R. W. Moore, R. W., Jagatheesan, A. Jagatheesan, A., Rajasekar, A. Rajasekar, A., et al. (April 2004). “Data grid management systems,”. Proceedings of the 21st IEEE/NASA Conference on Mass Storage Systems and Technologies (MSST), April 13–16, 2004, College Park, Maryland, USA, April 13–16, 2004.Google Scholar
  47. Mykletun, E., & Tsudik, G. (2006). Aggregation queries in the database-as-a-servicemodel. IFIP WG 11.3 on Data and Application Security, 89–103.Google Scholar
  48. NCBI (2010). Accessed on February 20, 2010.
  49. NetCDF (2010). Accessed on July 16, 2010.
  50. Olston, C., Reed, B., Srivastava, U., Kumar, R., & Tomkins, A. (June 2008). Pig latin: A not-so-foreign language for data processing. ACM SIGMOD 2008 International Conference on Management of Data, Vancouver, Canada, 1099–1110.Google Scholar
  51. OpenMPI (2010). Accessed on February 20, 2010.
  52. OpenPBS (2010). Accessed on February 20, 2010.
  53. Oracle Database 11 g (2010), Accessed on February 20, 2010.
  54. Oracle Real Application Cluster (2010). Accessed on Febrary 20, 2010
  55. Ozone (2010). Accessed on February 20, 2010.
  56. Palankar, M. R., Iamnitchi, A., Ripeanu, M., & Garfinkel, S. (2008). Amazon S3 for science grids: A viable solution? DADC ’08: Proceedings of the 2008 International Workshop on Data-Aware Distributed Computing, 55–64.Google Scholar
  57. Pan-STARRS project (2010). Accessed on February 20, 2010.
  58. Peng, J., & Law, K. H. Reference NEESgrid data model (Tech. Rep. NEESgrid-2004-40).Google Scholar
  59. Pike, R., Dorward, S., Griesemer, R., & Quinlan, S. Interpreting the data: Parallel analysis with Sawzall. Scientific Programming Journal Special Issues on Grids and Worldwide Computing Programming Models and Infrastructure, 13(4), 227–298.Google Scholar
  60. Plale, B., Gannon, D., Alameda, J., Wilhelmson, B., Hampton, S., Rossi, A., et al. (2005). Active management of scientific data. IEEE Internet Computing Special Issue on Internet Access to Scientific Data, 9(1), 27–34.Google Scholar
  61. PubCam (2010). Accessed on February 20, 2010.
  62. PubMed (2010). Accessed on February 20, 2010.
  63. Rackspace (2010). Accessed on February 20, 2010.
  64. Ratnasamy, S., Francis, P., Handley, M., Karp, R., & Shenker, S. (August 2001). A scalable content-addressable network. Proceedings of SIGCOMM, 161–172.Google Scholar
  65. Rowstron, A., & Drushel, P. (November 2001). Pastry: Scalable, distributed object location and routing for large scale peer-to-peer systems. Proceedings of Middleware 2001, 329–350.Google Scholar
  66. San Diego Supercomputing Center (2010), Accessed on February 20, 2010.
  67. SciDB (2010). Accessed on February 20, 2010.
  68. Simmhan, Y., Barge, R., van Ingen, C., Nieto-Santisteban, M., Dobos, L., Li, N., et al. (2009). GrayWulf: Scalable software architecture for data intensive computing. Proceedings of the 42nd Hawaii International Conference on System Science, 1–10.Google Scholar
  69. Singh, G., Bharathi, S., Chervenak, A., Deelman, E., Kesselman, C., Manohar, M., et al. (2003). A metadata catalog service for data intensive applications. IEEE, ACM, Super Computing the international conference for High Performance Computing, Networking, Storage and Analysis, 33–50.Google Scholar
  70. Stadel, J. G. (2001). Cosmological N-Body simulations and their analysis. (Doctoral dissertation, University of Washington, 2001).Google Scholar
  71. Stoica, I., Morris, R., Karger, D., Kaashoek, M. F., & Balakrishnan, H. (August 2001). Chord: A scalable peer0to-peer lookup service for internet applications. Proceedings of SIGCOMM, 149–160.Google Scholar
  72. Stonebraker, M. (1986). The case for shared nothing architecture. Database Engineering, 9(1), 4–9.Google Scholar
  73. Szalay, A., Bell, G., Vandenberg, J., Wonders, A., Burns, R., Fay, D., et al. (2009). GrayWulf: Scalable clustered architecture for data intensive computing. Proceedings of the 42nd Hawaii International Conference on System Science, 1–10.Google Scholar
  74. Teragrid (2010), Accessed on February 20, 2010.
  75. Thain, D., Tannenbaum, T., & Livny, M. (February–April 2005). Distributed computing in practice: The condor experience. Concurrency and Computation: Practice and Experience, 17(2–4), 323–356.CrossRefGoogle Scholar
  76. TIPSY (2010). Accessed on February 20, 2010.
  77. The Academic ClusterComputing Initiative (ACCI 2007). Google and IBM Announce University Initiative to Address Internet-Scale Computing Challegne, Google Official Press Center,
  78. The Globus Toolkit (2010). Data replication service. Accessed on February 20, 2010.
  79. Unidata (2010). Accessed on February 20, 2010.
  80. Yu, Y., Gunda, P. K., & Isard, M. (October 2009). Distributed aggregation for data-parallel computing: Interfaces and implementations. Proceedings of the Symposium on Operating Systems Principles (SOSP).Google Scholar
  81. Zhao, B. Y., Kubiatowicz, J., & Joseph, A. D. (April 2001). Tapestry: An infrastructure for fault-tolerant wide-area location and routing (Tech. Rep. UCB/CSD-01-1141, CS Division, UC Berkeley).Google Scholar
  82. Zverina, J. (2010) San Diego supercomputing center begins cloud computing research using the Google IBM clue cluster., Accessed on February 20, 2010.
  83. ZODB (2010). Accessed on February 20, 2010.
  84. Zookeeper (2010). Accessed on February 20, 2010.

Copyright information

© Springer Science+Business Media, LLC 2010

Authors and Affiliations

  • Sangmi Lee Pallickara
    • 1
  • Shrideep Pallickara
    • 1
  • Marlon Pierce
    • 2
  1. 1.Department of Computer ScienceColorado State UniversityFort CollinsUSA
  2. 2.Community Grids LabIndiana UniversityBloomingtonUSA

Personalised recommendations