On the Processing of Extreme Scale Datasets in the Geosciences

  • Sangmi Lee Pallickara
  • Matthew Malensek
  • Shrideep Pallickara


Observational measurements and model output data acquired or generated by the various research areas within the realm of Geosciences (also known as Earth Science) encompass a spatial scale of tens of thousands of kilometers and temporal scales of seconds to millions of years. Here geosciences refers to the study of atmosphere, hydrosphere, oceans, and biosphere as well as the earth’s core. Rapid advances in sensor deployments, computational capacity, and data storage density have been resulted in dramatic increases in the volume and complexity of data in geosciences. Geoscientists now see the data-intensive computing approach as part of their knowledge discovery process alongside traditional theoretical, experimental, and computational archetype [1]. Data-intensive computing poses unique challenges to the geoscience community that is exacerbated by the sheer size of the datasets involved.


Geospatial Data Hadoop Distribute File System Open Geospatial Consortium Global Telecommunication System Parallel File System 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. 1.
    T. Hey, et al., The Fourth Paradigm: Data-Intensive Scientific Discovery. Redmond, Washington: Microsoft Corporation, 2009.Google Scholar
  2. 2.
    F. M. Hoffman, et al., “Multivariate Spatio-Temporal Clustering (MSTC) as a data mining tool for environmental applications,” in the iEMSs Fourth Biennial Meeting: International Congress on Environmental Modelling and Software Society (iEMSs 2008), 2008, pp. 1774–1781.Google Scholar
  3. 3.
    F. M. Hoffman, et al., “Data Mining in Earth System,” in the International Conference on Computational Science (ICCS), 2011, pp. 1450–1455.Google Scholar
  4. 4.
    O. J. Reichman, et al. (2011) Challenges and opportunities of open data in ecology. Science. 703–705.Google Scholar
  5. 5.
    M. Keller, et al., “A continental strategy for the National Ecological Observatory Network,” Front. Ecol. Environ Special Issue on Continental-Scale Ecology, vol. 5, pp. 282–284, 2008.Google Scholar
  6. 6.
    D. Schimel, et al., “NEON: A hierarchically designed national ecological network,” Front. Ecol. Environ, vol. 2, 2007.Google Scholar
  7. 7.
    June, 17, 2011). The Open Geospatial Consortium (OGC) Available: http://www.opengeo http://spatial.org
  8. 8.
    G. Percivall and C. Reed, “OGC Sensor Web Enabliment Standards,” Sensors and Transducers Journal, vol. 71, pp. 698–706, 2006.Google Scholar
  9. 9.
    MTPE EOS Reference Handbook the EOS Project Science Office, code 900, NASA Goddard Space Flight Center, 1995.Google Scholar
  10. 10.
    The Global Telecommunication System. Available: http://www.wmo.int/pages/prog/www/TEM/GTS/index_en.html
  11. 11.
    National Center for Environmental Prediction (NCEP). Available: http://www.ncep.noaa.gov/
  12. 12.
    Panasas: Parallel File System for HPC Storage. Available: http://www.panasas.com/
  13. 13.
    M. M. Kuhn, et al., “Dynamic file system semantics to enable metadata optimizations in PVFS,” Concurrency and Computation: Practice and Experience, vol. 21, 2009.Google Scholar
  14. 14.
    P. J. Braam, “Lustre: a scalable high-performance file system,” 2002.Google Scholar
  15. 15.
    F. B. Schmuck and R. L. Haskin, “GPFS: A Shared-Disk File System for Large Computing Clusters,” in the Conference on File and Storage Technologies, 2002, pp. 231–244.Google Scholar
  16. 16.
    J. Lofstead, et al., “Managing Variability in the IO Performance of Petascale Storage Systems,” presented at the ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, 2010.Google Scholar
  17. 17.
    M. P. I. Forum, “MPI-2: Extensions to the Message-Passing Interface,” 1997.Google Scholar
  18. 18.
    S. Ghemawat, et al., “The Google File System,” ACM SIGOPS Operating Systems Review, vol. 37, 2003.Google Scholar
  19. 19.
  20. 20.
  21. 21.
    J. Li, et al., “Parallel netCDF: A high-performance scientific I/O interface,” in ACM Supercomputing (SC03), 2003.Google Scholar
  22. 22.
    H. Abbasi, et al., “DataStager: scalable data staging services for petascale applications,” in ACM international Symposium on High Performance Distributed Computing, 2009.Google Scholar
  23. 23.
    J. Craig Upson, et al., “The Application Visualization System: A computational environment for scientific visualization,” IEEE Computer Graphics and Applications, pp. 30–42, 1989.Google Scholar
  24. 24.
    VisIt Visualization Tool. Available: https://wci.llnl.gov/codes/visit/home.html
  25. 25.
    R. Daley, Atmospheric Data Analysis: Cambridge atmospheric and space science series, 1993.Google Scholar
  26. 26.
    O. Wildi, Data Analysis in Vegetation Ecology Willey, 2010.Google Scholar
  27. 27.
    P. Rigaux, et al., Spatial Databases with Application to GIS: Morgan Kaufmann, 2002.Google Scholar
  28. 28.
    S. Shekhar and S. Chawla, Spatial Database: A Tour: Prentice Hall, 2002.Google Scholar
  29. 29.
    P. Longley, et al., Geographic Information Systems and Science, 3 ed.: John Wiley & Sons, 2011.Google Scholar
  30. 30.
    R. Rew and G. Davis, “NetCDF: an interface for scientific data access,” IEEE Computer Graphics and Applications, vol. 10, pp. 76–82, 1990.CrossRefGoogle Scholar
  31. 31.
  32. 32.
    P. Cudre-Mauroux, et al., “A Demonstration of SciDB: A Science-Oriented DBMS,” in the 2009 VLDB Endowment 2009.Google Scholar
  33. 33.
    J. Buck, et al., “SciHadoop: Array-based Query Processing in Hadoop,” UCSC2011.Google Scholar
  34. 34.
    (2010, The HDF Group. Hierarchical data format version 5. http://www.hdfgroup.org/HDF5.
  35. 35.
    (2011, FITS Support Office. http://fits.gsfc.nasa.gov/.
  36. 36.
    D. C. Wells, et al., “FITS: A Flexible Image Transport System,” Astronomy & Astrophysics, vol. 44, pp. 363–370, 1981.Google Scholar
  37. 37.
    P. Cornillon, et al., “OPeNDAP: Accessing data in a distributed, heterogeneous environment,” Data Science Journal, vol. 2, pp. 164–174, 2003.CrossRefGoogle Scholar
  38. 38.
    D. M. Karl, et al., “Building the long-term picture: U.S. JGOFS Time-series Programs,” Oceanography, pp. 6–17, 2001.Google Scholar
  39. 39.
    P. Ramsey, “PostGIS Manual,” ed: Refractions Research.Google Scholar
  40. 40.
    A. Guttman, “R-trees: a dynamic index structure for spatial searching,” in Proceedings of the 1984 ACM SIGMOD international conference on Management of data, ed. Boston, Massachusetts: ACM, 1984, pp. 47–57.Google Scholar
  41. 41.
    S. Tilak, et al., “The Ring Buffer Network Bus (RBNB) DataTurbine Streaming Data Middleware for Environmental Observing Systems,” in IEEE e-Science, 2007, pp. 125–133.Google Scholar
  42. 42.
    D. N. Williams, et al., “The Earth System Grid: Enabling Access to Multi-Model Climate Simulation Data,” Bulletin of the American Meteorological Society, vol. 90, pp. 195–205, 2009.CrossRefGoogle Scholar
  43. 43.
    B. Domenico, et al., “Thematic Real-time Environmental Distributed Data Services (THREDDS): Incorporating Interactive Analysis Tools into NSDL,” Journal of Interactivity in Digital Libraries, vol. 2, 2002.Google Scholar
  44. 44.
    A. Shoshani, et al., “Storage Resource Managers (SRM) in the Earth System Grid,” Earth System Grid2009.Google Scholar
  45. 45.
    G. Khanna, et al., “A Dynamic Scheduling Approach for Coordinated Wide-Area Data Transfers using GridFTP,” in the 22nd IEEE International Parallel and Distributed Processing Symposium (IPDPS 2008), 2008.Google Scholar
  46. 46.
    Globus Online \(\vert\) Reliable File Transfer. No IT Required. Available: https://www.globuson http://line.org/
  47. 47.
    P. G. Brown, “Overview of sciDB: large scale array storage, processing and analysis,” in Proceedings of the 2010 international conference on Management of data, ed. Indianapolis, Indiana, USA: ACM, 2010, pp. 963–968.Google Scholar
  48. 48.
    M. S. Mit, et al. (2009, Requirements for Science Data Bases and SciDB.Google Scholar
  49. 49.
    J. Dean and S. Ghemawat, “Mapreduce: Simplified data processing on large clusters,” Communications of the ACM, vol. 51, pp. 107–113, 2008.CrossRefGoogle Scholar
  50. 50.
    A. Akdogan, et al., “Voronoi-Based Geospatial Query Processing with MapReduce,” in Cloud Computing Technology and Science (CloudCom), 2010 IEEE Second International Conference on, ed, 2010, pp. 9–16.Google Scholar
  51. 51.
    Y. Wang and S. Wang, “Research and implementation on spatial data storage and operation based on Hadoop platform,” in Geoscience and Remote Sensing (IITA-GRS), 2010 Second IITA International Conference on vol. 2, ed, 2010, pp. 275–278.Google Scholar
  52. 52.
    Apache Hadoop. Available: http://hadoop.apache.org/
  53. 53.
    Hadoop Distributed File System. Available: http://hadoop.apache.org/hdfs/
  54. 54.
    J. Wang, et al., “Kepler + Hadoop: a general architecture facilitating data-intensive applications in scientific workflow systems,” in Proceedings of the 4th Workshop on Workflows in Support of Large-Scale Science, ed. Portland, Oregon: ACM, 2009, pp. 12:1–12:8.Google Scholar

Copyright information

© Springer Science+Business Media, LLC 2011

Authors and Affiliations

  • Sangmi Lee Pallickara
    • 1
  • Matthew Malensek
    • 1
  • Shrideep Pallickara
    • 1
  1. 1.Department of Computer ScienceColorado State UniversityFort CollinsUSA

Personalised recommendations