Abstract
Transferring large volumes of information from one location to potentially many others that are geographically distributed and across varying networks is still prevalent in modern scientific data systems. This is despite the movement to push computation to the data and to reduce data movement needed to compute answers to challenging scientific problems, to disseminate information to the scientific community, and to acquire data for curation and enrichment. Because of this, it is imperative that decisions made regarding data movement systems and architectures be backed by both analytical rigor, and also by empirical evidence and measurement. The purpose of this study is to expand on the work performed by our research team over the last decade and to take a fresh look at the evaluation of multiple topical data transfer technologies in use cases derived from data-intensive scientific systems and applications in the areas of Earth science. We report on the evaluation of a set of data movement technologies against a set of empirically derived comparison dimensions. Based on this evaluation, we make recommendations towards the selection of appropriate data movement technologies in scientific applications and scenarios.
Similar content being viewed by others
Notes
Data movement, and data transfer are used interchangeably throughout the paper.
References
Evans C (2001) Comments on the overall architecture of Vsftpd, from a security standpoint. Internet
Foster I (2011) Globus online: accelerating and democratizing science through cloud-based services. IEEE Internet Comput 15(3):70–73
Goland Y et al (1999) HTTP extensions for distributed authoring–WEBDAV
Gu Y, Grossman RL (2007) UDT: UDP-based data transfer for high-speed wide area networks. Comput Netw 51(7):1777–1799
JPL Snow Server. http://snow.jpl.nasa.gov/. Accessed Nov 2014
Kempler S et al (2009) Evolution of information management at the GSFC earth sciences (GES) data and information services center (DISC): 2006–2007. IEEE Trans Geosci Remote Sens 47(1):21–28
Kernighan BW, Mashey JR (1979) The UNIX™ programming environment. Software: Practice and experience. Wiley
Masuoka E et al (2001) Evolution of the MODIS science data processing system. IEEE Geosci Remote Sensing Symp, 2001. IGARSS'01. IEEE 2001 International. Vol. 3. IEEE
Mattmann C (2007) Software connectors for highly voluminous and distributed data-intensive systems. Ph. D. Dissertation, USC
Mattmann CA et al (2006) A classification and evaluation of data movement technologies for the delivery of highly voluminous scientific data products. IEEE MSST
Mattmann CA, Woollard D, Mahjourian R (2007) Software connector classification and selection for data-intensive systems. Proceedings of the Second International Workshop on Incorporating COTS Software into Software Systems: Tools and Techniques. IEEE Comput Soc
Mattmann CA et al (2010) Experiments with storage and preservation of NASA's planetary data via the cloud. IEEE IT Prof 12(5):28–35
Mattmann CA, Waliser D, Kim J, Goodale C, Hart A, Ramirez P, Crichton D, Zimdars P, Boustani M, Lee K, Loikith P, Whitehall K, Jack C, Hewitson B (2013) Cloud computing and virtualization within the regional climate model and evaluation system. Earth Sci Inf 7:1–12
Mell P, Grance T (2011) The NIST definition of cloud computing
Postel J, Reynolds J File transfer protocol. Request for Comments (RFC) 959 October 1985. http://tools.ietf.org/html/rfc959
Running, SW et al (2000) Global terrestrial gross and net primary productivity from the earth observing system. Methods in ecosystem science. Springer, New York, p 44–57
(2013) Secure copy. http://en.wikipedia.org/wiki/Secure_copy
Sotomayor B et al (2009) Virtual infrastructure management in private and hybrid clouds. IEEE Internet Comput 13(5):14–22
Tarannum, N, Ahmed N (2014) Efficient and reliable hybrid cloud architechture for big data. arXiv preprint arXiv:1405.5200
Tran JJ et al (2011) Evaluating cloud computing in the NASA DESDynI ground data system. Proceedings of the 2nd International Workshop on Software Engineering for Cloud Computing. ACM
White T (2009) Hadoop: The definitive guide. O'Reilly Media, Inc.
Williams DN et al (2009) The Earth System Grid: enabling access to multimodel climate simulation data. Bull Am Meteorol Soc 90(2):195–205
Woollard D et al (2008) Scientific software as workflows: from discovery to distribution. IEEE Softw 25(4):37–43
Zaharia M et al (2010) Spark: cluster computing with working sets. Proceedings of the 2nd USENIX conference on Hot topics in cloud computing
Acknowledgments
Support provided by NASA Earth Sciences Division, NASA NCA (ID: 11-NCA11-0028) and NASA’s Advanced Information Systems Technology (AIST) program (ID: AIST-QRS-12-0002) and through the NASA Computational Modeling and Cyberinfrastructure (CMAC) program (11-CMAC11-0011). In addition, funding is provided by the National Science Foundation ExArch program (ID: 1125798), a component of the G8 initiative. Valuable contributions to the RCMES activity by way of collaboration comes from the World Climate Research Program (WCRP) Coordinated Regional Climate Downscaling Experiment (CORDEX), the North American Regional Climate Change Assessment Program (NARCCAP), the Climate & Development Knowledge Network (CDKN) and the University of Cape Town, and PCMDI through support of the obs4MIPs activity.
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by: H. A. Babaie
Rights and permissions
About this article
Cite this article
Mattmann, C.A., Cinquini, L., Zimdars, P. et al. A topical evaluation and discussion of data movement technologies for data-intensive scientific applications. Earth Sci Inform 9, 247–262 (2016). https://doi.org/10.1007/s12145-015-0243-1
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s12145-015-0243-1