Skip to main content
Log in

ROARS: a robust object archival system for data intensive scientific computing

  • Published:
Distributed and Parallel Databases Aims and scope Submit manuscript

Abstract

As scientific research becomes more data intensive, there is an increasing need for scalable, reliable, and high performance storage systems. Such data repositories must provide both data archival services and rich metadata, and cleanly integrate with large scale computing resources. ROARS is a hybrid approach to distributed storage that provides both large, robust, scalable storage and efficient rich metadata queries for scientific applications. In this paper, we present the design and implementation of ROARS, focusing primarily on the challenge of maintaining data integrity across long time scales. We evaluate the performance of ROARS on a storage cluster, comparing to the Hadoop distributed file system and a centralized file server. We observe that ROARS has read and write performance that scales with the number of storage nodes, and integrity checking that scales with the size of the largest node. We demonstrate the ability of ROARS to function correctly through multiple system failures and reconfigurations. ROARS has been in production use for over three years as the primary data repository for a biometrics research lab at the University of Notre Dame.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15

Similar content being viewed by others

References

  1. Filesystem in user space. http://sourceforge.net/projects/fuse

  2. Altschul, S., Gish, W., Miller, W., Myers, E., Lipman, D.: Basic local alignment search tool. J. Mol. Biol. 3(215), 403–410 (1990)

    Google Scholar 

  3. Amazon Simple Storage Service (Amazon S3). http://aws.amazon.com/s3/ (2009)

  4. Baru, C., Moore, R., Rajasekar, A., Wan, M.: The SDSC storage resource broker. In: Proceedings of CASCON, Toronto, Canada (1998)

    Google Scholar 

  5. Bonwick, J., Ahrens, M., Henson, V., Maybee, M., Shellenbaum, M.: The zettabyte file system. Technical Report, Sun Microsystems (2003)

  6. Borthakur, D.: HDFS architecture guide. In: Hadoop Apache Project. http://hadoop.apache.org/common/docs/current/hdfs_design.pdf

  7. Bui, H., Bui, P., Flynn, P., Thain, D.: ROARS: a scalable repository for data intensive scientific computing. In: The Third International Workshop on Data Intensive Distributed Computing at ACM HPDC 2010 (2010)

    Google Scholar 

  8. Bui, H., Kelly, M., Lyon, C., Pasquier, M., Thomas, D., Flynn, P., Thain, D.: Experience with BXGrid: a data repository and computing grid for biometrics research. J. Clust. Comput. 12(4), 373 (2009)

    Article  Google Scholar 

  9. Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. In: Operating Systems Design and Implementation (2004)

    Google Scholar 

  10. Dongarra, J.J., Walker, D.W.: MPI: a standard message passing interface. Supercomputer 12, 56–68 (1996)

    Google Scholar 

  11. Foundation, A.S.: The apache CouchDB project. http://couchdb.apache.org (2012)

  12. Ghemawat, S., Gobioff, H., Leung, S.: The Google filesystem. In: ACM Symposium on Operating Systems Principles (2003)

    Google Scholar 

  13. Hadoop. http://hadoop.apache.org/ (2007)

  14. Holupirek, A., Grün, C., Scholl, M.H.: BaseX & DeepFS joint storage for filesystem and database. In: Proceedings of the 12th International Conference on Extending Database Technology: Advances in Database Technology (2009)

    Google Scholar 

  15. Howard, J., Kazar, M., Menees, S., Nichols, D., Satyanarayanan, M., Sidebotham, R., West, M.: Scale and performance in a distributed file system. ACM Trans. Comput. Syst. 6(1), 51–81 (1988)

    Article  Google Scholar 

  16. Ivanova, M., Nes, N., Goncalves, R., Kersten, M.: Monetdb/sql meets skyserver: the challenges of a scientific database. In: International Conference on Scientific and Statistical Database Management, p. 13 (2007)

    Google Scholar 

  17. Leavitt, N.: Will nosql databases live up to their promise? Computer 43(2), 12–14 (2010)

    Article  Google Scholar 

  18. Maccormick, J., Murphy, N., Ramasubramanian, V., Weder, U., Yang, J.: Kinesis: a new approach to replica placement in distributed storage systems. ACM Transactions on Storage 4(1), 1–28 (2009)

    Article  Google Scholar 

  19. MongoDB. GridFS Specification. http://www.mongodb.org (2012)

  20. No, J., Thakur, R., Choudhary:, A.: Integrating parallel file i/o and database support for high-performance scientific data management. In: IEEE High Performance Networking and Computing (2000)

    Google Scholar 

  21. Plugge, E., Hawkins, T., Membrey, P.: The definitive guide to MongoDB: the noSQL database for cloud and desktop computing. In: Apress, Berkley, CA, USA (2010). 1st edn.

    Google Scholar 

  22. Rosenthal, D.S.: Lockss: lots of copies keep stuff safe. In: NIST Digital Preservation Interoperability Framework Workshop (2010)

    Google Scholar 

  23. Sandberg, R., Goldberg, D., Kleiman, S., Walsh, D., Lyon, B.: Design and implementation of the Sun network filesystem. In: USENIX Summer Technical Conference, pp. 119–130 (1985)

    Google Scholar 

  24. Sciore, E.: SimpleDB: a simple java-based multiuser syst for teaching database internals. In: Proceedings of the 38th SIGCSE Technical Symposium on Computer Science Education (2007)

    Google Scholar 

  25. Searcs, R., Ingen, C.V., Gray, J.: To blob or not to blob: large object storage in a database or a filesystem. Technical report msr-tr-2006-45, Microsoft Research (April 2006)

  26. Stolte, E., von Praun, C., Alonso, G., Gross, T.: Scientific data repositories: designing for a moving target. In: SIGMOD (2003)

    Google Scholar 

  27. Stonebraker, M., Becla, J., DeWitt, D.J., Lim, K.-T., Maier, D., Ratzesberger, O., Zdonik, S.B.: Requirements for science data bases and scidb. In: CIDR (2009). www.crdrdb.org

    Google Scholar 

  28. Stonebraker, M., Frew, J., Dozier, J.: An overview of the sequoia 2000 project. In: Proceedings of the Third International Symposium on Large Spatial Databases, pp. 397–412 (1992)

    Google Scholar 

  29. Szalay, A.S., Kunszt, P.Z., Thakar, A., Gray, J., Slutz, D.R.: Designing and mining multi-terabyte astronomy archives: the sloan digital sky survey. In: SIGMOD Conference (2000)

    Google Scholar 

  30. Tatebe, O., Soda, N., Morita, Y., Matsuoka, S., Sekiguchi, S.: Gfarm v2: a grid file system that supports high-performance distributed and parallel data computing. In: Computing in High Energy Physics (CHEP), September (2004)

    Google Scholar 

  31. Thain, D., Livny, M.: Parrot: an application environment for data-intensive computing. Scalable Comput., Pract. Exp. 6(3), 9–18 (2005)

    Google Scholar 

  32. Thain, D., Moretti, C., Hemmes, J.: Chirp: a practical global filesystem for cluster and grid computing. J. Grid Comput. 7(1), 51–72 (2009)

    Article  Google Scholar 

  33. Thain, D., Tannenbaum, T., Livny, M.: Condor and the grid. In: Berman, F., Fox, G., Hey, T. (eds.) Grid Computing: Making the Global Infrastructure a Reality. Wiley, New York (2003)

    Google Scholar 

  34. Vertica. http://www.vertica.com/ (2009)

  35. Wan, M., Moore, R., Schroeder, W.: A prototype rule-based distributed data management system rajasekar. In: HPDC Workshop on Next Generation Distributed Data Management, May (2006)

    Google Scholar 

  36. Weil, S.A., Brandt, S.A., Miller, E.L., Long, D.D.E., Maltzahn, C.: Ceph: a scalable, high-performance distributed file system. In: USENIX Operating Systems Design and Implementation (2006)

    Google Scholar 

Download references

Acknowledgements

This work was supported by National Science Foundation grants CCF-06-21434, CNS-06-43229, and CNS-01-30839. This work is also supported by the Federal Bureau of Investigation, the Central Intelligence Agency, the Intelligence Advanced Research Projects Activity, the Biometrics Task Force, and the Technical Support Working Group through US Army contract W91CRB-08-C-0093.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hoang Bui.

Additional information

Communicated by Judy Qiu and Dennis Gannon.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Bui, H., Bui, P., Flynn, P. et al. ROARS: a robust object archival system for data intensive scientific computing. Distrib Parallel Databases 30, 325–350 (2012). https://doi.org/10.1007/s10619-012-7103-5

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10619-012-7103-5

Keywords

Navigation