Springer Nature is making SARS-CoV-2 and COVID-19 research free. View research | View latest news | Sign up for updates

ROARS: a robust object archival system for data intensive scientific computing

  • 308 Accesses


As scientific research becomes more data intensive, there is an increasing need for scalable, reliable, and high performance storage systems. Such data repositories must provide both data archival services and rich metadata, and cleanly integrate with large scale computing resources. ROARS is a hybrid approach to distributed storage that provides both large, robust, scalable storage and efficient rich metadata queries for scientific applications. In this paper, we present the design and implementation of ROARS, focusing primarily on the challenge of maintaining data integrity across long time scales. We evaluate the performance of ROARS on a storage cluster, comparing to the Hadoop distributed file system and a centralized file server. We observe that ROARS has read and write performance that scales with the number of storage nodes, and integrity checking that scales with the size of the largest node. We demonstrate the ability of ROARS to function correctly through multiple system failures and reconfigurations. ROARS has been in production use for over three years as the primary data repository for a biometrics research lab at the University of Notre Dame.

This is a preview of subscription content, log in to check access.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15


  1. 1.

    Filesystem in user space. http://sourceforge.net/projects/fuse

  2. 2.

    Altschul, S., Gish, W., Miller, W., Myers, E., Lipman, D.: Basic local alignment search tool. J. Mol. Biol. 3(215), 403–410 (1990)

  3. 3.

    Amazon Simple Storage Service (Amazon S3). http://aws.amazon.com/s3/ (2009)

  4. 4.

    Baru, C., Moore, R., Rajasekar, A., Wan, M.: The SDSC storage resource broker. In: Proceedings of CASCON, Toronto, Canada (1998)

  5. 5.

    Bonwick, J., Ahrens, M., Henson, V., Maybee, M., Shellenbaum, M.: The zettabyte file system. Technical Report, Sun Microsystems (2003)

  6. 6.

    Borthakur, D.: HDFS architecture guide. In: Hadoop Apache Project. http://hadoop.apache.org/common/docs/current/hdfs_design.pdf

  7. 7.

    Bui, H., Bui, P., Flynn, P., Thain, D.: ROARS: a scalable repository for data intensive scientific computing. In: The Third International Workshop on Data Intensive Distributed Computing at ACM HPDC 2010 (2010)

  8. 8.

    Bui, H., Kelly, M., Lyon, C., Pasquier, M., Thomas, D., Flynn, P., Thain, D.: Experience with BXGrid: a data repository and computing grid for biometrics research. J. Clust. Comput. 12(4), 373 (2009)

  9. 9.

    Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. In: Operating Systems Design and Implementation (2004)

  10. 10.

    Dongarra, J.J., Walker, D.W.: MPI: a standard message passing interface. Supercomputer 12, 56–68 (1996)

  11. 11.

    Foundation, A.S.: The apache CouchDB project. http://couchdb.apache.org (2012)

  12. 12.

    Ghemawat, S., Gobioff, H., Leung, S.: The Google filesystem. In: ACM Symposium on Operating Systems Principles (2003)

  13. 13.

    Hadoop. http://hadoop.apache.org/ (2007)

  14. 14.

    Holupirek, A., Grün, C., Scholl, M.H.: BaseX & DeepFS joint storage for filesystem and database. In: Proceedings of the 12th International Conference on Extending Database Technology: Advances in Database Technology (2009)

  15. 15.

    Howard, J., Kazar, M., Menees, S., Nichols, D., Satyanarayanan, M., Sidebotham, R., West, M.: Scale and performance in a distributed file system. ACM Trans. Comput. Syst. 6(1), 51–81 (1988)

  16. 16.

    Ivanova, M., Nes, N., Goncalves, R., Kersten, M.: Monetdb/sql meets skyserver: the challenges of a scientific database. In: International Conference on Scientific and Statistical Database Management, p. 13 (2007)

  17. 17.

    Leavitt, N.: Will nosql databases live up to their promise? Computer 43(2), 12–14 (2010)

  18. 18.

    Maccormick, J., Murphy, N., Ramasubramanian, V., Weder, U., Yang, J.: Kinesis: a new approach to replica placement in distributed storage systems. ACM Transactions on Storage 4(1), 1–28 (2009)

  19. 19.

    MongoDB. GridFS Specification. http://www.mongodb.org (2012)

  20. 20.

    No, J., Thakur, R., Choudhary:, A.: Integrating parallel file i/o and database support for high-performance scientific data management. In: IEEE High Performance Networking and Computing (2000)

  21. 21.

    Plugge, E., Hawkins, T., Membrey, P.: The definitive guide to MongoDB: the noSQL database for cloud and desktop computing. In: Apress, Berkley, CA, USA (2010). 1st edn.

  22. 22.

    Rosenthal, D.S.: Lockss: lots of copies keep stuff safe. In: NIST Digital Preservation Interoperability Framework Workshop (2010)

  23. 23.

    Sandberg, R., Goldberg, D., Kleiman, S., Walsh, D., Lyon, B.: Design and implementation of the Sun network filesystem. In: USENIX Summer Technical Conference, pp. 119–130 (1985)

  24. 24.

    Sciore, E.: SimpleDB: a simple java-based multiuser syst for teaching database internals. In: Proceedings of the 38th SIGCSE Technical Symposium on Computer Science Education (2007)

  25. 25.

    Searcs, R., Ingen, C.V., Gray, J.: To blob or not to blob: large object storage in a database or a filesystem. Technical report msr-tr-2006-45, Microsoft Research (April 2006)

  26. 26.

    Stolte, E., von Praun, C., Alonso, G., Gross, T.: Scientific data repositories: designing for a moving target. In: SIGMOD (2003)

  27. 27.

    Stonebraker, M., Becla, J., DeWitt, D.J., Lim, K.-T., Maier, D., Ratzesberger, O., Zdonik, S.B.: Requirements for science data bases and scidb. In: CIDR (2009). www.crdrdb.org

  28. 28.

    Stonebraker, M., Frew, J., Dozier, J.: An overview of the sequoia 2000 project. In: Proceedings of the Third International Symposium on Large Spatial Databases, pp. 397–412 (1992)

  29. 29.

    Szalay, A.S., Kunszt, P.Z., Thakar, A., Gray, J., Slutz, D.R.: Designing and mining multi-terabyte astronomy archives: the sloan digital sky survey. In: SIGMOD Conference (2000)

  30. 30.

    Tatebe, O., Soda, N., Morita, Y., Matsuoka, S., Sekiguchi, S.: Gfarm v2: a grid file system that supports high-performance distributed and parallel data computing. In: Computing in High Energy Physics (CHEP), September (2004)

  31. 31.

    Thain, D., Livny, M.: Parrot: an application environment for data-intensive computing. Scalable Comput., Pract. Exp. 6(3), 9–18 (2005)

  32. 32.

    Thain, D., Moretti, C., Hemmes, J.: Chirp: a practical global filesystem for cluster and grid computing. J. Grid Comput. 7(1), 51–72 (2009)

  33. 33.

    Thain, D., Tannenbaum, T., Livny, M.: Condor and the grid. In: Berman, F., Fox, G., Hey, T. (eds.) Grid Computing: Making the Global Infrastructure a Reality. Wiley, New York (2003)

  34. 34.

    Vertica. http://www.vertica.com/ (2009)

  35. 35.

    Wan, M., Moore, R., Schroeder, W.: A prototype rule-based distributed data management system rajasekar. In: HPDC Workshop on Next Generation Distributed Data Management, May (2006)

  36. 36.

    Weil, S.A., Brandt, S.A., Miller, E.L., Long, D.D.E., Maltzahn, C.: Ceph: a scalable, high-performance distributed file system. In: USENIX Operating Systems Design and Implementation (2006)

Download references


This work was supported by National Science Foundation grants CCF-06-21434, CNS-06-43229, and CNS-01-30839. This work is also supported by the Federal Bureau of Investigation, the Central Intelligence Agency, the Intelligence Advanced Research Projects Activity, the Biometrics Task Force, and the Technical Support Working Group through US Army contract W91CRB-08-C-0093.

Author information

Correspondence to Hoang Bui.

Additional information

Communicated by Judy Qiu and Dennis Gannon.

Rights and permissions

Reprints and Permissions

About this article

Cite this article

Bui, H., Bui, P., Flynn, P. et al. ROARS: a robust object archival system for data intensive scientific computing. Distrib Parallel Databases 30, 325–350 (2012). https://doi.org/10.1007/s10619-012-7103-5

Download citation


  • Distributed storage
  • Distributed system
  • Archive system