Distributed and Parallel Databases

, Volume 30, Issue 5–6, pp 325–350

ROARS: a robust object archival system for data intensive scientific computing

  • Hoang Bui
  • Peter Bui
  • Patrick Flynn
  • Douglas Thain
Article
  • 273 Downloads

Abstract

As scientific research becomes more data intensive, there is an increasing need for scalable, reliable, and high performance storage systems. Such data repositories must provide both data archival services and rich metadata, and cleanly integrate with large scale computing resources. ROARS is a hybrid approach to distributed storage that provides both large, robust, scalable storage and efficient rich metadata queries for scientific applications. In this paper, we present the design and implementation of ROARS, focusing primarily on the challenge of maintaining data integrity across long time scales. We evaluate the performance of ROARS on a storage cluster, comparing to the Hadoop distributed file system and a centralized file server. We observe that ROARS has read and write performance that scales with the number of storage nodes, and integrity checking that scales with the size of the largest node. We demonstrate the ability of ROARS to function correctly through multiple system failures and reconfigurations. ROARS has been in production use for over three years as the primary data repository for a biometrics research lab at the University of Notre Dame.

Keywords

Distributed storage Distributed system Archive system 

References

  1. 1.
    Filesystem in user space. http://sourceforge.net/projects/fuse
  2. 2.
    Altschul, S., Gish, W., Miller, W., Myers, E., Lipman, D.: Basic local alignment search tool. J. Mol. Biol. 3(215), 403–410 (1990) Google Scholar
  3. 3.
    Amazon Simple Storage Service (Amazon S3). http://aws.amazon.com/s3/ (2009)
  4. 4.
    Baru, C., Moore, R., Rajasekar, A., Wan, M.: The SDSC storage resource broker. In: Proceedings of CASCON, Toronto, Canada (1998) Google Scholar
  5. 5.
    Bonwick, J., Ahrens, M., Henson, V., Maybee, M., Shellenbaum, M.: The zettabyte file system. Technical Report, Sun Microsystems (2003) Google Scholar
  6. 6.
    Borthakur, D.: HDFS architecture guide. In: Hadoop Apache Project. http://hadoop.apache.org/common/docs/current/hdfs_design.pdf
  7. 7.
    Bui, H., Bui, P., Flynn, P., Thain, D.: ROARS: a scalable repository for data intensive scientific computing. In: The Third International Workshop on Data Intensive Distributed Computing at ACM HPDC 2010 (2010) Google Scholar
  8. 8.
    Bui, H., Kelly, M., Lyon, C., Pasquier, M., Thomas, D., Flynn, P., Thain, D.: Experience with BXGrid: a data repository and computing grid for biometrics research. J. Clust. Comput. 12(4), 373 (2009) CrossRefGoogle Scholar
  9. 9.
    Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. In: Operating Systems Design and Implementation (2004) Google Scholar
  10. 10.
    Dongarra, J.J., Walker, D.W.: MPI: a standard message passing interface. Supercomputer 12, 56–68 (1996) Google Scholar
  11. 11.
    Foundation, A.S.: The apache CouchDB project. http://couchdb.apache.org (2012)
  12. 12.
    Ghemawat, S., Gobioff, H., Leung, S.: The Google filesystem. In: ACM Symposium on Operating Systems Principles (2003) Google Scholar
  13. 13.
    Hadoop. http://hadoop.apache.org/ (2007)
  14. 14.
    Holupirek, A., Grün, C., Scholl, M.H.: BaseX & DeepFS joint storage for filesystem and database. In: Proceedings of the 12th International Conference on Extending Database Technology: Advances in Database Technology (2009) Google Scholar
  15. 15.
    Howard, J., Kazar, M., Menees, S., Nichols, D., Satyanarayanan, M., Sidebotham, R., West, M.: Scale and performance in a distributed file system. ACM Trans. Comput. Syst. 6(1), 51–81 (1988) CrossRefGoogle Scholar
  16. 16.
    Ivanova, M., Nes, N., Goncalves, R., Kersten, M.: Monetdb/sql meets skyserver: the challenges of a scientific database. In: International Conference on Scientific and Statistical Database Management, p. 13 (2007) Google Scholar
  17. 17.
    Leavitt, N.: Will nosql databases live up to their promise? Computer 43(2), 12–14 (2010) CrossRefGoogle Scholar
  18. 18.
    Maccormick, J., Murphy, N., Ramasubramanian, V., Weder, U., Yang, J.: Kinesis: a new approach to replica placement in distributed storage systems. ACM Transactions on Storage 4(1), 1–28 (2009) CrossRefGoogle Scholar
  19. 19.
    MongoDB. GridFS Specification. http://www.mongodb.org (2012)
  20. 20.
    No, J., Thakur, R., Choudhary:, A.: Integrating parallel file i/o and database support for high-performance scientific data management. In: IEEE High Performance Networking and Computing (2000) Google Scholar
  21. 21.
    Plugge, E., Hawkins, T., Membrey, P.: The definitive guide to MongoDB: the noSQL database for cloud and desktop computing. In: Apress, Berkley, CA, USA (2010). 1st edn. Google Scholar
  22. 22.
    Rosenthal, D.S.: Lockss: lots of copies keep stuff safe. In: NIST Digital Preservation Interoperability Framework Workshop (2010) Google Scholar
  23. 23.
    Sandberg, R., Goldberg, D., Kleiman, S., Walsh, D., Lyon, B.: Design and implementation of the Sun network filesystem. In: USENIX Summer Technical Conference, pp. 119–130 (1985) Google Scholar
  24. 24.
    Sciore, E.: SimpleDB: a simple java-based multiuser syst for teaching database internals. In: Proceedings of the 38th SIGCSE Technical Symposium on Computer Science Education (2007) Google Scholar
  25. 25.
    Searcs, R., Ingen, C.V., Gray, J.: To blob or not to blob: large object storage in a database or a filesystem. Technical report msr-tr-2006-45, Microsoft Research (April 2006) Google Scholar
  26. 26.
    Stolte, E., von Praun, C., Alonso, G., Gross, T.: Scientific data repositories: designing for a moving target. In: SIGMOD (2003) Google Scholar
  27. 27.
    Stonebraker, M., Becla, J., DeWitt, D.J., Lim, K.-T., Maier, D., Ratzesberger, O., Zdonik, S.B.: Requirements for science data bases and scidb. In: CIDR (2009). www.crdrdb.org Google Scholar
  28. 28.
    Stonebraker, M., Frew, J., Dozier, J.: An overview of the sequoia 2000 project. In: Proceedings of the Third International Symposium on Large Spatial Databases, pp. 397–412 (1992) Google Scholar
  29. 29.
    Szalay, A.S., Kunszt, P.Z., Thakar, A., Gray, J., Slutz, D.R.: Designing and mining multi-terabyte astronomy archives: the sloan digital sky survey. In: SIGMOD Conference (2000) Google Scholar
  30. 30.
    Tatebe, O., Soda, N., Morita, Y., Matsuoka, S., Sekiguchi, S.: Gfarm v2: a grid file system that supports high-performance distributed and parallel data computing. In: Computing in High Energy Physics (CHEP), September (2004) Google Scholar
  31. 31.
    Thain, D., Livny, M.: Parrot: an application environment for data-intensive computing. Scalable Comput., Pract. Exp. 6(3), 9–18 (2005) Google Scholar
  32. 32.
    Thain, D., Moretti, C., Hemmes, J.: Chirp: a practical global filesystem for cluster and grid computing. J. Grid Comput. 7(1), 51–72 (2009) CrossRefGoogle Scholar
  33. 33.
    Thain, D., Tannenbaum, T., Livny, M.: Condor and the grid. In: Berman, F., Fox, G., Hey, T. (eds.) Grid Computing: Making the Global Infrastructure a Reality. Wiley, New York (2003) Google Scholar
  34. 34.
    Vertica. http://www.vertica.com/ (2009)
  35. 35.
    Wan, M., Moore, R., Schroeder, W.: A prototype rule-based distributed data management system rajasekar. In: HPDC Workshop on Next Generation Distributed Data Management, May (2006) Google Scholar
  36. 36.
    Weil, S.A., Brandt, S.A., Miller, E.L., Long, D.D.E., Maltzahn, C.: Ceph: a scalable, high-performance distributed file system. In: USENIX Operating Systems Design and Implementation (2006) Google Scholar

Copyright information

© Springer Science+Business Media, LLC 2012

Authors and Affiliations

  • Hoang Bui
    • 1
  • Peter Bui
    • 1
  • Patrick Flynn
    • 1
  • Douglas Thain
    • 1
  1. 1.University of Notre DameNotre DameUSA

Personalised recommendations