Skip to main content
Log in

Fault tolerant file models for parallel file systems: introducing distribution patterns for every file

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

Parallelism in file systems is obtained by using several independent server nodes supporting one or more secondary storage devices. This approach increases the performance and scalability of the system, but a fault in one single node can stop the whole system. To avoid this problem, data must be stored using some kind of redundant technique, so any data stored in a faulty element can be recovered. Fault tolerance can be provided in I/O systems by using replication or RAID based schemes. However, most of the current systems apply the same technique for all files in the system.

This paper describes the fault tolerance support provided by Expand, a parallel file system based on standard servers. This support can be applied to other parallel file systems with many benefices: fault tolerance at file level, flexible definition of fault tolerance scheme to be used, possibility to change the fault tolerant support used for a file, etc.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Cabrera L-F, Long DDE (1991) SWIFT: using distributed disk striping to provide high I/O data rates. Technical Report UCSC-CRL-91-46, UCSC

  2. Long DDE, Montague BR, Cabrera L-F SWIFT/RAID: A distributed RAID system. Technical Report UCSC-CRL-94-06, University of California at Santa Cruz

  3. Page TW, Popek GJ, Guy RG, Heidemann JS (1990) The Ficus distributed file system: Replication via stackable layers. Technical Report CSD-900009, University of California, Los Angeles, CA, USA

  4. Guy R, Heidmenn J, Mak W, Page T Jr, Popek G, Rothmeier D (1990) Implementation of the Ficus replicated file system. Proceedings of the Summer 1990 USENIX Conference, pp 63–71

  5. Swart G, Birrell A, Hisgen A, Mann T (1993) Availability in the Echo file system. Technical Report 112, Systems Research Center, Digital Equipment Corporation, Palo Alto CA, USA

  6. Liskov B, Ghemawat S, Gruber R, Johnson P, Shrira L, Williams M (1991) Replication in the Harp file system. In: Proceedings of 13th ACM symposium on operating systems principles. Association for Computing Machinery SIGOPS, pp 226–238

  7. Evans M (2000) FTFS: The design of a fault tolerant distributed file-system. Senior Thesis, University of Nebraska-Lincoln

  8. Anderson TE, Dahlin MD, Neefe JM, Patterson DA, Roselli DS, Wang RY (1995) Serverless network file systems. In: Proceedings of the fifteenth ACM symposium on operating systems principles. ACM Press, pp 109–126

  9. Soltis SR, Ruwart TM, O’Keefe MT (1996) The global file system In: Proceedings of the Fifth NASA Goddard conference on mass storage systems. IEEE Computer Society Press, pp 319–342

  10. Stonebraker M, Schloss GA (1990) Distributed RAID—a new multiple copy algorithm proceedings of the sixth international conference on data engineering, pp 430–437

  11. Calderon A, Garcia-Carballeira F, Carretero J, Perez JM, Fernandez J (2002) An implementation of MPI-IO on Expand: A parallel file system based on NFS servers. In: Kranzlmuller D et al, Recent advances in parallel virtual machine and message passing interface. Proceedings of the 9th European PVM/MPI Users Group Meeting, EuroPVM/MPI 2002, Linz, Austria, LNCS 2474, pp 306–313

  12. Garcia-Carballeira F, Calderon A, Carretero J, Fernandez J, Perez JM (2003) The design of the expand parallel file system. Int J High Perform Comput Appl 17(1)

  13. Gropp W, Takhur R, Lusk E (1999) On implementing MPI-IO portably and with high performance. In: Proceedings of the sixth workshop on I/O in parallel and distributed systems, pp 23–32

  14. Garcia F, Calderon A, Carretero J, Perez JM, Fernandez J (2003) A parallel and fault tolerant file system based on NFS servers. In: Proceedings of the eleventh Euromicro conference on parallel, distributed and network-based processing (Euro-PDP’03), pp 83–90

  15. Calderon A, Garcia-Carballeira F, Carretero J, Perez JM, Sanchez LM (2005) A fault tolerant MPI-IO implementation using the expand parallel file system. In: Proceedings of the 13th Euromicro conference on parallel, distributed and network-based processing (Euro-PDP’05), pp 274–281

  16. FLASH I/O Benchmark Routine—Parallel HDF 5. http://flash.uchicago.edu/~zingale/flash_benchmark_io/

  17. Carns PH, Ligon III WB, Ross RB, Thakur R (2000) PVFS: a parallel file system for Linux clusters. In: Proceedings of the 4th annual Linux showcase and conference, Atlanta, pp 317–327

  18. Alvarez GA, Burkhard WA, Cristian F (1997) Tolerating multiple failures in RAID architectures with optimal storage and uniform declustering. In: Proceedings of the 24th annual international symposium on computer architecture (ISCA ’97). ACM Press, pp 62–72

  19. Plank JS (1996) A tutorial on Reed-Solomon coding for fault-tolerance in RAID-like systems. Technical Report CS-96-332, University of Tennessee

  20. Blaum M, Brady J, Bruck J, Menon J (1995) EVENODD: an efficient scheme for tolerating double disk failures in RAID architectures. IEEE Trans Comput 44(2):192–202

    Article  MATH  Google Scholar 

  21. Hsieh P-H, Chen I-Y, Lin Y-T, Kuo S-Y (2004) An XOR based Reed-Solomon algorithm for advanced RAID systems. In: Proceedings of the 19th IEEE international symposium on defect and fault tolerance in VLSI systems (DFT04), IEEE Computer Society, pp 165–172

  22. Gibson G, Hellerstein L, Karp R, Katz R, Patterson D (1989) Coding techniques for handling failures in large disk arrays. In: Proceedings of the international conference on architectural support for programming languages and operating systems, pp 123–132

  23. Perez MS, Sanchez A, Robles V, Peña JM, Perez F (2004) Optimizations based on hints in a parallel file system. In: Proceedings of the workshop on parallel input/output management techniques (PIOMT04), pp 347–354

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to A. Calderón.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Calderón, A., García-Carballeira, F., Sánchez, L.M. et al. Fault tolerant file models for parallel file systems: introducing distribution patterns for every file. J Supercomput 47, 312–334 (2009). https://doi.org/10.1007/s11227-008-0199-8

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11227-008-0199-8

Keywords

Navigation