Algorithmica

, Volume 12, Issue 2–3, pp 182–208 | Cite as

Coding techniques for handling failures in large disk arrays

  • L. Hellerstein
  • G. A. Gibson
  • R. M. Karp
  • R. H. Katz
  • D. A. Patterson
Article

Abstract

A crucial issue in the design of very large disk arrays is the protection of data against catastrophic disk failures. Although today single disks are highly reliable, when a disk array consists of 100 or 1000 disks, the probability that at least one disk will fail within a day or a week is high. In this paper we address the problem of designing erasure-correcting binary linear codes that protect against the loss of data caused by disk failures in large disk arrays. We describe how such codes can be used to encode data in disk arrays, and give a simple method for data reconstruction. We discuss important reliability and performance constraints of these codes, and show how these constraints relate to properties of the parity check matrices of the codes. In so doing, we transform code design problems into combinatorial problems. Using this combinatorial framework, we present codes and prove they are optimal with respect to various reliability and performance constraints.

Key words

Input/output architecture Redundant disk arrays RAID Error-correcting codes Reliability Availability 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. [Ar]
    Array Technology CorporationProduct Description, RAID + Series Model RX, Revision 1.0, Array Technology Corporation, Boulder, CO 80301, February 1990.Google Scholar
  2. [Ba]
    Bates, K. H., Performance aspects of the HSC controller,Digital Tech. J., Vol. 8, February 1989, pp. 25–37.Google Scholar
  3. [Bel]
    Bell, C. G., Multis: a new class of multiprocessor computers,Science, Vol. 228, April 1985, pp. 462–467.CrossRefGoogle Scholar
  4. [Be2]
    Bell, C. G., The future of high performance computers in science and engineering,Comm. ACM, Vol. 32, No. 9, September 1989, pp. 1091–1101.CrossRefGoogle Scholar
  5. [Bi]
    Bitton, D., and J. Gray, Disk shadowing,Proc. 14th Internat. Conf. on Very Large Data Bases (VLDB), 1988, pp. 331–338.Google Scholar
  6. [Bol]
    Bollobás, B.,Combinatorics, Set Systems, Hypergraphs, Families of Vectors, and Combinatorial Probability, Cambridge University Press, Cambridge, 1986.MATHGoogle Scholar
  7. [Bor]
    Boral, H., and D. DeWitt, Database machines: an idea whose time has passed?,Database Machines, H. O. Leilich and M. Missikoff, eds., Springer-Verlag, New York, September 1983, pp. 166–187.Google Scholar
  8. [Br1]
    Brouwer, A. E., Wilson's theory, inPacking and Covering in Combinatorics, A. Schrijver, ed., Mathematical Centre Tracts 106, Mathematisch Centrum, Amsterdam, 1979, pp. 75–88.Google Scholar
  9. [Br2]
    Brouwer, A. E., and A. Schrijver, Uniform hypergraphs, inPacking and Covering in Combinatorics, A. Schrijver, ed., Mathematical Centre Tracts 106, Mathematisch Centrum, Amsterdam, 1979, pp. 39–73.Google Scholar
  10. [Ch]
    Chen, P. M., and D. A. Patterson, Maximizing performance in a striped disk array,Proc. 1990 ACM SIGARCH 17th Ann. Internat. Symp. on Computer Architecture, Seattle, WA, May 1990, pp. 322–331.Google Scholar
  11. [Ge]
    Gelsinger, P. P., P. A. Gargini, G. H. Parker, and A. Y. C. Yu, Microprocessors circa 2000,IEEE Spectrum, October 1989, pp. 43–74.Google Scholar
  12. [Gil]
    Gibson, G. A.,Redundant Disk Arrays: Reliable, Parallel Secondary Storage, M.I.T. Press, Cambridge, MA, 1992.Google Scholar
  13. [Gi2]
    Gibson, G. A., and D. A. Patterson, Designing disk arrays for high data reliability,J. Parallel Distrib. Comput., Vol. 17, No. 1, January 1993, pp. 4–27.CrossRefGoogle Scholar
  14. [Gr]
    Gray, J., Why do computers stop and what can be done about it?, Tandem Technical Report 85.7, June 1985.Google Scholar
  15. [Ha1]
    Hanani, H., The existence of construction of balanced incomplete block designs,Ann. of Math. Statist., Vol. 32, 1961, pp. 361–386.MATHCrossRefMathSciNetGoogle Scholar
  16. [Ha2]
    Hanani, H., Balanced incomplete block desings and related designs,Discrete Math., Vol. 11, 1975, pp. 255–369.MATHCrossRefMathSciNetGoogle Scholar
  17. [Hoa]
    Hoagland, A., Information storage technology: a look at the future,IEEE Trans. Computer, Vol. 18, July 1985, pp. 60–67.Google Scholar
  18. [Hol]
    Holland, M., and G. A. Gibson, Parity declustering for continuous operation in redundant disk arrays,Fifth Internat. Conf. on Architectural Support for Programming Languages and Operating Systems (ASPLOS V), Boston, MA, October 1992, pp. 23–25.Google Scholar
  19. [Ji]
    Jilke, W., Disk array mass storage sytems: the new opportunity, Amperif Corp., September 1986.Google Scholar
  20. [Ki]
    Kim, M. Y., Synchronously Interleaved Disk Systems with their Application to the Very Large FFT, Ph.D. Dissertation, Polytechnic University, January 1987.Google Scholar
  21. [Kl]
    Klietz, A., J. Turner, and T. C. Jacobson, TurboNFS: fast shared access for Cray disk storage,Proc. Cray User Group Convention, April 1988.Google Scholar
  22. [Kr]
    Kryder, M. H., Data storage in 2000—trends in data storage technologies,IEEE Trans. Magnetics, Vol. 25, No. 6, November 1989, pp. 4358–4363.CrossRefGoogle Scholar
  23. [Lin]
    Lin, Ting-Ting Yao, Design and Evaluation of an On-Line Predictive Diagnostic System, Ph.D. Dissertation, Carnegie Mellon University, April 1988.Google Scholar
  24. [Liv]
    Livny, M., S. Khoshafian, and H. Boral, Multi-disk management algorithms,Proc. ACM SIGMETRICS, May 1987, pp. 69–77.Google Scholar
  25. [Ma]
    MacWilliams, F. J., and N. J. A. Sloane,The Theory of Error-Correcting Codes, North-Holland Mathematical Library, Vol. 16, Elsevier Science, New York, 1977.Google Scholar
  26. [Mu]
    Muntz, R. R., and J. C. S. Lui, Performance analysis of disk arrays under failure,Proc. 16th Internat. Conf. on Very Large Data Bases (VDLB), D. McLeod, R. Sacks-Davis, and H. Schek, eds., Morgan Kaufman, Los Altos, CA, August 1990, pp. 162–173.Google Scholar
  27. [My]
    Myers, G. J., A. Y. C. Yi, and D. L. House, Microprocessor technology trends,Proc. IEEE, Vol. 74, No. 12, December 1986, pp. 1605–1622.CrossRefGoogle Scholar
  28. [Ne]
    Newberg, L., and D. Wolfe, String layouts for a redundant array of inexpensive disks,Algorithmica, this issue, pp. 209–224.Google Scholar
  29. [Pa]
    Patterson, D. A., G. A. Gibson, and R. H. Katz, A case for redundant arrays of inexpensive disks (RAID),ACM SIGMOD 88, Chicago, June 1988, pp. 109–116.Google Scholar
  30. [Pe]
    Peterson, W. W., and E. J. Weldon, Jr.,Error-Correcting Codes, 2nd edn., M.I.T. Press, Cambridge, MA, 1972, pp. 131–136.MATHGoogle Scholar
  31. [Ra]
    Rabin, M. O., Efficient dispersal of information at security, load balancing, and fault tolerance,J. Assoc. Comput. Mach., Vol. 36, 1989, pp. 335–348.MATHMathSciNetGoogle Scholar
  32. [Ru]
    Rubinstein, R. Y.,Simulation and the Monte Carlo Method, Wiley, New York, 1981.MATHGoogle Scholar
  33. [Sa]
    Salem, K., and H. Garcia-Molina, Disk striping,Proc. IEEE 1986 Internat. Conf. on Data Engineering, 1986, pp. 336–342.Google Scholar

Copyright information

© Springer-Verlag New York Inc. 1994

Authors and Affiliations

  • L. Hellerstein
    • 1
  • G. A. Gibson
    • 2
  • R. M. Karp
    • 3
  • R. H. Katz
    • 3
  • D. A. Patterson
    • 3
  1. 1.Department of Electrical Engineering and Computer ScienceNorthwestern UniversityEvanstonUSA
  2. 2.School of Computer ScienceCarnegie Mellon UniversityPittsburghUSA
  3. 3.Computer Science Division, Electrical Engineering and Computer SciencesUniversity of CaliforniaBerkeleyUSA

Personalised recommendations