Abstract
This paper discusses the fault tolerance issues of the Local Area Multiprocessor (LAMP) storage subsystem, and presents its architecture design, error detection and recovery algorithms, and logical volume reconstruction procedure. LAMP is a network of workstations with shared physical memory. Its basic communication protocol is load and store. The LAMP storage subsystem is developed for this class of distributed computing system: 1) It is with distributed shared memory; 2) It uses low-latency and high-bandwidth interconnection; 3) It provides remote DMA support. The LAMP storage subsystem stripes data across multiple nodes for higher I/O performance and availability. It organizes logical volumes (virtual disks) to store files according to the file size, data access pattern, as well as other criteria performance, availability, and security requirements. The LAMP storage subsystem implements RAID technology: RAID-0, 1, and 5 for each logical volume. The write-ahead logging is used to log data, metadata and parity updates of a recovery unit, which allows LAMP storage subsystem to perform fast error recovery. For rapid reconstruction of a failed logical volume, the LAMP logical volume reconstruction algorithm is implemented. In this paper, three main fault tolerance issues of the LAMP storage subsystem are discussed: system configurability for fault tolerance and performance, fast error detection and recovery, and fast logical volume reconstruction.
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
This work is sponsored in part by a grant from National Science Foundation CCR-941006
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
S. Asami, N. Talagala, T. Anderson, K. Lutz, and D. Patterson. The Design of Large-Scale Do-It-Yourself RAIDs. Draft 1.0. http://www.cs.berkeley.edu, Nov 10, 1995.
L.-F. Cabrera and D. Long. Swift: Using Distributed Disk Striping to Provide High I/O Data Rates. Computer Systems, 4(4):405–436, fall 1991.
D. Long, B. Montague, and L.-F. Cabrera. Swift/RAID: A Distributed RAID System. Computing Systems, 7(3):333–359, summer 1994
P. Dibble, M. Scott, and C. Ellis. Bridge: A High-Performance File System for Parallel Processors. Proceedings of the 8th International Conference on Distributed Computing Systems (ICDCS). IEEE, New York, 154–161, 1988.
P. Dibble, and M. Scott. Beyond Striping: The Bridge Multiprocessor File System. Computer Architecture News, 17(5):32–39, September 1989
J. Hartman, and J. Ousterhout. The Zebra Striped Network File System. ACM Transactions on Computer Systems, 13(3):274–310, August 1995.
R. Wong, and T. Anderson. xFS: A Wide Area Mass Storage File System. 4th Workshop on Workstation Operating Systems, 71–78, October 1993.
T. Anderson, M. Dahlin, J. Neefe, D. Patterson, D. Roselli, and R. Wang. Severless Network File Systems. 15th ACM Symposium on Operating Systems Principles, December 1995.
M. Rosenblum, and J. Ousterhout. The Design and Implementation of a Log-Structured File System. ACM Trans. on Computer Systems, 10(1):26–52, February 1992.
P. Chen, E. Lee, G. Gibson, R. Katz, and D. Patterson. RAID: High-Performance, Reliable Secondary Storage. ACM Computing Surveys, 26(2): 145–188, June 1994.
G. Gibson. Redundant Disk Arrays Reliable, Parallel Secondary Storage, MIT Press, 1992.
P. Corbett, D. Feitelson, J. Prost et.al. Parallel File Systems for the IBM SP Computers, IBM Systems Journal, 34(2): 222–248, 1995.
S. Lo Verso, M. Isman, A. Nanopoulos et. al. sfs: A parallel File System for the CM5, Proceedings of the Summer 1993 USENIX Conference (Cincinnati, Ohio), 291–305. June 1993.
P. Pierce. A Concurrent File System for a Highly Parallel Mass Storage Subsystem, Proceedings of the 4th Conference on Hypercubes, Concurrent Computers and Applications (Monterey, California), 155–160, March 1989.
B. Walker, G Popek, R. English, et. al. The LOCUS Distributed Operating System, ACM SIGOPS Operating Systems Review 17(5):49–70, 1993.
M. Satyanarayanan, J. Kistler, P. Kumar, et. al. Coda: A Highly Available File System for a Distributed Workstation Environment, IEEE Transactions on Computers 39(4):447–459, April 1990.
B. Liskov, S. Ghemawat, R. Gruber, et. al. Replication in the Harp File System, ACM SIGOPS Operating Systems Review 25(5):226–238, 1991.
J. del Rosario, R. Bordawekar, and A. Choudhary. Improved Parallel I/O via a Twophase Run-time Access Strategy, Computer Architecture News, 21(5): 31–38, December 1993.
G. Gibson, D. Stodolsky, F. Chang, et. al. The Scotch Parallel Storage Systems, Proceedings of the IEEE CompCon Conference (San Francisco, California), March 1995.
N. Nieuwejaar, and D. Kotz. The Gaily Parallel File System, PCS-TR96-286, Department of Computer Science, Dartmouth College, Hanover, NH, available at URL ftp://ftp.cs.dartmouth.edU/pub/CS-techreports/TR96-286.ps.Z, 1996.
ANSI/IEEE std 1596–1992, Scalable Coherent Interface, August 1993.
D. Gustavson, and Q. Li. Local Area Multiprocessor: the Scalable Coherent Interface, Proceedings of the Second International Workshop on SCI-based High Performance Low-Cost Computing: 131–154, March 1995.
W. de Jonge, M. Kaashoek, and W. Hsieh. The Logical Disk: A New Approach to Improving File Systems, Laboratory for Computer Science, MIT, Cambridge, MA. 1994.
W. Courtright, and G. Gibson. Backward Error Recovery in Redundant Disk Arrays, Proceedings of the 1994 Computer Measurement Group (CMG) Conference, Vol. 1:63–74, December 1994
W. Courtright, G. Gibson, and M. Holland, et. al. A Structured Approach to Redundant Disk Array Implementation, Proceedings of the International Computer Performance and Dependency Symposium (IPDS), September 4–6, 1996.
M. Holland. On-Line Data Reconstruction In Redundant Disk Arrays, PhD Dissertation, Department of Electrical and Computer Engineering, Carnegie Mellon University, 1994.
M. Holland, G. Gibson, and D. Siewiorek. Architectures and Algorithms for On-Line Failure Recovery in Redundant Disk Arrays, Journal of Distributed and Parallel Databases, 2(3), July 1994.
M. Holland, and G. Gibson. Parity Declustering for Continuous Operation in Redundant Disk Arrays, Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems, 1992.
D. Stodolsky, G. Gibson, and M. Holland. Parity Logging: Overcoming the Small Write Problem in Redundant Disk Arrays, Proceedings of the 21th Annual International Symposium on Computer Architecture, 1993.
Author information
Authors and Affiliations
Rights and permissions
Copyright information
© 1998 Springer Science+Business Media New York
About this chapter
Cite this chapter
Li, Q., Hong, E., Tsukerman, A. (1998). Fault-Tolerance Issues of Local Area Multiprocessor (LAMP) Storage Subsystem. In: Fault-Tolerant Parallel and Distributed Systems. Springer, Boston, MA. https://doi.org/10.1007/978-1-4615-5449-3_8
Download citation
DOI: https://doi.org/10.1007/978-1-4615-5449-3_8
Publisher Name: Springer, Boston, MA
Print ISBN: 978-1-4613-7488-6
Online ISBN: 978-1-4615-5449-3
eBook Packages: Springer Book Archive