Abstract
Distributed hash tables (DHTs) have been adopted as a building block for large-scale distributed systems. The upshot of this success is that their robust operation is even more important as mission-critical applications begin to be layered on them. Even though DHTs can detect and heal around unresponsive hosts and disconnected links, several hidden faults and performance bottlenecks go undetected, resulting in unanswered queries and delayed responses. In this paper, we propose dFault, a system that helps large-scale DHTs to localize such faults. Informed with a log of failed queries called symptoms and some available information about the hosts in the DHT, dFault identifies the potential root causes (hosts and overlay links) that with high likelihood contributed towards those symptoms. Its design is based on the recently proposed dependency graph modeling and inference approach for fault localization. We describe the design of dFault, and show that it can accurately localize the root causes of faults with modest amount of information collected from individual nodes using a real prototype deployed over PlanetLab.
Chapter PDF
Similar content being viewed by others
References
Bahl, P., Chandra, R., Greenberg, A., Kandula, S., Maltz, D.A., Zhang, M.: Towards highly reliable enterprise network services via inference of multi-level dependencies. In: ACM SIGCOMM (August 2007)
Barham, P., Donnelly, A., Isaacs, R., Mortier,R.: Using magpie for request extraction and workload modelling. In: OSDI 2004 (December 2004)
Breslau, L., Cao, P., Fan, L., Phillips, G., Shenker, S.: Web Caching and Zipf-like Distributions: Evidence and Implications. In: Proc. of IEEE INFOCOM Conference, New York, NY (March 1999)
Case, J., Fedor, M., Schoffstall, M., Davin, J.: A simple network management protocol (SNMP). RFC 1157, IETF (May 1990)
Freedman, M.J., Freudenthal, E., Mazières, D.: Democratizing content publication with Coral. In: Proc. of the USENIX Symposium on Networked Systems Design and Implementation (NSDI), San Francisco, CA (March 2004)
Freedman, M.J., Lakshminarayanan, K., Rhea, S., Stoica, I.: Non-transitive connectivity and dhts. In: WORLDS, San Francisco, CA (December 2005)
Hastorun, D., Jampani, M., Kakulapati, G., Pilchin, A., Sivasubramanian, S., Vosshall, P., Vogels, W.: Dynamo: Amazon’s highly available key-value store. In: SOSP, Stevenson, WA (October 2007)
HP Technologies, Open View, http://www.openview.hp.com
Dunagan, J., et al.: Fuse: Lightweight guaranteed distributed failure notification. In: OSDI 2004 (2004)
Jung, J., Sit, E., Balakrishnan, H., Morris, R.: DNS Performance and Effectiveness of Caching. In: Proc. of SIGCOMM Internet Measurement Workshop (IMW), San Francisco, CA (November 2001)
Kandula, S., Katabi, D., Vasseur, J.P.: Shrink: A tool for failure diagnosis in IP networks. In: Proc. ACM SIGCOMM MineNet Workshop (August 2005)
Kandula, S., Mahajan, R., Verkaik, P., Agarwal, S., Padhye, J., Bahl, V.: Detailed diagnosis in computer networks. In: ACM SIGCOMM (August 2009)
Karger, D., Lehman, E., Leighton, T., Levine, M., Lewin, D., Panigraphy, R.: Consistent hashing and random trees: Distributed caching protocols for relieving hot spots on the World Wide Web. In: Proc. of the ACM Symposium on Theory of Computing (STOC), El Paso, TX (April 1997)
Karp, R.M.: Reducibility among combinatorial problems. In: Miller, R.E., Thatcher, J.W. (eds.) Complexity of Computer Computations, pp. 85–103.
Kompella, R., Yates, J., Greenberg, A., Snoeren, A.C.: IP fault localization via risk modeling. In: NSDI (May 2005)
Kompella, R., Yates, J., Greenberg, A., Snoeren, A.C.: Detection and localization of network black holes. In: IEEE Infocom (May 2007)
Li, Z., Goyal, A., Chen, Y., Kuzmanovic, A.: P2PDoctor: Measurement and diagnosis of misconfigured peer-to-peer traffic. Technical Report NWU-EECS-07-06, North Western University (2007)
Maymounkov, P., Mazières, D.: Kademlia: A peer-to-peer information system based on the xor metric. In: Druschel, P., Kaashoek, M.F., Rowstron, A. (eds.) IPTPS 2002. LNCS, vol. 2429, p. 53. Springer, Heidelberg (2002)
Ramasubramanian, V., Sirer, E.G.: Beehive: O(1)lookup performance for power-law query distributions in peer-to-peer overlays. In: NSDI, San Francisco, CA (March 2004)
Ramasubramanian, V., Sirer, E.G.: The design and implementation of a next generation name service for the Internet. In: SIGCOMM, Portland, OR (August 2004)
Ratnasamy, S., Francis, P., Handley, M., Karp, R., Shenker, S.: A scalable content-addressable network. In: ACM SIGCOMM, San Diego, CA (August 2001)
Rhea, S., Chun, B.-G., Kubiatowicz, J., Shenker, S.: Fixing the embarrassing slowness of opendht on planetlab. In: WORLDS, San Francisco, CA (December 2005)
Rhea, S., Godfrey, B., Karp, B., Kubiatowicz, J., Ratnasamy, S., Shenker, S., Stoica, I., Yu, H.: OpenDHT: a public DHT service and its uses. In: Proc. of the ACM SIGCOMM Conference, Philadelphia, PA (August 2005)
Rowstron, A., Druschel, P.: Pastry: Scalable, decentralized object location, and routing for large-scale peer-to-peer systems. In: Guerraoui, R. (ed.) Middleware 2001. LNCS, vol. 2218, p. 329. Springer, Heidelberg (2001)
SMARTS Inc., http://www.smarts.com
Stoica, I., Morris, R., Karger, D., Kaashoek, F., Balakrishnan, H.: Chord: A scalable peer-to-peer lookup service for Internet applications. In: ACM SIGCOMM, San Diego, CA (August 2001)
Project Voldemort: A distributed database, http://project-voldemort.com
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2010 IFIP International Federation for Information Processing
About this paper
Cite this paper
Prakash, P., Kompella, R.R., Ramasubramanian, V., Chandra, R. (2010). dFault: Fault Localization in Large-Scale Peer-to-Peer Systems. In: Gupta, I., Mascolo, C. (eds) Middleware 2010. Middleware 2010. Lecture Notes in Computer Science, vol 6452. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-16955-7_13
Download citation
DOI: https://doi.org/10.1007/978-3-642-16955-7_13
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-16954-0
Online ISBN: 978-3-642-16955-7
eBook Packages: Computer ScienceComputer Science (R0)