When Software Defined Networks Meet Fault Tolerance: A Survey

  • Jue Chen
  • Jinbang ChenEmail author
  • Fei XuEmail author
  • Min Yin
  • Wei Zhang
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9530)


Software Defined Network (SDN) is emerging as a novel network architecture which decouples the control plane from the data plane. However, SDN is unable to survive when facing failure, in particular in large scale data-center networks. Due to the programmability of SDN, mechanism could be designed to achieve fault tolerance. In this survey, we broadly discuss the fault tolerance issue and systematically review the existing methods proposed so far for SDN. Our representation starts from the significant components that OpenFlow and SDN brings – which are useful for the purpose of failure recovery, and is then further expanded to the discussion of fault tolerance in data plane and control plane, in which two phases – detection and recovery – are both needed. In particular, as the important part of this paper, we have highlighted the comparison between two main methods – restoration and protection – for failure recovery. Moreover, future research issues are discussed as well.


Software defined network Fault tolerance OpenFlow Failure detection Recovery Restoration Protection 



Corresponding authors: Jinbang Chen and Fei Xu. They are with Shanghai Key Laboratory of Multidimensional Information Processing & Department of Computer Science and Technology, East China Normal University, China. This work was supported by the Science and Technology Commission of Shanghai Municipality under research grant no. 14DZ2260800, and China Postdoctoral Science Foundation under grant no. 2014M561438.


  1. 1.
    The internet topology zoo.
  2. 2.
  3. 3.
  4. 4.
    Openflow switch specification: version 1.0.0, December 2009Google Scholar
  5. 5.
    Openflow switch specification: version 1.1.0, Feburuary 2011.
  6. 6.
    Al-Fares, M., Loukissas, A., Vahdat, A.: A scalable, commodity data center network architecture. In: 2008 ACM International Conference on Special Interest Group on Data Communication (SIGCOMM), pp. 63–74, August 2008Google Scholar
  7. 7.
    Atlas, A.K., Zinin, A., Torvi, R., Choudhury, G., Martin, C., Imhoff, B., Fedyk, D.: Basic specification for IP fast reroute: loop-free alternates. In: RFC-5286, September 2008.
  8. 8.
    Basu, A., Riecke, J.: Stability issues in OSPF routing. In: 2001 ACM International Conference on Special Interest Group on Data Communication (SIGCOMM), pp. 225–236, August 2001Google Scholar
  9. 9.
    Bonaventure, O., Filsfils, C., Francois, P.: Achieving Sub-50 milliseconds recovery upon BGP peering link failures. IEEE/ACM Trans. Netw. 15(5), 1123–1135 (2007)CrossRefGoogle Scholar
  10. 10.
    Botelho, F.A., Ramos, F.M.V., Kreutz, D., Bessani, A.N.: On the feasibility of a consistent and fault-tolerant data store for sdns. In: 2013 2nd European Workshop on Software Defined Networks (EWSDN), pp. 38–43, October 2013.
  11. 11.
    Bryant, S., Previdi, S., Shand, M.: A framework for IP and MPLS fast reroute using not-via addresses. In: RFC-6981, August 2013Google Scholar
  12. 12.
    Desai, M., Nandagopal, T.: Coping with link failures in centralized control plane architectures. In: 2010 2nd International Conference on Communication Systems and NETworks (COMSNETS), pp. 79–88, January 2010.
  13. 13.
    Farhady, H., Lee, H., Nakao, A.: Software-defined networking: a survey. Comput. Netw. 81, 79–95 (2015)CrossRefGoogle Scholar
  14. 14.
    Ficco, M., Avolio, G., Palmieri, F., Castiglione, A.: An HLA-based framework for simulation of large-scale critical systems. Concurr. Comput.: Prac. Exp. (2015). doi: 10.1002/cpe.3472
  15. 15.
    Jain, S., Kumar, A., Mandal, S., Ong, J., Poutievski, L., Singh, A., Venkata, S., Wanderer, J., Zhou, J., Zhu, M., Zolla, J., Hölzle, U., Stuart, S., Vahdat, A.: B4: experience with a globally-deployed software defined wan. In: 2013 ACM International Conference on Special Interest Group on Data Communication (SIGCOMM), pp. 3–14, August 2013Google Scholar
  16. 16.
    Katta, N., Zhang, H., Freedman, M., Rexford, J.: Ravana: controller fault-tolerance in software-defined networking. In: 2015 1st ACM SIGCOMM Symposium on Software Defined Networking Research, pp. 4:1–4:12, June 2015Google Scholar
  17. 17.
    Katz, D., Ward, D.: Bidirectional forwarding detection. In: RFC-5880, June 2010Google Scholar
  18. 18.
    Kim, H., Santos, J.R., Turner, Y., Schlansker, M., Tourrihes, J., Feamster, N.: Coronet: fault tolerance for software defined networks. In: 2012 20th IEEE International Conference on Network Protocols (ICNP), pp. 1–2, October 2012Google Scholar
  19. 19.
    Kozat, U.C., Liang, G., Kokten, K.: On diagnosis of forwarding plane via static forwarding rules in software defined networks. In: 2014 33rd IEEE Conference on Computer Communications (INFOCOM), pp. 1716–1724, April 2013.
  20. 20.
    Kreutz, D., Ramos, F., Esteve Rothenberg, P., Esteve Rothenberg, C., Azodolmolky, S., Uhlig, S.: Software-defined networking: a comprehensive survey. Proc. IEEE 103(1), 14–76 (2015)CrossRefGoogle Scholar
  21. 21.
    Lee, S., Yu, Y., Nelakuditi, S., Zhang, Z.L., Chuah, C.N.: Proactive vs. reactive approaches to failure resilient routing. In: 2004 23rd IEEE Conference on Computer Communications (INFOCOM), pp. 176–186, March 2004.
  22. 22.
    Lee, S., Li, K.Y., Chan, K.Y., Lai, G.H., Chung, Y.C.: Path layout planning and software based fast failure detection in survivable openflow networks. In: 2014 10th International Conference on the Design of Reliable Communication Networks (DRCN), pp. 1–8, April 2014Google Scholar
  23. 23.
    Levin, D., Wundsam, A., Heller, B., Handigol, N., Feldmann, A.: Logically centralized? state distribution tradeoffs in software defined networks. In: 2014 Proceedings of 3rd Workshop on Hot Topics in Software Defined Networking (HotSDN), pp. 1–6, January 2012Google Scholar
  24. 24.
    Li, J., Hyun, J., Yoo, J.H., Baik, S., Hong, J.K.: Scalable failover method for data center networks using openflow. In: 2014 14th IEEE Network Operations and Management Symposium (NOMS), pp. 1–6, May 2014Google Scholar
  25. 25.
    Liu, Z., Li, Y., Su, L., Jin, D., Zeng, L.: M2cloud: software defined multi-site data center network control framework for multi-tenant. In: 2013 ACM International Conference on Special Interest Group on Data Communication (SIGCOMM), pp. 517–518, August 2013Google Scholar
  26. 26.
    Maesschalck, S., Colle, D., Lievens, I., Pickavet, M., Demeester, P., Mauz, C., Jaeger, M., Inkret, R., Mikac, B., Derkacz, J.: Pan-european optical transport networks: an availability-based comparison. Photonic Netw. Commun. 5(3), 203–225 (2003).
  27. 27.
    McKeown, N., Anderson, T., Balakrishnan, H., Parulkar, G., Peterson, L., Rexford, J., Shenker, S., Turner, J.: Openflow: enabling innovation in campus networks. ACM Comput. Commun. Rev. 38(2), 69–74 (2008)CrossRefGoogle Scholar
  28. 28.
    Moy, J.: OSPF version 2. In: RFC-2328, April 1998Google Scholar
  29. 29.
    Nagano, J., Shinomiya, N.: A failure recovery method based on cycle structure and its verification by openflow. In: 2013 IEEE 27th International Conference on Advanced Information Networking and Applications (AINA), pp. 298–303, March 2013Google Scholar
  30. 30.
    Fonseca, P., Bennesby, R., Mota, E., Passito, A.: A replication component for resilient openflow-based networking. In: 2012 IEEE 13th Network Operations and Management Symposium (NOMS), pp. 933–939, April 2012Google Scholar
  31. 31.
    Przygienda, T., Shen, N., Sheth, N.: M-ISIS: multi topology (MT) routing in intermediate system to intermediate systems (IS-ISs). In: RFC-5120, February 2008Google Scholar
  32. 32.
    Psenak, P., Mirtorabi, S., Roy, A., Nguyen, L., Pillay-Esnault, P.: Multi-topology (MT) routing in OSPF. In: RFC-4915, June 2007Google Scholar
  33. 33.
    Ramos, R.M., Rothenberg, C.E.: Slickflow: resilient source routing in data center networks unlocked by openflow. In: 2013 IEEE 38th Conference on Local Computer Networks (LCN), pp. 606–613, October 2013Google Scholar
  34. 34.
    Reitblatt, M., Canini, M., Guha, A., Foster, N.: Fattire: declarative fault tolerance for software-defined networks. In: 2013 Proceedings of 2nd Workshop on Hot Topics in Software Defined Networking (HotSDN), pp. 109–114, August 2013Google Scholar
  35. 35.
    Rongqing, C.: Research on the fast failure recovery technologies of IP networks. Master’s thesis, Hangzhou Dianzi University, March 2012Google Scholar
  36. 36.
    Roy, A.R., Bari, M.F., Zhani, M.F., Ahmed, R., Boutaba, R.: Dot: distributed openflow testbed. In: 2014 ACM International Conference on Special Interest Group on Data Communication (SIGCOMM), pp. 367–368, August 2014Google Scholar
  37. 37.
    Sgambelluri, A., Giorgetti, A., Cugini, F., Paolucci, F., Castoldi, P.: Openflow-based segment protection in ethernet networks. IEEE/OSA J. Opt. Commun. Netw. 5(9), 1066–1075 (2013)CrossRefGoogle Scholar
  38. 38.
    Sharma, S., Staessens, D., Colle, D., Pickavet, M., Demeester, P.: Fast failure recovery for in-band openflow networks. In: 2013 9th International Conference on the Design of Reliable Communication Networks (DRCN), pp. 52–59, March 2013Google Scholar
  39. 39.
    Sharma, S., Staessens, D., Colle, D., Pickavet, M., Demeester, P.: Openflow: meeting carrier-grade recovery requirements. Comput. Commun. 36(6), 656–665 (2013). CrossRefGoogle Scholar
  40. 40.
    Staessens, D., Sharma, S., Colle, D., Pickavet, M., Demeester, P.: Software defined networking: meeting carrier grade requirements. In: 2011 18th IEEE Workshop on Local Metropolitan Area Networks (LANMAN), pp. 1–6, October 2011Google Scholar
  41. 41.
    Suurballe, J.W.: Disjoint paths in a network. Networks 4(2), 125–145 (1974)MathSciNetCrossRefzbMATHGoogle Scholar
  42. 42.
    Tootoonchian, A., Ganjali, Y.: Hyperflow: a distributed control plane for openflow. In: 2010 7th Internet Network Management Conference on Research on Enterprise Networking (INM/WREN), p. 3, April 2010Google Scholar
  43. 43.
    Vasseur, J.P., Pickavet, M., Demeester, P.: Network Recovery: Protection and Restoration of Optical, SONET-SDH, IP, and MPLS. Morgan Kaufmann, San Francisco (2004)Google Scholar
  44. 44.
    Wang, S., Li, D., Xia, S.: The problems and solutions of network update in SDN: a survey. In: 2015 IEEE Conference on Computer Communications Workshops (INFOCOM WKSHPS), pp. 474–479, April 2015Google Scholar
  45. 45.
    Wei, T., Mishra, P., Wu, K., Zhou, J.: Quasi-static fault-tolerant scheduling schemes for energy-efficient hard real-time systems. J. Syst. Softw. 85(6), 1386–1399 (2012)CrossRefGoogle Scholar
  46. 46.
    Gu, W., Zhang, X., Gong, B., Wang, L.: A survey of multicast in software-defined networking. In: 2015 5th International Conference on Information Engineering for Mechanics and Materials (ICIMM), July 2015Google Scholar
  47. 47.
    Yu, Y., Shanzhi, C., Xin, L., Yan, W.: A framework of using openflow to handle transient link failure. In: 2011 1st International Conference on Transportation, Mechanical, and Electrical Engineering (TMEE), pp. 2050–2053, December 2011Google Scholar

Copyright information

© Springer International Publishing Switzerland 2015

Authors and Affiliations

  1. 1.School of Computer Science and Software EngineeringEast China Normal UniversityShanghaiChina

Personalised recommendations