Skip to main content
Log in

Reliability and Survivability Analysis of Data Center Network Topologies

  • Published:
Journal of Network and Systems Management Aims and scope Submit manuscript

Abstract

The architecture of several data centers have been proposed as alternatives to the conventional three-layer one. Most of them employ commodity equipment for cost reduction. Thus, robustness to failures becomes even more important, because commodity equipment is more failure-prone. Each architecture has a different network topology design with a specific level of redundancy. In this work, we aim at analyzing the benefits of different data center topologies taking the reliability and survivability requirements into account. We consider the topologies of three alternative data center architecture: Fat-tree, BCube, and DCell. Also, we compare these topologies with a conventional three-layer data center topology. Our analysis is independent of specific equipment, traffic patterns, or network protocols, for the sake of generality. We derive closed-form formulas for the Mean Time To Failure of each topology. The results allow us to indicate the best topology for each failure scenario. In particular, we conclude that BCube is more robust to link failures than the other topologies, whereas DCell has the most robust topology when considering switch failures. Additionally, we show that all considered alternative topologies outperform a three-layer topology for both types of failures. We also determine to which extent the robustness of BCube and DCell is influenced by the number of network interfaces per server.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19

Similar content being viewed by others

Notes

  1. Topology generation, failure simulation, and metric evaluation are obtained using the graph manipulation tool NetworkX [15].

  2. We call gateway in this work a switch that is responsible for the network access outside the DC. In practice, the gateway function is performed by a router connected to this switch.

  3. Hereafter, we split the result remarks in three items. The first one comments the performance of switch-centric topologies (Three-layer and Fat-tree), while the second one highlights the results of server-centric topologies (i.e., BCube and DCell). The last item, when available, indicates a general remark considering the three topologies.

  4. The number of switch ports n in function of \(|{\mathcal {S}}|\) is evaluated by solving the equation \(|{\mathcal {S}}|=n(n+1)\).

  5. For the homogeneous case, the Balanced and Unbalanced results correspond to the same scenario, since all servers are equal.

References

  1. Jennings, B., Stadler, R.: Resource management in clouds: survey and research challenges. J. Netw. Syst. Manag. 23(3), 567–619 (2015)

    Article  Google Scholar 

  2. Al-Fares, M., Loukissas, A., Vahdat, A.: A scalable, commodity data center network architecture. In: ACM SIGCOMM, pp. 63–74 (2008)

  3. Guo, C., Lu, G., Li, D., Wu, H., Zhang, X., Shi, Y., Tian, C., Zhang, Y., Lu, S.: BCube: a high performance, server-centric network architecture for modular data centers. In: ACM SIGCOMM, pp. 63–74 (2009)

  4. Guo, C., Wu, H., Tan, K., Shi, L., Zhang, Y., Lu, S.: DCell: a scalable and fault-tolerant network structure for data centers. In: ACM SIGCOMM, pp. 75–86 (2008)

  5. Greenberg, A., Hamilton, J., Maltz, D.A., Patel, P.: The cost of a cloud: research problems in data center networks. SIGCOMM Comput. Commun. Rev. 39(1), 68–73 (2009)

    Article  Google Scholar 

  6. Popa, L., Ratnasamy, S., Iannaccone, G., Krishnamurthy, A., Stoica, I.: A cost comparison of datacenter network architectures. In: ACM CoNEXT, pp. 16:1–16:12 (2010)

  7. Li, D., Guo, C., Wu, H., Tan, K., Zhang, Y., Lu, S., Wu, J.: Scalable and cost-effective interconnection of data-center servers using dual server ports. IEEE/ACM Trans. Netw. 19(1), 102–114 (2011)

    Article  Google Scholar 

  8. Kachris, C., Tomkos, I.: A survey on optical interconnects for data centers. IEEE Commun. Surv. Tutor. 14(4), 1021–1036 (2012)

    Article  Google Scholar 

  9. Singla, A., Hong, C., Popa, L., Godfrey, P.: Jellyfish: networking data centers, randomly. In: USENIX NSDI, p. 14 (2012)

  10. Curtis, A., Carpenter, T., Elsheikh, M., López-Ortiz, A., Keshav, S.: REWIRE: an optimization-based framework for unstructured data center network design. In: IEEE INFOCOM, pp. 1116–1124 (2012)

  11. Raiciu, C., Barre, S., Pluntke, C., Greenhalgh, A., Wischik, D., Handley, M.: Improving datacenter performance and robustness with multipath TCP. In: ACM SIGCOMM, pp. 350–361 (2011)

  12. Meng, X., Pappas, V., Zhang, L.: Improving the scalability of data center networks with traffic-aware virtual machine placement. In: IEEE INFOCOM, pp. 1–9. IEEE (2010)

  13. Cisco Data Center Infrastructure 2.5 Design Guide. www.cisco.com/application/pdf/en/us/guest/netsol/ns107/c649/ccmigration_09186a008073377d.pdf (2007). Accessed Oct 2014

  14. Greenberg, A., Hamilton, J.R., Jain, N., Kandula, S., Kim, C., Lahiri, P., Maltz, D.A., Patel, P., Sengupta, S.: VL2: a scalable and flexible data center network. In: ACM SIGCOMM, pp. 51–62 (2009)

  15. Hagberg, A., Swart, P., Schult, D.: Exploring network structure, dynamics, and function using NetworkX. Tech. rep., (LANL) (2008)

  16. Egeland, G., Engelstad, P.: The availability and reliability of wireless multi-hop networks with stochastic link failures. IEEE J. Sel. Areas Commun. 27(7), 1132–1146 (2009)

    Article  Google Scholar 

  17. Rahman, M.R., Aib, I., Boutaba, R.: Survivable virtual network embedding. In: NETWORKING 2010, Lecture Notes in Computer Science, vol. 6091, pp. 40–52. Springer, Berlin (2010)

  18. Barlow, R., Proschan, F.: Statistical Theory of Reliability and Life Testing: Probability Models, 1st edn. Holt, Rinehart and Winston, New York (1975)

    MATH  Google Scholar 

  19. Gertzbakh, I., Shpungin, Y.: Models of Network Reliability: Analysis, Combinatorics, and Monte Carlo, 1st edn. CRC Press, Boca Raton (2009)

    Google Scholar 

  20. Gill, P., Jain, N., Nagappan, N.: Understanding network failures in data centers: measurement, analysis, and implications. In: ACM SIGCOMM, pp. 350–361 (2011)

  21. Abramowitz, M., Stegun, I.A.: Handbook of Mathematical Functions: With Formulars, Graphs, and Mathematical Tables, 9th edn. Dover Books on Mathematics, New York (1970)

    MATH  Google Scholar 

  22. Liew, S.C., Lu, K.W.: A framework for characterizing disaster-based network survivability. IEEE J. Sel. Areas Commun. 12(1), 52–58 (1994)

    Article  Google Scholar 

  23. Albert, R., Jeong, H., Barabási, A.: Error and attack tolerance of complex networks. Lett. Nat. 406(6794), 378–382 (2000)

    Article  Google Scholar 

  24. Coleman, T.F., Moré, J.J.: Estimation of sparse Jacobian matrices and graph coloring blems. SIAM J. Numer. Anal. 20(1), 187–209 (1983)

    Article  MathSciNet  MATH  Google Scholar 

  25. Neumayer, S., Modiano, E.: Network reliability with geographically correlated failures. In: IEEE INFOCOM, pp. 1–9 (2010)

  26. Touch, J., Perlman, R.: Transparent interconnection of lots of links TRILL: problem and applicability statement. RFC 5556 (2009)

  27. Allan, D., Ashwood-Smith, P., Bragg, N., Farkas, J., Fedyk, D., Ouellete, M., Seaman, M., Unbehagen, P.: Shortest path bridging: efficient control of larger ethernet networks. IEEE Commun. Mag. 48(10), 128–135 (2010)

    Article  Google Scholar 

  28. Mudigonda, J., Yalagandula, P., Al-Fares, M., Mogul, J.: SPAIN: COTS data-center ethernet for multipathing over arbitrary topologies. In: USENIX NSDI, p. 16 (2010)

  29. Bilal, K., Manzano, M., Khan, S., Calle, E., Li, K., Zomaya, A.: On the characterization of the structural robustness of data center networks. IEEE Trans. Cloud Comput. 1(1), 64–77 (2013)

    Article  Google Scholar 

  30. Zhang, Q., Zhani, M., Boutaba, R., Hellerstein, J.: Dynamic heterogeneity-aware resource provisioning in the cloud. IEEE Trans. Cloud Comput. 2(1), 14–28 (2014)

    Article  Google Scholar 

  31. Googleclusterdata—traces of Google Workloads. http://code.google.com/p/googleclusterdata/ (2011). Accessed May 2015

  32. Couto, R.S., Campista, M.E.M., Costa, L.H.M.K.: A reliability analysis of datacenter topologies. In: IEEE GLOBECOM, pp. 1890–1895 (2012)

  33. Ni, W., Huang, C., Wu, J.: Provisioning high-availability datacenter networks for full bandwidth communication. Comput. Netw. 68, 71–94 (2014)

    Article  Google Scholar 

  34. Vishwanath, K.V., Nagappan, N.: Characterizing cloud computing hardware reliability. In: ACM SOCC, pp. 193–204 (2010)

Download references

Acknowledgments

The authors would like to thank FAPERJ, CNPq, CAPES, CTIC research agencies and the Systematic FUI 15 RAVIR (http://www.ravir.io) project for their financial support to this research.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Rodrigo de Souza Couto.

Appendices

Appendix 1: MTTF Approximation

In this appendix we obtain Eq. 6, derived from the combination of Eqs. 4 and 5. First, we replace the reliability R(t) in Eq. 4 by the reliability approximation given by Eq. 5, resulting in

$$\begin{aligned} MTTF = \int _{0}^{\infty } R(t) \, dt \approx \int _{0}^{\infty } e^{-\frac{t^rc}{{E[\tau ]}^r}} \, dt. \end{aligned}$$
(19)

Hence, we find the MTTF by evaluating the integral in the rightmost term of Eq. 19. The evaluation starts by performing the following variable substitution:

$$\begin{aligned} t=x^{\frac{1}{r}} \Leftrightarrow dt = \frac{1}{r}x^{\left( \frac{1}{r}-1\right) } \, dx. \end{aligned}$$
(20)

Note that the interval of integration in Eq. 19 does not change after the variable substitution, since \(t=0\) results in \(x=0\) and \(t \rightarrow \infty\) results in \(x \rightarrow \infty\). Hence, after the variable substitution, we can write Eq. 19 as:

$$\begin{aligned} MTTF \approx \frac{1}{r}\int _{0}^{\infty } x^{\left( \frac{1}{r} - 1\right) } e^{\frac{-xc}{{E[\tau ]}^r}} \, dx. \end{aligned}$$
(21)

The integral of Eq. 21 is evaluated using the gamma function defined as [21]:

$$\begin{aligned} \varGamma (z) = k^z \int _{0}^{\infty } x^{z-1}e^{-kx} \, dx, ({\mathfrak {R}}z > 0 , {\mathfrak {R}}k > 0). \end{aligned}$$
(22)

For better clarity, we rewrite the integral of Eq. 22 as:

$$\begin{aligned} \int _{0}^{\infty } x^{z-1}e^{-kx} \, dx = \frac{\varGamma (z)}{k^z}. \end{aligned}$$
(23)

We make \(z = \frac{1}{r}\) and \(k=\frac{c}{{E[\tau ]}^r}\) in Eq. 23 and multiply its both sides by \(\frac{1}{r}\), obtaining

$$\begin{aligned} \frac{1}{r} \int _{0}^{\infty } x^{\left( \frac{1}{r}-1\right) }e^{-\frac{xc}{{E[\tau ]}^r}} \, dx = \frac{1}{r}\frac{\varGamma \left( \frac{1}{r} \right) }{{\frac{c}{{E[\tau ]}^r}}^{\frac{1}{r}}} = \frac{E[\tau ]}{r}\root r \of {\frac{1}{c}} \varGamma \left( \frac{1}{r} \right) . \end{aligned}$$
(24)

Note that the leftmost term in Eq. 24 is the MTTF approximation given by Eq. 21. Hence, we can write the MTTF as:

$$\begin{aligned} MTTF \approx \frac{E[\tau ]}{r}\root r \of {\frac{1}{c}} \varGamma \left( \frac{1}{r} \right) . \end{aligned}$$
(25)

Appendix 2: Comparison of MTTF Equations for Link Failures

In BCube we have \(MTTF_{bcube} \approx \frac{E[\tau ]}{l+1}\root l+1 \of {\frac{1}{|{\mathcal {S}}|}} \varGamma \left( \frac{1}{l+1} \right)\). Hence, we will start by showing that if we have a new configuration with \(l'=l+1\) (i.e., one more server interface) we can increase the MTTF. For simplicity, we consider that \(|{\mathcal {S}}|\) is equal for the configurations using both l and \(l'\). Although it is not necessarily true, because the number of servers depends on the combination of l and n, we can adjust n to have a close number of servers for l and \(l'\), as done on the configurations of Table 1. First, we need to state that

$$\begin{aligned} \frac{E[\tau ]}{l'+1}\root l'+1 \of {\frac{1}{|{\mathcal {S}}|}} \varGamma \left( \frac{1}{l'+1} \right) > \frac{E[\tau ]}{l+1}\root l+1 \of {\frac{1}{|{\mathcal {S}}|}} \varGamma \left( \frac{1}{l+1} \right) . \end{aligned}$$
(26)

Doing \(l'=l+1\), and rearranging the terms we have the following requirements for the above formulation to be true:

$$\begin{aligned} |{\mathcal {S}}| > {\left( \frac{l+2}{l+1} \frac{\varGamma \left( \frac{1}{l+1} \right) }{\varGamma \left( \frac{1}{l+2} \right) } \right) }^{(l+1)(l+2)}. \end{aligned}$$
(27)

The right term of Eq. 27 is a decreasing function of l over the considered region (\(l \ge 1\)). Hence, it is sufficient to prove that Eq. 27 is true for \(l=1\). Doing \(l=1\) in Eq. 27, we have \(|{\mathcal {S}}| > 0.955\), which is true for a feasible DC.

As DCell with \(l>1\) has the same MTTF of a BCube with the same l, the above reasoning is valid for this topology. For DCell2 (\(l=1\)), the equation of the MTTF is the same of BCube2 (\(l=1\)) except that DCell2 has the value \(\sqrt{\frac{1}{1.5|{\mathcal {S}}|}}\) instead of \(\sqrt{\frac{1}{|{\mathcal {S}}|}}\). Consequently, the MTTF of BCube2 is greater than that of DCell2. We can thus conclude that DCell2 has the lowest MTTF among server-centric topologies. Hence, to show that the MTTF of Fat-tree is smaller than the MTTF of all server-centric topologies, we compare it to DCell2. We thus need to prove that

$$\begin{aligned} \frac{E[\tau ]}{|{\mathcal {S}}|} < \frac{E[\tau ]}{2}\sqrt{\frac{1}{1.5|{\mathcal {S}}|}} \varGamma \left( \frac{1}{2} \right) . \end{aligned}$$
(28)

The solution of this equation is \(|{\mathcal {S}}| > 1.909\), which is always true considering a real DC.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Couto, R.S., Secci, S., Campista, M.E.M. et al. Reliability and Survivability Analysis of Data Center Network Topologies. J Netw Syst Manage 24, 346–392 (2016). https://doi.org/10.1007/s10922-015-9354-8

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10922-015-9354-8

Keywords

Navigation