1 Introduction

Since the 1988 landmark paper by Bendsøe and Kikuchi (1988) topology optimization has emerged as the most active research area within the field of engineering optimization. There have been thousands of research papers published on this subject in the literature. A comprehensive treatment of topology optimization can be found in a book by Bendsøe and Sigmund (2004). In-depth reviews and insights can be found in papers by Rozvany (2001, 2009), Sigmund and Maute (2013), Deaton and Grandhi (2014). In general the subject of this paper falls into the category of design involving uncertainties, known generally as reliability based design optimization (RBDO). Since mid-2000 there has been increasing research on reliability based topology optimization (RBTO) considering uncertainties in loads, material, boundary geometry or fabrication. Since this work doesn’t directly build on earlier RBTO contributions, we simply refer to Deaton and Grandhi (2014) for literature survey, and to select papers for further reading: Maute and Frangopol (2003), Kharmanda et al (2004), Jung and Cho (2004), Guest and Igusa (2008), Silva et al (2010), Chen et al (2010), Chen and Chen (2011), Lazarov et al (2012), Nguyen et al (2011), Rozvany and Maute (2011), Tootkaboni et al (2012), Wang et al (2006). However, as also pointed out by Jansen et al. (2014), due to the binary nature of failure definition it is difficult to treat fail-safe requirement within statistics based RBTO framework.

Fail-safe design philosophy is probably the single most important reason why flying is so incredibly safe today. The list of catastrophic accidents due to aircraft structural failures is rather short (Wikipedia 2016a), compared to the long list of accidents and incidents involving commercial aircraft (Wikipedia 2016b). From the latter list we can find many cases involving midair structural damages. Fortunately in the vast majority of such cases an aircraft survived such structural damages, some of which were of critical level of severity. For example, a widely publicized incident involved A380 in November 2010, only about 3 years after the world’s largest passenger aircraft entered service in 2007. Qantas Flight 32 suffered severe structural and control system damages from an uncontained engine failure shortly after taking off from Singapore Changi Airport (Wikipedia 2016c). It managed to land safely. Fail-safe requirement for aircraft design was clearly defined in two books by Niu that are widely used handbooks by aerospace structural engineers—‘Aircraft structural design’ and ‘Airframe structural analysis and sizing’ (Niu 1988, 1997). He defined fail-safe as ‘1. Fail-safe structure must support 80–100 % limit loads without catastrophic failure. 2. A single member failed in redundant structure or partial failure in monolithic structure’ (p 538, Niu 1988). Note that limit load for aircraft design corresponds to maximum service loads. It differs from ‘ultimate loads’ that carries a safety factor of 1.5 (Niu 1988). To be clear ‘a single member failure’ is meant to be tested on all discrete members, one instance at a time, for the condition of complete loss of load carrying capacity. Taking the example of a stiffened panel as a common aircraft structural component, ‘a single member’ refers to a stiffener while ‘monolithic structure’ refers to the panel itself. A partial failure is meant as complete devoid of a chunk of material of a given size. The size of failure considered is related to possible maximum damage between inspection cycles due to crack propagation and other reasons, outlined by Niu as ‘For fail-safe structure inspection techniques and frequency must be specified to minimize risk of catastrophic failures’ (p 538, Niu 1988). In this paper we prefer to use the term damage as we feel that it represents the state of the structure more accurately, especially for the case of partial ‘failure’ of a large structural part. To be clear damage throughout the paper is defined as the equivalent of devoid of material. It has nothing to do with damage models describing the process of failure of materials.

Topology optimization has seen fast growing adoption throughout all major industries since the turn of the millennium. This includes successful aerospace applications during development of the new generation airliners such as A380, 350 and B787 (see, e.g., Krog et al. 2004). However, the inability of taking fail-safe requirement into consideration in a computationally viable way has been seen as a significant limitation. In fact as optimization process pushes material utilization to maximum efficiency, design tends to be less redundant in general. For example, results of topology optimization are often benchmarked against Michell trusses (Michell 1904). While Michell trusses are highly efficient, they are all statically determinate with zero structural redundancy. In the context of topology optimization, the first challenge for fail-safe consideration lies in defining member failure test before structural members emerge from the optimization process. The second challenge lies in creating a computational scheme that is efficient enough to solve real world problems. Topology optimization considering fail-safe requirement was first introduced by Jansen et al. (2014). For 2D structures they introduced damage as a square void zone of required size, placed one instance at a time at every finite element grid, with the exception that square zones falling partially outside the structural domain are excluded. Jansen et al. introduced the concept of fail-safe as a general notion without reference to specific practice for aircraft design. As such they did not explicitly discuss modeling the requirement of a complete failure of a discrete structural member. A directly relevant statement can be found in the conclusion of Jansen et al. (2014): ‘As opposed to truss topology optimization where a clear definition of a structural member exists, there is no real notion of a structural member in a continuum and cracks and holes of varying shape and size can occur. A simplified model of local failure is therefore used to simulate local material failure where a number of patches with predefined shape can be removed from the design.’ This indicates that they avoided dealing explicitly with the first challenge stated above - the notion of truss member failure test within the context of topology optimization. The design problem is defined as minimizing the maximum compliance of all the FEA models, with each containing a square hole at a unique location. While the approach captures the essence of fail-safe requirement, it involves analysis of a very large number of FEA models at the scale equal to NE representing the number of elements. As finite element models commonly contain NE exceeding several hundred thousand, the fail-safe problem becomes computationally prohibitive even with largest HPC cluster available today.

This paper studies fail-safe topology optimization for general 3D structures. First we establish the original, mathematically rigorous, design formulation as the structure containing either a sphere or a cube damage of required size at a random location within the structural domain. For geometric simplicity we use cube damage throughout the paper although sphere form is more generic. The original problem involves an infinite damage population, which strictly enforces zero material survival rate in a representative cube with size equal to damage cube. In order to reduce the original problem into one with finite damage population, we introduce the concept of damage population series, with Level 1 starting as gapless fill of damage cubes within the structural domain. With each increasing level in the series the density of damage placement doubles. We then study quantitative relationship between damage population size and maximum material survival rate of a representative cube as a measure of damage modeling accuracy. We demonstrate that for practical purpose we can obtain desired structural redundancy with Level 1 damage population represented by gapless fill of damage cubes. We also show that a partial Level 2 damage population only doubling Level 1 population offers good ROI (return on investment) property for increased damage modeling accuracy. Further we examine the implication of maximum length scale on damage modeling accuracy and discover that zero material survival rate can be achieved with Level 2 damage population when member cross-section size is limited to half of the damage size. As also pointed out by Jansen et al. (2014), all involved FEA models can be analyzed in parallel with perfect scalability. We implemented the solution in Altair OptiStruct (2016) in a MPI process for best solution efficiency. An overview of OptiStruct implementation can be found in Zhou et al. (2004). Several 2D and 3D examples, including the example studied by Jansen et al., are shown to demonstrate the impact of fail-safe requirement to design concept generated by topology optimization. The FEA model of the 3D example contains 327,493 tetrahedron elements.

This paper is structured in the following sections. In Section 2 the concept of fail-safe design is illustrated with a simple three-bar truss. In Section 3 mathematical representation of failure or damage for a 3D structural continuum is introduced so that the resulting structure is subjected to the failure test on both discrete members and monolithic parts. Section 4 introduces the concept of damage population series and studies quantitative relationship between damage population size and accuracy in damage representation. Computational scheme and implementation are discussed in Section 5. Two 2D examples and one 3D example are shown in Section 6. Section 7 offers concluding remarks and suggests topics for future study.

2 Fail-safe design

In this section we illustrate the concept of fail-safe design using a simple three-bar truss shown in Fig. 1. The X and Y dimensions of the truss is 100 × 50; the material properties are: E = 2.1 × 105 and υ = 0.3; a horizontal load of 1.0 × 103 is applied at the junction node of the three bars.

Fig. 1
figure 1

Three-bar truss

Design constraints for fail-safe should primarily be stress and may include displacements. However for all examples in this paper we use a simplified optimization formulation—minimizing the compliance with a volume constraint. This makes it easy for researchers to study and compare results as this is the most used formulation for papers on topology optimization. It also allows us to study fundamental behaviors of a new type of problem on a well behaved simple optimization problem. Compliance provides a clear global performance measure for comparison between various structural configurations involving failure scenarios. For the three-bar truss we consider minimizing compliance under a volume constraint:

$$ \begin{array}{l}\mathrm{Minimize}\kern0.75em C\left(\boldsymbol{x}\right)\\ {}\mathrm{Subject}\ \mathrm{t}\mathrm{o}\kern0.5em V\left(\boldsymbol{x}\right)-{V}^U\le 0\ \\ {}\kern4.5em {x}_i^L\le {x}_i\le\ {x}_i^U,\kern4.25em i=1,2,3\end{array} $$
(1)

in which the upper bound of the volume constraint V U = 1000; the lower bounds of bar cross-sectional areas are set to zero, allowing truss members to vanish. We ran the optimization problem with Altair OptiStruct (2014). To obtain a fully converged solution we used a very low convergence tolerance on the objective at 0.0001, and the run converged after 12 iterations. The optimal design, referred to as Standard case, is given in Table 1. The middle bar vanishes and the structure reduces to a two-bar truss. We know this design corresponds to the simplest case of the famous Michell (1904) cantilever truss.

Table 1 Conventional and fail-safe designs of a three-bar truss

For a structure with predefined load carrying members, fail-safe is defined as sustained structural integrity under the condition that an arbitrary structural member fails. For the given three-bar truss it means that the structure should survive under the three structural configurations shown in Fig. 2, representing failure of the right, middle and left bars, respectively. This design problem now involves three structural cases of distinct structural configurations under the same applied loads. This fail-safe optimization problem with compliance as the performance measure can be formulated as follows:

$$ \begin{array}{l}\mathrm{Minimize}\kern0.75em \mathrm{Max}\left({C}_j\left(\boldsymbol{x}\right)\right),\kern4.5em j=1,2,3\\ {}\mathrm{Subject}\ \mathrm{t}\mathrm{o}\kern0.5em V\left(\boldsymbol{x}\right)-{V}^U\le 0\kern3.5em \\ {}\kern4.5em {x}_i^L\le {x}_i\le\ {x}_i^U,\kern4.25em i=1,2,3\end{array} $$
(2)

which implies that optimization should target improving compliance of the worst failure cases. Note that Min(Max) compliance wouldn’t be meaningful for different load cases since relative quantities of compliances do not have clear engineering implication. However, compliances under the same load conditions for different failure modes are meaningful engineering measures on how a failure impacts the total performance of the structure. For a structure under multiple load cases, we can regard the sum of compliances, termed total compliance herein, as the global performance measure of the structure. In other words the compliance in (2) for each structural case involving one failed structural member becomes the total compliance under all load cases.

Fig. 2
figure 2

Three-Bar Truss: Structural cases under member failures

We used Altair OptiStruct (2014) to solve this multiple model optimization (MMO) problem. The same low convergence tolerance is used and the run converged in 21 iterations. The optimal design of the fail-safe problem is included in Table 1, which also lists compliances for the three structural cases involving failures. Both standard and fail-safe designs are illustrated in Fig. 3(a) and (b), respectively. Obviously the structure needs redundant stable load transmission paths, hence all three bars are necessary for a fail-safe design. It is interesting to observe that the compliance of undamaged state is the same as that of the second failure case with vanishing middle bar. This can be easily understood as, given symmetry, there is no vertical displacement at the loading point, hence the vertical bar has zero strain under linear FEA assumption. The performance under the first and the third failure cases are significantly worse than the 45° two-bar configuration that corresponds to the optimal Michell truss. Clearly the optimization is driven by these two structural cases. We refer to these as the active structural cases for fail-safe as they correspond to active constraints in the bound formulation for the Min(Max) problem (Olhoff 1989).

Fig. 3
figure 3

Optimal designs: (a) Standard and (b) fail-safe

3 Fail-safe formulation for general 3D structural continuum

While the concept and practice of fail-safe is easy to follow when the layout of load carrying structural members is already established, the definition of member failure for design concept generation process, i.e., topology optimization, has yet to be established. To understand the concept, let’s start with the first fail-safe scenario defined by Niu—failure of discrete members. This scenario is highly relevant to topology optimization that often results in a truss-like structure if there are no additional imposed manufacturing requirements such as casting. The presumption of fail-safe requires that failure test of a single member needs to rotate through all structural members without exception. However, the dilemma lies in defining failure test before discrete members emerge during the iterative process of topology optimization. By definition the failure test of ‘a member’ needs to be valid for any member emerged at any arbitrary location. To meet this requirement the failure test of all structural members within a 3D structural domain Ω can be established as a spherical damage of diameter d randomly located in the given domain as illustrated in Fig. 4(a). As stated in the introduction section, damage of a given shape and volume throughout the paper is defined as complete devoid of material. We cannot over emphasize the keyword random location that implies that the spherical damage be tested at any possible location, one instance at a time. Under the above definition no discrete structural member would survive a failure test as long as its cross-section doesn’t exceed length scale d. If the cross-section of a structural part shaped by topology optimization is bigger than length scale d, the damage due to the presence of the given spherical damage falls into the second category of failure defined by Niu as ‘partial failure in monolithic structure’. To conclude, the definition of damage outlined above covers both failure scenarios defined by Niu as ‘A single member failed in redundant structure or partial failure in monolithic structure’. To help visualizing the effect of the damage volume, we can picture the sphere as a magic eraser that only erases material where it currently occupies but doesn’t leave trace behind. In other words, the material would recover from void to solid when the eraser moved away. For a more dramatic visualization we can picture the effect of an unconventional bomb that would evaporate material in the contained volume while leaving adjacent space unharmed. Note that the fundamental concept outlined above is clearly present in Jansen et al. (2014), although (1) they did not explicitly establish the mathematically rigorous formulation with the given damage at a random location; (2) they did not explicitly differentiate between the failure scenario of a single discrete member and that of a monolithic part. These aspects by no means diminish the value of their work as the first research publication on fail-safe design within the context of topology optimization.

Fig. 4
figure 4

Structure containing a random damage: (a) Sphere; (b) Cube

Alternatively let’s also consider a cube shaped damage of edge length d, shown in Fig. 4(b). Though spherical damage is directionless and more generic, damage of a cube can also represent a practical use case without losing generality. As interactions of cubes are more easily illustrated and analyzed, we will primarily use cube form damage for establishing a mathematical and engineering foundation. From engineering perspective cube form damage with edge length equal to the diameter d of the spherical damage can be used as a more conservative representation as the devoid material volume contains the subset of the sphere at the same location. Therefore we can assume that general qualitative observations drawn based on cube damage are valid for sphere damage as well. Obviously the orientation of damage cubes has quantitative implications. The diagonal section of a cube can cause a larger sectional cut of a structural member in its most vulnerable cross-section. As we generally use cube damage as a more conservative test of spherical damage, we can regard the orientation factor as merely varying degrees of reserves relative to sphere damage. From application perspective the practicing engineer should place cube orientation according to insights into the structure’s directional vulnerability. When directional neutrality of damage is important we can always resort to sphere damage in actual modeling.

The topology optimization problem for fail-safe design can be defined as follows:

$$ \begin{array}{l}\mathrm{Minimize}\kern0.75em f\left(\boldsymbol{x}\right)\\ {}\mathrm{Subject}\ \mathrm{t}\mathrm{o}\kern0.5em \left({g}_j\left(\boldsymbol{x}\right)-{g}_j^U\le 0,\kern1em j=1,\dots, M\right)\kern0.5em for\kern0.5em \left(S\in \varOmega \wedge S\notin {D}_{random}\right)\\ {}0.0\le {x}_i\le 1.0,i=1,\dots, N\end{array} $$
(3)

where \( f\left(\boldsymbol{\mathsf{x}}\right) \) represents the objective function, \( {g}_j\left(\boldsymbol{\mathsf{x}}\right) \) and g U j the j-th constraint response and its upper bound, respectively. For fail-safe design stress constraints should be primary focus, while displacement and other constraints could also be included if crucial for the survivability of the structure. There are generally multiple load cases involved. For simplicity of notation let’s consider that the total number of constraints M includes constraints from all load cases considered. All M constraints should hold for the residual structure S excluding a randomly located damage D random . In essence this represents an infinite number of structural cases. \( {\mathit{\mathsf{x}}}_i \)is the normalized material density of the i-th element. In this paper we use the SIMP topology optimization approach (Bendsøe 1989; Zhou and Rozvany 1991) where a power law penalty is applied to the stiffness density relationship:

$$ {\underline {\mathbf{K}}}_i\left({x}_i\right)={x}_i^p\ {\mathbf{K}}_i $$
(4)

where K i and K i represent the penalized and the real stiffness matrix at full density of the i-th element, respectively, and p is the penalization power that is larger than 1.0. Typically p takes value between 2 and 4. A small lower bound, say 0.01, is typically applied on the density variables to prevent singularity in the stiffness matrix. Note that the effect of elements at density lower bound is further significantly weakened by the power law. It is well established today that minimum length scale should be included in the topology variable formulation. There are many approaches available today that can be find in reviews by Sigmund and Maute (2013) and Deaton and Grandhi (2014).

For a structure with discrete members (i.e., a truss or frame-like structure), if damage D is larger than or equal to the largest structural member size in the design, it can completely wipe out an arbitrary member as it moves through the structure to ‘erase’ a target. For a spread-out structural component, e.g., a plate with area larger than D, the damage produces a hole in the structure of the given size at the given location.

4 Solution strategy for fail-safe topology optimization

Obviously random placement of damage of a given size within a structural domain is challenging even for analysis. Fortunately, we can take cues from many similar design scenarios we encounter in engineering practice. For example, loads in an elevator for N persons are randomly placed, which can, however, be modeled by sufficient load case samples. The same strategy can be applied to fail-safe design formulation, although admittedly the case of failure is more complex than the random load location analogy. For the latter it is relatively easy to identify several most severe load cases.

For convenience let’s focus on cube damage in this section. In Section 4.1 we introduce a serial process of damage population increase to explore the relationship between the population size and the maximum material survival rate within a representative cube of the same size as the damage. In Section 4.2 we study the effect of cross-section length scale on material survival rate. In Section 4.3 considerations for practical application are discussed.

4.1 Effect of damage population size

From engineering perspective it makes sense to start with a base damage population with damage cubes occupying the structural domain Ω evenly without gap and overlap. This implies that not a single finite element in the structural domain survives removal under the base damage population. The five grey color cubes in Fig. 5 represent a part of the base population N D . First we establish a series of damage population size levels, termed Damage Series A, with each level increase doubling the density of damages.

Fig. 5
figure 5

Base damage cubes in grey and best hideout in red (R) (lines within lower left cube are for size reference)

  • Damage Series A:

    1. (a)

      Level 1: A total number of N D damage cubes of size d are distributed evenly to cover the entire structural domain Ω. Grey cubes in Fig. 5 represent the base Level 1 damage population. The centers of damages are evenly spread in Ω, with diagonal distance \( \sqrt{3}d \) between neighboring cubes. We denote the population size of Level 1 at PA1= N D .

    2. (b)

      Level 2: Next we aim to double the density of damage zones to create evenly spread damage zones with distance between neighbors reduced to \( \sqrt{3}d/2 \). To achieve that we only need to double the grid points of cube centers along X, Y and Z, resulting in a damage population PA 2 = 23 × N D  = 8 × N D .

    3. (c)

      Level L: We aim to double the evenly spread damage population density from Level (L−1), producing a damage population PA L  = 23(L − 1) × N D . It can be easily established that for a 3D domain the total population of level L is always eight times the population of the previous level, i.e., PA L  = 8 × PA (L − 1). Thus the increase of population from one level to the next is always ΔPA L  = 7 × PA (L − 1).

      Now we construct a slight variation of the above damage series, termed PB, as a partial set of PA at all levels except Level 1. We will show later that PB has some interesting characteristics.

  • Damage Series B:

    1. (a)

      Level 1: The damage population starts exactly the same as PA, i.e., PB1 = PA1.

    2. (b)

      Level 2: We keep only a subset of Level 2 population in PA. PA2 can be constructed by moving seven copies of PA1 into bisection combinations along XYZ. For PB we only keep the copy moving diagonally in space, shown in red color in Fig. 5. The resulting damage population is PB 2 = 2 × PA 1 = PA 2/4.

    3. (c)

      Level L: Following the logic in (b) we construct PBL as PA(L−1) enriched with its copy shifted diagonally into bisection location in space, i.e. in all three dimensions XYZ. The total damage population is PB L  = 2 × PA (L − 1) = PA L /4. Therefore, the population size of the partial set for PB is only a quarter of the complete set of PA at any level in the series.

The population size is summarized in Table 2 for varying damage levels. For a given finite damage population size PL (i.e., PAL or PBL) established at Level L, the design problem given in (3) can be formulated as follows:

Table 2 Maximum material survival rate of a representative cube matching the size of damage cube
$$ \begin{array}{l}\mathrm{Minimize}\kern0.75em f\left(\boldsymbol{x}\right)\\ {}\mathrm{Subject}\ \mathrm{t}\mathrm{o}\left({g}_j\left(\boldsymbol{x}\right)-{g}_j^U\le 0,j=1,\dots, M\right)for\left(S\in \varOmega \wedge S\notin {D}_l\right),l=1,\dots, {P}_L\\ {}\kern4.5em 0.0\le {x}_i\le\ 1.0,\kern1.5em i=1,\dots, N\end{array} $$
(5)

As discussed in Section 2, we will use a simplified problem formulation for numerical examples so we can focus on the key phenomenon related to fail-safe. The Min(Max) compliance formulation for the fail-safe problem becomes the following:

$$ \begin{array}{l}\mathrm{Minimize}\kern0.5em \mathrm{Max}\left({C}_l\left(\boldsymbol{x}\right)\right),\kern1em for\kern0.5em \left(S\in \varOmega \wedge S\notin {D}_l\right),\kern0.75em l=1,\dots, {P}_L\\ {}\mathrm{Subject}\ \mathrm{t}\mathrm{o}\kern0.5em V-{V}^U\le 0\\ {}\kern4.5em 0.0\le {x}_i\le\ 1.0,\kern1.5em i=1,\dots, N\end{array} $$
(6)

where C l represents the sum of compliances, termed total compliance herein, of all static load cases for the structure S ∉ D l . Note that analysis and sensitivity calculation of a structural case involving a damage zone is a standard process except that the concerning FEA model has elements contained in the damage zone D l removed. In other words, load carrying capacity and sensitivity contribution from elements within D l are zero for the structural case S ∉ D l . Therefore, the optimization problem shown in (6) essentially aims at minimizing the adverse impact of damages to the structural performance. This is the same problem formulation proposed by Jansen et al. (2014). They further reduced the min(max) objective into a single aggregate based on K-S function formulation to deal with a very large damage population. As will be shown subsequently, our solution involves a much smaller damage population size. Therefore it is sufficient to solve the Min(Max) problem accurately by introducing an upper bound on all objectives (Olhoff 1989).

Compared to (3) the problem in (5) or (6) becomes numerically feasible, though computationally expensive. We will discuss computational aspects in Sections 4.3 and 5. For now we will focus on establishing important characteristics in relationship between the damage population size and the maximum material survival rate within a representative cube of damage size. This is important for measuring the confidence level we have in the model as the very definition of random failure corresponds to zero material survival of a representative cube randomly located in the entire structural domain. Therefore the lower the maximum material survival rate of a representative cube is, the closer the problem defined by a constructed finite damage population reflects the original problem. We introduce the following theorem first:

Theorem I

The problem defined by Damage Series A or B shown in (5) is equivalent to the original problem in (3) as the damage population P L approaches infinity.

The proof of the theorem is straightforward. Since damages defined by PA are always evenly spread in the entire structural domain Ω with distance from one damage zone to another not exceeding \( \sqrt{3}d/{2}^{\left(L-1\right)} \), the distance between damage zones approaches zero when L approaches infinity. The same should hold for PB as at Level L the total damage population includes the damage population Level (L−1) of PA. In a visual display, when L approaches infinity every possible point in the space Ω would have a bomb planted that can remove material within a cube of the given size d.

In the following analysis we normalized the cube to unit size d = 1. For a given damage population, it is important to understand the reliability of the model for capturing a random failure. Assuming discrete members have a maximum cross-section length scale d = 1, failure test of a member can be carried out on a unite length that forms a representative cube of size d. Clearly when damage population is infinity the material survival rate within the representative cube is zero. As already pointed out when introducing cube form damage, it causes more sectional damage for structural members not aligned with the cube orientation. Therefore we only need to focus on the most survivable case where the representative cube is aligned with damage cubes. For a given damage population, the maximum material survival rate within the representative cube provides a quantitative measure on the accuracy of the failure test. For PA1 = PB1, it can be observed that the red cube shown in Fig. 5 represents the best hideout location for the representative cube in red, with a volumetric survival rate of 87.5 %. Note that from the perspective of structural mechanics the sectional residual is a more important measure as forces are transferred directionally. It is easy to calculate that the sectional survival rate for a structural member represented by the unit length cube is 75 % for the base damage population PA1.

Visualization becomes much more crowded when damage population increases eight times from PA1 to PA2 . For visual clarity we only leave the most damaging cubes in the image in Fig. 6. Analyzing geometric interactions we have determined that the green cube in Fig. 6(a) represents the best hideout for PA2, with a sofa-corner shaped surviving volume at (37/64) = 57.81 % and sectional survival rate at 43.75 %. For PB2 we can find a slightly better hideout shown in blue in Fig. 6(b). The survival volume of the blue cube has a L-shape, or sofa-section shape, with a surviving volume and cross-section of 62.50 and 50 %, respectively. It should be noted that while sectional survival rate of the green cube in Fig. 6 is equal in all X, Y and Z directions for PA2 and PB2, the sectional survival rates of the blue cube in Fig. 6 are uneven for PB2 with 25, 50 and 25 % in X, Y, Z directions, respectively. We observe that PB is superior to PA in terms of computational efficiency since at any level above Level 2 we achieve just a slightly worse material survival rate with a quarter of the damage population.

Fig. 6
figure 6

a Best hideout in green (G) for PA2; b Best hideout in blue (B) for PB2

For the damage population at the L-th level, we have determined that the above observation regarding the best hideout positions and surviving volume shapes holds true, albeit with thicknesses of the surviving green sofa-corner and blue sofa-section halving to the next damage population level. The maximum sectional and volumetric survival rates for Level 1 to Level 5 are shown in Table 2, including exact formulae for Level L. These results are obtained by analyzing geometric interaction of intersecting cubes. The results are confirmed through numerical simulation with a billion random cube locations. We are confident about the results represented in Table 2, although we would forego a formal mathematical proof.

4.2 Effect of member cross-section length scale

In Sections 3 and 4.1 we already made the connection between damage size and the cross-sectional size of discrete members subject to failure test. Let’s restate the conclusion for the case with infinite damage population: (1) failure test holds for any discrete member with maximum cross-sectional dimension smaller than damage size d; (2) for members with larger dimensions the presence of a size d damage establishes what Niu (1988) defined as ‘partial failure in monolithic structure’. This conclusion serves as a clear guideline for practical application—the design engineer should define damage size d according to the target size for structural members classified as individual members. Obviously the fail-safe design problem has to be solved under finite damage population, with as low a population size as possible. As damage population size increases exponentially with respect to population level, Level 2 should be considered the practical limit for real applications. Table 2 shows that the maximum sectional survival rate of a member with cross-sectional size d is 43.8 % for full Level 2 damage population PA2, and slightly higher at 50 % for partial Level 2 damage population PB2. Now let’s assume that we reduce the maximum cross-sectional length scale to d/2. As illustrated in Fig. 7(a), the material survival rate for a representative cube of size d/2 does not change for PA1. Figure 7(b) shows the best hideout location for the half edge length cube under PB2 that carries a volumetric and sectional survival rate of 50 %. The sectional survival rates are directional, with 50 % along two axes and 0 % along the third axis. At full Level 2 damage population PA2 we can observe a surprisingly strong property—there is no survivable hideout place for the small size cube. That means that the damage population series converge, as shown in Table 3, at Level 2 when maximum cross-sectional length scale is half of the damage cube size. This is an extremely attractive property both theoretically and practically as, at only eight times base damage population size, failure test of discrete members is strictly satisfied if damage size d is defined as twice the maximum length scale for topology optimization. Various methods were proposed for imposing maximum length scale for topology optimization (see, Guest 2009; Guo et al. 2014). Maximum length scale control is also available in OptiStruct 7.0 since 2004. For a 3D structure maximum length scale constraint allows formation of a plate-like part with thickness below the given length scale, forming a structural part referred to as monolithic structural component in Niu’s terminology (1988). The failure test from damage population PA2 for such parts guarantees a through hole in the plate-like components with maximum thickness not exceeding d/2.

Fig. 7
figure 7

Best hideout in red (R) for d/2 maximum length scale: a PA1 in grey cubes; b PB2 in grey and blue (B) cubes

Table 3 Maximum material survival rate of a representative cube half the edge length of damage cube

4.3 Practical considerations on damage population size

In Sections 4.1 and 4.2 we studied material survival rate of a cube representing an isoperimetric section of a structural member under a given damage population. The properties are derived from a pure geometric perspective. We emphasized that sectional residual of a structural member is the most important measure for residual load carrying capacity. Assuming a damage population PA1, it is to be expected that topology optimization will exploit the loopholes in the damage model, placing material around the intersection between damage cubes to reach maximum potential sectional residual of 75 %. However, we shouldn’t forget that material placement is also driven by structural performance. In general, best locations for structural members to survive damage cubes do not correspond to best locations for structural performance. With the combined effects of performance and survival the optimization process is most likely to split a larger member into two or more smaller members so that only a subset of them is affected by each damage instance. This hypothesis will be confirmed by examples in Section 6. When more stringent failure test on discrete members is desired, we recommend imposing a maximum length scale constraint d_max and define the damage cube size at 2xd_max. As shown in Section 4.2, strict member failure test can be satisfied if we applied full Level 2 damage population PA2. However, practically it may not be necessary to increase the computing cost eight time (4 times for 2D structures) compared to PA1. Limiting maximum structural member size will further force the design to split larger members into multiple smaller members. This effect, combined with presence of damage population PA1 or PB2, will drive robust load path redundancy. So far our discussion has been limited to cube form damage. If spherical damage is applied, PA1 will leave elements between the sphere gaps unaffected. However the second layer of damage spheres in PB2 provides complete coverage of the gaps in PA1, and hence guarantees elimination of all elements by at least one damage instance. Therefore PB2 should be the recommended choice when spherical damage is applied.

Jansen et al. (2014) introduced a damage population density essentially equivalent to FEA mesh density. It is well-established today that minimum length scale is generally required for avoiding checkerboard and other artifacts due to too course FEA mesh relative to feature size. Generally we can assume that a structural member at minimum size should have three elements across. Let’s assume that desired maximum member size has at least 4 elements across, a minimum damage cube size should be at least 4 times element size. Under these assumptions the damage population applied by Jansen et al. corresponds to PA3 shown in Table 2, which carries 64 times damage population size of PA1 or 32 times that of PB2. In practice the damage cube size is likely to be much larger than 4 time element size, so the discrepancy between element density based damage population and that of PA1 or PB2 can be much larger. This large gap can be seen from Example 2 shown in Section 6.2.

5 Computational scheme

In the context of finite element analysis, the optimization problem in (3) involves P L structural cases with distinct FEA models. The solution obviously is computationally expensive. However, since each FEA model is completely independent from another, the analysis and sensitivity analysis can be solved entirely in parallel. Therefore, given a large enough HPC cluster with P L computing nodes the fail-safe optimization problem can be solved at the same turnaround time as the base design problem without failure modes. The fail-safe topology design framework is implemented in Altair OptiStruct 14.0.220 (2016), based on the multiple model optimization (MMO) framework available in OptiStruct 13.0 (2014). The MMO capability is a general feature aimed at optimizing structures of varying configurations, yet sharing some common design components. An example of MMO scenario is a car chassis platform on which three variations—sedan, van and SUV—are built. The MMO optimization problem involves three FEA models with a set of independent variables for each model, while all models share common design variables on the chassis. In OptiStruct MMO is implemented as a MPI parallel algorithm, with the master process orchestrating the optimization solution by assembling the analysis and sensitivity results from all models involved.

The iterative scheme of fail-safe solution is illustrated in Fig. 8. It is implemented as a MPI application with (P L  + 1) processes, with the master process also carrying out analysis of the undamaged model for performance reference. Several practical measures are implemented for damage zone generation:

Fig. 8
figure 8

Fail-safe topology iterative scheme

  • Damage zones containing any point load are eliminated to preserve load conditions. Partial elimination of distributed loads by a damage zone is considered acceptable.

  • If a damage zone increases the compliance by a significant margin compared to that of the undamaged structure at the start, the process terminates. Such case indicates that the structure’s function depends on a narrow pathway that doesn’t allow redundancy to be built. The margin threshold can be defined by the user and 10 times compliance increase is set as default.

  • For reducing computation cost, a threshold on the material volume inside a damage cube can be applied to reduce the total damage population. Ten percent threshold is used for numerical examples in this paper. In addition for second layer of damage cubes added by PA2 or PB2 we only include those that are fully inside the structural domain.

For preserving load conditions one could also freeze out a sufficiently large non-design domain around the loading points. However the above general treatment adds robustness to the software implementation. The general implementation can accommodate any level of PA and PB damage population generation. However, for practical applications we recommend not to exceed PB2. Note that a MPI application can utilize computing resources flexibly. When the number of computing nodes N C is smaller than N MPI , several MPI processes are distributed onto each computing node. The operating system on each node manages multiple processes in the same manner as a computer handles multiple tasks. On a homogeneous HPC cluster it is recommended to choose N MPI as a multiple of N C for best computing resource utilization. Optimally, we should use large enough HPC cluster with (P L  + 1) computing nodes for shortest run time.

6 Numerical examples

Two 2D examples and one 3D example are provided to show the effect of fail-safe on topology results. We use OptiStruct default objective convergence tolerance 0.005 except Example 2 that used a lower tolerance 0.001. For penalty value in (4) OptiStruct has a built-in gradual increase, with the default final penalty values at 3.0 for 2D and 4.0 for 3D structures. The minimum member size (minimum length scale) takes default value equal to three times the average mesh size.

6.1 Example 1: rectangular plate under shear force

We aim to reproduce the fail-safe scenario discussed in the three-bar truss example in Section 2. The 2D domain has dimensions 100 × 50 with a thickness of 1.0, modeled with 200 × 100 = 20,000 quadratic elements with material properties: E = 2.1 × 105 and υ = 0.3. The load is the same as for the three-bar truss example: P = 1000 is applied at the center of the bottom edge while the upper edge is fixed. The finite element model is shown in Fig. 9. The same volume constraint of 1000 is used, which represents a 20 % volume fraction of the design domain. A relatively large square damage size of 25 × 25 is considered. The models corresponding to damage population PA1 are illustrated in Fig. 10. For reference purpose damage populations PA1 and PB2 are shown in Fig. 11 with labels for the damage zones. Figure 11(b) also shows that the grey zone of PB2 is eliminated because it cuts off the point load. The optimum for this standard problem is, as expected, a two-bar truss-like structure shown in Fig. 12(a). The fail-safe designs with PA1 and PB2 damage populations are shown in Fig. 12(b) and (c), respectively. The runs took 26, 39 and 39 iterations for the results in Fig. 12 in the order (a), (b) and (c). To provide a clear view on how the structure preserves certain level of structural integrity under each damage instance, models for damage population PA1 are shown in Fig. 13 with final fail-safe topology result. The compliances of the standard and the fail-safe design for PA1 are listed in Table 4, including the compliances for all eight failure modes. In Table 5 the compliances for damage population PB2 are listed. The compliance of the final standard design is 58.72, which is 23 % above the compliance 47.60 of the two-bar truss discussed in section 2. This difference is due to several factors: (a) 1D vs. 2D modeling; (b) the penalty effect on the remaining semi-dense elements. The compliances, 84.28 and 82.96, for undamaged state under PA1 and PB2 are about 44 % higher than the standard solution. At the final designs there are four active failure zones in the center (2,3,6,7) for PA1, with an additional zone (10) becoming active for PB2. We classified active damage zones as those yielding final compliances within 2 % of the maximum compliance. From results in Tables 4 and 5 we notice a maximum compliance increase of about 120 % compared to the undamaged state. This is less than the 170 % increase observed for the three-bar truss design shown in Table 1. This seems to indicate that the increased redundancy compared to a three-bar truss helped to improve risk mitigation.

Fig. 9
figure 9

FEA model with 200 × 100 quadratic elements

Fig. 10
figure 10

Eight damaged models for damage population PA1

Fig. 11
figure 11

Damage populations: (a) PA1; (b) Addition in PB2

Fig. 12
figure 12

Topology results: (a) Standard: (b) Fail-safe PA1; (c) Fail-safe PB2

Fig. 13
figure 13

Damaged models with final fail-safe topology for PA1

Table 4 Compliances for optimal design under PA1
Table 5 Compliances for optimal design under PB2

In order to study damage location dependency we ran the same example for PA1 and PB2 with locations shifted by 1/4 of the square size along both X and Y axes. We obtain the designs shown in Fig. 14(a) and (c) for PA1 and PB2, respectively. Asymmetry of the design is induced by asymmetric damage zone placement. If we enforce symmetry for topology optimization, we arrive at the designs shown in Fig. 14(b) and (d) for PA1 and PB2, respectively. The iteration numbers and compliances of the four solutions are given in Table 6. Note that the damage location dependency is amplified in this example by a very large damage size relative to the structural dimension. But even for this rather extreme case we can observe that globally the results in Figs. 14 and 12 share similar design features in terms of load path redundancy.

Fig. 14
figure 14

Shifted damage zones: Unrestricted (a) PA1 and (c) PB2; Enforced symmetry (b) PA1 and (d) PB2

Table 6 Detailed results with shifted damage zone locations

6.2 Example 2: rectangular plate under bending force

We consider the example from Jansen et al. (2014). The plate has dimensions 180 × 60 with a thickness of 1.0 and E = 1.0, modeled with 180 × 60 = 10,800 quadratic elements. The left edge of the plate is supported, and a unit load P = 1.0 is applied at the center of the right edge. The FEA model is shown in Fig. 15. The material volume constraint is 40 %. For this example we ran OptiStruct with an objective convergence tolerance 0.001. Result of the standard optimization problem without fail-safe requirement is shown in Fig. 16(b), which is practically identical to the result from Jansen et al. in Fig. 16(a). The compliance is 222, compared to 203 reported by Jansen et al. This shows about 9 % result difference, which is likely due to presence of a small amount of semi-dense elements in our result. At final penalty p = 3.0 semi-dense material offers significantly weakened stiffness. Jansen et al. used a Heaviside projection formulation that achieved very discrete result. They did not report number of iterations in their results. Average damage population size of first 250 iterations was mentioned when they studied damage population reduction approaches. Therefore we can assume that results in Jansen et al. took more than 250 iterations to achieve. As the result of a less steep projection formulation OptiStruct typically results in a semi-dense layer on the boundary of structural members in the final design. But practically applicable topology results are typically obtained well below 100 iterations. Note that for practical applications final topology is usually extracted as smoothed iso-surface at a density threshold between 0.3 and 0.5. Therefore presence of some amount of semi-dense elements and distorted final compliance value have limited practical implication as long as clear topological representation can be achieved. We will see in the following that as more members form under fail-safe requirements, percentage of semi-dense elements increases, and hence the gap between compliance values compared to Jansen et al. widen even further. Therefore we should focus mostly on the topological layout of results and the relative performance difference between undamaged state and damaged state of the same design. Results for the standard design and fail-safe designs are all summarized in Table 7.

Fig. 15
figure 15

FEA model with 10,800 elements

Fig. 16
figure 16

Topology result of the standard problem: (a) Jansen et al (2014); (b) This paper

Table 7 Results for designs in Figs. 16, 18 and 20

Jansen et al. considered three damage square edge lengths d equal to 5, 10 and 22. They excluded 1/9th of the model on the right end to preserve the load. As summarized in Section 5 we implemented an automatic procedure to exclude damage zones that eliminate point loads. Since d = 5 is too small to make any practical impact, we only study two cases with larger damage square sizes. For damage square d = 10 PA1 and PB2 damage populations are shown in Fig. 17, which contain 108 and 193 squares, respectively. In Jansen’s model 7701 damage squares are involved. The topology results of the reference, PA1 and PB2 are shown in Fig. 18(a), (b) and (c), respectively, with Fig. 18(d) showing PB2 result on one critical damage case. We can see that all results show very similar redundant design features. Therefore damage population PA1 is practically sufficient for achieving an applicable solution. The compliance values are summarized in Table 7, including maximum compliances under damage state. Table 7 also include number of active damage zones that yield compliances within 2 % of the maximum value. The final design is a truss-like structure without large monolithic parts. Without performing analyses modeling each member failure we can observe qualitatively that robust redundant load paths exist for failure test on every member. This include members supporting the loading point for results from this paper. For damage population PA1 the computational cost difference between the reference and this paper is 7701/108 = 71 times. Note that Jansen et al. proposed approaches to screen active damage set that could further reduce their base damage population up to 20 %. This doesn’t change the above comparison from a qualitative perspective. Note also that the actual damage population reduction would be less as they excluded coverage of 1/9th of the model.

Fig. 17
figure 17

Damage populations for damage d = 10: (a) PA1; (b) Addition in PB2

Fig. 18
figure 18

Damage d = 10: (a) Jansen et. al.; (b) PA1; (c) PB2; (d) PB2 with a critical damage

For damage square d = 22 PA1 and PB2 damage populations are shown in Fig. 19, which contain 26 and 42 squares, respectively. In Jansen’s model 5421 damage squares are involved. The topology results of the reference, PA1 and PB2 are shown in Fig. 20(a), (b) and (c), respectively, with Fig. 20(d) showing PB2 result on one critical damage case. Again we observe that all results show very similar redundant design features. Therefore damage population PA1 is practically sufficient for achieving an applicable solution. The compliance values are summarized in Table 7, including maximum compliances under damaged state. Table 7 also include number of active damage zones that yield compliances within 2 % of the maximum value. Screening failure test of every discrete member we notice that the PA1 solution doesn’t provide member redundancy at the loading point. This implies that for design with larger damage size yielding relatively low damage population it is both desirable and affordable to use PB2 damage population. The computational cost for d = 22 under PB2 is 42/5421 = 0.8 % of that of Jansen et al. We also ran this case with a refined mesh with 360 × 120 = 43,200 elements. The result shown in Fig. 21 is very similar to the one in Fig. 20(c), which shows that the solution is largely mesh independent. Final compliance and other values for this design are also listed in Table 7. The refined FEA model does provide cleaner member definition. For this model the damage population according to the reference approach would quadruple to above 20,000 while PB2 remains 42 in our approach.

Fig. 19
figure 19

Damage populations for damage d = 22: (a) PA1; (b) Addition in PB2

Fig. 20
figure 20

Damage d = 22: (a) Jansen et al.; (b) PA1; (c) PB2; (d) PB2 with a critical damage

Fig. 21
figure 21

Damage d = 22: PB2 with refined mesh (360 × 120)

6.3 Example 3: 3D control arm

In this example taken from OptiStruct tutorials, the dimensions of the model are approximately 450 × 550 × 110, and the model contains 327,493 tetrahedron elements with material properties E = 2.1 × 105 and υ = 0.3. We apply a 30 % volume fraction constraint, as well as single draw direction constraint for casting manufacturing. Two load cases are considered, shown in Fig. 22, representing different combinations of bending and torque. The total compliance of the two load cases is the response for the Min(Max) problem in (6). A damage cube size of 50 × 50 × 50 is imposed for fail-safe. The base layer damage population PA1 contains 45 cubes shown in Fig. 23(a), and the enrichment layer for PB2 adds 28 more cubes, are shown in Fig. 23(b). The optimal designs for standard and fail-safe are shown in Fig. 24. The compliances of standard and fail-safe designs are 162.3 and 193.7, and took 46 and 41 iterations, respectively. The maximum compliance of damaged structure is 756.8, corresponding to four active damage zones all close to the two vertical bearings in the back in Fig. 22. The compliance loss is quite large at almost four times that of the undamaged state. This can be explained by the fact that all active failure zones cause significant weakening of the already narrow pathways to the bearings. This clearly suggests to the designer that if fail-safe of the structure is required additional bearings should help to widen the pathways for load transfer. From Fig. 24 we can see that major fail-safe features include an additional large rib in the middle and more redundancy close to the two vertical bearings. With 327,473 elements in this FEA model damage population as proposed by Jansen et al. (2014) would likely exceed 200,000 after eliminating cubes falling partially outside the structural domain. This would make it computationally rather infeasible.

Fig. 22
figure 22

Control arm: (a) Load case 1; (b) Load case 2

Fig. 23
figure 23

Damage populations: (a) PA1; (b) Addition for PB2

Fig. 24
figure 24

Final design with casting constraint: (a) Standard; (b) Fail-safe

7 Conclusion

The basic concept of populating damage squares within a 2D structure for fail-safe topology optimization was first introduced by Jansen et al. (2014). However two major shortcomings existed: (1) From theoretical perspective they avoided directly dealing with the dilemma of failure test of discrete members that yet to emerge from topology optimization; (2) From practical perspective the method they proposed is computationally prohibitive for real applications as it involves a damage population size at the scale of the number of elements in the model.

In this paper we established the concept and formulation for fail-safe design in the context of topology optimization of a 3D structural continuum by following well established fail-safe requirements for aircraft design. We first established the rigorous mathematical foundation of the fail-safe design problem as designing a structure with presence of a given size damage randomly located within the structural domain. We then introduced the concept of damage population series to study the relationship between accuracy in failure test modeling and the damage population size. We defined base damage population PA1 as containing gapless fill of damage cubes occupying the entire structural domain, with each population level increase doubling the placement density of damage cubes. We discovered simple exact formulae for maximum material survival rate within a representative isoperimetric section of a structural member under a given damage population level. Further we investigated the relationship of structural member size and damage size and discovered that rigorous member failure test is guaranteed with level 2 population PA2 if maximum length scale of half the damage size is imposed. This is a profound property considered that rigorous failure test guarantee is reached only when damage population approaches infinity if maximum length scale is equal to damage cube size.

From engineering perspective the needs for redundant load paths are sufficiently represented by PA1 damage population. We constructed a partial set of PA2 damage population PB2 that only doubles the PA1 population. The added damage layer occupies exactly the best hideout locations under PA1 damages, centered at the junctures of PA1 damage cubes. We identified that the partial population of a given level is superior as it yields just a slightly higher material survival rate with a quarter of the full level damage population. For practical applications we recommend using PA1 damage population. PB2 can be a good option if damage cubes appear to be rather coarse in the structural domain as a result of large damage size relative the structural dimensions. Although the paper used a Min(Max) compliance formulation to study the fail-safe design concept, the engineering case clearly defined by Niu (1988) is a stress constrained design problem. Our implementation in OptiStruct (2016) includes stress constraints. We chose to limit the content of this paper to the compliance based formulation so that we can focus the study on fail-safe aspects. Also compliance-based design can be easily reproduce by readers, while handling of stress constraints involve complicated aggregates and related parameter tuning (see, e.g., Le et al. 2010).

As all models containing damage instances can be solved completely in parallel, we implemented the solution as an MPI parallel process in OptiStruct (2016). The user only needs to specify damage cube size in the model. The software automatically populates damage cubes within the structural domain, and then initiates the MPI solution process. HPC computing resources are widely available today, making it possible to solve fail-safe topology optimization containing hundreds of damage cubes with very efficient turnaround time. We demonstrated the solution with two 2D examples and one 3D example, including a cantilever plate example from Jansen et al. (2014). For the cantilever plate example we showed that qualitatively identical results are obtained with roughly 1/100th damage population compared to Jansen et al.

Since fail-safe design has important practical relevance, we hope that this paper can open up a new research direction for further exploration. We can suggest several future research areas:

  • The fail-safe topology optimization concept and formulation is developed with reference to well established fail-safe design approach vital to aircraft design. However, the general approach should be applicable to any industry where structural failure can result in catastrophic accident. We only established the basic concept with generic sphere and cube shaped damages. In practice damage can occur in different forms. For example a ballistic impact could cause a penetration of a given shape and size. It should be an interesting research topic to explore more broad damage scenarios found in engineering practice of various industries.

  • From the perspective of confidence quantification of the failure representation, we only provided quantitative study about maximum material survival rate of a representative member section for a given damage population. Statistics based study on reliability of the entire structural system under a given damage population can provide further insights.

  • Some basic problems studied herein can be interesting for mathematicians as well. We used an engineering approach to establish exact survival properties under a given damage population through analyzing geometric interactions of a representative member section with intersecting damage cubes. It can be of theoretical value to provide formal mathematical proofs of the findings presented. Also it can be an interesting problem to derive geometric properties for sphere and other damage shapes. Another very challenging theoretical problem is to study the uniqueness and global optimality of fail-safe topology. Though the solution is not unique under finite damage population, the limiting case of infinite damage population seems unique by definition. Mathematical insights into this complex problem can be highly valuable from both theoretical and practical perspectives.