1 Introduction

The main target of companies that base their production on the effectiveness of their repairable systems consists of maximizing the availability of such systems. When a repairable system is not available, the system enters an unproductive phase (Boliang et al 2019) where not only resources are not generated, but also they are consumed until recovering the system’s available state. In order to enhance systems availability, techniques such as including redundant devices or considering the schedule of preventive maintenance tasks have been widely analyzed. However, the simultaneous consideration of both techniques has not received sufficient attention.

In previous studies (Cacereño et al 2021a, b), the authors of the present paper considered the simultaneous optimization of design (this consists of including redundant devices) and maintenance (this consists of including preventive maintenance tasks) of a system considered as a case study. It allows finding particularized maintenance strategies in relation to specific alternatives of the structural design, which avoid designing maintenance strategies after deciding such structural designs. This has the result of achieving a compact maintenance strategy, which is absolutely customized to the structural design. The authors coupled the Non-dominated Sorting Genetic Algorithm II (NSGA-II) (Deb et al 2002) and Discrete Event Simulation (DES) in order to solve such a problem. Each solution proposed by the Multi-objective Evolutionary Algorithm is evaluated through Discrete Event Simulation; a technique used to describe the behavior of the system by building the system Functionability Profile, a concept presented by Knezevic (1996). Both availability and operational cost were considered as objective functions from a multi-objective point of view. As an Optimization Evolutionary Algorithm, the NSGA-II method is considered a standard state-of-the-art solver to find feasible solutions to real-world multi-objective optimization problems (Emmerich and Deutz 2018).

In order to enhance the knowledge and efficiency of the methodology, such a methodology is thoroughly explored, exploited and analyzed in the present paper. The contributions are presented as follows:

  • Being the resolution and optimization of real-world problems with evolutionary algorithms/metaheuristics a current hot topic (Osaba et al 2021), it is still a research focus the analysis of state-of-the-art multi-objective algorithms for solving real-world engineering problems, as the one in this research. Therefore, as a case study, the methodology is applied to a containment spray injection system for which five state-of-the-art Multi-objective Evolutionary Algorithms (taking into account the state-of-the-art algorithms of reference of the three selection type paradigms: dominance-based selection, indicator-based selection and aggregation-based selection) are used. The target consists of analyzing the efficacy of such state-of-the-art Multi-objective Evolutionary Algorithms where two crossover strategies are taken into account, the Simulated Binary Crossover (SBX) (Deb and Agrawal 1995) and the Differential Evolution (DE) (Storn and Price 1997). Their performances are compared in detail by using the Hypervolume and statistical significance tests (including post-hoc analysis when necessary), demonstrating the success of achieving a set of non-dominated solutions and determining the features of the methods displaying the best performance. Next, the performance results are compared with some recently developed algorithms. The benefit of using the methodology with the indicator-based selection algorithm SMS-EMOA is demonstrated.

  • The methodology consists of carrying out an unique Discrete Event Simulation in order to emulate the behavior of the system all along its mission time. When Evolutionary Algorithms have been coupled with Discrete Event Simulation, some authors use several discrete simulations in order to emulate the behavior of the system. On the other hand, some authors employ an unique Discrete Event Simulation. Therefore, a gap exists in order to determine the effect due to such a circumstance. In the present paper, the power of the Multi-objective Evolutionary Algorithms is used to minimize simultaneously unavailability and operational cost while considering automatic design of the system. The unique discrete simulation procedure is further compared in the case study with a Monte Carlo Simulation (analyzed sampling sizes of 10, 100 and 1000) where the minimal extreme value is taken as representative of the achieved distribution (either: minimal unavailability, minimal operational cost, or minimal weighted unavailability–operational cost—equivalent to minimal Manhattan distance—were considered). An hypothesis test is shown where the proposed methodology was firstly ordered and statistically significant differences were found when equivalent total number of fitness evaluations were run. Insights about the competitiveness and computational efficiency of the methodology were given.

  • The methodology is applied to two more complex industrial systems (with up to 36 devices), supporting its scalability and generalization. Evidences of the benefit of automatic selection of devices were given when compared versus systems with mandatory all devices chosen, and also two types of chromosome codifications related with those automatic selection of devices were compared in the biggest industrial application case showing benefits in the performance.

Summarizing the main contributions of this manuscript, the authors have developed a thorough study in which the multi-objective problem of simultaneous optimization of the availability of the system and the cost of the system by defining the maintenance strategy (through their maintenance times) and alternative designs (through automatically defining system devices) is handled, coupling discrete event simulation with a single sample size and multi-objective evolutionary algorithms: a) a study of the performance of several state-of-the-art multi-objective evolutionary algorithms is presented (8 algorithms); b) a study of the influence of sampling size is presented, demonstrating the benefits of the here proposed single sample size case; c) a set of three reliability problem applications with increasing complexity is solved with the abovementioned proposal showing its capability and scalability.

The paper is organized as follows: Sect. 2 explores the related literature. Section 3 summarizes the methodology and Sect. 4 introduces the multi-objective optimization by using Evolutionary Algorithms. Sect. 5 presents the case study. Its results are presented and discussed in Sect. 6. In Sect. 7, the methodology is applied to more complex systems, and finally, Sect. 8 introduces the conclusions.

2 Literature review

Optimization is useful in practically all areas of life. Engineers must find optimal solutions to achieve the best performance when they are solving engineering problems. Optimization has been widely used in many fields of engineering (aeronautical, civil engineering, electrical networks, transport, logistics, etc.), and its limits are in the engineer’s fancy. Hence, when solving complex problems is needed, the employment of optimization methods is a suitable course of action. Optimization is particularly useful when the number of potential solutions is high and achieving the best solution is very difficult. Instead of the best solution, some sufficiently good solutions can be obtained (Simon 2013). Systems reliability optimization has been widely studied, however, because of technological advances, increases in system complexity and consumer demand (among other aspects), it is an ever-changing and developing problem (Coit and Zio 2019).

One strategy commonly used in order to improve the availability of repairable systems consists of adding redundant devices. Including a redundant device in a system increases the number of alternative paths so its probability of keeping on available state is improved. Including redundant devices in a system requires the modification of its design. Several approaches have been used in order to achieve optimal designs of systems such as Dynamic Programming (Fyffe et al 1968), Integer Programming (Misra and Sharma 1991) or Nonlinear Programming (Tillman et al 1977). However, the use of Evolutionary Algorithms has been taking importance due to their power when multiple objectives must be handled. In this sense, several authors considered using Genetic Algorithms (Zoulfaghari et al 2014; Ghorabaee et al 2015), the Non-dominated Sorting Genetic Algorithm II (NSGA-II) (Greiner et al 2003; Kayedpour et al 2017; Sharifi et al 2019; Chambari et al 2021), the Ant Colony Algorithm (Zhao et al 2007), the Artificial Bee Colony Algorithm (Jiansheng et al 2014) or Particle Swarm Optimization (Samanta and Basu 2018).

In order to improve the availability of repairable systems, another one strategy consists of applying preventive maintenance tasks. The main reasons why a continuous operation system stops are a failure (after that, a recovery time in relation to corrective maintenance tasks is required) or a scheduled shutdown to perform preventive maintenance tasks. When a preventive maintenance task is performed, the unproductive phase is better controlled than when repairs have to be performed because of a failure. Therefore, it is important to identify the optimum moment at which to stop the system and perform a preventive maintenance task. This should be done before the occurrence of the failure but as close as possible to such a failure, in order to maximize the total system available time. Various approaches have been employed in order to schedule preventive maintenance tasks such as Integer Programming (Kralj and Petrovic 1995), Mixed Integer Linear Programming (Charest and Ferland 1993; Fathollahi-Fard et al 2021) or Evolutionary Algorithms, where several authors employed Genetic Algorithms (An et al 2020; Bressi et al 2021; Wang et al 2021), the Non-dominated Sorting Genetic Algorithm II (NSGA-II) (Piasson et al 2016; Zang and Yang 2021), the Ant Colony Algorithm (Berrichi et al 2010) or the Bee Colony Algorithm (Li et al 2014). In all of the above scenarios it is possible to improve the availability of repairable systems either through the design modification or the maintenance strategy. However, there will likely be consequences of such a modification or strategy in terms of operational costs.

Although both the design and preventive maintenance strategy of a system have an influence in its performance and are affected by each other, the joint management of these strategies has not received significant attention yet. There are relatively few works in which the design and preventive maintenance strategy for technical systems have been jointly optimized. Levitin and Lisnianski (1999) presented the first formulation of the joint redundancy and maintenance optimization problem for multi-state systems by using a Genetic Algorithm as an optimization technique. Nourelfath et al (2012) formulated a joint redundancy and imperfect preventive maintenance planning optimization model based on Markov processes and universal moment generating function, in order to evaluate availability and cost for multi-state systems by using Genetic Algorithms and Tabu search. Bei et al (2017) presented an approach to designing the configuration of a multiple component system and determining a maintenance plan with uncertain future stress exposure by using a two-stages stochastic programming model. However, when a new system is designed, the Discrete Event Simulation arises as a powerful modeling technique, which allows complex systems to be analyzed much more accurately due to a more realistic representation of their behavior in practice.

Using Evolutionary Algorithms coupled with Discrete Simulation has produced good results in the reliability field. Regarding the design optimization, Cantoni et al (2000) presented an approach which couples a Single-objective Evolutionary Algorithm and the Monte Carlo simulation for optimal plant design. Marzaguerra et al (2005) proposed a similar approach from a multi-objective perspective where the presence of uncertainty is considered. Regarding the systems optimal maintenance through the preventive maintenance strategy, Tan and Kramer (1997) proposed a general framework for preventive maintenance optimization, which combines the Monte Carlo simulation with a Genetic Algorithm and Oyarbide-Zubillaga et al (2008) determined the optimal preventive maintenance frequencies for multi-equipment systems by using Discrete Event Simulation and the Non-dominated Sorting Genetic Algorithm II (NSGA-II). Recently, Azevedo et al (2020) proposed a multi-objective approach to model a replacement policy problem applicable to equipment that undergoes failures with several levels of severity. To do that, they used a Multi-objective Genetic Algorithm coupled with Discrete Event Simulation. However, although coupling Multi-objective Evolutionary Algorithms and Discrete Event Simulation presented good performance when the optimization of the design considered corrective maintenance tasks (Lins and Droguet 2009; Lins and López 2011), a few studies that consider preventive maintenance tasks have been developed. A study to mention was developed by Galván et al (2007), where a methodology for Integrated Safety System Design and Maintenance Optimization from a bi-level evolutionary process was developed. They coupled the Non-Sorting Genetic Algorithm II and Monte Carlo Simulation in order to achieve the optimum design and surveillance test intervals. Therefore, no studies covering both corrective and preventive maintenance—consisting of achieving the optimum period of time to carry out preventive maintenance tasks—were developed before the proposed by the authors of the present paper ( Cacereño et al 2021a, b). A deeper study of the methodology is shown in the present paper. In this case, the performance of using different types of Multi-objective Evolutionary Algorithms is thoroughly studied. Furthermore, the influence of the sampling size is analyzed regarding the number of simulations to conduct, in order to describe the behavior of the system. Finally, the methodology is applied to more complex systems to demonstrate its flexibility.

3 Methodology and description of the proposed model

3.1 Extracting availability and cost from the functionability profile

Reliability is an intrinsic characteristic of systems and it is related to the way in which they have been designed and built. Maintainability can be intrinsic when it is related to design conditions (a piece that is difficult to access will be more complex to maintain) or extrinsic, e.g., when it is related to available spares or human teams who must perform maintenance operations. Whereas Reliability is a concept in relation to the Time to Failure, Maintainability is a concept in relation to the Time To Repair. In availability, these two concepts are related to define the way in which the system is able to achieve the function for which it was designed, over a period of time. Availability can be computed through the unconditional failure w(t) and repair v(t) intensities, as was explained by Andrews and Moss (2002). A device which is continuously subjected to the failure and repair process, presents a failure probability in the time interval \([t,t+dt)\) given it was working at \(t=0\) represented by w(t)dt. Two situations lead to failure in \([t,t+dt)\): The device works continuously from 0 to t until the first failure in \([t,t+dt)\) (the probability of this is given by f(t)dt, where f(t) is the failure density function) or the device fails in \([t,t+dt)\) but this is not the first failure. In this second situation, the device has experienced one or more repairs prior to the failure and the last one was carried out in the interval \([u,u+du)\) (the probability of this is given by \(v(u)du \times f(t-u)dt\)). The repair time u can occur at any point between 0 and t and so adding all possibilities gives the Eq. 1.

$$\begin{aligned} w(t)dt = f(t)dt + \int _{0}^{t}f(t-u)v(u)du\,dt \end{aligned}$$
(1)

Repair can only occur in \([t,t+dt)\) in case of failure has occurred at some interval \([u,u+du)\) prior to t. The probability of this is \(g(t-u)dt \times w(u)du\), where w(u)du is the probability of failing in \([u,u+du)\) given it was working at \(t=0\), \(g(t-u)dt\) is the probability of repair in \([t,t+dt)\) given it has been in failed state since last failure in \([u,u+du)\) and it was working at \(t=0\) and knowing that g(t) is the repair density function. Since u can vary between 0 and t, Eq. 2 can be obtained.

$$\begin{aligned} v(t)dt = \int _{0}^{t}g(t-u)w(u)du\,dt \end{aligned}$$
(2)

Canceling dt from Eqs. 1 and 2, the simultaneous integral equations defining the unconditional failure and repair intensities, which are shown in Eqs. 3, are obtained.

$$\begin{aligned} \begin{aligned}&w(t) = f(t) + \int _{0}^{t}f(t-u)v(u)du \\&v(t) = \int _{0}^{t}g(t-u)w(u)du \end{aligned} \end{aligned}$$
(3)

The opposite of availability (A(t)) is unavailability (Q(t)). The Eq. 4 shows how to compute Q(t) from Eq. 3.

$$\begin{aligned} Q(t) = \int _{0}^{t}[w(u)-v(u)]du \end{aligned}$$
(4)

When a device has exponential failure and repair intensities (constant failure and repair rates), its availability can be found through the solutions of Eqs. 3 for w(t) and v(t), which can be carried out by using Laplace transforms. In this case, the Eq. 5 is obtained so availability can be computed by using the Mean Time To Failure (MTTF) and the Mean Time To Repair (MTTR).

$$\begin{aligned} A = \frac{{{\textrm{MTTF}}}}{{{\textrm{MTTF}}} + {\textrm{MTTR}}} \end{aligned}$$
(5)

When a device has not exponential failure and/or repair intensities, finding the device availability through the solutions of Eqs. 3 for w(t) and v(t) is really hard so a simulation approach can be suitable. In this paper, the system availability is characterized in a simulation approach by using the system Functionability Profile. Technical systems are developed to fulfill specific functions so “functionality” is an important feature which is related to the system’s capacity to achieve its mission. The system must not only achieve its mission but also satisfy a set of requirements called “satisfactory features” (e.g., volume or flow). Moreover, it is necessary to specify the “operation conditions” under which the system must be able to operate (e.g., temperature or humidity). These three aspects come together under the umbrella concept of “Functionability Profile” introduced by Knezevic (1996), which is defined as the inherent capacity of the system to achieve the required function under specific features when are used as specified. In general, all systems achieve their function at the beginning of the their lives. However, irreversible changes take place over time and variations in the system behavior occur. The deviation of the variations in relation to the satisfactory features reveals the occurrence of the system failure which causes a transition from the operation state to the failure state. After failing, if the system is repairable, its capacity to fulfill the required function can be recovered through recovery activities or corrective maintenance.

Aside from corrective maintenance activities, systems can require additional tasks to maintain them in operation. These are generally less complex and are called preventive maintenance activities or maintenance prior to failure. From the Functionability Profile’s point of view, the states of a repairable system fluctuate between operation and failure over the course of the mission time. The shape of cited fluctuations is called the Functionability Profile as it tracks the states through the overall mission time. An example of a Functionability Profile is shown in Fig. 1. Functionability Profiles depend on the operation times (either Time To Failure or Time To Start following a scheduled Preventive Maintenance activity) \((t_{f1}, t_{f2},..., t_{fn})\) and recovery times (either Time To Repair after the failure or Time To Perform a Preventive Maintenance activity) \((t_{r1},t_{r2},..., t_{rn})\). When the Functionability Profile is set to logical 1, it is considered that the device is operating. Conversely, when the Functionability Profile is set to logical 0, it is considered that the device is stopped (it may be being repaired after a failure or maintained). It is possible to deduce from Fig. 1 that, after an operation time, a recovery time is needed.

Fig. 1
figure 1

Functionability profile of a device (or system)

Users need to be sure that the system Functionability Profile satisfies the desired function. Hence, users are interested in the shape of the system Functionability Profile with special emphasis on the where the system is available. As previously mentioned, availability is tightly related to Functionability Profiles. Availability is characterized by the relationship between the system operation times and the total mission time. In the present paper, some considerations in relation with the model to compute the availability are taken as follows:

  • The state of each device at any point of time is either one of the “operation” or “failed” state.

  • The devices are independent of each other.

  • A repair starts just after the failure of the device.

  • A repair returns the device to the as-good-as-new state.

The system will be able to fulfill its purpose during \(t_f\) times, so it is possible to evaluate its availability at mission time by using Eq. 6.

$$\begin{aligned} A = \frac{\displaystyle \sum _{i=1}^n{t_{fi}}}{\displaystyle \sum _{i=1}^n{t_{fi}+\sum _{j=1}^mt_{rj}}} \end{aligned}$$
(6)

where n is the total number of operation times, \(t_{fi}\) is the i-th operation time in hours (Time To Failure or Time to Start following a scheduled Preventive Maintenance Activity), m is the total number of recovery times and \(t_{rj}\) is the j-th recovery time in hours (due to repair or preventive maintenance activity). Therefore, availability is a variable with value between 0 (minimum availability) and 1 (maximum availability), so adding availability and unavailability the value of 1 is obtained.

Operation and recovery times behave as random variables so they are not previously known. If a historical record of both times is compiled and a statistical analysis is performed, operation and recovery times could be defined as probability density functions through their respective parameters. Those functions can arise from a specific typology (e.g., Exponential, Weibull, Normal). There are several Data Bases on the market (OREDA 2009; CCPS 1998) that supply characteristic parameters for the referred functions, so operation and recovery times can be characterized for different failure modes of system devices.

When systems are operating, earnings are generated in relation to their availability. Conversely, when systems have to be recovered, economic cost is generated to return the operation status. In this paper, the economic cost is a variable directly associated with recovery times. Such recovery times are related to corrective and preventive maintenance activities; quantities can be computed by Eq. 7.

$$\begin{aligned} C = \sum _{i=1}^{q}{cc_{i}}+\sum _{j=1}^{p}{cp_{j}} \end{aligned}$$
(7)

where C is the system operational cost quantified in economic units, q is the total number of corrective maintenance activities, \(cc_i\) is the cost due to the i-th corrective maintenance activity, p is the total number of preventive maintenance activities and \(cp_j\) is the cost due to the j-th preventive maintenance activity. Maintenance activity costs depend on the respective fixed quantities per hour (corrective and preventive) so the global cost is directly related to recovery times. Preventive maintenance activities are scheduled shutdowns, so recovery times will be shorter and more economical than recovery times due to corrective maintenance activities (for reasons such as willing and trained human personnel, or available spare parts). If it is desirable to avoid long recovery times, it will be necessary to carry out preventive maintenance activities ideally just before the failure but otherwise as close as possible to it. Therefore, the system normal Functionability Profile must be modified (i.e., to modify the system Functionability Profile, which represents the continuous cycle failure and recovery after corrective maintenance over the time mission) including preventive maintenance activities for system devices, with the aim of maximizing the system availability and minimizing costs due to recovery times.

3.2 Building functionability profiles to compute the objective functions

In order to optimize the system design and it preventive maintenance strategy, it will be necessary to characterize both the system availability and the cost from the system Functionability Profile. The system Functionability Profile is built from the Functionability Profiles of the system devices which are built by using Discrete Event Simulation. To this end, information about how to characterize the operation \((t_{fi})\) and recovery \((t_{rj})\) times is needed. Operation times are composed by Times To Failure (TF) and Times To Start following a Preventive Maintenance activity (TP), while recovery times are composed by Times To Repair after failure (TR) and Times To Perform a Preventive Maintenance Activity (TRP). In the present paper, the failure modes for each device are grouped in a unique failure mode. The Functionability Profiles for system devices are built by generating random times, which are obtained from the respective probability density functions for times to failure (TF) and times to repair (TR). Such Functionability Profiles must be modified by including Times To Start following a Preventive Maintenance task (TP) and Times To Perform a Preventive Maintenance task (TRP), which are characterized based on the limits supplied by expert judgment. In order to build the Functionability Profile of a device, each Time To Failure (TF) generated for the device must be compared with the Time To Start following a scheduled Preventive Maintenance activity (TP) related to such a device. This information is identified via each solution provided by the Multi-objective Evolutionary Algorithm (each individual of the population). Depending on what happens before, the failure or the Preventive Maintenance activity, a Time To Repair (TR) or a Time To Perform a Preventive Maintenance activity (TRP) will be used, respectively, in order to complete such a section of the Functionability Profile. The process (shown both in Fig. 2 and in pseudo-code Algorithm 1) is explained below:

  1. 1.

    System mission time (life cycle) must be fixed and then, the process continues for each device.

  2. 2.

    The device Functionability Profile (PF) must be initialized.

  3. 3.

    The Time To Start following a scheduled Preventive Maintenance activity (TP) proposed by the Multi-objective Evolutionary Algorithm (previously limited by the minimum \((TP_{min})\) and the maximum \((TP_{max})\) set value) is extracted from the respective decision variable of the individual of the population evaluated and a Time To Perform a Preventive Maintenance activity (TRP) is randomly generated, within the minimum \((TRP_{min})\) and maximum \((TRP_{max})\) previously fixed limits.

  4. 4.

    An operation Time To Failure (TF) is randomly generated within the minimum \((TF_{min})\) and maximum \((TF_{max})\) previously fixed limits with reference to the failure probability density function related to the device.

  5. 5.

    If \(TP < TF\), the preventive maintenance activity is performed before the failure. In this case, as many logical “ones” as TP units followed by as many logical “zeros” as TRP units are added to the device Functionability Profile. Each time unit represented in this way (both as a logical “one” and as a logical “zero”) is equivalent to one hour of real time. In this case, the cost is computed by multiplying (TRP) and the preventive maintenance cost. Finally, it is added to the global preventive maintenance cost \((cp_{j})\).

  6. 6.

    If \(TP>TF\) , the failure occurs before carrying out the preventive maintenance activity. In this case, attending to the repair probability density function related to the device, the Time To Repair (TR) is randomly generated, within the minimum \((TR_{min})\) and maximum \((TR_{max})\) previously fixed limits. Then, as many logical “ones” as TF units followed by as many logical “zeros” as TR units are added to the device Functionablity Profile. Each time unit represented in this way (both as a logical “one” and as a logical “zero”) is equivalent to one hour of real time. In this case, the cost is computed by multiplying (TR) and the corrective maintenance cost. Finally, it is added to the global corrective maintenance cost \((cc_{i})\).

  7. 7.

    Steps 4 to 6 must be repeated until the end of the device mission time.

  8. 8.

    Steps 2 to 7 must be repeated until the Functionability Profile has been built for all devices.

  9. 9.

    After building all the Functionability Profiles, the system Functionability Profile is made by referring to the logical structure of the system. Several techniques might be used to do this depending on the complexity of the system, such as logical operators or fault tree.

Once the system Functionability Profile is built, the values of the objective functions can be computed by using both Eq. 6 (showing the system availability in relation to the time in which the system is operating and being recovered) and Eq. 7 (showing the system operational cost depending on the cost of the time units in relation to the development of corrective or preventive maintenance).

Fig. 2
figure 2

Building Functionability Profiles

Algorithm 1
figure a

Build the system Functionability Profile (PF)

4 Multi-objective evolutionary optimization approach

The optimization methods that are used in this paper belong to the Evolutionary Algorithms (EA) paradigm. This kind of method uses a population of individuals with a specific size. Each individual is a multidimensional vector called a chromosome, which represents a possible candidate solution to the problem. The vector components are called genes or decision variables. Extended information on Evolutionary Optimization Algorithms was supplied by Simon (2013). In the present paper, a real-world engineering multi-objective problem is afforded, where a set of non-dominated solutions constitutes the set of equally optimum final designs. Evolutionary algorithms are population-based search methods that have been established as state-of-the-art methods to solve those kind of design multi-objective optimization problems. See, e.g., reference books as Coello et al (2007) and Deb (2001), or more recent scientific works as Coello (2015) and Emmerich and Deutz (2018). Among their advantages are that they are stochastic global optimizers, no requirements are requested to the fitness function, and they are able to attain the whole non-dominated set of solutions in a single run; among their disadvantages the most critical is the necessity to execute a high number of fitness function evaluations (at least in the order of thousands). In this work, each individual in the population consists of a real numbers string in which the system design alternatives and the periodic times to start a preventive maintenance activity related to each device included in the system design are codified as it is detailed later, in the case study scenario. Optimization problems can be minimized or maximized for one or more objectives. In most cases, real-world problems requires optimizing various objectives at the same time and these objectives are frequently in conflict with each other. These problems are called “multi-objective” problems and their solutions arise from a solutions set that represents the best compromise among the objectives (Pareto-optimal set) ( Coello 2015; Emmerich and Deutz 2018). These kind of problems are described by Eq. 8 (considering a minimization problem) ( Simon 2013).

$$\begin{aligned} \displaystyle \min _xf(x)= \displaystyle \min _x[f_1(x), f_2(x),..., f_k(x)] \end{aligned}$$
(8)

In optimization, when problems are defined by this way, the k functions must be simultaneously minimized. In the present paper, the objective functions are, on the one hand, the system availability (objective function to maximize and mathematically expressed by Eq. 6, it is similar to minimize unavailability) and, on the other hand, the operational cost (objective function to minimize and mathematically expressed by Eq. 7). The optimization problems usually are subjected to some constraints. In this case, the problem is subjected to constraints related to maximum and minimum values for Times To Failure, Times To Repair, Times To Start following a scheduled Preventive Maintenance activity and Times To Perform a Preventive Maintenance activity. Classical optimization methods suggest converting the multi-objective optimization problem to a single-objective optimization problem by emphasizing one particular Pareto-optimal solution at time. Due to their ability to find multiple Pareto-optimal solutions in one single simulation run, a number of Multi-objective Evolutionary Algorithms (MOEAs) were subsequently suggested. Nowadays, Multi-objective Evolutionary Optimizers (EMO) can be classified in three groups ( Emmerich and Deutz 2018; Greiner et al 2017):

  • Indicator-based selection EMO; Methods based on some unary indicator to guide the search.

  • Decomposition/Aggregated-based selection EMO; Methods based on decomposition of the search space, optimizing a set of scalarizing functions in parallel.

  • Dominance-based selection EMO; Methods that use the concept of Pareto dominance as the basis of their selection.

In this paper, a thorough study of the use of Multi-objective Evolutionary Algorithms applied to the field of Reliability is conducted. Firstly, several representative EMOs of each of the selection criteria defined above are used to optimize a case study:

  • The S-Metric Selection Evolutionary Multi-objective Optimization Algorithm (SMS-EMOA) (Beume et al 2007), which uses the multi-objective selection based on Dominated Hypervolume, as representative of the indicator-based selection EMO.

  • The Multi-objective Evolutionary Algorithm Based on Decomposition (MOEA/D) (Zhang and Li 2007) and its extension (MOEA/D-DE) (Li and Zhang 2009), which uses the differential evolution (DE) as an operator, as representative of the Decomposition/Aggregated-based selection EMO,

  • The Non-dominated Sorting Genetic Algorithm II (NSGA-II) ( Deb et al 2002), and the Generalized Differential Evolution (GDE3) ( Kukkonen and Lampinen 2005), which use the Pareto dominance criterion to perform multi-objective optimization, as representative of the dominance-based selection EMO.

Methods such as SMS-EMOA, MOEA/D and NSGA-II are state-of-the-art standard solvers when it comes to solving real-world multi-objective optimization problems (Emmerich and Deutz 2018). These methods use Simulated Binary Crossover (Deb and Agrawal 1995) to create new individuals. The study is then extended to methods that use Differential Evolution (Storn and Price 1997) to create new individuals such as MOEA/D-DE. In addition, Kukkonen and Lampinen (2005) showed as GDE3 outperformed NSGA-II in a set of different types of test problems. It is intended to compare the performance of these two methods, which share the use of the Pareto dominance criterion, in a real-world problem. Therefore, these five Multi-objective Evolutionary Algorithms are used, looking for the joint optimization of the system design and its preventive maintenance strategy. Moreover, once the previous study is concluded, some recently developed algorithms are used in order to compare the performance results. Such methods are:

  • The Adaptative Non-dominated Sorting Genetic Algorithm III (ANSGA-III) (Himanshu and Kalyanmoy 2014), which uses the Pareto dominance criterion to perform multi-objective optimization.

  • An approach to adapt weights during the decomposition-based evolutionary multi-objective optimization (AdaW) (Li and Yao 2020).

  • The Efficient Large-Scale Multi-objective Optimization Based on a Competitive Swarm Optimizer (LMOCSO) (Tian et al 2020).

5 The case study

The case study consists of optimizing the design and preventive maintenance strategy for an industrial system based on two conflicting objectives: availability and operational cost. Maximum availability and minimum operational cost are desirable. The more investment in preventive maintenance, the greater the system availability. Conversely, this policy implies the growth of unwanted cost and constitutes a conflict between objectives. The proposed methodology is applied to a containment spray injection system (CSIS) of a nuclear power plant, whose simplified model is shown in Fig. 3 and it is based on a case presented by Greiner et al (2003) and previously studied by the authors of the present paper (Cacereño et al 2021a, b). The model is formed by using cut valves \((V_i)\) and impulsion pumps \((P_i)\). The CSIS mission is the injection of borated water into the containment in order to wipe radioactive contamination that may be released after a loss of coolant accident. In this case, the number of redundant devices is limited as it is shown in Fig. 3.

Fig. 3
figure 3

Case study: containment spray injection system (CSIS)

Table 1 Data set for system devices

Although the basic shape of the system was presented by Greiner et al (2003), the data used in the present paper were updated and previously used in Cacereño et al (2021a, 2021b). They are shown in Table 1, where:

  • Life Cycle = The system mission time (expressed in hours).

  • Corrective Maintenance Cost = The cost entailed in developing a repair activity to recover the system following a failure (expressed in economic units per hour).

  • Preventive Maintenance Cost = The cost entailed in developing a preventive maintenance activity (expressed in relation to the Corrective Maintenance Cost).

  • Pump \(\mathrm {TF_{min}}\) = The minimum operation Time To Failure for a pump without preventive maintenance (expressed in hours). It is considered that a failure of a pump cannot occur before this time.

  • Pump \(\mathrm {TF_{max}}\) = The maximum operation Time To Failure for a pump without preventive maintenance (expressed in hours). It is mandatory that the failure of a pump occurs before this time.

  • Pump \(\textrm{TF}\) \(\lambda \) = The failure rate for a pump, which follows an exponential failure distribution (expressed in hours raised to the power of minus six).

  • Pump \(\mathrm {TR_{min}}\) = The minimum Time To Repair or duration of a corrective maintenance activity for a pump (expressed in hours).

  • Pump \(\mathrm {TR_{max}}\) = The maximum Time To Repair or duration of a corrective maintenance activity for a pump (expressed in hours).

  • Pump \(\textrm{TR}\) \(\mu \) = The mean for the normal distribution followed for the Time To Repair assumed for a pump (expressed in hours).

  • Pump \(\textrm{TR}\) \(\sigma \) = The standard deviation for the normal distribution followed for the Time To Repair assumed for a pump (expressed in hours).

  • Pump \(\mathrm {TP_{min}}\) = The minimum operation Time To Start following a scheduled Preventive Maintenance activity for a pump (expressed in hours). It is considered that a Preventive Maintenance activity for a pump is not necessary before this time.

  • Pump \(\mathrm {TP_{max}}\) = The maximum operation Time To Start following a scheduled Preventive Maintenance activity for a pump (expressed in hours). It is considered that a Preventive Maintenance activity for a pump is a recklessness after this time. It should be done before this time.

  • Pump \(\mathrm {TRP_{min}}\) = The minimum Time To Perform a preventive maintenance activity for a pump (expressed in hours). It is the minimum time needed to develop the Preventive Maintenance activity for a pump.

  • Pump \(\mathrm {TRP_{max}}\) = The maximum Time To Perform a preventive maintenance activity for a pump (expressed in hours). It is the maximum time needed to develop the Preventive Maintenance activity for a pump.

  • Valve \(\mathrm {TF_{min}}\) = The minimum operation Time To Failure for a valve without preventive maintenance (expressed in hours). It is considered that a failure of a valve cannot occur before this time.

  • Valve \(\mathrm {TF_{max}}\) = The maximum operation Time To Failure for a valve without preventive maintenance (expressed in hours). It is mandatory that the failure of a valve occurs before this time.

  • Valve \(\textrm{TF}\) \(\lambda \) = The failure rate for a valve, which follows an exponential failure distribution (expressed in hours raised to the power of minus six).

  • Valve \(\mathrm {TR_{min}}\) = The minimum Time To Repair or duration of a corrective maintenance activity for a valve (expressed in hours).

  • Valve \(\mathrm {TR_{max}}\) = The maximum Time To Repair or duration of a corrective maintenance activity for a valve (expressed in hours).

  • Valve \(\textrm{TR}\) \(\mu \) = The mean for the normal distribution followed for the Time To Repair assumed for a valve (expressed in hours).

  • Valve \(\textrm{TR}\) \(\sigma \) = The standard deviation for the normal distribution followed for the Time To Repair assumed for a valve (expressed in hours).

  • Valve \(\mathrm {TP_{min}}\) = The minimum operation Time To Start following a scheduled Preventive Maintenance activity for a valve (expressed in hours). It is considered that a Preventive Maintenance activity for a valve is not necessary before this time.

  • Valve \(\mathrm {TP_{max}}\) = The maximum operation Time To Start following a scheduled Preventive Maintenance activity for a valve (expressed in hours). It is considered that a Preventive Maintenance activity for a valve is a recklessness after this time. It should be done before this time.

  • Valve \(\mathrm {TRP_{min}}\) = The minimum Time To Perform a preventive maintenance activity for a valve (expressed in hours). It is the minimum time needed to develop the Preventive Maintenance activity for a valve.

  • Valve \(\mathrm {TRP_{max}}\) = The maximum Time To Perform a preventive maintenance activity for a valve (expressed in hours). It is the maximum time needed to develop the Preventive Maintenance activity for a valve.

The data were obtained from specific literature (OREDA 2009), expert judgment (based on the professional experience from the Machinery & Reliability Institute (MRI), Alabama, USA) or mathematics relations. In this sense, the TR \(\sigma \) for valves and pumps were set in relation to the \(\mu \) of their respective normal distribution functions and their \(\mathrm {TR_{min}}\) previously established. In relation to the \(\mathrm {TR_{max}}\), it is known that the 99.7% of the values of a normally distributed variable are included into the interval \(\mu \pm 3\sigma \). The interval was extended to \(\mu \pm 4\sigma \), taking into account anecdotal further values. As exposed above, optimization objectives consist of maximizing the system availability and minimizing the operational cost due to unproductive system phases (both because the system is being repaired and because the system is being maintained). To do that:

  • It is necessary to establish the optimum period to perform a preventive maintenance activity for the system devices and

  • It is necessary to decide whether to include redundant devices such as P2 and/or V4 by evaluating design alternatives. Including redundant devices will improve the system availability but it will also increase the system operational cost.

From the optimization point of view, it was explained before that the Evolutionary Algorithms (EAs) use a population of individuals called chromosomes, which represent possible solutions to the problem. In this case, the chromosomes will be formed by real number strings with 0 as the minimum value and 1 as the maximum value. Each string will be codified as [\(B_1\) \( B_2\) \( T_1\) \(T_2\) \( T_3\) \( T_4\) \(T_5\) \(T_6\) \(T_7\) ], where the presence of redundant devices, P2 and V4, is defined by the decision variables \(B_1\) and \(B_2\), respectively, and the optimum time to start a preventive maintenance activity in relation to each device is represented by the decision variables \(T_1\) to \(T_7\). However, they have to be transformed to evaluate the objective functions:

  • Variables \(B_1\) and \(B_2\) are rounded at the nearest integer, so a value of 0 implies that the device is not included in the design and a value of 1 implies the opposite.

  • Variables \(T_1\) to \(T_7\) are scaled by using the Eq. 9, where \(TP_i\) is the true value of the parameter Time To Start following a scheduled Preventive Maintenance activity for the i-th system device, \( T_{i}\) is the value of the corresponding decision variable for the i-th system device and finally, \(TP_{max_i}\) and \(TP_{min_i}\) are the limit values of the parameter TP for the i-th system device, when \(1 \le i \le 7\).

$$\begin{aligned} TP_i = round(TP_{min_i}+T_{i}\cdot (TP_{max_i}-TP_{min_i})) \end{aligned}$$
(9)

The chromosomes are iteratively modified generation after generation during the evolutionary process (inspired by the natural Darwinian principle of survival of the species). The parameters used to configure the evolutionary process are shown in Table 2. Due to computational budget limitations (each execution time of one independent run of the case study took an average of almost 4 days of High Performance Computer, as shown in Sect. 6), the number of combinations of parameters were chosen to the considered most relevant factors (shown in Table 2 of the manuscript) related with a proper balance of exploration-exploitation in our problem. The parameter tested values were chosen among general recommendation and common practice in the evolutionary algorithms field; detailed explanation is given below. Five optimization methods were used as described above. Depending on the method, specific parameters must be set. The parameters are:

  • Mutation Probability (PrM): The expectation of the number of genes mutating. The central value is equivalent to 1/decision variables. Two more probabilities, one above and the other one below the central value (1.5/decision variables and 0.5/decision variables, respectively) have been set to the methods that use Simulated Binary Crossover (SMS-EMOA, MOEA/D, NSGA-II, ANSGA-III and AdaW).

  • Mutation Distribution (disM): The distribution index of polynomial mutation. This is set to the typical value of 20 for the present case study.

  • Crossover Probability (PrC): The probability of doing crossover when the Simulated Binary Crossover is used. The crossover operator has an impact on the creation of new individuals. It is set to 0.9 for the present case study.

  • Crossover Distribution (disC): This is the crossover distribution index when the Simulated Binary Crossover is used. It is 20 for the standard solvers and 30 for the recently developed algorithms (as in the original references).

  • Crossover Rate (CR): The crossover operator has the function of mixing the genetic information among chromosomes to create new individuals. In Differential Evolution, each gene is crossed (or not) depending on a probability variable referred to as the Crossover Rate. The typical value for the Crossover Rate is between 0.1 and 1.0 (Simon 2013). For the case study, the Crossover Rate parameter is set to 0.9 given that a large CR often speeds convergence (Storn and Price 1997).

  • Scale Factor (F): In Differential Evolution, the mutation operator alters the genes of the chromosome by adding a scaled difference vector from two chosen chromosomes to a third chromosome. The difference vector is scaled by using the Scale Factor. The typical value for the Scale Factor is between 0.4 and 0.9 ( Simon 2013). For the case study, values of 0.4, 0.5 and 0.6 are tested as it is shown in Table 2.

  • Scalarizing function (Approach): The MOEA/D method decomposes a multi-objective optimization problem into different single-objective sub-problems by using a set of weight vectors and a scalarizing function. Typical scalarizing functions for MOEA/D include the weighted sum, Tchebycheff and Penalty-based Boundary Intersection (PBI). Following the results from Tanabe and Ishibuchi (2018) in which the Tchebycheff approach performed well in some two-objectives problems, this approach is used for the present case study.

  • Probability of choosing parents locally \((\delta )\): A typical value that is commonly used (Li and Zhang 2009; Liu et al 2010; Zhang et al 2009) is 0.9.

  • Replacement mechanism (\(n_r\)): The replacement mechanism improves the quality of the population in terms of dominance and it also maintains diversity. A high-quality offspring solution could replace most current solutions in favor of its neighboring solution (Liu et al 2010), which implies a decrease in diversity. The parameter \(n_r\) is used to establish the maximum number of solutions to replace by a high-quality offspring. A proposed empirical rule ( Zhang et al 2009) consists of considering \(n_r\) = 0.01\(\cdot \)N, as being N the population size (note that \(n_r\) must be an entire value).

Table 2 Parameters for the optimization process (population sizes of 50, 100 and 150 were tested in every case)

Table 2 includes the parameters for the optimization process. Each method was executed by using population sizes (N) of 50, 100 and 150 individuals, respectively. The population size plays a crucial role in maintaining the equilibrium between exploration and exploitation. Populations with excessive size could lead to slow convergences, whereas populations with few individuals could lead to premature stagnation, converging to local optimums ( Michalewicz 1996; Goldberg 1989; González et al 2019). Nine different configurations of the five standard state-of-the-art methods were simulated and each configuration was executed 21 times (for statistical purposes) with a total of 10,000,000 evaluations used as the stopping criterion. Regarding the recently methods, nine different configurations were executed for ANSGA-III and AdaW, while three different configurations were executed for LMOCSO due to the fact that the population size is considered as its main parameter.

Scale factors in relation to the value of the objective functions were used in order to achieve a dispersed non-dominated front with the unit as maximum value. The values were obtained through a practical approach in which the values of the scale factors are extracted from the values of the objective functions when the optimization process starts. This approach is based on the assumption that the values of the objective functions will improve over the course of the evolutionary process. The scale factors were used as follows:

  • The scale factor used to compute the Cost was 1,700 economic units.

  • The scale factor used to compute the system unavailability was 0.003.

Finally, a two dimensional reference point is needed to compute the Hypervolume. The cited point must cover the values limited by the scale factors, which restrict the values of the objective functions to a maximum of one. The reference point was set to (2,2). The Software Platform PlatEMO (Tian et al 2017) was used to optimize the case study. The open source platform PlatEMO includes more than 160 Multi-objective Evolutionary Algorithms, more than 300 multi-objective test problems and several widely used performance indicators. In this case, the Design and Maintenance Strategy analysis software was developed and implemented into the platform to solve the problem described above.

6 Results and discussion

Due to the hardness of the problem, a general purpose calculation cluster was used in the optimization process. The cluster is composed of 28 calculation nodes and one access or front-end node. Each calculation node consists of 2 processors Intel Xeon E5645 Westmere-EP with 6 cores each and 48 GB of RAM memory, allowing 336 executions to be run simultaneously.

Once the results were obtained, valuable information emerged:

  1. 1.

    Information related to the computational process is given in order to show the problem hardness and computational cost. It consists of the time taken for 21 executions of the 9 configurations related to each state-of-the-art method.

  2. 2.

    Box plots are given for the Hypervolume (HV) (Zitzler et al 2003) values distribution achieved (in twenty-one executions) after the stopping criterion is met.

  3. 3.

    The values of the main measures obtained for the final evaluation are shown. These measures are the Average, Median, Minimum, Maximum and Standard Deviation values of the Hypervolume.

  4. 4.

    In order to establish the existence of significant differences among the performance of the configurations and the optimization methods, a rigorous statistical analysis is carried out. The Friedman’s test allows significant differences among results obtained to be detected, and the null hypothesis \((H_0)\) to be rejected in that case. Generally, once the differences have been detected, a post-hoc test is carried out in order to find the concrete pairwise comparisons, which produce such differences. The p-value is a useful datum, which represents the smallest significant value that can result in the rejection of \(H_0\). The p-value provides information about whether a statistical hypothesis test is significant (or not), and it also indicates how significant the result is: The smaller the p-value (< 0.05), the stronger the evidence against the null hypothesis. The procedure to conduct multiple comparisons that is followed in this paper was described by García and Herrera (2008), however, some exceptions will be applied, as it will be seen below.

  5. 5.

    The Hypervolume is computed for the accumulated best non-dominated solutions obtained (the non-dominated front) from the state-of-the-art methods. These represent the best equilibrium solutions among the objectives, and the computational procedure is described in Fonseca et al (2006).

Table 3 Configurations, identifiers relationship, Hypervolume statistics and hypothesis test (Global)

Furthermore, for the global comparison of the state-of-the-art methods, the Hypervolume (HV) average value evolution (in twenty-one executions) and the non-dominated solutions are shown.

In order to make the reading of the paper clearer, results of the individual comparison of the tested configurations of each algorithm are supplied as a supplementary material file. The comparative analysis of the best performing cases among the state-of-the-art algorithms is directly included next in subsection 6.1, and their overall optimum design results in subsection 6.2. Subsection 6.3 shows the comparative analysis of the best performing cases among the state-of-the-art and the more recently developed algorithms. Subsection 6.4 exposes a discussion section, where subsection 6.4.1 analyses the effect of sampling size when coupling discrete event simulation and multi-objective evolutionary algorithms, and finally subsection 6.4.2 quantifies the operational cost saved when using the optimum designs.

6.1 Comparing the standard state-of-the-art methods

As a supplementary material, a detailed study regarding the convergence of the state-of-the-art Multi-objective Evolutionary Algorithms is supplied. From such a study, the configurations with the best Average Ranks according to the Friedman’s test for each method were selected to be globally compared. These configurations are shown in Table 3 (columns 2 and 3).

The Hypervolume average values evolution in relation to the evaluations number is shown in Fig. 4. It can be seen that the configuration with identifier ID1 (with SMS-EMOA as an optimization method, population of 150 individuals and mutation probability of 0.5 gene per chromosome) presents the highest Hypervolume Average value at the end of the process. Moreover, box plots of the Hypervolume values distribution at the end of the process are shown in Fig. 5. They represent the statistical information supplied in Table 3 where it can be seen that the configuration with identifier ID1 (with SMS-EMOA as an optimization method, population of 150 individuals and mutation probability of 0.5 gene per chromosome) presents the highest Hypervolume Average value, the configuration with identifier ID8 (with NSGA-II as an optimization method, population of 150 individuals and mutation probability of 0.5 gene per chromosome) presents the highest Hypervolume Median value, the configuration with identifier ID9 (with GDE3 as an optimization method, population of 50 individuals and F parameter of 0.5) presents the highest Hypervolume Maximum value, while the configuration with identifier ID2 (with SMS-EMOA as an optimization method, population of 150 individuals and mutation probability of 1.5 gene per chromosome) presents both the highest Hypervolume Minimum value and the lowest Hypervolume Standard Deviation value.

Fig. 4
figure 4

Hypervolume Average (21 independent executions) vs. evaluations evolution

Fig. 5
figure 5

Box plots of the final Hypervolume, identifiers (ID’s) as in Table 3

In order to establish whether any one of the ten configurations performs better than any other, a statistical significance hypothesis test was conducted. The Average Ranks computed through the Friedman’s test are shown in Table 3 (column 9). It can be seen that the configuration with identifier ID1 (with SMS-EMOA as an optimization method, population of 150 individuals and mutation probability of 0.5 gene per chromosome) produces the lowest Average Rank. Moreover, the p-value computed (8.2594\(\cdot \)10\(^{-11}\)) implies that the null hypothesis (\(H_0\)) can be rejected (p-value<0.05), so it can be seen that, in the studied conditions, there are configurations that perform better than others. In order to find the concrete pairwise comparisons that produce differences, a post-hoc test was run. The Shaffer’s test was used to compare the configuration with identifier ID1 (with SMS-EMOA as an optimization method, population of 150 individuals and mutation probability of 0.5 gene per chromosome), which produced the lowest Average Rank regarding the Friedman’s test, with the rest of configurations. The results in relation to the comparisons are shown in Table 5. It can be seen that, in the conditions of the experiment, the configuration with identifier ID1 performs better than the configurations with identifiers ID3–ID4 (with MOEA/D as an optimization method) and ID6 (with MOEA/D-DE as an optimization method, population of 150 individuals and F parameter of 0.6), but is not possible to establish that the configuration with identifier ID1 performs better than any other.

The best accumulated non-dominated solutions obtained through the last generation of the evolutionary process for all executions and all configurations of each method were used to compute their respective accumulated Hypervolume values (as described in Fonseca et al (2006)). Such values are shown in Table 4. Furthermore, the best accumulated non-dominated solutions obtained from these fronts were used to compute the whole accumulated Hypervolume, whose value was 2.4179 and is shown in Table 4. As expected, the value is higher than 2.4087, the maximum accumulated value achieved after the evolutionary process for the SMS-EMOA method.

Table 4 Hypervolume Accumulated Value
Table 5 p-values from Shaffer’s test (Global)
Fig. 6
figure 6

Accumulated non-dominated front (Global)

Table 6 Optimum solutions obtained from the evolutionary process

The final results of the analysis bring to light the better performance of the methods based on Indicators (SMS-EMOA) and Non-dominance (NSGA-II or GDE3) in comparison to the methods based on Decomposition (MOEA/D or MOEA/D-DE). However, the operator which creates new individuals does not appear to have a significant effect, since methods that use Simulated Binary Crossover (SMS-EMOA or NSGA-II) presented similar performance to a method that uses Differential Evolution (GDE3).

6.2 Overall optimum design results

The non-dominated solutions to the problem provided at the end of the evolutionary process for all executions, all configurations and all methods are shown in Fig. 6. All optimum solutions belonging to the obtained non-dominated front are shown in Table 6. unavailability (Q) is shown as a fraction, Cost is shown in economic units and the rest of the variables represent, for the respective devices, the optimum times to allow scheduling of preventive maintenance activities in hours.

The solution with the lowest Cost (ID1) (861.38 economic units) represents the biggest unavailability (0.002898). These values are followed by periodic optimum times (hours) measured from the moment in which the system mission time starts (time to perform the preventive maintenance activity (TR) is not included). For the solution ID1, it can be seen that periodic optimum times to preventive maintenance for devices P2 and V4 are not supplied. This is because the design alternative does not include such devices. The opposite case shows the biggest Cost (ID19) (1,698.62 economic units) and the lowest unavailability (0.000749). For solution ID19, periodic optimum times to perform preventive maintenance activities are supplied for all devices. This is caused because the design alternative includes devices P2 and V4. Other optimum solutions were found in these two solutions and can be seen in Table 6. The decision makers will need to decide which is the preferable design taking into account their individual requirements. Depending on the application case, the company may have a cost threshold (e.g., due to financial budget constraints) and observing the attained non-dominated solutions, the solution/design with best non-availability corresponding to that cost could be chosen. Alternatively, the company may have a non-availability threshold (e.g., due to a legal norm) and observing the attained non-dominated solutions, the solution/design with best cost corresponding to that non-availability could be chosen.

Moreover, solutions have been clustered in Fig. 7 according to their final design. Solutions are shown in ascending order in relation to the Cost from ID1 to ID19 and from the left to the right respectively. Solutions contained in Cluster 1 (solutions 1 to 2, see also Table 6) are the solutions in which non-redundant devices have been included in the design. In this case, the system exclusively contains devices placed in series. These solutions present the lowest Cost and the biggest unavailability. Solutions contained in Cluster 2 (solutions 3 to 6, see also Table 6) are the solutions in which a redundant valve has been included in the design as a parallel device. These solutions present bigger Cost and lower unavailability than the solutions contained in Cluster 1. Solutions contained in Cluster 3 (solutions 7 to 14, see also Table 6) are the solutions in which a redundant pump has been included in the design as a parallel device. These solutions present higher Cost and lower unavailability than the solutions contained in Clusters 1 and 2. Finally, solutions contained in Cluster 4 (solutions 15 to 19, see also Table 6) are the solutions in which both a redundant valve and a redundant pump have been included in the design as parallel devices. These solutions present the biggest Cost and the lowest unavailability.

Fig. 7
figure 7

Clustered accumulated non-dominated front and design options

6.3 Comparing the best ordered standard state-of-the-art method and recently developed methods

As a supplementary material, a detailed study about the convergence of some recently developed Multi-objective Evolutionary Algorithms is supplied. From such a study, the configurations with the best Average Ranks according to the Friedman’s test for each method were selected to be globally compared. Next, the best ordered configuration obtained from the study of the state-of-the-art methods is compared to the best ordered configurations achieved from recently developed algorithms. The relationship between the configurations of the methods (where N represents the population size, PrM the mutation probability) and the identifiers of the configurations is shown in Table 7 (columns 2 and 3). Furthermore, statistical information is supplied by such a Table (the best values in bold). Such information is represented by the Fig. 8.

Fig. 8
figure 8

Box plots of the final Hypervolume, the best ordered state-of-the-art method and recently developed methods, identifiers (ID’s) as in Table 7

Table 7 Configurations, identifiers relationship, Hypervolume statistics and hypothesis test

In order to establish whether any one of several configurations performs better than any other, a statistical significance hypothesis test was carried out. The Average Ranks computed through the Friedman’s test are shown in Table 7 (column 9). It can be seen that the configuration with identifier ID1 (with SMS-EMOA as an optimization method, population of 150 individuals and mutation probability of 0.5 gene per chromosome) produces the lowest Average Rank. Moreover, the p-value computed (8.240\(\cdot \)10\(^{-11}\)) implies that the null hypothesis (\(H_0\)) can be rejected (p-value<0.05). Therefore, it can be seen that, in the studied conditions, there are configurations that perform better than others. In order to find the concrete pairwise comparisons that produce such differences, a post-hoc test was conducted. The Shaffer’s test was used to compare the configuration with identifier ID1 (with SMS-EMOA as an optimization method, population of 150 individuals and mutation probability of 0.5 gene per chromosome), which produced the lowest Average Rank regarding the Friedman’s test, with the rest of configurations. The results in relation to such comparisons are shown in Table 8. It can be seen that, in the conditions of the experiment, the configuration with identifier ID1 performs better than the configurations with identifiers ID6–ID7 (with LMOCSO as an optimization method), but is not possible to establish that the configuration with identifier ID1 performs better than any other.

Table 8 p-values from Shaffer’s test (Global)

The final results of the analysis bring to light that the state-of-the-art SMS-EMOA method is highly competitive. It shows a better performance than a method such as LMOCSO. Furthermore, it is better ordered than the rest from the Friedman’s test point of view.

6.4 Discussion

From Sect. 6.3, the configuration with the best order from the Friedman’s test point of view was found (SMS-EMOA, population size of 150 individuals, and mutation rate of 0.5 gene per chromosome). It is taken as a reference for the proposed methodology. Therefore, such a configuration is selected to extend the analysis in case of more difficult applications in Sect. 7. Next, a discussion is opened regarding two interesting aspects: firstly, the effect of the sampling size when discrete simulation and evolutionary multi-objective algorithms are combined, and secondly, the quantification of the economic cost savings when the methodology is used.

6.4.1 Discrete event simulation combined with multi-objective evolutionary algorithms: the effect of sampling size

The proposed methodology executes a unique discrete simulation per individual of the population to characterize the system behavior and then, the objective functions are evaluated. Here, the effect of varying the sampling size is analyzed with equivalent number of fitness evaluations. Summarizing the procedure: the Functionability Profile of the system was built sample size times for each individual of the population and the objective functions (availability and operational cost) were computed after as many times. The configuration of the case study with best average rank from the Friedman’s test (SMS-EMOA, population size of 150 individuals, and mutation rate of 0.5 gene per chromosome, see Tables 3 and 7, column 9) was taken as reference (and mentioned as ‘direct SMS-EMOA’). In addition to this case (sample size equal to 1), sample sizes of 10, 100 and 1000 for each solution evaluated by the multi-objective evolutionary algorithm were tested. To foster equivalent purpose (attain the best non-dominated solutions), this procedure is equivalent to execute multiple simulations (as many as the chosen sample size) taking the minimal extreme value as a representative of the distribution achieved. However, in multi-objective optimization the non-dominated direction (non-dominated lower extreme value) of each solution is not known a priori and it depends on its relative position versus other non-dominated solutions. Therefore, cases of taking as minimal extreme value either: 1) minimal unavailability, 2) minimal cost, or 3) a minimal equal weighted unavailability–cost (which is equivalent also to the Manhattan distance of both objectives) have been tested. The proposed methodology (single sample size, direct SMS-EMOA) is compared to those nine combinations (three minimal extreme values with three sample sizes each) and with a standard random search as an optimization baseline; all cases sharing an equivalent total stopping criterion of 10.000.000 evaluations of the fitness functions and being executed in 21 independent runs each. The set of configurations is shown in Table 9.

Table 9 Configurations, Hypervolume statistics and Friedman’s test average ranks (multiple simulations)

Box plots of the Hypervolume values distribution at the end of the process are shown in Fig. 9. It can be seen that, as expected, the method based on random search (the configuration ID10) presents the worst performance. Moreover, it can be seen that the configuration with identifier ID11 (which uses the direct SMS-EMOA) shows the biggest Hypervolume median and average values. The configuration with identifier ID7 (which looks for minimum unavailability using 1000 evaluations of the objective functions per individual) presents the highest Hypervolume maximum value and the configuration with identifier ID1 (which looks for minimum unavailability using 10 evaluations of the objective functions per individual) supplies the highest Hypervolume minimum value. These, and other measures obtained, are shown in Table 9.

Fig. 9
figure 9

Box plots of the final Hypervolume (multiple simulations, identifiers as in Table 9)

In order to quantify whether any one of the configurations performed better than any other, a statistical significance hypothesis test was conducted. The average ranks computed through the Friedman’s test are shown in Table 9. It can be seen that the configuration with identifier ID11, which uses the direct SMS-EMOA, produced the lowest average rank. After a similar number of evaluations, the direct SMS-EMOA achieved the first order regarding the hypothesis test. Moreover, the p-value computed (6.6712\(\cdot \)10\(^{-11}\)) implies that the null hypothesis (\(H_0\)) can be rejected (p-value<0.05), so it can be seen that, in the studied conditions, there are configurations that perform better than others. In order to find the concrete pairwise comparisons that produce such differences, a post-hoc test was run. The Shaffer’s test was used to compare the configuration with identifier ID11 with the rest of configurations. The results in relation to the comparisons are shown in Table 10.

Table 10 p-values from Shaffer’s test (multiple simulations)

The hypothesis test shows that the direct SMS-EMOA achieved the best average rank from the Friedman’s test point of view. Moreover, it shows significant differences regarding some of the configurations (ID7, ID8 and ID10) so a better behavior is expected when the direct SMS-EMOA is used. Next, the configurations with the best average rank from each extreme studied (minimum unavailability, minimum cost and minimum weighted unavailability–cost) were selected to compare the results in front of using the direct SMS-EMOA. Their non-dominated solutions are shown in Fig. 10. It can be seen that the maximum Hypervolume is covered when the direct SMS-EMOA is used (with a value of 2.3832). Finally, the accumulated non-dominated front from Fig. 10 is shown in Fig. 11. It can be seen that there are not solutions that follow minimizing the unavailability (marked as a \(\square \)) on the left side of Fig. 11 as is expected. It is because the solutions on the left side present a best cost, which is contrary to obtaining more unavailable solutions. Conversely, there are not solutions that follow minimizing the cost (marked as a \(\triangle \)) on the right side of Fig. 11 as is expected. It is due to the fact that the solutions on the right side present best unavailability, which is contrary to obtaining more economical solutions. It can be seen how the solutions supplied by the direct SMS-EMOA (marked as an \(\times \)) are spread along the non-dominated front.

In summary, the results of sampling size analysis enhance the benefits of the proposed methodology showing the positive synergy among discrete event simulation and multi-objective evolutionary algorithms, where only a single unique simulation per individual is enough in the fitness function evaluation to attain very competitive results.

Fig. 10
figure 10

Accumulated non-dominated solutions from the best configurations (multiple simulations)

Fig. 11
figure 11

Accumulated non-dominated front from the best configurations (multiple simulations)

A further analysis of the non-dominated solutions of the previous results based in their representative averages instead of their best values attained was conducted: For each non-dominated solution at the final generation (after 10.000.000 evaluations) of each of the 21 independent executions of the previous experiments, 10.000 discrete simulations were executed and their objective functions were computed. Then, the unavailability and cost averages were used as representative values of the distribution of each solution. In this way, for each extreme non-dominated solution achieved previously, the center of its distribution was located. These central solutions are the solutions that would be achieved by executing a Monte Carlo simulation when considering the average as the representative value of the distribution. Box plots of the Hypervolume values distribution regarding each configuration are shown in Fig. 12. Statistical information regarding this experiment is shown in Table 11. It can be seen that the configuration with identifier ID3 (which looks for minimum weighted unavailability–cost using 100 evaluations of the objective functions per individual) presents both the best median and maximum Hypervolume values.

In order to establish whether any one of the configurations performed better than any other, a statistical significance hypothesis test was conducted. The average ranks computed through the Friedman’s test are shown in Table 11. It can be seen that the configuration with identifier ID3 (which looks for minimum weighted unavailability–cost using 100 evaluations of the objective functions per individual) produced the lowest average rank, while our methodology, the configuration with identifier ID11 -direct SMS-EMOA- was ranked third out of eleven configurations. Moreover, the p-value computed (9.2629\(\cdot \)10\(^{-11}\)) implies that the null hypothesis (\(H_0\)) can be rejected (p-value<0.05), so it can be seen that, in the studied conditions, there are configurations that perform better than others. In order to find the concrete pairwise comparisons that produce such differences, a post-hoc test was run. The Shaffer’s test was used to compare our methodology ID11 -direct SMS-EMOA- with the rest of configurations. The results in relation to the comparisons are shown in Table 12. It can be seen that the configuration with identifier ID11 performs better than ID7, ID8, ID9 and ID10, while no other configuration outperforms our methodology. Moreover, the Hypothesis test shows that non-significant differences were found among the direct SMS-EMOA and configuration ID3 with the best results from the Friedman’s test point of view. However, the way to know a priori which would be the best values and their influence in the optimization outcome, either of the sampling size parameter or either of the minimum extreme value parameter is not determined. For example, if we focus on each value of sampling size, we could observe that in the case of sampling size 10 (ID1 to ID3), the best ordered case was the minimum equal-weighted unavailability–cost extreme (ID3), while in the case of sampling size 100 (ID4 to ID6), the best ordered case was the cost extreme (ID4), and finally in the case of sampling size 1000 (ID7 to ID9), the best ordered case was the unavailability extreme (ID8). Therefore, depending on the sampling sizes all the minimum extreme directions could have been the best options according to the attained results of our experiment. Hence, there are many parameters to explore in very expensive computational processes. On the contrary, our methodology is parameter-less due to a unique sampling size and therefore due to its implicit management of the non-dominated solutions by the selection operator of the evolutionary multi-objective algorithm.

Fig. 12
figure 12

Box plots of the Hypervolume from simulated centers (identifiers as in Table 11)

Table 11 Configurations, Hypervolume statistics and Friedman’s test average ranks from simulated centers
Table 12 p-values from Shaffer’s test from simulated centers

In summary, the proposed methodology as shown in the insight results and discussion of the case study, is a computationally efficient and robust approach (non-parameter dependent regarding the number of samples or the minimal search direction) versus the use of Monte Carlo simulation-based approaches when facing the multi-objective optimization reliability problem handled.

6.4.2 Quantification of the operational cost saved

In order to evaluate the cost savings attained by the proposed methodology in the case study using the direct SMS-EMOA, a comparison with a standard random search has been tested. Therefore, each individual of the population (the design -devices involved in- and maintenance strategy) was randomly generated and the objective functions were evaluated after. The total number of solutions generated was equivalent to the stopping criterion of the direct SMS-EMOA (10,000,000) at each of the 21 independent executions. The configuration of the case study with best average rank from the Friedman’s test (SMS-EMOA, population size of 150 individuals, and mutation rate of 0.5 gene per chromosome, as seen in Table 3, column 9) was taken as reference. In both compared cases (direct SMS-EMOA and random search), the 21 independent executions were ordered by their Hypervolume values and the median case (11th ordered) was taken as reference, as is shown in Fig. 13.

Fig. 13
figure 13

Non-dominated fronts from the direct SMS-EMOA and random search cases (median case out of 21 independent executions)

Fig. 14
figure 14

Global non-dominated front from the median direct SMS-EMOA and median random search cases (taken from Fig. 13)

Table 13 Extreme solutions taken from Fig. 13

The Hypervolume covered by the methodology (2.2977) is better than the Hypervolume covered by the random search (2.2385); hence, the non-dominated front obtained by our methodology is better than the non-dominated front obtained by the random search. From those sets, their joint non-dominated front was shown in Fig. 14, where all solutions except a single one (by chance) were attained by our methodology.

In order to quantify what is the benefit of using the direct SMS-EMOA, characteristic solutions identified as ID1 and ID3 (taken from direct SMS-EMOA) and ID2 and ID4 (taken from random search) have been chosen to compare. These solutions are shown both in Table 13 and in Fig. 13. Comparing the solutions with the best cost from Table 13 (ID1 and ID2), it can be seen that the solution ID1 (achieved from the direct SMS-EMOA) is not only more economic but also more reliable than the solution ID2 (achieved from the random search). Comparing the economic cost of the solutions ID1 and ID2, the solution ID1 presents an improvement of a 4%. The better the unavailability, the bigger the impact of the methodology in terms of cost benefits. Comparing the solutions with the best unavailability from Table 13 (ID3 and ID4), it can be seen that the solution ID3 (achieved from the direct SMS-EMOA) is not only more economic but also more reliable than the solution ID4 (achieved from the random search). The difference between the solutions ID3 and ID4 reaches an economic cost of a 10% lower. Then, it can be seen that, in the conditions of the experiment, using the direct SMS-EMOA produces a positive impact not only from the economic point of view but also from the availability point of view.

7 Applications

Two applications are faced in order to demonstrate the viability of applying the methodology. The Application Case A consists of an extension of the main case study by adding a second branch. The Application Case B consists of a system whose structure is more complex and the number and kind of devices is bigger.

7.1 Application case A: the case study with double branch

The first application consists of an extension of the case study, which is a basic model of a containment spray injection system. In this case, a second branch is included as is shown in Fig. 15. A similar structure was analyzed in Greiner et al (2003). As in the case study, it is necessary to establish the optimum period to perform a preventive maintenance activity for the system devices and it is necessary to decide whether to include redundant devices such as the pump P2, the pump P9, the Valve V4 and the Valve V11 by evaluating design alternatives. Hence, the number of components of Application Case A can vary from 10 to 14 automatically as an outcome from the evolutionary algorithm search. The chromosomes are codified as [\(B_1\) \( B_2\) \( B_3\) \( B_4\) \( T_1\) \(T_2\) \( T_3\) \( T_4\) \(T_5\) \(T_6\) \(T_7\) \( T_8\) \(T_9\) \( T_{10}\) \( T_{11}\) \(T_{12}\) \(T_{13}\) \(T_{14}\)], where the presence of redundant devices P2, V4, P9 and V11 is defined by the decision variables \(B_1\), \(B_2\), \(B_3\) and \(B_4\) respectively, and the optimum Time To Start a Preventive Maintenance activity in relation to each device is represented by the decision variables \(T_1\) to \(T_{14}\). They have to be transformed to evaluate the objective functions as claimed in Sect. 3. This application was executed attending to the configuration that presented the best behavior from the case study, as is shown in Table 14.

Fig. 15
figure 15

Application Case A: Double branch containment spray injection system (CSIS)

Table 14 Parameters configuration for the Application Cases A and B

The scale factors in relation to the value of the objective functions used with the purpose of achieving an equally dispersed non-dominated front with the unit as maximum value of each objective were as follows:

  • The scale factor used to compute the Cost was 4,500 economic units.

  • The scale factor used to compute the system unavailability was 0.00004.

Firstly, the experiment was conducted based on the proposed methodology. Additionally, to tune the effect of automatic devices selection, a second problem was run where the structural design was based on the mandatory selection of all devices. Box plots of the Hypervolume values distribution at the end of the process are shown in Fig. 16. It can be seen that the configuration with identifier ID1, which uses the proposed methodology, presents the best median Hypervolume value. Statistical information regarding the Hypervolume reached when 21 independent executions were carried out is shown in Table 15. It can be seen that the configuration with identifier ID1 presents the best statistics. Moreover, the configuration with identifier ID1 presented the best average rank from the Friedman’s test point of view, and the p-value achieved of 4.592\(\cdot \)10\(^{-6}\) establishes that the configuration ID1 performs better than the configuration ID2. The non-dominated solutions achieved by our methodology (marked as a \(\times \)) are shown in Fig. 17. The accumulated Hypervolume computed in this case out of 21 independent executions reached a value of 3.1146. Moreover, the non-dominated solutions are detailed in Table 16. It can be seen that the devices P2, V4, P9 and V11 are not included in the design. The non-dominated solutions achieved based on the mandatory selection (marked as an \(\circ \)) are shown in Fig. 17. The accumulated Hypervolume computed in this case out of 21 independent executions reached a value of 3.1004. It can be seen that the Hypervolume covered by the non-dominated solutions achieved by the proposed methodology is bigger than the Hypervolume covered by the non-dominated solutions achieved by the mandatory selection of all devices; also the former non-dominated solutions, which are identified as ID1, ID2 and ID3, dominate the latter non-dominated solutions. The methodology was able to find optimum non-dominated solutions for the system.

Fig. 16
figure 16

Box plots of Hypervolume achieved from the configurations for the Application Case A (identifiers as in Table 15)

Table 15 Configurations, Hypervolume statistics and Friedman’s test average ranks for the Application Case A
Fig. 17
figure 17

Accumulated non-dominated solutions and designs for the Application Case A

Table 16 Non-dominated solutions for the Application Case A (identifiers as in Fig. 17)

Finally, Fig. 18 shows the non-dominated solutions achieved both for the case study (marked as a \(\times \)) and for the Application Case A (marked as a \(\triangle \)). It can be seen that both fronts are complementary. The set of solutions of the non-dominated front achieved by the case study present a lower cost and a bigger unavailability than the set of solutions of the non-dominated front achieved by the Application Case A. The Application Case A is more reliable (lower unavailability) and more expensive (higher cost) due to the fact that its solutions are composed by two parallel branches and with more components than the case study. Nevertheless, this is the reason why a bigger economic investment is needed to maintain the system. The decision maker should determine whether the benefit of applying better unavailability designs supports the increase in economic investment.

7.2 Application case B: an extended model for the containment spray system of a nuclear power plant

Application Case B is based on an industrial case presented in Galván et al (2007). It consists of a Pressured Water Reactor (PWR) Containment Spray System, which is designed to provide particular and different functions inside the containment of a PWR, such as the borated water injection function. In the case study, a simplified model of this system was studied. In this case, an extended model is studied as is shown in Fig. 19.

The system consists of two separated trains, each one formed by centrifugal pumps and valves to control the flow of borated water from the Refueling Water Storage Tank. The main devices are the Single Valves (V1, V2 and V9), the Motor-Driven Pumps (P4 and P5), the Motor Operated Valves (M3, M6, M7, M8 and M10) and One Way Valves (NR11 and NR12). The aim is the simultaneous optimization of the system structural design (with automatic selection of devices) and its maintenance strategy with some considerations:

  • Each position may locate a maximum of three redundant devices in parallel, so the maximum number of devices is thirty-six as is shown in Fig. 20,

  • The devices V1, P4, V9, M10, NR11 and NR12 are mandatory as is shown in Fig. 21,

  • When the device M8 is not included in the design, the line is considered a tube,

  • When the device P5 is not included in the design, the device M7 cannot be included,

  • The device M3 might be included in the design when the device V2 or/and the device P5 is/are included,

  • As in the Application Case A, to tune the effect of automatic devices selection, a second problem was run where the structural design was based on the mandatory selection of a minimum of a device per position.

As in the case study, the chromosome codification includes the period of time to perform a preventive maintenance activity for the system devices. Moreover, it is necessary to decide whether to include redundant devices by evaluating design alternatives; the number of devices of Application Case B may vary from 6 to 36 automatically as an outcome from the evolutionary algorithm search. In this application, two types of chromosome codifications are taken into account and compared:

  • Long Chromosome codification: formed by 66 decision variables, which are 30 for the design and 36 for the maintenance strategy. Six components are mandatory so 30 design decision variables are necessary to decide whether the rest of the devices are or not included in the design. The system may contain a maximum of 36 devices so this is the number of the decision variables for the preventive maintenance strategy,

  • Short Chromosome codification: formed by 48 decision variables, which are 12 for the design and 36 for the maintenance strategy. In this case, the 12 design decision variables are scaled to entire values so they may take a maximum value of three. As in the previous case, the system may contain a maximum of 36 devices so this is the number of the decision variables for the preventive maintenance strategy.

In summary, in this section we execute four problems solving Application Case B: long chromosome and the proposed methodology, short chromosome and the proposed methodology, long chromosome and mandatory selection of a minimum of a device per position, and short chromosome and mandatory selection of a minimum of a device per position.

Fig. 18
figure 18

Accumulated non-dominated solutions achieved for both the case study and the Application Case A

The data used are shown in Table 17. As in Application Case A, this Application Case B was executed attending to the evolutionary multi-objective optimization configuration which presented the best performance in the case study, as is shown in Table 14. The scale factors in relation to the value of the objective functions used with the purpose of achieving an equally dispersed non-dominated front with the unit as maximum value of each objective were as follows:

  • The scale factor used to compute the Cost was 6,000 economic units,

  • The scale factor used to compute the system unavailability was 0.00083.

Fig. 19
figure 19

Application Case B: Base line

Fig. 20
figure 20

Application Case B: The most complex possible design

Fig. 21
figure 21

Application Case B: The most simple possible design

Box plots of the Hypervolume values distribution at the end of the process are shown in Fig. 22. It can be seen that the configuration with identifier ID3, which uses the short chromosome and the proposed methodology, presents the best median Hypervolume value. Statistical information regarding the Hypervolume reached when 21 independent executions were carried out is shown in Table 18. It can be seen that the configuration with identifier ID3 presents the best average, median, minimum and standard deviation Hypervolume value, whereas the configuration with identifier ID1, which uses the long chromosome and the proposed methodology, presents the best maximum Hypervolume value. Moreover, the configuration with identifier ID3 presents the best average rank from the Friedman’s test point of view, and the p-value achieved of 7.666\(\cdot \)10\(^{-11}\) establishes that the configuration ID3 performs better than some other. Finally, the Shaffer’s test was used to compare the configuration with identifier ID3 with the rest of configurations. Statistical significant difference was found regarding the configurations with identifiers ID2 and ID4 (both with mandatory selection of devices) as is shown in Table 19. Therefore, in the studied conditions, it can be seen that the best performance is achieved when the proposed methodology is used. Furthermore, the best order from the Friedman’s test is achieved when only one decision variable per position for the system design is used (the short chromosome).

Table 17 Data set for the Application Case B

The non-dominated solutions achieved when the long chromosome and our methodology are used (marked as \(\times \)) are shown in Fig. 23. It can be seen that the accumulated Hypervolume computed out of 21 independent executions is 3.7809. The non-dominated solutions achieved when the long chromosome and the mandatory selection of a minimum of a device per position, are used (marked as \(\circ \)) cover an accumulated Hypervolume computed out of 21 independent executions of 3.5196. The non-dominated solutions achieved when the short chromosome and the proposed methodology are used (marked as \(\square \)) cover an accumulated Hypervolume computed out of 21 independent executions of 3.7702. Finally, the non-dominated solutions achieved when the short chromosome and the mandatory selection of a minimum of a device per position are used (marked as \(\triangle \)) cover an accumulated Hypervolume computed out of 21 independent executions of 3.5230. In Fig. 24, the non-dominated solutions, which belong to the non-dominated front from all configurations, were extracted. The detail of the solutions is shown in Table 20. Solutions L1, L2 and L3 are solutions supplied when the long chromosome and the proposed methodology are used. Solutions S1, S2 and S3 are solutions supplied when the short chromosome and the proposed methodology are used. Finally, the SM1 solution is a solution achieved when using the short chromosome and the mandatory selection of devices. Each solution presents its cost, unavailability and periodic times to start a preventive maintenance activity regarding each device included in the design, as is shown in Table 20. The design alternatives are shown in Figs. 252627282930. It can be seen that solution L1 (the less expensive design) belongs to the most simple design, as shown in Fig. 21. On the contrary, it can be seen that solution SM1, whose design is shown in Fig. 30, is the more reliable solution with an unavailability equal to 0.0. This is due to the fact that the system was kept in the operating state all along the mission time for the discrete simulation that describes its behavior. All the four tested methods were able to achieve best solutions with same unavailability value, as shown in the bottom right part of Fig. 23, although with slight differences in the cost attained.

Fig. 22
figure 22

Box plots of Hypervolume achieved from the configurations for the Application Case B (identifiers as in Table 18)

Table 18 Configurations, Hypervolume statistics and Friedman’s test average rank for the Application Case B
Table 19 p-values from Shaffer’s test from Application Case B
Fig. 23
figure 23

Accumulated non-dominated solutions achieved for the Application Case B

7.3 Discussion

The application of the proposed methodology to both application cases demonstrates that its generalization and scalability is viable. It has been possible to extend the methodology to more complex systems and balanced unavailability–cost solutions could be found with automatic selection of system devices. It was interesting to compare the proposed methodology with the cases with mandatory selection of devices.

Fig. 24
figure 24

Global non-dominated front achieved for the Application Case B (taken from Fig. 23)

In Application Case A, the proposed methodology avoids to choose single devices located in parallel. Once the non-dominated solutions from the mandatory selection of devices are achieved, it could be seen that the achieved non-dominated solutions from the proposed methodology dominated the previously cited solutions. Hence, solutions with single devices located in parallel are not optimal solutions. They were rejected by the Multi-objective Evolutionary Algorithm along the evolutionary process.

Table 20 Non-dominated front solutions for the Application Case B (identifiers as in Fig. 24)

Regarding Application Case B, two chromosome codifications (short and long) were explored. Statistically non-significant differences were found regarding the length of the chromosome, however, the short chromosome resulted firstly ordered by the Friedman’s test. It can be seen that solutions from both codifications belong to the accumulated non-dominated front. Nevertheless, generally speaking, solutions with better cost and worse unavailability were achieved from the long chromosome; conversely, solutions with worse cost and better unavailability were achieved from the short chromosome. Therefore, depending on the features of the solutions to achieve it could be better to use one or another codification to solve complex problems in the proposed methodology. In any case, the proposed methodology with either of the two tested types of chromosome codifications were able to attain a distributed set of non-dominated solutions along the whole front as shown in Figs. 22 (Hypervolume distributions) and 23.

The proposed methodology is a powerful tool for decision makers (e.g., chief of engineering, manager of company) in order to plan the systems with simultaneous optimum maintenance cost and non-availability and automatic selection of systems components. It is possible to attain an optimum set of non-dominated solutions with minimum cost and minimum non-availability: it can be observed as the set of solutions where for each value of cost, the best non-availability is shown, or alternatively, the set of solutions where for each value of non-availability the best value of cost is shown.

8 Conclusions

In this paper, coupling Multi-objective Evolutionary Algorithms and Discrete Event Simulation is used in order to tackle simultaneously both the optimization of systems design (based on a process of including automatic structure of components and redundant devices) and also their maintenance strategy (based on the implementation of periodic preventive maintenance activities), while addressing the conflict between availability and operational cost. Coupling these techniques had previously been used to explore the problems separately but not simultaneously, when both the corrective and the preventive maintenance—consisting in achieving the optimum period of time to carry out a preventive maintenance activity—are taken into account. The Multi-objective Evolutionary Algorithm gave rise to a population of individuals, each encoding one design alternative and one preventive maintenance strategy. Each individual represented a possible solution to the problem, which was then used to modify and evaluate the system Functionability Profile through Discrete Event Simulation. The individuals evolved generation after generation until reaching the stopped criterion. This process was applied to a technical system in a case study in which the performance of five state-of-the-art Multi-objective Evolutionary Algorithms (SMS-EMOA, MOEA/D, MOEA/D-DE, NSGA-II and GDE3) was compared and a set of optimum non-dominated solutions were obtained. In conclusion, the use of Multi-objective Evolutionary Algorithms and Discrete Event Simulation to address the joint optimization of systems design and their maintenance strategy provides availability–Cost balanced solutions to real-world problems where data based on field experience were used. Moreover, for the solved test problem, in relation to the state-of-the-art Multi-objective Evolutionary Algorithms, the best performance lies in the use of methods based on both the Hypervolume indicator (SMS-EMOA) and Pareto dominance relation (NSGA-II and GDE3) rather than methods based on Decomposition (MOEA/D and MOEA/D-DE). However, the operator used to create new individuals does not appear to have a relevant effect since methods that use Simulated Binary Crossover (SMS-EMOA and NSGA-II) presented similar performance than methods that use Differential Evolution (GDE3); also in the case of MOEA/D versus MOEA/D-DE. Furthermore, the SMS-EMOA method resulted better ordered from the Friedman’s test point of view. Next, the case study was solved by using some recently developed Multi-objective Evolutionary Algorithms such as ANSGA-III, AdaW and LMOCSO. Their performances were compared with the performance of the SMS-EMOA method. Again, such a method resulted better ordered than the rest, and achieved a better performance than the LMOCSO method.

Fig. 25
figure 25

L1 optimum design (Cost = 628.00 - unavailability = 1.5068\(\cdot \)10\(^{-4}\))

Fig. 26
figure 26

L2 optimum design (Cost = 676.00 - unavailability = 1.4844\(\cdot \)10\(^{-4}\))

Fig. 27
figure 27

S1 optimum design (Cost = 755.00 - unavailability = 1.4611\(\cdot \)10\(^{-4}\))

Fig. 28
figure 28

L3 and S3 optimum designs (Cost = 798.00 - unavailability = 4.5662\(\cdot \)10\(^{-5}\), Cost = 1111.00 - unavailability = 2.7397\(\cdot \)10\(^{-5}\), respectively)

Fig. 29
figure 29

S2 optimum design (Cost = 978.00 - unavailability = 3.1963\(\cdot \)10\(^{-5}\))

Fig. 30
figure 30

SM1 optimum design (Cost = 1431.00 - unavailability = 0.0000)

Once the case study was solved, the better ordered configuration from the Friedman’s test point of view was identified and a discussion was opened. On the one hand, the effect of sampling size and its minimal extreme direction was analyzed, whose results enhance the benefits of the proposed methodology showing the positive synergy among discrete event simulation and multi-objective evolutionary algorithms, where only a single unique simulation per individual is enough in the fitness function evaluation to attain very competitive results. This results are confirmed with an analysis based on the average values of both objective functions for each non-dominated solution of each compared configuration. The proposed methodology is a computationally efficient and robust approach (non-parameter dependent regarding the number of samples or the minimal search direction) versus the use of Monte Carlo simulation-based approaches when facing the multi-objective optimization reliability problem handled. On the other hand, the economic benefits of using the methodology to determine the optimum structural design of the system and its maintenance strategy were quantified in the case study being in the estimated interval of 4-10%.

Finally, the proposed methodology scalability and generalization was demonstrated when applied to two more complex applications where the problems were satisfactorily solved, and insights about a proper chromosome codification regarding the design components were obtained from executed experiments.

In the future, it is proposed to extend the analysis to more complex problems in the reliability field. Other casuistry could be studied, such as the deterioration state space, imperfect repairs or dependency between devices. Moreover, a deeper research related with the type of decision variables in the chromosome (binary, integer, real) to study their influence in the convergence of the search is proposed.