1 Introduction

Downtime is described as the accumulated amount of time when an asset is out of action or unavailable for use. It is required to keep the asset downtime under 10% to achieve world-class standards (Ong et al. 2020). Unplanned downtime due to asset failures can be very expensive. It is estimated that the average cost of unplanned downtime is $260,000 per hour across all industries (Arsenault 2016). Further, the downtime cost in terms of revenue lost per hour may vary depending on the industry. For example, while it is $2.1 million for pharmaceuticals, it reaches up to $4.6 million for the telecommunications industry (Hicks 2019).

Developing proper maintenance strategies is essential to reduce both the unplanned downtime and the operational costs of the assets. Maintenance strategies are broadly classified into two categories, namely, corrective maintenance and preventive maintenance (Rahmati et al. 2018). The corrective maintenance, which is also called breakdown maintenance, is performed after the failure and tries to bring the failed asset up to its operational status. Corrective maintenance is a reactive strategy. Thus, it can lead to high asset downtime and maintenance costs, if implemented improperly (Salari and Makis 2020). In this paper, we try to find a cost-effective implementation of the corrective maintenance by simultaneously optimizing different (but interconnected) planning decisions.

Among the aforementioned planning decisions, spare part inventory decisions have paramount importance since the unavailability of spare parts is the major reason for all asset downtime (Turan et al. 2020a; Kosanoglu et al. 2018). Holding a sufficient number of spare part stocks in inventory may reduce the asset downtime. However, many spare parts are expensive and it is economically more feasible to repair the failed parts/components of assets rather than replacing them with spares (Samouei et al. 2015; Levner et al. 2011). A failed part that can be repaired to an as-good-as-new condition is called a “repairable" part. In this paper, we optimize the number of repairable spare part stocks to keep in the inventory.

When a repairable part installed inside an asset fails, the defective part is substituted with a new one from the spare parts inventory (if there is any available) and the failed part is shipped to the maintenance/repair facility. In the maintenance facility, the failed part is repaired by the skilled maintenance workforce, afterwards, it is placed in spare stock inventory to be eventually used in case of another asset failure (Samouei et al. 2015; Turan et al. 2020b). It is clear that the maintenance workforce capacity (i.e., the number of technicians and their skills) in the repair facility directly affects the number of spare parts to keep in the inventory. In other words, a high number of skilled workforce would lead to faster repairs for failed parts and reduce throughput times in the maintenance facility, which would result in achieving the same availability level for assets with a less number of spare stocks. Thus, we include the maintenance workforce capacity optimization in our problem as the second planning decision.

It is generally assumed that all workforce in the maintenance facility has the same skill set and can repair all types of failed parts (i.e., full cross-training) in the system (Sleptchenko et al. 2019). Nevertheless, limited workforce flexibility with appropriate cross-training can offer most of the benefits of full cross-training (Jordan and Graves 1995). Moreover, if the repair tasks are complex and varied, utilizing full cross-trained maintenance workers may not be economical due to the limited learning capacity of workers and associated high training costs (Wang and Tang 2020). As our third maintenance planning decision, we aim at optimizing the skill assignment of the maintenance workforce. That is, we try to determine the skill set -in which each worker can only repair a subset of all part types- to minimize downtime with the minimum cost.

We contribute to the literature by addressing all the above-listed planning decisions in a single joint optimization model. Solving the modeled joint problem and finding the (near) optimal maintenance plans are nontrivial tasks due to the challenges associated with (i) the size of the decision space to search for (i.e., the number of feasible plans including, amount of spare stocks to keep on inventory for each part type, the number of maintenance workers to utilize in the maintenance facility and the number of possible combinations of assignments of skills to workers), and (ii) the stochastic nature of the system (e.g., random failures of parts installed in assets and service rate of maintenance).

To handle the challenge (i), we enhance a well-known meta-heuristic algorithm to search solution space more effectively. In this direction, we couple a Double Deep Q-Network based Deep Reinforcement Learning (DRL) and a Simulated Annealing (SA) algorithm. In this coupling, different from the traditional SA algorithms where neighborhood structures are selected only randomly, DRL learns to choose the best neighborhood structure based on experience gained from previous episodes and delivers the selected neighborhood structure to SA to use in future iterations. To overcome the challenge (ii), we model the maintenance facility as a queuing network. However, when the model’s size gets larger (i.e., the number of spare part types and the number of workers), finding closed-form solutions for the modeled queuing network gets computationally demanding. Thus, we alleviate this difficulty by dividing the maintenance facility into smaller and independent repair cells where each repair cell is responsible for repairing a subset of all part types with its own cross-trained workforce.

The remainder of this paper is organized as follows. In Sect. 2, we review the relevant literature on deep reinforcement learning in maintenance problems. We also briefly discuss how deep reinforcement learning has been integrated with different optimization algorithms to solve hard optimization problems. In Sect. 3, the details of the problem are provided and the mathematical model is introduced. Next, we present the details of the solution algorithm in Sect. 4. We list and briefly discuss the benchmark algorithms and methods in Sect. 5. In Sect. 6, we outline the computational experiment design and present both computational and managerial results. Lastly, we discuss conclusions and future research directions in Sect. 7.

2 Literature Review

Reinforcement Learning (RL) and Deep Reinforcement Learning (DRL) have gained popularity in solving hard combinatorial optimization problems. In this section, we review the RL-DRL in combinatorial optimization and present a summarized literature review of the works in maintenance that employ RL-DRL as a solution approach in Table 1.

RL has shown promising results to solve complex combinatorial optimization problems. Integration of RL into optimization has been through two different patterns (Mazyavkina et al. 2020). In the first stream, the decision-maker can exploit the RL methods to directly find the part of the solution or the complete solution without any off-the-shelf solver (end to end learning) (Bengio et al. 2020). In this setting, RL has been employed to tackle many combinatorial optimization problems, such as Travelling Salesman Problem (TSP) (Bello et al. 2017; Kool et al. 2019; Emami and Ranka 2018; Ma et al. 2019; Deudon et al. 2018), Vehicle Routing Problem (VRP) (Nazari et al. 2018; Kool et al. 2019), and Bin Packing Problems (BPP) (Hu et al. 2017; Duan et al. 2019). Further examples of RL applications to optimization include traffic flow optimization (Walraven et al. 2016), pricing strategy optimization (Krasheninnikova and García 2019), bioprocess optimization (Petsagkourakis et al. 2020), and orienteering problem (Gama and Fernandes 2020).

Table 1 A summarized review of the literature on reinforcement-deep reinforcement learning in maintenance

Another form is utilizing RL methods to improve the solution abilities of already existing solvers. In this approach, the optimization algorithm can call RL one time to initialize some parameter values or can call it repeatedly as the optimization algorithm iterates (Bengio et al. 2020). In this framework, RL methods are used to leverage the power of solvers or problem-specific solution heuristics by initializing values of some hyper-parameters. For example, RL can be utilized to select the branching variable in MIP solvers (Etheve et al. 2020; Hottung et al. 2020; Tang et al. 2020). Some recent studies of Ma et al. (2019); Deudon et al. (2018); Chen and Tian (2019) show that optimization heuristics powered with RL methods outperform previous methods.

Deep reinforcement learning (DRL) is one of the most intriguing areas of machine learning that combines reinforcement learning and deep learning. In particular, Deep Neural Network (DNN) allows reinforcement learning to be applied to complicated problems due to its ability to learn different levels of perception from data (François-Lavet et al. 2018). The benefits brought by DRL to other fields such as robotics and computer games are also employed to solve complex decision-making problems. Some recent works use DRL to solve some of the most prominent complex optimization problems such as resource management problems (Mao et al. 2016), job scheduling (Chen et al. 2017; Liu et al. 2020; Liang et al. 2020), VRP (Nazari et al. 2018; Lin et al. 2020; Zhao et al. 2020; Yu et al. 2019), and production scheduling problems (Waschneck et al. 2018b, a; Hubbs et al. 2020).

The maintenance planning (e.g., frequencies of corrective and preventive maintenance and condition-based maintenance policies) of spare parts inventory management models is well-studied in the literature with the various optimization models. Most of those studies employ off-the-shelf optimization software, heuristics, and well-known meta-heuristics (i.e., SA, Genetic Algorithm, Variable Neighborhood Search, etc.). Besides these methods, DRL is a promising new method in maintenance planning. Some recent studies employ RL to solve maintenance planning problems in various sectors. Huang et al. (2020) propose a preventive maintenance (PM) model in the Markov Decision Process (MDP) framework and solve the resulting problem employing the DRL algorithm. Andriotis and Papakonstantinou (2019) study a PM and inspection model for large multi-component engineering systems. They model a sequential decision-making problem with the MDP concept and propose a DRL algorithm that is capable of solving problems with massive state space.

Zhang et al. (2019) study the prediction of the equipment’s remaining useful life, which is an essential indicator in maintenance planning. They model this problem as Health Indicator Learning (HIL) which learns a health curve that shows the equipment health conditions over time. Instead of conventional methods (inspection by hand or physical modeling) in addressing HIL, they propose a data-driven model that employs the DRL method. Hoong Ong et al. (2020) utilize sensor data to evaluate the equipment health condition and obtain optimal maintenance policy by DRL approach. Similarly, Skordilis and Moghaddass (2020) employ sensor data for a real-time decision-making framework for system maintenance based on DRL. Some other recent works use DRL to solve PM planning problems for different domains such as energy (Rocchetta et al. 2019), infrastructure (Wei et al. 2020; Yao et al. 2020), aero-engine (Li et al. 2019), and cybersecurity (Allen et al. 2018).

As IoT-based monitoring routine increases, intensive data-driven condition-based maintenance (CBM) planning gains increasing attention in recent years. Some recent studies show that using the benefits of RL techniques (such as learning from historical and online data) in CBM planning may accommodate promising results. Zhang and Si (2020) propose a CBM planning model for multi-component systems which can utilize equipment conditions at each inspection directly in decision-making without maintenance thresholds. Instead of computationally inefficient and intractable conventional threshold methods, they employ a DRL-based method. Mahmoodzadeh et al. (2020) study an RL-based CBM planning model for corrosion-related maintenance management of dry gas pipelines.

The literature review in RL-DRL in maintenance optimization reveals the vacancy of research on the DRL-based solution approaches to corrective maintenance planning. This study contributes to maintenance planning optimization by developing a hybrid solution algorithm that combines Deep Reinforcement Learning (DRL) and Simulated Annealing (SA) algorithm.

3 Problem description and formulation

The studied maintenance planning problem consists of (i) a set of assets that includes components/parts subject to fail, (ii) a maintenance facility divided into repair cells, (iii) repair cells with a different number of cross-trained maintenance workforce that has different skill sets, and (iv) a spare parts stock point where multiple types of repairable spare parts; i.e., stock keeping units (SKUs), are kept as inventory. Fig. 1 provides high-level visualization of the studied problem.

Fig. 1
figure 1

The modeled maintenance planning problem

Tables 2 and 3 list the problem parameters and decision variables used to formulate the mathematical model, respectively. We introduce the remaining parameters when they are needed.

Table 2 The list of problem parameters
Table 3 The list of decision variables

We assume that assets contain N distinct types of parts (SKUs), and each SKU is subject to random failure throughout its lifetime with a constant rate \(\lambda _n \ (n=1,\ldots ,N)\). We assume that the failure rate of the part is independent of its age (i.e., how long it has been in use). Further, we model the failure behavior of SKUs with exponential probability distributions with parameter \(\lambda _n \ (n=1,\ldots ,N)\), which is a well-known assumption in repairable spare part inventory models (Sherbrooke 1968; Muckstadt 2005). We also assume the failure of an SKU is independent of other SKUs installed on the same asset.

When an in-use part fails, two things happen simultaneously: (i) an order is immediately placed for the same type of new (i.e., ready-for-use part) part to be supplied from spare stocks at the maintenance facility, and (ii) the failed part is sent to the maintenance facility as shown in Fig. 1.

The adopted spare inventory replenishment method mentioned in (i) corresponds to \((I_n-1, I_n)\) policy. In this policy, the base stock inventory amount is equal to \(I_n\) for SKU type \(n \ (n=1,\ldots ,N)\) and after each part failure, an order for a replacement is initiated for the same type of SKU. This is a well-known assumption used in repairable spare parts inventory systems (Sherbrooke 1986; Muckstadt 1973).

The failed part is shipped to the maintenance facility, and assigned to a repair cell k \(\ (k=1,\ldots ,K)\). We assume that each part type \(n \ (n=1,\ldots ,N)\) can be repaired at “only one" repair cell, and each repair cell k should be able to repair at least one type of SKU. Further, we assume that each repair cell k contains cross-trained workers \(z_k\) (where \(z_k \ge 1\)) that can repair any type of SKU assigned to their repair cell.

The failed parts that can be repaired in cell k form a single queue to be repaired by one of the maintenance workers in the cell (since all workforce in cell k can repair all types of SKUs assigned to that cell). We use the first come first served (FCFS) queuing discipline (i.e., there is no priority between failed parts), and the first available worker in the cell repairs the failed part. The repair time of failed part type \(n \ (n=1,\ldots ,N)\) follows an exponential distribution with parameter \(\mu _n \ (n=1,\ldots ,N)\). The exponential distribution is often applied to maintenance tasks where the repair completion times are independent of previous maintenance operations, and durations of repair operations have high variability Turan et al. (2020a). In the literature, it is also shown via simulation studies that the maintenance systems are often insensitive to the repair time distributions (Sleptchenko and van der Heijden 2016; Sleptchenko et al. 2018) We also assume that the repair time is independent of the maintenance worker who is performing the repair given that the worker has the skill to repair the part n.

The objective of the model is to minimize the total cost \({{\mathcal {T}}}{{\mathcal {C}}}\) by finding the (near) optimal settings of cross-training policy \({\mathbf {X}}\), the number of maintenance workers to allocate each repair cell \({\mathbf {Z}}\), and the number of spare stocks to keep in inventory for each part type \({\mathbf {I}}\). The objective of the studied maintenance planning problem is given in Eq. (1).

$$\begin{aligned} {{\mathcal {T}}}{{\mathcal {C}}} \left[ {\mathbf {X}}, {\mathbf {Z}}, {\mathbf {I}} \right] {=}\min _{{\mathbf {X}}, \ {\mathbf {Z}},\ {\mathbf {I}}} \ \sum _{n=1}^N h_n I_n{+} \sum _{k=1}^K \alpha z_k + \sum _{k=1}^K z_k \left( \sum _{n=1}^N \beta _n x_{n,k} \right) + b \sum _{n=1}^N \mathbb {EBO}_n\left[ {\mathbf {X}}, {\mathbf {Z}}, I_n \right] \nonumber \\ \end{aligned}$$
(1)

The first summation term in Eq. (1) corresponds to the total inventory holding cost. This cost is calculated based on the initial stock levels for each part \(I_n\), which is equal to the inventory position because of the implemented inventory replenishment method \((I_n-1, I_n)\) (Turan et al. 2018; Sleptchenko et al. 2019). The second cost term reflects the cost paid to the workforce as the base-level salary. In addition to the base-level salary, a cross-trained workforce (that has the ability to repair one or more types of failed parts) can get the compensation payment depending on the number of skills the worker has. The compensation payment also includes the one-time training cost for workers. The third summation term in the objective function captures these costs.

When an order to replace the failed part can not be immediately met from the spare stocks at the maintenance facility, a backorder/backlog for the failed part occurs, and the asset goes down until a new part is supplied (either from the maintenance facility after the repair is completed or external part supplier). To reflect this, we calculate the “expected" number of backorders for SKU type n denoted by \(\mathbb {EBO}_n\left[ \cdot \right] \) as the function of decision variables \({\mathbf {X}},\ {\mathbf {Z}}\), and \({\mathbf {I}}\). The total number of backorders (the last summation term in objective function) represents the number of assets that are down, and we multiply this value by the downtime cost per unit time per part on backorder b to obtain the total backorder cost (Turan et al. 2020c).

The decision variables have to satisfy some constraints to be a feasible solution to the modeled problem. First, the number of repair cells K in the maintenance facility cannot be less than one or more than N (in addition to being a positive integer \(K \in {\mathbb {Z}}^+\)). When the number of repair cells is equal to one, it is said that the maintenance facility is fully flexible/fully cross-trained, and in this type of facility each worker can fix all types of failures. In contrast, when the number of repair cells is equal to the number of distinct part types N, it is called the facility is fully dedicated. That is, each worker can only repair one type of part. Second, a feasible cross-training policy \({\mathbf {X}}\) has to be a member of the set defined in Eq. (2). More specifically, each SKU type has to be repaired at exactly one repair cell and each repair cell should be able to repair at least one SKU type.

$$\begin{aligned} {\mathbf {X}} \in \left\{ \begin{array}{cl} \sum _{k=1}^K x_{n,k}=1, &{} n=1,\ldots ,N \\ [1.25ex] \sum _{n=1}^N x_{n,k} \ge 1, &{} k=1,\ldots ,K \\ [1.25ex] x_{n,k} \in \{0,1 \}, &{} n=1,\ldots ,N \ k=1,\ldots , K\\ [1.2ex] 1 \le K \le N&{} \text {and}\ K\in {\mathbb {Z}}^+ \end{array} \right\} \end{aligned}$$
(2)

Each repair cell k has to contain at least one worker \((z_k \ge 1, \ k=1,\ldots ,K)\) and \(z_k \in {\mathbb {Z}}^+\). Further, it should be ensured that the utilization of each repair cell \(k \ (k=1,\ldots ,K)\) has to be less than one in order to prevent the number of failed parts waiting to be served in each cell to go infinity as time progress. To achieve this, a sufficient number of maintenance workers has to be allocated to repair cell k in such a way that Eq. (3) holds. Lastly, the amount of spare inventory \(I_n\) kept on stock for SKU type n has to be a non-negative integer \((I_n \in \{0\} \cup {\mathbb {Z}}^+, \ n=1,\ldots ,N)\).

$$\begin{aligned} z_k \ge {\left\{ \begin{array}{ll} \sum _{n=1}^N x_{n,k} \dfrac{\lambda _n}{\mu _n}+1, \ \text {if} \sum _{n=1}^N x_{n,k} \dfrac{\lambda _n}{\mu _n} \ \text {is integer}\\ \Biggl \lceil {\sum _{n=1}^N x_{n,k} \dfrac{\lambda _n}{\mu _n}}\Biggl \rceil , \text {otherwise} \end{array}\right. }, \ k=1,\ldots ,K \end{aligned}$$
(3)

The number of feasible cross-training policy \({\mathbf {X}}\) that satisfies the set constraints in Eq. (2) for a given number of SKUs N and the number of repair cells K is the equivalent of the number of ways to partition a set of N distinct objects into K non-empty subsets. The latter is a well-known combinatorial problem with a solution equal to the Stirling number of the second kind (or Stirling partition number), which is denoted as S(NK) and calculated as follows:

$$\begin{aligned} S(N,K)= \dfrac{1}{K!} \sum _{i=0}^{K} (-1)^i \left( {\begin{array}{c}K\\ i\end{array}}\right) (K-i)^N, \ K=1,\ldots ,N \end{aligned}$$

Since K can take integer values between 1 and N, the exact total number of feasible cross-training policies NCP(N) can be calculated as the function of N as follows:

$$\begin{aligned} NCP(N)=\sum _{K=1}^N S(N,K) \end{aligned}$$

Figure 2 shows how NCP(N) grows in the logarithmic scale, which proves the exponential increase in the size of decision space and the complexity of the problem. The size of the decision space gets even bigger when the decisions associated with the number of maintenance workers in each repair cell \(z_k\) and the spare inventory levels for each SKU type \(I_n\) are considered. Therefore, it is not possible or cumbersome to find the (near) optimal cross-training policy \({\mathbf {X}}\) with traditional optimization methods or a brute-force enumeration. To mitigate this, we enhance an SA algorithm with DRL to systematically check the promising policies and efficiently search the mentioned decision space.

Fig. 2
figure 2

The growth in the number of feasible cross-training policies (\({\mathbf {X}}\)) as a function of N and K

4 Solution approach

In this section, we discuss the details of the solution approach. The approach basically contains three subroutines: a Simulated Annealing (SA) meta-heuristic, a Double Deep Q-Network-based Deep Reinforcement Learning (DRL) algorithm, and a simple greedy heuristic that uses a queuing approximation. These subroutines work collaboratively and iteratively to seek (near) optimal solutions for cross-training policy \({\mathbf {X}}\), the number of maintenance workers to allocate each repair cell \({\mathbf {Z}}\), and the number of spare stocks to keep in inventory for each part type \({\mathbf {I}}\) as shown in Fig. 3.

Fig. 3
figure 3

Flow chart of the \(\mathcal {DRLSA}\) Algorithm

We use a one-dimensional array encoding scheme to represent the solutions as shown in Fig. 4. The first part of the array shows the cross-training policy \({\mathbf {X}}\). The length of this part (i.e., the number of cells) is equal to the number of part types (SKUs) N in the model. For the illustrative representation in Fig. 4, we have five SKUs in the model. For this part of the array, the number in each cell indicates the assignments of SKUs to repair cells. In the example, SKU 1, 2, and 5 are assigned to repair cell 1.

Fig. 4
figure 4

An illustrative solution representation for a maintenance facility with 5 SKUs and 2 repair cells

The second part of the array corresponds to the number of maintenance workers allocated to each repair cell \({\mathbf {Z}}\). The length of this part (i.e., the number of cells) is dynamic and varies depending on the distinct numbers used in the first part of the array. In the example, the number of repair cells in maintenance facility K is two, and the first repair cell contains two and the second repair cell contains three workers. The last part of the array represents the number of spare stocks to keep in inventory for each part type \({\mathbf {I}}\), and the length of this part is again equal to the number of part types (SKUs) N. The number in each cell denotes how many spares should be kept on stock for that SKU type. For example, for SKUs 1 and 5, it is not required to keep any spare stocks; i.e., \(I_1=I_5=0\). Solutions corresponding to Fig. 4 are shown in Eq. (4).

$$\begin{aligned} {\mathbf {X}}= \begin{bmatrix} 1 &{} 0 \\ 1 &{} 0 \\ 0 &{} 1 \\ 0 &{} 1 \\ 1 &{} 0 \\ \end{bmatrix}, \ {\mathbf {Z}}=\left[ 2,3 \right] , \ {\mathbf {I}}=\left[ 0,1,1,5,0 \right] \end{aligned}$$
(4)

In the remainder of this section, we first discuss the key components of the reinforcement learning algorithm in Sect. 4.1. Section 4.2 explains why deep reinforcement learning is used. In Sect. 4.3, we give the details of the SA algorithm used, and in Sect. 4.4 we describe how we combined the ideas in DRL and SA. In Sect. 4.5, we discuss details of the greed heuristic that optimizes \({\mathbf {Z}}\) and \({\mathbf {I}}\). Lastly, in Sect. 4.6, we list the value of all parameters that we used throughout the solution approach.

4.1 Reinforcement Learning (RL)

Reinforcement learning (RL) is a type of machine learning technique concerning how intelligent agents should take actions in an environment to maximize the cumulative reward (Hu et al. 2020). In a basic reinforcement algorithm, a decision-maker (also called an agent) is located in environment E. The agent sequentially interacts with the environment over time. At time step t, the agent gets the representation of the environment, which is called a state and denoted as \(s_t \in S\), where S is the set of all possible states. Observing the state \(s_t\), the agent chooses an action denoted as \(a_t \in A\) to take under a policy \(\pi \), where A denotes the set of all possible actions. Once the agent takes the action \(a_t\), it gets a reward \(r_{t+1}\) and the environment transitions to a new state \(s_{t+1}\). The goal of the agent under the policy \(\pi \) is to maximize its expected return G, while trying to reach its goal state as formulated in Eq. (5).

$$\begin{aligned} G_t^{\pi } = \sum _{k=0}^{{\mathcal {T}}} \gamma ^{k}r_{t+k+1} \end{aligned}$$
(5)

where \({\mathcal {T}}\) is the number of steps to take to reach the goal state from \(s_t\) and \(\gamma \) is the discount factor whose value is between [0, 1] indicating how much to discount future rewards. Lower values of \(\gamma \) significantly discount the future rewards while a value of 1 gives equal importance to all future rewards.

4.2 Deep Reinforcement Learning (DRL)

Deep reinforcement learning is the combination of reinforcement learning (RL) and deep learning. The term deep learning implies that an artificial neural network structure with more than 1 hidden layer is used while performing learning. Unfortunately, RL exhibits several limitations in practice when deployed in high-dimensional and complex stochastic domains, mainly manifesting algorithmic instabilities with solutions that significantly diverge from optimal regions, or exhibiting slow value updates at infrequently visited states (Andriotis and Papakonstantinou 2018). In such scenarios, deep learning methods are incorporated to solve the problem. The deep neural networks are strong function approximators. To benefit from their approximation capabilities in the reinforcement learning area, especially when the state-action space is very complex, they are used as policy function approximators as in Huang et al. (2020).

4.3 Simulated Annealing (SA)

Simulated annealing algorithm is initially introduced by Kirkpatrick (1984) which is one of the most popular and robust meta-heuristic algorithms that enable us to solve many hard combinatorial optimization problems (Suman and Kumar 2006) such as The Traveling Salesman Problem (TSP) (Kirkpatrick et al. 1983), and Quadratic Assignment Problem (QAP)(Connolly 1990). The SA uses a stochastic approach to avoid local optimum traps. In particular, SA searches for possible neighborhood solutions allowing a move to the neighboring solution even if the moved neighbor result is worse than the current one. In an analogy of SA with an optimization procedure, the physical material states correspond to problem solutions, the energy of a state to the cost of a solution, and the temperature to a control parameter (Du and Swamy 2016). The flowchart of the algorithm details can be found in the SA part of Fig. 3 which is highlighted in red. The algorithm starts with an initial feasible solution. At each iteration, SA generates a neighborhood solution and determines the objective function value based on this solution. If the resulting solution is better than the current solution then the new solution is accepted unconditionally. Otherwise, the neighborhood solution is accepted with the probability of \(e^{(-\Delta E/T)}\) where T denotes the current temperature and \( \Delta E\) denotes the difference between objective functions of current and the neighborhood solutions. Thus, the probability of accepting a worse solution is larger at high temperatures. The parameter of the temperature has crucial importance in the SA procedure since the acceptance probability is controlled with the temperature. This parameter gradually decreases at each iteration. Therefore, the probability of accepting uphill moves (accepting worse solution) is large at high T, and is low at low T.

4.4 \(\mathcal {DRLSA}\)

Our algorithm, denoted as \(\mathcal {DRLSA}\), is a combination of Deep Reinforcement Learning (DRL) and Simulated Annealing (SA) algorithms.

4.4.1 State Space

In the \(\mathcal {DRLSA}\), a state \(s_t\) represents the cross-training policy \({\mathbf {X}}\) mentioned in Sect. 3 and the finite state space S is any representation that \({\mathbf {X}}\) can get.

4.4.2 Action Space

The finite action space A is composed of the following three actions. The first action \(a^1\), selects two random positions in the solution vector and swaps elements of these positions. The second action \(a^2\), selects one element of the solution vector randomly and changes the value of this element. Similarly, the third action \(a^3\), selects two elements of the solution vector randomly and changes the value of these elements.

4.4.3 Flow of the approach

In our implementation of the algorithm, both DRL and SA benefit from each other’s feedback, as shown in Fig. 3. The DRL starts with a random initial state \(s_0\) and runs for a number of time-steps, tries to learn which actions to take depending on the current state by changing its internal parameters, and passes the best state \(s_{best}\) it encounters during its training to SA. The SA then uses \(s_{best}\) as its initial solution, runs for several iterations and passes back its best solution \(s_{best}\) to DRL, and thus one episode becomes completed. The combined algorithm runs for a number of episodes, and the final \(s_{best}\) value, and its corresponding cost are reported as the final solution. The pseudo-code of the algorithm can be found in Algorithm 1 which in itself calls Algorithm 2.

figure a
figure b

4.4.4 Double Deep Q-Network

The type of the RL algorithm we used is Q-Learning (Watkins and Dayan 1992). Q-learning aims at learning the optimal action-value functions (also known as the Q-value functions or Q-functions) to derive the optimal maintenance management policy (Mahmoodzadeh et al. 2020). A Q-value, \(Q(s_t,a_t)\) denotes how good it is to take the action \(a_t\) from state \(s_t\). Traditional Q-learning uses a tabular structure(Q-table) to learn how to map action-state pairs to Q-values. Because of the complexity of the action-state space in our problem, using a Q-table-based solution wouldn’t be feasible. Therefore, we adopted the Double Deep Q-Network described in Huang et al. (2020) which is derived from the basic deep Q-Network by Mnih et al. (2015) to approximate the Q-function. As depicted in Fig. 5, the first deep Q-network, denoted as DQN and is used as a policy network, takes a state \(s_t\) as input and produces Q-values for each possible action.

Fig. 5
figure 5

Deep Q-network with L hidden layers D input units and C output units, where C denotes the number of possible actions. The \(l^{\text {th}}\) hidden layer contains \(h^{(l)}\) hidden units. State \(s_t\) is fed as input to the network and Q-values for each action \(a^i\), \(Q(s_t,a^i)\), are predicted as output

The optimal Q-function \(Q^*\) is formulated as in Eq. (6) which uses the Bellman optimality equation as an iterative update process to estimate the optimal Q-value for a given state-action pair \((s_t,a_t)\).

$$\begin{aligned} Q^{*}(s_t,a_t) = {\mathbb {E}}\Big [r_{t+1} + \gamma \, max_{a_{t+1} \in A}\,Q^*(s_{t+1}, a_{t+1}) \big | \,s_t,a_t \Big ] \end{aligned}$$
(6)

The equation states that for a given state-action pair \((s_t,a_t)\) at time t , the expected return of starting from state \(s_t\), selecting action \(a_t\) and following the optimal policy thereafter will be the expected reward \(r_{t+1}\), we get from taking action \(a_t\) in state \(s_t\) plus the maximum expected discounted return that can be achieved from any possible next state-action pairs \((s_{t+1}, a_{t+1})\).

The reward value \(r_{t+1}\) for \(\mathcal {DRLSA}\) is defined as the difference between the total cost of the current state \(obj(s_t)\), and the total cost of the next state \(obj(s_{t+1})\). If the following state yields a lower cost value than the current state, the reward will be positive.

While selecting an action (selectAction() in Algorithm 1) we use Epsilon(\(\varepsilon \))-greedy strategy. The exploration rate \(\varepsilon \) is initially set to 1, which means the agent will explore the environment rather than exploit it. As the agent learns more about the environment, \(\varepsilon \) gets decayed exponentially by some factor and the agent explores the environment with probability \(\varepsilon \), and exploits it with probability \(1-\varepsilon \). Exploring the environment corresponds to choosing a random action while exploiting it means selecting the best action with maximum Q-Value, obtained by using DQN. We use a second Q-network called the target network, denoted as \(DQN_{target}\), to calculate \(Q^*(s_{t+1}, a_{t+1})\) part of Eq. (6). This type of architecture is called Double Deep Q-Network (DDQN) and leads to less overestimation of the Q-learning values, as well as improved stability, and hence improved performance (Hu et al. 2020).

The structure of \(DQN_{target}\) is the same as DQN, and they share the same parameters at first. The only difference is that during training, the parameters of \(DQN_{target}\) are not updated for a predefined number of episodes, \(\eta \), while the parameters of the DQN are being updated at each iteration. After \(\eta \) episodes, the parameters of \(DQN_{target}\) are set to have the parameter values of the DQN. Equation  (7) reformulates the optimal Q-Value in terms of networks.

$$\begin{aligned} Q^{*}(s_t,a_t) = r_{t+1} + \gamma \, max(DQN_{target}(s_{t+1}, a_{t+1})) \end{aligned}$$
(7)

Since the Q-values produced by the network are real-valued numbers, the Mean Squared Error(MSE) loss function is used to train the network. It finds the error between the approximated optimal Q-value, \(Q^{*}(s_t,a_t)\), and the Q-value of the current state-action pair, \(Q(s_t,a_t)\) as formulated in Eq. (8).

$$\begin{aligned} L(s_t,a_t) = Q^{*}(s_t,a_t) - DQN(s_t,a_t) \end{aligned}$$
(8)

To break the sequential behavior of the learning algorithm, we incorporated Experience Replay technique which was proved to improve the performance of the Q-networks. After each action is taken, the state, action, reward, and the next state value quadruple \((s_t,a_t,r_{t+1},s_{t+1})\), called experience and denoted as \(\xi \) is stored in a queue named \(Replay\ Buffer(RB)\). During each episode, both DRL and SA populate the RB with their experiences. At each iteration of the network training, if there are enough samples in the buffer, a number of experiences are sampled from RB and are fed as input to the network to optimize the parameters of DQN.

The two important feedbacks that DRL gets from SA are the next state values and the experiences. Every time SA makes an iteration, it stores its experience \(\xi \) in DRL’s RB so that DRL can use those experiences as a guide to update its parameters and learn how to predict which actions to take in upcoming episodes.

4.5 Optimizing workforce and spare part inventories for repair cells

We use a simple greedy heuristic to find the optimal values of workers to allocate each repair cell \(z_k \ (k=1,\ldots ,K)\) and base stock inventory levels \(I_n\) for each SKU type \(n \ (n=1,\ldots ,N)\) for a given cross-training policy \({\mathbf {X}}\) produced by \(\mathcal {DRLSA}\) as shown in Fig. 3.

The greedy heuristic consists of two collaboratively acting subroutines. At the initial step, the workforce optimizer subroutine allocates the minimum number of workers required to each repair cell k by using:

$$\begin{aligned} z_k = {\left\{ \begin{array}{ll} \sum _{n=1}^N x_{n,k} \dfrac{\lambda _n}{\mu _n}+1, \ \text {if} \sum _{n=1}^N x_{n,k} \dfrac{\lambda _n}{\mu _n} \ \text {is integer}\\ \Biggl \lceil {\sum _{n=1}^N x_{n,k} \dfrac{\lambda _n}{\mu _n}}\Biggl \rceil , \qquad \qquad \text {otherwise} \end{array}\right. }, \ k=1,\ldots ,K \end{aligned}$$
(9)

which ensures the feasibility of Eq. (3).

Since repair cells are independent of each other based on the discussion provided by Turan et al. (2020b), Sleptchenko et al. (2019), we treat each repair cell k as a Markovian (due to exponential probability distributions for failure and service times) multi-class multi-server queuing systems \(M/M/z_k\) where servers correspond to maintenance workers in repair cell k and classes correspond the set of SKU types that are assigned to cell k. The analysis of resulting queuing systems provides the expected number of backorders \(\mathbb {EBO}_n\left[ \cdot \right] \) for each SKU type \(n \ (n=1,\ldots ,N)\) which can be used to optimize \(I_n\) (see Turan et al. (2020c) for details). However, to evaluate the expected number of backorders \(\mathbb {EBO}_n\left[ \cdot \right] \) for larger repair cells (with the high number of workers \(z_k\) and part types N) is not computationally feasible, thus we use the aggregation-based approximation proposed by Van Harten and Sleptchenko (2003).

In each iteration of the greedy heuristic, the number of workers \(z_k\) in repair cell k increased by one to capture the trade-off between adding an additional worker to a cell and decreasing inventories of spares that are allocated to that cell. Iterations continue until employing an additional worker is not economical; i.e., the increased workforce doesn‘t lead to any spare inventory reduction.

4.6 Algorithm parameters and their values

The DRL parameters number of episodes(nEps), and the number of time steps(\(\tau \)) in Table 4 are chosen by using a grid search over the values of 100, 200, 300, 500, 1000, 2500 for nEps and 32, 64, 128, 256, 512, and 1024 for \(\tau \). The values that yield in reasonable running time and close to optimum solutions are reported. The \(\gamma \) is chosen high not to discount the future rewards severely. Table 5 lists the used parameters for the deep Q-network. To control the stochastic behavior of the network and make training (which uses matrix calculations heavily), we used \(batch\_size\) value of 32. The choice of the \(\eta \) value is also critical to ensure the target network catches up with the policy network. The common approach is to choose a value between 0.1 and 0.00001 (Wu et al. 2019). We adopted the recommended setting by Kandel and Castelli (2020). The number of input units D for DQN depends on the number of SKUs (N) used. Since the number of input units is at most 20 (see Sect. 6.1), only 2 hidden layers were enough to produce accurate results. The number of hidden units \(h^{(l)}\) at \(l^{th}\) layer depends on D and is chosen using a grid search over possible values. The number of output units C for the network is equal to the number of available actions. In Table 6, we provide the important parameter values used for the SA algorithm. To update the parameters of the networks, we used Adam optimizer. We refer the reader to Kingma and Ba (2017) for details of the Adam optimizer.

Table 4 DRL parameters
Table 5 DQN parameters
Table 6 SA parameters

5 Benchmark methods

We use two different sets of benchmark algorithms to compare the results of \(\mathcal {DRLSA}\). The first set of benchmarks includes three well-known meta-heuristic algorithms, namely a Simulated Annealing (SA), a Genetic Algorithm (GA), and a Variable Neighborhood Search (VNS). The SA algorithm by Kosanoglu et al. (2018) is used as the benchmark, which is an improved version of the basic SA algorithm that we use in this study. Further, the details of benchmark GA and VNS algorithms together with their parameter settings can be found in Turan et al. (2020b).

In the second set of benchmarks, we consider a machine learning-based (ML-based) clustering algorithm with four variants (K-Median1, K-Median2, K-Median3, K-Median4). We use a K-Median Clustering as provided in Algorithm 3. Variant algorithms differentiate from each other in terms of clustering features they employ. We use different properties of parts (e.g., failure rates \(\lambda _n\), holding costs \(h_n\), and repair rates \(\mu _n\)) as clustering features. Table 7 presents pseudo-codes for each variant are provided in Algorithms 4, 5, 6, and 7.

figure c
Table 7 ML-based clustering algorithms variants

6 Computational Study

To evaluate the performance of the proposed \(\mathcal {DRLSA}\) approach, we conduct a detailed numerical analysis. Particularly, in Sect. 6.1, we define the experiment testbed employed in our analysis. Section 6.2 explores the solution quality of the approach by comparing it with total enumeration results. Then, Sect. 6.3 presents a detailed run time and convergence analysis. Next, in Sect. 6.4, we compare the total system cost achieved by the proposed \(\mathcal {DRLSA}\) algorithm with the costs achieved by benchmark algorithms. Finally, in Sect. 6.5, we investigate capacity usage and cross-training policies.

6.1 Testbed

We utilize the same testbed of instances as in Turan et al. (2020c) to investigate the effectiveness of the proposed \(\mathcal {DRLSA}\) method. In this testbed, a full factorial design of experiment (DoE) with seven factors and two levels per factor is used as shown in Table 8. We test the proposed algorithm and all benchmarks with a total of 128 instances. In this testbed, initial workers M denotes the minimum number of workers that have to be utilized in the maintenance facility. Further, we use two different patterns, namely completely random and independent (IND) and hyperbolically related (HPB) to generate the holding costs \(h_n\). The HPB pattern reflects the cases in which expensive repairables are repaired less frequently. For the explicit definitions and derivation of test instances and factors, we refer the reader to the work of Turan et al. (2020c).

Table 8 Problem parameter variants for testbed

6.2 Optimality gaps and optimal solution behavior

In this section, we give an analysis for assessing the solution quality of the proposed \(\mathcal {DRLSA}\) algorithm. We employ the testbed given in Sect. 6.1 to investigate the gap between \(\mathcal {DRLSA}\) solutions and the optimal solutions. In order to create relatively small size test problems that can be solved by a brute force (total enumeration) approach, we randomly choose four, five, six, and seven SKUs from each case generated in the previous section. First, we investigate the gap between the minimum cost achieved by total enumeration (i.e. the optimal cost) and the \(\mathcal {DRLSA}\) algorithm. The \(\mathcal {DRLSA}\) algorithm achieves the optimal cost for all tested 512 \((4\times 128)\) cases. We further investigate the optimal solution behavior. In particular, we inspect how the optimal number of repair cells (clusters) K differs as the number of SKUs N changes. We present the number of repair cells for each solved case in Fig. 6. We observe that the optimal number of repair cells never reaches the number of SKUs N (such that dedicated design). Our results also indicate that the maintenance facility is divided into mostly two repair cells regardless of the number of SKUs N.

Fig. 6
figure 6

The number of repair cells K for each optimally solved cases

6.3 Runtime and convergence analysis

In this subsection, we comment on the convergence and runtime of \(\mathcal {DRLSA}\), and SA and RL stages. We implemented all algorithms discussed in this paper in Python programming language and ran them on a desktop computer with 4 core 3.60 GHz CPU and 8 GB RAM.

Table 9 shows the effect of the problem factors on runtime. Under all problem factors, we observe that most of the runtime (nearly 85%) is occupied by SA rather than RL stages. It should be also noted that on average runtime of \(\mathcal {DRLSA}\) increases nearly twice when Cross-training cost (\(\beta _n\)) is reduced to 0.01 from 0.10. Further, another noticeable runtime fluctuation occurs when the minimum holding cost (\(h_{min}\)) is increased from 1 to 100.

Table 9 Runtime analysis for each stage of \(\mathcal {DRLSA}\)

For illustrative purposes, the total cost \({{\mathcal {T}}}{{\mathcal {C}}}\) convergence is shown for a single case in Fig. 7. In this particular case, \(\mathcal {DRLSA}\) starts with a sharp decrease in total cost at early episodes (until around episode 10) and continues to converge to the best solution until around Episode 260. Figure 7 also presents how cost convergence occurs in a single episode between RL and SA stages. For the case we presented, it is clear that the SA stage does not perform well after iteration 600, which may indicate that initial temperature \(T_o\) might be reduced without affecting solution quality.

Fig. 7
figure 7

The convergence behaviour of \(\mathcal {DRLSA}\) for a single case

Fig. 8
figure 8

Pair-wise performance comparisons of \(\mathcal {DRLSA}\) to benchmark algorithms

6.4 Performance comparisons

In this section, the performance of the proposed algorithm is evaluated by comparing it with a set of benchmark algorithms described in Sect. 5. We present a detailed analysis of cost reductions achieved by the proposed \(\mathcal {DRLSA}\) algorithm and factors affecting the performance.

We first assess how \(\mathcal {DRLSA}\) performs in cost compared to benchmark algorithms. Figure 8 presents pair-wise performance comparisons of \(\mathcal {DRLSA}\) to benchmark algorithms. In particular, we present the number of cases \(\mathcal {DRLSA}\) outperforms, underperforms, and performs the same as benchmark algorithms in achieving minimum cost. In most of the cases, the \(\mathcal {DRLSA}\) algorithm is superior to benchmark algorithms. The second best performing algorithm is SA which outperforms the \(\mathcal {DRLSA}\) in 35 out of 128 cases, while the \(\mathcal {DRLSA}\) is superior in 54 out of 128 cases. Table 10 presents the number of cases lowest cost achieved by each algorithm under each of the problem factors. For each case, multiple algorithms may achieve the minimum cost. The \(\mathcal {DRLSA}\) is the best performing algorithm in achieving the minimum cost 55 out of 128 cases, followed by SA with 39 out of 128 cases. GA algorithm achieves the lowest total cost in 29 of 128 cases, VNS algorithm achieves the lowest total cost in 20 of 128 cases, and K-Median1, K-Median2, K-Median3, K-Median4 algorithms achieve the lowest total cost in 15, 11, 16, 13 of 128 cases respectively. We also observe that fully flexible design achieves the lowest total cost in 11 of 128 cases, and dedicated designs never achieve the minimum cost. We also observe that GA and VNS algorithms are extremely sensitive to the number of SKUs N while SA and \(\mathcal {DRLSA}\) are relatively less sensitive. Another notable result is the increase in workforce cost/base-level salary (\(\alpha \)) increases the performance of \(\mathcal {DRLSA}\) and VNS, decreases GA, and doesn’t affect SA. In general, SA is the least sensitive algorithm to the problem factors.

Table 10 The number of cases lowest cost achieved by each algorithm under each of the problem factors

We next investigate how the total cost achieved by \(\mathcal {DRLSA}\) compares to the benchmark algorithms. Figure 9 presents the total cost of \(\mathcal {DRLSA}\) minus the total cost of the benchmark algorithm. The minimum costs achieved by SA and VNS are relatively close to \(\mathcal {DRLSA}\).

In Figs. 10a and b, we present cost reductions achieved by each algorithm in comparison to dedicated and fully flexible designs, respectively. The cost reduction amounts are sensitive to problem factors in comparison to both dedicated and fully flexible design. In particular, all algorithms attain the highest cost reduction compared to the dedicated design when N is large (20), M is small (5), workforce cost is large (\(100h_{max}\)), and cross-training cost is small (\(0.01\alpha \)). Presumably, when N is large, dedicated design results in lower utilization of workers. Similarly, when workforce cost is higher, the cost of dedicated design results in a larger cost due to the extensive number of workers. When M is small, the additional workforce demand increases the cost due to the requirements of the dedicated design. Finally, when cross-training cost is small instead of having many maintenance workers dedicated to a single SKU type, cross-training workforce are more cost-effective. On the other hand, the cost reduction is more sensitive to cross-training costs in comparison to fully flexible design due to an increase in training cost (compensation payment) of workers for each SKU. We can also remark that cost reduction of SA, VNS, GA, and \(\mathcal {DRLSA}\) are substantially larger than ML-based algorithms.

We also evaluate the performance of \(\mathcal {DRLSA}\) for larger problem instances with 50 and 100 SKUs. We compare the achieved objective function values of \(\mathcal {DRLSA}\) and ML-based algorithms. We observe that \(\mathcal {DRLSA}\) achieves the best cost reduction in comparison to fully flexible design followed by K-median3, K-median2, K-median1, and K-median4 in 50 SKUs case. Nevertheless, in 100 SKUs case, K-median3 achieves the best cost reduction followed by \(\mathcal {DRLSA}\), K-median2, K-median1, and K-median4. The detailed results are presented in Table 11. The performance of the \(\mathcal {DRLSA}\) algorithm may be improved for larger SKU numbers (larger search space) by increasing the iteration and episode numbers. Furthermore, the proposed \(\mathcal {DRLSA}\) algorithm provides the solutions for the larger maintenance planning problems in an acceptable time.

Fig. 9
figure 9

Pair-wise total cost comparison of \(\mathcal {DRLSA}\) and benchmark algorithms

Fig. 10
figure 10

Average cost reductions in comparison with dedicated and fully flexible designs

Table 11 The achieved percentage cost reductions in larger instances in comparison to fully flexible design

6.5 Capacity usage and cross-training analysis

This section investigates how repair cell formation and cross-training depend on problem parameters and solution algorithms. In Table 12, we present the average number of workers utilized per case, the average number of repair cells in the optimized solution, and the average number of skills per worker. Interestingly, the number of repair cells formed with ML-based algorithms is extremely sensitive to cross-training cost factor levels, yet SA, VNS, GA, and \(\mathcal {DRLSA}\) are relatively less sensitive to cross-training cost factor levels. Specifically, when the cross-training cost increased from 0.01\(\alpha \) to 0.1\(\alpha \), the average number of repair cells is nearly doubled. We also observe that the numbers of repair cells formed with SA, VNS, GA, and \(\mathcal {DRLSA}\) are more sensitive to the number of SKUs N (on average 75\(\%\) increase in the average number of repair cells formed when the number of SKUs is increased from 10 to 20). Surprisingly, the numbers of repair cells formed with K-Median1 and K-Median4 decrease as the number of SKUs increases, while a slight increase is observed for K-Median2 and K-Median3.

Table 12 Average # of repair cells, average # of worker per cell, and average # of skill per worker under each of the problem factors

Another question we explore is how much cross-training would be enough under various problem factors. Figure 11 shows the average cross-training percentages achieved by each algorithm for each problem factor. Our results demonstrate that the average cross-training percentages of ML-based algorithms are larger compared to other algorithms. Although the differences are not large, the cross-training percentage order of other algorithms is \(\mathcal {DRLSA}\), SA, VNS, and GA from the largest to the smallest, respectively, for almost all problem factors. The average cross-training percentage is highly sensitive to cross-training costs \(\beta _n\) for all algorithms. More specifically, the average cross-training percentages of ML-based algorithms are tripled as cross-training costs increased from 0.01\(\alpha \) to 0.1\(\alpha \), whereas cross-training percentages of other algorithms are nearly doubled. The number of SKUs N is another essential problem factor that affects the average cross-training percentage. In particular, when the number of SKUs is increased from 10 to 20, the average cross-training percentage is diminished by half for \(\mathcal {DRLSA}\) and other meta-heuristics. However, an increase is observed for K-Median1, K-Median2, and K-Median4, and a decrease for K-Median3 as the number of SKUs increases 10 to 20. This result arguably arises due to the clustering features of algorithms. K-Median3 uses the failure rate of the in-use part as a clustering feature (\(\lambda _n\)), while other algorithms use the repair rate of the failed part (\(\mu _n\)), the inventory holding cost (\(h_n\)), or both.

Fig. 11
figure 11

The average cross-training percentages obtained by each algorithm for each problem factors

7 Conclusions and Future Research

The design of effective maintenance planning is crucial to ensure high asset availability with the minimum cost. In this paper, we develop a heuristic algorithm inspired by DRL to solve a joint maintenance planning problem by taking into account several decisions simultaneously, including workforce planning, workforce training, and spare parts inventory management. This work is an initial attempt to improve solution algorithms using DRL for corrective maintenance problems.

The computational results show that the proposed solution algorithm is promising in solving the defined problem. Particularly, the proposed \(\mathcal {DRLSA}\) algorithm obtains a better objective function value (total cost) than well-known meta-heuristic algorithms (SA, GA, VNS) and machine learning-based clustering algorithms (K-Median1, K-Median2, K-Median3, and K-Median4). Our work reveals that there exists a great potential in the application of machine learning to optimization. In particular, machine learning-enforced optimization algorithms may provide relatively better solutions. We note that the performance of algorithms is sensitive to problem parameters. Moreover, parameters that affect the performance of algorithms may differ for each algorithm. For example, the increase in workforce cost/base-level salary (\(\alpha \)) increases the performance of \(\mathcal {DRLSA}\) and VNS, decreases GA, but does not affect SA.

The studied joint maintenance planning problem has a few major limitations. First, it is assumed that the failure rates of parts are constant throughout their lifetime. However, a more realistic modeling would consider the aging of parts in-use and adjust the failure rates as a function of time. Second, the model is analyzed under static policies (e.g. routing of failed parts in the maintenance facility) to ensure the computational tractability of the queuing approximations.

All of the above-listed computational challenges arising due to the analysis of queuing models could be alleviated by modeling repairable supply chains (including the repair shop) via a simulation model and coupling this simulation model with DRL. This is a niche area in the current literature that enables decision-maker to analyze and optimize dynamic problems under uncertainty. Further, another possible extension could be improving the \(\mathcal {DRLSA}\) algorithm by exploiting the great potentials of the other reinforcement learning methods such as SARSA (State-Action-Reward-State-Action) to increase solution quality while reducing the runtime. Additionally, it might be worthwhile to integrate the DRL algorithm with other meta-heuristics such as GA and VNS.