1 Introduction

Inventory is usually kept to respond to fluctuations in demand and supply. With more inventory, a higher service level can be achieved but the inventory costs also increase. To find the right balance between the service level and inventory costs, inventory management is needed. Inventory usually accounts for 20–60% of the total assets of manufacturing firms (Unyimadu and Anyibuofu 2014). Therefore, inventory management policies prove critical in determining the profit of such firms (Arnold et al. 2008). Most companies are already using inventory policies to manage their inventories. With these policies, they determine how much to order at a certain point in time, as well as how to maintain appropriate stock levels to avoid shortages. These policies often focus on a single location and only use local information, which results in individually optimized local inventories and do not benefit the supply chain as a whole. The reason for not expanding the scope of the policies is the lack of sufficient data and the growing complexity of the policies. Due to recent IT developments, it has become easier to exchange information between stocking points, resulting in more data availability. This contributed to an increasing interest in inventory optimization over multiple locations, also called Multi-echelon inventory optimization (MEIO). Studies show that the coordination among inventory policies, i.e. a centralized decision-making policy, can reduce the bullwhip effect (Giannoccaro and Pontrandolfo 2002).

Despite the promising results, a lot of companies are still optimizing individual locations (Jiang and Sheng 2009). Mathematical models for inventory management can quickly become too complex and too time-consuming to solve. To prevent this, the models usually heavily rely on assumptions and simplifications (Jiang and Sheng 2009; Dogru et al. 2013; Jain and Raghavan 2009), for example regarding lead times, demand patterns, and network structures, which makes it harder to relate the models to real-world problems and to be put into practice. Tunc et al. (2011) demonstrate that certain assumptions (in their case stationary demand patterns) that do not hold in real-life, can be very costly when using the corresponding models in real-life applications. Another way of reducing the complexity and computation time is the use of (meta)heuristics. Unfortunately, these heuristic policies are typically problem-dependent and still rely on assumptions, which may limit their use in different settings (Gijsbrechts et al. 2022). Although mathematical models and heuristics can deliver good results for their specific setting, we are especially interested in a method that can be used for various supply chains, without requiring major modifications. For such a generic approach, reinforcement learning (RL) is a promising method that may cope with this complexity (Topan et al. 2020). RL is developed to solve sequential decision making problems in dynamic environments (Sutton and Barto 2018). In sequential decision making, a series of decisions are to be made in interaction with a dynamic environment to maximize overall rewards (Shin and Lee 2019). This makes RL an interesting method for inventory management. Due to recent advances in deep learning (Goodfellow et al. 2016), reinforcement learning is often combined with neural networks. This combination, called deep reinforcement learning (DRL) (Mnih et al. 2013), shows even more potential to solve complex inventory problems. Inspired by a real case at a manufacturing company, which we refer to anonymously as the CardBoard Company, we recognize the need for an approach to solve MEIO for general inventory systems with stochastic demand, which is more complex than previously has been considered in the literature.

This paper applies DRL to three multi-echelon inventory systems. For each of these systems, we take into account the holding and backorder costs. We model the problem as a Markov decision process (MDP) and solve it using the DRL approach proximal policy optimization (PPO). The contribution of this paper is threefold. First, we solve these inventory problems by proposing order quantities that depend on various features of the inventory system. To determine the order quantities, we use a continuous action space. Most other research on applications of DRL in supply chain management use a discrete action space. The discrete action space has limitations in terms of problem size, whereas the continuous action space is more scalable and can therefore be used to cope with real-life problems. Second, one of the cases we consider is the inventory problem of a general inventory system (besides the simpler linear and divergent systems) with backorders. While the general inventory system is relevant for practice, it has received limited attention in research, and to the best of our knowledge, DRL has not been applied to a general inventory system with this many stock points before. Third, we showcase a method that can be used without major modifications for various supply chains by solving three different cases. Two of the cases are from literature, while one case is based on our case study at the CardBoard Company. This flexibility allows the application of this general learning algorithm without having to rely on considerable domain knowledge or limiting assumptions.

The remainder of this paper is structured as follows. In Sect. 2, we perform a literature study on the application of DRL in multi-echelon inventory systems. Section 3 introduces the three different cases on which we will apply our DRL method. The PPO algorithm and the setup of the experiments are introduced in Sect. 4. In Sect. 5, the results of our DRL method are presented. We end with conclusions and discussion in Sect. 6.

2 Literature review

In this section, we first elaborate on studies concerning multi-echelon inventory management (Sect. 2.1). Second, we zoom in on the advances in RL research in Sect. 2.2. Finally, we look at the literature that uses RL as an approach for solving inventory management problems in Sect. 2.3.

2.1 Multi-echelon inventory management

The term multi-echelon is used for supply chains where an item moves through more than one stage (i.e., location) before reaching the final customer (Ganeshan 1999; Rau et al. 2003). Because of the multiple locations, managing inventory in a multi-echelon supply chain is considerably more difficult than managing it in a single-echelon one (Gumus et al. 2010). As a result, this research area attracted the attention of many scientists. During the last decade, the research focus shifted more to integrated control of the supply chain, mainly due to advanced information technology (Kalchschmidt et al. 2003; Rau et al. 2003; Gumus and Guneri 2009). According to Gumus and Guneri (2007), mathematical models are mostly used, often in combination with simulation, followed by heuristics and Markov Decision Processes.

The MEIO problem is characterized by the network structure of the supply chain. The simplest structure is a linear network, where each stock point is supplied by one upstream location, and is supplying one downstream location. A more complicated structure is the divergent network, where one stock point can supply multiple downstream locations. The most complicated structure is the general structure, which contains at least one stock point that can be supplied by more than one upstream location, while also containing at least one stock point that can supply multiple downstream locations. Figure 1 provides a visualization of these network structures, where we consider warehouses w (stock points) and retailers r. Retailers are stock points where demand is observed and can be supplied by one or multiple warehouses.

Fig. 1
figure 1

Three types of network structures for MEIO problems

The aim of exact mathematical models is to solve the inventory management problem to proven optimality. Lambrecht et al. (1984) shows that the multi-echelon stochastic problem can be modeled as a Markov decision process (MDP). An MDP consists of stages where we transition from one state to the next, satisfying the Markov property (where the next state is dependent only on the current state and the chosen action). Several researchers model the MEIO problem as an MDP, as multiple convenient methods for solving MDPs exist. However, these authors typically focus on relatively simple structures of the supply chain, for example, the linear supply chain Lambrecht, Luyten and Vander Eecken 1985; Chen and Song 2001; Vercraene and Gayon 2013; Iida 2001) and the divergent supply chain (Chen et al. 2002; Minner et al. 2003). Yet, determining optimal policies is intractable, even for simple network structures, due to the well-known curses of dimensionality (De Kok et al. 2018; Bellman 1957). As a result, most mathematical models use extensive restrictions and assumptions, such as deterministic demand and constant lead times. Also, mathematical models are usually complex and case-specific.

There are solution approaches that aim to overcome the aforementioned curses of dimensionality. In Sect. 2.2, we discuss approximate methods for solving the MDP, the focus of this paper. Another approach, Robust Optimization, is applied by Bertsimas and Thiele (2006) and Ben-Tal et al. (2009) to the MEIO problem, where the demand uncertainty is captured in uncertainty sets. Because Robust Optimization methods typically yield conservative solutions, measures must be taken to avoid the focus on possible worst-case realizations of the uncertainty. Bertsimas and Thiele (2006) and Ben-Tal et al. (2009) show that this method yields promising results, and is less affected by the curses of dimensionality.

Due to the intractability of mathematical models, heuristics are also often used as a solution approach in the optimization of multi-echelon inventory management. Heuristics are used to find a near-optimal solution in a short time span and are often a set of general rules, which can, therefore, be used on various problems. Frequently, a heuristic is used to optimize the parameters of the order-up-to policy (i.e., base-stock policy). This policy has been the most commonly adopted approach in inventory optimization (Nahmias and Smith 1993, 1994). For multi-echelon inventory systems, the base-stock policy, though not necessarily optimal, has the advantage of being simple to implement and to perform well (Rao et al. 2000; Rong et al. 2017).

2.2 Reinforcement learning

MDPs are typically solved using dynamic programming (DP) (Bellman 1957). Because of the high computation costs of DP, an approximate method is often required to solve MDPs. We consider two approximate approaches: (1) Approximate dynamic programming (ADP) and (2) RL. ADP is a modeling framework based on an MDP model (Mes and Rivera 2017), whereas RL does not necessarily assume a perfect model (Sutton and Barto 2018). ADP is an umbrella term for a broad spectrum of methods to approximate the optimal solution of an MDP, just like RL. It typically combines optimization with simulation. ADP and RL are two closely related paradigms for solving sequential decision-making under uncertainty, but in the remainder of this paper, we will use the umbrella term RL.

RL is about learning a policy to maximize a reward. In RL, unlike most forms of machine learning, the learner is not explicitly told which action to perform in each state (Sutton and Barto 2018). At each time step, the agent observes the current state \(s_t\) and chooses an action \(a_t\) to take. The agent selects the action by trial-and-error (exploration) and based on its knowledge of the environment (exploitation) (Giannoccaro and Pontrandolfo 2002). After taking the action, the agent is given a reward \(r_{t+1}\) and evolves in the new state \(s_{t+1}\). Using this observed information about the state transition and reward, the agent updates its knowledge of the environment and selects the next action. To maximize the reward, the trade-off between exploration and exploitation is important. The agent has to explore new state-action pairs in order to find the value of the corresponding reward. Full exploration ensures that all actions are taken in all states to avoid ending up in a local optimum. However, this is often too time consuming, so most of the methods start with full exploration and gradually increase the chance for exploitation.

In RL, tables could be used that map all action values to the states. Nevertheless, this table grows prohibitively large when solving large, realistic problems. Furthermore, the learnings of some states that are similar to others (i.e. having values for the state variables that are close to each other) cannot easily be generalized with this representation. To address this problem, often approximations for policies or value functions are used. With approximations, only the parameters of the function need to be stored, rather than the whole mapping table. There are several ways to use value or policy function approximation in RL, but a method that is gaining popularity is the use of deep learning (DL). Deep learning means that the learning method uses one or multiple neural networks to estimate the value or policy function. In a recent paper, Mnih et al. (2015) applied deep learning as a value function approximator and gained impressive results. Deep learning was combined with RL, because they have a common goal of creating general-purpose artificial intelligence systems that can interact with and learn from their environment (Arulkumaran et al. 2017). While RL provides a general-purpose framework for decision-making, deep learning provides a general-purpose framework for representation learning. With the use of minimal domain knowledge and a given objective, it can learn the best representation directly from raw inputs (Sutton and Barto 2018). The neural networks are updated with gradient descent, where the objective is to minimize the loss function.

We can roughly distinguish between value-based and policy-based RL. In value-based RL methods, the values of states and actions are approximated with a function, which could still result in a tremendous computational load. In this case, the loss indicates how well the estimates of the values match the observed values. Examples of value-based methods are Q-Learning (where the action with the maximum estimate is used in the updates) and Temporal Difference Learning (where the action that is taken is used in the updates). In essence, in solving the MDP, the optimal policy is the desired end result. With policy-based methods, the policy is directly parameterized and approximated, rather than learning ‘indirectly’ through the value functions. In this case, the loss indicates how well the policy is performing. Usually, some type of baseline is used in policy-gradient methods that reduces the variance in learning. Value- and policy-based methods can also be combined, for example in the A3C algorithm where the value estimates (i.e., the critic) are used as a baseline for the policy (i.e., the actor).

Schulman et al. (2017) developed the proximal policy algorithm (PPO), which is more scalable, sample-efficient, and requires less parameter tuning than other DRL algorithms. It is a policy-gradient algorithm that can be used in combination with a critic. The key of the algorithm is that large updates to the policy are avoided by clipping the loss function that is used to perform stochastic gradient ascent updates. This ensures that the weights of the neural network are updated in a subtle way, resulting in more stable training behavior. It has been shown empirically that the performance of this PPO algorithm is less sensitive to the chosen hyper-parameters compared to other DRL algorithms. Furthermore, PPO is applicable to a continuous action space. If there are a large number of potential actions in a range of continuous values, it is possible to learn the parameters of the probability distribution of the action (continuous action space), rather than the probabilities of every individual action (discrete action space). This enhances the scalability of the algorithm to larger problem instances.

2.3 Reinforcement learning in multi-Echelon inventory management

Quite some research on RL or ADP in MEIO has already been done. An overview is given in Table 1. Within this overview, we list per paper the considered network structure and how demand that is not satisfied is treated. Additionally, the objective used in solving the problem is of relevance. Some papers focus on cost (and/or profit), while others focus on minimizing the difference between the realized and target service levels. The problem horizon is also a consideration. In a finite horizon, RL is trained by simulating a given time period, e.g., a number of days or weeks. The outcome of such a finite horizon approach is a time-dependent policy, which suggests an action for each possible state in each time period of the horizon. In an infinite horizon, the simulation runs for an arbitrary time period, and the outcome will be a stationary policy. In the last column of Table 1, we show the approach that is used by the authors. Note that some entries are incomplete in case the research did not clearly specify these characteristics.

Table 1 Usage of reinforcement learning in multi-echelon inventory management

As we can see in Table 1, there are already a large number of papers available considering RL in MEIO. The papers differ substantially in their inventory systems and approaches. When we compare the different network structures of the papers, we can see that a divergent network is mostly assumed, followed by a linear supply chain. Only one paper considers a convergent structure and four papers describe a general network structure, which is the network structure of our case study at the CardBoard Company. In two papers where this structure is considered, the authors recommend addressing the scalability challenge to solve larger problems (Çimen and Kirkbride 2013, 2017).

In the case of unsatisfied demand, both lost sales and backorders are almost equally considered in literature, whereas two papers consider a hybrid approach. Reaching the target service level is the goal in eight papers, and most of the papers focus on the costs by either minimizing the costs or maximizing the profit. The algorithms that are used vary, but Q-learning is the most frequently used algorithm, followed by the actor-critic (AC) algorithm including variations of this, such as A3C. While RL and ADP are both often used in multi-echelon inventory management, DRL is considered nine times. After reviewing these papers, we conclude that, while reinforcement learning is used because of its promise to solve large problems, many papers still use extensive assumptions and simplifications. As we see in the table, the problems that are solved in literature are still quite different from the general network setting of the CardBoard Company. Harsha et al. (2021) consider a general network structure and a dual-sourcing structure, which are both quite different from our general network structure. In their general network structure, each retailer stock point can only be supplied by one warehouse, thus resembling a divergent network. In their dual sourcing structure, the network is fully connected, i.e., every retailer stock point (3 in total) is connected to each of the two warehouses. In our general network, the number of retailers (5) and warehouses (4) is larger, and the network is not fully connected, i.e., not all retailers can order at all warehouses. Pirhooshyaran and Snyder (2020) focus on a finite horizon policy and specify the order up to levels per node, whereas we focus on finding a steady-state policy and the replenishment quantities rather than order up to levels. Furthermore, both studies use a value-based DRL method to solve the problems, whereas we focus on the policy-based method PPO. This method has also been applied by Vanvuchelen et al. (2020), but in that case, a divergent joint-replenishment network is considered using a discrete action space. This method will scale poorly to the size of the general network considered in this paper, so we use a continuous action space within the PPO algorithm.

We contribute to the existing literature by considering multiple network structures, including the general structure, and applying a novel method within this field. By applying PPO with a continuous action space on multiple network structures, we study the flexibility of this method in solving such problems.

3 Problem statement

In this section, we introduce the multi-echelon inventory optimization problem with a general network structure. In this problem, we consider two types of stock points: warehouses \(w \in W\) and retailers \(r \in R\). A summary of the problem notation is given in Table 2.

Table 2 Notation

We model the problem as a Markov decision process (MDP), where all decisions are made centrally by a single decision maker. In each time period t, the process is in state \(s_t\), and the decision maker may choose an action \(a_t\) that is available in that state. The state (1) should contain enough information such that the problem has the Markov property, i.e., the future (e.g., the decision \(a_t\) and the expected future rewards) depends only on the present state and does not depend on past history. For each of the three inventory cases, we define the state as follows:

$$\begin{aligned} s_t&= \left[ ti, tb, I_{w,t}^{end}, I_{r,t}^{end}, BO_{r,w,t}, BO_{r,t}, IT_{w, t, t'}, IT_{w,r,t,\tilde{t}} \right] \nonumber \\&\qquad \forall w \in W, r \in R, t' \in {1,\ldots , \lambda _{w}}, \tilde{t} \in {1,\ldots , \lambda _{w,r}} \end{aligned}$$
(1)

In (1), ti represents the total inventory in the system, and tb defines the total number of backorders. Note that we do not consider the time period as a state variable, meaning that we are looking for a stationary (or infinite horizon) policy that is valid at any time period.

In our problem, the state consists of all the necessary information for the decision maker to take an action. In all problem structures, the action (2) to take is the order quantity for the upstream location. Therefore, we define the action space as an order for every edge of the network:

$$\begin{aligned} a_t = \left[ O_{w,t}, O_{r, w, t} \right] \qquad \forall w \in W, r \in R \end{aligned}$$
(2)

After taking the action, we have a transition to a new state \(s_{t+1}\). The probability of moving to the new state \(s_{t+1}\) is influenced by the chosen action, and given by the state transition function \(P(s_{t+1} \mid s_t, a_t)\). The transition function is presented by the sequence of events (Table 3), where the randomness lies in the uncertainty of customer demand \(d_{r,t}\) and lead time \(\lambda _{w,r}\) between stock points.

In every time step t, five events happen. At the beginning of each period, the shipments of the upstream stock points (\(IT_{w,t',t}, IT_{w,r,t',t}\)) are received. Then, the supplier of the warehouse sends outgoing shipments to the warehouse (\(IT_{w,t,t+\lambda _w}\)) with (possibly stochastic) lead time \(\lambda _w\). Thereafter, the warehouses fulfill the orders \(O_{r,w,t-1}\) from the retailers with on-hand inventory if possible (\(IT_{w,r,t,t+\lambda _{w,r}}\)). Transport between two stock points w and r requires a (possibly stochastic) lead time \(\lambda _{w,r}\). If there is not enough inventory for all downstream orders, the retailers are ranked in ascending order of their inventory minus their backorders, and the one with the lowest inventory minus backorders is satisfied first, until there is no inventory left. When there is no inventory left, the (partial) order will be put in backorder (\(BO_{r,w,t}\)). After sending the outgoing shipments, upstream orders are placed (\(O_{w,t}, O_{r,w,t}\)). Finally, the customer demand is fulfilled at the retailers, and demand that was not satisfied is backordered (\(BO_{r,t}\)). This means that the demand of the customers \(d_{r,t}\) is not known before the replenishment orders are placed. Note that in the linear case, as shown in Fig. 1, all stock points except the last one are considered warehouses. With respect to the sequence of events in the linear case, event 2 concerns only the most upstream warehouse. In event 3, there also exist warehouse to warehouse shipments and backorders (i.e., \(IT_{w, w',t, t+\lambda _{w, w'}}, BO_{w,w',t}\) where \(w, w' \in W\)). In event 4, there are also warehouse to warehouse orders (i.e., \(O_{w, w', t}\) where \(w, w' \in W\)).

Table 3 Sequence of events describing the transition dynamics of the MDP

The transition from \(s_t\) to \(s_{t+1}\) due to action \(a_t\) results in a reward \(R(s_t, a_t, s_{t+1})\). In this general network, we minimize the costs, consisting of holding costs \(h_w, h_r\) and backorder costs \(b_w,b_r\), for the warehouses w and retailers r, respectively. The action \(a_t\) directly impacts the inventory levels and backorders in the system. This results in the following cost function that we aim to minimize:

$$\begin{aligned} c_t(s_t,a_t) = \sum _{w \in W}\left[ h_w \cdot I_{w,t} + b_w \cdot \sum _{r\in R} BO_{r,w,t} \right] + \sum _{r \in R}\left[ h_r \cdot I_{r,t} + b_r \cdot BO_{r,t} \right] \end{aligned}$$
(3)

The objective of the MDP is to find a policy \(\pi ^*\) that minimizes the discounted infinite sum of costs, where a policy is a mapping of states to actions. Thus, the objective can be expressed as shown in (4), where \(\gamma\) is the discount factor balancing the impact of future and present rewards and \(a_t^{\pi }\) represents the action at time t prescribed by policy \(\pi\). Note that we aim to minimize the long-term average costs. This means that in practice, we choose T to be high enough to reflect the infinite horizon costs.

$$\begin{aligned} \pi ^* = \arg \min _{\pi \in \Pi } {\mathbb {E}} \left[ \sum _{t=0}^{\infty }{\gamma ^t \cdot c_t\left( s_t,a_{t}^{\pi }\right) } \right] \end{aligned}$$
(4)

In the next section, we elaborate on how the PPO algorithm can be used to solve this problem.

4 Proximal policy optimization implementation

The PPO algorithm (Schulman et al. 2017) can be used to solve MDPs in an approximate manner. The policy \(\pi\) is prescribed by a neural network (actor network), and by interacting with the environment (i.e., a simulation of the problem dynamics, sampling demand, etc.) the parameters of the neural network are updated to improve the policy. In Sect. 4.1 we elaborate on the standard steps taken in the PPO algorithm to learn a good policy, and in Sect. 4.2 we describe the hyperparameters that are used in the algorithm. In the sections that follow, we specify how the state and action spaces and reward function were transformed to make them applicable for use in the PPO algorithm. Although we use a standard implementation of PPO, the hyperparameters, and the state, action, and reward space transformations are design choices.

4.1 Algorithm steps

We implement the actor-critic structure of the PPO algorithm as described by Schulman et al. (2017). Within this algorithm, two neural network approximators are used: one for the policy function and one for the value function. We call them the actor network (\(\pi (s_t;\theta )\)) and the critic network (\(v^\pi (s_t;\phi )\)), which do not share weights. The actor network gives the current policy, meaning that, given the input state, we get the action that is taken in that state. The critic network outputs the estimated value of being in that state. The network weights are initialized randomly, and then the training procedure starts. The random initialization encourages exploration at the beginning of training. The actor network results in a stochastic policy, as the output of the actor network consists of the mean and standard deviation of a Gaussian distribution. The action is sampled from this distribution. When starting with a large standard deviation, exploration of actions is encouraged. With the random samples, we do not have the guarantee that all state-action pairs are visited. However, the benefit of using neural networks as approximators is that they can generalize the learnings from similar states and actions.

In every training iteration, a buffer is filled with states, actions, and (scaled) rewards following the current (initially random) policy \(\pi (s_t;\theta )\). With these observations, the advantage is calculated, which is an essential indication of whether the action that was taken is better or worse than the average policy action (i.e., the output of the critic network). This advantage calculation is done using the Generalized Advantage Estimation method as described by Schulman et al. (2015). This gives an indication of the direction in which the weights of the actor network need to be updated.

Based on the advantage information, the parameters of both neural networks are updated by computing a loss function, based on a trajectory of following the current policy. The updates aim to improve the policy and the value estimates. This means that for the actor network, we need to find the direction of choosing better actions. The loss function of Schulman et al. (2017) is used for the actor network, and it differs from other DRL algorithms in the sense that it clips the loss, meaning that the updates of the policies are not taking excessive steps. For the critic network, we need to close the gap between the observed and predicted values (mean squared error loss). Updating the neural network weights by filling the buffer and computing the loss functions constitutes one iteration. Typically the learning continues until some convergence is reached, meaning that we no longer observe significant improvements in the rewards achieved by consecutive policies.

4.2 Algorithm parameters

For many steps in the PPO algorithm, the values for the hyperparameters need to be selected. First of all, the architecture of the neural networks needs to be defined, which consists of the number of hidden layers, the number of neurons in these hidden layers, and the activation function. The number of neurons of the input layer is given by the size of the state space [i.e., the number of different combinations of variable values in (1)], whereas the number of neurons of the output layer is equal to the size of the action space [i.e., the number of different combinations of variable values in (2)], multiplied by two as we want to output the mean and standard deviation of the Gaussian distribution for each action. Additionally, we need to determine the method for initializing the weights of the network parameters.

After initializing the neural networks, decisions about the length of the buffer and the parameters (\(\gamma , \lambda\)) used for calculating advantages are chosen. The \(\gamma\) parameter corresponds to the discount rate mentioned in Sect. 3, while the \(\lambda\) controls the trade-off between bias and variance in the estimation of the advantage function. For updating the networks, we need to identify the relevant parameter for the loss function, namely the \(\epsilon\) that indicates how far away from the ‘old’ policy the network is allowed to be after an update. To update the network, we use the popular Adam optimization procedure to update the network weights, instead of classical stochastic gradient descent. The key hyperparameter for this procedure is the learning rate. Finally, the networks are updated using minibatches of a certain size and several update epochs that are performed with the same training data.

We implemented the ‘standard’ PPO algorithm as described by Schulman et al. (2017), using the same hyperparameters presented in their paper, for which they concluded the PPO algorithm to perform well in a number of different problem environments. Although hyper-parameter tuning might improve the results, we decided to use these given parameter settings for each of the three inventory cases, to investigate how this PPO algorithm works without the time-consuming hyper-parameter tuning steps. However, to make the algorithm applicable to the three different inventory cases under consideration, we had to choose how to represent and scale the states, actions, and rewards. We elaborate on this in the following subsections.

A summary of the used hyperparameters for the PPO algorithm is given in Table 4. Experimentation on these parameters confirmed the suitability of the parameters found by Schulman et al. (2017).

Table 4 PPO hyper-parameters

4.3 State space transformation

As we use neural networks as approximators for the policy and value function, we perform a common preprocessing step to normalize the input for the neural network. This normalization prevents vanishing or exploding gradients, improves convergence and generalization, and helps with regularization. The values of the state variables (1) are normalized to a range of \([-1,1]\), to keep the values in a relevant domain for the hyperbolic tangent activation function of the neural network. For the general case, the values are normalized to a range of [0, 1] as opposed to \([-1, 1]\). This enables us to get an output of the untrained network that lies in the interval of \([-1, 1]\), such that the actions do not have to be clipped in the beginning.

A neural network is initialized with small values and updated with small gradient steps. If the input values are not normalized, it takes a long time for the weights of the actor network to converge to well-performing action values and for the weights of the critic network to converge to accurate value functions. We want the interval of the state variables before normalizing to have a wide range, to ensure we do not limit our method in its observations. The lower bound is 0 for every variable because we do not have orders, backorders, or shipments that can be negative. With a large scaling upper bound, the state space values will be very close together, which also makes it hard for the neural network to distinguish different states from each other. Therefore, we test the performance of several different scaling upper bounds and assess which one performs the best in learning well-performing values.

For the simple network structures (i.e., the linear and divergent cases), we expect that transforming the input variables for the neural network, without limiting the transition dynamics with the upper bounds, will work reasonably well to find well-performing policies. However, as the problem grows in complexity (i.e., increasing state and action space), exploration may get trapped in unlikely states, from which no recovery is possible. Consider for example a situation where there is an extreme amount of inventory. If the algorithm takes actions to get out of this situation, by not ordering replenishments, the positive reinforcement signals could have a long delay, because of the slow depletion of inventory. One option to avoid this is to use the scale on the state variables both for normalizing the input to the neural network, as well as for limiting the variables in the transition function. We evaluate this ‘transition-limit’ option for the general case, as it is the most complex case and could benefit from this adjustment. It is important to note that the upper bounds should only be used in training the algorithm, not in evaluating the performance of the trained policy.

4.4 Action space transformation

For all inventory cases, the action to take is the order quantity at the upstream location. Reinforcement learning problems can either have a discrete or continuous action space. For every output node, the neural network of the actor outputs either the probability that this action will be executed (discrete actions) or the action-value (continuous actions). With a discrete action space, the agent decides which distinct action to perform from a finite action set. For example, when we have four edges (linear case), and we want the order size to be in the interval of [0, 30], we have \(31^4 = 923,521\) different actions. For the case of the CardBoard Company using the same interval, this would even result in an astonishing number of \(31^{18} = 6.99 \cdot 10^{26}\) actions. With this number of actions, the neural network cannot even be initiated on most computers due to memory limits, as for every distinct action in this action set, the neural network has a separate output node. The probability of taking one of the actions is computed by a softmax function on the output layer. Because this results in a large network that is not scalable to larger action spaces, we choose to implement a continuous action space.

With a continuous action space, the number of output nodes of the neural network is equal to twice the number of different combinations of variable values in the action space. In (2), we defined an order quantity per edge of the network. The output of the neural network is the mean and standard deviation of the order quantity for every edge (5). The action for each edge is then sampled from a Gaussian distribution using these parameters. This action space results in a more scalable neural network, as the number of output nodes only grows linearly with the number of edges in the network, rather than exponentially in the case of a discrete action space.

$$\begin{aligned} a_t = \left[ \mu _{w,t}, \sigma _{w, t}, \mu _{r,w,t}, \sigma _{r,w,t} \right] \qquad \forall w \in W, r \in R \end{aligned}$$
(5)

In the linear and divergent cases, the number of edges is equal to the number of stock points. However, in the general case, the number of edges (18) is much larger than the number of stock points (9). This may result in issues with scalability. To explore the impact of reducing the size of the action space, we consider two different action space implementations for the general case: (1) order per edge and (2) order per stock point. In the case where we define an order per stock point, we use a probability to determine the predecessor stock point to which the order is sent. For example, with 60% probability, the order of retailer \(r_1\) is sent to warehouse \(w_1\), and with 40% probability the order is sent to warehouse \(w_2\), meaning the order quantity is not split over multiple suppliers. These percentages are determined based on historical data of the CardBoard Company. While this action space implementation might work well in improving the scalability of the algorithm, the performance could be jeopardized, as we add an additional source of uncertainty to the problem (i.e., the random choice of supplier), which is a common challenge for DRL algorithms (Dulac-Arnold et al. 2021).

The default output of an untrained network is 0, with a standard deviation of 1. It could, therefore, take a long time to reach the desired mean output. For example, if the optimal order size is 20, the neural network has to adjust its output to 20, but might do this in very small steps. Hence, it is recommended to scale the action space to the interval \([-1,1]\) (Raffin et al. 2019). This should speed up the learning of the network. The interval of the action space can be determined based on experimentation or reasoning on what would be the maximum allowed action in the environment. Because the action space output of the neural network is unbounded, we clip the outputs that lie outside the given interval. This way, we encourage the network to stay within the limits of the interval. We have also experimented with giving the network a penalty whenever the outputs were outside the given interval, but this resulted in worse performance.

4.5 Reward function transformation

PPO tries to maximize the rewards. Because we want to minimize our costs, we multiply the costs (3) by \(-1\) to represent the rewards. As mentioned in Sect. 4, the PPO algorithm uses an Actor-Critic architecture. While the actor outputs the values of the actions, the critic estimates the value function. The network weights are then optimized to achieve minimal loss (MSE) between the network outputs and training data outputs. If the scale of our network outputs is significantly different in weight and bias from that of our input features, the neural network will be forced to learn unbalanced distributions of weight and bias values, which can result in a network that is not able to learn properly. To combat this, it is recommended to scale the output values (Muccino 2019). In our experiments, we noticed that the scaling of these rewards is not trivial and no best practices are defined yet. In our case, dividing the results by 1000 yielded good results and we saw that the expected value of the critic corresponded to the average reward that was gained in the experiments.

5 Experiments and results

Experiments were executed to assess the performance of the PPO algorithm from Sect. 4 in the multi-echelon inventory problem as described in Sect. 3. In this section, we first introduce the settings of the three inventory cases we experiment with (Sect. 5.1). Thereafter, we report the performance of the PPO algorithm in terms of minimizing the costs compared to the benchmarks in the linear (Sect. 5.2), divergent (Sect. 5.3), and general inventory network (Sect. 5.4). We present the cost and computational performance of the algorithm, and we provide an analysis of how to interpret the resulting policy in practice.

5.1 Experimental settings

The linear supply chain case is based on the synthetic problem settings as described by Chaharsooghi et al. (2008). The lead time is drawn from a random distribution each period and is identical for all stock points. Furthermore, the demand at the retailer is drawn from a discrete uniform distribution, and there are four stock points. We have replicated their simulation and applied our reinforcement learning method to this simulation. As a benchmark, we used their original method, the Reinforcement Learning Ordering Mechanism (RLOM).

The divergent supply chain case is based on the synthetic problem settings as described by Kunnumkal and Topaloglu (2011). The lead time here is deterministic, while the demand at the retailer follows a Poisson distribution. The mean of the Poisson distribution is drawn from a Uniform distribution. There are again four stock points, and in this case, unlike the others, there are no backorders for the warehouses. To benchmark this system, we have used the decomposition-aggregation heuristic of Rong et al. (2017). This heuristic is also considered as benchmark by Harsha et al. (2021).

The general supply chain case is inspired by a real-world case at the CardBoard Company, with nine stock points. It is structured as shown in Fig. 1, where the four warehouses in this case are paper mills that process the raw materials into paper and cardboard products. Then this intermediate product is shipped to corrugated plants (which can be considered as the five retailers), which process the intermediate product into the final products that are sold to customers. As the general network structure is already complex to study, we assume deterministic lead times in this case. In this study, we assume the demand is drawn from a Poisson distribution. The warehouses do observe backorders, however, there are no backorder costs associated with this, since only the demand from the end customers is ‘penalized’ for not being readily available. In the current inventory policy (the benchmark), the aim is to reach a 98% fill rate at all stock points, meaning that also here the aim is to minimize backorders between retailers and warehouses. However, the general intuition at the CardBoard Company is that improvements are possible in terms of decreasing the total amount of inventory while keeping customers satisfied.

Table 5 Parameters per case

All case specific parameters, which originate from the cases studied by Chaharsooghi et al. (2008) and Kunnumkal and Topaloglu (2011), can be found in Table 5. Please note that these benchmarks are all non-iterative, and therefore each provides only one output value. Hence, the benchmarks are depicted as straight lines in the figures with results. Unfortunately it was not possible to use one of the other value-based DRL methods we found in literature as a benchmark. As noted by Lynnerup et al. (2019), DRL studies are notoriously difficult to replicate due to the algorithms’ intrinsic variance, the stochasticity in the environment, and the various hyper-parameters that can be chosen (and that are not always reported). Since the problems considered in other studies vary from our problem settings, the corresponding DRL methods cannot easily be applied to our case. Instead, we focus on explaining the approach and implementation details of our PPO algorithm, and demonstrating its applicability to different problems.

In the linear and divergent case, we run the PPO algorithm for 10,000 learning iterations. For the general case, we use 50,000 iterations as the state and action space are larger. To study the convergence of the algorithm in terms of the performance of the learned policy, we simulate the learned policy at every 100 learning iterations (at every 500 iterations for the general case). At each of these evaluation moments, 100 simulation runs are performed. In the linear case, 35 time periods are simulated to compare to the benchmark. Note that the linear case benchmark focuses on a finite horizon setting, whereas we aim to minimize the long-term average costs in the other network structures. In the divergent case, 75 time periods are simulated, and the first 25 are discarded as warm-up periods. For the general case, 100 time periods are simulated, and the first 50 are discarded as warm-up. The differences in time periods are a result of the different benchmarks that are considered. Since there is randomness in running the PPO algorithm, ten different random seeds (referred to as ‘replications’) are used in each case to demonstrate the consistency in performance of the algorithm. The average performance of these random seed replications is reported. The experiments were conducted on an Intel(R) Xeon(R) Processor CPU @ 2.20 GHz with 4 GB of RAM.

As indicated in Sect. 4, the state and action space need to be scaled to an appropriate range for the specific problem environment. Since our problem environments differ significantly in terms of the number of nodes and thus in the reasonable maximum amounts of total inventory and replenishment quantity, we use different scales for the state and action variables in each case. The upper bounds for the state and action space were found after experimenting with several values, and the resulting bounds are shown in Table 6.

Table 6 Upper bound of the state and action variables per case

5.2 Linear network

We first show the results of the linear network, providing insight into the applicability of the PPO algorithm to the simplest network structure. One replication of the PPO algorithm takes on average 1 h to complete. Note that the computation time of the algorithm can be reduced with early stopping procedures once the solution has converged.

The results of the experiments can be found in Fig. 2. As a benchmark, we have included the average results of the RLOM approach of Chaharsooghi et al. (2008).

Fig. 2
figure 2

Performance of the PPO algorithm in the linear inventory system. The blue shade depicts the 95% confidence interval (colour figure online)

We see that the PPO algorithm improves quickly and it only needs a few iterations to improve on the benchmark (note that the benchmark also relies on RL). After completing training, the PPO algorithm results in 16.4% lower costs compared to the benchmark. From this, we conclude that the PPO algorithm succeeded in yielding good rewards while using a large state- and action space, and is able to learn well-performing actions. These results indicate that deep reinforcement learning might be a promising method for other more complicated cases.

5.3 Divergent network

In this section, we present the results of applying the PPO algorithm to the divergent network case. One full replication of the deep reinforcement learning method takes on average 1 h. The result of the deep reinforcement learning method can be seen in Fig. 3.

To benchmark our method, we use the decomposition-aggregation (DA) heuristic of Rong et al. (2017), which is proven to be asymptotic optimal. Besides the DA heuristic, Rong et al. (2017) also propose the Recursive-Optimization heuristic; they conclude that both heuristics perform well, but that DA is significantly less computationally intensive. With this heuristic, we find the base-stock levels [124, 30, 30, 30]. Hence, the base-stock level for the warehouse should be 124, while the base-stock level for the retailers is 30. The calculations for the heuristic can be found in Appendix A. The average costs of the benchmark are 4059.

Fig. 3
figure 3

Performance of the PPO algorithm in the divergent inventory system. The blue shade depicts the 95% confidence interval (colour figure online)

As can be seen in Fig. 3, PPO is able to learn the required actions quickly, as the costs decrease vastly over the first iterations. After only 3000 iterations, which takes about 30 min, the method is able to beat the benchmark. At iteration 6300, the algorithm has learned the lowest costs, with an average cost performance of 3600, which means that the method is able to beat the benchmark by an impressive 11.3%. Note that it is not recommended to continue learning after convergence has been reached, as the performance could deteriorate after more learning iterations due to overfitting. From these results, we conclude that the deep reinforcement learning method can easily be applied to other cases, without any alterations to the algorithm. There might be still some improvement possible by tuning the parameters of the algorithm. However, this is a non-trivial and time-consuming task (Gijsbrechts et al. 2022). Therefore, we decided to keep the current parameter values. With the current settings, the method is already able to beat a base-stock level heuristic.

To gain more insights into the decisions that the deep reinforcement learning method makes, and to interpret the resulting policy when being used in practice, we look at how the actions of the method depend on the state. To do this, we use the trained network and feed the different states as input to this network. The output of the network is the action that it will take, i.e., the quantity of the order it will place upstream. When we visualize these states and actions in a heatmap, we see how the actions are coherent with the states. Because we use a two-dimensional heatmap, we can only vary two different variables in the state vector at a time, so the other values in the state vector remain fixed. This value is based on the average value of the particular state variable, which we determined using simulation.

Fig. 4
figure 4

Actions of the warehouse and retailers

The heatmap of Fig. 4a shows the cohesion of the inventory in the warehouse with the number of items in transit for the warehouse. We see that the order quantity gets lower as the inventory of the warehouse gets higher. The order quantity only slightly depends on the number of items in transit. When the number of items in transit is low, the order quantity is high. This order quantity slightly decreases when the number of items in transit increases. However, when the number of items in transit passes 50, the order quantity tends to increase again slightly. Although this is counter intuitive, it might be the result of providing only a limited number of observations of these states to the neural network, as they do not occur often in the simulation, so they seem to be unlikely to occur in practice. All in all, the heatmap shows that PPO is able to make explainable connections between states and actions, but also resulted in some actions that make less sense.

The state-action connections of the retailers can be found in Fig. 4b. For these heatmaps, we investigated the different retailers and the relation between their inventory and both the inventory of the warehouse and the number of items in transit. We see that there is no clear dependency between the actions and the inventory of the warehouse. This makes sense, as the number of items in the warehouse does not directly affect the inventory position of the retailer. The connection between the number of items in transit and the inventory of the retailer is clearly visible. The resulting policy does not order when the number of items in transit is large. Also, the current inventory of the retailers impacts the order quantity. Especially for retailers 1 and 2, the order quantity quickly increases when the retailers have backorders (with a slower increase for retailer 3).

With this analysis, we demonstrate that the PPO algorithm finds sensible policies, and we provide guidance on how to interpret the resulting policy in practice. Although it is not possible to quickly see the relations between all the different states in these two-dimensional heatmaps, we do get insights into some important connections.

5.4 General network

In this section, we present the results of applying the PPO algorithm to the general network case. We conducted three different sets of experiments. In Experiment set 1, we defined the action space with orders per stock point. Experiment set 2 had a larger action space, as an order per edge was considered (see Sect. 4.4). To ensure that the PPO algorithm does not become trapped in unlikely state spaces, we defined Experiment set 3, where the transition function was restricted (see Sect. 4.3) and where we again use the action space with orders per edge.

As a benchmark, we used the actual policy of the CardBoard Company, which aimed to achieve a fill rate of 98% for their customers. They aimed to achieve this by also enforcing the fill rate of 98% for the connections between the paper mills and corrugated plants, which is similar to a decentralized inventory policy. We set up a benchmark by manually tuning the base-stock parameters such that the fill rate of 98% is met for every connection. We measure this fill rate by running a simulation model of the network for 100 time periods, where we remove the first 50 periods to account for the warm-up period and replicated this simulation 500 times. Using this method, we obtained the following base stock parameters for Experiment 1: [82, 100, 64, 83, 35, 35, 35, 35, 35], resulting in an average total cost of 10,467. For Experiments 2 and 3, the base stock parameters were defined as: [37, 47, 33, 63, 30, 30, 30, 30, 30], with an average total cost of 4,797. There is a difference between these base stock parameters due to the higher uncertainty in Experiment set 1 as a result of the randomness in where the upstream order will be placed. In Experiment set 2 and 3, the orders are split across all the upstream connections, resulting in lower demand uncertainty for the warehouse stock points, higher delivery reliability between warehouse and retail stock points, and thus lower base-stock parameters.

We run the deep reinforcement learning method for 50,000 iterations for the three experiment sets, which takes approximately 5.8 h per replication. Table 7 shows the most important results of every experiment set. When looking at the results, we observe that the PPO algorithm is not able to achieve consistent results for both Experiment set 1 and 2. Although the lowest costs are lower than the benchmark, the average costs are high, due to some runs that are not able to converge. Experiment set 3, which implements the transition limit, is able to converge in every run and is therefore yielding an average performance that is lower than the benchmark. For the remainder of this section, we will further analyze the results of Experiment 3.

Table 7 Results of the benchmark and the PPO algorithm in the general inventory system
Fig. 5
figure 5

Results of the PPO algorithm on the case of CardBoard Company. The blue shade depicts the 95% confidence interval (colour figure online)

Figure 5 shows the result of Experiment 3, compared with the benchmark. We see that the PPO algorithm once again improves quickly in this case, but needs some time to converge and yield steady results. After 22,000 iterations, which takes about 2.5 h, the average costs drop below the benchmark for the first time. On average, the PPO algorithm improves the benchmark by 6.6%. To gain more insight into the decision that the PPO algorithm makes, we will look into the fill rate of the inventory system and compare it with the benchmark. Note that for the case of the CardBoard Company, we are not able to create heatmaps, as we have too many state variables that depend on each other.

For every replication of our deep reinforcement learning method, we use the fully trained network to run a simulation for 100 periods. We then remove the first 50 periods, to account for the warm-up period. Figure 6 shows the average fill rate of replication 1 for both the benchmark and our deep reinforcement learning method. While we did not impose the 98% fill rate as a constraint in our model, the method is still meeting this requirement for almost every retailer. We see that the method did learn that it is not necessary to achieve a 98% fill rate for every stock point, but only for the retailers, as they are directly delivering to the customers. With respect to the warehouses, our PPO algorithm decides to only store inventory at warehouse 4. According to Fig. 1, this warehouse has a connection to every retailer. The PPO algorithm learned that storing inventory at only one warehouse minimizes the variability and achieves higher downstream fill rates. With this approach, the PPO algorithm has learned to simplify the general network structure of our case to a divergent network structure. While this is a valid solution to our provided case, this might not be a realistic solution in a real setting, for example, due to the distances between warehouses and retailers. Therefore, future research could focus on altering the case, such that it becomes less beneficial to only store inventory at warehouse 4. This could be done by either adding an extra long lead time to this warehouse, making using of preferred suppliers, or changing replenishment costs for different suppliers.

Fig. 6
figure 6

Average fill rates (in percentage) for the stock points of the CardBoard Company

Based on these experiments, we conclude that the PPO algorithm is not able to converge in complex environments without using a transition limit. The extra stochasticity in Experiment set 1 results in a highly unstable method. Experiment set 2 shows that, while the stochasticity is reduced, the method remains unstable. The transition limit in Experiment set 3 ensures that the PPO algorithm does not get stuck in unlikely states and pushes the method in the right direction.

6 Conclusion

We investigated the applicability of Deep Reinforcement Learning (DRL) for finding replenishment order quantities in multi-echelon inventory management. More specifically, we proposed a PPO algorithm, a method that performs well with robust hyper-parameter values found in the literature. We applied our method to three different network structures, to show its general applicability. When applying the PPO algorithm to the different cases, the main modifications we made were related to the definition of the state and action spaces, and setting the bounds. The bounds on state and action variables have a big impact on the performance, but with minor experimentation, suitable bounds can be identified. Future research could look into automating and optimizing the process for determining these bounds. The only other modification concerns the transition limits in the larger problems to ensure a stable performance of the algorithm.

The PPO algorithm outperforms the benchmarks in all cases. In the linear case, the algorithm achieved 16.4% lower costs compared to the benchmark of Chaharsooghi et al. (2008). In the divergent network structure, PPO resulted in 11.3% lower costs compared to the benchmark of Rong et al. (2017). In the general network structure, we considered a real-world case of Cardboard Company, where we found that a transition limit was necessary to let the PPO algorithm converge. In this general network structure case, the PPO algorithm was able to reduce the costs by 6.6%. The PPO algorithm learned to minimize variability and improve downstream fill rates by storing inventory at only one warehouse. However, this may not be a feasible solution in real-world settings. Future research could explore ways to make it less beneficial to store inventory at one warehouse by introducing more complexity and uncertainty, such as preferred suppliers or different lead times per connection. Furthermore, the PPO algorithm is executed within reasonable computation times of around 1 h, with larger problems taking more time to train. Once trained, the neural networks can instantly provide the action to take. We showed that insights into the resulting policy of the algorithm can be retrieved by creating heatmaps and investigating the prescribed actions in different states.

Our contributions to this research are threefold. First, we applied a DRL method with a continuous action space, which is not yet widely used, to a multi-echelon inventory system, creating better scalability to larger problems. If discrete actions would be used, we would not have been able to initiate the neural network for the case of the CardBoard Company. Second, we investigated a general inventory system, which has received less attention in the literature due to its complexity and intractability. Nevertheless, the relevance of a general network structure is underlined by our real-world case of the CardBoard Company, and we showed that the PPO algorithm can improve inventory policies. Third, we demonstrated the flexibility of the PPO algorithm by applying it to a variety of cases without major modifications and without any parameter tuning.

While demonstrating the applicability of the PPO algorithm in a general network structure, we investigated the performance in a simulated environment, using stationary demand distributions. Investigating the performance and suitability in a real-life environment is essential before implementing the PPO algorithm in practice. While our case reflects a real-life environment, further research could focus on implementing more real-life aspects. Additionally, more research should be done on the explainability of our method and deep reinforcement learning in general. As the explainability aspect is a key component of implementing such methods in practice, it deserves more attention. van Hezewijk et al. (2022) provide an example of how the neural network ‘black box’ can be opened to interpret the resulting policy of the PPO algorithm. They do so by using multiple linear regression to find relationships between the state and action variables of the trained actor, thereby providing insight into the logic behind selecting certain actions when observing certain states.