Hybrid algorithm based on reinforcement learning for smart inventory management

This article proposes a hybrid algorithm based on reinforcement learning and the inventory management methodology called DDMRP (Demand Driven Material Requirement Planning) to determine the optimal time to buy a certain product, and how much quantity should be requested. For this, the inventory management problem is formulated as a Markov Decision Process where the environment with which the system interacts is designed from the concepts raised in the DDMRP methodology, and through the reinforcement learning algorithm—specifically, Q-Learning. The optimal policy is determined for making decisions about when and how much to buy. To determine the optimal policy, three approaches are proposed for the reward function: the first one is based on inventory levels; the second is an optimization function based on the distance of the inventory to its optimal level, and the third is a shaping function based on levels and distances to the optimal inventory. The results show that the proposed algorithm has promising results in scenarios with different characteristics, performing adequately in difficult case studies, with a diversity of situations such as scenarios with discontinuous or continuous demand, seasonal and non-seasonal behavior, and with high demand peaks, among others.


Introduction
An efficient inventory management requires a special interest in companies dedicated to commercialization or production. Thus, "inventory represents one of the most important investments of companies compared to the rest of their assets, being essential for sales and optimizing profits" (Durán, 2011). Hence, the relevance of an efficient inventory management, as well as production planning, are critical elements that represent a competitive advantage, and that constitute a determining factor for the long-term survival of the organization (Silver et al., 2017). Inventory Orlicky (1975), which aims to plan material requirements (Huq & Huq, 1994). However, and despite its popularity, this methodology has an important limitation since its precision is not suitable for dynamic environments. Therefore, small variations in the system lead to the bullwhip effect in the supply chain, which consists of distortions that are generated between the number of units demanded versus those purchased (Constantino et al., 2013). This effect has been widely studied in the literature (Steele, 1975;Mather, 1977;Wemmerlov, 1979;Lee & Rim, 2019), which generates cost increases, and loss of customers, among other things.
Given the above, the present work is carried out based on an alternative methodology: the DDMRP, developed by Ptak and Smith (2011), which allows a better adaptation in environments with high variability, and therefore, more efficient inventory management. The "Demand-Driven" approach, called DDMRP, introduces the creation of decoupling to absorb variability, reduce lead times, and reduce overall capital investment.
Thus, in this article, a hybrid algorithm is developed based on reinforcement learning and the DDMRP inventory management methodology, to determine the optimal time to buy a product, and the quantity requested on the purchase order. It is important to highlight with respect to this last aspect (quantity of units) that it should not be very high since the demand for more resources increases the costs; nor very low because it can cause unsatisfied demand, production delays, among other problems.
The main contribution is the definition of a hybrid algorithm based on reinforcement learning and on DDMRP to determine when and how much to buy a certain product. Regarding the theoretical contribution of this work, it is based on the definition of an extension of Q-Learning to introduce different types of information from an inventory management problem in the reward mechanisms. The hybrid algorithm is defined with three different reward functions based on the DDMRP theory, an optimization function, and a shaping function. They are evaluated in multiple case studies, which differ from each other according to the next characteristics: discontinuous or continuous demand, seasonal and non-seasonal behaviors, high or low demand peaks, with different lead times, among others. Thus, the main contribution of this work is the implementation of a hybrid reinforcement learning algorithm that allows a more efficient inventory management process than the one proposed in the DDMRP theory. Additionally, an alternative formula to calculate the optimal inventory level to the one defined in the DDMRP theory is also proposed, to calculate it in a more efficient way.
The article is organized as follows: in Sect. 2, a literature review is presented; Sect. 3 describes the theoretical framework. In Sect. 4, the experimentations are carried out; and in Sect. 5, an analysis and discussion of the results are presented. Finally, in Sect. 6, the conclusions of the study are described.

Literature review
The general trend of research on inventory management has been usually using the MRP, as stated Rossi et. al (2017), who remark that around 75% of manufacturing companies use MRP as the main method for planning production. Since the introduction of the MRP, a wide variety of investigations have been developed, such as the proposal of Pooya et al. (2021), in which dynamic systems are used to reduce the impact of the bullwhip effect produced by demand, and thus, reduce production costs.
As an alternative system to MRP has been developed DDMRP, a system that solves the problem of the bullwhip effect through the positioning of decoupling points or buffers located in the supply chain (Ptak & Smith, 2016). The main function of these buffers is to store a certain number of products to avoid the variability of demand or variability in the supply chain. Around the DDMRP, works have been devel-oped mainly focused on exploring the advantages of this methodology in organizations, such as the one proposed by Velasco et al. (2020), where the authors recreate a simulation environment of the system through Arena software, and demonstrate the efficiency of the system in manufacturing environments, obtaining results such as a reduction in the lead time of 41%, and a decrease of 18% in inventory levels.
On the other hand, Kortabarria et al. (2018) present a case study of a manufacturing company of home appliance components, in which they compare an inventory management methodology based on MRP to one based on DDMRP. The results of the DDMRP have reduced the bullwhip effect and rush orders. Also, Shofa and Widyarto (2018) developed a case study for a company in the Indonesian automotive sector where their results show through simulation that the delivery times of the DDMRP method were reduced from 52 to 3 days, and additionally, the levels of inventory were lower than when the MRP approach was used.
But DDMRP and MRP are not the only models that have been studied in the literature. Mathematically, inventory management has been proposed as an optimization problem (Silver, 1981;Aguilar, 2001) whose objective is to maximize profit and minimize costs. These models have been applied in various organizational areas; for example, authors such as Hubbs et al. (2020) and Karimi et al. (2017) developed an inventory management system aimed at human resource scheduling in production. In processes such as maintenance, Paraschos et al. (2020) develop a model to optimize the tradeoff between machinery maintenance, equipment failures, and quality control. Abdelhalim et al. (2021) analyze the buffer positioning problem in the DDMRP model. They propose an optimization approach of the buffer positioning with a linearization of the model. Rosario et al. (2022) study the impact of production control policies in a two-product, two-echelon supply chain dynamic problem. The factory has two different disruptive occurrences (i.e. failure events and changeovers), which must control adopting different production control policies. They compare the DDMRP policy with other well-knowns such as Hedging Corridor Policy, among others. Thürer et al., 2022) use simulation to assess the performance of four Production Planning and Control (PPC) systems under different levels of bottleneck severity and due date tightness. The systems evaluated were Kanban, MRP, Optimized Production Technology (OPT), and DDMRP. Results show that MRP performs the worst, meanwhile, Kanban and DDMRP perform the best if there is no bottleneck. If there is a bottleneck, then DDMRP and OPT perform the best. On the other hand, Oluyisola et al. (2022) define a methodology for the design and development of smart PPC systems. This methodology is illustrated in a case study in a sweets and snacks manufacturing company.
Likewise, in terms of resource allocation, Zhang et al. (2019) develop a genetic algorithm to set the optimal resource allocation by minimizing the cost of the inventory policy. Similarly, this type of algorithm has been used also by other authors, such as Saputro et al. (2021), which use it for the selection of suppliers of strategic items. They propose a model considering variables such as inventory cost, quality costs and delivery times to determine the optimal number of units of inventories. Other approaches for inventory management have been defined. For example, authors such as Dhahri and Chabchoub (2007), Ran (2018), Punia et al. (2020), Aguilar et al. (2022) propose optimal inventory levels through time series forecasting to have a supply process that is more adjusted to the possible reality of the company.
There are several recent works that present reviews of the literature on the different works carried out around the DDMRP model. For example, Butturi et al. (2021) review the existing scientific literature to analyze its application. They determine three main research lines: comparison with other methodologies, extensions of the DDMRP basic principles, and case studies. They conclude that the main criticality of the method is the high subjectivity affecting the positioning of the buffers. On the other hand, Azzamouri et al. (2021), carry out a systematic literature review about research dealing with the DDMRP published between 2011 and 2020 in different languages. They analyze the extensions of the method in the literature, and present a taxonomy. Also, they identify gaps that require further research.
In summary, although several studies propose inventory management systems from methodologies such as MRP, DDMRP, as an optimization problem, or through demand forecasting, no research using reinforcement learning techniques and DDMRP was found in the literature for inventory management.

Inventory management
Inventories are all those items or stock used in production or commercialization in an organization (Durán, 2012, p. 56). Some important aspects of how to obtain and maintain an adequate inventory are: absorbing fluctuations in demand, having protection against a product that is difficult to ensure a constant supply, obtaining discounts when ordering with larger quantities, and reducing order costs if they are carried out less frequently (Muller, 2011). Regarding this last aspect, Peterson et al. (1998) point out that there are basically five categories of costs associated with inventory management: the unit cost of the value of the product, costs of maintaining the products, ordering costs, stockout costs, and those associated with control systems.
On the other hand, DDMRP combines relevant features of MRP, distribution resource planning (DRP) and Six Sigma. It is a system that allows the adaptation to dynamic demand environments and avoids the amplification of the bullwhip effect in the supply chain through buffers. In general, these buffers act as decoupling points of fluctuations, not only in demand but also in those inherent or associated with the supply chain. Thus, the DDMRP implements buffers (also called decoupling points) whose function is to create independence between the supply chain, use of materials, and demand. This is achieved by establishing optimal inventory levels at the decoupling points, in such a way that if any variation is generated in the system, it is not transmitted through the entire supply chain. In the next subsections are presented some concepts related to DDMRP.

Buffer
The buffers are made up of three zones (see Fig. 1): red, yellow, and green, which will be described below.
Red zone It is the lower zone of the buffer and is associated with low inventory levels (see Fig. 1). The way to calculate the base of the red zone (BZR) is: where: ADU: is the average daily usage. DLT (decoupled lead time): Lead Time between buffers or decoupling points. LTF (Lead time factor): variability factor that gives a greater threshold in delivery times. On the other hand, the top of the red zone (TOR) is given by: where, FV: variability factor that gives a greater slack to the area in case the demand for the product is highly variable.
Yellow zone It corresponds to the intermediate level of the buffer (see Fig. 1). The lower limit of the yellow zone is TOR, and the upper limit (TOY) is calculated as: (3) Green zone It corresponds to the upper zone of the buffer (see Fig. 1) and is associated with high inventory levels. The lower limit of the zone is given by TOY. To determine the upper limit of this zone (TOG), it is necessary to calculate the following three factors: (i) Desired order cycle (DOC): this factor represents the number of days between orders. It sets the imposed or desired number of days of inventory until a new  Ptak and Smith (2016) replenishment order is made. The way to calculate it is: (4) ADU * Days between orders (ii) Base of the red zone (BZR), calculated according to Eq. (1). (iii) Minimum order quantity that can be made (MOQ). Now, once the 3 factors have been calculated, the TOG is calculated as follows:

Qualified demand
Qualified demand is made up of the sum of demand orders existing to date, and the sales orders that exceed the OST level (Order Spike Threshold) in a certain time horizon (OSH). This time horizon is equivalent to the DLT value. Note that the OST level represents the maximum demand threshold for it to be considered as a demand peak. This ensures that high levels of demand are identified, as well as the supply of materials necessary to satisfy them. This level is defined as the value of the ADU.

Net flow inventory
Net flow inventory position (NFE) is a concept defined in the DDMRP methodology associated with the amount of inventory available. This generates the signal to request a supply order; in other words, it defines the need to make a purchase.
To calculate it, Ptak and Smith (2016) define the following equation: where: OH (On hand): Inventory available; quantity of stock available to be used; OP (Pending orders): Quantity of ordered stock not received; QD: Qualified demand orders.

Optimal level of inventory
Ptak and Smith (2016) define the optimal level of inventory from the following equation:

Purchase order
The buy signal is generated when the NFE is less than or equal to the TOY level. The number of recommended units to request in the purchase order (SR) is calculated from: Otherwise, no purchase order is generated.

Reinforcement learning
Reinforcement Learning (RL) is a type of learning where actions to take are not defined, rather than that, these are discovered based on experience (Sutton & Barto, 2018). In other words, learning takes place through trial and error, and the rewards obtained in each of those. These interactions are generally modeled as an MDP, which is made up of the following elements: the agent, in charge of the learning and decision-making process; and the environment, which are all the objects with which the agent interacts (Watkins, 1989). These are a formalization of a sequential decision-making process where actions are influenced not only by immediate actions, but also by those taken in future situations and states (Sutton & Barto, 2018). To do this, the agent selects an action, and the environment generates a new situation and a reward for the action chosen.
In general, the structure of an MDP consists of 4 parts: the possible states (s), the possible actions (a), a transition function, and a reward function (R). If the actions are deterministic, then a transition function is defined to assign each (s, a) a new state (s') as a result of the interaction between both. On the other hand, if the action is stochastic, then the transition function is defined as a probability function, where P(s |s, a) represents the probability of being in a state's given the couple s and a. It should be noted that the final objective of the MDP is to find a policy:π : s → a that maximizes the expected value of the rewards associated with the states. Thus, we seek to maximize the expected profit given by the function (Sutton & Barto, 2018): where: R t : It is the reward obtained in episode t. Which, defined in a recursive and generalized way, gives (Sutton & Barto, 2018): where: G (t) : It is the reward function obtained in episode t. k: Interval of time. γ : discount factor. R (s,t+k+1) : reward for action taken in the moment t + k + 1 by the state s. Now, the agent's behavior in relation to the probability of selecting a certain action is defined based on the policies. In this way, it determines how desirable it is to take an action in a specific state. Under a certain policy, action-value functions are defined. The way to calculate the function is as follows (Sutton & Barto, 2018): where: q π (s, a): Action value function of state s and action a under the policy π. E π : Expected value under policy π .

Q-Learning
Q Learning is an RL algorithm introduced by Watkins (1989). It is characterized by being an off-policy, a policy where the optimal policy is learned independently of the agent's actions. This, as stated by Sutton and Barto (2018), allows the convergence of the algorithm to be faster. Now, regarding the calculation of the Q values with which the stock value function is constructed, it is carried out as follows (Watkins, 1989): where:Q(S, A): expected reward value for action A taken in state S,α: Learning rate. R : Rewards.γ : Discount factor.

Shaping function
The Shaping function was initially introduced by Skinner (1958) because of the effectiveness obtained by training an animal by giving it rewards during the learning process, once it performed behaviors similar to those desired. Similarly, the shaping function has been implemented in RL algorithms by authors such as Ng et al. (1999), proving to be a very efficient technique, and sometimes indispensable for a quick convergence of the learning algorithm (Sutton & Barto, 2018). Formally, the reward function is defined as: where: R(s,a,s') Reward function. F(s,a,s') Shaping function.
For its part, F is defined as: where: γ: Discount factor. φ: Function that defines how close or far the agent is from the target. The inventory environment is described in Fig. 1. It will consist mainly of three components: the first associated with the purchase orders (left part of Fig. 1: Supply Side); the second one with the buffer (in the center of Fig. 1), and the third is the demand side, associated with the demand in a given time horizon (OSH) and the OST.

Proposed model
The proposed model in this study is the definition of an optimization problem (W) that aims to minimize the distance between the real inventory (OH) and the optimal inventory level (OH*) defined in Eq. (7): Now, to solve this optimization problem, it is structured as an MDP where the environment is defined from the theoretical concepts of DDMRP ( Fig. 1), and the policy to optimize and learn is the request for orders of products. This last process is carried out through Q-Learning.

Purchase order
The time horizon of this component is given by the DLT, and represents the order of products that are pending to be received. Once a purchase order is placed, it is represented as an order in the period: P t -DLT , where P t : Period where the order was placed.

Our optimal inventory level
Although in Eq. (7) DDMRP theory provides a function to calculate the OH*, we propose an alternative function defined as: Note that both OH* will be used in each of the case studies to compare the performance between them. By way of clarification, TOR refers to the upper level of the red zone of the buffer, TOY to the yellow zone and TOG to the green zone; as defined in Sect. 3.1.1.

Markov decision process
Next, each of the MDP components of the algorithm is defined.

Actions
The action of agent A t will be based on the number of units to buy at a certain time. Based on OH inventory, the agent must determine the optimal number of units to request in the order (if necessary).

Rewards
Three reward approaches were developed.
R 1 : Rewards based on DDMRP levels. The first approach is based on the state (S t ) of the inventory. Since the most desirable level for the DDMRP theory is yellow, a rewards function R 1 will then be defined such that: This reward function seeks to optimize the desirable level of inventory, in this case, yellow.
R 2 : Rewards based on optimization.
Given that the goal is to minimize the distance between the OH and the optimal OH*, the following reward function is defined based on Eq. (15): It should be noted that by maximizing the reward function in Eq. (18), the optimization defined in Eq. (15) is minimized. With this reward, it is sought to optimize the distance of the inventory to the optimal inventory value.
R 3 : Rewards based on shaping Finally, a shaping approach based on Eq. (17) will be used, such that a new reward function will be defined: where φ(s) is calculated as follows: This reward allows optimizing both the inventory level and distance to the optimal value.

States
The model states are given by three components: OH, OH*, and the lead time (LT). This information will be stored at time t in a tuple with the following structure: For example, the state S (100, 120, 3) represents an inventory level of 100 units, an optimal inventory of 120, and a lead time of 3.

Variables and assumptions
Since the algorithm uses the inventory environment defined by the DDMRP theory as the environment, each of the variables explained in Sect. 3.1 is used. The assumptions used to develop the model associated with the real scenario are defined below: • Demand: given that the historical data of the demand was very limited in the proposed scenarios, it was decided to generate pseudo-random data for learning the model by means of the Mersenne Twister algorithm, from the maximum and minimum demand identified in the historical data. The Mersenne Twister algorithm was selected for two reasons: first, because it is one of the best generators of pseudo-random numbers (Matsumoto & Nishimura, 1998), and second, because its characteristics can significantly favor the convergence time of an algorithm (Bonato et al., 2013). • ADU-OSH: For the real scenario, these variables were calculated from the demand of the previous number, based on a 60-day moving average. • DLT-Lead Time-OSH: taken from the median of the Lead Time of the last year of the historical data. • Initial OH: the initial OH was determined from the final inventory of the period prior to the testing of the historical data. • MOQ: calculated as the minimum purchase order present in the historical data. In relation to the theoretical scenario, the assumption used for the construction of the model was the use of the Mersenne Twister random number generator to simulate the demand in the learning process. Also, for the rest of the variables, this scenario has established the parameters for each of them.

Case studies
The algorithm will be implemented in two test scenarios: the first one will be a theoretical scenario made from the data presented in chapter 9 of the book Demand Driven Material Requirements Planning, by the authors Ptak and Smith (2016), used to simulate the behavior of DDMRP for a given product. This first scenario will be used to compare, from a theoretical point of view, the behavior of our algorithm proposed with the behavior of the DDMRP in the simulation environment that the authors of the book proposed.
On the other hand, in the second case, a real scenario will be carried out in an organization in the logistics sector, in which the behavior of 3 products with different distribution centers and demand behavior will be evaluated. Below we define each of the case studies. Table 1 shows the products that will be analyzed in the real and theoretical cases.

P1: theoretical case study
This first case study is developed from the simulation data in chapter 9 of the book Demand Driven Material Requirements Planning, by the authors Ptak and Smith (2016). This case study is tested in 21 days with a product whose demand is continuous (see Fig. 2). Figure 2 shows that there´s only one demand peak on day 7, the maximum demand requested is 30 units, the median demand is 9 units (and 10.53 in mean), has a lead time of 7 days and a MOQ of 20 units. Notice that the period of time is so short that there´s no evidence to conclude the demand has any kind of stationary or trend behavior.

P2: case study of product 39,933
This case study was taken from the historical behavior of product 39,933 from operation center 11 of the logistics company. This case has a discontinuous demand, and it is characterized by having a stationary time series in mean and variance (see Fig. 3). Also, there is neither a noticeable trend nor a significant change in variance over time. The median demand for this product is 0 (and mean 2.52), the maximum demand is 72 units, has a MOQ of 40 units, and an LT of 9 days (see Table 1).

P3: case study of product 28,440
This case study was taken from the historical behavior of product 28,440 of the operation center 12 of the company in the logistics sector under study. The demand for this product is discontinuous, and it is characterized by having a trend time series with non-stationary variance in mean or variance (see Fig. 4). Notice how the variation of the variance, in terms of dispersion of data, is lower in the time window of the first half of the time series, and higher at the end of it. Additionally, it can be seen that the quantity demanded increases over time. Also, notice that this case study has the highest standard deviation and demand quantities among the case studies (see Table 1). The maximum demand requested is 256 units, has a median of 0 units (mean of 3.61), has a MOQ of 16 units, and an LT of 16 days, being the case that takes the higher time to be delivered to the operations center.

P4: case study of product 43,387
This case study was taken from the historical behavior of product 43,387 from the operation center 14 of the company in the logistics sector under study. The demand for this product is discontinuous and characterized by having a seasonal time series with non-stationary variance in mean or variance (see Fig. 5). Notice how there´s seasonality around the midterm of both years and how the variance increases over time. Overall, of the case studies, this product has the lower standard deviation, it has a maximum demand of 26 units and a median of 0 (1.05 in mean). This product has an MOQ of 4 units and an LT of 7 days (see Table 1).

Evaluation metrics
The evaluation metrics will be classified into 2 categories, RL metrics and logistic metrics, they will be described below:

Logistic metrics
Below, the logistics metrics are presented: Bullwhip effect ratio (REL): this metric is used to evaluate the ability to avoid spreading distortions between the purchase orders and demand for the product (Romero et al., 2016). In our algorithm, it will be compared the purchase orders by the optimal policy learned by our RL and the demand of the test period. The expression to calculate it is defined as: where: σ 2 : variance.
The closer the ratio is to 1, the less is the distortion of the bullwhip effect. Note that a result equal to one means there is no distortion, thus, there´s no bullwhip effect.
Number of stockouts (BS): This metric is proposed to evaluate the number of times the stock is broken. This is a very relevant event because it can cause an increased risk of lost sales as well as it leads to reduced customer satisfaction and lowered loyalty levels (Merrad et al., 2020). Note that the lower the number of stockouts, the better the inventory policy was learned.
Average OH* distance (AOHD): This metric is used with the objective of evaluating the performance of the closeness of the inventory of our RL algorithm to its optimal level. Therefore, the closer to zero, the better. Mathematically, it represents the Euclidean distance between OH and OH*. Its formulation will then be: Now, to evaluate the general behavior throughout the episodes, the median of these distances will be calculated.

RL metrics
Below, the reinforcement learning metrics are presented.
Average Accumulated Reward (AAR): The average accumulated reward metric is used to evaluate the performance of the policy learning process (Sutton & Barto, 2018), in our proposed model, the purchase order policy. A higher   Bold values indicate the best results AAR value is better since the algorithm has obtained a higher reward on average. It is calculated as follows: where: r t represents the rewards in episode t, N: the number of episodes.
Percentage of Best-Accumulated Reward (PBAR): The best-accumulated reward (BAR) is defined as the global maximum sum of rewards obtained in an episode in the whole run of the learning process. Thus, PBAR represents the proportion of BAR achieved. PBAR at time t can be calculated as: P B AR t max(r 1 + r 2 + · · · + r t ) B AR where BAR is calculated as: B AR max(r 1 + r 2 + · · · + r N ) This metric is important because it shows how long it takes in terms of episodes to achieve the best-accumulated reward.
Rate of Convergence of the Algorithm (AC): This metric is widely used in the RL context by authors like Sutton (1988), and Watkins and Dayan (1992), to prove the ability of an algorithm to find an optimal value. Although the capabilities of Q learning algorithm to converge are proved mathematically by Watkins and Dayan (1992), convergence is used in this paper to prove that our algorithm is working as it should in terms of finding an optimal policy, and to visualize the speed of convergence. To show the convergence of our algorithm, the average accumulated rewards (AAR) and the number of episodes is compared in the learning process. It is a way to view the convergence of an algorithm in practice (Sutton & Barto, 2018).

Learning and evaluation periods
In all the proposed scenarios, a learning process was carried out in a simulation environment composed of a time window of 800 days. Once the learning process was carried out, on the non-theoretical case studies, the evaluation process was carried out from April 1, 2019, to April 1, 2020. In the theoretical case study, the evaluation process was carried out in the 21 days of simulation proposed in the book.

Results analysis
In each of the next case studies, we compare the results of our models (R1-R2-R3) versus the results obtained by DDMRP´s theory (named "DDMRP" in the results tables). Note that the DDMRP is not a reinforcement learning model, it is obtained by calculating the units to be requested in the purchase order (SR) as defined in Sect. 3.5.1. Additionally, for each case study are shown the results obtained by our RL approach (3), named in the results as "DDMRP OH*"), and the results obtained based on the optimal inventory level that we propose (see Sect. 4.2, named in the results as "our proposed RL OH*").

P1: Theoretical case study
In Table 2a, the model with the fastest learning process was the one with the reward function R1 (P1R1). The PBAR reaches 93% of the BAR within 100 episodes. The slowest learning model was P1R2, in 100 episodes it reaches only 80% of the BAR, and it takes 20,000 to reach 100%. Additionally, P1R1 and P1R3 models are the ones that obtain the best performance, in 30,000 episodes their AAR were 1 and 0.99, respectively. According to Table 2b, in terms of learning, our proposed RL OH* has a very similar performance in the different models, which also is similar to DDMRP OH*.
With respect to convergence, the models based on DDMRP's OH* (P1R1-P1R2-P1R3) and based on our proposed RL OH* (P1R1P-P1R2P-P1R3P) converge properly, evidencing a successful learning process by all the algorithms (see Fig. 6).
Test results can be observed in Table 3a. For each of the products (P1-P2-P3-P4), four models are compared against each of the metrics. R1-R2-R3 are models related to each of the reward functions defined in Sect. 4.3.1, and the fourth model, DDMRP, is based on the purchase order policy defined in DDMRP´s theory.
The results of the metric AAR show that the model with the best performance is R3. According to the results of both, DDMRP´s OH* (Table 3a) and our proposed RL OH* (Table  3b), the highest result is obtained by this last model. This means that model R3 accumulated more rewards on average, indeed, it behaves better in terms of being at the desirable level (yellow). Particularly, in Table 3a, the AAR of model R3 is 0.88, meaning that it accumulated 0.88 rewards per day on average (being the value 1 the maximum possible). Continuing with the metric AOHD, the best results are obtained, in the case of DDMRP´s OH* by R2 and, in our proposed RL OH* by R1 and R3. Note that in this case study all of our models proposed outperformed the purchase order policy of DDMRP theory. In Table 3a, this model has a median of 16 units away from the OH * level, this being the closest distance between the compared models (the visual behavior of AOHD can be seen in Figs. 7 and 8). In relation to the BS metric, it can be observed that none of the models performs any number of stockouts. In this sense, all the models behave well since at no time do they run out of inventory. Finally, in terms of REL, in Table 3a, it can be seen that all of our models outperformed the purchase order policy of DDMRP. DDMRP has a rate of 10.05, meaning it was the most affected in terms of the bullwhip effect ratio.

P2: Product 39,933
According to Table 4a, the model with the fastest learning process uses the reward function R1. It reaches a PBAR of 32% in 100 episodes. It is observed that the learning time increases considerably in this case study with respect to P1 because the complexity of the variables in the P2 case study is much higher. It can be observed that while the best model of P1 (P1R1) obtains 93% of the PBAR in 100 episodes, the best model of P2 (P2R3) requires around 2000 episodes to reach 92%, in other words, it takes around 20 times longer to reach nearly the same level of PBAR. In Table 4b is shown the results of our approach, and the results are similar.
In Fig. 9 can be seen that models based on DDMRP's OH* (P2R1-P2R2-P2R3) and models based on our proposed RL OH* (P2R1P-P2R2P-P2R3P) converge properly, evidencing a successful learning process by all the algorithms. Table 5 shows the results for product P2. The results on the metric AAR show that the model with the best performance is model R1. Note that the results on both, DDMRP´s OH* (Table 5a) and our proposed RL OH* (Table 5b), the highest result is obtained by this model. This means that R1 accumulated more rewards on average, indeed, it behaves better in terms of being at the desirable level (yellow). Particularly, in Table 5a, the AAR of R3 is 0.71, meaning that it accumulated 0.71 rewards per day on average (being the value 1 the maximum possible).  Continuing with the metric AOHD, the best results are obtained, on both in DDMRP´s OH* and in our proposed RL OH*, by R2. It can be seen that this model is a median of 17 units away from the OH * level, this being the closest distance between the compared models (the visual behavior of AOHD can be seen in Figs. 10 and 11). In relation to the BS metric, it can be observed that all the models of DDMRP´s OH* have one day of stockout, except R2, which has 14 days. This significantly changes with our proposed RL OH*, in which the worst model, R1, has only 4 days of stockouts.
Finally, in terms of REL, in Table 5 is shown that the worst performances were obtained by R3 with our proposed RL OH* (P2R3P), and R1 based on DDRMP´s OH* (P2R1P), meaning they were affected the most in terms of the bullwhip effect ratio.

P3: Product 28,440
With respect to Table 6, to analyze the performance of the DDMRP´s OH*, the model with the fastest learning process was R3 (P3R3 and P3R3P). It reaches a PBAR of 39% in 100 Fig. 10 Model results behavior based on DDMRP´s OH*-P2 case episodes and 97% in 10,000 episodes. In Fig. 12 is observed that models based on DDMRP's OH* (P3R1-P3R2-P3R3) and based on our proposed RL OH*(P3R1P-P3R2P-P3R3P) converge properly, evidencing a successful learning process by all the algorithms.
The test results on the metric AAR for P3, are shown in Table 7. The model with the best performance is R1, for both, DDMRP´s OH* (Table 7a) and our proposed RL OH* (Table 7b). This means that R1 accumulated more rewards on average, indeed, it behaves better in terms of being at the desirable level (yellow). Particularly, in Table 7a, the AAR of R3 is 0.82, meaning that it accumulated 0.82 rewards per day on average.
With respect to the metric AOHD, the best results are obtained by models R1 and R2 of our proposed RL OH*. It can be seen that these models had a closeness to the OH* of 23 and 17 units respectively (the visual behavior of AOHD can be seen in Figs. 13 and 14). In relation to the BS metric, it can be observed that none of the models performs any number of stockouts. In this sense, all the models behave well since, at no time, they run out of inventory.
Finally, in terms of REL, in Table 7 the worst performance was obtained by DDMRP´s purchase order policy, with a ratio of 2.13, meaning it was affected the most in terms of the bullwhip effect ratio.

P4: Product 43,387
In Table 8, for the training performance of the DDMRP´s OH* the model with the fastest learning process was R3 (P4R3 and P4R3P). Note that the PBAR reaches 88% in 100 episodes. The performance of our proposed RL OH* is very similar. In Fig. 15 is observed that models based on DDMRP's OH* (P4R1-P4R2-P4R3) and based on our proposed RL OH*(P4R1P-P4R2P-P4R3P) converge properly, evidencing a successful learning process by all the algorithms. Table 9 shows the general result for P4. The test results on the metric AAR show that the model with the best performances is R1 for DDMRP´s OH* (Table 9a) and R3 for our proposed RL OH* (Table 9b). The highest results obtained by these models mean that they accumulated more rewards Fig. 11 Model results behavior based on our proposed RL OH* behavior-P2 case on average. Particularly, in Table 9 the AAR of model R1 is 0.85, meaning that it accumulated 0.85 rewards per day on average.
Continuing with the metric AOHD, the best results are obtained by model R2 in the case of DDMRP´s OH*. In general, in this case study, all the models outperformed DDMRP´s ordering policy, this model (DDMRP) has the furthest distance to OH* (6 units in median) (the visual behavior of AOHD can be seen in Figs. 16 and 17). In relation to the BS metric, there´s a significant improvement between our proposed RL OH* and DDMRP´s OH*, this is due to the decrease in the number of stockouts in the test period. Note in Table 9 that with our proposed RL OH* only model R3 has one stockout.
Finally in terms of REL, in Table 9 is shown that the worst performance was obtained by DDMRP´s purchase order policy, with a ratio of 1.86, meaning it was affected the most in terms of the bullwhip effect ratio, note that the closer it is to value one, the better.
In Table 10, we compare the learning process performance, on average, between each of the reward function models that we proposed. In this table, we can see that the reward function that has the best learning performance is R3. In other words, R3 is the function that learns the fastest, on average, across all the case studies.
In Table 11 is summarized the results of the BS metric. In this table is shown how the performance of the algorithms is improved with our OH*. Note that in the P2 scenario the best performance is given with R2 with no inventory break. In the case of P4 the same happens, again, our OH* outperforms the BS metric compared to DDMRP's OH*. Note that the worst of the results obtained was a breakout, while the worst DDMRP's OH* scenario was 14 days of breakouts. Now, in scenarios P1 and P3, it can be observed that the performance is the same in both cases; however, in general, it can be observed that our proposed RL OH* has a higher performance than that proposed by the DDMRP theory. The result is highly significant given that in this way the risks of lost sales and customer satisfaction are being significantly reduced. Finally, in relation to the analysis of our proposed models, the best performance was obtained with R2 based on our proposed RL OH*. Note that in each of the study cases it was better or equal to the best result. In Table 12 is summarized the results of the REL metric. it can be observed that even though in the theoretical scenario, the performance of the DDMRP's OH* is significantly higher; however, in the cases of real studies (P2-P3-P4), this difference is not so significant. Particularly, in the P2 case the best results were a ratio of 1.38, obtained by R2 and R3 of DDMRP's OH*, and it was also the same result for R1 and R2 of our proposed RL OH*. In the P3 scenario, the results of our OH* are also improved. Note that although the best result for both is given with R1 with a ratio of 0.96, R2 improves significantly, going from 0.86 to 1.05. Finally, in the P4 scenario, it can be observed that the performance of the DDMRP's OH* is higher than our OH* for R1 and R3 models; however, in R2 improves. Note that the closer this value is to one, the less it is affected by the whip effect. In relation to the performance of the proposed models, it can be observed that in this particular metric, it was very diverse, R2 and R3 were the best in P1, R2 was the best in P2, R1 in P3, and R3 in P4. Observe that although they were diverse, in general, all the models proposed outperformed the purchase order model defined in the theory (DDMRP).
In general, our proposed OH* in most cases has a better performance compared to the OH* defined by the DDMRP theory [see Eq. (7)], being significantly better in contexts of demand with high variability. This is given because it avoids the breakdown of inventory and is more robust against the bullwhip effect. Additionally, according to Tables 11 and 12, our proposed models, in general, also outperformed in terms of efficiency, the purchase order policy defined in the DDMRP theory. Now, to define which of the models is better logistically, for our criteria it is the R2 model. The above given the superiority in terms of BS and the good performance obtained in REL. We recommend this model (R2) even though it was not the most efficient in terms of time required in the learning process (see Table 10). Although a policy that has better performance in results is learned, it is not the fastest in the learning process.
However, if the case study has a high level of complexity or computational limitations, we recommend using R3 since it obtains good results in terms of learning (see Table 10), and in terms of results (see Tables 11 and 12). The selection of the best model must be a tradeoff between whether what is sought is efficiency in terms of learning or performance of results.

Comparison with other works
The comparison of our proposal with other studies was carried out in relation to the following 3 comparison criteria: • Technique: the techniques used. • Bullwhip effect: it evaluates if the proposed model has a strategy to avoid distortions associated with the bullwhip effect. • Adaptability: it evaluates if the proposed method can be applied in demanding scenarios with different seasonal and trend behaviors. Paraschos et al. (2020) and Kara and Dogan (2018) propose an inventory management system that allows to optimally evaluate the tradeoff between cost (associated with equipment failures) and benefit. Wang et al. (2020) develop an order generation system based on price discount strategies. Giannoccaro and Pontrandolfo (2002) develop an inventory management system that allows making decisions in relation to supply, production, and distribution. Wang et al. (2020) develop an optimal replenishment and stocking strat- On the other hand, there are still important works that present the inventory management problem as an optimization problem, seeking to minimize different aspects, such as storage costs, among others (Abdelhalim et al., 2021;Thürer et al., 2022). Other works have mixed them with machine learning techniques to, for example, predict the behavior of certain variables (Ran, 2021;Aguilar et al., 2022).
Based on Table 13, our proposal differs from the rest of the articles because is the only one that proposes a model that avoids the distortions provided by the bullwhip effect in the supply chain. Particularly, Giannoccaro and Pontrandolfo (2002) conclude that their proposed model can adapt to "slight changes of demand", similarly, Kara and Dogan (2018) Thürer et al. (2022) showed evidence that their models can adapt to uncertain demand but none evaluated the bullwhip effect. Finally, Wang et al. (2020) assumed constant demand for their proposed model, being this a very strong assumption, far from reality.
In relation to adaptability, the articles propose solutions for a specific process or business sector. Particularly, Wang et al. (2020) propose a model for a business that has specific pricing policies by their suppliers. Karimi et al. (2017) develop a model for a human resource planning area with specific variables that could not be replicable to other businesses. Similarly, Paraschos et al. (2020) develop a quality control model for detecting failures, Kara and Dogan (2018) for perishable products. Finally, Giannoccaro and Pontrandolfo (2002), although their model can be replicated in multiple business sectors, it is not so clear in the work how it can be used in other contexts. Next, we make a quantitative comparison using the metrics used in the works most similar to ours, and their reported average quality (see Table 14).
In general, our approach obtains very good results in the metrics on the quality of the reinforcement learning algorithm (AAR and PBAR), which also happens with the works (Kara and Dogan, 2018;Paraschos et al., 2020). With respect to the works that define an optimization problem (Saputro et al., 2021;Abdelhalim et al., 2021), not in all cases they achieve to optimize the cost. Regarding the metrics on inventory management of the bullwhip effect (REL and BS), our approach has acceptable results (this effect has been little studied in the literature). Other works try to evaluate the quality of the models to follow the ideal behavior of the inventories (Ran, 2021;Aguilar et al., 2022), with interesting results, but they do not consider the bullwhip effect, and its impact on the number of stockouts.

Conclusions
This article implements a hybrid reinforcement learning algorithm based on the DDMRP theory and RF algorithms for inventory management that allows a more efficient ordering process. Additionally, we develop an alternative optimal inventory level function that outperforms the function defined by DDMRP. This was concluded by comparing the performance of the algorithm in scenarios with different characteristics, performing adequately difficult case studies with a diversity of situations, such as scenarios with discontinuous or continuous demand, seasonal and non-seasonal behavior, with high demand peaks, multiple lead times, among others.
The results obtained in relation to the model with the best performance was R2. It provides a balanced purchasing policy that optimizes the distance to optimal inventory, REL ratio and minimizes stockouts. Note that although this was the best model, the other models proposed in our case stud- ies were also promising as they were in general terms more efficient in terms of purchase orders than the model proposed by DDMRP.
In terms of Inventory level, we show that in cases like P4 and P2, where the level is too close to zero, the inventory can be broken multiple times as the variability of the demanded units changes. In the results, there's evidence that our proposed inventory level significantly reduces the number of occurrences, which can avoid the associated risks and costs. Continuing with the results of the REL ratio, the results show that in the case studies our models outperformed the model of the DDMRP´s theory. This, our models are more robust and less affected by the bullwhip effect.
In terms of learning performance, it was shown that in general, the most efficient model is R1 and the least efficient R2. Depending on the computational resources available, one model may be more suitable than another. In our case studies, R2 adapted well to the resources, and it was possible to take advantage of its good results in the evaluation metrics.     For future work, it is proposed to build inventory management systems based on the SARSA and Deep Q Network reinforcement learning algorithms (Huang et al., 2020). The SARSA model is proposed with the objective of comparing the effect that an on-policy type (as it is) to the one used in this article (off policy). The online policy could lead to better learning process performances. On the other hand, the model based on Deep Q Network is proposed since the neural net-works replace the Q table, which in practice can translate into an increase in the performance of the learning process since it is not based on a predefined discrete space (Q table). Finally, for future works, it will be explored an alternative exploitation-exploration policy that reduces the exploration Fig. 17 Model results behavior based on our proposed RL OH*-P4 case rate over time with the objective of increasing the efficiency in the learning times of the model. With the current exploitationexploration policy, it continues exploring at the same rate from the start to the end of the episode.
Funding Open Access funding provided by Colombia Consortium.

Conflict of interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecomm ons.org/licenses/by/4.0/.