Coordinated Replenishment Game and Learning Under Time Dependency and Uncertainty of the Parameters

This research proposes a periodic review multi-item two-layer inventory model. The main contribution is a novel approach to determine the can-order threshold in a two-layer model under time-dependent and uncertain demand and setup costs. The first layer consists of a learning mechanism to forecast demand and forecast setup costs. The second layer involves the coordinated replenishment of items, which is analysed as a Bayesian game with uncertain prior probability distribution. The research builds on the concept of the (S, c, s) policy, which is extended to the case of uncertain and time-dependent parameters.


Introduction
This paper studies a periodic review multi-item two-layer inventory model to determine the can-order threshold of a (S, c, s) policy in a coordinated replenishment problem. The (S, c, s) policy was first introduced by Balintfy [3] and is useful in situations where the demand is stochastic and where the setup costs are high.
In the (S, c, s) policy, an order is placed to take the inventory level to the order-up-to threshold S. A replenishment occurs either when the inventory level is lower than the mustorder threshold s or when the inventory level is lower than the can-order threshold c and Various algorithms [4,8,18,23,29,32,33] and heuristics [6,22,34] to determine the thresholds of the (S, c, s) policy are established and compared in the literature [12,19,21,26,27]. Federgruen et al. [8] proposed an algorithm to determine the optimal (S, c, s) policy in a multi-item inventory system with compound Poisson demands. In the work of Zheng [35], the author provides a method to determine the thresholds in a single-item continuous review inventory system with Poisson demand. Similarly, in this work we assume that the demand and discount opportunity of each item evolves according to two independent Poisson processes. However, we consider that the rate parameters of these processes are unknown and change over time.
In practice, demand is often assumed to be uncertain [2, 10, 20, 23-26, 28, 31] and demand forecasting models are widely developed in order to cope with uncertainty [9]. In this research, as the demand follows a Poisson distribution, we apply a Poisson regression model to learn the future demand [5]. This technique is commonly used to forecast a count variable that follows a Poisson distribution and it ensures that the estimated parameter takes values in the set of nonnegative integers Z + . On the other hand, the setup costs can also be uncertain [1]. In this case, we apply an exponential smoothing method to forecast the values of the setup cost at each time t [11,17]. The uncertainty in the parameters highlights the versatility of the proposed method and its link to data-driven approaches. Indeed, the method is effective even when we do not know a priori the values of the parameters. Furthermore, considering setup costs as unknown parameters allows to accommodate real-life scenarios where it is common that market conditions generate changes in transportation cost.
In this paper, we focus on determining how the optimal can-order level in a periodic review system with multiple items and discount opportunities can be obtained, while learning uncertain setup cost and uncertain demand. We assume that each item is managed by an independent decision maker. In addition, assuming that the decision-maker of each item does not have information about the demand of any other item, we translate this coordinated replenishment problem into a Bayesian game, where each item is considered as a player. The theory about Bayesian games was introduced by Harsanyi [14][15][16]. However, in a Bayesian game all the players need to know the prior probability distribution of the demand. In this work, we assume that the players do not know the exact prior probability distribution, but each player knows the set of possible probability distributions. To be protected against the uncertainty of not knowing the real probability distribution, we apply distributionally robust optimization [30] to obtain the optimal can-order level that minimizes the expected cost.
The main contributions of this paper are: • We present a new model to describe a coordinated replenishment game which involves a learning mechanism to forecast uncertain and time-dependent parameters. This model is based on Bayesian games and distributionally robust optimization to accommodate the uncertainty in the probability distributions. • We prove that the cost function is convex with respect to the can-order threshold and that it is concave with respect to the set of probability distributions used for the robust optimization problem. Furthermore we prove that, in terms of Bayesian games, finding an optimal can-order level that minimizes the cost function corresponds to finding a strategy at a Bayesian Nash equilibrium. • We provide a new method to determine the optimal-can order level that minimizes the cost function. • We conclude with a numerical analysis to corroborate our theoretical results. This paper is organized as follows. In Sect. 2, we introduce the model for a specific item and we extend the model to multi-item joint replenishment. In Sect. 3, we analyse the coordinated replenishment model presented in Sect. 2 as a Bayesian game. In Sect. 4, we present our main results. In Sect. 5, we provide a numerical analysis. Finally, in Sect. 6 we draw conclusions and discuss future works.

Problem Statement and Model
Consider consecutive time intervals [t, t + 1) for all t = 0, 1, . . ., and let the set of items N = {1, 2, ...n} be given. Let us assume that each item i ∈ N is managed by an independent decision maker. For each specific item i ∈ N , we also assume that the customers arrive one at a time. The arrival of a customer indicates the occurrence of a demand event. The cumulative demand in a time interval [t, τ ], where τ ∈ [t, t + 1), is given by the number of customers that arrive during that period of time. Let d i (τ ) ∈ Z + be the cumulative demand of item i from time t to τ , where Z + represents the set of nonnegative integers. The value of the cumulative demand d i (τ ) is reset at the beginning of each time interval [t, t + 1). It means that when τ = t + , d i (τ ) = 0, where t + denotes the first time instant in the interval [t, t + 1). We are interested in determining the cumulative demand that is observed at the end of each discrete-time interval. Let us denote by D i (t + 1) ∈ Z + the observed cumulative demand of item i from time t to t + 1. In addition, let us denote by (t + 1) − ∈ [t, t + 1) the last time instant in the interval [t, t + 1). The cumulative demand at time t + 1 is given by It is assumed that d i (τ ) evolves according to a continuous time Poisson process with rate λ i (t) ∈ R + , where λ i (t) represents the average number of demand events that occur during the time interval [t, t + 1), and R + denotes the set of positive real numbers. However, at the beginning of each time interval the true value of the rate parameter λ i (t) is unknown and has to be estimated. Let us denote byλ i (t) the estimated parameter of the Poisson process. This parameter is learned from past data by applying a Poisson regression model. In this Poisson regression model, the estimated parameter depends on the historical observed cumulative demand D i (t), as well as on the historical data of the minor and major setup costs, denoted by a i (t) and A(t), respectively. The minor setup cost a i (t) ∈ R + is the cost that has to be paid when item i is ordered jointly with at least one other item j ∈ N , j = i. On the other hand, the major setup cost A(t) ∈ R + corresponds to the cost that has to be paid when an order is placed and no other items are ordered at the same time.
Before explaining the Poisson regression model, let us introduce the forecast cumulative demand of item i at time t +1, denoted byD i (t +1) ∈ Z + . We assume that the forecast cumulative demand follows a Poisson distribution with rateλ i (t). Let D i (t) be the discrete-time random variable of the forecast cumulative demand. The corresponding Poisson distribution has the following probability mass function: It is important to remember that the expected value and variance of a Poisson distribution correspond to the value of the rate parameter. Furthermore, we consider that the forecast cumulative demand is given by the value of the estimated parameterλ i (t), namely For a specific item i, let us assume that we have historical data of the observed demand and the setup costs. The data consist of m independent observations of the cumulative demand D i (t), as well as of the major and minor setup costs A(t) and a i (t), respectively, for all t = 0, 1 . . . , m. In a Poisson regression model, D i (t) is the dependent variable and (A(t), a i (t)) are the linear independent regressors that determine D i (t) at each time t. The Poisson regression model in this case is defined by the following equations: where Pr(D i (t)|(A(t), a i (t))) is the conditional probability distribution of the observed cumulative demand given the independent variables A(t) and a i (t), and the unknown parameters β 1 , β 2 ∈ R called regression coefficients. Observe that this conditional probability distribution is also a Poisson distribution, where the parameter λ i (t) is a continuous function that depends on the variables (A(t), a i (t)) and also on the unknown parameters β 1 and β 2 . Let us denote by β = (β 1 , β 2 ) the vector of unknown parameters and byβ = (β 1 ,β 2 ) its corresponding vector of estimated parameters. In a Poisson regression model, these parameters are estimated by maximum likelihood estimation (MLE). Given m independent observations, the log-likelihood function is given by To maximize the log-likelihood function, we differentiate (3) with respect to β = (β 1 , β 2 ) and we set the derivative equal to zero. The estimated parametersβ 1 andβ 2 are obtained from solving the set of nonlinear equations: Note that finding an analytical solution forβ 1 andβ 2 is prohibitive. However, it is common to apply iterative algorithms to obtain the values of the parameters that maximize the log-likelihood function. The most common method that statistical programs use to find the solution for the parameters is the Newton-Raphson iterative method. By applying this method convergence is guaranteed, as the log-likelihood function (3) is globally concave.
On the other hand, at time t + 1 the values of the major and minor setup cost are also unknown. The learning mechanism applied to forecast the setup costs is the exponential smoothing. Let us denote byÂ(t + 1) andâ i (t + 1) the forecast values of the major and minor setup costs at time t + 1, respectively. These values are obtained from the following exponential smoothing model: where α and γ are the smoothing coefficients. Finally, to forecast the value of the demand rate at time t + 1 we substitute in (2) the estimated parameters (β 1 ,β 2 ), and the forecast setup costs (Â(t + 1),â i (t + 1)) as follows: Figure 2(left) displays the continuous cumulative demand d i (τ ) which evolves according to a Poisson process and the cumulative demand D i (t) in the interval [t −1, t). The displayed example repeats over three consecutive time periods. Once a Poisson regression model has been fit for the historical information up to time t, we can use this model to forecast the demand rate at time t + 1 as in Fig. 2(right).
In the (S, c, s) policy, item i is ordered as a result of a discount opportunity when its inventory level is lower than its can-order threshold c i (t) ∈ R + and there is, at least, an order placed by the decision maker of another item j ∈ N , j = i. We assume that the discount opportunity events for item i are also modelled by a Poisson process with rate μ i (t) ∈ R + . However, similarly as in the rate parameter λ i (t), the discount opportunity rate is unknown at the beginning of each time interval [t, t + 1) and has to be estimated. Let us denote byμ i (t) the estimated rate parameter of the Poisson process that describes the discount opportunity events for item i. The discount opportunity rate can be calculated as follows: For the computation of the discount opportunity rate, in (7) we consider the possible events that can generate a discount opportunity. For a specific item i, the first term in (7) corresponds to the possibility that any other item j = i is reordered separately. On the other hand, the second term in (7) corresponds to the possibility that any item j ∈ N , j = i, is reordered jointly with a different item k ∈ N . Note that the discount opportunity rate of item i depends on the demand rates of the other items in N . In this work, we assume that the demand Poisson processes for the different items are independent. In Sect. 3, we explain how each player i estimates the values ofλ j for all j = i, j ∈ N . It is important to observe that the demand and the discount opportunity are two independent Poisson processes; therefore, any possible event, which involves either the materializing of a demand or the occurrence of a discount opportunity, is determined by a Poisson process with rate λ i (t) + μ i (t).
At each time t, an optimization problem over an infinite time horizon is formulated to obtain the optimal thresholds of the can-order policy (S i , c i , s i ). In doing this the values of the demand parameters are regarded as constant and equal to the last forecast. This is illustrated in Fig. 3, whereD i (t) is the forecast demand at time t, which is regarded as a constant as indicated by the horizontal line. At each time t, a newD i (t) is calculated and used in the new optimization problem.
The must-order s i (t) ∈ N level can be obtained as The term k is a safety factor that has to be set to ensure that in a specified percentage of the possible cases shortage does not occur [13]. The order-up-to level S i (t) ∈ N consists of the safety stock SS and the economic order quantity E O Q i , namely S i := SS + E O Q i . For the economic order quantity, we have where h i is the holding cost per unit time. Then, for the order-up-to level we obtain The dynamics of the inventory level for a specific item i is characterized by a Markov decision-making process. Let us assume that the inventory level of item i at time t is y and when the next event occurs, the inventory goes to level x, the transition probabilities p yx (t) between two inventory levels are given by The transition probabilities in Eq. (10) describe a data-driven time varying Markov chain. The Markov decision process is illustrated in Fig. 4. The plot depicts transitions between inventory levels. When at time t the inventory is s i (t) + 1 the transition can be to s i (t) (unit demand) with probabilityλ i (t) . When the inventory is c i + 1 the transition can be to c i (unit demand) with ) or it remains at the same level in case of discount opportunity with . The expected time between consecutive events is given by 1/(λ i (t) + μ i (t)).
Given the transition probabilities, the purpose is to determine the optimal policy that minimizes the long-run average cost for each item i. The long-run average cost is defined similarly as in the work of Zheng [35] and is given by In the above, assuming that the inventory level is S i (we are at the beginning of a new replenishment period), we denote the expected time until the next order is placed by T i (S i ), the expected holding and penalty costs incurred until the next order is placed by I i (S i ), and the probability that the next order is triggered by a demand by Q i (S i ).
Let us denote by y the inventory level of item i at time t and by G i (y) its corresponding one-step inventory cost. In addition, let us denote by the probability that the next event is the occurrence of a unitary demand. In the following result, we explain how to obtain an explicit expression for each term in (11). (11), the following equations hold:

Lemma 1 For the long-run average cost
Proof The proof is in the Appendix.
Since the parameters involved in the computation of the long-run average cost depend on the thresholds S i (t), c i (t), and s i (t), the minimization of the long-run average cost with fixed S i (t) and s i (t) consists in finding the optimal value of c i (t). The problem we wish to solve can be formulated as follows: where the expressions for In the same spirit as in Zheng [35], an algorithm to calculate the optimal can-order level for a single-item inventory system with Poisson demand and Markovian discount opportunities on the setup cost is presented in the rest of this paper. In this research, the long-run average cost formula is extended to a multi-item periodic review system, where the demand rateλ i (t) is forecast by a Poisson regression model.
It is important to observe that the long-run average cost of a specific item i depends on the decisions made for the other items. When item i is reordered as a result of a discount opportunity the minor setup cost a i (t) is paid. Otherwise, whenever item i is reordered without a discount opportunity, the major setup cost A(t) is paid. Nevertheless, the decision-maker of item i does not know the demand rate λ j (t) of any other item, for all j ∈ N , j = i. In the next section, we explain how the decision-maker of each item i can estimate the demand rateλ j (t), for all j ∈ N , j = i and how it affects the computation of the optimal can-order level that minimizes the long-run average cost (12). This estimation is based on the theory of Bayesian games as introduced by Harsanyi [14][15][16].

Bayesian Game
In this section, we analyse the coordinated replenishment model presented in Sect. 2 as an incomplete game, where each item i ∈ N represents a player. A game is called incomplete when the players do not have complete information about the important parameters of the game. As it is explained in the previous section, at each time t a new observation of the demand is obtained based on which we estimate the future demand and setup costs. These estimated parameters are used in the Bayesian game and in the computation of the optimal can-order level c * i (t) that minimizes the expected long-run average cost. For the sake of simplicity in the rest of this paper, we drop the dependence on time t.
First, let us introduce the control variable u i for a specific item i. This variable indicates the possible amount to be reordered, which can be triggered either by the occurrence of a demand event or by a discount opportunity, and it is determined by the corresponding (S i , c i , s i ) policy. For any inventory level y, this strategy can be defined formally as follows: Let us denote by G the incomplete game related to the initial coordinated replenishment problem. We consider this as an incomplete game since the only information available to player i is its own estimated parameterλ i . However, player i does not know the estimated parameterλ j of any other player j ∈ N , j = i. Letλ = (λ 1 , . . . ,λ n ) be the random vector that contains the demand rates of all the players. Also, letλ −i = (λ 1 , . . . ,λ i−1 ,λ i+1 , . . . ,λ n ) be the random vector that contains the demand rates of all the players, except player i. The incomplete game G can be formally defined in normal form as follows: where N = {1, . . . , n} is the set of players, U i is set of strategies of player i, L i is the set of possible values that the rate parameterλ i can take, is the conditional probability distribution that player i uses to describe the chance of having certain vectorλ −i given that player i knows the value of its estimated parameterλ i .
To determine the best strategy for player i, denoted by u * i , we need to define a complete game that is Bayes-equivalent to the incomplete game G . This complete game is also called Bayesian game, and it is denoted byḠ . To obtain the Bayesian game, we have to assume that each player i knows not only its own estimated parameterλ i , but also the common prior probability distribution ofλ, represented byP(λ) =P(λ 1 , . . . ,λ n ). Given the prior probability distributionP(λ), the complete gameḠ can be formally defined bȳ Note thatḠ differs from G from the fact that the set of probability distributions {P i } n i=1 is replaced byP, which is supposed to be known by all players. The games G andḠ are called Bayes-equivalent because P i (λ −i |λ i ) =P(λ −i |λ i ), whereP(λ −i |λ i ) can be obtained from P(λ) using the Bayes' rule:P It is important to mention that in the Bayesian game instead of minimizing the longrun average cost g i (S i , c i , s i ) as in (12), we aim to minimize the conditional expected long-run average cost. Let us denote the conditional expected long-run average cost by Let u * −i be the set of optimal strategies of all players except player i. In the rest of this paper, we are going to make use of the following definitions.

Definition 1
A best-response set in a Bayesian game is the set of all strategies that minimize the expected long-run average cost: The can-order level c i takes values between the upper threshold S i and the lower threshold s i . Therefore, we have a finite set of options for the values of the can-order level. As a result, there is always an optimal strategy u * in the best-response set B i (u −i ).
For compactness, let us denote the optimal strategy u * This strategy corresponds to the optimal can-order level c * i that minimizes the expected long-run average cost.
Note that at a Nash equilibrium all players play a best-response, such that u * i ∈ B i (u * −i ) for all i ∈ N , and no player benefits from changing its reorder strategy.
The following result was borrowed from Harsanyi [15], and it states that if u * = (u * 1 , . . . , u * n ) is a Nash equilibrium strategy of the complete gameḠ , then it is also a Bayesian equilibrium strategy for the incomplete game G . (14) and the Bayesian game (15) be given, such that (15) is Bayes-equivalent to (14). In order to any given n-tuple of strategies u = (u 1 , ..., u n ) be a Bayesian equilibrium point in the game (14), it is both sufficient and necessary that in the normal form of the game (15) this n-tuple u be an equilibrium point in Nash's sense.

Theorem 1 Let the incomplete game
However, defining a prior probability distributionP that is known by all players can be very complicated. In this work we assume that the players do not know the exact prior probability distributionP, but each player knows the set of possible probability distributionsP(λ −i |λ i ), denoted by P. In order to be protected against the uncertainty of not knowing the real probability distribution we reframe our approach within the context of distributionally robust optimization. Distributionally robust optimization emerges from the idea of being protected against the ambiguity in the probability distribution by taking the worst-case scenario, which involves finding the probability distribution that maximizes the long-run average cost. As a result, we can formulate the following robust optimization problem: Each player i ∈ N can define its best strategy u * i by finding the optimal solution of (20) denoted by the point (c * i ,P * (λ −i |λ i )). Observe that for fixed values of the upper and lower thresholds S i and s i , respectively, determining the optimal strategy that characterizes the control variable u i consists first in finding the optimal probability distributionP(λ −i |λ i ) that maximizes the long-run average cost g i (S i , c i , s i ) for each possible can-order threshold c i . Then, we can find an optimal can-order threshold c i that minimizes the expected long-run To find the optimal solution, it is important to define the correct set of possible probability distributions P. We assume that P is a set of multi-variate Poisson probability distributions.
Let us denote by M Poisson(η,λ −i ) the probability distribution of a multi-variate Poisson random variable, whereη ∈ R 1×(n−1) + represents the vector of rate parameters of each random variableλ j ∈λ −i , j = i. Now we can formally define the set P as follows: We assume that the probability distributions in the set P satisfy the Bayesian-equivalence P i (λ −i |λ i ) =P(λ −i |λ i ). Under this assumption, we ensure that Theorem 1 is still satisfied. In addition, we assume that from the perspective of player i the rest of the players are homogeneous, namely the values of the parameters η j ∈ R + are the same for all player j = i. Let us denote this general parameter by η ∈ R + . For the sake of simplicity and without loss of generality, the random variablesλ j , j ∈ N , are independent and identically distributed with parameter η. We have to solve the robust optimization problem to obtain a single optimal value η * ∈ R + for the multi-variate Poisson distribution. Based on these assumptions, the expected long-run average cost is explicitly computed as follows: Note that in (22) we take the thresholds S i , c i , and s i as fixed and for each term in the sum we compute the long-run average cost as a function ofλ −i .

Main Results
In this section, we show that the expected long-run average cost E[g i (S i , c i , s i )|λ i ] is convex on the can-order threshold c i . Furthermore, we prove that finding an optimal threshold c i corresponds to finding a Bayesian Nash equilibrium strategy.
To prove the convexity of the expected long-run average cost with respect to the canorder threshold c i , first we are going to analyse the behaviour of the long-run average cost g i (S i , c i , s i ) for any possible vectorλ −i .
Let us define It is important to observe that ω i is a positive fraction, namely ω i ∈ (0, 1). Note that ω i increases when the can-order level decreases. Also let the following function be given The above is related to the cost of replenishing always with discount opportunity when the inventory reaches the can-order level.

Lemma 2
For ω i as in (23), the following equation holds: Proof Let θ =λ i /(λ i +μ i ). Since the inventory level is at S i and c i must be less than S i , for T i (y) we have I i (y) at an inventory level of S i is given by where S i > c i . Q i at inventory S i can be written as: where S i > c i . Let N g i and D g i be the numerator and denominator of the cost function in (12). From (26) we have thatλ

From (27), we also havê
The variables, T i , I i and Q i can be substituted into the objective function in (12), and this yields Let us now substitute c i + 1 in (29), which yields: Note that for the last part of the numerator it holds Then, (30) can be rewritten as We also have Likewise, from (24) we have: From (31)-(33), we can conclude that (25) is proved.
A direct consequence of Lemma 2 is that g i (c i + 1) is in the convex hull between g i (c i ) and f i (c i + 1).
In the next result, we show that the optimal value of c i is located at a local minimum of g i for any possible vectorλ −i , under continuous relaxation for c i , namely c i ∈ R + . Let the discrete difference operator be given ∂ ∂c i , which is defined as

Lemma 3 Let the can-order threshold be c i ∈ R and in the interval [s i , S i ]. Then, at a local minimum it must hold
Proof First, we obtain the derivative of g i (·) as At a local minimum it must hold As ω i is only zero when there is no demand, the above implies (34) is proven.
In the next result, we establish the convexity of function g i (S i , c i , s i ) in the decision variable c i for any possible values ofλ −i .

Lemma 4 The long-run average cost function is convex with respect to c i , and the following inequality holds:
where ν ∈ [0, 1] and g i (·) is defined on positive real numbers.
Proof First note that g i (·) is defined on positive real numbers as the costs are nonnegative. Now, take a value for c i which is a local minimum for g i (·) and denote itc i . Then the value should be greater than or equal to g i (S i ,c i , s i ). This is the case, if and only if the second derivative of g i (·) is greater than or equal to zero. From the first derivative in (35), we obtain where ω i is as in (23).
To simplify the expression of (37), the result of (35) is substituted in (37), which yields Furthermore, the above can be rewritten as The previous condition yields which can be simplified as where h i (S i − c i − 2) ≥ 0 and ( c i 2 + 1 2 − S i 2 ) ≤ 0, if S i ≥ c i + 1 which holds true. Since E(a i ) > 0 andλ i > 0 hold true, then (41) is satisfied and we can conclude that A direct implication of the above is that (∂ f i (c i + 1, S i ))/(∂c i ) is positive. This means that the first term in (38) for ∂ 2 g i (S i ,c i ,s i ) ∂c 2 i is greater than or equal to zero, namely: In order to prove convexity, the rest of ∂ 2 g i (S i ,c i ,s i ) ∂c 2 i must be positive or smaller than (42).
Since from (35), we know that ∂ becomes zero in a local minimum. Then in the neighbourhood of a stationary point is relatively small. This makes the second term in (38) approximately equal to zero and therefore negligible. We can conclude that for any stationary point the second derivative is nonnegative and therefore function g i (·) is convex on c i . Lemma 4 proves that g i (S i , c i , s i ) is convex on c i ; therefore, it is known that the optimal can-order level is at the same time a local and global minimum.
In the following result, we prove the convexity of the function E[g i (S i , c i , s i )|λ i ] with respect to the decision variable c i , for a given parameter η that characterizes the probabilitȳ P(λ −i |λ i ).

Theorem 2 The expected long-run average cost function is convex with respect to c i , and the following inequality holds:
where ν ∈ (0, 1) and g i (·) is defined on positive real numbers.
Proof Similarly as in Lemma 4 we have to prove that for any is greater than or equal to zero.
The first derivative of Consequently, the second derivative is given by Based on Lemma 4, we know that ∂ 2 ∂c 2 i g i (S i , c i , s i ) ≥ 0, and then each term in (44) is greater than or equal to zero. We can conclude that for any stationary point and for any parameter η, the second derivative is nonnegative and the function The set P defined by (21) has to be determined in such a way that the expected long-run average cost function is concave with respect to the parameter η. In the following lemma, we prove that such a set P exists.
Proof To prove the concavity of the expected long-run average cost with respect to η, we use the second derivative criteria. The first derivative is computed as follows: By computing the second derivative and evaluating it in η = η * , we obtain: Observe that as η * is the optimal value that maximizes the expected long-run average cost, the first two terms in the previous equation are equal to zero. Then, we have From the previous equation, we have that Hence η * is a local maximum. We can conclude that there exists an interval [η * −ε, η * +ε] such that the expected long-run average cost is concave.

Remark 1
Note that as c i ∈ [S i , s i ] and η ∈ [η * − ε, η * + ε], both variables are contained in convex and compact sets. In addition, based on Theorem 2 and Lemma 5 we know that the expected long-run average cost is convex with respect to c i and concave with respect to η, on their respective intervals. Therefore, according to John von Neumann's minimax theorem: In the following result, we prove that finding an optimal can-order level c i that minimizes (17) corresponds to finding a strategy at a Bayesian Nash equilibrium.

Theorem 3
Let for all players i ∈ N the upper threshold S i and the lower threshold s i be given. In addition for a given parameter η that maximizes the conditional expected long-run average cost, let the optimal can-order threshold c * i be obtained by: Then, the n-tuple (c * 1 , c * 2 , . . . , c * n ) is at a Bayesian Nash equilibrium of the Bayesian gamē G .
Proof Based on Theorem 2, we know that the expected long-run average cost is convex on the can-order level c i . Moreover, the optimal value c * i is a global minimum. Therefore, the can-order level c * i that minimizes the long-run average cost is unique. By definition of best-response set in a Bayesian game B i (u * −i ), if we find the can-order level that minimizes E[g i (S i , c i , s i )|λ i ] for a player i ∈ N , given the optimal strategy of the other players, then the optimal strategy of player i, u * i is in the set B(u * −i ) defined by (18). Hence, the strategy is at a Bayesian Nash equilibrium. Table 1 Setup cost and demand  parameters  t  1  2  3  4  5  6  7  8  9   a i  12  14  13  15  18  16  14  17  18   A  78  80  84  82  79  82  84  82  83  λ i  1 0  1 0  5  4  7  8  5  5  1 0   10  11  12  13  14  15  16  17  18  19   20  17  18  21  23  20  25  23  24  24   79  77  78  80  78  82  80  84  87  85   8  9  8  11  12  13  16  16  15

Numerical Analysis
Let us consider a time interval of 39 months, where each month consists of 4 weeks. The simulated results are obtained by optimizing the can-order threshold for the minimum expected long-run average cost at every time interval. Consider a system of N = 4 items which are ordered at a single distribution centre. At every time period, we forecast the demand and setup costs of a single item i to obtain the optimal can-order threshold of the (S, c, s) policy. The parameters are selected as follows. The demand is measured over 39 consecutive months. Each month consists of four weeks. The total number of weeks included in the simulations is 156. The lead time is 1 week. The holding costs h i are e2 per month. We consider 4 items involved in the coordinated replenishment. The major setup costÂ(1) is initially forecast to be e80. The minor setup costâ i (1) is initially forecast to be e15. The safety factor k is 3. The forecast coefficients α, γ are both 0.2. The initial demand is forecast considering observed data for the first 12 months applying a Poisson regression model and every month we add four more new observations (one for each week) to forecast the demand for the next month. Finally, it is assumed that all items in the system face the same demand uncertainty for the sake of simplicity and without the loss of generality. Note that S i (t) and s i (t) can therefore be represented by S(t) and s(t), respectively, as the levels are assumed to be the same for all items. The demand is assumed to follow a Poisson distribution with mean equal to λ i and the setup cost are assumed to follow a Gaussian distribution with mean equal to a i and A, see Table 1.
Since it is a forecasting model of the can-order policy, the demands and setup costs for all weeks are forecast. Figure 5 displays the forecastD i (t) and actual demand D i (t) over a 39-month period. Similar to the demand, the setup costs influence the control variables of the can-order policy and must therefore be redetermined at every time period. In Algorithm 1, we present the pseudo-code used to obtain the optimal can-order level at each month. Each month, the events are forecast again and the control variables of the can-order policy are redetermined. The optimal can-order level is determined by increasing the value of c i by one and calculating the corresponding expected long-run average cost for each probability distribution in the set P as in equation (21). If the expected long-run average cost calculated at (c i + 1) is higher than it was at the previous can-order level, the iterations stop and the value of the previous can-order level is the optimal value of c i . As it is stated in Theorem 2, the expected long-run average cost is convex with respect to c i . The optimal can-order level is at the same time a local and a global minimum. As a result of the behaviour of the long-run average cost and the iterative process of the algorithm, it is possible to say that it converges to the optimal can-order level c * i that minimizes the expected cost in the long run.
It is important to mention that for the Poisson regression model, we use the built-in MATLAB function glmfit. This function uses a method called weighted least squares to estimate the regression parametersβ 1 andβ 2 . Weighted least squares is a generalization of generalized least squares. It takes into account the weights that determine how much each value influences the final parameter estimates. For the convergence of the weighted least squares MATLAB uses a Newton-Raphson iterative method capturing the first-order optimality conditions at the solution, resulting in strong local convergence rates.
The simulations of the expected long-run average cost are depicted in Fig. 6. The graph corroborates the results in Theorem 2 and Lemma 5, as it shows that the expected long-run average cost is a convex function on c i and concave on η. For all consecutive time periods, the optimal can-order levels are listed in Table 2.
The optimal can-order levels which are determined in the previous step of the algorithm are used to simulate the inventory reordering policy and the related inventory evolution over time. Figure 7 shows that reordering also takes place at the can-order level. In addition, to analyse the stability of the algorithm, we run 100 simulations to compute the 95% confidence interval of the can-order level. The evolution over time of the can-order level and its 95% confidence interval is depicted in Fig. 8. In this plot, we can observe that the optimal values for the can-order level do not have a great variability as its confidence interval is small.

Algorithm 1 Algorithm to obtain the can-order level
Input: Simulation parameters as in Table 1, holding cost h i , safety factor k, set of probability distribution P defined by the parameter η as in (21) Output: Optimal can-order level c * i for each time period 1: Initialize forecast setup-costsÂ(1) andâ i (1) 2: for every time instance do 3: Generate new observed demand from a Poisson distribution with parameter λ i from Table 1  4: Generate new observed minor setup cost from a Gaussian distribution with mean a i from Table 1  5: Generate new observed major setup cost from a Gaussian distribution with mean A from Table 1  6: Estimate the setup cost by applying exponential smoothing as in equation (5)  7: Fit a Poisson regression model to obtain the regression coefficientsβ 1 andβ 2 8: Forecast the expected demand according to equation (6)  9: Compute the must-order level s i and order-up-to level S i as in equations (8) and (9), respectively 10: for every probability distribution in set P do 11: for every inventory level c i between s i and S i do 12: for every vectorλ ∈ R n do 13: Compute the discount opportunity rate as in equation (7) 14: Compute T i (S i ), I i (S i ), and Q i (S i ) as explained in Lemma 1 15: Compute the expected long-run average cost E[g i (S i , c i , s i )|λ i ] as in equation (17)  16: Save in a vector c opt the value of c i for every probability distribution in P 19: Save in a vector the value of the expected long-run average cost for the optimal c i 20:

Conclusions
In this research, we have developed a periodic review system with multiple items, with uncertain demand and uncertain setup costs. The problem consists in learning the uncertain demand and setup costs, while determining the values for the threshold of the can-order policy by minimizing the expected long-run average cost.
Motivations arise from two-layer time-dependent logistic systems. The first layer consists of a learning mechanism to obtain the uncertain input parameters. The setup cost and demand are uncertain, and their forecasts are run using exponential smoothing and Poisson regression, respectively. The second layer consists of a time dependent (S, c, s) policy where the must-order, can-order and order-up-to levels are determined every time new input data are obtained. The proposed methodology is original in that the long-run average cost is optimized under the assumption that the parameters are time-dependent. We translate the initial coordinated replenishment problem into a Bayesian game, where the probabilities involved in this computation are unknown. In order to deal with the uncertainty in the probability distributions, we consider the worst-case scenario by making use of distributionally robust optimization.
The main result is the proof of convexity of the cost function with respect to the canorder level, according to which the can-order value can be optimized by solving a convex problem. In addition, we show that there exists a set of probability distributions where the cost function is concave with respect to the parameter of the corresponding distribution. Finally we show that finding an optimal can-order level that minimizes the expected cost function in the Bayesian game corresponds to finding a strategy at a Bayesian Nash equilibrium. The developed model is validated on a numerical example.
An interesting approach for future works is to analyse the model from a different game theory perspective, by focusing on minimizing the long-run average cost g i for all the items at the same time applying, for example, the theory of strategic complements.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. level is one item above the must-order level. Namely, we have In the above, the time until the next order is placed T i (s i ) = 0 as the inventory level has reached the must-order point. If the inventory level is two items above the must order level, namely y = s i + 2, then the expected time until the next order is placed changes as follows: Analogously for the case of y = s i + 3, we obtain For the case where y > c i , T i (y) has a different structure. This can be seen by setting the inventory level to y = c i + 1 and considering as next event only a demand occurring as discount opportunities will be ignored. This yields From the above, we have which can be rewritten as Using (53) we obtain where δ(1) is the indicator function which takes value 1, if the inventory level is above c i and 0 if the inventory level is less than or equal to c i . From this, it becomes clear that the solution to the renewal equation has the following general form: Equation (55) has the following interpretation. Since the replenishment decision follows a Markov renewal process, a decision can be made in correspondence to two discrete events. One event corresponds to reducing the inventory of one unit, namely a unitary demand materializes, which occurs with probabilityλ i (t) . A second event is referred to as discount opportunity, which means that replenishment is triggered by another item which has reached the must-order point, and this occurs with probabilityμ i (t) λ i (t)+μ i (t) . Combining both events, which we assume to occur with a Poissonian clock and independently then we obtain (55).
From the above, we then obtain the following compact formulation: where θ(t) =λ i (t)/(λ i (t) +μ i (t)) and δ(x) = 1 if x ≥ 1, and δ(x) = 0 if u ≤ 1. An explicit expression for T i (S i ) in (11) is then To obtain the expected holding and penalty costs incurred until the next order is placed similarly to the calculation of T i (y), we first consider the case where the inventory level is between s i < y ≤ c i . Then, I i (y) is given by: where G i (x) is the one-step inventory cost and is assumed to be quasi-convex with lim |u|→∞ G i (x) = ∞. All cases of linear holding and penalty costs, and fixed cost for each backorder item are included in the same spirit as in the work of Federgruen and Zheng [7]. To see how to obtain the above equation, first the inventory level is set to y = s i + 1, which yields To determine the above for y = s i + 2, we can write The above expression contains a combination of the result of (59) and a new part. The previous inventory level, s i + 1 is reduced by one with probabilityλ i (t) , as depicted in the Markov decision process in Fig. 4.
For y = s i + 3, we then have 2 G i (s i + 1)).
In the case where y > c i , I i (y) presents a similar structure as T i (y), namely: where G i (c i + 1) divided byλ i (t) is the expected average holding and penalty cost until the next demand occurs. The above can be rewritten aŝ From the above, we obtain that the solution to the renewal equation is given by: Also in compact form we can write where θ =λ i (t)/(λ i (t) +μ i (t)) and again δ(x) = 1 if x ≥ 1, and δ(x) = 0 if x ≤ 1.
An explicit expression for I i (S i ) in (11) is then Finally, denote the probability that an order is triggered by a demand by Q i (y). When the inventory level is y > c i , no replenishment decision is made and therefore Q i (y) is given by Note that in the region of s i < y ≤ c i , Q i (y) is given by: Observe that the latter is the probability that y − s i unitary demand events occur consecutively. When this happens, the inventory level reaches the must-reorder point and a reorder must occur.
To see how to obtain (66), let y = s i + 1, which results in the following: In the above, we use the fact that Q i (s i ) = 1, as an order is placed directly when the inventory level is equal to the must-order level s i . By repeating the same reasoning for y = s i + 2, we obtain For y = s i + 3, we obtain: From the above, we derive that the solution to the renewal equation can be written as In compact form, we can write An explicit expression for Q i (S i ) in (11) is then