Robustness analysis of generalized Jackson network

Queuing networks are a well-established approach to modeling and analysis of complex systems. This paper develops an approach to risk-analysis of queuing network models, where “risk” is understood as the possible impact of ignoring parameter insecurity. Our approach allows to compute the value at risk of performance characteristics of queuing networks under parameter insecurity.


Introduction
Jackson networks (henceforth JN), to be formally introduced later on, are a well established class of models in, e.g., production, telecommunication, computer systems; for surveys see Kelly (1979) and Chen and Yao (2001).JN's have the desirable property that the distribution of the stationary queue length vector is of product-form, which allows for quick numerical evaluation of performance measures, such as the the mean Like for classical JN's, one can obtain for these extended JN's the steady-state distribution of the queue-length vector at stable nodes in product-from for different type of failure-regimes (a precise definition of "stable node" will be provided later in the text), see Sommer et al. (2017).In addition, closed-form solutions for the long-run throughput of subnetworks and of the complete network are provided.
Design and analysis of stochastic networks are often challenged by the fact that the exact specifications of the network are either not known.This is even more so true for models including breakdowns as there is typically only limited information on breakdowns available.Indeed, during usual operation breakdowns are to be avoided and typically only censored observations are available, which is in contrast to, for example, repairs (indeed, repair times are observable and can often be influenced by a decision maker).
Elaborating on the product-form results in Sommer et al. (2017), we will in this paper investigate the impact of the distribution of the time between breakdowns of the individual servers on the throughput of the network.More specifically, we will model the breakdown behavior through a parameterized distribution, and provide a robustness analysis of the system throughput with respect to the uncertainty parameters.Our analysis shows how for different breakdown and repair regimes, the corresponding risk profiles for system-oriented and customer-oriented performance metrics can be evaluated.It is worth noting that this efficient risk analysis step is only possible due to the simple closed-form solutions obtained for the performance measures.The framework provided in this paper allows to combine robustness analysis and performance modeling in an efficient way.
The research for robustness analysis of stochastic models is a predominant research line in Georg Pflug's work, see, for example the monographs (Ermoliev et al. 2006;Pflug 2000).Next to his impressive work on stochastic optimization the study of risk and the investigation on how to deal with uncertainty in stochastic models.
The paper is organized as follows.Section 2 gives a brief introduction to the class of generalized JN's.Robustness analysis is introduced in Sect.3. The general approach to robustness analysis is presented in Sect. 4. We conclude the paper with discussion of possible future research directions.

Jackson networks with breakdowns and repairs
We present a brief review of the theory of JN's with breakdowns and repairs.For details we refer to Sommer et al. (2017).The network consists of J exponential single server nodes with service discipline "First-Come-First-Served" (FCFS), the node set is denoted by J = {1, . . ., J }.At node i a Poisson stream with rate λ i ≥ 0 arrives from the exterior node 0, and service times at node i are exponential with rate μ i .All service times constitute an independent family of variables which are independent of the arrival streams.Standard customers are indistinguishable and follow the same rules.Routing is Markovian Nodes in V ⊆ J have an infinite supply from which customers are put into an idling server.We denote W := J \V and require V = ∅ (unless otherwise specified).Customers from the infinite supply have low priority, and (standard) customers arriving from the outside or from another server have high priority with preemptive-resume regime: Service of a low priority customer is interrupted as soon as a high priority customer arrives.Service of low priority customers is resumed only when the server idles again.When a low priority customer is served and fed into the network, he becomes a high priority customer and follows the rules for standard customers.Service times of the low priority customers are independent from the external arrival streams and the service times of high priority customers.
Let D denote the set of nodes that can breakdown.The breakdown-repair process With these functions we set for all subsets of down nodes Remark 1 With suitable functions A and B we can model, e.g., that nodes may break down isolated or in groups, and repair may happen similarly.It is not required that nodes which are broken down are repaired simultaneously.A statistical procedure to check whether this form is justified, is to determine in a first step all possible values A(I ) = α(∅, I } and B(I ) = β(I , ∅), ∀I ⊆ D, and then to check (1) stepwise.
The availability process Y is an ergodic Markov process with stationary distribution From this the stationary (time) point availability (PA) of a Jackson network with infinite supply and unreliable nodes (or subnetworks thereof) may be computed similar to Sauer and Daduna (2003), p.185) as PA(H ) := K ⊆D\H π(K ), for H ⊆ D and t ≥ 0, where π(I ) is the probability that exactly the nodes in I ⊆ D are under repair as given by (2).
The following regime is set in force whenever a node breaks down: -service at this node is interrupted, customers (of high as well as of low priority) are frozen there to wait for restart of the service, which is resumed at the point where it was paused, -no new customers are admitted to enter that node, -customers who select a broken down node to visit are rerouted according one of the classical rules: stalling, skipping or blocking rs-rd, which will be defined below, -all these rules, if applicable, are valid for high and low priority customers.
Rerouting is a functional of Y and applies only to high priority customers, because on departure from a node with infinite supply low priority customers are transformed immediately to high priority, and only thereafter are rerouted.We distinguish the following rerouting schemes: -Stalling: Whenever a node breaks down the service system is frozen: All arrival processes are interrupted and service everywhere in the network is stopped until all broken down nodes are repaired again.Stalling is applied, e.g., in the automotive industry to decrease variability of the flow of materials.Indeed, stalling prevents servers to send parts to a server that is broken down and thereby prevents piling up inventory.-Skipping: If as next destination of a customer a down node is selected, the customer jumps to this node, spends no time there, and immediately performs the next jump according to routing regime R until he arrives at a node in up status or leaves the network.Skipping is applied, e.g., in production networks where skipping a production step yields a product of lower but sufficient quality.-Blocking rs-rd: Broken down stations are blocked.A customer whose next destination is down stays at his present node to obtain immediately another service there.After the repeated service (rs) the customer chooses his next destination anew according to R (random destination (rd)).Blocking rs-rd is applied, e.g., in communication networks where packages are rerouted in case a link is not available.
Throughout this article, it is assumed that all nodes in W are stable, i.e., the traffic rate η i , following from the general traffic equations for Jackson networks with infinite supply but no breakdowns and repairs, is smaller than its service rate μ i for every node i in W , i.e., η i < μ i .Without breakdowns, i.e., D = ∅, the traffic equations of Jackson networks with infinite supply is By assumption, under blocking rs-rd the following reversibility constraints hold: and in case of skipping the following rate stability constraints hold: These constraints ensure that solution η i , i ∈ J , from ( 3) is also the solution of the traffic equations for unreliable Jackson networks under any breakdown scenario.
For more details see Sommer et al. (2017).Under these assumptions, we provide an overview on the studied performance characteristics in the following.For an overview of other performance characteristics see Sommer et al. (2017).
Under stalling the stationary throughput at nodes i ∈ W (no infinite supply) is Under blocking rs-rd and skipping the stationary throughput at a node i ∈ W is To simplify the presentation, we will in the following only consider individual breakdowns and repairs, where for a server i ∈ D the breakdown rate will be denoted by τ i and the repair rate by ρ i .We conclude this section with a short discussion on the robustness of JN's as modeling class for stochastic networks.Remark 2 Suppose that the physical layout of the network, i.e., the number of nodes, and the topology are known, as well as mean service times, mean inter-arrival times of customers at the network and routing decisions.Provided that this rough information constitute the only available data in advance, arguments exploiting entropy properties lead to use so-called product-form models as conservative first order models.Indeed, if mean service times and mean inter-arrival times are given, the exponential distribution is known to maximize the entropy over all distributions with support R + := [0, ∞) and these means,1 see, for example, Park and Bera (2009), Lisman and van Zuylen (1972).Furthermore, for given arrival and service rates at the stations, a product-form solution maximizes the entropy of the stationary queue length distribution (Ferdinand 1970;Walstra 1985).Therefore, product-form solutions are conservative and robust models with respect to model insecurity in the service time and inter-arrival time distributions.Hence, working with a model with exponentially distributed service times and inter-arrival times that does have a product-form solution for the stationary queue length distribution, provides a robust model for performance analysis.

Robustness analysis
As pointed out in the introduction, breakdown rates are hard to estimate via historical data.Therefore, one is typically confronted with uncertainty about the true value of the parameters defining the distributions of the time between breakdowns, in our case the failure rate.This is known as parameter uncertainty in the literature, see, for example, Haverkort and Meeuwissen (1992), for a discussion on integration of parameter uncertainty into queueing models and Henderson (2003) for a discussion on parameter insecurity from a broader perspective.In the following, the focus is on robustness analysis of our queuing model with respect to uncertainty about the breakdown rates.
In modeling parameter uncertainty, the choice of the distribution is of importance and one typically chooses a particular distribution based on (possible incomplete) knowledge that is available.For example, if the mean and the variance are known, and if, in addition, we know that the parameter may take values in R, the most general distribution is the normal distribution, where "most conservative" refers to the fact that this distribution maximizes the entropy.On the other hand, when, due to expert knowledge, it is known that the parameter falls into an interval, say, [a, b], then the uniform distribution on [a, b] is the entropy maximizing distribution; see, for example, Kullback (1959).Alternatively, there may be statistical knowledge available on θ based on measurements.Then, the distribution of the statistic used for estimating θ is a natural candidate for the distribution of θ .
Formally, we assume that the breakdown rate τ is a random variable defined on some underlying probability field, and that the probability density function for τ , denoted by f τ , is known.We let h(τ ) denote some reward function.Think, for example, of h as the stationary throughput at node i. Provided that h is invertible and the inverse is differentiable with respect to the throughput yields the density of the stationary throughput.Based on the distributional assumptions or statistical information comprised in f τ , one may, as is common practice in applied probability, take the expected value of τ , denoted by μ τ = y f τ (y)dy, as a noise-free approximation of τ and subsequently h(μ τ ) as output for the throughput.Since h(μ τ ) is typically not close to E[h(τ )], simply using μ τ instead of τ falls short of bringing the risk incurred by the insecurity on τ to light.To analyze the impact τ has on h(τ ), we consider the value at risk of h(τ ), denoted in short by VaR(α), where VaR(α) = q if and only if where G(•) denotes the cumulative distribution function of h(τ ), which is, for ease of presentation, assumed to be continuous and invertible.The potential misspecification at an α probability level is thus h(μ τ ) − VaR(α).Note that for the throughput we want to hedge against the risk of low values, so we use the α-quantile, whereas for cost functions one would measure the risk through the (1 − α)-quantile.
In the following, we make the reasonable assumption that the "true" breakdown rate at node i, that is, τ i , is not revealed to us and we therefore assume that τ i follows a given distribution F i .Instances of the throughput can be easily obtained by sampling the τ i 's according to their assumed distribution and evaluating the realization of the stationary throughput.Creating a sufficient number of samples, the density and the cumulative distribution function of the throughput can be estimated and evaluated for further robustness analysis.We would like to point out that a similar robustness analysis can be performed for other performance measures of the queuing network as given (Sommer et al. 2017).
We illustrate the application of the above results to robustness analysis with the help of the following examples.

A tandem system
Consider the network with J = {1, 2, 3} nodes given in Fig. 1.The network is a twoway tandem of three nodes.The infinite supply is depicted by a dashed arrow pointing to server 2, and the node that is prone to failure is depicted as a grey circle.Note that by incorporating node 0, the linear topology is transformed into a ring.To summarize, V = {2}, W = {1, 3}, and D = V , i.e., the infinite supply node is prone to failure.Routing is given by and For ease of analysis, we parameterize the model and set λ 1 = (1 − a)t, for t > 0, and λ 3 = at, and b = 1 − c.Regarding the service rates it holds that In this model, we assume for a breakdown scenario that the rate with which a breakdown of server i = 2 occurs is given by τ 2 , and we denote the corresponding repair rate by ρ 2 .Then, and the throughput at node 3 under stalling as a function of a certain τ 2 is given by and by computation Let h denote the stationary throughput at node i = 3 and model τ 2 as being random.In the following, two distributions for τ 2 are elaborated, the uniform and exponential distribution, respectively: -Assume that τ 2 is uniformly distributed on [a, b], with 0 < a < b < ∞.Then, and zero otherwise.It holds that μ τ 2 = (a + b)/2.The value at risk, i.e., the α-quantile of the stationary throughput, is also easily computable to be In words, for α • 100% of the possible breakdown rates the actual throughput of the system will fall below VaR(α).Observe, that for α = 1 we have VaR(α) = η 3 ρ 2 /(ρ 2 + a), which is the right bound of the support of τ 2 , and for α = 0 we have Var(α) = η 3 ρ 2 /(ρ 2 + b), which is the left bound of the support of h(τ 2 ).Suppose b = k • a with k > 1, then τ 2 is uniformly distributed on [a, ka], and τ 2 becomes rather uncertain for large values of k.Then the above analysis uncovers the exposed risk by expecting the throughput to be of order h(μ τ 2 ) without taking the stochasticity into account.In particular, with chance α the realized throughput h(τ 2 ) is at least smaller than h(μ τ 2 ).For example, let α = 0.1, then with probability 0.1 the actual throughput h(τ 2 ) is at least approximately 44.4% smaller than h(μ τ 2 ) for relatively large k.
-Let τ 2 be exponentially-λ-distributed so that μ τ 2 = 1/λ.Then the density for the throughput via (7) equals for y ∈ (0, η 3 ].It can be shown that the cumulative distribution function of g(y) is given by which leads to for α ∈ (0, 1).Note that the "naive" throughput is given by and the difference between h(1/λ) and VaR(α) expresses the model risk at probability α.
Remark 3 Consider the uniform model for τ 2 and consider the reasonable case that the breakdown rate is small, i.e., assume that τ 2 is close to zero.More specifically, let τ 2 ∼ U (0, 2 • ), and assume that ρ 2 = c • , for c > 1.It then holds for the relative error that ) for α ≥ 1/2.This reasoning allows for a quick assessment of the impact of the postulated model on the breakdown rate.

A star-like system
Consider a network with J = {1, 2, . . ., 6}, V = {2, 3, 4}, W = {1, 5, 6}, and D = V , i.e., all infinite supply nodes are prone to failure, depicted by grey circles in Fig. 2. Jobs arrive from outside with rate λ at the central node 1.From node 1 they go with probability r /5 to any of the nodes 2 to 6, for r ∈ (0, 1).After finishing service at node i = 2, . . ., 6, jobs are sent back to the central node 1. Being served there, they either leave the system with probability 1 − r , or are sent back to one of the servers Fig. 2 The star-like network as analyzed in Sect.3.2 in the set {2, . . ., 6} according to the routing scheme described above.Let h denote the stationary throughput at node i = 2 and model τ 2 as being random.Then, the throughput at node 2 as a function of certain τ 2 under skipping is with π(∅) given as and we may write for a constant τ 2 Hence, In the following we show that for uniformly and exponentially distributed τ 2 , closed form expressions for the value at risk can be obtained: and zero otherwise.The value at risk, i.e., the α quantile of the stationary throughput, is also easily computable to be -For τ 2 exponentially distributed with parameter λ it holds that for y ∈ (0, a 3 a 1 ), so that and thus for α ∈ (0, 1).
As we have shown in this section, for uniform and exponential distributions, VaR(α) can be explicitly solved, which is due to the simplicity of both distributions.In the following we study the more challenging problem when the distribution of τ 2 is of general form.

The general approach
In this section we provide a general approach to approximately computing VaR(α).Revisit the two-way network from example in Sect.3.1.Let τ 2 be normally distributed with mean μ and standard deviation σ but conditioned on interval [γ l , γ r ] where 0 ≤ γ l < γ r .The rationale behind the conditioning is that negative values as well as non-realistically large values for τ 2 are avoided.Then, following (7), the throughput at node 3 under stalling has density g for given by where and Φ(•) is the standard normal cumulative distribution function.To obtain the VaR we need to find the inverse of the cumulative distribution of the throughput given by Computing the inverse of a general function can usually only be performed numerically.However, in case the function of interest is analytical and can thus be written as a power series, a power series representation of the inverse can be obtained.This result is well-known in analysis, see, e.g., Dettman (2012).However, computing the actual elements of the power series is a challenging task.A first result can be found in Whittaker's pioneering paper (Whittaker 1951).In particular, Whittaker provided an explicit expression for the elements of the power series of the inverse in terms of the elements of the power series of the original function.Unfortunately, the computation of the elements is rather demanding.An alternative approach, that suffers from the same computational burden is Lagrange's inversion formula (Abramowitz and Stegun 1992).Dominici (2003) introduced a method for numerical inversion that is very well suited to computing the VaR of a transformation of an exponentially distributed random variable.In the following we will present this approach.
For an infinitely often differentiable mapping f define the nested derivative with f (a) = 0, ∞.Then according to Theorem 4.1 in Dominici ( 2003) the inverse of h(x) is given by where |y| < for some > 0. The elements of the series can be easily evaluated by means of standard computer algebra tools.We refer to Dominici (2003) for details.
Example 1 Consider the exponential mapping e x and apply the method of nested derivatives.Note that h(x) := e x − 1 = x 0 e t dt.
Let f (x) = e −x , then Since f is analytical we obtain the inverse of h(x) as which is easily recognizable as the series expansion of ln(y + 1) around 0.
As illustrated in the above example, the method of nested derivatives allows for a direct analysis of the function under the integral.This is particularly useful in VaR computations as the analysis can directly be applied to the density and computation of the cumulative distribution function can thus be avoided.
In the following we present the main result on nested derivatives, where we write ḡ(t) = 1/g(t).Proof For x ≥ a, write and let dt.
We now apply the nested derivatives method to ḡ(t).From ( 16) [see also Dominici (2003)] it then follows that Note that the advantage of Theorem 1 lies in the fact that the elements of the series expansion have to be computed once for a, yielding a polynomial approximation of VaR on an entire interval.For ease of reference define for N ∈ N Example 2 Reconsider the tandem network from Sect.3.1.Let τ 2 be normally distributed with mean μ and standard deviation σ but truncated on interval [γ l , γ r ] where 0 ≤ γ l < γ r .See ( 14) for the density g(y) of the throughput.In order to approximate the VaR by Theorem 1, where a is chosen to be η 3 ρ 2 ρ 2 +γ r so that c a = 0 (note that this is allowed since g( η 3 ρ 2 ρ 2 +γ r ) = 0), we will compute the series using the computer algebra algorithm provided in Dominici (2003).Using notation ḡ(y) = 1/g(y) it holds that For the VaR approximation of order 2 we have to add the term In general, for the n-th term it holds where P(n) is a homogeneous polynomial of degree 2(n − 1) in variables γ r , σ , ρ 2 and μ.In particular for P(n) with n = 1, 2, 3, 4 it holds Figure 3 provides a numerical example which illustrates that the VaR series with a few terms already yields an accurate approximation for VaR(α) for α between 0 and 0.1.For the tandem network parameters we take a = 3/5, b = 1/2, c = 1/2, t = 15/8, μ 1 = μ 3 = 2, so that λ 1 = 9/8, λ 3 = 3/4 and μ 2 = 3/2.From the traffic equations it follows that η 3 = 1.5 for this this parameter setting.Regarding the parameter uncertainty parameters, we let ρ 2 = 1.1, μ = 1, σ = 1.5 and [γ l , γ r ] = [0.1,2.75] so that μ τ 2 = E[τ 2 ] = 1.3259 and VaR(τ 2 ) = 0.52134.Furthermore, the example illustrates that significant risk is ignored when taking h(μ τ 2 ) as measure for the throughput.Specifically, h(μ τ 2 ) ≈ 0.68 whereas with probability 0.2 the actual throughput is approximately smaller than 0.52 (a difference of at least 23.5%) and with probability 0.1 the actual throughput is approximately smaller than 0.48 (a difference of at least 29.4%).
In case one is interested in VaR(α) for α around 0.3, Fig. 3 shows that poor approximations are obtained via the series, even when using 10 terms for the series.The approximation for VaR(α) with α around 0.3 can be improved by choosing a in condition (iii) of Theorem 1 greater than b l = η 3 ρ 2 ρ 2 +γ r such that c a lies near 0.3.The downside is that this approach requires numerical evaluation of c a and the search for an appropri- ate a.But after this numerical burden, the series for VaR(α) from Theorem 1 provides an accurate and efficient approximation for VaR(α) with α in a relative large interval around c a .As example, Fig. 4 shows for the same instance as in Fig. 3 that choosing a such that c a ≈ 0.3 leads to accurate approximations for VaR(α) with α ∈ (0.15, 0.45) even for a small number of series terms.

Conclusion
In this paper we have argued the importance of robustness analysis in case of parameter uncertainty in queuing models.For generalized Jackson networks we have provided a framework for evaluating numerically the value at risk incurred by parameter uncertainty.Future research includes uncertainty analysis of multiple parameters and further development of our risk analysis framework.In additional topic of further research is to provide a numerically efficient bound for the remainder of the series approximation of the value at risk.
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/),which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
Markov on state space P(D), where P(D) denotes the power set of D. Y (t) = I , for ∅ ⊆ I ⊆ D, indicates that (exactly) the nodes in I are broken down.The transition rates of Y out of I ⊆ D are given as 1. if I ⊂ H ⊆ D, the nodes in H \I break down with rate α(I , H ) ≥ 0, 2. if ∅ ⊆ K ⊂ I , the nodes in I \K are repaired with rate β(I , K ) ≥ 0. Rates α(•, •) and β(•, •) are constructed from any pair of functions A, B : P

Theorem 1
Let G(y) be a cumulative distribution function onB = [b l , b r ],where B = R is not excluded.Suppose that there exits g(t), for t ∈ B, such that (i) for y ∈ B it holds that GG is analytical on the interior of B as a mapping in y, (iii) there is an a ∈ B such that g(a) = 0. dt = G(a), thenVaR(α) = a + ḡ(a) n≥1 D n−1 [ ḡ](a) (α − c a ) n n! ,for α sufficiently close to c a .