# Effective Risk Assessment in Resilient Communication Networks

- 1.7k Downloads
- 2 Citations

## Abstract

The paper discusses business impact analysis in the context of resilient communication networks. It is based on the total (aggregated) penalty that may be paid by an operator when the services (identified with transport demands) provided are interrupted due to network failures. The level of penalty is expressed as a commonly accepted business risk measure, Value-at-Risk (\(VaR\)). First, the main concern over \(VaR\), namely the theoretical lack of subadditivity, is discussed. The study shows that, in practice, disadvantages do not appear in resilient network design, and \(VaR\) can be used without the need to apply more complex and less informative measures. Second, a method for calculating the upper bound of the total penalty is presented. The assessment is performed for unprotected and protected services with a broad variety of compensation policies used to translate technical loss to monetarily expressed penalty. The proposed bounds are experimentally shown to be effective in comparison with alternative calculation methods, and also in the case when some of the assumptions taken during the modelling stage are not met.

## Keywords

Availability Compensation policy Continuity Design Downtime Management Reliability Value-at-Risk## 1 Introduction

Random failures, such as link cuts or hardware faults, are destructive to networks, both from a technical and business viewpoint. Penalties can be imposed on operators due to breach of Service Level Agreements (SLAs). To counteract these problems, risk management is defined as “coordinated activities to direct and control an organization in response to risk” [1].

With the recent ‘beyond connectivity’ trend [2]—where the main concern is to focus on the entire communication service—interest in network risk management has increased. In order to deal with failures from the risk management viewpoint, it is necessary to address the parameters such as: *probability* of adverse events, and the *loss* (*impact*) incurred by them. The latter is here related to penalties paid to clients affected by failures.

If risk is dealt with using the methodological approach, it is exercised within the cyclic risk management framework [3]. The simplified structure of this kind consists of the following steps: (1) risk assessment, (2) planning the risk response, (3) response deployment, and (4) risk monitoring. *Risk assessment* consists of: (a) risk analysis, identifying failure scenarios, and (b) risk evaluation determining their probability and impact on business goals. Here we use probabilistic risk assessment, during which both parameters are expressed mathematically [4].

Although in this paper we focus on risk evaluation alone, we also outline shortly other phases of the cycle. Designers of resilient networks are typically most familiar with the *risk response* stage, since the task of the technician is to prepare response strategies. The manager of the service provider then decides which one to choose. In resilient networks, the basic approach is risk mitigation that involves decrease of the impact. In practice, mitigation uses combination of resilience procedures [5]. In this case, *response deployment* embraces the configuration of resources and testing. The next step, *risk monitoring*, includes: (a) continuous risk monitoring, where risks are observed in order to identify new ones, and (b) response monitoring, performed to check if the implemented response meets the intended goals.

A current practice of dealing with failures in design and management of resilient networks is based on measuring failure risk with purely technical methods. These methods apply steady-state availability or mean downtime as risk measures. However, they are not relevant in a business context. Firstly, it is more important to express the consequence of failures in monetary terms. Secondly, businesses may be more interested in the variability of loss rather than its mean value, which does not usually capture changes in network behaviour. Here, we use our previous works on the shift of interest in resilient networks design [5, 6, 7] as a basis. There, we discuss using business-related risk measures, such as Value-at-Risk (\(VaR\)). This measure is already being used in communication network failure descriptions [8], and security management [9]. Nevertheless, recurrent reservations exist about the fact that in the financial sector \(VaR\) is known to be misleading, due to its lack of the property known as *subadditivity*. As such, it is postulated to investigate more complex measures. The aim of this paper is first to show that these disadvantages do not appear if \(VaR\) is used to assess risk related to random network failures. We show that only extremely atypical network characteristics, not met in practice, can disrupt the usefulness of \(VaR\). Further, we provide a computationally effective method for estimating the level of penalties with this metric.

From the viewpoint of pay-offs, our results ensure: (a) effective quantification of total penalties imposed on a network operator due to failure presence, and thus (b) opening of a broad range of possible risk response methods based on \(VaR\) and elaborated in the financial field. The former enables an operator to save money in comparison with using a simple addition of risk measures calculated for a single service. Such an approach takes advantage of the diversification typical for investing [10]. The proposed estimation method is based on offline calculations, thus making them easier, more robust and less draining than simulations. The model calculates the penalties for both unprotected and protected connections. Our approach deals with a broad range of mappings between technical loss and business impact.

The remainder of the paper is organized as follows. First, we discuss related work in Sect. 2. In Sect. 3, we focus on various methods of expressing the monetary impact of failures (compensation policies) and methods for meaningful quantification of the predicted financial losses (risk measures). The section also discusses the known reservations concerning the main financial risk measure. In Sect. 4, we elaborate on an effective method for quantifying the upper bounds on this measure in the context of transport network services. This is the main contribution of our paper, and we show how to find the bounds for various compensation policies effectively. Then, we present numerical results confirming validity of our statements and models. In Sect. 5.1, we show that from a practical viewpoint the mentioned concerns on lack of subadditivity are not relevant in resilient network design, and thus it is possible to use \(VaR\) with all its advantages. In Sect. 5.2, we present the results of a very broad numerical study proving that the provided models for bounds of \(VaR\) are indeed exact. The final section concludes our work and shows avenues for future research.

## 2 Related Work

Franke [11] notes that the discussion of the relationship between the technical aspect and the business context related to network management is poorly developed. Our paper is presented in order to change this and boost work towards the goal of efficiently interfacing the technology and business worlds. While the methods and protocols for network resilience, described for instance in [12, 13], are not a new topic, many problems remain in a business-oriented approach to resilient network design and management. Historically, in the communications sector, risk has been dealt with for example in: (a) selection of new investment [14], (b) security against faults generated by malicious behaviour [15, 16, 17, 18], or (c) quantification of deviations from the desired quality levels [19]. Risk assessment is the most popular concern researched in these contexts. While Value-at-Risk was postulated to be used in communications networks for risk quantification in network security [20, 21] and resilience [8, 9], there are no efficient theoretical models to predict \(VaR\) when it is applied in network resilience. At the modelling level, we use some elements of the methodology similar to more recent works on network reliability, such as [22, 23, 24, 25], yet we also deal with the modelling of whole distributions, and we add penalty [26], SLA [27, 28], and compensation policy [8] concerns.

Our numerical study is based on the distributions of failure and recovery times reported in literature. The relevant bibliography is reviewed and commented on in the context of risk engineering in [29]. Extensive studies of failure and recovery times related to operational networks have been performed and reported for the Sprint network [30], the Finnish research network [31], and the Norwegian university network [32]. Generally, a typical approach in numerical studies is to assume that the failures arise due to the homogeneous Poisson process. This means that times between failures are exponentially distributed. This classical approach seems to be statistically valid for many cases in communications [32, 33]. Other distributions for failure times, or their approximations by times between consecutive failures, are also occasionally reported (e.g., Weibull distribution [30]), although they cannot be responsible for generating heavy tails in the loss distributions. Modelling of downtimes (repair or recovery times) is more controversial. While the simplest approach also uses exponential times, recovery times in real networks appear to be log-normal [34] or Pareto-like—but always having mean value [31, 33].

This contribution can be treated as an extension of the two previously published papers [6, 7]. The new contribution can be summarized as follows: (1) The first paper [6] elaborates on the issue of lack of subadditivity of \(VaR\), and shows that this risk measure can be successfully used in communications since lack of subadditivity is not a concern in practice. Here, we extend the set of the numerical studies to confirm this statement, especially by broadening the set of investigated distributions and taking into account node failures. Therefore, while the final statement is the same, now we increase the confidence on validity of this statement. (2) As the main contribution of this paper, we conceive the extension of [7], where the model for finding the upper bound on risk measures related to random failures in communication networks is presented. Our extension presented here is considerable: (a) we have extended the set of compensation policies to the ones that are more realistic than the ones shown in [7]; (b) while [7] uses a simple model based on results elaborated in a seminal work [35], here we present a more general model based on results derived in [36]; (c) the presented numerical studies are much broader than the ones presented before.

## 3 Business-Related Risk Assessment

SLA defines the desired values of parameters related to the services provided. These parameters include non-functional properties, such as reliability in the presence of network failures, the maximum acceptable downtime, or interval availability for a period of time [27, 37]. Penalties for not meeting these requirements may also be agreed and form the basis for calculating monetary impact to quantify business risk [26].

### 3.1 Compensation Policies

*compensation policy*[38]. Basically, if an outage appears and lasts for period of time \(\tau\), we can model

*p*—the penalty (outage cost) for this single outage. It is as a general function of \(\tau\): \(p = f(\tau )\). To find

*p*, we follow the basic compensation options considered in [11, 38]. All of them are represented by convex functions. They are illustrated in Fig. 1.

*continuity*. This concerns services such as very short communication connection or sensitive data for real-time traffic control. Such services are rendered useless, no matter how brief outages are or how fast the resilience procedure is. We call such a policy Cont. For it, we may assume the fixed penalty independent of the outage time \(\tau\):

*interval unavailability*, the fraction of time when the service is not operating [40]. As this measure is a probabilistic complement to a better known availability, we call this policy Avail. This type of policy is most suitable for long-lasting services for elastic traffic, such as data transmission, web browsing, e-mail, and is typically agreed with individual customers. This reference policy based on the downtime \(\tau\) can then be expressed as:

To make things more complex, but close to reality, it can be assumed that the penalty is based on the number of outages exceeding a selected downtime *threshold* [41]. A practical example of this policy can be seen for SLAs related to Amazon S3, an online storage service.^{1} Such a policy is valid when the agreement predicts the value of the Recovery Time Objective (RTO) beyond which the client can be sure that the failure will inflict some harm. This is a typical approach in business continuity planning [42].

### 3.2 Probabilistic Risk Measures

*probabilistic risk measures*are used. From the risk evaluation viewpoint, in the best case scenario the full probability distribution function (PDF) of the impact expressed in monetary units can be found. For a given PDF of the penalty, point estimates are applied [44]. The popular mean value of the penalty distribution [45, 46] has been found to be insufficient. The reason is its inability to quantify extreme values, characteristic of dealing with the risk context. Instead, the main measure used by finance departments to quantify the level of investment is Value-at-Risk, \(VaR\). The definition is as follows [47, 48]: \(VaR _{\eta }\) is a quantile measure, and it provides for a selected level of probability \(\eta\) the value of penalty that can appear. Let \(\xi\) be the level of penalties to be paid in an interval. If \(P_{\xi }(x) = \Pr \{\xi \le x\}\) is the cumulative distribution function of \(\xi\), the Value-at-Risk is defined as the maximum penalty with a given confidence level \(\eta\):

### 3.3 Concerns about Subadditivity of *VaR*

*coherent*(desirable) risk measures [53]. One property is especially problematic when dealing with \(VaR\). The property is known as

*subadditivity*. It can be defined as follows:

Subadditivity has the following positive consequences. (1) Quantification of risk measures is easier during the risk evaluation phase, where *risk aggregation* (calculation of the overall risk from individual risks) is conducted. In the case of subadditive measures, it is possible to easily assess the upper bound of risk. (2) Portfolio diversification, justifying good practice to provide service differentiation, is advantageous in comparison to separate investments [54]. (3) Avoiding or mitigating the risks related to the greatest levels of impact is the best option to deal with risk response [9]. (4) Subadditivity is a necessary condition to ensure the convexity of the risk measure. Then, efficient linear programming-based methods inspired by portfolio optimization approaches [44, 55] become feasible during network design.

A lack of subadditivity does not only mean that the above advantages are not present, but also that there is a very important danger related to using such a risk measure. It is believed that one of the roots of the banking crisis in 2008 was related to improper assessment of credit risks [56], i.e., based on \(VaR\). The problem is that the used method is very sensitive to heavy tails in the PDFs of the impact. This sensitivity is the result of a lack of subadditivity. While it is common practice in the investment sector to base the \(VaR\)-related calculations on normal distributions [57], this is not always justified. These arguments against \(VaR\) are repeated in risk studies. Hence, we decided to verify whether lack of subadditivity is a real danger in resilient networks. This is done by numerical simulations in Sect. 5.1 and the results are very optimistic. Therefore, we assume to be able to use this risk measure without dangers.

The next part of the paper, then, focuses on the main contribution, which is an efficient modelling of \(VaR\) in resilient networks.

## 4 Probabilistic Assessment of Aggregated Risk (Total Penalty)

Here, we assess the value of a risk measure for the total penalty paid by the network operator during a given time interval. One could think of a simple method of calculating the total penalty by calculating risk measure values for single services and then adding the results to obtain the aggregated risk measure related to the whole portfolio of services. This is not the best method, since it may provide an over-optimistic bound in the case of \(VaR\), which is not subadditive. Additionally, even if subadditivity is observed as shown in Sect. 5.1, the bound obtained in this manner may be too pessimistic. This phenomenon is also noted later in our numerical results presented in Sect. 5.2. Therefore, we need an effective method of providing an upper bound for aggregated risk.

*V*,

*E*), where

*V*is a set of nodes, and

*E*is a set of links connecting the nodes. \(V \cup E\) is the set of network components. All of them are unreliable, which means they may fail and be repaired. Hence, we associate two probability distribution functions with each unreliable element: (a) the first describes time between failures and (b) the second concerns downtimes. While we use various types of time distributions (exponential, Weibull, Pareto, and log-normal), at some stage of the modelling we need to determine the failure (\(\lambda _c\)) and repair (\(\mu _c\)) rates for each unreliable network component \(c \in (V \cup E)\). All the failure and repair processes are assumed to be independent of each other. Each service

*d*is modelled at the physical level as a transfer service between two different nodes using a connection made of an

*n*-tuple of links and nodes. The algorithm follows the steps given below:

- 1.
Before we start preparing an exact model of risk, we need to assign compensation policies to determine penalties for each service. We also assume that each service is given a pre-determined primary path, and a backup path if the dedicated protection case is modelled. That is, we do not bother about the routing which is treated as an input to our algorithm.

- 2.
Then, the continuous-time Markov chain (CTMC) for each service is constructed. Means and variances of compensation policy-related penalty values for all the services are found using these Markov chains. This makes it possible to find the

*mean*and*variance*of the total aggregated penalty (\(p_{\texttt {Total}}\)) over the interval. - 3.
Finally, the whole distribution of the aggregated penalty parameterized by these two values is found. We use one of the elliptical distributions. We found that, typically, the log-normal distribution gives the best fit results. When the whole risk distribution is parameterized, it becomes possible to find its quantiles, including \(VaR\).

### 4.1 General Case

To quantify a quantile risk measure (such as \(VaR\)), we need to estimate a full probability distribution function for the penalty over a given time interval (typically *per annum*). We need to evaluate this value on the basis of the penalties calculated for separate services instead of modelling the level of penalties for the whole network, which would be a very complex task. The individual penalties related to various services are correlated because a failure of one component can affect many services. To estimate the level of the total penalty \(p_{\text {Total}}\), we want to use the worst case approximation (the upper bound) for finding covariances between penalties calculated for various services.

*d*. The total penalty, used to measure the aggregated risk, is calculated for an interval as:

*d*during the observation interval

*t*. And let \(p_d\) denote the single penalty (for a single outage) related to this service. The modelling of various types of penalties related to different compensation policies, and methods of finding penalty values, are discussed in Sect. 3.1. The penalty related to each service can be found as a random sum of individual penalties related to this service (i.e., for various outages). If we assume that the means and variances of \(N_d(t)\) and \(p_d\) are known, we can use basic probabilistic rules to find the average value of the total penalty over an interval for a service as:

Finally, these values are used to find the parameters of an elliptical distribution being the distribution of a total sum of penalties, or a distribution close to it—for an explanation, see [35]. In our case, this is the log-normal distribution. On the basis of this distribution, we are finally able to find the quantile risk measures. However, the problem is finding the values for parameterization. As exact calculation is too complex in practice, here we present the upper bound of the distribution to effectively obtain the values.

### 4.2 Unprotected Case

We can treat a macrocomponent as just an ensemble of independent ON-OFF components. As such, it can be modelled as a single ON-OFF system itself. However, the general modelling of such an ensemble on the basis of the behaviour of a single element is not possible unless the uptimes and downtimes are assumed to be exponentially distributed. Hence, each component and the entire ensemble can be modelled as a CTMC. Then, it is possible to find the analytical bound for the risk measures. In practice, network failures arrive according to a Poisson process [31]. This is a common assumption taken while the mathematical modelling of failures is performed. In our numerical studies shown in Sect. 5.2, we challenge this assumption and show that our model also provides useful results when the failure process is not memoryless.

*n*independent components is a CTMC on the state space \(\{0,1\}^n\) with an infinitesimal generator matrix:

As a macrocomponent is a series reliability structure, we are able to define only ensemble modelling CTMC, where all the components operate. Therefore, the time the system spends in the up-state (*U*) is exponentially distributed with the rate being equal to the sum of failure rates of all the components. On the other hand, the distribution of downtimes is more difficult to compute. We use the embedded Markov chain and the Laplace transform to find a good approximation for the mean and variance of the time the system spends in the down-state (*D*). Let us begin with a fully operational macrocomponent (all up-states). Next, a component fails after an exponentially distributed time. In the next Markov chain jump, the failure may be repaired or another failure may occur. The time elapsed before the next event is again exponentially distributed, with the rate parameter dependent on the current state. Applying the total probability formula to the number of failures, the distribution of *D* can be expressed as an infinite sum of convolutions. In the Laplace transform domain, convolutions become multiplications, and the first and second raw moment of the distribution can be derived from the derivative of the transform [58]. Finally, since the probability of simultaneous multiple failures is extremely low, the infinite series can be approximated by the first few terms, where in practice the first two to three terms are sufficient. This truncation simply omits possibility of triple, quadruple, or more simultaneous failures.

Below, we generalize the case of an unprotected connection outlined above to the case when the service can be supported by a more complex connection, especially applying alternative connections for dedicated protection. Then, we expand the concept of a macrocomponent and relate it to the probability process state. The generalization is necessary since the model given in Eq. (14) is prone to the state space explosion. Despite the current computational power offered by an efficient sparse matrix implementation (as available in MATLAB, for instance), it is necessary to reduce the complexity of the model by redefining the state space.

### 4.3 Generalized Markov-Based Modelling of Penalties

*k*-element combinations out of all

*n*unreliable network components related to a given service (i.e., we deal with the components forming the working path for unprotected connection or a pair of paths for the protected case). And

*M*is the arbitrary selected number of probable simultaneous failures in the network. In practice, we can assume that \(M \le 3\) is sufficient, and the dimension of the state space is reduced from \(2^n\) to a number \(\sim n^M\). Note that for \(M=n\), both spaces (i.e., the one defined here as a generalization and the one related to Eq. (14) shown for the unprotected case in Sect. 4.2) are isomorphic. The new representation allows us to cut off configurations with extremely low probabilities by reordering original states. Here, we rely on the simple fact that having three, four or more simultaneous failures in a network is extremely unlikely. The infinitesimal generator of the process is a sparse matrix containing the entries of the following form (see example for \(n=3\) and \(M=2\) in Fig. 3):

Now, for each service \(d_i\) we define only a pair of macrocomponents \((U_i,D_i)\) constituting the partitioning of the whole set of states: \(S = U_i \cup S_i\), \(U_i \cap S_i = \varnothing\). In some states \(U_i \subseteq S\), the service \(d_i\) works, while in others summarized as \(D_i = S \backslash U_i\) this service is down. Note that the selection of \((U_i,D_i)\) is unique for each service \(d_i\). For instance, this is the case of an unprotected connection service, \(U_i = \{S_0\}\), where \(S_0\) is the only state where all the components \(c \in d_i\) work properly. However, we keep this derivation general, so it remains useful for the dedicated protection case. In the latter case, \(U_i\) is the set of all states in which all the components of the primary path or all the components of the backup path are operational.

The calculation of the downtime of a macrocomponent simply involves solving the first *passage time* problem in CTMC, that is the amount of time it takes for the Markov process to reach the absorbing state from the initial totally faultless state [36] (that is, the time it takes for a system to jump out of \(U_i\) to \(D_i\)). Although we use the CTMC, meaning the times between changing the various states are exponential, the overall passage time it takes to reach the macrocomponent \(D_i\) from \(U_i\) is not exponential. Therefore, if we wish to describe the state of the service on state space \(\{0,1\}\) (0: the whole service is operational, 1: the service is faulty), we need to use a semi-Markov process with phase-type distributed sojourn times, which are the times the underlying CTMC spends in a groups of states \(U_i\) and \(D_i\).

*r*states. The values in the square matrix \(\mathbf {r}_{|U_i|\times |U_i|}\) are related to the transitions between all the states in \(U_i\). Bearing this in mind, the matrices \(\mathbf {T}_{|D_i|\times |D_i|}\) (gathering data about transition rates inside the \(D_i\) macrocomponent) and \(\mathbf {q}_{|D_i|\times |U_i|}\) (gathering data about transition rates between the states from various macrocomponents) are found in a unique way.

*z*is the complex variable defined for the Laplace transform, and \(\mathbf {I}\) is the identity matrix of the proper dimension. Additionally, \(\mathbf {P}_{\text {in}}\) is the distribution of the initial state of service \(d_i\): if \(U_i\ne \{S_0\}\), there are different starting points in \(U_i\) as well as in \(D_i\), distributed according to \(\mathbf {P}_{\text {in}}\). Since we are interested in distributions of the states upon the state change, we approximate \(\mathbf {P}_{\text {in}}\) by the stationary distribution of the embedded Markov chain of \(\mathbf {Q}\) conditioned on being in the selected state subset \(U_i\). This way, we avoid solving differential equations to find the exact form of \(\mathbf {P}_{\text {in}}\).

*f*with the Taylor series, the moments can be expressed in terms of the raw moments of the distribution of outage time \(\tau\). The

*i*th moment of the \(\tau\) distribution, denoted as \(m_n\), can be found as follows [36]:

### 4.4 Dedicated Protection Case

*d*does not form a Poisson process. Then, they have the form [35]:

*i*th moment of the up-time of the connection supporting service \(d_i\).

## 5 Numerical Studies

^{2}The numerical examples are constructed as follows. The network topologies used are retrieved from the SNDlib library (http://sndlib.zib.de) [60] and model two large networks: the compact and dense German Research Network (nobel-germany.xml), and the very broad yet sparse US Network (nobel-us.xml). For each node and link in a network, the interchanging failure and resilience process is modelled. In the basic case, according to the most commonly assumed conditions, both distributions for link/node failure times/downtimes are exponential. Their rates were taken from [31]. We use the following function to find the failure/repair rates for links:

*l*represents a link length expressed in kilometres (\(l_{\max }\) is the largest link length in a networks), and \(\lambda_R\) and \(m\) are the basic distribution parameters retrieved from [31]. Each service has its own parameters necessary to find the exact value of the penalty. The scaling weight \(w_i\) is equal to the volume transferred by a service. This volume is provided with the network models. The time scale \(T_{\text {thr}}\) is equal to the mean downtime of the most reliable component of the network. We have checked that scaling this value with 0.5 or 2 does not change the qualitative character of the results. Each connection supporting a service is routed with the shortest path routing found by the Dijkstra algorithm (our networks are modelled by weighted digraphs, where the weights representing lengths of the links are non-negative). For each scenario, we held 100,000 simulations developed in C++. Each simulation time was 1 year of network operation; this is the interval for which penalties due to the assumed compensation policies are estimated. We need this number of simulations since the events are rare, and only with 100,000 simulations do the observed correlations for two runs start to differ at the third decimal place. The mathematical modelling is performed with the help of MATLAB.

A typical assumption made during various risk assessment calculations is that all the failures and repairs are independent and the downtimes are exponentially distributed. We support the former assumption. On the other hand, this is a rough estimate of the reality of the situation. Additionally, with exponential downtimes, we cannot observe the non-subadditive character of \(VaR\), since heavy tails are not present. Additionally, it has been reported that PDFs of recovery times in networks can be heavy-tailed [30, 31, 33]. We use such distributions (the Pareto distribution), and by changing parameters, we show that the lack of subadditivity of \(VaR\) does appear for extremely atypical values only.

- 1.Networks:
- (a)
German (\({\texttt{N}}_{\text {Ger}}\)),

- (b)
US (\({\texttt{N}}_{\text {US}}\)).

- (a)
- 2.Resilience methods:
- (a)
unprotected service (\({\texttt{R}}_{\text {UP}}\)),

- (b)
dedicated path protection (\({\texttt{R}}_{\text {DP}}\)).

- (a)
- 3.Compensation policies:
- (a)
Cont,

- (b)
Avail,

- (c)
FixedRestart,

- (d)
Snowball.

- (a)
- 4.Distributions of failure times and downtimes:
- (a)
E: negative exponential \({\texttt{Exp}}_{\lambda }\) (\(f_{\text {exp}}(t) = e^{-\lambda t}\));

- (b)
P: Pareto \({\texttt{Par}}_{\alpha ,m }\) (\(f_{\text {Pareto}}(t) = \frac{\alpha m^\alpha }{t^{\alpha +1}}\)), note that for this distribution there is no mean when \(\alpha \le 1\) and no variance when \(\alpha \in [0,2]\), the \(m\) parameter is 60 sec in all cases (this value stems from the granularity of router queries sent by the Simple Network Management Protocol to check the connectivity state in the broad numerical study shown in [31]) except for the extreme case that attains non-subadditive \(VaR\);

- (c)
W: Weibull;

- (d)
L: log-normal.

- (a)
- 5.Risk measures:
- (a)
the value of the aggregated risk obtained in the simulation: \(VaR \left( \varSigma \right)\),

- (b)
naïve upper bound for aggregated risk: \(\sum VaR\),

- (c)
efficient upper bound introduced by the theoretical model presented in this paper: \(VaR _{\text {Th}}\).

- (a)

### 5.1 Study I: Subadditivity of Value-at-Risk is Not an Issue

The general character of the obtained results is shown in Fig. 4, grouping the three representative cases. It is related to the US network, but the character of the results is the same for the German network. In this figure, only the results for the Avail compensation policy are presented, as it is most sensitive to the recovery time distribution. In the figure we show two curves. One is related to the risk measure calculated separately for each connection individually and summed (the naïve upper bound exceeded if the measure shows lack of sub-additivity). The other curve shows the value of the measure calculated for the distribution of the total penalty in the network. Figure 4a shows the situation when both failure and recovery times are exponential, and where the character of the PDF for the penalties is convergent to the Gaussian-, Gamma, or log-normal-like distribution, which is the result of the fact that the cumulative downtime distribution is the convolution of exponential times. Therefore, the subadditivity holds, and further on we do not analyze results for exponential recovery times.

*Relative Subadditivity Measure*, defined as:

### 5.2 Study II: Effective Upper Bounding of Risk Measures

We know that, while calculating \(VaR\), we should not use the summation of the values of \(VaR\) for individual services, as this measure is not subadditive in general. And even if it happens to behave as if it was subadditive, we would like to obtain more exact estimates. Therefore, we show that the theoretical modelling derived in Sect. 4 provides a very good upper bound for the aggregated risk, much better than the naïve bound obtained by simply adding the risk measure values across all the services separately.

*n*is the number of samples [61]. Moreover, for large values of \(\eta\), if the tails of the distributions do not decay quickly, we have to increase the number of simulations to have the opportunity to calculate the quantiles to gather enough samples. Our approach enables us to avoid technical problems such as this. It is noticeable that for the Snowball compensation policy combined with the Pareto downtimes we are not able to effectively bound the results.

### 5.3 Summary of the Results

We promote the usage of business-relevant risk measures in the context of resilient networks. That is, we propose to apply the commonly accepted quantile measure \(VaR\), which is widely used in the investment sector to assess the obligatory level of savings. In network design and management, this approach can be used to predict penalties, estimate the level of the provided protection against failures, or suggest necessary changes to the network. Nowadays, simpler measures of risk used in network design lose information about impact variability, since they are based on mean values. Therefore, they describe the character of the impact distribution very roughly. On the other hand, \(VaR\) preserves the information on variability and makes it possible to use complex portfolio optimization methods elaborated in the financial sector. Nevertheless, we paid attention not only to the advantages in using this measure, but also highlighted potential drawbacks. We showed that: (1) For a broad spectrum of distributions encountered in real networks, the \(VaR\) measure is subadditive in practice and can be used reliably. (2) Additionally, even when we were able to find highly unrealistic values for which the lack of subadditivity is expressed, it was generally the case for low quantile values that are less interesting in practical cases of risk assessment. (3) Furthermore, as the quantification of \(VaR\) requires calculation of the whole penalty distributions, we propose a computationally effective method of exact upper bounding of the total penalty to be paid by the operator using various compensation policies encountered in practice. (4) While our newly introduced model assumes memoryless property of the involved stochastic processes, we show that it also performs well when challenged with various non-exponential distributions.

## 6 Conclusions

With the results confirmed experimentally and summarized in the previous section, we are providing the operators with a tool to assess the business consequences of technical losses, which will improve SLA preparation as well as network design and management processes. This may also be used for resilience purposes, e.g. selection of network parts to be especially protected. The calculated values can also be used in optimization problems, thus opening the possibility of using the methods elaborated in modern portfolio theory.

We regard making the most of this potential as future work, where we would like to: (1) Focus on mathematical programming-based optimization approaches that treat connections or service classes in resilient networks as investments, and where risk assesses either the return from selling them or the loss that is incurred when the services are lost due to failures. (2) Extend the presented model by relaxing the constraint on the unified type of compensation policy across all services. (3) Add relevant modelling for the shared protections, where the backup resources are not dedicated to selected services anymore. (4) Deal with the service which is not based on an end-to-end (unicast) connection, but is related to a connection to a pool of resources, where availability of only one item is sufficient to provide the service. From data transfer viewpoint, this scheme can be perceived as related to anycast. A practical application of this case is relevant to cloud or grid environments. While this is a problem of utmost practical importance, the related modelling is more complex than the one used by us in this paper. The reason for increased complexity is that the level of the service dynamics involved is much higher and the IT infrastructure is dependent on external resources, such as power provisioning. The latter involves hard problems known under the name of system-of-systems modelling.

## Footnotes

- 1.
According to the most recent

*Amazon S3 Service Level Agreement*(version: September 16, 2015; source: https://aws.amazon.com/s3/sla/), the outage starts to be counted if the requests are not responded for no less than 5 min. - 2.
Due to the limited space, we are not able to present in this paper all the obtained results. Therefore, they are present in the form of plots in the companion webpage: http://home.agh.edu.pl/~cholda/research/effective-risk-assessment-with-value-at-risk/.

## Notes

### Acknowledgments

This scientific work was financed by the Polish Ministry of Science and Higher Education from the research budget for 2013-2015, Project No. IP2012 022972. This research was supported in part by PL-Grid Infrastructure.

## References

- 1.Trček, D.: Computationally supported quantitative risk management for information systems. In: Gülpinar, N., Harrison, P., Rüstem, B. (eds.) Performance Models and Risk Management in Communications Systems, Springer Optimization and Its Applications, pp. 55–77. Springer, New York (2011)Google Scholar
- 2.Banerjee, S., Shirazipourazad, S., Ghosh, P., Sen, A.: Beyond connectivity—new metrics to evaluate robustness of networks. In: Proceedings of 12th IEEE International Conference on High Performance Switching and Routing HPSR 2011, Cartagena, Spain (2011)Google Scholar
- 3.Araujo Wickboldt, J., Bianchin, L.A., Castagna Lunardi, R., Granville, L.Z., Gaspary, L.P., Bartolini, C.: A framework for risk assessment based on analysis of historical information of workflow execution in IT systems. Comput. Netw.
**55**(13), 2954–2975 (2011)CrossRefGoogle Scholar - 4.Todinov, M.: Risk-Based Reliability Analysis and Generic Principles for Risk Reduction. Elsevier, Amsterdam (2006)zbMATHGoogle Scholar
- 5.Chołda, P., Jaglarz, P.: Optimization/simulation-based risk mitigation in resilient green communication networks. J. Netw. Comput. Appl.
**59**, 134–157 (2016)CrossRefGoogle Scholar - 6.Chołda, P., Guzik, P., Rusek, K.: Risk-awareness in resilient networks design: Value-at-Risk is enough. In: Proceedings of 16th International Telecommunications Network Strategy and Planning Symposium NETWORKS 2014, Funchal, Madeira, Portugal (2014)Google Scholar
- 7.Chołda, P., Rusek, K., Guzik, P.: Upper bound for failure risk in networks. Electron. Not. Discrete Math.
**51**, 31–38 (2016)CrossRefGoogle Scholar - 8.Mastroeni, L., Naldi, M.: Compensation policies and risk in service level agreements: a Value-at-Risk approach under the ON-OFF service model. In: Proceedings of 7th International ICQT Workshop on Advanced Internet Charging and QoS Technology ICQT 2011, Paris, France (2011)Google Scholar
- 9.Ackermann, T.: IT Security Risk Management. Perceived IT Security Risks in the Context of Cloud Computing. Springer Fachmedien, Wiesbaden (2013)Google Scholar
- 10.Arratia, A.: Computational Finance. An Introductory Course with R. Atlantis Studies in Computational Finance and Financial Engineering. Atlantis Press, Paris (2014)zbMATHGoogle Scholar
- 11.Franke, U.: Optimal IT service availability: shorter outages, or fewer? IEEE Trans. Netw. Serv. Manag.
**9**(1), 22–33 (2012)CrossRefGoogle Scholar - 12.Vasseur, J.P., Pickavet, M., Demeester, P.: Network Recovery. Protection and Restoration of Optical, SONET-SDH, IP, and MPLS. Morgan Kaufmann, San Francisco (2004)Google Scholar
- 13.Schupke, D.A.: Multilayer and multidomain resilience in optical networks. Proc. IEEE
**100**(5), 1140–1148 (2012)CrossRefGoogle Scholar - 14.Elnegaard, N.K., Stordahl, K.: Modelling uncertainty and risk in telecommunication investment project. Telektronikk
**104**(3/4), 119–135 (2008)Google Scholar - 15.Wheeler, E.: Security Risk Management. Syngress, Waltham (2011)Google Scholar
- 16.Teixeira, A., Sou, K.C., Sandberg, H., Johansson, K.H.: Secure control systems. A quantitative risk management approach. IEEE Control Syst. Mag.
**35**(1), 24–45 (2015)MathSciNetCrossRefGoogle Scholar - 17.Shin, J., Son, H., Khalil ur, R., Heo, G.: Development of a cyber security risk model using Bayesian networks. Reliab. Eng. Syst. Saf.
**134**, 208–217 (2015)CrossRefGoogle Scholar - 18.Jing, Y., Ahn, G.J., Zhao, Z., Hu, H.: Towards automated risk assessment and mitigation of mobile applications. IEEE Trans. Dependable Sec. Comput.
**12**(5), 571–584 (2015)CrossRefGoogle Scholar - 19.Dabbebi, O., Badonnel, R., Festor, O.: An online risk management strategy for VoIP enterprise infrastructures. J. Netw. Syst. Manag.
**23**(1), 137–162 (2015)CrossRefGoogle Scholar - 20.Wang, J., Chaudhury, A., Rao, H.R.: A Value-at-Risk approach to information security investment. Inf. Syst. Res.
**19**(1), 106–120 (2008)CrossRefGoogle Scholar - 21.Cao, Z., Guan, Z., Chen, Z., Hu, J.B., Tang, L.Y.: Towards risk evaluation of Denial-of-Service vulnerabilities in security protocols. J. Comput. Sci. Technol.
**25**(2), 375–387 (2010)CrossRefGoogle Scholar - 22.Mello, D.A.A., Schupke, D.A., Waldman, H.: A matrix-based analytical approach to connection unavailability estimation in shared backup path protection. IEEE Commun. Lett.
**9**(9), 844–846 (2005)CrossRefGoogle Scholar - 23.Heegaard, P.E., Trivedi, K.S.: Network survivability modeling. Comput. Netw.
**53**(8), 1215–1234 (2009)CrossRefzbMATHGoogle Scholar - 24.Distefano, S., Trivedi, K.S.: Non-Markovian state-space models in dependability evaluation. Qual. Reliab. Eng. Int.
**26**(2), 225–239 (2013)CrossRefGoogle Scholar - 25.Ghosh, R., Kim, D., Trivedi, K.S.: System resiliency quantification using non-state-space and state-space analytic models. Reliab. Eng. Syst. Saf.
**116**, 109–125 (2013)CrossRefGoogle Scholar - 26.Dikbiyik, F., Tornatore, M., Mukherjee, B.: Minimizing the risk from disaster failures in optical backbone networks. J. Lightwave Technol.
**32**(18), 3175–3183 (2014)CrossRefGoogle Scholar - 27.Kuusela, P., Norros, I.: Dynamic approach to Service Level Agreement risk. In: Proceedings of 9th International Conference on Design of Reliable Communication Networks DRCN 2013, Budapest, Hungary (2013)Google Scholar
- 28.González, A.J., Helvik, B.E., Tiwari, P., Becker, D.M., Wittner, O.J.: GEARSHIFT: Guaranteeing availability requirements in SLAs using hybrid fault tolerance. In: Proceedings of 2015 IEEE Conference on Computer Communications INFOCOM 2015, Hong Kong, China (2015)Google Scholar
- 29.Chołda, P., Følstad, E.L., Helvik, B.E., Kuusela, P., Naldi, M., Norros, I.: Towards risk-aware communications networking. Reliab. Eng. Syst. Saf.
**109**, 160–174 (2013)CrossRefGoogle Scholar - 30.Markopoulou, A., Iannaccone, G., Bhattacharyya, S., Chuah, C.N., Ganjali, Y., Diot, C.: Characterization of failures in an operational IP backbone network. IEEE/ACM Trans. Netw.
**16**(4), 749–762 (2008)CrossRefGoogle Scholar - 31.Kuusela, P., Norros, I.: On/off process modeling of IP network failures. In: Proceedings of 40th Annual IEEE/IFIP International Conference on Dependable Systems and Networks DSN 2010, Chicago, IL (2010)Google Scholar
- 32.González, A.J., Helvik, B.E., Hellan, J.K., Kuusela, P.: Analysis of dependencies between failures in the UNINETT IP backbone network. In: Proceedings of 16th Pacific Rim International Symposium on Dependable Computing PRDC 2010, Tokyo, Japan (2010)Google Scholar
- 33.Uchida, M.: Statistical characteristics of serious network failures in Japan. Reliab. Eng. Syst. Saf.
**131**, 126–134 (2014)CrossRefGoogle Scholar - 34.Garraghan, P., Moreno, I.S., Townend, P., Xu, J.: An analysis of failure-related energy waste in a large-scale cloud environment. IEEE Trans. Emerg. Top. Comput.
**2**(2), 166–180 (2014)CrossRefGoogle Scholar - 35.Takács, L.: On certain sojourn time problems in the theory of stochastic processes. Acta Math. Acad. Sci. Hung.
**1–2**(8), 43–48 (1957)MathSciNetzbMATHGoogle Scholar - 36.Kijima, M.: Markov Processes for Stochastic Modeling. Stochastic Modeling Series. Springer, New York (1997)CrossRefzbMATHGoogle Scholar
- 37.Hedwig, M., Malkowski, S., Neumann, D.: Risk-aware Service Level Agreement design for enterprise information systems. In: Proceedings of 45th Hawaii International Conference on System Sciences HICSS-45, Grand Wailea, Maui, HI (2012)Google Scholar
- 38.Mastroeni, L., Naldi, M.: Violation of service availability targets in Service Level Agreements. In: Proceedings of Federated Conference on Computer Science and Information Systems FedCSIS 2011, Szczecin, Poland (2011)Google Scholar
- 39.Xia, M., Tornatore, M., Martel, C.U., Mukherjee, B.: Risk-aware provisioning for optical WDM mesh networks. IEEE/ACM Trans. Netw.
**19**(3), 921–931 (2011)CrossRefGoogle Scholar - 40.Trivedi, K.S.: Probability and Statistics with Reliability, Queuing, and Computer Science Applications. Wiley, New York (2001)zbMATHGoogle Scholar
- 41.Dikbiyik, F., Reaz, A.S., De Leenheer, M., Mukherjee, B.: Minimizing the disaster risk in optical telecom networks. In: Proceedings of Optical Fiber Communication and the National Fiber Optic Engineers Conference OFC/NFOEC 2012, Los Angeles, CA (2012)Google Scholar
- 42.Snedaker, S., Rima, C.: Business Continuity and Disaster Recovery Planning for IT Professionals. Syngress, Waltham (2014)Google Scholar
- 43.Clemente, R., Bartoli, M., Bossi, M.C., D’Orazio, G., Cosmo, G.: Risk management in availability SLA. In: Proceedings of 5th International Workshop on the Design of Reliable Communication Networks DRCN 2005, Lacco Ameno, Island of Ischia, Italy (2005)Google Scholar
- 44.Olson, D.L., Wu, D.: The impact of distribution on Value-at-Risk measures. Math. Comput. Model.
**58**(9–10), 1670–1676 (2013)MathSciNetCrossRefGoogle Scholar - 45.Ahmed, M.S., Al-Shaer, E., Taibah, M., Khan, L.: Objective risk evaluation for automated security management. J. Netw. Syst. Manag.
**19**(3), 343–366 (2011)CrossRefGoogle Scholar - 46.Tsai, H.Y., Huang, Y.L.: An analytic hierarchy process-based risk assessment method for wireless networks. IEEE Trans. Reliab.
**60**(4), 801–816 (2011)CrossRefGoogle Scholar - 47.Alexander, C., Sarabia, J.M.: Quantile uncertainty and Value-at-Risk model risk. Risk Anal.
**32**(8), 1293–1308 (2012)CrossRefGoogle Scholar - 48.MacKenzie, C.A.: Summarizing risk using risk measures and risk indices. Risk Anal.
**44**(12), 2143–2162 (2014)CrossRefGoogle Scholar - 49.Mastroeni, L., Naldi, M.: Options and overbooking strategy in the management of wireless spectrum. Telecommun. Syst.
**48**(1–2), 31–42 (2011)CrossRefGoogle Scholar - 50.González, A.J., Helvik, B.E.: SLA success probability assessment in networks with correlated failures. Comput. Commun.
**36**(6), 708–717 (2013)CrossRefGoogle Scholar - 51.Sun, L., Hong, L.J.: A general framework of importance sampling for Value-at-Risk and Conditional Value-at-Risk. In: Proceedings of 2009 Winter Simulation Conference WSC 2009, Austin, TX (2009)Google Scholar
- 52.Göb, R.: Estimating Value at Risk and Conditional Value at Risk for count variables. Qual. Reliab. Eng. Int.
**27**(5), 659–672 (2011)CrossRefGoogle Scholar - 53.Artzner, P., Delbaen, F., Eber, J.M., Heath, D.: Coherent measures of risk. Math. Finance
**9**(3), 203–228 (1999)MathSciNetCrossRefzbMATHGoogle Scholar - 54.Alexander, C. (ed.): Value-at-Risk models. In: Market Risk Analysis, vol. IV. Wiley, New York (2008)Google Scholar
- 55.Mansini, R., Ogryczak, W., Speranza, M.G.: Twenty years of linear programming based portfolio optimization. Eur. J. Oper. Res.
**234**(2), 518–535 (2014)MathSciNetCrossRefzbMATHGoogle Scholar - 56.Alexander, G.J., Baptista, A.M., Yan, S.: Bank regulation and stability: an examination of the Basel market risk framework. In: Proceedings of Joint Fall Conference Basel III and Beyond: Regulating and Supervising Banks in the Post-Crisis Era, Eltville am Rhein, Germany (2011)Google Scholar
- 57.Ogryczak, W., Ruszczyński, A.: Dual stochastic dominance and quantile risk measures. Int. Trans. Oper. Res.
**9**(5), 661–680 (2002)CrossRefzbMATHGoogle Scholar - 58.Balakrishnan, N., Limnios, N., Papadopoulos, C.: Basic probabilistic models in reliability. In: Balakrishnan, N., Rao, C.R. (eds.) Advances in Reliability, Handbook of Statistics, vol. 20, chap. 1, pp. 1–42. Elsevier, Oxford (2001)Google Scholar
- 59.Dayar, T.: Analyzing Markov Chains Using Kronecker Products. Theory and Applications. Springer Briefs in Mathematics. Springer, New York (2013)Google Scholar
- 60.Orlowski, S., Wessäly, R., Pióro, M., Tomaszewski, A.: SNDlib 1.0—Survivable Network Design Library. Networks
**55**(3), 276–286 (2010)Google Scholar - 61.Chen, S.X., Hall, P.: Smoothed empirical likelihood confidence intervals for quantiles. Ann. Stat.
**21**(3), 1166–1181 (1993)MathSciNetCrossRefzbMATHGoogle Scholar

## Copyright information

**Open Access**This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.