Enterprise IT service downtime cost and risk transfer in a supply chain

In this paper we present an economic model for analyzing enterprise IT service downtime cost, first on a standalone basis and then in a supply chain setting. With a baseline probability model of Poisson arrival frequency with random downtime duration, we analyze optimal production of a firm’s investments in reducing frequency and duration of downtime, and corresponding premiums for insuring against downtime cost. We also present a model for the spillover effect of downtime for interconnected firms in a supply chain, and discuss how third-party insurance coverage can help enterprises to internalize the externalities of spillover effects on the supply chain.


Introduction
An enterprise may face IT service downtime costs, due to a variety of causes including antagonistic IT attacks by hackers, non-antagonistic IT service outages, or natural catastrophes such as floods or solar storms.The potential losses from downtime can be highly significant, with hourly costs of IT service outages ranging from hundreds of thousands to even millions of US dollars (Rapoza 2014; Ponemon 2016), at least for large companies in certain industries.These large downtime costs are not unique to our present time and had been around since 20 years ago (IBM Global Services 1998).However, the sheer level of interdependence in modern IT service ecosystems is indeed a recent phenomenon.The advent of the internet over the past two decades unlocked the potential of integrated supply chains for small and medium-sized enterprises (Stefansson 2002), where previously such supply chains had been the sole privilege of large companies.However, it is the ubiquity of cloud services in recent years that have increased today's complex dependencies -along with the fact that IT integration of supply chains does lead to more competitive business performance (Swafford et al. 2008).As a consequence, there is no shortage of research on the reliability, e.g., Dai et al. (2009) and Wood et al. (2010) and security, e.g., Takabi et al. (2010) and Subashini and Kavitha (2011) of cloud services.
In this paper, we analyze the cost of IT service downtime, first for a single firm, and then from the perspective of interdependencies between firms in a supply chain, where the downtime of one enterprise IT service may indirectly trigger further business interruptions at other firms.This is the case when a firm buys services such as credit-card payment, authentication, physical storage and associated inventory status information, customer relations data storage, bookkeeping, etc. from third parties.For a business process to execute, e.g. a customer successfully placing an order, the services of these other firms need to be available at the same time.From an economics perspective, such interdependency has interesting consequences as downtime at one firm may affect multiple other firms in a supply chain.Specifically, security investments and resource allocation by one firm may impose externalities on other firms connected to it in the supply chain (Zhao et al. 2013).Whether there is a role for insurance to internalize this externality is also a question of both theoretical interest and practical implications.As a risk management tool, firms can purchase insurance coverage to manage the unforeseen consequence of downtime costs.A recent study by Lloyd's showed that a few days of downtime for a top cloud service provider would result in billions of US dollars in insured losses in the US alone (Lloyd's 2018).Pondering this, it should be remembered that even though the US is one of the most mature markets in terms of insurance, only a tiny fraction of the losses are insured (due to low take-up rates and low coverage limits in the policies offered).The uninsured losses, borne by the firms themselves, would be much greater in such a scenario.
It is this background of large costs, interdependencies, externalities and the role of insurance that motivates our work.Following a brief literature review in Section 2, we introduce a stochastic model of IT service downtime in Section 3. Section 4 then extends this model with the downtime costs to a firm, and Section 5 addresses the resulting resource allocation problem.Section 6 extends the model to the supply chain setting.In Section 7, we discuss how insurance products covering business disruption can help firms in a supply chain to manage the risk related to downtime costs.Finally, Section 8 concludes the paper.

Related work
As noted above, the general areas of digital supply chain and cloud computing security and reliability are well researched.However, the seemingly largest strand of research in cloud security primarily addresses antagonistic threats and data breaches (Ali et al. 2015;Ahmed and Hossain 2014).These are certainly pertinent issues, but a bit different from our focus, as we address downtime costs caused by both antagonistic and non-antagonistic factors.Furthermore, much of the technical literature delimits itself from economic aspects of availability, including the cost and incentive issues that we address.In the following, we briefly review some of the most relevant literature related to our contribution.

Economics of outages, security and supply chain resilience
From an economics perspective, supply chain security has been analyzed previously from several angles.Jenjarrussakul et al. (2013) consider sectoral and regional interdependence of Japanese firms under the influence of information security, in a model where sectoral security risk levels are represented by survey data.Their work is similar to ours in investigating the dependencies between firms, but as a descriptive model, it differs from our more prescriptive model used to analyze how to minimize downtime cost and design insurance products covering business disruption.Thomas et al. (2013) focus on the complexity of managing an incident in a supply chain, where many partners need to be coordinated.Their work is similar to ours in addressing the interplay between supply chain actors, but differs importantly by focusing on data breach rather than service outage.Dynes et al. (2005) offer an interesting interview-based investigation of firms' self-assessment of their supply chain resilience in the face of internet outages up to a week.The results indicate overconfidence, and even though it is difficult to imagine that executives today maintain that their supply chains remain robust in the face of a week-long internet service outage, the relative overconfidence might still be similar.Dynes et al. (2007) investigate the costs of internet outages to the US economy.Importantly, they estimate the multipliers by which the direct costs are related to the indirect ones, e.g. through supply chain ripple effects.While this is similar to our area of interest, Dynes et al. (2007) are more descriptive of fait accompli consequences, and do not focus on decision problems and incentives for individual firms to prevent outages.Shahrad and Wentzlaff (2016) propose how the inefficiency of best-effort availability for cloud computing customers with varying demands and willingness to pay for availability can be improved upon by flexible availability offerings.They offer an OpenStack prototype and encouraging simulation results of improved profit margins.A main difference compared to our work is that Shahrad and Wentzlaff (2016) focus on the situation where a single service-provider uni-directionally serves many diverse customers, rather than the more general case of bi-directional dependencies that we address.Aceto et al. (2018), in a recent survey of the literature on internet outages, note that this literature is rather scattered and difficult to get an overview of.While "internet outages" only partially overlap with our object of study, it is still worth observing that one of the open issues identified in the survey concerns lack of proper models for risk assessment.In particular, this is identified as a problem for insurers, and a barrier to properly priced insurance (ibid, section 7, p. 51).
Our study makes a contribution to precisely this area.
In the literature there is abundant exposition of production of security investment (Gordon and Loeb 2002;Anderson and Moore 2006;Matsuura 2009), externalities of security investment (Camp and Wolfram 2004;Muermann and Kunreuther 2008;Herley 2009).However, the mainstream strand in this research field is externalities imposed by underinvestment in security against antagonistic threats.For example, Camp and Wolfram (2004) discuss vulnerabilities that can be exploited by attackers.This differs from our focus on downtime, which can be either non-antagonistic or antagonistic in nature.

Cyber insurance
As we discuss implications for insurance, it is also relevant to consider the cyber insurance literature.Böhme and Kataria (2006) discuss correlations of cyber risks in insurance portfolios.This is similar to our interest in interdependencies between firms (second tier correlations, in the terminology of Böhme & Kataria), but different in focus: Böhme & Kataria model loss events using attack data obtained from honeypots, whereas we do not distinguish between downtime events caused by cyberattacks and other known or unknown, natural or human causes.Furthermore, as we analyze downtime, we are interested in durations as well as occurrences of events.Böhme & Kataria use a discrete (beta-binomial) distribution to model their loss events while our model is based on Poisson arrival frequency and random downtime duration.
From an empirical cyber insurance point of view, it should be noted that the downtime costs of business interruption are insurable.Franke (2017) notes that some kind of coverage for business interruption caused by antagonistic factors is covered by all insurance policies on offer in Sweden, whereas business interruptions caused by non-antagonistic factors are treated differently (some insurers exclude it, some include it, while some offer it as an option).A CCRS/RMS survey cited by the OECD (2017) found that business interruption resulting from thirdparty disruption was covered by only a third of the policies investigated.

Scope and contribution
This paper contributes to the literature by a cross-pollination of various academic fields, including information and system security, management information systems, economics of information security, insurance and actuarial science.This paper presents novel applications of popular probability models of IT service breakdown and duration of breakdown, as well as economic impacts and insurance premium calculations.

Probability model of downtime for an enterprise
For simplicity we consider a generic firm that offers a single IT service, with a target average availability (the fraction of total time that the service is working as it should) such as 99.95%.Traditionally, the average availability is often defined as the ratio of Mean Time To Failure (MTTF) to the sum of Mean Time To Failure (MTTF) and the Mean Time To Restore (MTTR): We note that the conventional measure of average availability A in Eq. 1 does not differentiate between the following two cases: i) 12 times of outage in a year, with 2 hours downtime each time ii) 1 single outage in a year, with downtime lasting 24 hours Assuming 365 days per year, and 24 hours per day, conventional measure of average availability A in Eq. 1 would be the same for both cases, namely, A = 364 365 .However, these two cases may have different economic impacts.Suppose that the firm has an insurance policy that covers business interruption due to downtime, with a deductible of 8 hours of waiting time.Then, the insurance pay-out would be zero in case i) but non-zero in case ii).
We thus propose a probability-based model for IT service breakdowns.

Frequency of breakdown -poisson arrival model
Consider a single generic firm.We assume that the number of breakdowns in a given time interval follows a Poisson arrival process with intensity λ.The random number N of IT service breakdowns in any given one year time interval is The Poisson arrival process can also be defined by stating that the time intervals between IT service breakdowns are exponential variables with mean 1/λ (e.g.Ross 1996).
Remark 1 The Poisson process is one of the most widelyused counting processes; it is used when the occurrences of certain events happen at a certain rate, but completely at random.For example, suppose that from historical data, we know that earthquakes occur in a certain area with a rate of 1 per five years.Other than this information, the exact timings of earthquakes seem to be completely random.Thus, we conclude that the Poisson process might be a good model for earthquakes.In practice, the Poisson process or its extensions have been used to model the number of car accidents at a location, the number of requests for individual documents on a web server in a given day, etc. Kingman (1992) and Haenggi (2012).More specifically, Poisson arrivals of outages are a common assumption in the literature on IT service reliability and availability, e.g.Martinello et al. (2005); Jeske and Xuemei Zhang (2005); Franke (2012Franke ( , 2016Franke ( , 2019)); Yamada (2014); Taylor and Ranganathan (2014).

Downtime duration model
Given that IT service breakdown occurs, the random duration of a breakdown T follows a probability distribution with raw moments Specifically, μ (1) is the average downtime T per occurrence.
To facilitate later calculation of premium for insurance against downtime cost, we introduce a concept of deductible (waiting time) d, which is defined such that the downtime can be decomposed into two parts: Note that the expected value of downtime in excess of the deductible d is: For IT services, the lognormal distribution is often used to model outage durations (Schroeder and Gibson 2010;Franke et al. 2014).A lognormal distribution is defined by the probability density function: The k-th raw moment of the lognormal distribution is The lognormal distribution has a limited expected value (Bahnemann 2015): where Φ is the standard normal cumulative distribution function.
Remark 2 The downtime duration can have other distributions and can take either discrete or tabular form.

Cost of downtime as function of duration
Consider the firm's downtime costs in a given year.Assume that given an occurrence of downtime, the firm's downtime cost C(T ) is a function of the duration T of downtime.Generally, C(T ) should be a non-decreasing function of the downtime duration T .Here are three simple cost functions: 1. Linear: Remark 3 As empirical studies of IT service outage costs are rare, it is interesting to consider several different functional forms of the cost.Conceptually, it is worth distinguishing revenue lost from repair costs (Oppenheimer et al. 2003), and explore cost functions capable of capturing both.The linear model is in a sense the simplest one, corresponding to models such as Patterson's (2002), where the cost of an outage hour is always the same.The quadratic model model, by contrast, corresponds to cases where costs snowball (Franke 2012; Vecchio 2016) so that short outages are barely noticeable, but longer ones can have dire consequences.An example is when customers stop using an ATM or credit card payment system even after it has come back online, because a long outage gave it a bad reputation.Such reputational impacts of service outages can be considerable, including plummeting stock prices (Bharadwaj et al. 2009), and be difficult to manage (Boritz and Mackler 1999;Manika et al. 2015).The constant model is a contrast to the other two, letting cost be independent of outage duration.It thus represents the case where fixed restart costs overshadow variable outage costs.Meland et al. (2017) cite data showing that response costs are on average three times as great as loss of business income.Though these categories are not exactly equivalent to fixed vs. variable costs, such data indicate that fixed, durationindependent, costs cannot be ignored.Physical industrial processes are good examples of IT dependent operations that have considerable fixed restart costs.
Note that the firm can incur a random number N of IT service breakdowns, where N has a Poisson distribution with mean λ.The n-th breakdown has a random duration T n and incurs a downtime cost C(T n ).Thus, in a given year, the firm's aggregate downtime cost S is a random sum of N independent random variables: The aggregate downtime cost S in Eq. 5A has a compound Poisson distribution with the following mean and variance (Klugman et al. 2012): Now we analyze each of the three simple downtime cost functions.

Linear model
The downtime cost per occurrence has the following moments: The firm's aggregate downtime cost S in a given year has the following mean and variance: 4.2 Quadratic model As in Franke (2012), the coefficient c represents snowball effects.
The downtime cost per occurrence has the following moments: The firm's aggregate downtime cost S in a given year has the following mean and variance:

Constant model
The downtime cost per occurrence has the following moments: The firm's aggregate downtime cost S in a given year has the following mean and variance:

Resource allocations to cost reduction and risk transfer
It can be demonstrated that one effective way of reducing downtime is by building redundancy.Normally, the frequency of breakdown of any one component is small, (λ is near zero).To illustrate the effect of redundancy, e.g. when two independent payment systems have been procured, the corresponding expected downtime frequency in the payment service is approximately reduced to λ 2 .Nevertheless, the sum of multiple independent sources of breakdown can increase the Poisson frequency λ.In general, more complex cases of architectural dependencies between IT services can be modeled using fault trees (Närman et al. 2014).
The concept of building redundancy to reduce breakdown frequency is not new.In the aviation industry, commercial aircrafts have almost everything at least in duplicate, for instance, multiple engines, auxiliary fuel pumps, dual spark plugs, dual electrical displays and circuitry.In addition to this redundancy, there are strict maintenance and training requirements for all commercial aircrafts.This redundancy leads to an extremely small likelihood of an airliner completely "breaking down".Now we return to discussions of enterprise IT services.We assume that a firm has already deployed baseline investments to assure some level of redundancy which corresponds to a baseline frequency-duration model: -The baseline breakdown frequency N 0 has a Poisson distribution with mean λ 0 , and -The baseline downtime duration T 0 has raw moments μ (k)

Uptime production and resource allocation
To facilitate the firm's decision problem in resource allocation, assume that the firm can invest Capital K to reduce frequency, and invest Labor L in reducing duration of downtime.We assume a Cobb-Douglas (1928) type production function for IT service availability: 1. Investment K in reducing the frequency (e.g., investing in prevention and redundancy): 2. Investment L in reducing the duration (e.g., investing in detection and response) by a scaling factor: Thus the k-th raw moment of T (L) satisfies Remark 4 Franke (2014) considered a similar Cobb-Douglas type production model without using a probabilistic model.The intuition behind the model is that Capital K can buy better hardware (or similar hardware, to build redundancy), reducing the frequency of downtime (i.e. increasing the MTTF), while Labor L can be used to monitor the system and take swift action when it fails, which also reduces the average duration (decreasing the MTTR).
Of course, Labor and Capital need not be taken literally, but can rather be seen as stylized descriptions of two different kinds of available investments.
With the Cobb-Douglas type production model in place, the firm faces a total cost (which is the sum of investments K and L made to maintain availability, and the residual aggregate downtime cost S) as follows: The variance of the firm's aggregate downtime cost S is Proposition 1 Consider the Linear Model: a) For a given fixed budget K + L = M, the optimal allocations of resources to minimize the expected total cost (7A) are: The optimal level M that minimizes the total cost (7A) has a closed-form formula: With the resource allocations in Eqs.8A and 8B, the minimal expected total cost is b) The resource allocations that minimize the variance of aggregate downtime cost (7B) are: The level M that minimizes the variance of aggregate downtime cost (7B) is: Remark 5 From Eq. 8B, the optimal level of investment M * is an increasing function of both the baseline expected frequency λ 0 and the baseline expected downtime cost cμ (1) 0 .Minimizing the expected total cost and decreasing the variance of the total cost, respectively, are the two availability management strategies discussed by Franke (2012, especially Section VI.A).When empirically investigating how IT professionals act in procuring availability Service Level Agreements (SLAs), however, most did not minimize the expected total cost, and many exhibited decision-making patterns not easily explained (Franke and Buschle 2016).
Proposition 2 Consider the Quadratic Model: a) For a given fixed budget K + L = M, the optimal allocations of resources to minimize the expected total cost (7A) are: The optimal level M that minimizes the total cost (7A) has a closed-form formula: With the resource allocations in Eqs.9A and 9B, the minimal expected total cost is b) The resource allocations that minimize the variance of aggregate downtime cost (7B) are: The level M that minimizes the variance of aggregate downtime cost (7B) is: Remark 6 From Proposition 2, large snowball effect of the cost of downtime would indicate relatively more investment to shorten the duration of downtime.
Proposition 3 Consider the Constant Model: C(T ) = c.For a given fixed budget K + L = M, the optimal allocations are K = M and L = 0. Specifically, we have: Proof in Appendix.
Remark 7 A large fixed cost drives investment into minimizing the number of breakdowns.The result from the degenerate constant model is extreme, but a more realistic model of the allocation problem might be a weighted sum of the constant, linear, and quadratic models.

Risk transfer of downtime cost by insurance
Suppose that the firm can transfer the downtime cost to an insurer by purchasing insurance that reimburse the firm's cost of breakdown after a deductible (waiting time) d.The actuarially fair insurance premium can be calculated as It is noted that the level of a firm's investments K and L can directly affect the insurance premium.
Example 1 Assume that the baseline frequency follows a Poisson arrival with λ 0 = 0.5, and the baseline downtime duration T 0 follows a lognormal distribution (4A) with a = 0.7 and σ = 1.2.From Eq. 4B we have μ (1) Assume that the production parameters in Eqs.6A and 6B are α = 0.4 and β = 0.6, with λ(K) = 0.5K −0.4 and μ (1) Let the downtime cost function be a linear model: C(T ) = c • T with c = 1.According to Proposition 1, from Eq. 8B, the optimal level of resource is From Eq. 8A we have K * L * = α β = 0.4 0.6 , with K * = 0.804 and L * = 1.206.From Eq. 8C, the minimal Expected Total Cost is E [Total Cost] = 2M * = 4.03.Figure 1 illustrates how resource allocation {K, L} affect the firm's total cost (the sum of resources allocated and expected aggregate downtime cost), where an optimal pair {K * , L * } exists.
Figure 2 illustrates that, as expected, the premium for insuring against downtime cost (with a waiting time deductible d = 8) decreases as the allocated resources {K, L} increases.Remark 8 The total optimal level of spending M * remains the same whether it is used for producing uptime or purchasing insurance.This follows from the fact that we assume an actuarially fair premium and enterprises without risk-aversion.
Example 2 As a slight variation from Example 1, instead of using a linear cost function, we now consider a quadratic cost function C(T ) = c • T 2 with c = 0.05727 such that E[C(T )] remains the same as in Example 1. From Eq. 9B, the new optimal level of resource allocation is From Eq. 9A we have K * L * = α 2β = 0.4 1.2 and thus K * = 0.56 and L * = 1.68.Thus, relatively more resources are allocated to reducing duration of downtime as compared to Example 1.

Insurance with Aggregate Limits
Recall that the firm's aggregate downtime cost S in Eq. 5A in a given year is a random sum of independent random cost variables.It is common for an insurance contract to specify not only a deductible d (waiting time) per occurrence but also impose a cap (annual aggregate limit) on the total insurance payment in a given policy year.Our frequencyduration model can facilitate calculation of the probability distribution of the aggregate downtime cost for a firm, and thus enables calculating the cost of insurance (with deductible and aggregate limit).
For the firm, assume a Poisson frequency of breakdown with mean λ and the following tabular per occurrence downtime cost (after applying waiting time deductible): The aggregate downtime cost S in Eq. 5A, before applying aggregate limit, has a compound Poisson distribution which can be computed using the Panjer recursion (Panjer 1981): with a starting value g(0) = exp(−(1 − f (0))λ).Once the aggregate loss distribution is computed, the aggregate limit can be applied to the aggregate loss distribution for computing the insurance cost.

Model of downtime cost for enterprises supply chain
As in Wang (2017), we consider a supply chain ecosystem of m firms, indexed by j = 1, 2, . . ., m, are interconnected through business relations (vendors, contractors, suppliers, service providers, etc.).The execution of a business process at firm j depends on available IT services from a set of other firms, i.e. an outage at any one of these vendors can potentially halt the business process of firm j .We label a "tag" j to the notations to refer to a quantity that is specific to firm j .For firm j , we let N j represent the random number of IT service breakdowns in a year, and T j represent the random duration of breakdown given one occurrence.Assume that N j have a Poisson distribution with mean λ j .Let T j have raw moments μ In a supply chain setting, IT service downtime in one firm can propagate through the supply chain and cause business disruption to another firm.Let the parameter θ i,j ≤ 1 represent the propagation coefficient, i.e., the likelihood that a downtime of firm i's IT service will affect another firm j 's business operation.Note the directional propagation: 1) Outbound: Firm j exports downtime to another firm i with propagation coefficient θ j,i 2) Inbound: Firm j imports downtime from another firm i, with propagation coefficient θ i,j We get a matrix of propagation coefficients with diagonal values θ j,j = 1 but not necessarily symmetric: Remark 9 The propagation matrix may be relatively sparse, given that flowchart like IT service architectures are accurate.However, it may be less sparse than expected, as is sometimes discovered when consequences of outages reveal unexpected dependencies, see for example Swedish Civil Contingencies Agency et al. (MSB 2014).Böhme and Schwartz (2010) note that fully-connected graphs allow for modelling of network externalities.
When firm j allocates resources {K j , L j } in IT services maintenance, the corresponding number of breakdowns for firm j is a random variable N j K j with a Poisson mean λ j (K j ), and the downtime for each breakdown occurrence, T j L j , has a mean μ (1) j (L j ).Let K = (K 1 , K 2 , . . ., K m ) and L = (L 1 , L 2 , . . ., L m ) be the vectors of resource allocations by the m firms.We consider only 1-step propagation from the origin firm, and ignore further steps of propagation.For simplicity, we delimit ourselves to the 1-step case only.

Independent sources of breakdown
The frequency of breakdown for a firm is the sum of direct breakdown and indirect breakdown, both of which follow Poisson arrivals: 1) direct breakdown N j follows a Poisson arrival process with intensity λ j (K j ) 2) indirect breakdown from firm i follows independent Poisson arrival process with intensity λ i (K i ) The combined frequency of breakdown due to direct and indirect causes follows a Poisson distribution with the mean Remark 10 Though the model does not allow for statistically dependent breakdowns, it can nevertheless capture operational dependencies such as multiple firms being dependent on a single cloud service provider n by setting the corresponding θ n,j parameters close to unity for all j .
For simplicity, we assume a linear model C j T j = c j T j for firm j 's cost of downtime.Though not perfect, it can be considered a reasonable first approximation of a more advanced model.
The expected cost to firm j from all possible breakdowns in the supply chain is Note the indices: Cost coefficients c j belong to firm j , whose cost we are assessing, but the number and durations of outages are summed over all the firms in the ecosystem.
Remark 11 As noted by Franke (2017), different insurance companies take different approaches to which parts of Eq. 12A they cover.One of the offers to cover outages at all external service providers with an increase in the premium of some 20-25% and the indemnity limit is cut in half.This corresponds to breaking (12A) into two terms, one internal and one external: The increase in premium and decrease of indemnity limit reflects the insurer's assessment of the relative magnitudes of the first (internal) and the second (external) terms in Eq. 12B.Alternatively, the same insurer offers coverage of outages at a specific list of some 3-5 named providers.This corresponds to breaking (12A) into two different terms: The index j , of course, is in the Covered set.Other insurers investigated by Franke (2017), make no distinction between internal outages and outages at external service providers, i.e. they cover all of Eq. 12A.Indeed, one of the insurers interviewed remarks that most companies are actually better off from the business continuity perspective by trusting service providers like Microsoft, Google, or Amazon rather than trying to build equally reliable in-house IT service.A firm's own IT services breakdown causes a direct business interruption cost, as well as spillover effect (business interruption to other firms) in a supply chain, with a total expected downtime cost: In Eq. 13B, the second term represents, equivalently, (i) the expected externality cost imposed on the others in the supply-chain by the breakdown of firm j in the absence of liability, or (ii) the expected liability that has to be paid in the presence of such liability, or (iii) the actuarially fair cost of a third-party liability insurance which covers the other firms' costs of the downtime caused.
Proposition 4 Everything else equal, to minimize the total expected downtime cost (12A), the optimal investment K j for firm j in reducing the breakdown frequency exceeds the optimal investment for the same firm on a standalone basis.
Remark 12 However, firm j does not have any incentives to purchase insurance to cover third parties, unless it is required to do so by external pressure.
It is illuminating to rewrite (12B) in yet another form: where the term is a supply chain dependency multiplier (analogous to the concept of multiplier used by Dynes et al. (2007)).In the special case that all firms in the supply chain have independent and identically distributed downtime frequency N j and duration T j , Eq. 14A can be simplified to If firm j has no expected cost from outages in the rest of the supply chain, Θ j = 1, i.e. there is no multiplier effect.
However, Θ j is typically greater than and the expected cost to firm j is directly proportional the multiplier Θ j .Assuming independent drivers for the origins of breakdowns among firms, due to the propagation of downtime in a supply chain, the cost to all firms in the supply chain from all possible breakdowns by all firms in the supply chain is Equation 15 can be used to calculate risk aggregation for an entire supply chain.An insurer may cover many firms in the same supply chain, and an insurer may purchase reinsurance protection for its own insurance portfolio as a way of diversifying concentration risk.Sometimes capital market solutions can be used to transfer the aggregated supply chain risk to investors.
Example 3 Consider four firms (j = 1, 2, 3, 4) which all depend upon one IT Service Provider firm 5 (see Fig. 3).Assume that the propagation coefficients are θ 5,j = 0.8 but θ j,5 = 0, for j = 1, 2, 3, 4. The propagation coefficients matrix thus looks as follows: 1 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 1 0 0.8 0.8 0.8 0.8 1 Effects on the four firms: For each of the four firms (j = 1, 2, 3, 4), according to Eq. 12A, we have Thus, for each of the four firms (j = 1, 2, 3, 4), the downtime experienced is a sum of one term that can be controlled by investments K j and L j and one Fig. 3 A simple supply chain of four firms and their common IT Service Provider term beyond the control of firm j .While each firm j can optimize balance between K j and L j according to Proposition 1, they cannot do anything about the second term, except try to keep c j as small as possible.b) Effects on the IT Service Provider: According to Eq. 12B, we have In the absence of 3rd party liability, the IT Service Provider carries only its first-party cost, the first term, whereas the second term represents externalities carried by the firms.Thus, there is little incentive to reduce downtime as much as would be socially optimal.Even if the four firms (j = 1, 2, 3, 4) take out insurance policies covering outages at their provider, incentives will still be misaligned and their actuarially fair premiums would cost them as much as their part of the expected outage cost.If, on the other hand, the IT Service Provider (firm 5) were liable for the downtime costs it causes the firms, it would have incentives to increase its investment K 5 + L 5 to the socially optimal level.c) Insurance on the ecosystem: It can be shown that the total loss to the ecosystem can be optimized by minimizing ( 15).An insurer is generally concerned about risk accumulation in its portfolios, as it may cover multiple firms in the same supply chain, and thus may be liable for much of the total downtime cost within a supply chain, including both first-party and third-party insurance.An insurer (or a group of insurers and reinsurers) that insures the four, or even five, firms (j = 1, 2, 3, 4, 5) thus has an incentive to control the total supply chain downtime cost (15) by encouraging investments at those insured customer firms where it is the most efficient to do so.Wang (2019) proposed an integrated approach to information security investment and cyber insurance, with more focus on risk advisory services and partnership.Similar mechanisms could be proposed for managing downtime costs in a supply chain.

Characterization of insurance policies of downtime cost
We note that the models introduced in the previous sections describe existing cyber insurance offerings on the market quite well.Such offerings can be broadly described as comprising of first and third-party insurance coverage et al. 2003;Biener et al. 2015;Franke 2017).First-party insurance coverage for the direct costs of a firm's IT service are modeled in Example 1. Simple assumptions about the production of security give insight into the pricing of these products.The use of deductibles by insurers to discourage under-investment from the insured are readily understood from this framework, and can be incorporated into the model in a way that resembles call options.Based on the supply chain model, where each firm is both an importer and an exporter of downtime, the demand for first-party coverage of costs caused by other firms can be understood.If these firms were strictly liable for these costs, there would be no externalities.However, as noted by Franke (2017Franke ( , 2018)), many large IT service providers instead have standard SLAs that severely limit their liability.Cyber insurance (covering first-party downtime loss) can then be used to cover the residual gap between the full loss of the firm (client) and whatever liability the service provider accepts.
Third-party liability coverage can also be understood based on the supply chain model.Given some liability (full or partial), a firm has an interest to insure itself against the claims of other firms, affected by outages originating at the first firm.Since these other firms can be much larger in terms of turnover and profit than the origin firm, and their claims can be similarly larger, such insurance can be a prudent measure.Again, insurers have an incentive to adjust the premiums offered based on the security investments of the insured.Furthermore, the insurer has an incentive to reward investments that are in line with the loss function of the potentially affected third parties: if most third parties have large snowball effects, the insurer would encourage investments to shorten the duration of outages.Additionally, insurers might encourage customers to engage in SLA negotiations with others in the supply chain in order to learn more about their preferences, particularly their downtime cost functions.
Risk Aggregation Across the Supply Chain: As noted in conjunction with Example 3, and insurer is generally concerned about risk accumulation in its portfolios, and could thus be instrumental in aligning incentives and securing investment where it is most needed.Wang (2019) proposed an integrated approach to information security investment and cyber insurance.Based on the insights from the model, an improved cyber insurance design is proposed with more focus on risk advisory services and partnership.Franke and Draeger (2019) proposed a similar scheme, where insurers could act to enable collective action on behalf of their insureds facilitating collective funding of additional IT service incident managers in the face of limited capacity.

Conclusions and outlook
This paper presents a probability model (Poisson arrival frequency with lognormal downtime duration) for an enterprise, and a propagation model for the supply chain, building on literature from multiple disciplines including IT service, supply chain, actuarial science, insurance and economics of information security.We used the model to study interactions of a firm's security investment (resource allocation) and externality of spillover effects, as well as implications in first-party versus third-party insurance.

Single-firm case
We show how total enterprise resources in the single firmcase can be optimally allocated depending on the relative effectiveness of reducing (i) the frequency of outages and (ii) the duration of outages.When the cost of downtime increases more than linearly with time, it is optimal to allocate more resources to reduce the duration of outages.Conversely, When the cost of downtime increases slowly, or not at all, with time, it is optimal to allocate more resources to reduce the frequency of outages.Though the results are theoretical, they still offer useful rules of thumb for management practice.

Interconnected firms case
The supply chain-model with several connected firms offers interesting conceptual insights.We highlight three: First, the supply chain dependency multiplier Θ j in Eqs.14A and 14B is a measure of how much the interdependence in the supply chain network increases the downtime costs of a firm over and above those caused by internal factors.Conceptually, it points to a way of reducing the expected cost to firm j by reducing the correlation parameters θ i,j (including bringing some down to 0, via cutting the dependency entirely).In a practical setting, one way of reducing dependency is to build a kind of "supply chain redundancy" across firms.This is non-trivial, and quite different from redundancy in a firm's own internal operation.For example, while everyone has redundant hard drives in data warehouses, and it might be feasible to reroute a credit card transaction to a redundant payment system, it is significantly more cumbersome to maintain dual independent but fully synchronized inventory systems that keep track of physical products in a storage area somewhere at an off-shore location.Furthermore, in some common cases like IT services that depend on Google accounts for authentication of users, it is virtually impossible to maintain a redundant alternative authentication system.However, this is not necessarily always the case.Fostering more redundancy at the supply chain -or ecosystem -level will require collective investment and Second, though the correlation parameters θ i,j are easy to define, in practice it can be difficult to attribute causes of downtime in practice as it propagates through the supply chain.This is true for several reasons.First of all, in the case of antagonistically caused downtime, all the usual difficulties in attribution apply (see e.g.Nicholson et al. (2013), Rid andBuchanan (2015) for different perspectives on this).However, there is a second complication specific to downtime: since an outage is not a discrete event, its actual duration can have multiple causes.One root cause (e.g. a cooling system breaking down) might have brought the IT service down in the first place (as servers in one place overheat), but another cause (misconfiguration in the alternative site) might prevent it from coming back online, thus prolonging the outage.Without going into more details, we note that conceptually, the difficulty of attribution might make it easier to offer more extensive firstparty insurance (which does not require attribution) rather than third-party liability insurance (which does require attribution).As such, this insight helps us understand why existing cyber insurance policies are a mix of first and thirdparty coverage.Different combinations will be necessary in different situations and will have to be customized to suit different industries and firms.
Third, the parameters θ i,j are hard to estimate statistically before a major incidence of IT service breakdown.Firms may not even realize that there are dependencies until it is too late (see for example Swedish Civil Contingencies Agency et al.MSB 2014).Thus, similar to penetration tests to counter antagonistic cyber security threats, there is a need for more "fire drill" type testing, to document consequences and learn how to increase resilience.This, however, is difficult for a number of reasons, the most obvious being that systems in production cannot be easily used for testing purposes, while development systems designed for testing purposes typically lack precisely the kind of third-party dependencies such exercises would seek to discover and mitigate.Nevertheless, the consequences of not conducting such drills beforehand might be even costlier.

Managerial implications
IT service downtime is a problem faced by all enterprises today.With increasing dependency on IT services comes increasing sensitivity to outages.However, all firms are not IT service dependent in the same way, so it is important for managers to find the measures most appropriate for their operations.
For any firm allocating finite resources in managing IT downtime risk, it is important to look at the overall picture, by evaluating the relative effectiveness of reducing the frequency of downtime occurrence (λ(K)) or downtime duration (μ (1) (L)), as well as how the business cost of downtime increases with the duration (c(T )).Even though it can be difficult in practice to determine exact numbers, the exercise of trying to do so can be rewarding.In particular, it is valuable to try and estimate whether downtime cost increases more than linearly with the duration of downtime, in which case efforts should aim to minimize duration, or if downtime cost is mostly fixed response costs, in which case efforts should aim to minimize frequency of occurrence.
However, in an increasingly inter-connected world, a firm not only needs to look at its own IT system, but also at the propagation of downtime in its complex supply-chain.As noted above, it can be challenging to try and map out dependencies and estimate correlation parameters θ ij .Nevertheless, there might also some lowhanging fruits.One recent case study indicates that some firms building novel 5G-enabled services around would benefit from formalizing their service level agreements with each other, thus enabling a more mature risk sharing in the ecosystem (Olsson and Franke 2019).Another recent study, albeit a small one, indicates that although companies are willing to share IT vulnerability information with each other, proactive sharing of vulnerability information is relatively rare, and customers do not require such information from their providers (Olsson et al. 2019).Such passivity about information sharing stands in stark contrast to the recommendations about active disclosure and sharing found in contemporary cyber security guidelines.As these examples suggest, there are relatively straightforward actions that could be taken to mitigate IT service supplychain risk.
For insurers, the results highlight both the need for active management of portfolio accumulation risk the opportunity to re-innovate business models.Starting with accumulation risk, insurers should work to improve current practices for mapping dependencies among their insureds.Whereas much underwriting today is based on (selfassessment) forms being filled out and interviews conducted (Woods et al. 2017), it is fully conceivable for insurers today to proactively and continuously scan their customers to produce an up-to-date situational awareness of their vulnerabilities and inter-dependencies.Though this comes at a cost, this cost-benefit analysis is worth conducting.This leads to the business opportunity: insurers are in a position where they could take the lead in not only pricing cyber risk, but more proactively working to reduce it.Possible examples already mentioned include facilitating collective action on behalf of customers, thus enabling beneficial security investments that not otherwise be realized (Wang 2019;Franke and Draeger 2019).However, the future role of insurance in cyber security governance is by no means clear, and it is up to insurance managers to show leadership in finding novel ways to reduce systemic risks (Woods and Moore 2019).

Future work
The results also suggest some avenues future work.First, it would be interesting to further explore the impact of the complexity and structure of the supply chain.While in general, the sum of independent Poisson processes (from various nodes in a supply-chain network) produces another Poisson process, if the Poisson processes are correlated, the combined effect would be more complicated.
Another avenue of future research is more empirical studies of the frequency and duration of IT service breakdown (testing the Poisson frequency model and the lognormal model for downtime duration), as well as empirical tests of the effectiveness of security spending in reducing the frequency and duration of IT service breakdown.Such future work would require well designed field studies and gathering of survey data.

Fig. 1
Fig. 1 Optimization of Production of Resource Allocation

Fig. 2
Fig. 2 Calculated Insurance Premium under various resource allocations