Keywords

9.1 Introduction

Globalisation and fast-paced technological innovation have given rise to a worldwide digitalisation over the last 20 years. The impact of the internet and cyberspace on modern societies is so vast that some commentators argue that we have entered a new era and “cyber age”, characterised by rapid change [1]. In this development, cyberspace has become an enabler of economic and social prosperity, but also of vulnerability and transnational security concerns. An especially serious concern has been centred around the increasing connectivity and interconnectedness of critical infrastructures. These worries have been enhanced by high-profile cyber-incidents and crises, including the NotPetya attacks in 2017 affecting organisations in over 60 counties including the global transport and logistics firm Maersk, the pharmaceutical giant Merck and the National Bank of Ukraine, the WannaCry attacks on the UK National Health Services in 2017 and the ransomware attack on an American oil pipeline system in 2021.

Meanwhile, research contributions on the phenomenon of large-scale cyber-crises have been surprisingly thin and scattered within the field of crisis management research and beyond, reflecting the state of macro-level cyber-security research in general. Despite an increasing cyber-security interest within research fields dealing with national and international security affairs, attempts to theorise this essentially complex and multidisciplinary issue have been fragmented. As noted by Green [2], this fragmentation is problematic because of the “blind spots” of each perspective. “Thus lawyers have little idea of the technology that they are trying to regulate, strategists do not pay enough heed to the wider ethical and legal implications of acts of interstate cyber-aggression; and computer scientists delineate the intricacies of the technology with little focus on its political and strategic implications” [2, p. 3]. In other words, there is a shortage of comprehensive (although necessarily less detailed) scholarly perspectives on cyber-security issues in general, and cyber-crises in particular.

Although still relatively slim, the current body of literature focusing on cyber-security at a macro-level has illuminated some important aspects of the consequence dynamics of cyber-crises. One is that cyber-crises (as cyber-security issues in general) tend to blur important dichotomies, including internal/external, technical/strategic and civilian/military, making them difficult to analyse, frame and conceptualise [3, 4]. This has resulted in, for example, early national cyber-crisis strategies that differed significantly from each other, despite facing largely the same challenges [5].

Another aspect identified in the literature is that the consequences and response efforts of cyber-crises tend to be characterised by transboundary-ness, to the extent that cyber-crises can be conceptualised specifically as transboundary crises [6, 7]. For example, this manifests in moments of rapid escalation after moments of slow development and a quick increase of involved and affected actors [8]. These consequence dynamics tend to put substantial stress on national crisis management structures that were not built with transboundary crises in mind [9, 10]. Previous cases have highlighted that response challenges that are challenging in any crisis (especially if the crisis shows transboundary features), like performing joint sense-making, coordination of response efforts and effective crisis communication, become even more challenging in cyber-crises. Not least because of the added complexity of a crisis which essentially involves technical matters, but has societal consequences [6, 7].

However, less attention has so far been placed on understanding the underlying conditions that allow this transboundary dynamic to unfold. This chapter begins to address this research gap, drawing upon the classic theoretical perspectives of Normal Accidents (Perrow) and High Reliability organisations [11, 12].

9.2 Normal Accidents and High Reliability Organisations

Since the publication of Charles Perrow’s book Normal Accidents in 1984, the idea that tightly coupled complex systems will inevitably cause accidents (making them “normal”) has been highly influential across many academic disciplines, especially those concerned with technology, risk, safety and crisis.

The idea of Normal Accidents (NA) is based upon the combination of two system conditions: interactive complexity and tight coupling. System in this sense is loosely defined and can refer to a computer system as well as an organisational system. Interactive complexity of a system starts with a lot of different components. These components can be technical parts (software or hardware, for instance), but they can also refer to procedures or human operators. Within this setting, failures among system components can interact in unexpected ways. Due to the complexity of the system and all its components, few if anyone (designers of the system included) can predict the many ways that failures in different components can interact and the consequences of these interactive failures [13]. In itself, interactive complexity is not a major problem, unless combined with what Perrow refers to as “tight coupling”. If a system is complex but not tightly coupled, it means that even if failures interact in unexpected ways within this system, there is enough “slack” within the system to have time to figure out how to do things in other ways (where it is possible to do things/operate the system in other ways). When a system is both complex and tightly coupled, it means that the interactive failures, for some reason, cannot be isolated from each other, and that there is no alternative way of operating. This means that the disturbances within this system will spread quickly and “cascade” [14].

When the integral system characteristics of interactive complexity and tight coupling are present, accidents will inevitably (although perhaps rarely) happen due to multiple and unexpected interaction of failures. According to Perrow, neither new technological solutions nor better organisation can totally undo this dynamic, since the added complexity (either organisational or technological) from these “fixes” will then be part of the possible interactive failures within the system [14]. Decentralisation is required to deal with unexpected interactions in tightly coupled and complex systems. The problem is that systems cannot be decentralised and tightly coupled at the same time and there are strong economic incentives to keep and extend the tight coupling [14, 15].

While the perspective of NA drew attention and gained popularity both within and outside of academia in the 1990s, critical reactions and perspectives also emerged. One of the prominent stemmed from Todd La Porte and a group of Berkeley researchers. These scholars highlighted the fact that some organisations experience virtually no accidents despite the presence of interactive complexity and tight coupling (referred to as High Reliability organisations, or HROs), thus challenging the idea that organisational “fixes” cannot prevent accidents. A common finding of this research has been that HROs seem to allow flexible and decentralised decision making, have strong external preferences for failure-free operations and invest heavily in reliability improvement, including redundancy and training. The cost of failure is high in these organisations [15]. Bierly et al. [16] argue that HROs in general share two main characteristics, besides interactive complexity and tight coupling, that set them apart from other organisations: catastrophic potential (which increases scrutiny and expectations of accident-free operations) and accountability (linked to clear areas of responsibility, control and expectations of performance) [16].

In more recent contributions of NA application, two main trends can be identified. The first is the notion that despite some differences, the perspectives of NA and HRO can largely be viewed and used as complementary to understand the complex dynamics of accidents and safety in high-risk systems [15, 17]. The second departs from the observation that the world becomes ever more interconnected and complex, with global, multiorganisational, large-scale systems that are managed by a plethora of private and public actors. Thus, contributions within this trend aim to extend the classical theoretical arguments of both NA and HRO beyond technological systems and organisations to the macro-level, or the level of “organisation of organisations”. This chapter draws on both trends.

The main argument of this chapter is that the consequence dynamics of large-scale cyber-crises, characterised by their transboundary nature, can be explained by NA-dynamics in several layers of the sociotechnical systems that comprises modern critical infrastructure operation. Through five interviews with senior experts on cyber-security and critical infrastructure and two case examples (the incident involving the Ukraine power grid in 2015 and the Kaseya incident in 2021), this chapter will explore how these dynamics can manifest on several layers of critical infrastructure operation. In doing so, this chapter aims to contribute to bridging the gap between the understanding of the consequence dynamics of cyber-crises and the structural conditions in sociotechnical systems of critical infrastructure that allow them to unfold and cascade as observed in previous research [3, 6, 7]. The interviewees included senior experts from Sweden, the UK and the USA, all with a background of working with national level cyber-security and critical infrastructure protection. Beyond the interview data, this chapter also used media reports and official reports as material.

The analysis in the following empirical sections will be loosely structured around four analytical categories, or layers, of NA application, identified by Le Coze [17]: 1. Technology, 2. Cognition, 3. Organisation and 4. Macro.

9.3 Analysis

9.3.1 Technology

The first category of NA application, as relevant for critical infrastructure operations, is technology. Perrow realised rather early the applicability of NA to the internet, which is essentially composed of technical systems (both hardware and software) with interdependent components in interactive complexity [see for example Perrow [18]]. However, the extent of digitalisation of society and the development of ICS (Industrial Control Systems) in critical infrastructure has come a long way since then. Today, an analysis of the NA dynamic applied to critical infrastructure calls for a focus on the problem of legacy code, systems and hardware.

A legacy system can be defined as “An information system that may be based on outdated technologies but is critical to day-to-day operations. As enterprises upgrade or change their technologies, they must ensure compatibility with old systems and data formats that are still in use” [19].

In modern critical infrastructure, it is not uncommon to have legacy systems and code more than 15 years old underpinning operations. Some legacy code is written in old and outdated programming languages, like COBOL, which relatively few programmers know today. Moreover, many legacy systems were built without security in mind, making security considerations an “add on” aspect, or afterthought. When the systems underpinning ICS in critical infrastructure are built upon layers upon layers of legacy code, and systems written in a variety of code languages, with numerous add-ons to make them compatible, interactive complexity is continuously built into the system as a whole. Thus, components with the potential to fail, and interact with other failing components in unexpected ways, are continuously added.

Getting rid of legacy code, systems and hardware is often exceedingly expensive, which is one of the factors why many of them operate way beyond their intended lifetime. As one of my interviewees explained: “There is lots of legacy code and legacy hardware out there. I’ve been to places where they can’t find a vendor to replace the hardware any more, and places that buy things from online auction sites, because that’s cheaper than to upgrade to something modern, and they just don’t have the funding” (Interviewee 1).

NA are expected when the characteristic of interactive complexity is paired with tight coupling: an inability to isolate subsystems and interdependent systems from each other, or to stop them. This means that failures will cascade until a major part of the system, or all of it, fails [14]. Previous procedures to decrease the degree of tight coupling in Industrial Control Systems (ICS) of critical infrastructure, such as creating an “air gap” between the system and the internet, are made more difficult as the demand for digital transformation and efficiency of industrial organisations increases. Instead, modern ICS networks may be connected both to third parties and the wider organisation [20].

In the words of one expert commentator: “Legacy systems are often maintained only to ensure function, and their operations are often digitized with upgraded Internet of Things (IoT) functionality for the sole purpose of operability. OT maintenance may fail to consider the IT and cybersecurity perspective, seeking to make changes to improve systems without questioning if those systems remain secure. While these legacy systems may seem helpful after years of use, networked systems’ prolonged exposure to these legacy devices proves time and time again the familiar adage: What can go wrong will go wrong” [21].

Moreover, as one interviewee highlighted, legacy technologies might be maintained because the processes they underpin are too important to risk being disrupted even for a short amount of time: “Many organisations are afraid of swapping out legacy technologies for new ones because they are afraid that it will disrupt the production process or cause it to fail” (Interviewee 4).

9.3.2 Cognition

A key problem connected to the cybersecurity of critical infrastructure operations is that the actual danger characterised by NA characteristics in large-scale systems is not always easily perceived. This difficulty is partly because the components interacting are not only technical but include organisational and human components too, making the complex interactions and interdependencies between components and subsystems more difficult to understand and estimate. In the words of Grabowski and Roberts: “In general, large complex systems are difficult to comprehend as a whole. Therefore, the tendency is to decompose or factor them into smaller subsystems, which can lead to the development of a large number of subsystem interfaces” [22]. One of the interviewees of this study highlighted the difficulty of getting a comprehensive picture of all the subsystems involved in critical infrastructure operations: “many operators and roles are very specialized now. You only understand one small part of the system and worry about that. However, all the parts must be included and compared to achieve a common model and understanding. Some parts may affect or even disturb other parts in unexpected ways. You must build your operation on a comprehensive analysis including supply chain dependencies. But this is currently lacking when it comes to critical infrastructure” (Interviewee 2).

According to the findings of HRO research, commitment to reliability is a key feature of managing the danger of the combination of interactive complexity and tight coupling, and this commitment is connected to a common understanding of the potential of catastrophic consequence [11]. As one interviewee argued, this commitment to reliability does not appear to be widespread when it comes to critical infrastructure operations and cyber: “It seems that when it comes to cybersecurity and digitalization of critical societal functionalities, we have not learned much from the high-risk industries. In those industries, we are happy to let security cost whatever is necessary. This is not the case with cybersecurity yet, despite that healthcare services (for example) could be disrupted nationwide due to a zero-day vulnerability. We are not yet at the point where we allow digitalization to be expensive due to security concerns” (Interviewee 3).

The difficulty of grasping cyber related vulnerability in critical infrastructure can also be enhanced by the fact that some systems (and system components) are critical and high risk, but many are not (and thus would not require strict security and safety measures in accordance with HRO). However, due to interactive complexity, system components that appear to be non-important could potentially be contributing to the failure of a system that is, thus making the distinction between critical and non-critical more challenging.

9.3.3 Organisation

A common finding of both classical HRO and NA research has been to point to the importance of reducing tight coupling in systems through “organisational slack”. This can be done by achieving structural flexibility and redundancy, which involves duplication or multiple and independent ways of operating, communicating and making decisions [11].

The 2015 cyber-attack on the Ukraine power grid is an example of how centralisation can make accidents more consequential, but is also an example of how redundancy in organisations can reduce the same. In December 2015, one of Ukraine’s power grid providers was taken down by a cyber-attack, leaving 230,000 customers across various areas without electricity for several hours. Through a sophisticated, long-term attack campaign earlier that year, including spear-phishing methods, hackers had succeeded in taking remote total control of the ICS of at least three energy distribution companies. In disrupting power through remotely switching breakers, the attackers also disabled backup power supplies to all but one distribution centre in order to hinder operators from reaching out and giving or receiving information about the evolving situation. Finally, they launched a distributed denial-of-service (DDoS) attack on the customer service centre.

Despite the sophistication and novelty of the attack, the operators were able to restore service within 3–6 h by moving over operations to manual control [23]. The possibility of going manual was highlighted in a later report as an important mitigation mechanism. It also highlighted that those utilities that are more reliant on automation might not be able to do this [24]. An interviewee echoed this: “The reason this attack was not more disruptive was that there were parts of the grid which was not digitalized, which means it was possible to move to manual operating mode” (Interviewee 5).

Applying the NA-dynamics perspective, interactive complexity allowed the hackers to gain total access and control over the ICS of the energy grid. They identified, attacked and exploited many individual technical and human components of the system in unexpected ways, and once they were in, they were able to “cascade” their access due to tight integration of the system [25]. However, this case is also an example of the impact of redundancy and organisational slack on limiting the tight coupling of the system and thus the full effects of the NA dynamics. By being able to operate the grid manually, the tight coupling of the system was reduced, and operational capacity was restored before the situation could develop into a serious, long-lasting energy crisis.

9.3.4 Macro

The interdependent linkages and interactive complexity that surround critical infrastructure services on a macro-level involve a complex ecology of supply chain actors and other critical infrastructure-sectors. These interdependencies may be difficult to analyse and regulate. This affects the ability to detect NA characteristics and possible cascade paths of disruptions. It also makes it difficult to implement coherent regulation across the transnational structures of critical infrastructure. As noted by Grabowski & Roberts: “In large-scale systems, subsystems are often characterized paradoxically by both autonomy and interdependence. At one level, subsystems exist and operate independently of other systems, resources, and interference, and they are often responsible for their own survival, success, and growth. Thus, they appear to be rather autonomous entities, requiring little coordination. At the same time, subsystems are also interdependent” [22].

This dynamic can be exemplified by the tendency of critical infrastructure services to be dependent on supply-chain actors for upholding its digital systems (including legacy systems). As one interviewee explained: “The overall fundamental security of the system will be dependent on the particular company running the legacy technology to be updating its software to be compatible with the latest operating systems from Microsoft and other companies. For example, if the technology only runs off Windows XP and cannot be upgraded to Windows 10, because the company that created it does not support that, or no longer exists, then you have a fundamentally vulnerable system” (Interviewee 4).

The combination of centralisation, interactive complexity and tight coupling in the systems of organisations that underpins the functionality of critical services (through, for example, the reliance on supply chain actors) as well as in the technological systems of critical services is continuously exploited by cyber-threat actors who use this dynamic to launch ransomware and supply-chain attacks. Interactive complexity creates the possibility for a cyber-exploit to spread quickly and unexpectedly, and centralisation in combination with tight coupling enhances the impact of the attack, putting the victim under more pressure to pay the ransom (especially if the victim provides a critical societal service such as energy, water or food distribution).

An example of this can be found in the case of the recent REvil (also known as Sodinokibi) ransomware attack, affecting at least hundreds of businesses worldwide, including the grocery chain COOP in Sweden, in early July 2021. The attackers managed to use a vulnerability in Kaseya’s VSA software to bypass security measures and distribute malware (ransomware) to its customers [26]. Through the centralisation of IT-infrastructure service delivery, the attackers could focus on targeting this one business to get to connected businesses further down the line. In the case of COOP, there was no alternative way of operating without access to their digital payment system (no organisational slack/redundancy), and they had to simply close most of their 800 grocery stores in Sweden until the problem was solved [27].

In other words, REvil used the conditions of centralisation of service providers (making it possible to effectively spread malware to many businesses by exploiting a single supply-chain actor), interactive complexity (using different interactive components to achieve unexpected consequences) and tight coupling (leveraging the victim’s dependency on the ransomed digital systems in order to force them to pay the ransom) to achieve their goals.

9.4 Conclusion

This chapter aimed to contribute to the understanding of the transboundary characteristics of large-scale cyber-crises by suggesting that they can be explained by the existence of NA dynamics (the combination of interactive complexity and tight coupling) in several layers of the sociotechnical systems that support modern critical infrastructure operations. With support from the insights of classical NA and HRO theory, it explored the application of these arguments. Analysing the technical layer, the chapter highlighted the problem of interactive complexity and tight coupling in the sociotechnical systems underpinning critical infrastructure, especially through legacy code, systems and hardware. Analysing the cognitive layer, it pointed to the difficulty of clearly perceiving the danger stemming from NA dynamics in large-scale systems. At the organisational layer, it highlighted the 2015 cyber-attack on a Ukraine power grid as an example of how centralisation can make accidents more consequential, but also an example of the mitigating effect of operational redundancy measures to reduce tight coupling. Finally, analysing the macro-layer, the chapter used the case of the recent REvil attacks to discuss how the macro-NA dynamics including the reliance on supply-chain actors, can be exploited by cyber-threat actors and create cascading consequences and transboundary cyber-crises.