6.1 Introduction

Over the last decade, the concept of resilience has developed substantially [1]. The literature (e.g., [2,3,4,5]) comprises diverse definitions, resulting in the lack of a universal understanding of the construct [4] and in turn its further operationalisation [6, p. 2713]. Consequently, work is still required to make the notion comprehensible and usable for the relevant stakeholders [7].

To this aim, this chapter focuses on the operational dimension of resilience. Its scope is threefold: first, to associate resilience with a systems levels of service; second, to investigate how this could be implemented and used by relevant stakeholders in their daily operations; and, third, to investigate the relationship and contribution of humans to resilience considering that, in order to cope with real world complexity, individuals as well as organisations constantly adjust their performance to the current conditions.

6.2 Resilience as System Behaviour and Service Levels

Resilience is described as the operational behaviour of a system subsequent to an endogenous or exogenous shock event [8], and it is associated with four response behaviours. The first response behaviour, namely robust, illustrates a system that can fully recover after a shock event. The second and third response behaviours, i.e., ductile and collapsing respectively, refer to a system that can either recover its basic and critical functions or collapse after a shock event. Finally, the fourth response behaviour, adaptive behaviour, represents a system that could reach a performance level higher than the original level, e.g., when the system is reconfigured during its recovery and restoration.

Additionally, adopting the studies by Robert et al. [9] and UC Quake Centre [10], five levels of service for a system are identified, as follows:

  • Optimal level of service (OpLoS): the theoretical condition for which the system was planned and designed.

  • Normal level of service (NLoS): The system is performing as required and expected, achieving its mission to supply the anticipated level of service, while all the systems outputs are in their normal state.

  • Acceptable level of service (ALoS): The systems performance is partially degraded, with one or more of the systems outputs in a disturbed mode. Still, due to the action(s) taken, i.e., contingency plan(s), the system can maintain the service quality at acceptable levels and limit its degradation.

  • Unacceptable level of service (ULoS): The systems performance is severely degraded and despite the action(s) taken its degradation has become unacceptable. The system is no longer able to accomplish its mission.

  • Out of service level (OLoS): Discontinuation of the service.

The above classification does not provide exact thresholds to determine when the performance of a system changes from one LoS to another. Such thresholds are determined and described by the organisations according to their expectations, requirements and needs. In the case of rapid transit or metro systems, for instance, the LoS comprises: (i) in-service on-time performance (quality), (ii) the frequency of service (quantity) and the trains headways, the (iii) average load factor (quantity).Footnote 1 Figure 6.1 shows how the LoS above can be associated to system behaviours. First, the NLoS region lies between the elastic response corresponding to robust behavior and some ductile behavior with longer-term degradation. This implies that “normal” does not imply operating the system continuously at 100% and that some margin remains available. The ALoS region lies just below this region; some post-disruption LoS degradation, below normal, typically remains tolerable. Third, there is an ULoS where the system continues to operate but does not meet reduced post-disruption expectations. This region is bounded by the OLoS, which results from the inability to maintain any service due to a collapsing behavior. In contrast to the LoS named above, the “new normal” LoS results from an adaptive behavior that enables an increase in performance.

Fig. 6.1
figure 1

The association of resilience behavioural responses and the LoS of a system

Resilience is applicable to safety-critical as well as other systems. For the former, whose loss or failure has direct implications, resilience emphasizes continued and correct operation in the wake of disruptions [11]. For other systems, whose purpose is not a safety function but considered essential (critical) infrastructure, resilience implies continued operation as well. Naturally, if this service must be safe, it suggests continued operation and excludes operation with reduced safety levels, at least in this discussion. Nevertheless, the degradation or loss of such service may have indirect safety implications. In the case of public transport systems as discussed in this chapter, for instance, overcrowding on station platforms may have safety consequences or the resulting congestion on other transport modes may hinder emergency services as a knock-on effect.

6.3 The Human Contribution to Resilience

Woods describes [12] resilience as a parameter of a system that captures how well that system can adapt to handle events that challenge the boundary conditions for its operation. Such events may occur due to (i) limitations in the plans and procedures, (ii) the tendency of systems to adapt given changing pressures and expectations for performance, and (iii) environmental changes. The systems response capacity to challenging events lies partially in the expertise, strategies, and tools that people employ to respond to certain challenges [12].

It is therefore clear that people at all levels of an organization, e.g., frontline personnel, middle management personnel, and top policy decision makers, are able to create (or not) resilience by adjusting their performance to current operational conditions [13]. Research [14] has already defined four fundamental cornerstones that describe a resilient organisation and are associated with the human contribution to resilience:

  • Knowing what to do, which refers to the ability of responding to regular and irregular disruptions and disturbances by adjusting normal functioning or activating readymade responses.

  • Knowing what to look for, which refers to the ability of monitoring that which is or could become a threat in the near term. The monitoring shall cover both what happens in the environment, and what happens in the system itself, i.e., its own performance.

  • Knowing what to expect, which refers to the ability of anticipating developments and threats further into the future.

  • Knowing what has happened, which refers to the ability of learning from experience.

Reviewing the cornerstones, a continuous loop of interactions can be observed, as shown in Fig. 6.2, where human involvement is divided into two main levels. The first level, in the upper half of the figure, refers to the contribution of the frontline personnel as well as the responses of the crisis teams, including management. This level of involvement includes the short-term actions/tasks of the personnel, and represents those individuals, or teams within an organisation who respond after the occurrence of a disruption and who react and manage to recover the LoS. Depending on the type of disruption, there may also be opportunities to limit the magnitude of degraded service with a possibly consequential positive impact on its duration. All of these actions are associated with the what to do and look for cornerstones.

Fig. 6.2
figure 2

(adopted from [13] and extended by the authors)

The resilience capabilities loop as function of human contribution

The second level, in the lower half of the figure, refers to a longer-term organisational response across the whole spectrum of an operation, including any normal and unexpected situations. It is assumed that the organisations knowledge of what to do and what to look for, on which the response to a disruption is built, is itself built upon the organisation previous experience and anticipation. Experience is derived from the organisations learning from past events, while anticipation refers to its ability to identify potential, future threats. Learning and anticipating, in other words, corresponding to the what has happened and what to expect cornerstones respectively, together form the basis for preparedness, which is transformed concretely into the plans, policies, procedures and training that are applied in the actual response to a disruption.

6.4 Resilience Operationalisation Using the Four Cornerstones and the LoS Concept

This section demonstrates the operationalisation of resilience using the four resilient cornerstones and the concept of service levels in the transportation sector. Data was collected from publicly available reports [15, 16] that describe two major disruptions of the Singaporean metro system that occurred in December 2011 within a period of two days.

The first disruption, on December 15, lasted five hours and affected about 127,000 commuters. A second disruption occurred on December 17, spanned over seven hours and affected some 94,000 commuters. An investigation found that both disruptions were preventable and caused by a combination of factors, none of which individually would have resulted in the disruptions. The official investigation report [15] describes the events leading to the disruption as follows:

The immediate cause of the stalling of the trains was damage to their Current Collector Device (CCD) “shoes” due to sagging of the “third rail”, which supplies electrical power to the trains. During both incidents sections of the third rail sagged after multiple “claws”, which hold up the third rail above the trackbed, were dislodged. With their CCDs damaged, the trains were unable to draw electricity from the third rail to power their propulsion and other systems such as cabin lighting and air-conditioning.

The investigators found that the December 15 incident

was initiated by a defective fastener in the Third Rail Support Assembly (TRSA), which damaged the Current Collector Device (CCD) shoes of the trains that passed the incident site. In the process, these trains destabilised the third rail system elsewhere along the network, and the forces generated by the CCD shoes of multiple trains impacting the sagging third rail caused three more claws at the incident site to be dislodged, such that the third rail came to rest on the trackbed. Thereafter, this segment of the third rail became totally impassable to all trains.

The second incident

was triggered by one or more “rogue trains” which suffered not easily detectable CCD shoe damage when passing the 15 December 2011 incident site as the third rail was progressively sagging. In its haste to resume revenue service on 16 December 2011, the metro personnel did not conduct a sufficiently thorough investigation, such that the CCD shoe damage on the rogue train(s) went undetected. Had the investigation been thorough, the incident on 17 December 2011 might have been prevented.

In addition, the analysis of the events prior to the disruptions in [15] identified numerous factors that contributed to the incidents, such as:

  • Defects on train wheels that resulted in severe vibration.

  • Gauge fouling, or contact with the third rail system by passing trains due to the separation between the third rail and the running rail.

  • Design of the current third rail claw.

  • Shortcomings in the maintenance work culture within Singaporean Mass Rapid Transit (SMRT).

  • Shortcomings in the maintenance and monitoring regime, mainly in the context of ageing assets.

This example highlights a service disruption due to a combination of failures. Had the failures happened independently they would have not produce any substantial disruption on the system. Using the four resilient cornerstones, it could be claimed that all contributing factors are primarily associated with the SMRTs inability to provide its employees with the appropriate means (e.g., policies and procedures) to execute their tasks.

Regarding the levels of service, the SMRT managed to restore its service in timely manner, while also providing alternative travel options to its customers (e.g., replacement buses). In spite of the preparedness to manage the disruptions indicated by this response, the LoS in both disruptions were deemed unacceptable, as implied by the fine imposed by the Singaporean Land Transport Authority [16]. Thus, the elasticity threshold should not be related to acceptability, while the LoS, if measured as a momentary or average capacity, is not sufficient per se to discuss service degradation. Instead, a service degradation measure needs to consider the duration (width) of the disruption and not only its magnitude (depth).

With respect to resilience cornerstones, the SMRT seemed to have learnt from the experience of the first disruption; and managed better the incident related to the second disruption. Replacement buses and alternative travel options were deployed. Considering the longer duration of the second disruption, here it appears important to determine the boundaries of a systems LoS and service degradation. Indeed, the SMRT duration of disruption was longer, yet the passengers were better served and transferred to their destinations. Hence, a broader measure of the overall performance of the system (or the measure of service degradation) considers not only the customers whose metro was not available but also completed passenger trips independent of mode, e.g. replacement services.

The SMRT example shows that organisations shall not only focus on preparing and planning how to handle with individual shock events, but also to account for potential consequential effects and their impact on the systems overall resilience. This example underscores the importance of ensuring that recovery is not only timely but also durable, i.e., placing the system into an “as good as new” state in reliability terms. The incident on the December 17 may have been prevented if the SMRT was not in haste to resume its service on December 16, subsequent to the disruption on December 15. Such haste led to the deployment of not sufficiently investigated trains and in turn to the second disruption in the system.

6.5 Conclusions

Resilience is broadly used to study and understand the response of critical sectors to disruptions. However, the operationalisation of the notion has not sufficiently been explored. In this chapter, resilience was presented in association with a systems performance and its degradation in terms of service levels as an undesired outcome distinct from those related to potential hazards to the public and environment. Resilience was described as the operational behaviour of a system subsequent to the occurrence of an endogenous or exogenous shock event. Further, five levels of service for a system were identified, i.e., the new normal, normal, acceptable, unacceptable and out of service level. The association between service levels and system resilience was also shown. Moreover, the acceptability of the systems response (service level trajectory) can be seen as largely unconnected with whether this response is elastic. Ultimately, the resilience of systems that deliver a service is defined not in terms of whether the system response exhibits an elastic behaviour but rather whether the service level trajectory is acceptable. Specifically, a ductile or inelastic response with a longer-term service level degradation may be acceptable; the acceptability criteria for the system will instead be based on response criteria such as the minimum service level maintained during the peak of the disruption, the magnitude of the longer-term degradation, and an overall service loss that combines the duration and magnitude of degraded service.

People at all levels of an organisation play a significant role in creating (or not) resilience. This chapter examined the human contribution to resilience, whereby the four resilience cornerstones clearly provide a helpful lens. Yet, it could be seen that the functions the cornerstones describe need to be interpreted on two levels. First, on the organizational level in terms of anticipating threats, learning from disruptions, and incorporating the lessons thereof into contingency plans and training. Second, for the frontline personnel at the “sharp end”, the functions become responding to a disruption per the procedures, monitoring whether the actions taken are successful to prevent and mitigate service degradation or recover service, and anticipating the systems evolution to enable a proactive response.

This chapter does not provide any figures of merit about a systems resilience involving the LoS or the probability/frequency and duration of the service degradation. Thus, future research will focus on evaluating different systems and their preparedness against unexpected events, while it will also identify human critical tasks and scenarios that could lead to significant losses.