1 Background and introduction

This paper is discussing the significant findings of a risk assessment of the key communication infrastructure used in emergency communication in the Norwegian railway. The paper consists of an introduction and background of the risk assessment, description of our approach to the extended risk assessment, documentation of the results, and a conclusion with suggestions for further research.

The communication infrastructure in Norway is based on Global System for Mobile Communications in Railways (GSM-R) and is an international wireless communications standard for railway communication, documented in GSM-R (2010). GSM-R is a part of the European Rail Traffic Management System (ERTMS), and this is described in more detail in ERTMS (2009). ERTMS is the planned single Europe-wide standard for train control and command systems. ERTMS consists of the European Train Control System (ETCS), a standard for in-cab train control, and GSM-R, the GSM mobile communications standard for the railway operations.

The GSM-R system was implemented in Norway in 2007 after the train accident at Åsta, in eastern Norway in 2000, where 19 people were killed, ref NOU (2000). Before the accident, the train control identified two meeting trains on the same single track, but they did not manage to avoid the collision, due to missing ability to contact the train drivers in time. To avoid similar accidents, the GSM-R system is used as a critical emergency communication system between train control and trains. In Norway, there are several different railway-signaling systems used for interlocking, such as NSI-63, NSB-78, NSB-84, and other systems. The systems are fairly simple and based on track circuit design; this is described in more detail in Eriksson (2004). Thus, communication facilities to handle unanticipated events are important. A more advanced system, ERTMS, is going to be implemented through pilot projects in Norway, and the first project is planned to start in 2014.

The railway system and communication infrastructure in Norway are defined as a part of the critical infrastructure, as documented in a whitepaper (2009). This is in line with the categorization given by the European Commission in EC (2005) and by the US Department of Homeland Security, see TSA (2007).

We performed a risk assessment of the GSM-R system in 2008. Several key challenges were the basis for our work, i.e., there was only one central GSM-R switch without backup, and the switch could be a single point of failure; two central BSC units managing most of the traffic were placed in the same room with a common power supply vulnerable of common cause failures. In UK, these vulnerabilities are mitigated since Network Rail has duplicated the GSM-R at a disaster recovery site. The organization of rail traffic in Norway is divided between different actors. The operator of the GSM-R system, JBV, has responsibility of the railway infrastructure, such as railway tracks and signaling equipment. The train operators are organized as separate and independent units. In 2010, there were 13 operators. The Norwegian Railway Authority, SJT, is a key stakeholder related to safety and security of this infrastructure and is responsible for ensuring that rail operators meet the conditions and requirements set out in the railway legislation. In addition, the Directorate for Civil Protection and Emergency Planning (DSB) has been involved in analyzing major accidents or incidents impacting the railway. The main focus of JBV is the implementation, operation, and maintenance of the railway infrastructure such as tracks, signaling systems, and supporting infrastructure. The communication system GSM-R is serving all operators. The safe and secure operation of the GSM-R system is a key factor to ensure safety in train traffic. Human factors have also been an important element, due to the changing technological and organizational context of the railway system, as documented in the review of human factors performed by Wilson and Norris (2006). We have based our risk assessment on the MTO concept (man–technology–organization), for a broad socio-technical approach to safety that builds on many knowledge areas such as relevant technical issues, psychology, organization knowledge, culture, human factors, and safety, as described in Rollenhagen and Evenéus (2007). Thus, a key issue in the project was to identify the major MTO risks, see question Q1 at the end of this section.

The risks of the GSM-R system are characterized by complexity and uncertainty due to the integration of technical, organizational, and human factors issues. The main steps of the risk assessment have been in accordance with the activities in a preliminary hazard analysis (PHA), as described in Ericson (2005). As suggested by Renn (2005), pp 16, we have used the traditional analysis of safety on the simple risk problems, using traditional decision making such as risk–benefit analysis and risk–risk trade-offs. In addition, we have extended the traditional risk analysis due to complexity and uncertainty, incorporating resilience. Exploration of resilience is thus in addition to traditional risk analysis where redundant and fault tolerant systems are discussed. Renn (2005) suggests exploring management strategies such as precautionary based and resilience focused when we have uncertainty-induced risk problems. To involve stakeholders, Renn (2005) suggests a reflective discourse in a setting of uncertainty-induced risk problems. We have explored action research as a mechanism to support a reflective discourse in the risk assessment. The focus on resilience is also in accordance with the vision to establish a “secure and resilient transportation network” from TSA (2007), supporting resilience as a strategy of the GSM-R system. The improvement of resilience was one issue in our project, as formulated in question Q2 at the end of this section.

Communication via the GSM-R system is key to assure safety and regularity in train operations. When the GSM-R system fails, the failure should not lead to accidents, but to a controlled degradation of the operation of train traffic, i.e., to resilience in operations. Resilience is defined as “the intrinsic ability of a system to adjust its functioning prior to or following changes and disturbances, so that it can sustain operations even after a major mishap or in the presence of continuous stress”, from Hollnagel et al. (2006). This definition has been seen as a high-level, strategic definition. Safety is defined as: “freedom from unacceptable risks”, from ISO (1999). Improvement in safety and resilience has been a key issue in the project and is formulated in question Q3 at the end of this section. Improvement in resilience of the communication infrastructure should be observed by improved stability of the system or improved ability to perform communication during changes or disturbances. Thus, train delays and disturbances related to GSM-R incidents should be reduced if resilience is improved.

Risk has been defined as “Combination of the probability of occurrences of harm and the severity of that harm”, from ISO (1999). Information security has been defined as “protecting information and information systems from unauthorized access, use, disclosure, disruption, modification, or destruction in order to provide: integrity, confidentiality and availability”, from United States Code Title 44 (2006). In the following, we have used the term security when discussing information security. The focus of our risk assessment and discussion of security has been in the context of safety of train operations, to avoid major train accidents, incidents, or disruption of traffic. Thus, we have based our approach on methods from safety but have included relevant security hazards that could impact safety of operations. The relationship between security and resilience has been of interest and has been formulated in question Q4 at the end of this section.

A hazard is seen as something that can cause adverse effects or harm, while a risk is the likelihood that a hazard will actually cause its adverse effects, from Ericson (2005).

Basing our approach on the MTO concept, the safety and security culture has been of interest, especially how the “culture” influences safety and resilience and how the “culture” changes due to the actual risk assessment and implementation of mitigating actions. Safety culture has increasingly been explored in railways, as documented in the review performed by Wilson and Norris (2006). Safety culture is an area with many different perspectives and opinions, as discussed in Yule (2003); one main point has been the possible relationship between safety surveys and safety outcomes. In Itoh et al. (2004), there is a documented correlation between attitude factors such as the operators morale and motivation and the actual incident/accident rate of train operations. Based on the correlation of the past, it is suggested to use this kind of questionnaire in addition to incident/accident data to identify possible high-risk or low-risk units.

The survey of culture has been based on one of the more often used definitions of safety culture as mentioned in the review in Yule (2003): “The safety culture of an organization is the product of individual and group values, attitudes, perceptions, competencies, and patterns of behaviour that determine commitment to, and the style and proficiency of, an organization’s safety management”. This definition comes from the nuclear industry see ACSN (1993). The definition has also been used by the International Union of Railways (UIC) to explore safety culture in railways, see Johnsen et al. (2005). Security has been included in the above definition of culture, i.e., looking at “safety and security culture”, and thus, assessment of security issues has been included in our survey in addition to safety.

Based on the prior discussion, the key research issues prior to starting the project were as follows:

  1. Q1.

    What are the major risks in a MTO perspective? And how can risks from different areas be prioritized in collaboration?

  2. Q2.

    What are the main MTO-mitigating actions, which also may improve resilience? How can we prioritize the mitigating actions to ensure that actions are implemented?

  3. Q3.

    How can we measure improvement in safety and resilience in operations?

  4. Q4.

    Can resilience be explored to mitigate security issues or challenges?

The following section describes our approach, based on the above issues; in addition, these questions are discussed in Sect. 4.

2 Approach

In this section, our approach is described, combining a preliminary hazard analysis with action research, incorporating resilience, and exploring safety culture as a tool to measure changes in safety, security, and resilience.

The activities included steps from a preliminary hazard analysis (PHA). To ensure a complete risk picture and to avoid simplification, as discussed by Weick and Sutcliffe (2001), a broad set of different competencies and knowledge was included in the risk assessment. This ranged from technical competence of GSM-R to the understanding of railways and handling of incidents in the railway systems. To ensure understanding, ownership, and commitment to the mitigating actions, the stakeholders from the organization were involved. Key issues were prioritized in collaboration between different stakeholders and management in appropriate meeting arenas. These open collaborative meetings were called “search conferences”, as described by Greenwood and Levin (2007).

A key issue was to measure improvements in safety and resilience. In our context, safety of transportation is related to the stability of the GSM-R system, and thus, the train delays and disturbances related to GSM-R incidents were measured. Resilience is dependent on technical infrastructure, organizational abilities, and human knowledge and experience. In Hollnagel et al. (2006), p. 350, required qualities of a resilient system are described by anticipation (knowing what to expect), attention (knowing what to look for), response (knowing what to do), and learning (dynamic development and updating). Thus, we chose to measure resilience as an organizational capacity by these factors, in addition to exploration of technical incidents based on train delays and disturbances. To anticipate and be able to adjust functioning during changes or disturbances, there must be awareness of what may go wrong, clear responsibility, and ability to handle crisis. In addition, an assessment of learning in the organization has been of interest. Thus, these questions were included in a safety and security questionnaire used to explore culture. The yearly assessment of safety and security culture was included in order to identify development of knowledge and awareness that could impact safety. The main activities in the project were based on an adjusted PHA, as described in Ericson (2005), and consisted of the following activities:

  1. T1.

    Organize the project, scope, and activities with the management steering committee in combination with open collaboration meetings, involving all stakeholders (i.e., search conference).

  2. T2.

    Identify security hazards impacting safety of operations; identify major hazards and major risks related to MTO—based on literature reviews, interviews of different stakeholders, and open discussions in a search conference.

  3. T3.

    Prioritize major hazards and major risk in collaboration with the stakeholders through search conferences, based on a risk matrix and qualitative discussions. Discuss the prioritizing in order to create understanding across different disciplines and create common risk perceptions.

  4. T4.

    Identify and prioritize mitigating actions based on technology, organization, and human factors, i.e., MTO, and in addition improve resilience. A set of resilient principles has been explored to improve resilience, by defining mitigating actions based on MTO. Involve the workforce to prioritize mitigating actions in order to support bottom-up processes and understanding of consequences due to limited resources.

  5. T5.

    Assess development of safety and security culture each year in order to identify areas of concern and areas of strength.

These activities were structured as a project, managed through a steering committee. The steering committee approved scope, goals, and time schedule in the second quarter of 2008 (Q2-2008). A search conference was subsequently arranged with the employees and management to describe the project and to identify major risks and unwanted incidents in collaboration. The operating organization of the central GSM-R switch was the key stakeholder in the project. The main technical object of interest was the central GSM-R switch communicating with several distributed base control stations (BSC), connected with multiple base transceiver stations (BTS) deployed along the railway track. Six different GSM-R groups were involved as stakeholders in the risk assessment, consisting of transmission and fiber; radio (through BSC); different GSM-R systems and components (dispatcher, central switch); user support and service desk; data network, external access, and ICT (Information and Communication Technology); physical security and safety (backup of power and air condition, building safety, and security).

The hazards and risks were based on prior documented incidents and selected issues from literature. The preliminary hazard analysis was similar to the risk management framework as described in TSA (2007) figure 1-2, in our case focusing on safety and security in the context of safety. Relevant threats and vulnerabilities related to industrial control systems security, from Stoufer et al. (2008), were explored such as poor security training and awareness; lack of password policies; lack of backup; and lack of redundancy. Interviews and discussions were performed in eight expert groups, based on a structured questionnaire. The risks and mitigating actions were explored based on a MTO perspective, including organizational and human factors issues to ensure a more complete understanding of risks. Relevant security issues were explored with suppliers and local ICT security personnel. The key issues and findings from the interviews were analyzed, documented, and submitted to the interviewed persons. Based on the data gathered during interviews, it was decided to arrange a search conference to prioritize risks and mitigating actions.

The search conference was arranged to support common understanding and collaborated action. The risk matrix and mitigating actions were displayed as large posters on the walls, in order to be able to change and prioritize items in an open collaborative manner. Several suggested risks were changed based on the peer reviews, disagreements, and discussions. Each employee was given five “post-it notes” to be used to prioritize mitigating actions. All employees from the different technical groups, around forty, were involved in prioritizing the different actions. The different stakeholders increased their understanding by looking at how different personnel prioritized different mitigating actions. After a break and documentation of the result of the voting, the suggested prioritized list of mitigating actions was presented, in order to get agreement on the key issues.

The finalized risk matrix and mitigating actions were presented to the top management—safety staff and line management. Based on discussions, the suggestions were incorporated in the formal budget process.

The project was assessed by JBV after the project was finished in Q2-2009.

The safety and security culture were assessed through a questionnaire distributed in Q1-2009 and in Q1-2010. The questionnaire consisted of 30 questions, distributed to 37 employees. In 2009, 76% answered the questionnaire, and in 2010, 78% answered the questionnaire.

An assessment of the stability of the system was done at the end of 2010 based on reported delays in the railway system, caused by failures of the GSM-R.

2.1 Use of action research in the project

The risks of operation of the GSM-R systems were seen as uncertain and complex. To ensure participation-based development of safety from the key stakeholders, an action research approach was used. In addition, Renn (2005) suggests a reflective discourse in a setting of uncertainty-induced risk problems. Action research is a reflective discourse and is described as “the touchstone of most good organizational development practices”, ref Van Eynde and Bledsoe (1999). The action research method has been formalized as an iterative process model with five action research principles: (i) researcher client agreement; (ii) cyclical process model; (iii) theory; (iv) change through action; and (v) learning through reflection, as described by Davison et al. (2004). A survey of the action research literature was performed, and it indicates that an action research approach contributes to improvements in safety and security. However, findings from the survey were based on a limited data set and may be somewhat biased.

The involvement and commitment of the stakeholders are important in relation to ownership of risks and mitigating actions, process, results, learning, and reflection. Action research is an approach that is well suited to a broad-based change project such as a MTO risk assessment, since the assessment identifies risks and mitigating actions related to man, technology, and organization. Thus, a risk assessment could be seen as organizational learning, improving the safety of the organization based on reflection and the deeper double-loop organizational learning as described by Argyris and Schon (1978). Double-loop organizational learning indicates that you detect and correct risks by reexamine the fundamental underlying values and issues related to organization, technology, and human factors; thus, the organization’s capacity for effective coordinated action does increase as mentioned by Kim (2004). The relevant actors should be involved because risk assessment and improvement may involve many different stakeholders and perspectives in addition to need for ownership, understanding, and ability to change organizational practice and underlying values through involvement. Expert judgment has been used in addition but cannot replace local ownership, understanding, learning, and ability to perform reflective change. Action research has been used to improve safety and security in complex organizations; this has been more fully documented in Johnsen et al. (2009); however, some examples are given in the following. Related to security, there is a description in Smith et al. (2007) on how an action research program conducted across the entire Government in New South Wales (Australia) contributed to better compliance of security standards, increased understanding of safety and security, improved policies, and effective business continuity plans. The effect of the deeper double-loop organizational learning seems to have impacted safety performance, as described in the following. Alteren et al. (2004) document improvements in safety and productivity at an offshore oil rig, the number of incidents involving injuries dropped to one-third of the previous number, and the productivity (drill meters per day) increased. Antonsen et al. (2007) document improvements in safety (and efficiency) in service vessels in the oil and gas industry, injuries on service vessels (per million working hours) were reduced from 13.8 in 2001 to 2.6 in 2006, and service vessel collisions were reduced from twelve in 2000 to an average of one per year from 2001 through 2005. Richter (2003) documented that accident rates at two Danish enterprises dropped to about 25% of the average of the preceding 5 years. Thus, the process of action research may influence key issues in the organizations, such as safety culture and subsequent safety.

Based on the literature survey of use of action research, four activities were incorporated in the work plan, identified from Smith et al. (2007), Alteren et al. (2004), Antonsen et al. (2007), and Richter (2003):

  • Involving the different communities of practice who are working to ensure safety, security, and high level of service. In this case, the groups working with transmission; radio; GSM-R systems and components; user support; data network; physical security and safety.

  • Using workgroup meetings as a tool for fostering workforce understanding, participation, and enthusiasm. Focusing on an bottom-up process in addition to “top-down” support; increasing worker understanding and ownership of challenges and solutions; basing the work on practical experience from the workforce that contributes to actions being perceived as more legitimate by workers. At the same time, involving management in order to get the “top-down” perspective to get support prioritizing resources and budgets.

  • Using the workgroup meeting to prioritize risks and mitigating actions (through voting). This may improve understanding between different areas and help to focus on the most important issues.

  • Using the workgroup meeting to support a “proactive and informed” culture focusing on dialog and reflection (i.e., learning) when discussing risks and mitigating actions. We have explored the risk matrix as a communication tool to improve risk understanding and risk communication.

From an external perspective, key stakeholders in a risk assessment are expected to be the regulatory authorities (represented by SJT) focusing on rules and regulations; the media (newspapers) exploring accidents and incidents to the general public; suppliers focusing on deliverables with right quality; and customers focusing on dependability, resilience, and safety of transportation services. These external stakeholders have not participated directly in the risk assessment, due to resource limitations. However, the external stakeholders have influenced the result through their perspectives and possible actions following an incident or accident. In case of a later accident or incident, it was supposed that the external stakeholders would scrutinize the risks and mitigating actions from the project. Perceptions and risk assessment from external stakeholders have thus been used as boundary conditions, influencing our risk assessment. Key prior incident reports have been explored, such as the fire at the Oslo central station, ref DSB (2008). The key suggestions have also been discussed with the vendors, in order to ensure that the most important areas have been included in the risk assessment.

2.2 Exploring resilience in the preliminary hazard analysis (PHA)

Resilience is suggested as an appropriate strategy to be used when we are faced with complexity and uncertainty-induced risk problems, as described in Renn (2005). It is also suggested as an appropriate instrument to cope with surprises or unanticipated actions or uncertainty, and thus, resilience seems well suited to security issues. Resilience engineering has been suggested as a paradigm for safety management to cope with complexity under pressure to achieve continuity and regularity; as mentioned by Hollnagel et al. (2006), p. 6, thus, resilience is increasingly explored in risk assessments and risk management. In Becker et al. (2011), the use of resilience seems to improve awareness of dependencies and couplings and seems to be a fruitful path to follow in a time of complexity and dynamic change. In Cedergren (2011), the interplay between different organizations is discussed, and the influence of cross-organizational aspects on resilience is highlighted as key issues. Complexity, uncertainty, pressure, continuity, and cross-organizational issues are key issues in operation of GSM-R in Norway. Thus, resilience as strategy was included in the risk assessment. To describe resilience in a more operational setting, a set of resilient principles has been identified to achieve resilience. The resilient principles was identified through a “state of the art review”, see Johnsen et al. (2009), describing resilient principles such as redundancy. These principles are explored in the risk assessment to improve safety by implementing mitigating actions through technology, organization, or human factors. The relationships between goals, strategies, and principles to achieve safety and continuity through resilience are as follows:

Main goal::

Safety and continuity of operations

Strategy::

Resilience (due to uncertainty and complexity)

Principles::

Resilient principles, example: redundancy

Abilities::

Resilient abilities, i.e., the principles implemented in technology, man, and/or organization as mitigating actions related to safety critical functions. As an example, redundancy can be implemented in technology as a redundant technology component, to ensure failsafe operation in case of an error of a safety critical component. Redundancy could also be implemented in organizational abilities or in human abilities.

These resilient principles are similar to some of the principles described in Lay et al. (2011) to achieve resilience, such as buffering capacity, flexibility, margins, and tolerance. The resilient principles are influenced by constraints, as described by Lewycky (1987). Lewycky describes a hierarchy from constraints through conditions that influences chain of events. Constraints could be technical conditions, social dynamics, human actions, management systems, organizational culture, or governmental or socioeconomic policies and conditions. There is, however, interactions between resilience and constraints, since resilience as a strategy can and should influence constraints in a process such as risk management. The suggested resilient principles are described in the following with necessary constraints. The resilient principles are a selected key set of principles and could be extended based on exploration and reflection of resilience used in risk governance. The resilience principles are further explored in Johnsen et al. (2009):

  • Redundancy is defined as having several alternate and independent ways of performing a function. The function (i.e., what) can be implemented (i.e., how) by different organizations, by different technical systems, or by different procedures. Redundancy may be achieved by standby spares, by buffers, or by concurrent use of multiple devices. Redundancy was important in this project since there was only one central GSM-R switch, and redundancy was not fully implemented in the distributed network. In our case, redundancy was explored as a combination of technical and organizational actions to improve resilience. Key constraints to improve redundancy are technical resources (in our case, backup of GSM-R switch) organizational resources, procedures, and training when there is need.

  • The ability to perform controlled degradation and ability to “rebound or recover”, when system functions or barriers are failing. There must be an ability to perform a partial shutdown of functions, ensuring safety in intermediate states. The ability to recover may depend on knowledge so that human intervention may aid in the recovery. Effective recovery is based on both timely impact analysis and competent mobilization. The use of competence in the whole organization can be used in collaboration with technical systems, as a contributor to resilience. The GSM-R system is distributed and complex—key elements may fail and the organization and technical systems must be able to handle a degraded system without serious incidents and with acceptable quality of service. Key constraints to improve “controlled degradation” and to “rebound or recover” are technical conditions and design allowing “rebound and recovery” but also operational indicators, alarms, and resources to perform impact analysis and competent mobilization when there is need.

  • Flexibility in systems and organizations or diversity; having different ways of performing a function within a specific system. The systems must be designed to be flexible, accept improvisations, and error tolerance. When there is a failure, the total MTO system should be flexible. Examples could be the ability to handle loss of key components in the communication network, in a flexible manner enabling key messages to be distributed through different special systems or through special emergency procedures. Key constraints to improve flexibility are technical systems, organizational resources, and procedures, and training to implement ability to be flexible when there is need.

  • Managing margins—ensure that performance boundaries are not crossed. The ability to manage margins is a key function and must be explored. Scenarios could be used to explore the ability of the system to manage margins based on man, technology, or organization. Sacrificial decisions, i.e., decisions balancing productivity versus safety, must be a part of the scenarios. Risks may be increasing due to reduced safety focus when error rates decrease and reliability increase. It is important to measure and manage such drift toward margins. A measurement of “safety and security culture” may provide a good measure of this drift and perceptions of risks and are included in our approach. A “sense of uneasiness”, as described by Weick and Sutcliffe (2001), should be developed related to the operation, to create constant vigilance of margins. A key issue is to identify indicators of areas of concern before an incident. Indicators could be used to assess human factors, technical issues, and organizational issues such as surveys of morale and motivation, as suggested by Itoh et al. (2004). Key constraints to manage margins are technical systems, organizational resources, management systems, and the existence of indicators, procedures, and training to be able to identify closeness to performance margins and the ability to act.

  • Establishment of common mental models—ensuring that communication, cooperation and collaboration across organizations are supported and based on common understanding. Our focus is related to safety and security. A mental model can help us to know what to expect; knowing what to look for; knowing what to do and aiding learning. Some of the factors to establish common mental models could be extensive system insight, organizational knowledge, ensuring that the systems could be explored to their full extent. Mental models play an important role in understanding and describing the causes of accidents, in addition to creating a framework of learning from accidents. Management participation and involvement across organizational silos are important in creating common understanding and possibility of reducing accidents. This is being done in our approach, using stories of incidents and exploring the risk matrix as a way to communicate common mental models. Key constraints to establish and explore common mental models are necessary social dynamics, organizational resources, management systems to establish awareness, communication, procedures, and training.

In Hollnagel et al. (2006), there is a discussion by Hale in chapter 9 on resilience versus rules in railway systems. The goal has been to build on this exploration and focus on resilience as one strategy to achieve high level of safety, security, and continuity.

2.3 From rule-based culture to learning and resilient culture

The railway industry is focused on rules and regulation. In addition, technical personnel have a tendency to focus mostly on technology, often at the expense of organizational, cultural, and human factors. By focusing on more than rules, there is also an opportunity to establish a proactive and learning culture, as described in Reason (1997). A focus on a proactive and learning culture among the different stakeholders can ensure that different professions and organizations share a common understanding of the new risks, i.e., common mental models, and can co-operate to improve risk awareness, communication, and resolve incidents in a proactive manner, thus supporting learning as a quality to improve resilience. The railway industries, as other industries, have been fragmented between different operators and suppliers with narrow focus. These different organizational silos and different risk perceptions may impact the risk of operations. However, close collaboration and common risk perceptions among key actors may mitigate these risks as discussed in De Bruijne and Van Eeten (2007) through “networked reliability” based on collaboration of skilled operators. Thus, awareness of what may go wrong, responsibility, and perception of ability to handle crises are issues impacting resilience.

A positive correlation between safety culture elements such as morale and motivation and safety in railways has been documented in Itoh et al. (2004). The assumption in this paper is that culture can be measured, managed, and manipulated as described in the functionalistic tradition by Schein (1992). One method to explore and discuss culture is called CheckIT and is described in Johnsen et al. (2007). This method has been based on an approach adapted to railways called Safe-Culture, used by the international railway organization (UIC), ref. Johnsen et al. (2005). Thus, CheckIT has been explored in this project.

CheckIT consists of 31 questions. Each question is presented, and three alternative answers are presented in a table next to the question. The aim is to develop a rating of the organization on a numerical scale from 1 to 5, i.e., a Likert scale, where levels one, three, and five are textually described. The described alternatives correspond to a cultural taxonomy from denial through rule-based organization to a learning organization. This cultural taxonomy is described in Westrum (1993) and consists of three levels:

  • Denial culture (Level 1)

  • Rule-based culture (Level 3)

  • Learning and proactive culture (Level 5).

3 Results

In this section, a description is given for identified key unwanted incidents, key risks documented in a risk matrix, and the suggested key mitigating actions. The results of the survey of culture are presented. The levels of actual incidents in the period from 2005 to 2010 are documented, and future risks and mitigating actions are discussed.

3.1 Unwanted incidents and unwanted situations

The unwanted incidents (U..) were documented, and for each incident a mitigating action (A..) was identified and is documented in the next section. The key unwanted incidents were as follows:

  • (U1) Stop of central GSM-R communication. A technical error or mishap in the GSM-R system leads to loss of GSM-R functionality and subsequent halt in all train traffic in Norway. There is no independent backup of the GSM-R system; the GSM-R system is a single point of failure. At present, organizational routines have not been established to enable the train traffic to function satisfactory with loss of GSM-R communication system. (Mitigation: A1 and A1.2, see next section).

  • (U2) Stop of regional GSM-R communication, through loss of local BSC. Common failures in the infrastructure may lead to cascading errors and halt in communication in key areas. A fire at the Oslo S central station at 2007-11-28 removed power to much of the signaling equipment such as the GSM-R communication equipment, the local BSC, used in the Oslo area, halting all train traffic in Oslo, the Norwegian capital, lasting 20 h, ref Utne et al. (2009). (Mitigation: A2).

  • (U6) Unanticipated human errors due to poor training of short-time contract employees and too few employees with high experience in permanent positions. (Mitigation: A6).

  • (U7) Poor risk understanding across the complex organization consisting of several actors such as JBV, operators (such as NSB), and SJT. In addition, poor risk communication across different technical groups working with diverse technologies such as railway and GSM-R system, and missing common mental models related to key risks. (Mitigation: A7.1).

  • (U25) Poor resilience and MTO ability to handle crisis and recover, due to poor scenario training and poor crisis management across the organization. (Mitigation: A25).

These unwanted incidents are based on the analysis of accidents and incident reports performed by internal and external experts. Some “near misses” in operations were characterized as poor organizational resilience, i.e., poor knowledge (anticipation, attention, and response) by short-time contract employees and too few employees with high experience. In addition, the fire at the Oslo S central station at 2007-11-28 identified poor resilience in key areas. A great deal of energy was used to mitigate the consequences of the incident, but the technical and organizational constraints had not been impacted by resilience, ref DSB (2008), and it seemed an accident “waiting to happen”. Thus, this incident was explored to improve resilience.

3.2 Risk matrix and mitigating actions

All unwanted incidents were given a probability and consequence based on expert judgments from the group and external experts. The consequences were categorized in five classes going from insignificant (1), minor (2), moderate (3), major (4) to extreme (5). The likelihoods were categorized into five classes as rare (1), unlikely (2), possible (3), likely (4), or almost certain (5). These classes were quantified, as an example “rare (1)” as “once in 100 year, or more seldom”. Based on probability and consequences, the unwanted incidents were placed in a risk matrix.

The reason to explore the risk matrix was to get aid in prioritizing unwanted incidents that should be mitigated, i.e., incidents with high risk as defined by high consequence and high likelihood (red area in the risk matrix). When discussing mitigating actions, there must be an exploration on how to reduce consequences and/or reduce likelihoods.

In addition, the risk matrix was used as a communication tool across organizational silos to ensure common understanding of risks based on consequences and probabilities. Based on exploration of the risk matrix, the priorities from the attendees and expert judgment, the suggested prioritized mitigating actions were as follows:

  • A1—Establishing redundancy by duplicate the core GSM-R system via an independent backup system, in order to be able to sustain communication, even if there are failures in the central GSM-R complex. The cost of an independent backup solution is estimated to be around USD 30 million (Ref. U1).

  • A1.2—“Rebound and recover”, improve organizational resilience if/when GSM-R fails, to ensure some sort of “degraded” but safe train traffic and operation with reduced GSM-R functionality. It is suggested to establish competencies, collaboration, and manual procedures that can be used to manage the traffic when GSM-R systems halts or has a serious error. This poses a challenge, demanding collaboration between different organizations (Ref. U1).

  • A2—Improve technical resilience by redundancy, i.e., duplicating and separating key distributed GSM-R components in different locations, as an example with different power supplies to avoid common cause failures. This is a technical issue but demands investments and allocated resources (Ref. U2).

  • A6—Increase permanent manning in safety critical areas and prioritize training, in order to increase knowledge, experience, flexibility, and redundancy. This demands management actions in order to increase budgets and allocate increased manning (Ref. U6).

  • A7.1—Perform search conference in collaboration between key actors—in JBV and train control; in order to improve risk awareness, risk perceptions and improve collaboration and support the deeper double-loop organizational learning as described by Argyris and Schon (1978). Key results could be to improve common mental models of risks and risk communication across suppliers and operators; in order to explore and improve resilience or redundancy in collaboration between key stakeholders; improve flexibility and the ability to manage safety margins in a stressful environment. This is a “soft” issue and demand resources, collaboration, and co-ordination between several organizations, something that could be a challenge (Ref. U7).

  • A25—Increase scenario training of a set of defined crises, such as loss of communication, loss of power or loss of critical communication equipment such as BSC, in order to build resilience and be able to handle unwanted incidents with greater competence and skill across the organization (Ref. U25).

The cost of duplication of the GSM-R, i.e., action A1, was evaluated based on a cost/benefit analysis. The cost of 1-day stop of the GSM-R service in key areas has been estimated to be between 3 and 10 million USD, see DSB (2008) and Hestnes (2008). If the central GSM-R solution is destroyed, a replacement may take up to 90 days to establish, and thus, the cost of a GSM-R backup of 30 million USD seems to have an acceptable cost-benefit.

The next steps in the risk management were to ensure that the mitigating actions was prioritized and executed, in addition to establishing learning and reflections periodically or when incidents did happen or could have happened.

3.3 Prioritizing mitigating actions and politics

The described mitigating actions were discussed with line management and other stakeholders such as safety staff. The safety staff did not recommend using the risk matrix. The risk matrix is complex, and the assessment of risks and consequences could be open to different interpretations. If a risk was positioned in the “red” area—indicating that “red” issues had to be resolved, then this could reduce the influence of management and staff to prioritize mitigating actions. This was mentioned as a key issue related to (U1)—loss of GSM-R functionality. The argument was that these significant actions should be the responsibility of line management and not “forced” to be implemented through the use of a risk matrix. The authorities could also demand that mitigating actions had to be performed of risks in the “red” area, without enabling the management to decide what should be prioritized. By not using the risk matrix, management could maintain decision and power in the line organization. This could improve the ability to learn based on own processes and discussions and avoid piecemeal decisions based on one risk assessment from one sector.

However, due to involvement from the line management, the risk matrix was accepted, and management at JBV prioritized the suggested main activities. Thus, the following mitigating activities have been prioritized and implemented:

  • A1—Duplicating the core GSM-R functionality at an estimated cost of USD 30 million as prioritized by the line management, increasing resilience.

  • A2—Duplicating of a set of local distributed GSM-R components has been prioritized in order to increase resilience.

  • A6—Increasing permanent manning in safety critical areas, expanding necessary manning in order to increase resilience.

  • A25—Increasing scenario training of defined crises, in order to increase resilience.

The following activities are “in work”:

  • A1.2—Improvement in organizational resilience during technical failures of GSM-R.

  • A7.1—Perform search conference in collaboration between key actors.

The risk assessment process and the work performed were assessed by JBV in Q2-2009, and the project got positive response on process and results.

3.4 Survey of safety and security culture

The questionnaire-based survey was performed in 2009 and 2010. The total “culture rating” in 2009 from the questionnaire was a subjective assessment of 3.7. The total “culture rating” in 2010 from the questionnaire was a subjective assessment of 3.8, i.e., an assessment between a rule-based culture (score 3) and a learning organization (score 5). This is an individual assessment based on a small sample of 30 respondents.

The three issues with the highest grade from the survey in 2010 were as follows:

  1. 1.

    Knowledge of what may go wrong 4,6

  2. 2.

    Safety when surfing on the Internet 4,4

  3. 3.

    Responsibility of safety 4,3

The three issues with the lowest grade from the survey in 2010 were as follows:

  1. 1.

    Safety issues when using mobile devices 3,1

  2. 2.

    Sufficient system training 3,2

  3. 3.

    Safety when supplier is performing work 3,4

Since the issue getting the highest evaluation of the 30 questions was “Knowledge of what may go wrong”, this could be interpreted as a positive ability to know the risks of operations in JBV, and this may be one of the key results of the risk assessment that we have performed. The knowledge of what may go wrong and clarity of responsibility has been given high marks—indicating that the stakeholders in JBV is moving toward an learning and resilient organization in these areas. As seen by issues getting lowest grade, the knowledge related to mobile devices and sufficient system training was a key area to be improved, in addition to improved collaboration with suppliers.

The survey is difficult to discuss as an independent survey, since the assessment is subjective from the participants and thus relative. However, the development from 2009 to 2010 is interesting, i.e., what have changed in the period? A positive increase of 0.6 and one decrease of 0.3 are highlighted in the following. The “planning and perception of ability to handle crisis” improved from a score of 3.5 in 2009 to a score of 4.1 in 2010; this is an improvement of 0.6 and indicates that work with scenario analysis/crisis management has impacted the perceptions of the workforce in addition to establish routines to be used in a crisis situation. This result indicates an increase in the ability to be resilient. This was suggested to be a result of the mitigating activity A25—scenario training of a set of defined crises in the risk assessment.

A reduction was identified in one area, related to the ability to “point out potential problems and errors toward colleagues”; this ability was reduced from 3.9 to 3.6 in the period from 2009 to 2010. This is a reduction of 0.3 and may indicate that the open “reporting culture towards colleagues” has been reduced in the period. This is an area of concern to management, since this is a reduction related to sustaining an open learning culture, a fundamental ability to support safety.

If JBV wants to sustain or improve abilities of a learning organization, actual unwanted incidents should be an area of exploration and discussion in the local organization, involving management when this is relevant related to safety or resilience.

3.5 Actual incident in 2010, after the risk assessment

In the risk assessment performed, it had been important to point out the risk of loss of communication of the GSM-R system and suggest that the resilience of the GSM-R system had to be improved by establish a redundant GSM-R system. At 2010-03-29, the GSM-R system did fail and all the trains in Norway had to stop during 3 h, due to the failure, Ref NRK (2010a). The authority, SJT, regulates this decision to stop all the trains. Luckily, the top management could say to the critical media and SJT that they had decided to invest USD 30 million in a backup GSM-R central, to be used when the central GSM-R system failed, Ref NRK (2010b). This incident is referenced to show that such a major mishap may happen and demonstrates that one of the identified risks did happen. If the system had been redundant, the system should not have halted.

3.6 GSM-R regularity in the period from 2005 to 2010

We have assumed that GSM-R incidents influence train regularity, and train regularity due to GSM-R incidents has been seen as a precursor of safety related to GSM-R. We have gathered reported incidents related to GSM-R failures of key traffic areas (i.e., the sections Drammen-Eidsvoll; Trønderbanen and Bergensbanen), in the period 2005 through 2010. Delays due to GSM-R have been minimal. Some incidents happened in 2005 and 2006, and the description of the incidents was a part of our risk assessment. The incidents in 2005 and 2006 were partly due to missing competence and staffing in key areas, and thus, the staffing in the organization was increased, and this could improve the resilience in the organization. There were no GSM-R related incidents in 2007, 2008, or 2009 impacting regularity (or delays). A few incidents happened in 2010. The most severe incident was a failure of the central GSM-R lasting 3 h. If resilience had been better, such as through a technical GSM-R redundancy, or through organizational redundancy allowing some railroad traffic in key areas, the system should have sustained operations. However, the system seemed partly resilient, since no accidents happened and regular operations were commenced after 3 h.

In the period 2005 to 2010, the percentages of delays due to GSM-R incidents were as follows:

  • Year 2005: 0.11% delay

  • Year 2006: 0.02% delay

  • Year 2007, 2008, 2009: 0.00% delay

  • Year 2010: 2.82% delay.

No accidents have happened due to failures in the GSM-R system in the period 2005 to 2010.

3.7 Future risks and mitigating actions

The complete risk assessment identified several issues that were not prioritized as key issues by the expert group or project team, and they are not mentioned in this article, but the issues are documented in the complete analysis.

The operational status of the GSM-R system and possible developments or improvements should also be reviewed based on an assessment of key safety issues such as stability and quality of the communication, i.e., through key indicators such as:

  • The technical “up time” of the GSM-R switch and stability of decentralized communication equipment such as BSC and communication equipment in the train central and on board the trains.

  • The number of unwanted incidents related to communication failures.

  • The regularity of railway operations—and delays due to GSM-R incidents.

A review of all these factors mentioned above has not been a part of this assessment, due to limitation in scope and difficulties getting hold of relevant data. These issues should be assessed, in order to sustain focus on safety, security, and resilience in the transportation network.

Risk assessment and mitigating actions are continuous activities dependent on a dynamic and changing threat picture.

4 Conclusion and suggestions for further research

In the following, we have reflected on the key research issues, validity, and areas of further research.

The preliminary hazard analysis, based on a broad socio-technical approach to safety, seems to have identified relevant major risks related to organization and human factors, based on the incidents in 2010. The exploration of action research and collaboration through search conferences seems to have supported common understanding and common models of risks and mitigating actions across different stakeholders and competencies, since the key actions have been implemented.

The main MTO mitigating actions have been prioritized through collaboration between management and the workforce. The resilience of the system has been improved, due to improved scenario training, improved organizational redundancy (through increased manning), and improvement in technical redundancy.

Safety and resilience of operations have been explored based on actual train delays and disturbances, in addition to a subjective assessment of awareness of what may go wrong, responsibility, and ability to handle crisis. The development of knowledge and risk awareness seems to have been impacted and improved through the risk assessment as documented by the CheckIT questionnaire.

Resilience seems to be a useful strategy to mitigate security issues and uncertainty since it improves the capability to cope with surprises, it improves diversity and allows for flexible responses—all important issues related to security issues and uncertainty, and thus, resilience seems to be an important perspective when discussing security and uncertainty.

The paper suggests extending the risk assessment process through including resilience and action research to improve safety in complex settings and during uncertainty.

Validity of the risk assessment and the mitigating actions is difficult to ascertain in such a short-time period from 2008. We cannot reject that the risk assessment has improved resilience. We have explored improvement of safety and resilience through triangulation, i.e., surveys, expert discussions, and exploration of regularity data. In addition, the mitigating actions have impacted technology, organizational issues, and knowledge and awareness. These issues indicate that safety and resilience are improved. Key mitigating actions were prioritized and implemented in plans and budgets, and key suggested actions have been implemented. One of the identified hazards did happen in 2010. This fact enabled JBV to stand out as a safety-oriented and proactive organization to key stakeholders, the public and safety authorities (SJT). However, validation of the risk assessment and the mitigating actions must be based on systematically exploration of relevant safety data such as number of incidents, number of accidents, key risk and safety indicators, in addition to a critical review of suggested actions in an extended period of time. This activity should be an ongoing part of safety management activities. We have assumed that GSM-R incidents influence train regularity, and train regularity due to GSM-R incidents has been seen as a precursor of safety. Train regularity has been excellent in 2007, 2008, and 2009. A few incidents happened in 2010.

Involvement from the different experts in the company, the workforce, and management seems to have created a risk assessment process supporting resilience. Common mental models (i.e., common knowledge and perceptions) helped to mitigate opposition and resistance through open and sound discussions. Political issues and power issues in implementing mitigating actions cannot be underestimated. However, a broad-based participatory approach can help in bridging political issues and power issues. The open internal process and the possibility of involvement from external stakeholders such as the media and safety authority (SJT) have supported the mitigating actions, and safety and resilience seem to be strengthened in such crossfire.

The technical issues was prioritized and executed at JBV with more ease than organizational and human factors issues. However, perceptions and knowledge related to organizational factors and human issues seem to have been improved and may impact the later exploration of organizational and human factors issues.

Development and improvement in safety and security culture are a long-range activity. Assessment of safety and security culture is difficult, but it raises issues of importance. The surveys document improvement in areas that have been prioritized by management, but also that new challenges have been raised. At the same time, there is a need to explore the relationships between the survey and the actual levels of safety and security. An assessment of actual safety level and risk level of the GSM-R should be performed annually. Issues such as number of unwanted incidents, stability of communication, and other key critical safety issues should be explored to aid in this process. The relationships between safety culture as measured and the actual safety level should be discussed and explored in order to understand the relationships. Our conclusion is that a periodic survey of safety and security culture is a useful supplement to the risk assessment and should be sustained.

We would like to document and analyze future incidents and accidents to improve the understanding of the extended risk assessment process and how exploration of resilience may impact safety and security in operations. We are thus suggesting continuing our work, based on interviews, surveys, and workshops/discussions performed periodically to observe development of safety, security, and resilience.