Keywords

1 Introduction

1.1 Formal Methods and Risk Management

Risk Management. Risk management [4, 26] is something we do every day: we lock our houses to prevent burglary; our health insurance covers the financial consequences of hospital visits; cars are checked yearly to prevent failures; we back up data to not lose valuable information; we wear seat belts when driving; we double check if we have not forgotten our phones, etc.

In industry, such decisions are made at large: access policies determine which employees can enter the building; companies insure their employees against work accidents; regular maintenance keeps production plants up and running; back ups are performed to not lose valuable data; helmets and safety glasses protect employees against injuries; the four eyes principle – reviewing critical tasks by at least two people – enhances accuracy. However, such measures cost time and money and are also inconvenient —insurance and COVID masks are prime examples here. Thus, a key concern is to select effective measures to lower the most prominent risks [26]: The overall goal of risk management is to support decision-making on (cost-)effective measures that keep risks below an acceptable level.

Formal Methods. Formal methods refer to mathematically rigorous techniques for the specification, development, analysis, and verification of software and hardware systems [15, 22]. In this tutorial, I adopt a broader definition, following e.g. [55]: Rather than focusing on software and hardware, I will consider any kind of system. These include physical systems, such as biological and financial systems, but also services, procedures and missions. The broader definition, also taken in [23, 55], enables a better comparison with Risk Management, which also covers many domains, such as technological, environmental, financial, and social risks. Moreover, Formal Methods have actually been applied to a wide variety of systems, including biological systems [39, 52], chemical systems [1], business processes [41], and human behavior.

What sets formal methods apart from other disciplines is not the systems considered, but rather the methodological approach: modeling languages to conveniently specify the system under study, as well as languages to model their properties; formal syntax and semantics for these languages; rigorous analysis techniques that have been proven correct; compositional approaches to build large systems from smaller components, where especially the interaction between components matters. However, even with a more narrow scope of formal methods, there are strong links with risk management.

1.2 Formal Methods Versus Risk Management

Formal methods and risk management have strong links. They strive for the same goals, namely high-quality and reliable systems without surprises. Their means are quite different, though: Formal methods focus on mathematical methods, while risk management uses informal methods.

Formal Methods for Risk Management. The area of Formal Methods has made numerous relevant contributions to the area of risk management. These fall into three categories.

First, the field of Formal Methods has developed and strengthened a wide variety of risk assessment frameworks. Risk assessment is part of the risk management cycle (cf. Sect. 5) concerned with the identification, prioritization, and evaluation of risks. The area of Formal Methods has equipped a broad range of industrial risk assessment frameworks (such as fault trees [18], reliability block diagrams [43], the AADL language) with rigorous semantics, more modeling power and efficient analysis algorithms. These techniques enabled analyses of systems that were not possible before [12, 32].

Second, these techniques were possible due to more fundamental advances in the underlying stochastic analysis methods, enabling the analysis of systems with gigantic state spaces for a plethora of properties and metrics. Relevant approaches include Bayesian analysis [46], scenario optimization [13], Monte Carlo simulation [53], stochastic and statistical model checking [37], reinforcement learning [56], stochastic optimization and control [11]. Since uncertainty is a key ingredient of risk, stochastic and statistical analysis is a relevant area for risk analysis.

Finally, numerous methods have been developed to alleviate risks in software systems: rigorous specification, design and software verification, model-based testing, algorithms and architectures for fault-tolerant computing, monitoring and run-time verification, and debugging are all relevant techniques to reduce system risk. The risk management terminology classifies these as risk treatment strategies.

Many of these contributions were published in venues where formal methods intersect with (technical) subfields of risk analysis, such as dependability analysis, reliability engineering, and safety analysis, with relevant conferences, such as DSN, FMICS, FM, NFM, QEST and SAFECOMP.

Risk Management for Formal Methods. Despite these relevant contributions, risk thinking is not in the standard repertoire of researchers in formal methods. This is a pity for three reasons: The first argument I like to put forward is that The area of Formal Methods is in a unique position to make risk management more accountable: more systematic, transparent, and quantitative. It is very well known, especially through the work of Nobel laureate Daniel Kahneman [33, 34], that people have bad intuitions for risk. His numerous experiments repeatedly show how poorly people can assess chances and risks. Numerous cognitive biases have been identified that influence people’s perception of risk. According to Kahneman, there are two systems at work in our brains: System I, which makes decisions quickly and automatically, and System II, which is slow, systematic, and rational. System I is handy when we need to locate objects or interpret facial expressions, but it is not suitable for assessing probabilities. That is better left to System II. Even though there is no scientific evidence, it is my firm believe that, through their rigorous approach, Formal Methods foster System II thinking.

My second argument is that viewing Formal Methods through the lens of risk management provides a better perspective of where Formal Methods matter most. In particular, by addressing the most prominent risks, What do people and organizations care about most? What are their concerns? Which formal methods are appropriate to address these concerns? Where in the system development cycle can formal methods contribute most, and which methods are appropriate? This point of view aligns well with the pledge in the recent Manifesto for Applicable Formal Methods [24] that advocates integrative practices.

Finally, incorporating formal methods into risk management can amplify their importance and influence. Whereas few people know formal methods, almost everybody knows risk management. Framing formal methods in terms of risk management can increase the relevance and contributions of our field to the general public, politicians, and business managers.

To reap the benefits above, a comprehension of risk management is needed —exactly the goal of this tutorial.

1.3 This Tutorial

Objective. The objective of this tutorial is to provide an overview of the basic concepts, principles, and terminology of risk management for researchers in Formal Methods. Explicating the strong links between the fields of Formal Methods and Risk Management allows formal methods researchers to better align their research with concerns and practices in industry and society.

With focus on terminology and concepts, this tutorial is neither very technical nor very formal.

Didactic Set Up. This tutorial starts by setting the basic concepts then zooms out on the organizational processes supporting risk management.

Thus, I first discuss the various definitions of risk and related concepts, and delve into the three main elements of risk: objectives, consequences, and uncertainty. Once the main system risks have been identified, the question of what actions are to be taken arises: which interventions (if any) are needed to decrease the system risks? Four common risk intervention strategies will be discussed: tolerate, terminate, transfer, and treat. Next, the tutorial will cover the risk management cycle (PDCA), the starting point for most organizations. It will also discuss the formal methods recommended by ISO standards.

Organization. Section 2 discusses the definitions of risk and related concepts. Section 3 reviews the elements of risk: objectives, consequences, and uncertainty, and Sect. 4 discusses the four risk treatment strategies: tolerate, terminate transfer, and treat. The PDCA cycle and the role standards are discussed in Sects. 5 and 6 respectively. Section 7 concludes the proposal.

2 Risk and Risk Management

2.1 What Is Risk?

A first problem encountered in risk management, perhaps especially for the Formal Methods community, is that there is no standard definition of risk. Not only has the concept of risk evolved over time, just today many different definitions of risk exist across different disciplines, professional societies, scientists, and international standardization bodies. Here are some of them:

  • The Institute of Risk Management [28] defines risk as the combination of the probability of an event and its consequences.

  • The Threat Analysis Group [59] defines risk as asset, threat, and vulnerability.

  • The PRINCE2 method [7] defines risk as: An uncertain event or set of events that, should it occur, will have an effect on the achievement of objectives.

  • Fenton & Neil [21] assume risks to be unfavorable events influenced by factors. Such factors and their interactions might be random or uncertain.

  • Kaplan & Garrick [35] define risk as a set of triplets \((s_i,p_i,x_i)\) of a scenario \(s_i\), a probability \(p_i\) and a consequence level \(x_i\).

  • The ISO 31000:2009 standard [30] on risk management defines risk as: the effect of uncertainty on objectives.

Sources of debate are (1) whether risk include both negative and positive consequences, or only negative ones and (2) whether the likelihood of events should be interpreted as a probability, or in the broader terms of uncertainty. Despite significant efforts, the Society for Risk Analysis has concluded that settling for a single definition is not realistic.

A summary of various definitions of risk over time can be found in [48] and in more technical terms in [3]. In his seminal work Against the Gods: The Remarkable Story of Risk, the economist and financial historian Peter Bernstein provides an account of the history of risk from ancient Greece until today [10].

In this tutorial, I will use the ISO 31000 definition of risk:

$$\begin{aligned} \textit{the effect of uncertainty on objectives}. \end{aligned}$$

This definition is relatively widely accepted and stresses the emphasis on goals. For example, the U.S. National Institute of Standards and Technology (NIST) has also adopted this definition in the context of cybersecurity.

Remarks. Some remarks on the definitions of risk: (1) All definitions are application-independent. They apply to financial risks for a venture capitalist, high-tech risks concerning self-driving cars, and everyday risks like choosing travel insurance. (2) These definitions apply to various artifacts-products, processes, services, and missions-collectively termed systems. Whether it is a medical device, a banking procedure, or a military mission, the same risk management principles apply. (3) The definition is relevant to all phases of a system’s life cycle: requirement gathering, design, implementation, operation, and dismantling.

2.2 Risk Categories

Risks are often classified based on the organizational level they address:

  • Strategic risks concern the organization’s mission and long-term strategy, concerning e.g., market expansion, technological innovation, and mergers.

  • Tactical risks affect the implementation of strategies in operations, projects, and processes, e.g., supply chain disruptions or regulatory changes.

  • Operational risk are specific to the organization’s internal processes and include human errors, technology failures, or fraud incidents.

  • Compliance risks involve adherence to law, regulations, and internal policies.

Most research in formal methods focuses on operational and compliance risks, analyzing risk in airplanes, self-driving cars, and robots. Formal methods to address tactical risks have also been developed, especially in the context of enterprise risk modeling frameworks and business process modeling.

2.3 Related Terminology

Risk Versus Risk Level. Some older definitions, define risk as the statistically expected loss, i.e., the product of probability and impact [49]. This definition reduces risk to quantitative aspects; it is better to refer to this product as the risk level of an event e:

$$\begin{aligned} \textsf{RiskLevel}(e) = \textsf{Prob}(e) \times \textsf{Impact}(e) \end{aligned}$$

While standard, the definition of risk level does have some limitations: It does not take into account the evolution of failure probabilities over time, and neither the fact that the probability and impact of an event are uncertain themselves. Nevertheless, the risk level is popular to get a quick overview of the importance of several events in terms of risk.

Risk Versus Hazard. A hazard is defined as a source of danger [6]. Risk includes the likelihood that the hazard will lead to actual loss, injury, or damage.

Risk Versus Resilience. A popular term in the context of risk is resilience, i.e., the ability to withstand, adapt to, and recover from disruptive events or crises. Thus, resilience can be seen as a risk treatment strategy (cf. Sect. 4) that emphasizes impact reduction. While risk focuses on negative outcomes, resilience emphasizes the ability to recover and thrive despite challenges.

3 The Ingredients of Risk

This section reviews the three main ingredients of risk from the ISO 31000 definition: objectives, effect (described as impact) and uncertainty. First, I introduce a common visualization tool: the risk matrix.

3.0 The Risk Matrix

Fig. 1.
figure 1

Risk matrix

A risk matrix, a.k.a. risk priority heat map, is a popular tool to visualize risks. As shown in Fig. 1, this diagram plots the likelihood of an event against the severity (or impact). Such maps yield a quick overview of the risk landscape and help prioritize risks, as critical risks require the most attention and low risks the least.

Despite their merits, risk matrices should be handled with care. Several phenomena are not taken into account with sufficient consideration:

  • The risk matrix presents a snapshot of the probability and likelihood of events at a certain point in time. Their evolution over time is not reflected in the matrix.

  • Often, the probability and likelihood of an event are uncertain themselves. This could be accommodated in the risk matrix by plotting the events not as a dot but rather as an area, but this is not often done.

  • Dependencies and causal relations between risks cannot be accommodated.

Moreover, special attention should be paid to so-called HILP events (High Impact, Low Probability). Examples are nuclear power plant explosions and the Ever Given cargo ship that blocked the Suez canal in 2021, disrupting supply chains worldwide. HILP events sit in the very top corner of the risk matrix and are categorized as medium risk level. However, due to their enormous impact, HILP events require special attention.

Fig. 2.
figure 2

Risk matrix from World Economic Forum

Example. Figure 2 shows a risk matrix from the Global Risks Report 2022 by the World Economic Forum [61]. It visualizes the largest risks perceived by companies worldwide.

3.1 Ingredient 1: Objectives

Let us turn to the ingredients of risk. Recall the ISO 31000 definition of risk as the effect of uncertainty on objectives. In fact, one can say that, according to this definition, there is no risk, if there is no objective. Although such objectives are usually not specified in a formal way, they should be formulated as accurately as possible. Three guidelines are important:

  • Objectives should be formulated in a SMART (Specific, Measurable, Attainable, Realistic and Time-bound) way, where concreteness and realism are the most important focal points.

  • The objectives should be shared and agreed on with all stakeholders.

  • Objectives should not stimulate perverse behavior.

3.2 Ingredient 2: Impact

The second ingredient is impact. Evidently, different events can lead to a variety of outcomes. Typically, these outcomes are classified into five major impact classes:

  • Cost. Accidents are expensive, including covering the costs to recover from injuries and cyber incidents. Budget overruns in (software) projects are also common.

  • Time. Sometimes, risk causes delays. Natural disasters can stop the supply chain and delay project delivery. In software projects, delays are a recurring risk.

  • Reputation. Reputation is often underestimated, but is often considered one of the most severe consequences, having a long-term impact on operations and relationships with stakeholders.

  • Health & safety. These risks encompass harm and loss of life as a result of motor accidents, fires, explosions, errors in health care and natural disasters. Technology-related safety risks include exposure to toxic chemicals, defective airbags, faulty medical devices, and privacy breaches.

  • Quality. Compromises in the quality of a product or service are serious, as they often directly affect the company’s mission. In software systems, bugs are notorious, pertaining to the pervasive view that software testing should be risk-based.

Recently, a sixth factor came up, namely sustainability: the impact on environmental and social needs for future generations.

Severity Classes. In practice, standard severity classes are often used, rating the impact on a point scale. The following 4-point scale is used by hospitals for patient safety, showing how the medical field operationalizes severity. This scale can be seen as a discretization of severity levels combined with a description specific to the setting.

Scale

Patient safety

1. Minor

Discomfort

2. Moderate

Light injury

3. Critical

Permanent injury

4. Catastrophic

Patient death

Quantitative Formal Methods. Many formal methods quantify impacts as real numbers, in the form of rewards, cost, price, or utility, for example, in reinforcement learning techniques, stochastic games, Markov reward models, timed priced games etc.

3.3 Ingredient 3: Uncertainty

Uncertainty implies a situation in which a person does not have the necessary information to precisely describe, prescribe, or predict an event or its characteristics. Uncertainty comes in two flavors [16, 42]:

  • Aleatoric uncertainty originates from the word alea, Latin for dice, and refers to uncertainty stemming from natural fluctuations.

  • Epistemic uncertainty originates from the Greek word epistéme, which means knowledge. Epistemic uncertainty comes from our lack of knowledge.

Aleatoric Uncertainty. Traditionally, the mathematical analysis of risk has focused on aleatoric uncertainty, using the laws of probability theory. As Bernstein states [10]:

Probability theory is an instrument for organizing, interpreting and applying information. As one genius idea was piled on top of another, quantitative techniques of risk management have helped trigger the ideas of modern times. [..] Without the command of probability theory and other instruments of risk management, engineers could never have designed the great bridges that span our rivers, [..] polio would still be maiming children, no airplanes would fly, and space travel would just be a dream.

Numerous formal methods have been developed to handle aleatoric risks: These can be divided into two approaches: first, formal methods have made enormous contributions to the analysis of stochastic models. These include classic models, such as Markov chains, Markov decision processes, and Bayesian networks. These achievements have enabled better analysis of existing risk models, such as fault trees, reliability block diagrams, and AADL. Stochastic risk models are especially popular in the areas of probabilistic safety analysis (PSA) and reliability engineering.

Epistemic Undertainty. Epistemic uncertainty refers to uncertainty arising from a lack of knowledge or understanding about the system risks. It stems from limitations in available data, models, and parameters.

Epistemic uncertainty can be reduced through further research, data collection, or refinement of models and theories. However, it may never be fully eliminated because of inherent limitations in human understanding or the complexity of the system being studied. Properly addressing epistemic uncertainty is crucial to effectively mitigate risks.

The Combination. Several approaches exist that combine aleatoric and epistemic uncertainty. These especially include methods in which the probabilistic parameters are subject to uncertainty.

Prominent models include Bayesian belief methods [21], where practitioners can update their beliefs about both the parameters of a model and the variability in the data as new information becomes available. Other methods include methods based on fuzzy probability theory [17, 62], interval Markov chains [8], parametric Markov models [9], hidden and partially observable Markov models [54]. All these models are based on stochastic methods —aleatoric uncertainty— but allow for uncertainty in the parameters of the model.

Uncertainty is a large concept, with many different angles and interpretations, see e.g., [5, 38] for an interpretation in safety analysis.

3.4 Uncertainty: Black Swans and the Rumsfeld Matrix

When it comes to epistemic uncertainty, two concepts have emerged in the field: Black swans and the uncertainty matrix by Rumsfeld.

Black Swans. Important in the context of uncertainty is the concept of so-called Black Swans. Black swans were coined by Taleb [58] and are a metaphor for high-profile, hard-to-predict, and rare events with significant impacts that often catch people by surprise. Examples are the 2008 financial crises and the COVID pandamic.

Black swans have challenged traditional risk assessment methods, including, perhaps especially formal risk models based on aleatoric/stochastic analysis, because they were unable to foresee these black swans. Although black swans are difficult to anticipate, effective risk management strategies should include means to mitigate these, e.g. via robust contingency plans, resilience-building measures, and adaptive frameworks to mitigate their potential impacts.

Fig. 3.
figure 3

The Rumsfeld matrix

The Rumsfeld Matrix of Uncertainty. The Rumsfeld matrix categorizes information and knowledge based on their levels of certainty and awareness. It originates from former U.S. Secretary of Defense Donald Rumsfeld and uses four categories for comprehending and addressing uncertainties:

  1. 1.

    Known Knowns are the well-understood risks, including known hazards, historical data, and established patterns. They can be effectively managed via established risk assessment and mitigation strategies.

  2. 2.

    Known Unknowns represent risks that are recognized but not fully understood. The key to managing these risks is to obtain better information via research, analysis, and exchange of information, as well as thorough risk analyses and scenario planning.

  3. 3.

    Unknown Knowns refers to risks that exist but are not consciously recognized. These may include hidden vulnerabilities, cultural biases, and blind spots. Effective risk management requires again better information.

  4. 4.

    Unknown Unknowns. are the risks beyond current awareness, i.e., the black swans. As the greatest challenge to risk management, these require an agile approach, using resilience and contingency plans.

Clearly, it is very difficult to mitigate all black swans, e.g., how to prepare for a next pandemic that can be completely different from the last one? As phrased in a quote often attributed to Niels Bohr, one of the fathers of quantum mechanics [60]: Prediction is difficult, especially when it is about the future (Fig. 3).

4 Risk Strategies

After identifying system risks and classifying them according to likelihood and probability, the question is what to do with these risks. Four ways of dealing with risk exists, called risk strategies.

Tolerate. This strategy entails accepting risks as they are, taking no additional action—no risk, no fun. Plenty of examples exist, since we all drive, fly, walk. Products are released without being fully tested or verified.

Terminate. This is the opposite of tolerate: Stop, or do not start, an activity that is perceived as too risky. The decision of the aviation authorities in 2019 to not allow the Boeing 737 max to fly passengers is an example of the risk termination strategy.

Transfer. At times, risk can be transferred to other parties. Insurance serves as a prime example, in which the monetary hazards associated with theft, fire, or medical treatments are covered by an insurance plan. Other examples include outsourcing, where an entire task, including its intrinsic risks, is delegated to a third-party entity. However, it is impossible to completely transfer risks: insurance policies may handle the financial implications of medical care, but do not alleviate the other effects. Furthermore, in the case of outsourcing, there is always the danger that the entity to which the task was delegated fails to perform.

Treat. Risk treatment is an important strategy, as it finds mitigation measures to reduce potential risks. Mitigation measures are also called controls. There are two types:

(a) Impact reduction reduces the effect of hazards after they occur by taking corrective measures. Examples are safety devices (helmets, seat belts, air bags), monitoring systems (smoke detectors), and fail-safe mechanisms (which return the system to a safe state after an incident happened, for example, the emergency lane offering a refuge after a car accident). Impact reduction measures from software engineering include run-time verification and exception handling.

(b) Likelihood reduction implements preventive measures that reduce the likelihood that an event will occur. Examples are training of personnel (driving licenses are a typical example), regular maintenance schedules of machinery, regular software updates, and strict security protocols to prevent unauthorized information access. Many practices in software engineering, either formal or informal, fall into this category: rigorous specification, verification, validation, testing, etc.

4.1 The Application of Risk Strategies

It is important to re-assess risks after measures have been divised. Have all prominent risks been addressed? Have new risks been introduced? Risks that remain after measures have been taken are called residual risks.

It is good practice to implement a combination of preventive and corrective measures for important risks. Preventive measures have the advantage that the incident does not occur at all, so no damage is done. However, risk can often not be completely predicted, and therefore corrective measures are useful. This is especially the vision of resilience engineering: predicing risks is difficult, and black swans can always occur, therefore adaptability and flexibility are of utmost importance.

4.2 Risk Management Versus Dependability Engineering

Risk management is closely related to dependability engineering. Dependability [6] is defined as: the ability to deliver service that can justifiably be trusted, and refined into the ability to avoid service failures that are more frequent and more severe than is acceptable. Terminology of risk and dependability are closely related.

In their seminal paper, Avizienis et al. [6] break down the dependability landscape into attributes that reflect dependability concerns, threats that endanger dependability, and means to improve dependability, see Fig. 4.

Definitions. When considering the service as the primary objective of a system, it becomes clear that the concept of dependability is linked to (absence of) risk.

Attributes as Impact Classes. Six dependability attributes are identified: availability meaning readiness for correct service; reliability meaning continuity of correct service; safety meaning absence of catastrophic consequences on the users and environment; integrity meaning the absence of improper system alterations; maintainability meaning ability to undergo modifications. These can be considered as refinement of the risk impact classes from Sect. 3.2: With a focus on technical correctness, availability, reliability, confidentiality, integrity, and maintainability refine the quality class. The safety attribute immediately corresponds to the safety impact class.

Means as Strategies. Finally, dependability means can be viewed as risk strategies discussed in Sect. 4. Fault prevention means to prevent the occurrence or introduction of faults, and is therefore a preventive risk reduction strategy. Fault tolerance means to avoid service failures in the presence of faults and thus is a corrective risk reduction strategy. Fault removal means to reduce the number and severity of faults: again a preventive risk reduction strategy. Fault forecasting means to estimate the present number, the future incidence, and the likely consequences of faults, and is a risk assessment activity.

Fig. 4.
figure 4

Taxonomy for dependability attributes, threats and means

Actually, many formal methods can be viewed as risk strategies/dependability means: Formal requirements specification, verification, validation can all be seen as fault prevention means, and thus as preventive measures. Debugging is a fault removal technique. Run-time verification is, when combined with e.g. fail-safe mechanisms, a mean for fault tolerance. Code metrics are fault forecasting means.

5 Risk Management

Risk management refers to coordinated activities to direct and control an organization with respect to risk [26]. Virtually all organizations manage their risks through the Plan-Do-Check-Act (PDCA) cycle, and many use its concretization in the ISO 31000 standard. The latter provides concrete steps to select appropriate risk strategies, deciding how risks should be treated and which interventions are appropriate.

This section covers both the PDCA cycle and the ISO 31000 standard, as well as the role of formal methods.

5.1 The Risk Management Cycle: Plan-Do-Check-Act

The Plan-Do-Check-Act (PDCA) cycle, also known as the Demming cycle, is a systematic process for continuous improvement of processes and products [57]. As illustrated in Fig. 5, this cycle proceeds in four steps. Specialized to risk management, these are as follows.

  1. 1.

    Plan: Establish goals, identify and assess risks, develop mitigation strategies.

  2. 2.

    Do: Implement risk management strategies and allocate necessary resources.

  3. 3.

    Check: Monitor the effectiveness of the strategies, and report findings.

  4. 4.

    Act: Improve and update plans to ensure continuous risk management.

Little information is available on the relation between the PDCA cycle and the use of formal methods. However, since the PDCA cycle is designed for any improvement process, it is also applicable to formal verification activities: Set the goals of the verification, perform the verification, check if the verification yields the desired results, and update plans to improve both the system under verification and the verification process itself.

Fig. 5.
figure 5

PDCA cycle

5.2 The Risk Management Process

Several frameworks exist that refine and concretize the PDCA cycle. The ISO 31000 is a family of generic standards, applicable in many contexts. Other risk management frameworks, such as COSO [44], are more specialized for enterprise risks. ISO 31000 [30] provides principles, vocabulary and a process for any organization to assess and treat risks. Formal methods are especially useful during the Risk Assessment phase.

Fig. 6.
figure 6

Steps in the ISO31000 standard for risk management

As illustrated in Fig. 6, the process consists of several steps:

  1. 1.

    Establish the context, and especially the goals.

  2. 2.

    Identify risks, mapping the risks that threaten the goal.

  3. 3.

    Analyze the risks finding the root causes and factors that contribute to the risks.

  4. 4.

    Evaluate risks according to their likelihood and impact.

  5. 5.

    Treat risks finding effective measures.

5.3 Formal and Informal Methods for Risk Assessment

Performing a proper risk analysis is not easy and requires domain knowledge. For example, it is not trivial to identify all relevant risks in a self-driving car or nuclear plant.

There are several risk frameworks to support the risk assessment process. These frameworks offer a systematic procedure to identify risks in different classes, find root causes, and help determine their impact. The level of formality varies from very informal to very formal.

Text-Based Methods. Textual approaches provide systematic methods for exploring components or behaviors in complex systems and list all findings in textual form or a table. Common approaches are failure mode effect analysis (FMEA) [50], and Hazard & operability studies (HAZOP) [36].

Architectural Methods. These methods take an architectural system model as a starting point, decomposing a system into a number of interacting components, annotating these with potential risks. Such architectural methods are especially common for systems with large software components, but can be used in any domain with complex system designs. Some prominent examples include the Architectural Analysis & Design Language AADL [20], the AltaRica framework [2, 47], the Safety Analysis Modeling Language (SAML) [25].

Domain Specific Methods. These methods have been specifically developed for risk analysis. These include fault tree analysis [18, 51], reliability block diagrams [43], event trees [19], and bowtie diagrams [14]. All of these methods provide visual means to capture system behavior and offer different analysis possibilities, for example, (stochastic) model checking, Monte Carlo simulation, or dedicated computation methods.

Various organizations dealing with safety-critical systems, including NASA, ESA, the nuclear industry, and the US Federal Aviation Administration, have recognized that a single analytical approach is usually insufficient for effective risk management. Consequently, they suggest a combination of approaches.

Finally, is important to realize that risk models, like many other models in computer science, do not formulate an objective truth, like in Newtonian mechanics. Rather, these models serve decision making and reflect the best information currently available. Moreover, in my experience, creating risk models at design time can lead to design improvements that prevent risks from happening all together: the journey is the destination.

6 ISO Standards, Risk Management and Formal Methods

6.1 The Role of ISO Standards in Risk Management

The International Standardization organization ISO is an independent, nongovernmental organization that develops voluntary international standards for quality, safety, and efficiency in products, services, and systems.

These standards cover a wide range of industries and technologies, from manufacturing and technology to food safety and healthcare. ISO standards typically require organizations to establish systematic approaches to quality management, information security, health and safety, and more, by setting appropriate policies, procedures, and processes to monitor the outcomes. Several standards recommend the use of formal methods.

If companies and organizations meet the criteria for a certain standard, they can obtain certification for that standard. Such an accreditation is advantageous, as it bolsters an organization’s credibility and trust among customers. In addition, regulatory bodies or governments may require adherence to specific ISO standards as part of legal or contractual obligations.

Some standards are developed with other organizations such as IEEE or IEC, as reflected in their name. Names may also include the year of publication, reflecting the specific version.

6.2 ISO Standards for Software Systems

Some noteworthy standards related to software systems are the following.

ISO/IEC TR 5469: Artificial Intelligence—Functional Safety and AI Systems. This standard outlines the role of AI in safety-related systems, classes, and compliance levels. It provides guidance regarding the specification, design, and verification of functionally safe AI systems, or how to apply AI technology for functions that have safety-related effects. ISO/IEC 42001: AI management systems encompasses the Plan-Do-Check-Act cycle for AI management systems.

ISO/IEC/IEEE 90003:2018 Software Engineering. This standard is part of the ISO 9000 family on quality management. The 90003 standard provides guidelines for the acquisition, supply, development, operation, and maintenance of software and support services.

ISO/IEC/IEEE 12207:2018 Systems and Software Engineering Software Life Cycle Processes. Whereas ISO 90003 relates to software purchase, the 12207 standard relates to software development, setting requirements for the software life cycle process: agreement, organizational, technical management, and technical processes. The latter includes business or mission analysis, stakeholder needs, requirements, architecture, design, implementation, integration, verification, validation, operation, maintenance, and disposal.

6.3 ISO Standards Recommending Formal Methods

Several ISO standards, mostly related to safety-critical systems, recommend formal methods during design and verification. Here are some notable instances:

ISO 26262: Road Vehicles—Functional Safety. [29] concerns the functional safety of electrical and electronic systems in road vehicles for the entire automotive safety lifecycle: management, development, production, operation, and decommissioning. Automotive Safety Integrity Levels (ASIL) set risk levels, based on the probability and consequences of safety hazards.

ISO 22163:2023 Railway Applications, Railway Quality Management System [31]. It refines ISO 9001:Quality management systems with specific requirements for application in the railway sector.

IEC 61508: Functional Safety of Electrical/Electronic/Programmable Electronic Safety-Related Systems offers guidelines for implementing, designing, deploying, and maintaining safety-related systems, achieved through a safety life cycle and a probabilistic failure assessment.

A key concept in dependability analysis in safety-critical system is the notion of Safety Integrity Level (SIL), referred to as automotive SIL (ASIL) in the automotive industry [27]. (A)SIL levels range from 1 (low) to 4 (highest). For hardware, the device must meet strict limits on failure probability for (a rigorously defined notion of) dangerous failure.

Based on the (A)SIL level, the use of formal methods is recommended. ISO 50128 recommends formal methods for SIL 1 and 2, and highly recommends them for SIL 3 and 4. Interestingly, ISO 26262 recommends formal methods, but highly recommends semi-formal methods, such as UML and SySML.

6.4 Research on Formal Methods for ISO Compliance

Establishing a safety analysis in the context of ISO standards can be challenging. Various formal validation and verification techniques have applied to several ISO standards, such as ISO 6262 [40]. One significant hurdle is the requirement that tools used for developing safety-critical systems must be certified. An overview of approaches to demonstrate compliance with ISO standards, which can serve as a foundation for further application of formal methods, is provided in [45].

7 Conclusion

This tutorial provides an overview of risk management concepts, principles and techniques and their relation to formal methods.

Formal methods are in a good position to stimulate System II thinking. In this way, they set a good basis for accountable risk making: systematic, so that no risk are overlooked; transparent, since models explicate the information that risk decisions are based on; and quantitative, based on facts rather than on feelings.