Keywords

7.1 Introduction

At the beginning of the 90 s, Prof. Diekmann [7] stated the following. “New analysis tools are emerging, which have the potential to allow complex risk analyses to be performed simply. These new tools, which are underpinned by decision analysis and, lately, expert-systems technology, may lead to powerful, yet simple, approaches to the representation of risky problems”. This optimistic prediction on the future of risk analysis was accompanied by the suggestion of a possible interdisciplinary direction: “Future approaches to risk analysis will certainly rely more on the advances being made in Artificial Intelligence (AI) and the cognitive sciences. New computer tools and knowledge-representation schemes will unquestionably lead to new techniques, insights and opportunities for risk analysis”.

In the same decade (1997), the Russian chess grandmaster Garry Kimovich Kasparov (former World Chess Champion, ranked world No. 1 from 1984 until his retirement in 2005) lost a chess game with IBM’s chess playing computer Deep Blue, which was an example of Good Old-Fashioned Artificial Intelligence (GOFAI) [16]. On that game, [17] later stated the following: “Deep Blue was intelligent the way your programmable alarm clock is intelligent. Not that losing to a 10-million-dollar alarm clock made me feel any better”.

Industrial risk analysis and safety management have tried to make use of AI, but they have unevenly progressed since the described events. They neither respected Diekmann’s prediction (methodological gaps are still present [24]), nor turned into “programmable-alarm-clock intelligence” thanks to the progressive refinement of machine learning models and the increase in available computing power [12].

This contribution aims to outline what AI can bring to risk analysis and safety management by illustrating a series of examples (with emphasis on benefits and limitations) where AI techniques are used to continuously update the evaluation of the safety level in an industrial system.

7.1.1 Artificial Intelligence and Machine Learning

AI is intelligence demonstrated by machines, and it is divided into subfields based on technical considerations, such as particular goals (e.g., “robotics” or “machine learning”), the use of particular tools (“logic” or artificial neural networks) or deep philosophical differences.

This contribution focuses on the subfield of machine learning (ML). ML refers to techniques aiming to program computers to learn from experience [32]. Some of its models (e.g., deep learning) aim to simulate the learning model of the human brain [12]. Such models are composed of multiple processing layers to learn representations of data with multiple levels of abstraction.

A computer may be trained to assess risk for safety-critical industries such as Oil and Gas through these learning techniques. This would allow processing a large amount of information in the form of indicators from normal operations and past unwanted events (from mishaps to major accidents), which would be used for training. Due to the subjectivity of the definition of risk [40], a risk level cannot be assigned to each event with certainty and expert supervision is needed. Once the model has learned risk categorisation, it uses its knowledge to evaluate real-time risk from the state of the monitored system.

7.1.2 Monitoring of Early Deviations and Past Events

Increasing attention has been dedicated to monitoring safety barrier performance through indicators, as a way to assess and control risk. Indicators may report a series of factors: physical conditions of a plant (equipment pressure and temperature), number of failures of an equipment, maintenance backlog, number of emergency preparedness exercises run, amount of overtime worked, etc. A number of indicator typologies are theorised and used in the literature [23]. Øien et al. [23] affirm that we can refer to risk indicators if: they provide numerical values (such as a number or a ratio), they are updated at regular intervals; they only cover some selected determinants of overall risk, in order to have a manageable set of them. That being said, the latter feature is quickly becoming outdated due to the extensive collection carried out in industry and the attempts to process large numbers of them [30].

Øien et al. [23], Paltrinieri et al. [25, 26] and Landucci et al. [19] have produced several reviews on risk and barrier indicators. They show that the definition and collection of risk indicators have become consolidated practices in “high-risk” sectors, such as the petroleum and chemical industries. For instance, the Norwegian Petroleum Safety Authority (PSA) requires indicators describing the technical performance of safety barriers within the Norwegian Oil and Gas industry since 1999 [31], while the European directive “Seveso III” [9] on the control of major-accident hazards involving dangerous substances suggests their use for sites handling hazardous substances [10]. Such a trend towards the definition and collection of higher numbers of indicators [30] demonstrates the mentioned challenge on big data processing for risk level assessment.

7.2 Examples of AI-Based Prediction

Three examples of AI-based prediction with safety-related purposes are described in the following. The cases depict not only the application of machine learning techniques, but also the criticality of input data and implicitly the human efforts in preparing the data.

7.2.1 Consequence Class Associated with a Hazardous Material Release

ML techniques were applied by Paltrinieri et al. [28, 29] to a database of past accidents with the purpose of simulating its application on the national databases managed by the Seveso competent authorities. The data set used is the Major Hazard Incident Database (MHIDAS) [1] launched by the UK Health and Safety Executive in 1986 and developed by AEA Technology until the mid-1990s. The events included are based on public domain information sources, and their characteristics are registered using keywords.

MHIDAS includes about 8972 hazardous events from 1916 to 1992, with the attributes listed in Table 7.1. Some attributes use a taxonomy to systematically categorise the event. While the actual quality of the data could not be fully verified across the recorded hazardous events, this database is characterised by a high quality of data model, i.e., high semantic quality allowing for clear boundaries and relevant properties of the problem domain and the requirements of the task. Given that it takes a high amount of creativity and vision to design a solution that is robust, usable and can stand the test of time [15], the high semantic quality of MHIDAS could only be reached by significant knowledge and experience of the field.

Table 7.1 Attributes used to record hazardous events in MHIDAS [1]

The attributes listed in the upper part of Table 7.1 were used as inputs to the ML models to predict the consequences—lower part of Table 7.1. The details of data preprocessing are explained elsewhere [34]. The study focused on the number of people killed and aimed to predict the occurrence of a hazardous event within one of the severity categories listed in Table 7.2 based on the considered inputs. Only categorical data are used.

Table 7.2 Severity categories considered by the study

7.2.2 Wellhead Damage Frequency in a Drilling Rig

To avoid potential damage during drilling operations for a new offshore Oil and Gas well, a semisubmersible drilling unit should maintain its position above the wellhead. This is particularly critical if the platform is in shallow waters, where small changes of position lead to higher riser (pipe connecting the platform to the subsea drilling system) angles. Exceeding physical inclination limits may result in damages to the wellhead, Blowout Preventer (BOP—sealing the well) or Lower Marine Riser Package (LMRP—connecting riser and BOP) [5].

Platform position is maintained in an autonomous way (without mooring system) by a set of thrusters controlled by the Dynamic Positioning (DP) system. Input for the DP system is provided by the position reference system (Differential Global Positioning System—DGPS and Hydroacoustic Position Reference—HPR), environmental sensors, gyrocompass, radar and inclinometer [5]. A Dynamic Positioning Operator (DPO) located in the Marine Control Room (MCR) is responsible for constant monitoring of DP panels and screens and carrying out emergency procedures if needed [11]. Platform position may be lost due to several reasons.

In this case study, Paltrinieri et al. [29] assume that the platform thrusters exercise propulsion in a wrong direction, leading to a “drive-off” scenario. If the rig moves to an offset position, specific alarms turn on and suggest that the DPO stop the drive-off scenario by deactivating the thrusters and initiate the manual Emergency Disconnect Sequence (EDS) to disconnect the riser from the BOP. If the manual EDS fails, the automatic EDS activates at the ultimate position limit allowing for safe disconnection [5].

A number of works [21, 24, 26] address the details of occurrence and development of drive-off scenarios. Relevant indicators are defined to assess the performance of safety barriers and related systems. Examples of these indicators are the following.

  • thruster control failures in the last three months;

  • thruster monitoring sensors failures in the last three months;

  • simulator hours carried out by the DPO in the last three months;

  • inadequate DPO communication events in the last three months;

  • delays in DPO shifts in the last three months;

  • percentage of time in the last three months with more than one operator monitoring.

Collection of a wide variety of indicators may lead to challenges related to data integrity. Lack of accurate data may be due to several reasons, such as time and financial constraints experienced by database managers responsible for recording relevant indicators. As companies are expected to do more with less, developers must make decisions about the extent to which they are going to implement and evaluate quality considerations [15].

Simulations of drive-off indicator trends for a period of 30 years can be found in the literature [24]. They are inspired by the typical bathtub curve for technical elements [41] and relevant expert judgement for the remaining elements.

As shown by Bucelli et al. [4], indicator values may be aggregated based on relative weights and hierarchical barrier models, in order to enable dynamic update of barrier failure probabilities. This can be used to update, in turn, occurrence frequencies of potential outcomes. Outcome frequencies are an expression of the scenario probability and, in turn, of the risk. If we assume that the other factors are constant, this represents a simplified risk model. However, Matteini [21] points out a certain complexity within the hierarchical barrier model, which may be due to a tangled structure and an unclear approach to assign relative weights to single model elements. For this reason, a machine learning approach bypassing the construction of such hierarchies and aggregation rules is suggested by Paltrinieri et al. [29].

7.2.3 Alarm Chattering in an Ammonia Plant

Alarm data from a section of an ammonia production process [39] are analysed by Tamascelli et al. [38]. Due to the large quantity of hazardous substances stored and handled during normal activity, the plant has been classified as an “upper tier” Seveso III establishment. Extensive use of methane, hydrogen and ammonia (anhydrous and aqueous solution) occurs in the plant section. Furthermore, due to the intrinsic properties of the processes involved, severe operating conditions (i.e., high pressure and high temperature) are often associated with corrosive substances. Additional information about ammonia production and the considered site can be found at: [2, 42].

The alarm database consists of alarm data collected during an observation period of more than four months. In this case, both data and data model are of high quality as they are acquired from consolidated monitoring systems. Human effort would instead reside in the interpretation and the definition of appropriate priorities among the provided data.

Each row of the database represents an alarm event (26,473 observations in total), and each column (36 in total) represents a piece of information (i.e., an “attribute”) about the alarm. The most meaningful attributes are presented in Table 7.3.

Table 7.3 Alarm database attributes

The Alarm Identifier (point 5. of the “Message” attribute) is a code that defines the alarm status. Examples of Alarm Identifiers are “HHH” (which means that the measured variable has exceeded the “high level” setpoint), “HTRP” (the measured variable has exceeded the “very high level” alarm setpoint and automatic block intervention procedures might be triggered), “IOP” (which indicates an instrumental failure or out-of-range measure), “LLL” and “LTRP” (same as “HHH” and “HTRP” but referring to a “low/very low level”).

According to [18], an alarm event is uniquely identified by three attributes only: Time Stamp, Source, and Alarm Identifier (e.g., HHH, HTRP, LLL, LTRP, etc.). The combination of a “source” and an “alarm identifier” is called a “unique alarm”.

More than 96% of the alarms registered in the database occurred within one month only, when a considerable number of floods and chattering alarms must have occurred. In fact, only ten alarm sources (out of 194 in total) were responsible for more than 80% of the alarms recorded.

Chattering alarms are alarms “that repeatedly transitions between active state and inactive state in a short period of time” [3]. Therefore, chattering alarms have the potential to produce a large count of alarms and reducing their number is a key step to improve the performance of the alarm system during alarm floods.

Kondaveeti et al. [18] proposed a method for quantifying alarm chatter based on run lengths distributions. Although effective, this technique produces static results (i.e., chattering is quantified based on historical alarm data, but no conclusion can be drawn about the alarm’s future behaviour). This Chattering Index approach is modified by Tamascelli et al. [38] to predict chattering behaviour by means of standard ML models.

7.3 Method

ML classification models were used for the three examples in Sect. 7.2. Moreover, comparison among different ML models is also beneficial. Results from multiple linear regression (MLR) were compared to the relatively more sophisticated deep neural network (DNN) models.

Both MLR and DNN aim at modelling the relationship between two or more independent feature variables and a label dependent variable. While the former model fits a linear equation to observed data, the structure of the latter model is similar to the organisation of neurons in the brain, arguably the most powerful computational engine known today [20].

An algorithm uses part of the available data to train the ML model to predict the specific label variable based on the feature variables and test the result on the remaining data. Model performance needs to be evaluated before employing it for actual applications. The result might be far from perfect, and this may be due to poor data quality or indicate the need to tune the model to the actual application.

7.3.1 Metrics

The performance of the classification models used is assessed during the evaluation phase. As an example, consider a situation where accidents must be classified into two classes A or B. A positive prediction occurs when the model predicts the class A. Instead, a negative prediction occurs when the model predicts the class B. Whenever the model predicts the class of an object, there are four possible outcomes:

  • TP = True Positive—i.e., predicted label = A, true label = A;

  • TN = True Negative—i.e., predicted label = B, true label = B;

  • FP = False Positive—i.e., predicted label = A, true label = B;

  • FN = False Negative—i.e., predicted label = B, true label = A.

The sum of True Positives and True Negatives represents the number of correct predictions, while the sum of False Positives and False Negatives indicates the number of wrong predictions. True Positives, True Negatives, False Positives, and False Negatives are used to obtain three performance indicators:

$${\text{Accuracy }} = \frac{{{\text{TP}} + {\text{TN}}}}{{{\text{TP}} + {\text{TN}} + {\text{FP}} + {\text{FN}}}}$$
(7.1)
$${\text{Precision }} = \frac{{{\text{TP}} }}{{{\text{TP }} + {\text{FP}} }}$$
(7.2)
$${\text{Recall }} = \frac{{\text{TP }}}{{{\text{TP }} + {\text{FN}} }}$$
(7.3)

Accuracy represents the fraction of objects that have been correctly classified. Precision indicates the success rate of a positive prediction. Recall denotes the fraction of actual positives that have been correctly identified.

It is worth mentioning that metrics and indicators depend on the probability threshold used by the classification models. For example, if the decision threshold is lowered, the model may produce more positive predictions. As a result, the Recall might increase, but the Precision might decrease [33]. In fact, actions aimed at increasing Recall often lower the Precision, and vice versa [13]. A convenient mean of displaying the effect of the decision threshold is the Precision-Recall curve, i.e., a plot where each point represents the couple Precision vs. Recall at a specific decision threshold [22]. A convenient mean of summarising the information in the Precision–Recall curve is the area under the curve (AUC P-R) [22], which takes values between 0 and 1. Being independent on the decision threshold, the AUC PR is considered a more comprehensive indicator of the model performance if compared with Accuracy, Precision and Recall. In general, a large AUC P-R value indicates good performance [33].

7.4 Results and Discussion

Table 7.4 summarises all the results from the examples described in Sect. 7.2. The results from the two approaches used (MLR and DNN) are directly compared to identify the best predictive performance. MLR shows a higher number of higher values in green cells (9) if compared to DNN (6).

Table 7.4 Summary of results from the representative examples of ML application for safety purposes

However, this overall result cannot convey the message that MLR performs better than DNN as “there ain’t no such thing as a free lunch”. In fact, in these examples, DNN was applied with default parameters (e.g., number of layers and nodes suggested by Tensorflow tutorials [13]). In addition, DNN is relatively more sensitive to poor quality of data [24].

Table 7.4 reports all the metrics discussed in Sect. 3.3. If we exclusively focus on accuracy, we notice that the highest value (0.99) is obtained for both MLR and DNN predictions of the consequence class 10–100 fatalities associated with a hazardous substance release. However, accuracy alone is not informative if the problem involves the identification of rare classes, i.e., when the dataset is class imbalanced [14].

Releases of hazardous substances with 10–100 fatalities are (fortunately) rare events as they represent about 1% of the records in the MHIDAS database. In this case, the models have learned that the result will be correct 99% of the times if they predict that this kind of event never occurs. If the cost of a False Negative is higher than the cost of a False Positive (such as the case of a release of hazardous substance with 10–100 fatalities), Recall is the most meaningful metric. In this context, a good model must produce high Recall, while low precision might be considered acceptable and, to a certain extent, conservative.

The prediction of events with a relatively higher frequency and lower consequence (e.g., a release of a hazardous substance with no fatalities, an increase of wellhead damage frequency or a chattering alarm) may instead benefit from higher precision at the expense of the recall value.

For this reason, rather than considering Precision and Recall individually, one may aggregate them into the so-called F-score [6], especially if the area under the precision recall curve indicates the potential of optimising the model by tuning the decision threshold probability. Human contribution would again come into play in the setting of the algorithm parameters, which would inevitably represent a form of subjective calibration. For this reason, the techniques used in the depicted examples require a deep understanding of their benefits, limitations and application boundaries.

This contribution aims to convey the message that AI-based techniques must be considered as tools supporting and not substituting decision making. Awareness and knowledge of these tools’ properties by the user are essential to effectively exploit their results. The role of the human as user of these tools is even more central than before. AI should not be intended as a way to replace the human, but only as an improved approach assisting the human. This is compatible with the concept of trustworthy AI by the European Commission [8] promoting explainable AI (XAI), human centrality by means of interpretability, infobesity (overload of information) avoidance and transparency.

Embracing the principles of trustworthy AI and XAI will unlock the vast potential of machine learning in safety management, especially considering emerging variants of the traditional approaches described in this contribution, such as:

  • Transfer learning, aiming at developing methods to exploit the knowledge gained in one task (i.e., the source task) to address a new task (i.e., target task).

  • Federated learning, machine learning technique that trains an algorithm across multiple decentralised servers holding local data samples, without exchanging them.

  • Meta-learning, focus on the learning model and its optimisation towards new observations, in order to apprehend the emergence of unknown scenarios (e.g., unknown risks [35, 36]).

Machine learning has the potential to be eventually capable of supporting human users as [7] states. However, the author must admit the presence of another important challenge ahead that is yet to be fully overcome: ensuring appropriate safety culture by the user, i.e., foundations and motivations for which such advanced tools would be used. Once again, this challenge brings the discussion back to humans. Risk and safety analysts and managers would potentially have an advantage in the application of digitalised safety management due to their predefined state of mind, but only given their willingness to learn the basics and use of such advanced and promising techniques.

7.5 Conclusion

This contribution has illustrated examples of AI-based prediction used to continuously update the evaluation of the safety level in an industrial system. The examples refer to the impact prediction of a hazardous substance release in chemical industry, the wellhead damage frequency in offshore oil and gas and chattering alarms in ammonia production. The results can and must be read on different levels, carefully considering the available metrics based on the scenario addressed. This shows that we are not (and will not be in a near future) in a “no-brainer” condition in which the responsibility for human and system safety is entirely moved to the machine. At the same time, an understanding of digital solutions will be progressively required to guarantee their effective application. These advanced techniques have the potential to provide reliable support for critical decision making, guiding industry towards more risk-informed and safety-responsible planning.