Context

Introduction

Cyber-physical systems are computer-controlled, networked systems that interact with the physical environment, often in a control loop, some of them in an autonomous way [1,2,3,4,5,6]. Typical examples include autonomous cars, autopilot in an airplane, a heart pacemaker, or cooperating robots in a manufacturing line. Because of their impact on the real-world, cyber-physical systems must be built so that they cannot harm or damage people, property, or the environment: Their behavior must be safe and secure. Engineering safe and secure cyber-physical systems has become a specific, exciting, and essential engineering discipline.

A long time ago, computers were just processing data, such as keeping accounts or managing inventory. Then they slowly started interacting with the physical world, for example, in the form of embedded computers controlling a combustion engine or as supervisory control and data acquisition (SCADA) systems governing industrial plants. Today, computers controlling all sorts of cyber-physical systems are pervasive—we find them everywhere. They have taken over control from small devices like a heart pacemaker to large applications, such as an autonomous container ship.

The system receives information about the environment from sensors (temperature, wheel rotation rate, camera, radar, gyroscope, etc.) and acts on the physical environment through actuators (motors, pumps, valves, etc.). The system comprises a number of interacting control algorithms, many of them closed-loop feedback algorithms. Some of these algorithms are based on self-learning (machine learning), for example, an autonomous vehicle’s video processing software.

Software

Cyber-physical systems are controlled by software, that is, most of their functionality is implemented in software. Control by software carries some risks: A failure, fault, error, or successful cyber-attack—either in the software or in the execution platform—can have grave consequences, such as safety accidents, security incidents, crashes, or casualties. In today’s environment, malicious interactions, such as hacking, malware, infiltration, etc., can also inhibit the correct operation and lead to dangerous consequences. Therefore, the quality properties of the cyber-physical system—especially safety and security—must be assured during all phases of system development, operation, and evolution [7,8,9,10,11,12].

Architecture

At the heart of a cyber-physical system is its architecture [13,14,15,16]: “Fundamental concepts or properties of an entity in its environment (= Context of surrounding things, conditions, or influences upon an entity) and governing principles for the realization and evolution of this entity and its related life cycle processes” [17]. A long—and sometimes painful—history of systems has proven that adequate, sound architecture is indispensable [18]. The architecture provides the foundation for the efficient development and evolution of the cyber-physical system and enables to a large extent also the quality properties!

Safety and security

The list of a system’s possible quality properties/attributes is extensive (e.g.: https://en.wikipedia.org/wiki/List_of_system_quality_attributes). For the cyber-physical system, the essential quality properties are safety (e.g.: [19]) and security (e.g.: [20, 21]).

Drift into failure

Fortunately, most modern system engineering processes are strongly safety and security aware [22,23,24,25,26]. In the majority of cases, these processes produce dependable and trustworthy systems. The organizations which make cyber-physical systems are almost always careful and diligent. Nonetheless, the press regularly reports security incidents and safety accidents. Why the discrepancy?

There are many reasons. First, the enormous complexity of today’s (and even more: tomorrow’s!) cyber-physical systems makes it impossible to avoid all vulnerabilities. Second, the operating environment of these systems becomes more hostile every year (higher probability of failures, greater sophistication of malicious activities). Third, the market pressure demands low development and production cost. Fourth, the high rate of change often entices the developers to “cut corners,” that is, reduce or skip necessary quality assurance measures, such as modeling, reviews, verification, validation, and thorough testing. The result is an accumulation of technical debt [18, 26] and architecture erosion [18, 27]. This slow, hardly noticeable effect is called drift into failure [28] and constitutes a grave risk for evolving cyber-physical systems.

Last defense

As numerous examples show beyond doubt, it is not possible to eliminate all vulnerabilities from a complex cyber-physical system during development/extension/deployment time. Unfortunately, a likelihood always exists that the system will experience a security incident or generate a safety accident during operation.

Are there mechanisms other than a very diligent development process to reduce the impact/damage of a security incident or a safety accident? Fortunately, the answer is yes and reads: Run-Time Monitoring [29,30,31,32,33,34]. In run-time monitoring, the system’s behavior is observed and automatically checked for compliance against the desired behavior. The desired behavior is defined in policies, specifications, rules, or models. The run-time monitor attempts to identify anomalies, that is, any deviation from the desired behavior. Preferably, the run-time monitor works in real-time: In this case, the monitor can detect, inhibit, or mitigate anomalous behavior before a safety accident or a security incident occur. The run-time monitor, therefore, acts as a last line of defense (Fig. 1): The system’s engineering process attempts to eliminate the vulnerabilities in the system. However, a (hopefully small) number of vulnerabilities remain in the run-time system! A malicious threat or an unforeseen failure in the run-time system can thus provoke a security incident or generate a safety accident. If the run-time monitor works correctly and in real-time, it may prevent—or at least substantially reduce the negative impact—of the security incident or the safety accident. The functionality of the run-time monitor thus forms the last line of defense of the cyber-physical system!

Fig. 1
figure 1

Run-time monitoring as last defense

Run-time monitoring and protective shell

Run-time monitoring principle

“Run-Time Monitoring as a Last Line of Defense” of a cyber-physical system is used increasingly in various industries (e.g.: [31, 35]). The principle of run-time monitoring is explained in Fig. 2: The real behavior is continuously compared to the desired behavior. The desired behavior can be defined by a number of techniques:

  • The functional specifications, expressed in a formal, machine-readable language (e.g.: [34, 40, 41]).

  • A set of policies, expressed in a formal, machine-readable language [42].

  • A set of rules, expressed in a formal, machine-readable language [32].

  • Structural and behavioral models, expressed in a formal, machine-readable language [43, 44].

  • In addition, the comparison makes use of information, such as operational data, log files, and the context (environment, partner systems, public information).

Fig. 2
figure 2

Run-time monitoring principle

If a deviation of the real behavior from the desired behavior is detected, the run-time monitor takes corrective action, whenever possible in real-time. Many types of corrective actions are possible, all aiming to avoid or reduce the negative impact of a safety accident or security incident [12].

Using run-time monitoring (often called “active run-time monitoring” because of its real-time intervention capabilities) requires two types of system architecture:

  1. 1.

    The design-time architecture

  2. 2.

    The run-time architecture

Design-time architecture

The design-time architecture aims to avoid as many vulnerabilities in the system as possible. This is achieved by a diligent, security- and safety-aware system’s engineering process and a subsequent vulnerability elimination process (Fig. 3: e.g.: [10]).

Fig. 3
figure 3

Design-time architecture

Run-time architecture

As soon as the design-time architecture of the cyber-physical system is judged to be sufficiently safe and secure, the system is deployed, that is, transferred to its operational environment and handed over to the users (Fig. 3). Unfortunately, the run-time system may still contain vulnerabilities—which constitute a considerable risk for its usage.

Therefore, an additional architectural element protects the run-time system: the (active) run-time monitoring (Fig. 4). The run-time monitoring embraces the run-time system and attempts to protect it from the impact and the consequences of threats and failures—whenever possible in real-time. This additional layer of protection can be seen as a protective shell that enfolds the running system. The idea of a protective shell as a separate architectural element and engineering artifact was presumably introduced by Lance Eliot under the name of “AI Guardian Angel Bots” for systems controlled by machine learning [36]. Here, the less exotic name Protective Shell is preferred [12].

Fig. 4
figure 4

Run-time architecture

Protective shell

The engineering design and the capabilities of a protective shell strongly depend on the run-time system to be protected. A generic architecture of a system with a protective shell is shown in Fig. 5. In the core of Fig. 5, the operational cyber-physical run-time system, including its interfaces to the real world and the network connections, is featured. Enfolding the run-time system is the protective shell. The protective shell disposes of more information than the run-time system from additional sources, possibly even from additional hardware. Examples of additional information sources include (Fig. 5):

  • Operational data, log files, functional specifications, behavior models, policies, and specific rule sets

  • Context information (From the environment, from other systems, from public sources, etc.)

  • From access to the sensors (inputs) and actuators (outputs), possibly even using additional sensors or measuring instruments

  • From the network usage, monitoring, and logging

Fig. 5
figure 5

Protective shell

In addition to traditional techniques, such as range and rate checks of sensors, and discrepancy and plausibility checks on actuator values, the protective shell often uses artificial intelligence and machine learning to detect anomalies [37,38,39, 45, 46, 50]. Any anomaly in behavior detected is immediately analyzed, assessed, and corrective actions are taken. Corrective action may include stopping the system, leading the system into a safe state, or into a safe degraded operation.

Emergent behavior

Most cyber-physical systems today consist not of one single, homogeneous system but are assembled from various constituent systems—thus forming a system-of-systems (Fig. 6; [51, 52]). A number of self-contained systems with specific functionality are interconnected to realize higher-level objectives. By combining the functionality of the constituent systems, superior functionality can be achieved, which cannot be provided by any of the constituent systems alone. An example is the various driver assistance systems in modern cars, such as lane-keeping, distance control, electronic stability control, traffic sign recognition, emergency braking capability, obstacle detection, automatic speed limiter, and airbags. Individually they offer assistance for specific potential accident situations. However, if the functionality of these systems is combined, a much safer car results. The emerging functionality from combining obstacle recognition with automatic emergency braking capability and electronic stability control will prevent significantly more accidents than each of the individual systems possibly could. This desired, valuable emergent functionality is the reason why the system-of-systems is designed and built!

Fig. 6
figure 6

Protection against emergent behavior

Unfortunately, assembling system-of-systems from their constituent systems can also generate unexpected, undesired, potentially damaging behavior. The constituent systems’ interconnection may generate unexpected failure modes, unanticipated system weaknesses, or new attack avenues—as negative, unintentional emergence [53, 54]! It could be suggested that a protective shell is the only defense against unexpected, dangerous emergent behavior.

Autonomy and machine learning

Modern cyber-physical systems exhibit a strong tendency towards autonomous behavior (e.g.: [55]): Such systems can change their behavior due to learning from experience or in response to unanticipated situations during operation. They are characterized by computers (i.e., software) making decisions affecting the physical world, such as autonomous vehicles. In many applications, these decisions are based on machine-learning algorithms [56,57,58], such as recognizing obstacles, their trajectories, and speeds from video, radar, or lidar images. Often, the machine-learning algorithms are not based on deterministic calculations but, for example, on statistical or training data evaluation. This can introduce a high degree of uncertainty and unpredictability in the autonomous system [36, 56, 59], which, in turn, introduces the risk of safety accidents or security incidents. Again, anomaly detection during run-time would be the last defense because predicting, assessing, and mitigating all safety and security risks during the development/deployment process is improbable in the context of autonomy and machine learning (Fig. 7).

Fig. 7
figure 7

Autonomy and machine learning

Conclusions

protective shell is a technique that can significantly enhance the safety and security of cyber-physical systems at run-time. It is a current, active research area, and some industries producing mission-critical cyber-physical systems are already implementing it.

However, the challenges of implementing a protective shell are that:

  • Using a protective shell requires a very high degree of formalization for reliable anomaly detection [47].

  • Designing a protective shell to protect against damaging run-time behavior is a highly challenging engineering task.

  • The protective shell consumes additional run-time resources (power, CPU, memory).

  • Designing and implementing a protective shell needs highly educated engineers [48].

  • The protective shell’s code and data increase the system’s complexity, which may generate additional failure modes and possibly also enlarges the attack surface [49].