1 Rationale and contribution

Enhancing the resilience of complex internet of utility systems is a top priority, as it is the foundation of Critical Infrastructure Protection (Rehak et al. 2019). The term “Critical Infrastructure” (CI) encompasses procedures, systems, resources, technologies, networks, assets and services vital for the well-being and welfare of individuals, citizens, and governments. These infrastructures can be independent or interconnected and interdependent, also across national borders. Disruptions to any critical infrastructure can lead to catastrophic loss of lives, economic repercussions, and substantial damage to public confidence (Brown et al. 2006). Events such as the September 11 attacks, hurricanes in the Gulf of Mexico, and earthquakes in Haiti and other parts of the world have highlighted the vulnerabilities that exist in these infrastructures and how disruptions can ripple through a region, a country, or even a continent.

CIs encompass not only highly visible assets like nuclear power plants, but also numerous less conspicuous assets that are often indirectly interconnected with other critical infrastructures. In the United States, the Urban Areas Security Initiative grant program in 2006 generated a list of 77,000 assets, which was subsequently reduced to 2100 assets in 2007, though still a substantial number. It is likely that a similar or even greater number of CI assets require protection within the European Union. The European Programme for Critical Infrastructure Protection (EPCIP) was established to provide a framework for identifying and safeguarding a critical infrastructure that, in the event of a fault, incident, or attack, could severely impact the country where it is located and at least one other European Member State. To address CI Protection (CIP), the EPCIP developed a directive, EU COM(2006)-786, which outlines a legislative framework for CIP to promote transparent operations and facilitate cooperation across borders (EU 2006).

For the ICT industry, the CIP programme presents not only a market opportunity but also a significant research and development challenge. Traditional CI protection measures, as outlined in this proposal, are no longer adequate in light of several significant technological shifts. Conventional CIs were inherently secure systems due to a combination of factors:

  1. 1.

    They comprised primarily special-purpose devices designed on proprietary technologies.

  2. 2.

    The separate sub-systems operated in isolation and did not interact with the external world, except for the system being controlled.

  3. 3.

    They relied heavily on dedicated, rather than shared, communication links.

  4. 4.

    They depended mainly on proprietary, rather than open, communication protocols.

However, with the rise of connected devices, the Internet of Things (IoT), and the increase in interconnectivity, traditional security measures have become insufficient to ensure CI protection. As a result, new security strategies and technologies must be developed and implemented to safeguard these systems from cyber-attacks and other security threats. Two current key trends are:

  • Commercial-Off-The-Shelf (COTS) components are being used for implementing Supervisory Control and Data Acquisition (SCADA) systems on a massive scale.

  • Subsystems are being connected using corporate Local Area Network (LAN), or even Wide Area Network (WAN) links as a communication infrastructure, possibly including the public Internet, as well as wireless/satellite trunks.

The conventional architecture of current CIs usually has a hierarchical structure that integrates heterogeneous devices and network trunks, often through shared network connections (Yadav and Paul 2021). Protocols are being used more frequently, which expose SCADA systems to the same vulnerabilities that general-purpose Information Technology (IT) systems face (Kalech 2019). As a result, it is increasingly important to have effective Intrusion Detection and Reaction System (IDRS) technology available to safeguard SCADA systems supporting any Critical Infrastructure. An efficient IDRS should be capable of detecting, analysing, and responding to potential security threats in real-time. It should also have the ability to learn and adapt to new threats to ensure continued protection. With the emergence of machine learning and artificial intelligence technologies, it is possible to develop IDRS solutions that can automatically detect and respond to complex cyber-attacks, thus reducing the response time to potential security breaches. As such, the development and implementation of advanced IDRS technology should be a top priority for organizations responsible for protecting any Critical Infrastructure.

1.1 Proposed approach

Although various technological approaches have been proposed, none of them has been designed keeping in mind security issues of SCADA systems as a whole. The approach proposed in this paper is based on a self-organizing cooperative overlay network of complementary components that are dynamically and autonomously adapted to face distributed cyberattacks against Industrial Control Systems. This approach is different from other proposed technologies, as it is designed to provide “deep security” that goes beyond single-point protection mechanisms such as central firewall systems. It also specifically addresses the distributed nature of new types of cyber-attacks and implements mechanisms to dynamically orchestrate available resources for effective detection and remediation. In addition, importantly, it includes the development of a distributed flow monitoring system, which provides input data to distributed cooperative intrusion detection agents. These agents are able to cooperate to improve the identification of attacks originating from both inside and outside the monitored network and support distributed remediation mechanisms.

The paper discusses the design and development of a novel multi-layer distributed IDRS based on the Autonomic Communication paradigm (Fig. 1). Based on the described approach, the proposal aims at being a guideline for experts and practitioners to effectively counteract distributed cyber-attacks, by developing a distributed flow monitoring system which provides input data to distributed cooperative intrusion detection agents which are able to cooperate to improve the identification of attacks originating from both the inside and the outside of the monitored network, and to support distributed remediation mechanisms. Currently, intrusion detection techniques rely on the use of both hardware and software tools which are pre-configured to analyse specific traffic features according to the capabilities of the underlying monitoring platform and networking infrastructure. We believe that in order to detect unknown kinds of cyber-attacks, in-advance-network-planning is not effective, since the traffic metrics to be analysed, the required accuracy, the detection mechanisms, and the countermeasures to be applied strictly depend on the dynamic security scenario the intrusion detection system is called to cope with. With this aim in view, a self-managed system can be developed, which will be characterized by self-* properties in order to support an in-network decision-making process for security purposes. More precisely, the cooperative overlay network will be endued with mechanisms for self-awareness in order to increase reaction capabilities as well as with self-adaptation features enabling peers to (re-)configure themselves in order to optimize detection processes and minimise management overhead.

Fig. 1
figure 1

System overview

1.2 Conceptual architecture

The proposed conceptual architecture for the multi-layer distributed IDRS is illustrated in Fig. 1. It is organized as a typical 3-layers architecture. The first layer is the Distributed Monitoring, Detection and Reaction, which is responsible for the detection and reaction to potential cyber-attacks against SCADA systems. The second layer is the Autonomic Management System, which enables self-management and self-configuration of the overlay network of components to optimize detection processes and minimize management overhead. The third layer is the Situational Awareness Supporting Tools, which provides information to specialists about the current status of the infrastructure, particularly when ongoing attacks are detected. The main innovation of the proposed architecture is the self-organizing cooperative overlay network of heterogeneous components, which dynamically adapts to distributed cyber-attacks against SCADA systems and increases the robustness of the overall system. The architecture also includes mechanisms for dynamically orchestrating available resources for distributed detection and remediation, specifically addressing the distributed nature which characterises a large fraction of new of cyber-attacks.

The proposed approach utilizes a Bio-Inspired Integration Framework (BIIF) (Del Ser et al. 2019) to implement self-adaptation features. The BIIF aims to replicate the intelligent, bio-inspired processes found in biological systems by using computational algorithms. By leveraging the power of these bio-inspired algorithms, BIIF frameworks can provide highly effective solutions to challenging problems in a variety of fields. The proposed BIIF provides developers and integrators with abstractions and functionalities aimed at facilitating the development of distributed applications having inherent self-adaptation properties. The availability of the BIIF framework also aims at easing the deployment of the system in different types of critical infrastructures with domain-specific traffic, systems and configurations.

The proposed system is realized as an assembly of distributed BIIF agents. These agents have means for sensing and acting on their CI target environment. They also offer means for the development of application-specific components, where behaviours serve as abstractions to define how sensed stimuli will result in actions on the hosting environment. Depending on the design choice of the developer, behaviours can implement complete business functions (e.g., monitoring, detection, reaction) or lower-level functions. Recent works demonstrate that this kind of architecture can be effectively applied in many domains pertaining to critical infrastructure, spanning from Industrial IoT (Hussein and Hamza 2022) to Digital Water Infrastructure (Cali et al. 2023). However, as documented in the literature, no comprehensive technological solution is so flexible to be easily customized to any case (Sheeraz et al. 2023).

Apart from properties related to the hosting environment (e.g., SCADA traffic), agents can also sense internal parameters, known as “energy levels”, which indicate the fitness of the agent for the currently played roles. These energy levels can be either grounded on physical features (e.g., the charge level of a battery) or purely symbolic, depending on the specific need of the application. The “energy levels” represent homeostatic variables that an agent continuously tries to keep within the boundaries defined by infrastructure owners by means of activation criteria. The possibility to activate/deactivate behaviour at run-time based on internal and external parameters is one of the sources of adaptation for BIIF agents. Domain experts can define update criteria of purely symbolic energy levels as a function of the result of the execution of behaviours.

An essential aspect of the SCADA security system realized by the proposal through the BIIF agents is the capability to provide human operators with information about the status of the infrastructure. The interaction with the operators represents a fundamental instrument for decision-making support. The rest of the paper is structured as follows. Section 2 describes the overlay network, which forms the baseline of the proposed architecture. Sections 3, 4, and 5 provide details on the lower layer of the proposed architecture, with a focus on the internal features of distributed monitoring, detection, and reaction, respectively. In Sect. 6, we introduce the middle layer of the architecture, which is the Autonomic Management System, and discuss potential issues that may arise. Section 7 examines the upper layer, which is comprised of Situational Awareness Supporting Tools. In Sect. 8, we outline our development approach and introduce the technology pillars that are used to realize the proposal. Finally, in Sects. 69 we conclude the paper by summarizing our contributions, and providing closing remarks.

2 Overlay network

The baseline of the proposed architecture consists of a cooperative overlay network including heterogeneous components. This infrastructure supports and enables the three main functionalities of the IDRS: monitoring, detection, and reaction. These are realised through three types of distributed system components at the Distributed Monitoring System layer. As depicted in Fig. 1, these components cooperate to improve the effectiveness of the detection and reaction processes. The overlay thus integrates three different classes of nodes which specialise in specific tasks (Monitoring, Detection and Reaction). Each node cooperates with both nodes of the same class to improve class task performance, and with nodes of other classes to support their activities (black and coloured arrows in the figure). Information exchange among the different components can be supported by an advanced peer-to-peer communication protocol. This protocol defines the data format as well as the procedure to properly identify the receiver node to which the information has to be transmitted.

3 Distributed monitoring system

This component has the final goal of easily interacting with the Autonomic Management System enabling, controlling and monitoring self-adaptation functionalities. The idea is to develop, at the Distributed Monitoring System layer, novel techniques and tools for fast online monitoring of SCADA network traffic. This layer offers both software and hardware probes, by implementing novel algorithms as well as integrating well-known methodologies. Within the proposed system architecture (Fig. 1), real-time monitoring is performed through the interaction with various probes strategically deployed across the system. These probes, including network traffic analysers, log parsers, and system monitors, are dispersed throughout a SCADA infrastructure to gather and potentially refine status data.

Effective and efficient monitoring can be used to gather source data deeply nested throughout the system to be secured, which will make it possible to gain a deep and accurate knowledge of the communication network status. Monitoring data is processed to expose and detect ongoing malicious activities within a network. By carefully deploying a number of monitoring sensors, it will be possible to assess network traffic parameters which help evaluate and classify the current security status of the monitored network accordingly. The definition of metrics and descriptive parameters will be carried out in order to allow for the effective separation of legitimate and unwanted behavioural profiles. The objective is to develop novel monitoring techniques capable of controlling traffic flows and providing a security system, i.e. an Intrusion Detection System (IDS), with useful and accurate information. In particular, a flow monitoring framework can be employed. Such a layer can be seen as the component of an IDS responsible for packet capturing and flow information exporting.

Furthermore, the layer includes the development of custom network traffic analysis hardware prototypes that monitor the integrity of sensor/actuator data to and from PLCs/RTUs and are capable of communication via the autonomic communication overlay network. The module achieves its purpose by way of a small hardware-based hashing, data checksum and/or authentication engine. This engine can be implemented in embedded firmware if the RTU/PLC is capable of handling extra data processing, or in a relatively inexpensive dedicated hardware (custom or microcontroller) unit. The solution will be capable of working on protocols such as RS232, RS422, RS485 and RF. This layer is intended for retro-fitting existing SCADA infrastructure. It allows for network flow monitoring on sub-TCP/IP protocol infrastructure levels. The layer is built on previously demonstrated hardware development and integration capabilities (Caenegem and Skordas 2007). Specifically, it incorporates a collection of elements designed for the simultaneous execution of reactive and proactive security algorithms. These algorithms work together to provide improved pattern recognition and behaviour-based detection methods. Additionally, the system includes enhanced and closely linked rule sets for more effective alarm management and root-cause analysis.

Regarding the distributed network flow monitoring software, its functionality relies on specialized network traffic analysis software, log data parsers and system monitoring software. These components can be distributed throughout a SCADA network, and communicate through the overlay network. They offer a resource-efficient implementation of reasoning software and enable distributed data access for distributed reasoning. The monitored traffic flows can be classified into two main categories:

  • Fine-grained flows: These flows are generated by individual users, and monitoring them is crucial for detecting specific local attacks.

  • Coarse-grained flows: Comprising multiple fine-grained flows, these flows include transport information that describes the entire network context. They are analysed to identify large-scale distributed attacks, such as Distributed Denial of Service (DDoS) attacks.

This categorization plays a crucial role in defining metrics for the IDS. The monitoring systems need to measure specific metrics for each class of flows, depending on the targeted attacks for identification. With the proposed flow monitoring framework, users will have the ability to customize both flow definitions and metric specifications. This customization empowers users to obtain flexible and precise information about the current network status.

4 Distributed detection system

The proposed distributed detection system focuses on two of the main threats affecting interconnected systems: external and internal cyber-attacks. The system will provide solutions for both in the framework of the same distributed approach.

4.1 Detection approach for external cyber attacks

When a SCADA system is interconnected by a public network infrastructure, external attacks represent one of the main threats to system security. Any vulnerability, indeed, can be remotely exploited to compromise the functionalities of the SCADA infrastructure. This paper addresses external cyber-attacks using a deep security approach. A key objective is the development of distributed intrusion detection systems. The collaboration of distributed entities is a fundamental element in distributed approaches for intrusion detection, as it significantly enhances the performance of the overall detection process. In the context of a SCADA security overlay system, exchanging information among distributed components can yield substantial advantages. The distributed deployment of these components enables the system to gather a more comprehensive understanding of critical events compared to centralized systems. This, in turn, facilitates the implementation of reaction policies that are better tailored and specifically targeted to address the identified security threats.

The first step for the development of a distributed detection system is to provide any agent with the capability to process the information gathered from monitoring systems to enable the autonomous detection of a potential ongoing external attack. In particular, the proposed approach relies on machine learning techniques for defining behavioural models which are able to detect known and novel attacks. The classification result is then exploited by the other components of the distributed detection system. This approach also allows us to integrate heterogeneous classification and detection systems so that combinations of different techniques can be combined and exploited to improve overall attack detection.

To fully leverage the benefits of a cooperative approach to network security, it is crucial to address the challenges surrounding the determination of necessary information exchange among nodes. This includes identifying the types of information that, when shared, can significantly enhance the overall knowledge and awareness of ongoing critical security events. Additionally, it is important to establish how the increased volume of information should be merged, managed, and interpreted effectively. In distributed detection approaches, it is essential for the exchanged data between security system components to exhibit correlation. Merely having statistically independent data does not contribute to inferring information about network events when observing distributed phenomena. Observations from remote locations do not necessarily provide useful insights into local phenomena that can assist in reducing uncertainty during the event classification process.

This raises two key problems regarding the effectiveness of a distributed system:

  • Determining the types of information that should be shared. It is necessary to identify the specific information elements that need to be exchanged among nodes to achieve improved event classification and reduce uncertainty.

  • Selecting the appropriate peers to which a node should send the shared information. The challenge lies in determining which specific nodes in the network should receive the shared information to maximize the effectiveness of the distributed system.

When an attacker exploits a network for her/his actions, it is possible to detect the left traces or “footprints” at various points within the network. These footprints, although they may vary in form from one location to another, can be observed and aggregated by an intelligent observer who has the ability to collect evidence distributed across the entire network (Nunes et al. 2015). In the case of a distributed attack, numerous footprints are likely to be present, contributing to a higher level of confidence in successfully identifying the attacks. Apart from the type of attack, it can be viewed as a “network phenomenon”. The evidence of the attack, its location, and its variations are not random occurrences but possess two fundamental characteristics: temporal proximity and spatial proximity. If an observer detects a phenomenon at a certain time t, it is likely that they will observe the same event after a certain time interval t + Δt (temporal proximity). Additionally, when one observer detects a particular phenomenon, it is highly probable that a nearby observer will also sense the same event (spatial proximity). These characteristics of spatial and temporal proximity reflect the correlated observations of a phenomenon obtained from multiple sources and demonstrate that the observations of a phenomenon are provided by multiple sources in a temporally and spatially correlated manner.

In the context of distributed cooperative detection, the process of data dissemination and propagation is a crucial aspect that needs to be addressed. It revolves around the challenge of exchanging information among peers in an effective manner. The design of an appropriate dissemination protocol should fulfil various requirements that are directly linked to the specific security application being implemented and the desired performance goals of that application. The selection and implementation of a suitable protocol involve striking a balance between conflicting requirements. Generally, four fundamental characteristics can be identified:

  • Peer selection. Determine which peers should be involved in the information exchange process. The selection criteria may vary depending on factors such as their proximity, expertise, trustworthiness, or predefined roles within the cooperative detection system.

  • Time of reporting. Determine the timing of reporting events or sharing information among peers. It is crucial to strike a balance between timely reporting to facilitate swift response and minimizing false positives or unnecessary reporting that can overload the network.

  • Interference with normal network operations. The dissemination process should not disrupt or interfere with the regular operations of the network. It is important to ensure that the exchange of information does not cause performance degradation, network congestion, or conflicts with existing network protocols.

  • Speed of dissemination. Determine the efficiency and speed at which information is disseminated throughout the network. The dissemination protocol should aim to minimize delays, bottlenecks, or any factors that could hinder the timely distribution of information to all relevant peers.

The selection and implementation of a dissemination protocol need to carefully consider these fundamental characteristics to achieve a well-balanced approach that meets the requirements of the specific security application while ensuring optimal performance. In the design of a dissemination protocol, it is crucial to appropriately select a subset of peers to whom a node sends information. Not all the entities within a network may need to receive data from a remote location, so the protocol should aim to optimize the dissemination process. One effective solution to address this is to leverage the principle of proximity. Utilizing spatial proximity, a node can determine the set of nodes that should be involved in the dissemination process. This means that a node should be capable of identifying suitable partners with whom it needs to cooperate. This selection process should be dynamic and based on specific events or circumstances that require cooperation. The time of reporting is another important parameter that characterizes dissemination protocols and plays a significant role in the design of such protocols. The reporting time parameter is closely tied to the specific security application being implemented and can have a considerable impact on the protocol design. Reporting time refers to the timing at which the dissemination protocol is employed to share information among peers. The choice of reporting time is influenced by various factors, such as the urgency of the information, the desired speed of response, and the nature of the security threats being addressed. The reporting time problem is intricately linked to the challenge of data synchronization among peers. In order to be effective, peers should have synchronized clocks or a way to accurately determine the relative timing of events. This ensures that the reporting of information occurs in a coordinated manner.

The interference caused by the dissemination protocol with normal network operations is closely connected to the reporting time problem. The protocol can introduce significant traffic overhead, which becomes more pronounced with increased frequency of data reporting. The type and size of the information being disseminated also impact the level of interference caused by the protocol. Balancing the messaging frequency and the amount of information exchanged becomes crucial in protocol design and implementation as it directly affects the effectiveness of cooperative actions and the interference with normal network operations. In security applications where network dependability is critical, it may be preferred to tolerate higher overhead to ensure excellent performance of the cooperative system as a whole or for specific subsystems. In such security-critical scenarios, prioritizing security messaging through Quality of Service (QoS) features at the transport layer becomes desirable, and the protocol should support this.

The speed of dissemination is another parameter that characterizes the data-exchanging process. It determines the time it takes for information to reach all partners in the overlay network. While spatial proximity is a useful selection principle, it may not always be the optimal choice for the protocol being developed. Certain applications may require information to reach the widest possible number of nodes, necessitating a dissemination protocol that prioritizes broader coverage. In addition, data aggregation plays a vital role in a cooperative system. This process, often referred to as Data or Information Fusion, involves merging data from multiple sources that may differ in their conceptual or contextual representations. The goal of Information Fusion in this context is to obtain a more accurate understanding of the reality in the network, enabling improved detection and reaction to specific anomalies.

4.2 Detection approach against internal cyber-attacks

SCADA systems and controlled devices frequently rely on “security by obscurity”. It has however been demonstrated that obscurity cannot be relied upon for SCADA system security because attackers can cause significant harm without having any knowledge about a SCADA system infrastructure or any of the functionalities provided by the system infrastructure (Nunes et al. 2015). Such attacks, as well as attacks directed against very specific types of devices (e.g., controlled by SCADA systems), can be detected by using software “honey pot” systems that pose as physical devices as they are used in the operation of CIs. In the proposed architecture, we envision the development of a distributable, autonomic communication-enabled software agent that can automatically learn usual behaviour patterns from the communication exchanged between actual SCADA systems and devices, replicate these patterns to pose as a device to malicious parties (without interfering with the actual infrastructure that is being controlled), and identify abnormal communication behaviour or instructions being sent to it.

5 Distributed reaction system

Once an intrusion is detected, the remediation subsystem should provide the framework with mechanisms and policies for fast and effective remediation, aimed at countering and repairing consequences of an ongoing attack, and preventing any further damage to the system. Any remediation strategy will utilise a combination of suitable actions to be performed to guarantee effective attack remediation. When an attack is detected and an alert has been issued by the detection system, information about the presumed source and cause of the attack triggers appropriate remediation actions.

The overlay network provides the communication infrastructure that distributes information about an alert to all “interested” components. According to the type of attack that has been detected, the most appropriate remediation strategy will be selected and put into place. Since most attacks use spoofing in order to conceal the real attack source, a strategy that allows the system to carry out remediation actions as close as possible to the real attack source is necessary. Simple traceback based on source IP address is ineffective, so that more appropriate traceback methods for tracking the attacker back will be designed and implemented. More suitable traceback will help in discovering the point of the monitored network which is closer to the attack source, and will maximize the effect of remediation strategies. For this purpose, the monitoring system will significantly contribute, since, through constant interaction with the remediation system, it is capable of localising the presence of specific traffic patterns within the network. Packet dropping and traffic shaping may be reasonable options for remediation strategies, since they can utilise traceback information to reduce the risk of altering traffic flows which are not actually harmful. A feedback mechanism allows the security framework to evaluate the performance of a remediation strategy when it is applied, to then fine-tune and eventually interrupt it through continuous analysis of the dynamics of network traffic on the overall network level.

Any remediation operation will be coordinated in cooperation with the other components of the framework, managed via the overlay. The remediation components cooperate and continuously communicate with other node classes in the overlay infrastructure which provides service to the remediation node population.

In absence of any anomalies, remediation components do not interfere with regular network operations and are inactive until they are needed to perform remediation actions. Remediation components follow a policy-based configuration paradigm in order to engage countermeasures appropriate for an identified attack or threat. Policies define actions belonging to three main classes:

  • policies for altering normal traffic forwarding schemes (e.g. shaping, dropping); this is useful to mitigate the effects of an attack by means of a fine-grained intervention on single packets; such policies configure the behaviour of traffic classifier and traffic scheduler modules embedded in the router forwarding plane;

  • policies for packets “sanitization”, meaning re-injecting packets which have been previously “purified” from dangerous content back into the network;

  • policies for logging traffic packets belonging to identified flows; this is aimed at both archiving anomaly-related data and collecting information useful for system tuning in an offline analysis phase.

6 Autonomic Management System

The Autonomic Management System (AMS) for IDRS is capable of automating the reaction to threats and alarms, as well as providing real-time decision support to a human manager. This is achieved with the development of two interacting sub-systems, namely the policy-based management decision tool described in this subsection and the situational awareness support tool (SAST) described in the next section. The effective management of networked critical infrastructures requires addressing diverse requirements. The AMS layer tackles this challenge by employing a novel policy-based organization model and decision tool. The main advantage of a policy-based system is its ability to introduce controlled programmability in the management system without compromising overall security and integrity. The IDRS system’s real-time adaptability can be automated and simplified through the use of the Policy-Based Management (PBM) paradigm. Simultaneously, the managed system reports contextual information and events to the Situational Awareness Support Tool (SAST), providing the necessary feedback to close the control loop. Moreover, a Policy Enforcement Point (PEP) enforces policies and reports data to the SAST, which, in turn, provides feedback for the Policy Decision Point (PDP) functionality. Our approach combines hierarchical and distributed paradigms in a hybrid model, leveraging policy-based control techniques to manage the trade-off between accurate Situational Awareness and management traffic overheads.

To achieve efficient and scalable policy-based management, the designed framework addresses collaborative policy definition, policy distribution and replication, and distributed policy enforcement. A step-by-step methodology is followed for the policy-based processes, guiding the design and implementation of policies from requirements gathering to the deployment of policy instances in a policy repository. This methodology enables the creation of both infrastructure-independent and infrastructure-specific policy specifications that can interoperate across different infrastructure’s PBM systems, bridging the gap between proprietary specifications and the need for interoperable critical infrastructure management. The proposed methodology consists of five steps:

  1. 1.

    Requirements gathering and system description.

  2. 2.

    Design and definition of policy types.

  3. 3.

    Representation of the Policy Information Model.

  4. 4.

    Mapping the Information Model to the Data Model.

  5. 5.

    Implementation, Deployment, and Testing.

Through these five steps, the entire implementation process is followed, starting from requirements gathering and ending with implementation and system deployment. The designed policy-based management decision support closely interacts with the SAST to provide remediation solutions and proactive measures. The SAST supplies input for automated policy enforcement actions and essential parameters related to the current system’s status and threat levels. This contextual information is matched against policy conditions to trigger corrective actions and countermeasures against identified threats. Policy sets are defined to correspond to the current threat level, and different policy sets can be applied to infrastructure subsystems based on spatial and/or temporal proximity to reported “network phenomena”.

7 Situational Awareness Supporting Tools

A key component is the Situational Awareness Supporting Tools (SAST) dedicated to monitoring and analysis of the telecommunication and SCADA networks used by a particular Critical Infrastructure operator. The main differences with current solutions are:

  • Real-time cooperation with IDRS.

  • Analysis of both historical and current data about events.

  • Wider scope covering analysis, simulation, support and evaluation of operators decisions.

  • Semantic description of the relevant multi-domain knowledge (business values, procedures flow).

  • Global/local view analysis (e.g., by analysing data from other neighbouring network domains).

  • Application of correlation and data mining algorithms.

Moreover, the SAST offers a comprehensive solution that can be applied across various domains. Unlike current tools that focus on network security or specific domain awareness, the SAST provides multi-domain analysis of network security and resiliency within any Critical Infrastructure environment. The primary objective of the SAST layer is to enhance the situational awareness of critical infrastructure operators. It achieves this by visualizing the network status and providing information on historical and current network events and security incidents. The tools leverage data from historical network performance, the underlying online IDRS system, and reported network events. The SAST analyses and presents threats, offers support and guidance to the operator, and evaluates potential actions and decisions made by the operator.

Furthermore, the SAST layer enables operators to incorporate high-level knowledge, such as business models or the business value of specific elements, through semantic descriptions. It also allows operators to visually track and annotate the “footprints” of attacks, correlating them with other network incidents to facilitate further analysis and information combination.

To achieve the key goal of improved Human and Organizational Performance (HOP) (Nunes et al. 2015), the SAST offers effective technological support in the area of Human-In-the-Loop (HIL), which is crucial for enabling judgment and knowledge-based decisions in automated systems. Integrating human oversight into decision-making processes allows for a balanced approach, where human operators can intervene and override automated decisions when necessary. This mechanism ensures that the critical infrastructure is human-centric, end-users are supported by smart-technology-driven aspects [as typically required (Jwo et al. 2021)], and have the option to retain full control of the platform at all times, should they desire. By incorporating human expertise, complex scenarios and nuanced situations can be better understood, leading to improved decision-making accuracy and adaptability. Furthermore, human intervention adds a layer of accountability and ethical considerations, as it helps prevent potential biases or errors that automated systems may exhibit. This collaborative human–machine approach fosters trust and confidence in the technology, making it more acceptable and usable in critical applications. Striking a harmonious balance between automation and human judgment ensures that the technology serves as a tool to augment human capabilities rather than replacing human agency, thereby enhancing overall system resilience and reliability. Importantly, to fully unleash the potential of the proposed solution in terms of HOP improvement, the SAST can be configured based on specific governance models, specifying best practices and Standard Operational Procedures (SOPs) of the organization, for more effective and timely management of prevention, detection, response, and mitigation of the consequences of cyber and physical attacks.

8 Development approach and technology pillars

For the development of the proposed framework, the following objectives have been set out:

  • Solutions must reach a high Technology Readiness Level (TRL)—as defined by the European Commission, and requested by the scientific research (Riglietti et al. 2018)—in a relatively short time.

  • Must be easy to install, configure, and maintain.

  • The Total Cost of Ownership (TCO), i.e. the actual cost of the system (including: initial purchase, operation, and maintenance), must be low.

  • European Technological Sovereignty must be preserved.

To achieve the desired objectives, the approach involves leveraging a mature tool based on the authors’ prior research and further advancing it (Coppolino et al. 2018, 2019). This will be accomplished by integrating carefully selected and top-performing Open-Source solutions, ensuring the tool’s enhanced capabilities and effectiveness. Overall, the framework relies on three pillars of cyber protection technology, namely: real-time security monitoring, Big Data Analytics, and protection of data in use.

8.1 Real-time security monitoring

Real-time security monitoring encompasses various technologies, among which Security Information and Event Management (SIEM) plays a crucial role. SIEM solutions, provided by reputable vendors like IBM, DELL, Exabeam, McAfee, Securonix, Splunk, and LogRhythm (Coppolino et al. 2018), are specifically designed to acquire, analyse, correlate, and report information from various data sources. These sources typically include network devices, identity management devices, access management devices, and operating systems. While individual SIEM products offer valuable data, they may lack comprehensive visibility across a broader spectrum of security elements necessary to effectively identify the increasing number and diverse types of cyber-attacks targeting corporate and government enterprises.

Another important enabling technology for security monitoring is Security Operations Centres (SOCs). SOCs are designed as a central location, within a building or facility, from which staff supervises the site, using data processing technology. Typically, they are equipped for access monitoring, and controlling of lighting, alarms, and vehicle barriers, so strictly related to Physical Security. Recently there are many examples of extremely sophisticated Cyber enabled SOCs, even if mainly devoted to physical security.

Yet another important security monitoring technology is the IDS. Despite the advancements in IDS technology, certain limitations persist. These include relatively poor detection accuracy, a high rate of false positives, scalability constraints, challenges in detecting emerging attacks due to evolving evasion techniques, and limited diagnostic capabilities (Kassimi et al. 2022). These ongoing issues highlight the need for continued improvements in IDS technology to overcome these limitations and enhance their effectiveness. Our solution is founded on autonomic agents’ qualities such as autonomy, intelligence, parallel processing, collaboration, and reasoning, the multi-agent system’s concept and motivation are based.

Existing SIEM and security monitoring solutions have yet to fully harness the potential of two promising components: Business Process Management (BPM) and Business Activity Monitoring (BAM). These elements have not been adequately explored in the current market offerings, despite their relevance. The BPM market has seen advancements, as evidenced by Gartner’s evaluation of leading vendors such as Pega, IBM, and Appian in their Magic Quadrant for Intelligent Business Process Management Suites (Gartner). On the other hand, BAM software plays a crucial role in monitoring business activities implemented in computer systems. It enables near real-time monitoring, tracks Key Performance Indicators (KPIs), presents data through dashboards, and offers automatic notifications for deviations. While the impact of BPM and BAM on organizational efficiency and effectiveness is debatable, it is clear that these components process a substantial amount of information closely tied to security matters.

The proposed framework brings a significant advancement in SIEM and real-time security monitoring technologies. It extends SIEM and other security monitoring technologies to multiple domains and multiple layers. The management of cross-layer security information and events is a problem that organisations are starting to face, with the increased adoption of service-oriented infrastructures and architectures. Furthermore, the framework considers the impacts on the disclosure of private information across domains, so to be acceptable for the society, from the point of view of human behaviour, as well as of principles of human rights and legal and economic viability. The framework extends the applicability and expressiveness of security monitoring technologies from the infrastructure domain, where it is mostly confined today, to a multi-domain view and high-level processes and services in order to perform security-related event processing and monitoring at the service level. It also extends the evaluation and correlation capabilities of security monitoring systems. The main improvements are: (1) support the definition of relations between events and the automatic processing of correlations for fine-grained decisions on possibly critical situations; (2) provide advanced techniques (e.g., predictive security monitoring) for the evaluation of security-related events and integrate these techniques within the proposed eco-system; (3) extend the expressiveness of event processing to enable capturing, filtering, correlating, and abstracting events as well as triggering alarms and countermeasures; (4) provide dynamic abstraction techniques to enable the framework security monitoring features to cope with the (ever increasing) scale of the systems that are to be protected. It improves the integration between SIEM and BPM and BAM technologies. As many emerging attacks exhibit discernible symptoms in terms of QoS degradation, leveraging BPM and BAM can significantly improve detection capabilities. By comprehending the Business Process Logic, these components enable the detection of new types of faults and attacks, such as orchestration flaws and misuse case exploits. Innovative computing models can be developed, combining edge-side processing (in close proximity to data sources with potentially limited computing power) and core-side processing (in the central infrastructure with virtually unlimited computing power, especially in cloud deployments). This approach facilitates effective data processing and analysis.

8.2 Big Data Analytics

Big Data Analytics is a powerful tool for large scale data analysis on live data. It is a consolidated and well-researched technology, with major Open Source initiatives—notably: those by the Apache Foundation—that have developed effective solutions for efficiently running various kinds of queries on live or stored data. It is ideal for applications that incorporate computation on massive amounts of data. These include monitoring of freely definable events or processes and detection of security threats. The variety of purposes for which Data Analytics can be used for security-enhancement features continued to grow during the past few years and will continue to grow as more and more devices generate enormous amounts of data. While the data keeps increasing, different scenarios require near real time analysis (e.g. intrusion detection, fraud detection, process monitoring). In most cases, for Data Analytics data security is an essential question to be answered in the commercial solutions that are currently available. While higher level security can be achieved by encryption of stored data and encryption of communication, the processing itself—i.e. when data is “in use” (as opposed to “in transit” or “at rest”—is currently restricted to decrypted data. Homomorphic encryption can be used to perform computation on encrypted data. However, these approaches usually introduce overhead and limitations in algorithmic freedom that render them impractical.

As program execution itself handles unencrypted data, it may be possible for a malicious process or—even more so—for the Operating System or privileged software (e.g. the hypervisor) running on the same machine to steal or even manipulate data. Therefore, the data owner must currently fully trust the environment and machine on which the processing is executed. The proposed framework exploits hardware-based security features that will be available in next generation commodity CPUs to improve Data Analytics towards yet unresolved aspects of security. New generations of Intel and ARM processors feature hardware-based encrypted processing at the processor level. Thereby, the data and the code that are used during the execution of a program are shielded from external inspection and manipulation, even by privileged users (e.g. the system administrator or the cloud provider) and software (e.g. the operating system or the hypervisor). The processor provides a Trusted Execution Environment (TEE), where the data of an application is protected from accesses by the operating system, the hypervisor or any other code running on the system and even attackers with physical access cannot read or modify the data or the code of protected programs (Schuster et al. 2015). With this important improvement, the proposed framework advances Data Analytics towards satisfying rigorous security rules, thus making it a key enabling technology of advanced security-enhancing services. Higher security in Data Analytics allows data analysis applications to perform processing on data that has not been accessible before due to security concerns. The resulting extensive insight will, with regard to anomaly detection and process monitoring, allows applications to reduce false positives, increase quality, and execute advanced data analysis queries. Importantly, this will allow security and monitoring technologies to advance to more complex scenarios, as higher security allows these technologies to grasp a broader but also more detailed view of the global context. Also importantly, Data Analytics features—besides being used by framework services for processing purposes—can also be made available according to the Software as a Service (SaaS) paradigm for direct use by solution providers and end user alike. The usage of TEE technologies enables processing of data feeds and data streams at remote locations without sacrificing data security. This will ultimately allow for participation in a larger Data Analytics eco-system while maintaining data confidentiality.

8.3 Protection of data in use

The current security solutions for protecting data-in-transit and data-at-rest are well consolidated. There are well-known hardening techniques based on encryption such as AES, PKI, SHA, SSL/TLS, for protecting the data persistently stored in the disk and the data exchanged between two endpoints. Contrariwise, the protection of sensitive data-in-use is more difficult to enforce. In fact, at a certain point, data is in the DRAM memory unencrypted, and ready for being processed. At that moment, an attacker could steal or tamper with the targeted protected information. The worse situation consists in the adversary that somehow escalated the privileges of a root user before launching the attack. Several run-time attacks on data-in-use pose risks to organizations handling sensitive data. These attacks include code injection, Iago attacks, and Code-Reuse Attacks (CRA) such as Return-Oriented Programming, control flow hijack, and buffer overflow attacks. A potential scenario that poses a significant risk is when an attacker gains physical access to servers and inserts a USB pen drive into the machines to exploit vulnerabilities. For example, a vulnerability in the tower probe function of Linux kernels (before version 4.8), as reported by the NIST National Vulnerability Database, allows local users with physical proximity to gain privileges by exploiting a race condition and a NULL pointer dereference, resulting in a write-what-where condition. Another example is the buffer underflow bug discovered in the realpath() function of glibc (prior to version 2.26), which is a POSIX library used in various Linux distributions. These vulnerabilities can be leveraged by attackers to gain unauthorized access and compromise the security of the systems.

To cope with this, a possible solution is leveraging the Trusted Execution Environment (TEE) features of COTS CPUs. In the case of server-side support, the framework can be built using Intel’s Software Guard eXtension (SGX), which is a widely accepted TEE paradigm among developers, software vendors, OEMs, and software ecosystem partners. TEEs, including SGX, offer attractive features such as integrity and confidentiality protection for data-in-use, even against super-privileged software and users. However, it’s important to note that utilizing TEEs, specifically SGX, requires expertise in software design and cybersecurity (Intel). The development of applications using SGX is typically undertaken by security-aware developers with advanced skills in these domains. Unfortunately, such developers are limited in number and come at a high cost, making them less accessible for small and medium-sized enterprises (SMEs) and medium-sized enterprises (MEs). Recently, TEE has come under criticism as ‘not being able to deliver on its promise of superior security’ due side-channel attacks. However, it is important to highlight that (1) these attacks (notably Spectre, Meltdown, NetSpectre, and SwapGs) require very high skills, which restricts in the first place the population of potential attackers to a very small set of individuals, (2) even for an attacker with those skills, a number of context conditions which can hardly be recreated artificially (since they require that a handful of intrinsically random phenomena be simultaneously brought to a specific deterministic status) are the enabling factor for the success of these attacks and (3) the major OS and security solutions vendors have already delivered the patches to fix the recent vulnerabilities. The framework provides a sophisticated solution against data-in-use attacks via hardware-assisted hardening mechanisms of trusted computing. With hardware assistance, the machine consistently behaves as expected, and this behaviour is enforced by both computer hardware and software. This is achieved by incorporating unique encryption keys into the hardware, which are inaccessible to the rest of the system and typically provided by technology providers such as Intel. Trusted computing relies on the establishment of a Chain-of-Trust, which involves validating each component of hardware and software starting from the end entity and progressing up to the Root-of-Trust. In hardware-assisted TC, the trust anchor is usually placed within a specific piece of hardware. This approach ensures that security is maintained even against powerful attackers who have full control over the system, such as malicious user who has escalated their privileges to the highest level in the operating system. By rooting the user’s trust in the silicon itself, it becomes more difficult for attackers to modify the hardware’s functionality, providing a higher level of security assurance. Hardware-assisted TC thus enhances the overall security of the computing system by leveraging the capabilities of the underlying hardware to establish and maintain a trusted environment.

The framework exploits the security features of hardware-assisted trusted computing that are being made available by major CPU vendors (notably, ARM and Intel) for creating execution environments that will be protected from attacks launched at multiple architectural levels, even by privileged software (e.g. the Operating System or the Hypervisor), or privileged users (e.g. the System Administrator or the Cloud Provider). The framework builds a security-enhanced execution environment to provide effective protection of data-in-use against most common attacks by privileged software and/or users. The framework will leverage both isolation and attestation features of available technologies to create a circle of trust among microservices designed in the framework. In this regard, we will investigate an advanced solution of attestation in the context of distributed microservices as this is something still missing in literature, as also stated in the following position paper (Ankergård et al. 2021). Moreover, framework solutions will make the superior security of TEE technology transparently available to end users, by developing an adaptation layer that will shield the application level from the complexity of the underlying hardware support. Therefore, making TEE accessible to a wide public constitutes a real innovation in the path towards better cyber resilience. The authors have previous wide experience on these topics (e.g., Campanile et al. 2017, Coppolino et al. 2010, 2017).

9 Summary and conclusions

This paper addressed a research topic—namely: resilience improvement of complex internets of utility systems—which is still an open issue, since available solutions fail to implement an integrated approach to detection, mitigation, and reaction which is able to face both well-known and new, previously unseen cyber-attacks (in particular dis-tributed ones, which constitute one of the most serious and still unresolved threat scenarios affecting networked systems). The topic is particularly important since it is the enabling factor of Critical Infrastructure Protection. This work presented the conceptual architecture of a novel multi-layer distributed IDRS based on the Autonomic Communication paradigm. The architecture relies on a self-organizing cooperative overlay network of complementary components which are dynamically and autonomously adapted in order to face distributed cyberattacks against Industrial Control Systems. The proposed conceptual architecture aims at being a guideline for experts and practitioners interested in counteracting novel kinds of cyber-attacks, by realizing mechanisms After the needed customization effort, it can be applied to a specific Critical Infrastructure coming from a whatever application domain. A distributed flow monitoring system provides input data to cooperative intrusion detection agents, which enables the correlation of information from heterogeneous feeds from heterogeneous feeds in order to improve the identification of attacks originating from both the inside and the outside of the monitored network, and to support customizable remediation mechanisms.

The proposed approach is being implemented in an integrated framework characterized by a high degree of usability—allowing non IT-experts (i.e., operators with limited technical skills/background)—to use the platform effectively. A large fraction of system configuration procedures is automated. Automation has become a powerful and effective component of cyber security incident response. The framework achieves trustworthy automation via combined use of advanced correlation techniques and powerful data analytics on monitored events and external feeds. To address the perceived disadvantages of automation (e.g. loss of control, lack of trust, fear of change), the framework: (a) takes a Human In the Loop (HIL) approach, to allow for judgment and knowledge-based decisions, which enables humans to override automated system decisions and thus ensures that end-users retain, if they so wish, full control of the platform at all times, and (b) provides evidence of the advantages of automated support, such as increased efficiency, fewer errors, and better and more timely decision making. To reduce operations costs, the techniques, solutions, tools, and services are “cloud-ready”, meaning that it is possible to seamlessly deploy the framework on the cloud, based on the specific needs and/or resources of the company (e.g., an operator with a stronger IT department might opt for a dedicated setup on a private cloud, while smaller operators might prefer a public cloud from an external provider). Cloud computing is one of the key enabling technologies for companies whose core business is not IT, as it contributes to reducing the TCO of IT infrastructures and services, lowering the entry bar for the use of IT solutions. However, serious concerns are about the security of the cloud paradigm, which can impede its uptake. Cloud threats can stem from privacy, authorisation and authentication points of view, as well as from hardware/firmware types of attacks. Therefore, on the one hand, the framework provides a highly secure and reliable operational solution by addressing privacy and security by design principles with state-of-the-art solutions and constantly monitoring its own services for potential vulnerabilities or misuse. On the other hand, the framework relies on the new Trusted Execution Environment (TEE) features of COTS CPUs, exploiting their capability of providing integrity and confidentiality protection of data-in-use even against super-privileged software and users. This will enable users to deploy the framework even on public clouds, without penalizing security (since not even the cloud provider would be able to tamper with their data)—a unique competitive benefit of the proposed approach.