1 Introduction

Cyber security exercises (CSEs) are increasingly becoming an integral part of the cybersecurity training landscape [20], providing a hands-on experience to personnel of both public and private organisations worldwide. A CSE, as described in the ISO Guidelines for Exercises [18], is “a process to train for, assess, practice, and improve performance in an organisation”. ENISA defines a CSE as “a planned event during which an organisation simulates cyber-attacks or information security incidents or other types of disruptions to test the organisation’s cyber capabilities, from being able to detect a security incident to the ability to respond appropriately and minimise any related impact.” [7].

1.1 Problem setting and objectives

The creation of CSE content is a painstaking process that requires a deep understanding of the current threat landscape and the historical threats and incidents faced by an entity and the corresponding sector. Furthermore, training employees with simulated incidents is the closest method to testing the preparedness and effectiveness of measures and procedures set in place. Creating a relevant and dynamic content for developing CSE scenarios requires expertise and resources often lacking among most organisations.

The main objective of our work is automating the generation of structured CSE scenarios based on a pool of unstructured information with little experience in scenario building expected from the Exercise Planner (EP).

The standard method for preparing an exercise scenario [18] lays down three layers, namely events, incidents, and injects. After developing a scenario, an organisation must ensure that it contains only necessary information. Moreover, it must be designed to test participants’ capabilities in a stressful environment. Events, at the first level, provide the general description of an exercise scenario. Depending on previously decided objectives and aims, the number of events can differ from one exercise to another. Each event would have a specific set of consequences at the second level. These consequences are called incidents. An event can have multiple consequences, which can affect each other. On the third level, injects facilitate the communication of events and incidents to the exercise participants. An ideal inject would provide exercise information and problems to be solved. At the same time, it would indirectly force participants to act on those consequences and make decisions.

The proposed scenarios should satisfy the specifications provided by the EP. Such specifications can be the training topics and objectives, the sector to focus on or specific threats of interest that are currently or will be trending in the future. For simplicity, in what follows, when referring to sectors, we will refer to the ones of Directive (EU) 2022/2555 of the European Parliament and of the Council on measures for a high common level of cybersecurity across the Union, amending Regulation (EU) No 910/2014 and Directive (EU) 2018/1972, and repealing Directive (EU) 2016/1148 (NIS 2 Directive) [12]; however, any other such classification can be used. More specifically, the objectives can be summarised as follows:

  1. 1.

    Create an ML-powered Exercise Generation Framework that would:

    1. (a)

      Generate structured exercise scenarios that reflect a sector’s current or future threat landscape, including potential threat actors and the corresponding techniques, tactics, and procedures (TTPs).

    2. (b)

      Generate scripted events and incidents that could materialise in the context of a real attack against an organisation belonging to any NIS 2 defined Sector

    3. (c)

      Identify and describe artefacts that could accompany the exercise scenarios as potential injects

  2. 2.

    The generated scenarios should be expressed in a structured way or format, following an Ontology. The generated outputs should be both machine and human-readable.

  3. 3.

    The proposed methodology and tools created should provide qualitative and quantitative added value in CSE development and cyber-awareness by measuring the following Key Performance Indicators (KPIs):

    1. (a)

      Improve the speed in CSE generation (quantitative)

    2. (b)

      Improve quality in CSE generation (qualitative) for inexperienced EPs

    3. (c)

      Improve the relevance of proposed CSE scenarios to the current threat landscape (qualitative)

The use of case studies will help measure the results of the KPIs set by comparing the traditional exercise generation methods and tools versus the proposed ones through an evaluation provided by an Ad-hoc Cyber Awareness Expert GroupFootnote 1Footnote 2 that will peer review the outputs of the aforementioned methodology.

1.2 Main contributions

The contribution of this work is twofold. Initially, we can identify future cyber-attack trends in a specific sector and propose customised awareness training topics by clustering them accordingly. Then, we automate the process of generating the corresponding content for cyber awareness exercises with machine learning (ML).

Our proposed methodology, which a set of tools will accompany, allows an inexperienced EP to fully structure CSE scenarios from free text following our proposed Cyber Exercise Scenario Ontology (CESO). The exercise structure will follow the traditional Scenario—Events -Incidents—Injects tree structure ISO 22398:2013 [18] as depicted in Fig. 1. Additional cyber exercise content will be generated to complement the scenario and proposals for the fittest of a set of given training topics to better prepare an organisation for an imminent cyber crisis.

Fig. 1
figure 1

Cyber exercise structure according to ISO 22398:2013 [18]

Through our work, we fill the gap in the lack of expertise by the average cyber security expert that acts as an Exercise Planner and provides the tools and the methodology to design CSE scenarios in an easy, automated, and structured way. To achieve this, we combine the power of machine learning (ML) and, more specifically, named entity recognition (NER) with a set of novel Cyber Exercise Scenario Ontology (CESO) and CSE scenario generation framework dubbed AiCEF. Finally, an evaluation methodology and its results are presented along with ideas for future work.

2 Related work

CSEs, also known as Cyber Defense Exercises (CDX), have been considered an effective way to implement an engaging security awareness training [13, 42] experience. CSEs have been characterised as a highly effective method to provide an ultimate learning experience [3], helping individuals or teams of varying expertise improve a range of skills related to information security. Furthermore, via exercising, organisations can uncover gaps in security policies, procedures and resources [9, 16] leading to awareness training, tools and policy improvements.

Previous work in the CSE domain [40] has highlighted the use of cyber defence competitions or live-attack exercises as a very effective way of teaching information security [10, 19], helping teams design, implement, manage and defend a network of computers [1, 6, 7, 30, 31]. Vigna [46] and Mink [27] further support these findings.

Further research was conducted on cyber defence competitions [36, 49] and the most suited architecture [41] and tools and techniques to be used to create an active learning experience were described by Green et al. [15]. Patriciu and Furtuna [34] presented several steps and guidelines to be followed when designing a CSE. White [48] introduced a different approach to such live CSEs, presenting lessons learned and providing suggestions to help organisations run their own exercises. Other works in the literature examined how to run CSEs, using a service provider model [26].

CSEs can be used as a tool to generate scientifically valuable datasets for future security research [38, 43] and help uncover hidden risk from weak Security policies and/or procedures [37]. CSEs can even be used to measure performance against specific standards [11] or team effectiveness based on behavioural assessment techniques [14]. Moreover, experiments using various platforms like the RINSE simulator [23] or using realistic inter-domain routing experiment platform [22], for the rendering of network behaviour.

Focusing further on the human aspect, Job Performance Modelling (JPM), using vignettes for improving cybersecurity talent management through cyber defence competition design, was described by Tobey [44].

A successful CSE counts heavily on the use of a robust scenario. Exercise scenarios must describe worst-case scenarios that participants can relate to and are realistic enough to trigger seamless engagement. Intuitive scenarios can be a powerful tool that can predict future states or situations [3, 13]; incorporating issues to be resolved, interactions and consequences [14, 15] leading to a constructive training experience.

An exercise’s scenario is a sequential, narrative account of a hypothetical incident that provides the catalyst for the exercise and is intended to introduce situations that will inspire responses and thus allow demonstration of the exercise objectives [41]. In the context of CSEs, a scenario defines the training environment that will lead participants towards fulfilling the exercise objectives [21] set. The cyber security problem described in a scenario itself portrays a structured representation named Master Scenario Events List (MSEL), which serves as the script for the execution of an exercise [41]. CSE scenarios formats can vary [35], but two are the most prevalent:

  • Outlined scenarios: Provide a general summary of the impact of an event on assets. [39]

  • Detailed scenarios: Contain exhaustive information sequentially describing the event’s impact on specific services or sections of an organisation, along with a timeline for restoring key functions. [17]

Recent trends in attack recognition utilise AI, ML, and NLP tools and techniques to empower their efficiency. However, there needs to be a more dedicated methodology focusing on CSE scenario generation. There is a need for a methodologically built and annotated CE corpus that could train multiple algorithms for Cyber Exercise elements. Such a corpus should focus on the syntactic and semantic characteristics of the cyber exercise components and broaden our understanding of the malicious patterns used in cyber incidents that can be reused for CSE material. A similar approach to the one used in building and evaluating an annotated Corpus for automated Recognition attacks has been utilised [45], only this time to extract CSE relevant objects.

Following Cyber Security related ontology creation examples [33], ontology -based scenario modelling for CSEs have already been proposed [47]. Still, an ontology that is truly compatible with Machine Learning algorithms is missing and will be the focus of our work.

3 Cyber exercise scenario ontology (CESO)

Our work so far highlighted the need for a common CSE scenario ontology for translating the various parts of an exercise while keeping a close link to popular already used ontologies for cyber incident representations. The analysis of the domain revealed many taxonomies for different areas of the cybersecurity domain (types of attacks, vulnerabilities, sectors, harm) but those needed to be linked together in a model that allows for an EP to represent a CSE accurately.

To build our ontology, the following questions were raised:

  1. 1.

    What is the scope of the ontology?

  2. 2.

    Should we consider reusing existing ontologies or taxonomies?

  3. 3.

    What are the important terms in the ontology?

The scope of the ontology was determined by asking competency questions to experienced EPs that helped us identify the most important terms. A key priority was interoperability to ensure that the proposed ontology could be integrated with existing tools and frameworks. Moreover, the proposed ontology should describe a cyber security incident using popular cyber security frameworks. Finally, the ontology should be easily implementable using extraction via named entity recognition (NER) to allow the easy ingestion of online content.

We also used the domain expert’s knowledge to identify prominent existing ontologies and ways to reuse them. The steps followed were:

  1. 1.

    Define the scope of our ontology,

  2. 2.

    Identify other ontologies or taxonomies that can be used/reused,

  3. 3.

    Define the main concepts and the relationships between them,

  4. 4.

    Define the properties of the concepts,

  5. 5.

    Implement the ontology.

3.1 Scope

The scope of the defined model was to target an efficient and robust way of representing cyber incidents in the context of a CSE. After all, a CSE is a collection of simulated incidents provided to players in an orchestrated way to achieve the exercise’s objectives.

The exercise ontology presented is incident-centric, focusing on using a bottom-up approach that allows us to identify and describe incidents first so we can group them into Events and then cover the full generation of CSE scenarios that fit the high-level objectives set.

The first building blocks, incidents, are assigned injects and mitigation actions that match the expected scope of the scenario. Injection timing is configured on the attribute level of each object. As we build toward the higher level of the exercise, the scenario is formed. The selected format should allow for the scenario’s portability to various existing tools (ex. MISPFootnote 3) and support a decentralised type of CSE execution.

3.2 Ontologies/taxonomies to be (re)used

A set of existing ontologies, taxonomies, frameworks, standards, and formats have been explored with relevance to cyber security and a focus on the representations of the key element of CSEs from the point of their very building blocks being the incidents to be simulated. Our research concluded that a combination of the following would provide the necessary means: ISO 22398 [18], MITRE ATT &CK [29] and Cyber Kill Chain [24], MITRE CVE [28], and STIX 2.1 [32].

We chose STIX 2.1 as the basis for our ontology, which defines a taxonomy of cyber threat intelligence to be extended to cover our need to describe CSE scenarios. ISO 22398 best describes the structure of the cyber exercise components and was used to help us repurpose STIX 2.1 to cover our scope. The STIX 2.1 model describes an adversary and adversary activities in appropriate data structures by default. STIX Domain Objects cover: Threat Actor; Malware; Tools; Campaign; Intrusion Sets, and Attack Patterns (referencing the Common Attack Pattern Enumeration and Classification taxonomy, CAPEC), perfectly covering what is called incident and injects in the CSE nomenclature. STIX 2.1 supports by default the MITRE ATT &CK, MITRE CVE and Cyber Kill Chain frameworks, helping us achieve our goal for maximum interoperability. Moreover, intelligence (CTI) in a consistent and machine-readable manner, allowing security communities to understand better what computer-based attacks they are most likely to face and anticipate and/or respond to those attacks faster and more effectively.

This helps us build on top of these communities to reuse existing tools and share CSE scenarios represented in the very same format.

3.3 Scenario augmented model

Based on the bottom-up approach, a scenario augmented model (SAM) is proposed in two layers that cover both the informational and operational aspects with the same objects but utilise different attributes.

The informational layer covers the context and main attributes of scenarios. Figure 2a describes the key relationships in the informational layer.

Fig. 2
figure 2

The informational (left) and operational (right) layers of CESO

The whole exercise is grouped using the grouping objects. The object holds information related to the exercise’s name, description and scenario. All Events, Objectives and their matching objects (Campaign, Note, Report) are related to the Exercise Scenario along with the matching "State of the World (SoW)". The SoW includes details such as the status of various simulated systems and networks, the simulated geopolitical landscape, and any simulated incidents or events that have taken place.

One or more Incidents (Intrusion Set) can be related to Events. From there, various objects with interlinked dependencies form the Inject in a Course of Actions Instance that refers to all related objects of an Attack Pattern.

Table 1 CSE components to STIX 2.1 objects mapping

An Inject can contain the following objects: Attack Pattern, Tool, Vulnerability, Indicator, Malware Threat Actor (who is attributed and Identity and is located at a Location) and a Course Of Action. Injects do not have to be related to an Event or Incident. Such examples are the STARTEX or ENDEX,Footnote 4 which can be represented only with a Course of Action object but are directly related to the Scenario.

The Scenario Operational Layer describes an exercise scenario’s execution flow, mainly dealing with injects delivery to the intended recipients. There are two major interrelated parts: (1) the events/injects, which describe the detailed activities of the scenario and expected actions from the participants, and (2) the Participants.

The whole scenario, including Events, Incidents, and Injects, is stored in an Infrastructure object, representing the Exercise Platform. This platform is used by EPs (Identity) to design and conduct the exercise, Observers, and Players to interact with the Scenario. All Participants are located in the same or different Locations. The Operational Layer is illustrated in Fig. 2b.

3.4 Implementing the ontology

Keeping the structure of CSE intact, the following STIX 2.1 Objects have been repurposed to fulfil our goal to represent the main CSE components successfully covering SAM along with matching relationships (Table 1).

Objects STIX 2.1 defined objects as per specifications.

Relationships All relationships are implemented as per STIX 2.1 relationship object specifications. The relationships in Table 2 (representing the edges of the graph) have been identified between key objects, but more can be used.

Table 2 Relationships matrix

Object Extension STIX 2.1 objects, extended with additional attributes/properties to cover the needs of CESO, as shown in Table 3.

Table 3 Objects extension matrix

4 Automated generation of cybersecurity exercise scenarios

To create the envisioned ML-powered Exercise Generation Framework, we opted to use Python and develop a set of tools that would perform a set of individual tasks in the form of steps, which would help an EP, regardless of his/her experience, to create timely and targeted CSEs. Conceptually, we split the process into six steps. Namely, data collection, data processing and mapping, trend prediction, incident generation, enhancement, and storyline generation. The proof of concept framework we developed is AiCEF, and its general outline is illustrated in Fig. 4.

Its main components that are relevant to the work presented in this paper are the following:

  • CESO: The Cyber Exercise Scenario Ontology used to describe the various components of a CSE

  • AiCEF: The Cyber Exercise Framework used to model CSEs based on CESO with the use of Machine Learning

  • MLCESO: The ML models trained to parse text and extract objects based on CESO

  • IncGen: The incident generation module that models a CSE incident from the MLCESO extracted objects based on CESO

  • CEGen: The cyber exercise generation module that models a CSE from the MLCESO extracted objects based on CESO

  • KDb: A knowledge pool of incidents stored in a database. Extracted objects and other characteristics, including the STIX 2.1 blob, are stored in the database

To facilitate the reader, we map these components in a timeline diagram, see Fig. 3. This way, one can get a quick grasp of the role of each component in the flow and navigate through the rest of the sections understanding how these pieces fit in the greater picture.

Fig. 3
figure 3

Process flow and the corresponding modules of AiCEF

The modular approach of AiCEF allows for customisation and local refinements and enables more interoperability. In the following paragraphs, we detail each component and then present the main steps to generate a concrete CSE scenario using AiCEF modules, providing some examples.

4.1 Machine learning to CESO (MLCESO)

The most important step in our methodology is the creation of the ML pipeline that will parse free text and extract objects in CESO, as defined in the previous section. To do so, we need to train our ML following a well-structured methodology consisting of three phases: Corpus Building, Corpus Annotation, and Corpus Evaluation using NER, which we detail below.

4.1.1 Corpus building

As shown in Table 4, four Incident Sources have been identified as the initial input to our corpus. All these websites cover a wide variety of cyber security incidents in article format that date many years. For simplicity, in this work, we collected incidents from 2020–01 till 2022–03, which accounts for 2000 articles. All relevant articles were collected through automated web scraping.

Fig. 4
figure 4

High-level overview of AiCEF

Table 4 Corpus collection count

Then, the raw text was processed using Natural Language Processing (NLP) techniques to form a reduced Incidents Corpus (IC). Initially, all text was converted to the UTF-8 encoding scheme. Using dictionaries and the Textblob library,Footnote 5 we performed spelling corrections and removed special characters. Empty lines, specific stopwords and specific punctuation marks were removed using traditional NLP libraries like NLTKFootnote 6 and spaCy.Footnote 7 Moreover, all HTML or other programming codes, URLs, and paths were removed. Any illegal characters were also stripped, and all text was transformed to lowercase.

The standard Penn Treebank [25] tokenisation rules were utilised for sentence tokenisation, and finally, standardisation processes were applied to tune the Incidents Text to facilitate annotation. At the end of this step, a corpus composed of Incidents was formed. As discussed, the corpus, from now on referred to as IC, contains 2000 cyber security articles. This accounts for 35.745 sentences containing 819.690 words leading to a vocabulary of 24,594 terms. An example of a corpus line ready for annotation is the following:

figure a

4.1.2 Corpus annotation

Following the CESO ontology, a simple model was developed comprising six steps to represent the annotation task. Entities and interconnections were formally described to align the efforts of converting words to tags in an Annotators Reference Document (ARD). This file, along with the corpus guidelines and CESO ontology, was given to the annotators to perform the annotation task using Prodigy.Footnote 8 After completing the annotation, an inter-annotator agreement assessment took place using Cohen’s Kappa metric, and the gold standard version of the IC was finally produced.

Our annotation methodology consists of the following steps.

Step 1: Setting the Annotation Objectives The main annotation objective was to create the appropriate semantic target to facilitate IC recognition by assigning the correct tag to in-context words in a sentence. Labelling all related words or sequences of words or text spans in the Cyber Incident context was crucial to perform efficient NER or text classification later. Each word or text span was labelled with a type identifier (tag) drawn from a vocabulary created based on the CESO ontology. It indicated what various terms denote in the context of a Cyber Incident and how they interconnect between them.

Table 5 Annotation tags per category
Table 6 Annotation tags per category example

Our objective is to identify keywords, syntax, and semantic characteristics to detect i) threat actors, ii) cyber security incidents, and iii) victim characteristics, to tag them accordingly.

Step 2: Specifications Definition A concrete representation of the Annotation model to be used is created based on CESO.

An abstract model that practically represented the annotation objectives was defined. A three-category classification (Attacker, Attack, Victim) was introduced as the basis of this abstract model for identifying cyber-incident related terms in the text analysed. The category other represents all remaining words out of context.

Our model M consists of a vocabulary of terms T, the relations between these terms R, and their interpretation I. Thus, our model can be represented as \(M = < T, R, I>\) where:

  • T={CESO, Attacker, Attack, Victim, Other}

  • R={CESO:: = Attacker|Attack|Victim|Other}

  • I={Attacker= "list of attacker related terms in vocabulary", Attack ="list of Cyber Security Incident or Attack terms in vocabulary", Victim = {"list of victim-related terms in vocabulary"} Other = {"Other terms not related to the attacks"}}

Step 3: Annotator Reference To help annotators in element identification and element association with the appropriate tags, we provided them with documentation containing the tags in Table 5, which have been identified and mapped accordingly.

Table 7 Consistency matrix
Table 8 AI models’ scores

Step 4: Annotation Task the annotation process is performed

The annotation task aimed to label the words of the IC corpus based on their semantic and syntactic characteristics. Two cybersecurity experts were assigned to label the words based on their semantic characteristics. By annotating the semantic characteristics of the words with Prodigy, the context of each sentence was translated into CESO. Table 6 presents the annotation in action through some examples.

Step 5-Golden Standard Creation: the final version of the annotated Incident corpus is generated.

The inter-annotator agreement (IAA) was validated using Cohen’s Kappa [8]. The formula used is defined as follows:

$$\begin{aligned} k=\frac{p_0-p_e}{1-p_e} \end{aligned}$$
(1)

where \(p_0\) expresses the relative observed agreement, and \(p_e\) is the hypothetical probability of chance agreement.

The produced IC corpus has \(N = 24594\) terms and \(m = 4\) categories, and both annotators (A and B) agreed for the Attacker category 397 times, for the Attack category 1722 times, for the Victim 932 times and for the Irrelevant 21416.

Table 7 shows the contingency matrix where each \(x_{ij}\) represents the multitude of terms that annotator A classified in category i, but Annotator B is classified in category j, with \(i,j\in \{1,2,3,4\}\). The proportions on the diagonal (\(x_{ii}\)) represent the proportion of terms in each category for which the two annotators agreed on the assignment.

The observed agreement \(p_o\) is:

$$\begin{aligned} p_o=\frac{397+1722+926+21416}{24594}=0,996 \end{aligned}$$

and the expected change agreement; thus, the proportion of terms which would be expected to agree by chance is:

$$\begin{aligned}{} & {} p_e=\frac{\frac{436\times 435}{24594}+\frac{1744\times 1752}{24594}+\frac{950\times 953}{24594}+\frac{21464\times 21454}{24594}}{24594}\\{} & {} \quad =0,768 (76,8\%) \end{aligned}$$

so, according to Eq. 1 the Cohen’s Kappa is \(k=\frac{p_0-p_e}{1-p_e}=\frac{0,228}{0,232}= 0,98\). Thus, based the Cohen’s kappa value of 0.98, we can safely conclude [50] that the level of agreement for the corpus annotation task was almost perfect.

Table 9 AI models’ scores vs reviewers evaluation. H: Hit, P: Partial, M: Miss
Fig. 5
figure 5

The workflow of IncGen

4.1.3 Training and evaluation using NER

The following methodology has been used to train and evaluate our Named Entity Recognition (NER) agent.

  1. 1.

    Preprocessing The corpus has already been annotated, with each line of the corpus stored as a list of token-tag pairs. Each token was represented by a word embedding using the pre-trained English language model of the spaCy NLP library.

  2. 2.

    Build a model using spaCy

  3. 3.

    Training Training was conducted in spaCy by specifying a loss function to measure the prediction error and a batch-wise gradient descent algorithm for optimisation. One NER model was created per object. To improve accuracy, several iterations were conducted by expanding the annotation and retraining the model until an F1 Score of \(\approx \%\)80% was reached.

    One NER model was trained per object as presented in Table 8.

  4. 4.

    Evaluation The performance assessment of the model was conducted by applying the model to the preprocessed validation data.

While the results seem satisfactory, one can achieve further performance improvements in some tags.

We made an extra evaluation step with two experts against a set of 100 articles not used before in the training or evaluation steps. The aim was to evaluate the models against the selected tags empirically. The two reviewers have scored the NER accuracy per tag as presented in Table 9:

  • HIT The tag was correctly assigned or not.

  • PARTIAL The tag was correctly assigned or not, but not for all values

  • MISS The tag was either assigned wrongly or was not assigned at all when it should

The following findings should be highlighted:

  1. 1.

    The hit rate of four (4) NER models has been identified as very weak, with an abnormal difference from the F1 score identified in the previous step.

  2. 2.

    Names of Attackers or Malware can be a very vague topic to tackle using NER.

  3. 3.

    The Attacker’s Origin cannot be properly identified with the use of the out-of-the-box SpaCy LOC NER model. Locations are identified but can be related to the victim or are irrelevant to the attacker’s origin.

  4. 4.

    The vulnerability NER model misses the correct formatting of CVE. This issue can be solved using a regex that accurately detects CVE in the text in combination with the model generated.

4.2 Incident generation and enhancement (IncGen)

Incident creation is the most important step of the scenario generation procedure and consists of several steps to achieve maximum customisation (Fig. 5). All of the steps can be automated, generating a variety of Incidents from which a Planner can choose to fit most.

The EP can choose to provide specific text or articles for conversion to Incidents or rely on a dynamic generation based on filtering parameters and a search of the existing database. Incidents can be enhanced with activity simulating TTPs of known APT actors.

To generate scenarios, a set of texts was used as a baseline and parsed to map with CESO for processing. The sources in Table 10 were utilised to create the knowledge database (KDb). To ensure relevance, a threshold system maturity was introduced to evaluate the maturity of the parsed articles and NER extracted tags. The scoring system, ranging from 0 to 185, is shown in Algorithm 1. In the implementation, a threshold of 50 was set to consider a text relevant for representing a standalone incident in AiCEF.

figure b
Table 10 Knowledge DB content per source

Two types of enhancements were applied to improve the automatically NER exported tags, namely Regular Expression (REGEX), which is a sequence of characters that defines a search pattern, and Hard-coded groups of Strings. Thus, the following tags have been further enhanced:

  • Attackers Name: NER + Hardcoded Groups of Strings from MITRE APT list [29]

  • Attackers Origin: No NER, Hardcoded Groups of Strings,

  • Malware Name: NER + Hardcoded Groups of Strings from MITRE APT list,

  • Technique: NER + Hardcoded Groups of Strings from MITRE APT list,

  • Vulnerability: NER + CVE REGEX.

The above enhancements greatly improved the tag detection rates, achieving almost 99% in the Vulnerability tag. Moreover, based on the analysis of the most prominent extracted tags, the tag groups of Table 11 were assigned to the training topics meta tag to help categorise text for later use in an exercise scenario-building process. An output report and visualisation (using stixviewFootnote 9 library) of IncGen utilising the improved MLCESO tag detection can be seen in Fig. 6.

Table 11 Training topics
Fig. 6
figure 6

IncGen output report and visualisation

4.3 APT enhancer

To simulate the activity of APT groups, a STIX 2.1 structure was created for each actor using the Groups from MITRE. Attributes and TTPs were automatically extracted to populate the database, generating a STIX 2.1 graph for comparison and enhancement purposes. During incident enhancement, the extracted graph is compared to known APT actors and the most similar is proposed for enhancement. The similarity score, based on a set of weighted properties and ranging from 0 to 100, is calculated using the STIX 2.1 Python API. In AiCEF, the EP can completely or partially merge the draft incident graph with that of known APT actors.

4.4 Storyline text generation

The Storyline Text Generator (STG) creates synthetic text based on predefined input. Using a Python text generator and Generative Pre-trained Transformer 2 (GPT-2),Footnote 10 AI large-scale unsupervised language model, which can create coherent paragraphs of text from small pieces of text input.

4.5 Trend prediction module (MLTP)

The trend prediction module provides EP with valuable information by analysing the KDb and extracting trends based on predetermined training objectives to generate a trend report. The MLTP process consists of three steps:

  1. 1.

    Receiving input such as Filter Tags

  2. 2.

    Extracting incident statistics based on specified sector and Training Objective

  3. 3.

    Performing time-series analysis to plot and calculate future trends for a specific Attack Type and/or Training Objective.

In our implementation, we chose the SARIMAFootnote 11 equation to represent the trends on the existing KDb of 2970 articles as represented in Table 10. However, in future work, we intend to investigate further methods to boost the capabilities of MLTP, including the identification of micro-trends as the existing results are very promising [2].

Fig. 7
figure 7

IncGen execution flow with intermediate representation steps

4.6 Putting everything together

Let us summarise the use of AiCEF and its modules with an example. An EP populates the Knowledge database (KDb) with incidents of interest, which are then converted into graphs based on the CESO ontology. When the EP wishes to create a new scenario for a cyber security exercise, they provide AiCEF with a set of keywords. To assist the planning process, AiCEF can generate a trend report that identifies trends relevant to the objectives at the time of the exercise execution. Based on the keywords, AiCEF crawls its database for the most relevant articles and returns a corresponding graph. The EP can then enhance the graph by merging it with that of known threat groups and filtering the graph according to the intended Cyber Kill Chain phases to be simulated. The resulting incident graph representation is then ready to be populated with injects. A representation of the progress of an incident graph generation can be visualised in Fig. 7.

This process is repeated multiple times to generate the number of wanted incidents for a specific CSE. The EP follows the CEGen flow to compile a full exercise and generate a scenario (Fig. 8) and Exercise graph (Fig. 9).

Fig. 8
figure 8

Sample text of an AI-generated exercise

Fig. 9
figure 9

Sample exercise graph visualisation

5 Evaluation methodology and results

We developed a case study to help measure the effectiveness of our proposed framework and underlying methodology. To this end, the steps below were followed.

  1. 1.

    Scenario Content Generation A group of exercise planners of varying expertise have been used to generate the same exercise scenario using traditional exercise means and the AiCEF methodology and tools while being monitored on timeliness, effectiveness, creativity and methodology used.

  2. 2.

    Content Evaluation The reports were anonymised and given to a group of evaluators to grade the complexity, technical depth and richness of lessons learnt on the generated subset of exercise scenarios as per Objectives and KPIs set through a questionnaire.

  3. 3.

    Results collection and Analysis The results of this process were evaluated against the previously set KPIs to estimate:

    1. (a)

      Improved speed in Cyber Exercise Scenario generation (quantitative) using AiCEF.

    2. (b)

      Improvement in quality in Cyber Exercise Scenario generation (qualitative) for inexperienced Planners using AiCEF.

    3. (c)

      Improved relevance of proposed Cyber Exercise Scenarios to the current threat landscape (qualitative) using AiCEF.

5.1 Scenario content generation

Four EPs were selected to individually generate a CSE scenario according to the provided high-level exercise requirements and specifications, see Fig. 10. The EPs were split into two groups based on their previous experience with the task. All EPs have deep knowledge of cyber security, and their skill sets resemble that of a CISO.

Both groups consisted of one experienced and one inexperienced planner. The first group was briefly introduced to the basics of developing CSE scenarios, while the second one was provided with a course on using AiCEF and the accompanying tools. Both groups were provided with the same Scenario Template (ST) to fill in as an output of their task.

Then, we created a third group, consisting of Scripted Exercise Planner (SEP), using different parameters and flows of the AiCEF methodology and toolset.

The provided ST had the following generic structure:

  • Section 1: Storyline (SoW)

  • Section 2: Scenario and MSEL

  • Section 3: Scenario Analysis

  • Section 4: Resources Used

We provided detailed instructions on the expected content per paragraph to all involved planners to streamline the information of the generated reports and create homogeneous outputs to be evaluated in the later step.

As a result, five complete exercise scenarios were generated, as shown in Table 12.

5.2 Scenario content evaluation

To evaluate the scenarios above, we conducted an anonymous online survey from 01/09/2022 to 30/09/2022. To avoid bias, we invited a number of evaluators from different cyber awareness and cyber exercise groups with varying expertise, ethnicity, and focus sectors to participate in the evaluation process. More precisely, we invited the Ad-hoc Cyber Awareness Expert Group of ENISA. In total, 16 experts responded, whose demographic statistics are illustrated in Table 13. Given that we have a representation of 66% of the group, we believe that the sample is significant, as they are experts. Moreover, we highlight that their allocation has been made through independent criteria, not from us, but from an individual international organisation on cyber security such as ENISA, which avoids possible biases.

Fig. 10
figure 10

Task definition

The survey was in the form of an online questionnaire consisting of 11 questions. Eight questions were used to evaluate the generated Scenarios, two to be used as Turing test to determine whether the AI used could be identified by humans and a set of complementary questions for demographic and future improvement purposes. All five scenarios were provided using only the "Eval_Tag" parameter for tracking purposes without providing additional information on the authors of the scenarios.

The eight scenario evaluation questions and their corresponding scores in parenthesis were the following:

  1. 1.

    How do you evaluate the relevance of the State of the World text to the Objectives of the Exercise? (0–4)

  2. 2.

    How do you evaluate the relevance of the selected Events to the Objectives of the exercise? (0–4)

  3. 3.

    How do you evaluate the relevance of the selected Incidents to the Objectives of the exercise? (0–4)

  4. 4.

    How do you evaluate the Complexity of the Scenario? (0–1)

  5. 5.

    How do you evaluate the Technical Depth of the Scenario? (0–2)

  6. 6.

    How do you evaluate the Threat Actor’s description? (1–3)

  7. 7.

    How do you evaluate the used resources? (0–2)

  8. 8.

    Would you be willing to use this Scenario based on the task description? (0–4)

To evaluate the use of AI for exercise content generation, we asked the expert the following questions:

  1. 1.

    How was the scenario generated?

  2. 2.

    How skilled was the planner?

Table 12 Details of the generated scenarios
Table 13 Demographics of the experts

Other questions revolved around the overall scenario development process:

  1. 1.

    How much time did you invest in the Scenario Content Development?

  2. 2.

    How do you define the scope/objectives of the exercise?

  3. 3.

    How do you define the scenario content?

  4. 4.

    What tools did you use to create the scenario or define the objectives if any?

Finally, evaluators were asked to rank AI-powered tools as follows:

  • Rank the following AI-powered tools that could be created to support the design and implementation of future cyber exercises:

    • Automated extraction of Exercise Objects (Incidents, Injects) from unstructured information and DB storage

    • Lead generation for trend prediction of Training Topics

    • Automated Enrichment’s of content to match realistic patterns and relationships of known Attackers

    • Automated Cyber Exercise Script/Scenario Generation

5.3 Results analysis

The analysis of the input provided a good understanding of the strengths and potential areas for improvement of AiCEF. It also provided better insight into the exercise Scenario creation process, with good inputs for future improvement based on the experience of real EPs (Fig. 11).

Fig. 11
figure 11

Overall performance of evaluated scenarios based on total score

Fig. 12
figure 12

Scenario evaluation parameters

Based on analysis of the provided input, we can safely conclude that both scenarios Sc3:ExpHum and AI and Sc4:NovHum and AI have scored higher than any other scenario with the help of AiCEF, see Fig. 14. Currently, the hybrid scenario generation approach of a human exercise planner using AiCEF outperforms a seasoned exercise planner, even when a planner is a novice. Furthermore, the Scripted Exercise Planner generated a relatively good Scenario (Sc5:AiCEF) that can be evaluated as equal, if not better, than that of a novice planner (SC2:NovHum) (Fig. 12).

In what follows, we provide a breakdown of the parameters evaluated to highlight the strengths and weaknesses of using AiCEF based on the experts’ input.

The use of AiCEF by a Scripted Exercise Planner performed well (top 3, outperforming humans) in Relevant Resources, Events Relevance, and Scenario Technical Depth. On the other hand, AiCEF did not perform as well in the following aspects: Threat Actor Description, Scenario Complexity, and Incidents to Objectives Relevance. The above can be justified by the fact that the raw generated content can include conflicting information or content that might not match the high-level context requested. After human curation, the content can be easily improved to compete with a seasoned exercise planner. In fact, AiCEF used by humans helped them excel in Scenario Creation, dominating all categories versus their human counterparts. The human expert using AiCEF (Sc3:ExpHum &AI) managed to create a better scenario 33,33% faster than his expert peer using regular tools (Sc1:ExpHum) (Fig. 13).

Fig. 13
figure 13

Score range for the Q1–8 of 16 evaluators

Nevertheless, the most impressive finding was that novice planners using AiCEF (Sc4:NovHum &AI) outperform a seasoned exercise planner (Sc1:ExpHum), as seen in Fig. 12, providing a good indication of the capabilities of the proposed framework. Note that the scenario performance developed by the novice planner with the help of AiCEF matches, among others, that of a Seasoned Planner in the question: Would you use the scenario?". Even more, evaluators could not distinguish the pure AI-generated content (ExSC5) based on Table 14, categorising the scenario as either hybrid or human-made. Indeed, the results were like those of a novice human planner.

Table 14 Turing test to evaluate the performance of AI
Fig. 14
figure 14

Novice planner with AiCEF (Sc4:NovHum &AI) versus senior exercise planner (Sc1:ExpHum)

Fig. 15
figure 15

Experts’ responses

On the question: "How do you define the scope/objectives of the exercise?" most evaluators replied with two or more of the following options, with known incidents and lessons learnt along with risk assessment as the most prevalent replies (Fig. fig:score).

On the question: "How do you define the scenario content?" most evaluators replied with two or more options, with news and articles being the most important source followed by the known incident option (Fig. 15).

The evaluators replied to the question “How much time do you invest in the Scenario Content Development?” with an average of 53 h. This means that tools which can improve the CSE scenario content development process by reducing time without compromising the quality could be of great use.

Finally, for the question “What tools did you use to create the scenario or define the objectives if any?”, the responses varied between Google Search, Cyber Security (News) websites, MS Office, and Internet/Table Top Research.

6 Conclusions and future work

The shortage of cybersecurity experts and awareness is a well-known and big worldwide challenge. CSEs can address some of the aspects of this problem; however, the shortage of experts to develop new CSEs coupled with the timeliness and relevance of the developed CSEs requires novel solutions. In this work, we try to fill in this gap by facilitating the work of EPs with the use of AI. To this end, we developed a novel AI-powered exercise generation framework called AiCEF, which generates structured exercise scenarios that reflect the current or future threat level that an organisation faces, including potential threat actors and TTPs. Moreover, it generates scripted events that could happen in the context of a real attack against a specific organisation belonging to one of the NIS2 critical infrastructure sectors. AiCEF also identifies and describes artefacts that could accompany the exercise scenarios. To this end, AiCEF uses a new ontology that we built, named CESO, and with which we were able to generate structured exercise scenarios that can be both machine and human-readable.

Our proposed methodology and developed tools can provide tangible qualitative and quantitative added value in CSE development and Cyber Awareness in various ways. For instance, in our experiment, the total time for the CSE scenario generation is decreased by 33.33% without impacting the quality. In fact, AiCEF improves the quality of CSE scenario generation for an inexperienced/novice EP by elevating the generated scenario quality to the same level as an experienced EP. Finally, the relevance of proposed CSE scenarios is aligned with that of the current threat landscape, as indicated by evaluating all the generated scenarios using AiCEF.

While AiCEF might be rather efficient, there is room for various improvements. For instance, for operational usage, more sources must be parsed (ex., threat reports and alerts) to generate more diverse scenarios. While Generative Pretrained Transformers (GTP-2 and GTP-3) [4] [5] might create a textual output of very good quality, it would be even better if the text synthesiser were based only on Cyber Security related resources so that the generated text is even more relevant and uses, e.g. better technical terms. As indicated in the evaluation, AiCEF could be benefited from further improvements to enhance the threat actor description section. Finally, we plan to enhance AiCEF to detect the Cyber Kill Chain phases automatically using NER and create relevant CSE injects for a number of popular categories like phishing while also automating the inject description and content generation using AI-powered text synthesis.