AiCEF: an AI-assisted cyber exercise content generation framework using named entity recognition

Content generation that is both relevant and up to date with the current threats of the target audience is a critical element in the success of any cyber security exercise (CSE). Through this work, we explore the results of applying machine learning techniques to unstructured information sources to generate structured CSE content. The corpus of our work is a large dataset of publicly available cyber security articles that have been used to predict future threats and to form the skeleton for new exercise scenarios. Machine learning techniques, like named entity recognition and topic extraction, have been utilised to structure the information based on a novel ontology we developed, named Cyber Exercise Scenario Ontology (CESO). Moreover, we used clustering with outliers to classify the generated extracted data into objects of our ontology. Graph comparison methodologies were used to match generated scenario fragments to known threat actors’ tactics and help enrich the proposed scenario accordingly with the help of synthetic text generators. CESO has also been chosen as the prominent way to express both fragments and the final proposed scenario content by our AI-assisted Cyber Exercise Framework. Our methodology was assessed by providing a set of generated scenarios for evaluation to a group of experts to be used as part of a real-world awareness tabletop exercise.


Introduction
Cyber Security Exercises (CSE) are increasingly becoming an integral part of the cybersecurity training landscape [16] providing a hands-on experience to personnel of both public and private organisations worldwide.A CSE, as described in the ISO Guidelines for Exercises [14], is "a process to train for, assess, practice, and improve performance in an organisation".ENISA defines a CSE as "a planned event during which an organisation simulates cyber-attacks or information security incidents or other types of disruptions to test the organisation's cyber capabilities, from being able to detect a security incident to the ability to respond appropriately and minimise any related impact."[4].

Problem setting and objectives
The creation of CSE content is a painstaking process that requires a deep understanding of the current threat landscape and the historical threats and incidents faced by an entity and the corresponding sector.Furthermore, training employees with simulated incidents is the closest method to testing the preparedness and effectiveness of measures and procedures set in place.Creating relevant and dynamic content for developing CSE scenarios requires expertise and resources often lacking among most organisations.
The main objective of our work is automating generation of structured CSE scenarios based on a pool of unstructured information with little experience in scenario building expected from the Exercise Planner (EP).
The standard method for preparing an exercise scenario [14] lays down three layers, namely events, incidents, and injects.After developing a scenario, an organisation must ensure that it contains only necessary information.Moreover, it must be designed to test participants' capabilities in a stressful environment.Events, at the first level, provide the general description of an exercise scenario.Depending on previously decided objectives and aims, the number of events can differ from one exercise to another.
Each event would have a specific set of consequences at the second level.These consequences are called incidents.An event can have multiple consequences, which can affect each other.On the third level, injects facilitate the communication of events and incidents to the exercise participants.An ideal inject would provide exercise information and problems to be solved.At the same time, it would indirectly force participants to act on those consequences and make decisions.
The proposed scenarios should satisfy the specifications provided by the EP.Such specifications can be the training topics and objectives, the sector to focus on or specific threats of interest that are currently or will be trending in the future.For simplicity, in what follows, when referring to sectors, we will refer to the ones of NIS2 [9]; however, any other such classification can be used.More specifically, the objectives can be summarised as follows: 1. Create an ML-powered Exercise Generation Framework that would: (a) Generate structured exercise scenarios that reflect an organisation's current or future threat landscape, including potential threat actors and the corresponding techniques, tactics, and procedures (TTPs).
(b) Generate scripted events and incidents that could materialise in the context of a real attack against an organisation (belonging to any NIS2 defined Sector) (c) Identify and describe artefacts that could accompany the exercise scenarios as potential injects 2. The generated scenarios should be expressed in a structured way or format, following an Ontology.The generated outputs should be both machine and human-readable.The use of case studies will help measure the results of the KPIs set by comparing the traditional exercise generation methods and tools versus the proposed ones through an evaluation provided by an Ad-hoc Cyber Awareness Expert Group1 2 that will peer review the outputs of the aforementioned methodology.

Main contributions
The contribution of this work is twofold.Initially, we predict future cyber-attack trends and the overall threat landscape against specific organisation sectors and propose customised awareness training topics by clustering them into training topics.Then, we automate the process of generating the corresponding content for cyber awareness exercises with machine learning (ML).
Our proposed methodology, which a set of tools will accompany, allows an inexperienced EP to fully structure CSE scenarios from free text following our proposed Cyber Exercise Scenario Ontology (CESO).The exercise structure will follow the traditional Scenario-Events-Incidents-Injects tree structure ISO 22398:2013 [14] as depicted in Figure 1.Additional cyber exercise content will be generated to complement the scenario and proposals for the fittest of a set of given training topics to better prepare an organisation for an imminent cyber crisis.
Through our work, we fill the gap in the lack of expertise by the average cyber security expert that acts as an Exercise Planner and provide the tools and the methodology to design CSE scenarios in an easy, automated, and structured way.To achieve this we combine the power of Machine Learning (ML) and more specifically, Named Entity Recognition (NER), with a set of novel Cyber Exercise Scenario Ontology (CESO) and CSE scenario generation framework dubbed AiCEF.Finally, an evaluation Methodology and its results are presented in along with ideas for future work.

Related work
CSEs, also known as Cyber Defense Exercises (CDX), have been considered as a effective way to implement an engaging security awareness training [10,39] experience.CSEs have been characterised as a highly effective method to provide an ultimate learning experience [2], helping individuals or teams of varying expertise, improve a range of skill related to informational security.Furthermore, via exercising, organisations can uncover gaps in security policies, procedures and resources [6,13] leading to awareness training, tools and policies improvements.
Previous work in the CSE domain [37] has highlighted the use of cyber defence competitions or liveattack exercises as a very effective way of teaching information security [7,15], helping teams design, implement, manage and defend a network of computers [1,3,4,26,27].Vigna [43] and Mink [23] further support these findings.
Further research was conducted on cyber defence competitions [33,46] and the most suited architecture [38] and tools and techniques to be used in order to create an active learning experience were described by Green et al. [12].Patriciu and Furtuna [31] presented several steps and guidelines to be followed when designing a CSE.White [45] introduced a different approach to such live CSEs, presenting lessons learned and providing suggestions to help organisations run their own exercises.Other works in the literature examined how to run CSEs, using a service provider model [22].
CSEs can be used as a tool to generate scientifically valuable datasets for future security research [35,40] and help uncover hidden risk from weak Security policies and/or procedures [34].CSEs can even be used to measure performance against specific standards [8] or team effectiveness based on behavioural assessment techniques [11].Moreover experiments using various platforms like the RINSE simulator [19] or using realistic inter-domain routing experiment platform [18], for rendering of network behaviour.
Focusing further on the human aspect, Job Performance Modelling (JPM) using vignettes for improving cybersecurity talent management through cyber defence competition design, was described by Tobey [41].
A successful CSE counts heavily on the use of a robust scenario.Exercise scenarios must describe worst-case scenarios that participants can relate to and are realistic enough to trigger seamless engagement.Intuitive scenarios can be a powerful tool that can predict future states or situations [2], [10]; incorporating issues to be resolved, interactions and consequences [12], [11] leading to a constructive training experience.An exercise's scenario is a sequential, narrative account of a hypothetical incident that provides the catalyst for the exercise and is intended to introduce situations that will inspire responses and thus allow demonstration of the exercise objectives [38].In the context of CSEs, a scenario defines the training environment that will lead participants towards fulfilling the exercise objectives [17] set.The cyber security problem described in a scenario itself portrays a structured representation, named Master Scenario Events List (MSEL), which serves as the script for the execution of an exercise [38].CSE scenarios formats can vary [32] but two are the most prevalent: • Outlined scenarios: Provide a general summary of the impact of an event to assets.[36] • Detailed scenarios: Contain exhaustive information sequentially describing the event's impact on specific services or sections of an organisation, along with a timeline for restoring key functions.[29] Recent trends in attack recognition utilise AI, ML, and NLP tools and techniques to empower their efficiency.However, there needs to be more dedicated methodology focusing on CSE scenario generation.
There is a need for a methodologically built and annotated CE corpus that could train multiple algorithms for Cyber Exercise elements.Such a corpus should focus on the syntactic and semantic characteristics of the cyber exercise components and broaden our understanding of the malicious patterns used in cyber incidents that can be reused for CSE material.A similar approach to the one used in building and evaluating an annotated Corpus for automated Recognition attacks has been utilized [42], only this time to extract CSE relevant objects.
Following Cyber Security related ontology creation examples [30], ontology -based scenario modelling for CSEs have already been proposed [44].Still, an ontology that is truly compatible with Machine Learning algorithms is missing and will be the focus of our work.
3 Cyber Exercise Scenario Ontology (CESO) Our work so far highlighted the need for a common CSE scenario ontology for translating the various parts of an exercise while keeping a close link to popular already used ontologies for cyber incident representations.The analysis of the domain revealed many taxonomies for different areas of the cybersecurity domain (types of attacks, vulnerabilities, sectors, harm) but those needed to be linked together in a model that allows for an EP to represent a CSE accurately.
To build our ontology, the following questions were raised: 1. What is the scope of the ontology?
2. Should we consider reusing existing ontologies or taxonomies?
3. What are the important terms in the ontology?
The scope of the ontology was determined by asking competency questions to experienced EPs that helped us identify the most important terms.We also used the domain expert's knowledge to identify prominent existing ontologies and ways to reuse them.The steps followed were: 1. Define the scope of our Ontology 2. Identify other ontologies or taxonomies that can be used/reused 3. Define the main concepts and the relationships between them 4. Define the properties of the concepts 5. Implement the ontology

Scope
The scope of the defined model was to target an efficient and robust way of representing cyber incidents in the context of a CSE.After all, a CSE is a collection of simulated incidents provided to players in an orchestrated way to achieve the exercise's objectives.
The exercise ontology presented is incident-centric, focusing on using a bottom-up approach that allows us to identify and describe incidents first so we can group them into Events and then cover the full generation of CSE scenarios that fit the high-level objectives set.
The first building blocks, incidents, are assigned injects and mitigation actions that match the expected scope of the scenario.Injection timing is configured on the attribute level of each object.As we build toward the higher level of the exercise the scenario is formed.The selected format should allow for the scenario's portability to various existing tools (ex.MISP3 ) and support a decentralised type of CSE execution.

Ontologies/taxonomies to be (re)used
A set of existing Ontologies, Taxonomies, Frameworks, Standards and Formats have been explored with relevance to Cyber Security and a focus on the representations of the key element of CSEs from the point of their very building blocks being the incidents to be simulated.Our research concluded that a combination of the following would provide the necessary means: ISO 22398 [14], MITRE ATT&CK [25] and Cyber Kill Chain [20], MITRE CVE [24], STIX 2.1 [28].
We chose Stix 2.1 as the basis for our ontology, which defines a taxonomy of cyber threat intelligence to be extended to cover our need to describe a CSE scenarios The STIX2.This helps us build on top of these communities to reuse existing tools and share CSE scenarios represented in the very same format.

Scenario Augmented Model
Based on the bottom-up approach, a Scenario Augmented Model (SAM) is proposed in two layers that cover both the Informational and Operational aspects with the same objects but utilise different attributes.
The Informational Layer covers the context and main attributes of scenarios.Figure 2 describes the key relationships in the informational layer.The whole exercise is grouped using the Grouping Objects.The object holds information related to the exercise's name, description and scenario.All Events, Objectives and State of the World (SoW) and their matching objects (Campaign, Note, Report) are related to the Exercise Scenario.
One or more Incidents (Intrusion Set) can be related to Events.From there, various objects with interlinked dependencies form the Inject in a Course of Actions Instance that refers to all related objects of an Attack Pattern.
An Inject can contain the following objects: Attack Pattern, Tool, Vulnerability, Indicator, Malware Threat Actor (who is attributed and Identity and is located at a Location) and a Course Of Action.Injects do not have to be related to an Event or Incident.Such examples are the STARTEX or ENDEX4 , which can be represented only with a Course of Action object but are directly related to the Scenario.
The Scenario Operational Layer describes an exercise scenario's execution flow, mainly dealing with injects delivery to the intended recipients.There are two major interrelated parts: (1) the events/injects, which describe the detailed activities of the scenario and expected actions from the participants, and (2) the Participants.
The whole scenario, including Events, Incidents and Injects is stored in an Infrastructure object, representing the Exercise Platform.This platform is used by EPs (Identity) to design and conduct the exercise, Observers, and Players to interact with the Scenario.All Participants are located in the same or different Locations.The Operational Layer is illustrated in Figure 3.

Implementing the Ontology
Keeping the structure of CSE intact, the following STIX 2.1 Objects have been repurposed to fulfil our goal to represent the main CSE components successfully covering SAM along with matching relationships.
Objects: All STIX 2.1 defines object as per specifications.2 (representing the edges of the graph) have been identified between key objects, but more can be used.

CSE
Object Extension: All used objects follow the STIX 2.1 Specification but some have been extended with additional attributes/properties to cover the needs of CESO, as shown in Table 3.

Automated Generation of Cybersecurity Exercise Scenarios
To create the envisioned ML-powered Exercise Generation Framework, we opted to use Python and develop a set of tools that would perform a set of individual tasks, in the form of steps, which would help an EP, regardless of her experience, to create a timely and targeted Cybersecurity Exercise Scenario.The proof of concept framework we developed is AiCEF, and its general outline is illustrated in Figure 4.
More concretely, to generate a concrete CSE scenario using AiCEF, the EP must perform the following steps.Initially, if the EP can generate a Trend Report on specific tags (e.g.Ransomware).In AiCEF, this is done through the MLTP module.Then, the EP provides a set of relevant articles or free text to be parsed and converted to Incident Breadcrumbs, as we call them in the implementation, using In the following paragraphs, we detail these steps and modules, providing some examples.

Machine Learning to CESO (MLCESO)
The most important step in our methodology is the creation of the ML pipeline that will parse free text and extract objects in CESO, as defined in the previous section.To do so, we need to train our ML following a well-structured methodology consisting of three phases: Corpus Building, Corpus Annotation, and Corpus Evaluation using NER.

Corpus Building
As shown in Table 4, four Incident Sources have been identified as the initial input to our corpus.All these websites cover a wide variety of cyber security incidents in article format that date many years in  Then, the raw text was processed using NLP techniques to form a reduced Incidents Corpus (IC).We used the text processing workflow illustrated in Figure 5 to prepare the collected Incidents.Initially, all text was converted to the UTF-8 encoding scheme.Using dictionaries and the Textblob library 5 , we performed spelling corrections and removed special characters.Empty lines, specific stopwords and specific punctuation marks were removed using traditional NLP libraries like NLTK6 and spaCy 7 .Moreover, all HTML or other programming codes, URLs, and paths were removed.Any illegal characters were also stripped, and all text was transformed to lowercase.
The standard Penn Treebank [21] tokenisation rules were utilised for sentence tokenisation, and finally, standardisation processes were applied to tune the Incidents Text to facilitate annotation.At the end of this step, a corpus composed of Incidents was formed.As discussed, the corpus, from now on referred to as IC, contains 2000 cyber security articles.This accounts for 35.745 sentences containing 819.690 words leading to a vocabulary of 24594 terms.An example of a corpus line ready for annotation is the following: { " text " : " revil sodinokibi ransomware targets chinese users with dhl spam " }

Corpus Annotation
Following the CESO ontology, a simple model was developed comprising six steps to represent the annotation task.Entities and interconnections were formally described to align the efforts of converting words to tags in an Annotators Reference Document (ARD).This file, along with the corpus guidelines and CESO ontology, was given to the annotators to perform the annotation task using Prodigy8 .After completing the annotation, an inter-annotator agreement assessment took place using Cohen's Kappa metric, and the gold standard version of the IC was finally produced.
Our annotation methodology consists of the following steps.
Step 1 -Setting the Annotation Objectives: The main annotation objective was to create the appropriate semantic target to facilitate IC recognition by assigning the correct tag to in-context words in a sentence.Labelling all related words or sequences of words or text spans in the Cyber Incident context was crucial to perform efficient NER or text classification later.Each word or text span was labelled with a type identifier (tag) drawn from a vocabulary created based on the CESO ontology.
It indicated what various terms denote in the context of a Cyber Incident and how they interconnect between them.
The following specific objective was set: Identify keywords, syntax, and semantic characteristics to detect i) Threat Actor ii) Cyber Security Incident or iii) Victim characteristics.If any of them is found, label them with the corresponding tag.
Step 2 -Specifications Definition: A concrete representation of the Annotation model to be used is created based on CESO.
An abstract model that practically represented the annotation objectives was defined.A threecategory classification (Attacker, Attack, Victim) was introduced as the basis of this abstract model for identifying cyber-incident related terms in the text analysed.The category other represents all remaining words out of context.
Our model M consists of a vocabulary of terms T , the relations between these terms R, and their interpretation I. Thus, our model can be represented as M =< T, R, I > where: • T={CESO, Attacker, Attack, Victim, Other} • R={CESO:: = Attacker|Attack|Victim|Other} • I={Attacker= "list of attacker related terms in vocabulary", Attack ="list of Cyber Security Incident or Attack terms in vocabulary", Victim = {"list of victim-related terms in vocabulary"} Other = {"Other terms not related to the attacks"}} Step 3 -Annotator Reference Doc: ARD is produced to help annotators in element identification and element association with the appropriate tags.The tags in Table 5 have been identified and mapped accordingly.
Step 4 -Annotation Task: the annotation process is performed The annotation task aimed to label the words of IC corpus based on their semantic and syntactic characteristics.Two cybersecurity experts were assigned to label the words based on their semantic characteristics.By annotating the semantic characteristics of the words, the background information in each sentence was linked with CESO.The syntactic characteristics of the words were labelled using Prodigy.Step 5-Golden Standard Creation: the final version of the annotated Incident corpus is generated.
The inter-annotator agreement (IAA) was validated using Cohen's Kappa [40][5].The formula used is defined as follows: where p 0 expresses the relative observed agreement and p e the hypothetical probability of chance agreement.
The produced IC corpus has N = 24594 terms and m = 4 categories, and both annotators (A and B) agreed for the Attacker category 397 times, for the Attack category 1722 times, for the Victim 932 times and for the Irrelevant 21416.
Table 7 shows the contingency matrix where each x ij represents the multitude of terms that annotator A classified in category i, but Annotator B is classified in category j, with i,j = 1,2,3,4.The proportions on the diagonal (x ii ) represent the proportion of terms in each category for which the two annotators agreed on the assignment.
The observed agreement p o is: Table 7: Consistency Matrix so, according to Equation 1 the Cohen's Kappa is k = p0−pe 1−pe = 0,228 0,232 = 0, 98.Thus, based the Cohen's kappa value of 0.98, we can safely conclude [47] that the level of agreement for the corpus annotation task was almost perfect.

Training & Evaluation Using NER
The following methodology has been used to train and evaluate our Named Entity Recognition (NER) agent.
1. Preprocessing: The corpus has already been annotated, with each line of the corpus stored as a list of token-tag pairs.Each token was represented by a word embedding using the pre-trained English language model of the spaCy NLP library.
2. Build a model using spaCy

3.
Training: To train the models in spaCy, we specified a loss function for the model to measure the distance between prediction and truth, and a batch-wise gradient descent algorithm was specified for optimisation.
One NER model was trained per object as presented in Table 8.
During the process that can be summarised in the flow above, several decisions were made to further improve the accuracy by retraining the models.Since the aim was to create NER models that reach an F1 Score of ≈ %80%, we iteratively extended the annotation and trained the model to pass this threshold.

Evaluation:
The performance assessment of the model was conducted by applying the model to the preprocessed validation data.While the results seem satisfactory, one can achieve further performance improvements in some tags.

Category
We made an extra evaluation step with two experts against a set of 100 articles not used before in the training or evaluation steps.The aim was to evaluate the models against the selected tags empirically.The two reviewers have scored the NER accuracy per tag as presented in Table 9: • HIT: The tag was correctly assigned or not.
• PARTIAL: The tag was correctly assigned or not, but not for all values The following findings should be highlighted: 1.The hit rate of four (4) NER models has been identified as very weak, with an abnormal difference from the F1 score identified in the previous step.

Names of Attackers or
Malware can be a very vague topic to tackle using NER.
3. The Attacker's Origin cannot be properly identified with the use of the out-of-the-box SpaCy LOC NER model.Locations are identified but can be related to the victim or are irrelevant to the attacker's origin.
4. The vulnerability NER model misses the correct formatting of CVE.This issue can be solved using a regex that accurately detects CVE in the text in combination with the model generated.

Incident Generation and Enhancement (INCGEN)
Incident creation is the most important step of the scenario generation procedure and consists of several steps to achieve maximum customisation (Figure 6).All of the steps can be automated, generating a variety of Incidents from which a Planner can choose to most fit.First, we need a set of texts to use as a baseline to generate our scenarios.Obviously, these articles would be parsed to be mapped with CESO so that our modules can process them.To this end, we used the sources of Table 10 to generate our Knowledge database (KDb).Evidently, not all articles that one may send for parsing may contain enough or even relevant information to generate a CSE scenario.They Table 11 were assigned to the training topics meta tag to help categorise text for later use in an exercise scenario-building process.An output report and visualisation (using stixview9 library) of IncGen utilising the improved MLCESO tag detection can be seen in Figure 7.

APT Enhancer
To simulate the activity of known APT groups basic STIX 2.1 structure was created per actor using the Groups from MITRE from which various attributes and TTPs were automatically extracted to populate our database.Thus, we generated a STIX 2.1 graph that can be used to compare and enhance other graphs.In this sense, during the enhancement process of an incident, the corresponding extracted graph is compared to all known APT actors and the most similar is proposed for enhancement.For each supported STIX 2.1 object type, the object similarity function (STIX 2.1 Python API) checks if the values for a specific set of properties match.Each matching property is separately weighted due to the fact that properties can have different levels of importance based on semantic similarity.The similarity score can range from 0 to 100, with 0 score representing no similarity between two objects compared.
In AiCEF, the EP may merge graphs completely or use only fragments.Thus, the EP may merge a draft incident graph with that of any known APT or proceed to merge with more than one APT actor's combined TTPs.

Storyline Text Generation
The Storyline Text Generator (STG) creates synthetic text based on predefined input.Using a Python text generator and Generative Pre-trained Transformer 2 (GPT-2)10 , AI large-scale unsupervised language model which can create coherent paragraphs of text from small pieces of text input.

Trend Prediction Module (MLTP)
The In our implementation, we chose the SARIMA11 equation to represent the trends on the existing KDb of 2970 articles as represented in Table 10.

Cyber Exercise Generation (CEGEN)
The exercise generation flow can be broken down into several steps (Figure 8).

Putting everything together
Now that we have described the main modules and their functionality, we may present AiCEF and present its outcomes.At some point, the EP will populate KDb with incidents which will be converted to graphs based on our ontology, CESO.Once the EP wishes to create a new scenario for a CSE, depending on the intended objectives, he/she would provide AiCEF with a set of keywords.To facilitate the planner work, AiCEF can initially generate a trend report that would allow the EP to identify trends relevant to the objectives at the time of the exercise execution.Based on the keywords, AiCEF will crawl its database for the most relevant articles and return a corresponding graph.From there, the EP can enhance the graph by merging it with that of known threat groups.The new graph can then be filtered according to the intended Cyber Kill Chain phases to be simulated.The above actions lead to a more specific incident graph representation that is ready to be populated with injects.A representation of the progress of an incident graph generation can be visualised in Figure 9.This process is repeated multiple times to generate the number of wanted incidents for a specific CSE.The EP follows the CEGen flow to compile a full exercise and generate a scenario (Figure 10) and Exercise graph (Figure 11).

Evaluation Methodology & Results
We developed a case study to help measuring the effectiveness of our proposed framework and underlying methodology.To this end, the steps below were followed.
1. Scenario Content Generation: A group of exercise planners, of varying expertise, have been used to generate the same exercise scenario using traditional exercise means and the AiCEF methodology and tools while being monitored on timeliness, effectiveness, creativity and methodology used.

Content Evaluation:
The reports were then anonymised and given to a group of evaluators to grade the complexity, technical depth and richness of lessons learnt on the generated subset of exercise scenarios as per Objectives and KPIs set through a questionnaire.

Results collection and Analysis:
The results of this process were evaluated against the previously set KPIs to estimate: (a) Improved speed in Cyber Exercise Scenario generation (quantitative) using AiCEF.
(b) Improvement in quality in Cyber Exercise Scenario generation (qualitative) for inexperienced Planners using AiCEF.
(c) Improved relevance of proposed Cyber Exercise Scenarios to the current threat landscape (qualitative) using AiCEF.

Scenario Content Generation
Four EPs were selected to individually generate a CSE scenario according to the provided high-level exercise requirements and specifications, see Figure 12.The EPs were split into two groups based on their previous experience with the task.All EPs have deep knowledge of cyber security, and their skill sets resemble that of a CISO.Both groups consisted of one experienced and one inexperienced planner.The first group was briefly introduced to the basics of developing CSE scenarios, while the second one was provided a course on using AiCEF and the accompanying tools.Both groups were provided the same Scenario Template (ST) to fill in as an output of their task.Then, we created a third group, consisting of Scripted Exercise Planner (SEP), using different parameters and flows of the AiCEF methodology and toolset.
The provided ST had the following generic structure: • Section 1: Storyline (SoW) • Section 2: Scenario & MSEL • Section 3: Scenario Analysis • Section 4: Resources Used We provided detailed instructions of the expected content per paragraph to all involved planners to streamline the information of the generated reports and create homogeneous outputs to be evaluated in the later step.As a result, five complete exercise scenarios were generated, as shown in Table 12.

Scenario Content Evaluation
To evaluate the scenarios above, we conducted an anonymous survey.To avoid bias, we invited a number of evaluators from different Cyber Awareness and Cyber Exercise Groups with varying expertise, ethnicity, and focus sectors to participate in the evaluation process.In total, 16 experts responded, whose demographic statistics are illustrated in Table 13.The survey was in the form of an online questionnaire consisting of 11 questions.Eight questions were used to evaluate the generated Scenarios, two to be used as Turing Test to determine whether the AI used could be identified by humans and a set of complementary questions for demographic and future improvement purposes.All five scenarios were provided using only the ""Eval_Tag" parameter for tracking purposes without providing additional information on the authors of the scenarios.
The eight scenario evaluation questions and their corresponding scores are the following: 1. How do you evaluate the relevance of the State of the World text to the Objectives of the Exercise?Score range: 0-4.
2. How do you evaluate the relevance of the selected Events to the Objectives of the exercise?Score range: 0-4.

Results Analysis
The analysis of the input provided a good understanding of the strengths and potential areas for improvement of AiCEF.It also provided better insight into the exercise Scenario creation process, with good inputs for future improvement based on the experience of real EPs.Based on analysis of the provided input, we can safely conclude that both scenarios Sc3:ExpHum&AI and Sc4:NovHum&AI have scored higher than any other scenario with the help of AiCEF.Currently, the hybrid scenario generation approach of a human exercise planner using AiCEF outperforms a seasoned exercise planner, even when a planner is a novice.Furthermore, the Scripted Exercise Planner generated a relatively good Scenario (Sc5:AiCEF) that can be evaluated as equal, if not better, than that of a novice planner with strong Cyber Security background (SC2:NovHum).
In what follows, we provide a breakdown of parameters evaluated to highlight the strengths and weaknesses of using AiCEF based on the experts' input.
The use of AiCEF by a Scripted Exercise Planner performed well (top 3, outperforming humans) in Relevant Resources, Events Relevance, and Scenario Technical Depth.On the other hand, AiCEF did not perform as well in the following aspects: Threat Actor Description, Scenario Complexity, and Incidents to Objectives Relevance.The above can be justified by the fact that the raw generated content can include conflicting information or content that might not match the high-level context requested.After human curation, the content can be easily improved to compete a seasoned exercise planner.In fact, AiCEF used by humans helped them excel in Scenario Creation, dominating all categories versus their human counterparts.The human expert using AiCEF (Sc3:ExpHum&AI) managed to create a better scenario 33,33% faster than his expert peer using regular tools (Sc1:ExpHum).
Nevertheless, the most impressive finding was that novice planners using AiCEF (Sc4:NovHum&AI) outperform a seasoned exercise planner (Sc1:ExpHum), as seen in Figure 14, providing a good indication of the capabilities of the proposed framework.Note that the scenario performance developed by the novice planner with the help of AiCEF matches, among others, that of a Seasoned Planner in the question: Would you use the scenario?".Even more, evaluators could not distinguish the pure AI-generated content (ExSC5) based on Table 14, categorising the scenario as either hybrid or human-made.Indeed, the results were like those of a novice human planner.

Scenario
Human AI & Human AI Sc1:ExpHum Table 14: Turing Test to evaluate the performance of AI On the question: "How do you define the scope/objectives of the exercise?"most evaluators replied with two or more of the following options, with Known incidents & lessons learnt along with Risk assessment as the most prevalent replies.
On the question: "How do you define the scenario content?" most evaluators replied with two or more options, with news and articles being the most important source followed by the known incident option.
The evaluators replied to the question "How much time do you invest in the Scenario Content Development?" with an average of 53 hours.This means that tools which can improve the CSE scenario content development process by reducing time without compromising the quality could be of great use.
Finally, for the question "What tools did you use to create the scenario or define the objectives if any? ", the responses varied between Google Search, Cyber Security (News) websites, MS Office, and Internet/Table Top Research.

Conclusions & Future Work
The shortage of cybersecurity experts and awareness is a well-known and big worldwide challenge.CSEs can address some of the aspects of this problem; however, the shortage of experts to develop new CSEs coupled with the timeliness and relevance of the developed CSEs requires novel solutions.In this work we try to fill in this gap by facilitating the work of EPs with the use of AI.To this end, we developed a novel AI-powered exercise generation framework called AiCEF which generates structured exercise scenarios that reflect the current or future threat level that an organisation faces, including potential threat actors and TTP's.Moreover, it generates scripted events that could happen in the context of a real attack against a specific organisation belonging in one of the NIS2 critical infrastructure sectors.AiCEF also identifies and describes artefacts that could accompany the exercise scenarios.To this end, AiCEF, uses a new ontology that we built, named CESO, and with which we were able to generate structured exercise scenarios that can be both machine and human-readable.
Our proposed methodology and developed tools can provide tangible qualitative and quantitative added value in CSE development and Cyber Awareness in various ways.For instance, in our experiment, the speed in CSE scenario generation is decreased by 33.33% without impacting the quality.In fact, AiCEF improves the quality of CSE scenario generation for an inexperienced/novice EP, by elevating the generated scenario quality to the same level of an experienced EP.Finally, the relevance of proposed CSE scenarios is aligned to that of the current threat landscape, as indicated by the evaluation of all the generated scenarios using AiCEF.
While AiCEF might be rather efficient there is room for various improvements.For instance, for operational usage, more sources have to be parsed (ex.threat reports and alerts) to generate more diverse scenarios.While GTP-2 and GTP-3 might create textual output of very good quality, it would be even better if the text synthesizer were based only on Cyber Security related resources so that the generated text is even more relevant and uses, e.g.better technical terms.As indicated in the evaluation, AiCEF could be benefited by further improvements to enhance the threat actor description section.Finally, we plan to enhance AiCEF in order to detect the Cyber Kill Chain phases automatically using NER and create relevant CSE injects for a number of popular categories like Phishing, while also automating the inject description and content generation using AI powered text synthesis.

State of the World (SoW)
Description: STG Input:: 'State sponsored Threat actor EnergeticGoat has historically targeted government organizations, non-government organizations (NGOs), think tanks, military, IT service providers, health technology and research, and telecommunications providers.With this latest attack, EnergeticGoat attempted to target the Energy Sector ' STG Output: "State sponsored Threat actor EnergeticGoat has historically targeted government organizations, non-government organizations (NGOs), think tanks, military, IT service providers, health technology and research, and telecommunications providers.With this latest attack, EnergeticGoat attempted to target the Energy Sector in South Europe.More specifically the following targets have been identified: 1) EU energy firms that have received investments from the Energysubsidiary, 2) SA government units and departments who receive funding from the Energysubsidiary, 3) the SA energy regulator, 4) the Energy Sector's main lobbying and public relations organization, 5) the Energy Sector's largest customer, the Energysubsidiary, and 6) the Energy Sector's corporate regulator, the Energy Regulator of SA (ERA).Energy Sector entities and their customers and regulators are usually among the least likely organizations to be attacked by EnergeticGoat.For example, the Energy Regulator of SA (ERA) is not a high-profile organization.It is a largely unregulated regulatory body that provides general oversight of the energy sector and serves as a key regulator and enforcement body of relevant energy laws in EU.Another example would be the Energy Corporation (Zesa).Zesa is a state-funded enterprise that provides essential infrastructure and related services to its citizens, including the electricity, gas, and broadband services that power large swaths in EU.The attackers may have targeted other Energy Sector actors, such as Eskom or the SA Energy Institute (SAEI).However, it is thought that EnergeticGoat is specifically targeting Zesa and other organizations as a result of the threat actor's perceived inability to target the SA Energy Sector as a whole.EnergeticGoat's choice to target a South African energy firm with its latest attack is indicative of this perception."Scenario Summary EXERCISE NAME: Energy Test EXERCISE OBJECTIVES: Test detection of Attacks and Incident Handling Procedures ### EVENT 1: EnergyWipeOut ### Description: STG Input:: 'State sponsored Threat Actor EnergeticGoat is believed to be behind cyber-attacks against Energy Corp. The attackers used PHISHING & DATA DESTRUCTION techniques to' STG Output: "State sponsored Threat Actor EnergeticGoat is believed to be behind cyber-attacks against Energy Corp. The attackers used PHISHING & DATA DESTRUCTION techniques to infect the system with malware.In addition to the malware used, the attackers also used other tools to get around the security countermeasures of ENERGYCORP.It is understood that it will take at least a few days for the authorities to identify the attackers as they have only just informed the company about the attack.A spokesperson for the company said the company did not have any details as to the motive behind the attack as it was still investigating."We are working with law enforcement to bring this to the most serious level.The situation is under investigation and we cannot comment any further than that" *** Incident 1.1: EnergyPhish *** Description: STG Input:: 'EnergeticGoat deployed Phishing attacks against Energy Corp. Malicious word documents were sent' STG Output:"EnergeticGoat deployed Phishing attacks against Energy Corp. Malicious word documents were sent to Energy Corp. employees in an attempt to obtain personal information.On October 26, 2022, a new employee of Energy Corp., received a suspicious email containing a malicious word document.The email identified as 'malware' was in fact a phishing attack with the subject 'Vendors of Energy Corp. under Federal Investigation"  Generate a CSE scenario for a cyber awareness exercise by filling in a provided Scenario Template.The CSE should include 2 events consisting of 1 incident each.All incidents should be accompanied by a short description of indicative injects to be sent to players.At least 3 inject descriptions per incident should be provided.The company which will use the scenario is an Energy Service Provider and all its Employees can be potential Players.The exercise should last between 2-4 hours and can include technical artefacts for analysis.The two main objectives of the exercise are:

3 .
The proposed methodology and tools created should provide qualitative and quantitative added value in CSE development and cyber-awareness by measuring the following Key Performance Indicators (KPIs): (a) Improve the speed in CSE generation (quantitative) (b) Improve quality in CSE generation (qualitative) for inexperienced EPs (c) Improve the relevance of proposed CSE scenarios to the current threat landscape (qualitative)

Figure 7 :
Figure 7: IncGen output report and visualisation Trend Prediction Module provides valuable information to any EP by deep diving into the KDb and extracting interesting trends based on predetermined Training Objectives to compile a Trend Report.To compile the report MLTP performs the following steps: 1. Step1.Receives Input like Filter Tags (Sector, Attack Type, Training Objective) 2. Step2.Extract incident statistics for the specified sector like Training Objective Breakdown, Top Attackers, Top Techniques, Top Malware used, Top Vulnerabilities used.3. Step3.Perform Time Series Analysis on data for the Specific Attack Type and/or Training Objective, plotting and calculating Future Trends.

Figure 8 :
Figure 8: CEGen workflow Using the CEGEN module will lead to the generation of the following outputs: • Exercise Scenario Summary along with a State of the World (SoW) skeleton • A MSEL tree skeleton (name and description)

Figure 9 :
Figure 9: IncGen execution flow with intermediate representation steps

Figure 10 :
Figure 10: Sample text of an AI-generated exercise.

Figure 13 :
Figure 13: Overall performance of evaluated scenarios based on total score.

Figure 14 :
Figure 14: Scenario Evaluation Parameters Scenario use.(f) SoW vs Objectives (g) Technical Depth (h) Threat Actors Description

Figure 18 :
Figure 18: Experts' responses to "How do you define the scenario content?" question.
1 model describes an adversary and adversary activities in appropriate data structures by default.STIX Domain Objects cover: Threat Actor; Malware; Tools; Campaign; Intrusion Set and Attack Pattern (referencing the Common Attack Pattern Enumeration and Classification taxonomy, CAPEC), perfectly covering what is called incident & injects in the CSE nomenclature.Moreover, STIX 2.1 enables organisations to share CTI in a consistent and machine-readable manner, allowing security communities to understand better what computer-based attacks they are most likely to face and anticipate and/or respond to those attacks faster and more effectively.

Table 1 :
CSE Components to STIX 2.1 objects mapping Relationships: All relationships are implemented as per STIX 2.1 relationship object specifications.The relationships in Table

Table 3 :
Objects Extension Matrixthe MLCESO module.In essence, MLCESO takes as input a text and maps it to our CESO ontology.Therefore, this step extracts all the relevant information for a cybersecurity exercise from the text.The breadcrumbs are then stored in a knowledge database called KDb, with a name tag for quick retrieval.Using IncGen, the EP generates several Incidents based on the provided meta tags.The EP may enhance these Incidents using the Graph Enhancer module to simulate known APT activity and, this way, fill in the possible missing information.The Incidents come with a set of Injects that can be edited on the fly.All Incidents are named and stored locally for later use.Finally, using CEGen, the EP can create a CSE scenario by defining various attributes like CSE name, number of Events and Incidents.The various objects are then merged into a single STIX 2.1 graph, and the scenario is generated along with a State of the World storyline(SoW).

Table 4 :
Corpus Collection Count

Table 5 :
Annotation Tags per Category

Table 6 :
Table 6 presents the annotation in action through some examples.Annotation Tags per Category Example

•
MISS:The tag was either assigned wrongly or was not assigned at all when it should

Table 9 :
AI Models Scores vs Reviewers Evaluation.H: Hit, P: Partial, M: Miss

Table 11 :
Training topics