A Systematic Mapping Study and Empirical Comparison of Data-Driven Intrusion Detection Techniques in Industrial Control Networks

A rising communication between modern industrial control infrastructure and the external Internet worldwide has led to a critical need to secure the network from multifarious cyberattacks. An intrusion detection system (IDS) is a preventive mechanism where new sorts of hazardous threats and malicious activities could be detected before harming the industrial process’s critical infrastructure. This study reviews the cutting-edge technology of artificial intelligence in developing IDS in industrial control networks by carrying out a systematic mapping study. We included 74 foremost publications from the current literature. These chosen publications were grouped following the types of learning tasks, i.e., supervised, unsupervised, and semi-supervised. This review article helps researchers understand the present status of artificial intelligence techniques applied to IDS in industrial control networks. Other mapping categories were also covered, including year published, publication venues, dataset considered, and IDS approaches. This study reports an empirical assessment of several classification algorithms such as random forest, gradient boosting machine, extreme gradient boosting machine, deep neural network, and stacked generalization ensemble. Statistical significance tests were also used to assess the classifiers’ performance differences in multiple scenarios and datasets. This paper provides a contemporary systematic mapping study and empirical evaluation of IDS approaches in industrial control networks.


Introduction
An industrial control network is a collection of interconnected devices that are responsible for managing and monitoring physical equipment in the industrial domain [1]. Through the fast-developing of information and communication technology, manual labors, undoubtedly, has been substituted by more reliable automated equipment, enabling better production monitoring and quality control in industry operations. As a result, efficient communication to connect the whole equipment is desirable, leading to the penetration of the communication networks into industrial segments. Industrial control networks; we hereafter refer them as industrial control systems (ICSs), might be decomposed into three main components, such as programmable logic controllers (PLCs), supervisory control, and data acquisition (SCADA), and distributed control systems (DCSs) [2]. In the past, ICS networks were mainly tangibly independent from outside networks due to the lack of communication protocols. Reasoning from this fact, today's ICSs are massively connected with external networks, including the Internet of Things (IoT) platforms that allow low-cost productivity and improved performance [3,4]. However, this remains a problem concerning security since ICSs are prone to cyberattacks that might arise from internal and external networks [5,6].
A multifariousness of cybersecurity attacks of ICSs has attained an ever-growing awareness due to a considerable rise in the number of security accidents in ICSs currently, which indicates a severe infrastructure susceptibility [7]. Moreover, since ICSs consist of some critical facilities, i.e., nuclear plants, power grid, and other industrial control systems, insecure infrastructure, and unqualified industrial networks might put industries at huge financial risk [8]. A successful attack on an ICS would severely harm any industry.

3
Negative consequences include financial loss, operational failure, damaged equipment, industrial property piracy, and significant safety risk. The configuration and scale of an ICS will determine whether or not it has faults. The larger the system, the bigger the chance for attackers to exploit. An ICS that installs its former system with advanced tools, e.g., Industrial Internet of Things (IIoT), might have more specific threats and security risks. Hence, security protection and mitigation strategies of the relevant ICSs are a must [9].
A strategy for addressing the issues mentioned above is to develop intrusion detection systems (IDSs). An IDS includes one of the prevention mechanisms used to eliminate unauthorized activities within a system network due to ICSs software vulnerabilities. It aims at detecting and intercepting the attacks automatically by analyzing network and file access logs, audit trails, and other relevant information in a computer system [10,11]. Since the earliest IDS concept introduced by Anderson [12], there has been a considerable increase in research interest to implement intrusion detection technology for ICSs. Artificial intelligence (AI) techniques, e.g., machine learning and deep learning algorithms, have been utilized to ameliorate the performance of IDSs [13]. Sort of IIoT devices might produce large amounts of data from a sensor, machine-to-machine (M2M) communication, and automation. This paradigm has shifted the research direction from a traditional data analysis using shallow machine learning (ML) to a big data analysis using deep learning (DL) techniques [14].
In addition, because of the ever-increasing complexity of ICSs, the conventional intrusion detection systems in the information technology domain are not fit to industrial process [15], it thus has rendered DL-based intrusion detection techniques fascinating. This study presents a systematic review of state-of-the-art artificial intelligence techniques used for intrusion detection/prevention in ICSs. The study has been extended to include DL algorithms, such as deep neural network (DNN), convolutional neural network (CNN), and recurrent neural network (RNN), providing researchers and practitioners an insight into the current status and future trends of IDSs literature adopted in the ICSs environment.
The remainder of the paper is structured as follows. Section 2 discusses the basic concepts of industrial control and intrusion detection systems. Section 3 substantiates the current research by comparing it to several similar survey studies, whereas Sect. 4 details the mapping study methodology. Section 5 summarizes and explains the results from the mapping study for each category. Section 6 examines several methods for implementing IDSs in ICSs, followed by Sect. 7, which includes the concluding observations and discusses the future research directions.

Industrial Control Systems
An ICS can be viewed as interconnected devices, systems, networks, and controls utilized to automatize industrial processes [16]. Each ICS operates in several ways to handle the tasks depending on the type of industry efficiently. The devices and protocols in an ICS are utilized as the backbone in almost all industrial sectors and major facilities, providing infrastructures for electricity generation and distribution, water treatment and supply, manufacturing, and transportation.
ICSs lay down in several variants, more typical of which are SCADA, DCSs, and PLCs. Nevertheless, the contrasts and boundaries between these categories are not consistently figured out. Determining apparent differences can be no less strict due to the advancement of technologies used by these categories. SCADA systems are primarily employed for the acquisition and processing of a large amount of data and control industrial equipment by establishing remote commands [1,18]. DCSs consist of multiple local controllers that are managed by a centralized supervisory control loop. PLCs are digital computer apparatus that takes inputs from data generation means, e.g., sensors, transmit them to the whole production units, and provide the outputs through human-machine interfaces. Fig. 1 A multi-level ICSs architecture [17] 1 3 An ICS is composed of multi-level architecture (see Fig. 1). Level 0 forms the system's front-line, where industrial physical components and their related instrumentation are organized. The devices can be actuators and sensors that involve in performing diagnostic operations and communicating with other components. The aim of Level 1 is to control and manage the industrial process using controller devices, e.g., PLCs. Concerning structure, PLCs are composed of some computing devices, i.e., CPU, RAM, input/ output modules, and communication interfaces that allow real-time communication with sensors and actuators [19]. Level 2 involves some control servers responsible for collecting information from the lower layers used to monitor and diagnostic purposes. Next, the collected information is presented to the operators via a human-machine interface (HMI), a graphical indicator that provides the physical process's circumstance. Lastly, Level 3-4 incorporates the allocation and optimization resources, maintenance planning, and quality control. These actions are planned based on the information collected from the previous stages.
As compared to prevalent information technology (IT) systems, ICSs have some specific characteristics that must be taken into consideration. Some primary differences should not be omitted while considering security measures within industrial control ecosystem. Table 1 outlines some key distinctions between conventional IT systems and ICSs [1,16,20].

Intrusion Detection Systems
An intrusion detection system is a responsive security mechanism used to monitor the network security status by detecting external aggression and anomalous servers' operations. It aims at providing credible traces of information systems being intruded. Concerning the detection approach, an IDS might have two distinct categories, i.e., anomaly-based and misuse-based. The former approaches assume that an intruder can be detected by inspecting deviations from the regular network traffic. An advantage of these approaches includes the ability to detect unacknowledged attacks; however, they remain to suffer from a considerable amount of false alarm rate [22][23][24]. On the other hand, the latter [25] works based on some known attack signatures, in which a possible attack is analyzed and detected by comparing it with such pre-defined attack signatures provided by a knowledge base of attack. A pattern-matching approach is commonly utilized in the suspicious detection task. In contrast to anomaly-based IDSs, misuse-based IDSs generate a lower false alarm rate, yet, unknown attack detection is lacking.
Additionally, IDSs can be classified into two primary deployment types, namely host-based and network-based. The primary objective of host-based intrusion detection systems (HIDSs) is to monitor and then notify about occurrences on a local computer system. A hash of the file system is one example found in HIDS. Untrustworthy behavior is identified by comparing the differences between the recalculated hash value and the previously saved in the database. On the other hand, network-based intrusion detection systems (NIDS) are intended to monitor network traffic and detect malicious activity by examining inbound network packets. To summarize, Fig. 2 illustrates the breadth of IDSs discussed in [21].

Problem Definition and Motivation
Most previous research concentrates on machine learning, deep learning, and intrusion detection in industrial control systems. Some surveys have either emphasized machine learning algorithms [26][27][28][29], intrusion detection in ICSs [30], or particular IDS approach, e.g. anomaly detection [31]. Moreover, most of the survey frameworks are not derived from a systematic review of existing research. Therefore, the coverage and meaningfulness of the frameworks remain insignificant. As far as we can tell, no studies have systematically surveyed the feasibility of utilizing machine learning and deep learning techniques in the purview of intrusion detection in ICSs. Table 2 presents some of the prior applicable reviews and emphasizes the research gaps.
We conduct a systematic mapping study and empirical evaluation focusing on the present literature on intrusion detection in ICSs using machine learning and deep learning techniques to bridge the research gap. A systematic mapping study was initially proposed by [32,33]. It is a research methodology whose objective is to bring a thorough overview of a field of interest, characterize the research gap, and establish some remarks for future research directions.
Utilizing this procedure, we categorize machine learning and deep learning-based IDSs techniques applied in ICSs, show frequencies of publications, combine the results to answer some detailed research questions, and present a visual summary by mapping the results.
This study fosters the existing literature towards providing state-of-the-art information about implementing machine learning and deep learning techniques for intrusion detection in the industrial control network. We argue that this systematic mapping study will allow researchers or professionals to formulate more proper machine or deep learningbased IDS techniques. Besides, this study is not a cure-all for solving the research challenges in intrusion detection for ICSs; however, this would be a significant outset to develop advancement in employing machine learning and deep learning-based IDS in an industrial control environment.

Procedure of Mapping Study
This section describes the steps involved in performing a systematic mapping study. It follows the criteria for conducting secondary research proposed by [32] and [34]. Although quality evaluation is required for any systematic review [34], in our mapping study, a quality assessment to filter out main studies is not deemed essential since we structure our analysis to be as broad as feasible. Following the recommendations, we specify the research questions (RQs) being addressed, the search method, and the selection (e.g., inclusion) procedure of primary studies in the following sections.

Research Questions
As noted by [34], RQs should manifest the objective of secondary studies. RQs also specify the issue to be investigated and direct to the methodology [35]. Hence, the aim and scope of this study are formulated using the following RQs. The first-three RQs would be addressed in Sect. 5, while the rest RQ is covered in Sect. 6.
(i) RQ 1 : What is the research trend in machine learning and deep learning-based intrusion detection in ICSs? (ii) RQ 2 : What types of learning algorithms have been employed to deal with the problems of IDSs in industrial networks? (iii) RQ 3 : Which types of intrusion detection techniques are prevalently used in ICSs? (iv) RQ 4 : What are the relative performance of AI algorithms for ICS-based IDS?

Search Method
Despite the fact that machine learning algorithms have been everywhere for more than four decades, however, there exist several issues remain underexplored, leading to a significant increase of interest in utilizing those algorithms to solve real-world problems. As already noted, some elements affecting this flourishing attention for AI are along the following axes: (i) the price of computational resources are depreciating, (ii) the advancement of powerful and efficient algorithms that are able to tackle different forms of data, and (iii) a vast amount of tools that can be employed to facilitate the rapid advancement of AI-based applications. According to this, we take into account primary studies published over the last six years: from January 2013 to November 2020. We utilized an automatic search to seek as many appropriate primary studies as possible to properly answer the RQs, as mentioned earlier. In particular, we searched two primary digital libraries, i.e., IEEE Digital Library and ACM Digital Library, to incorporate computerscience related journals and conferences. We also searched To get relevant results while doing a search in such digital libraries, well-defined search terms are required. Thus, keywords were generated from our RQs and from keywords identified in some previously published publications. More precisely, different keyword combinations were tried utilizing Boolean operators, namely AND and OR, resulting in some of the keyword combinations (see Fig. 3).

Inclusion and Exclusion Criteria
In this section, we specify inclusion and exclusion criteria that were utilized in this study. Obtained papers were filtered in terms of the following criteria, thus only applicable and relevant papers were correctly incorporated. Inclusion criteria are listed as follows.
1. INC 1 : Only publications that were issued in scholarly outlets, i.e. journals, conferences, and workshop proceedings are considered. These papers had been usually refereed by peer-review. 2. INC 2 : Papers that discuss machine learning and deep learning techniques for intrusion detection in industrial control systems were taken into consideration.
Besides, publications that meet at least one of the following criteria were omitted from our study.
1. EXC 1 : The study discusses the application of intrusion detection in ICSs, but machine learning and deep learning are not used. For instance, process mining [36], stateful analysis [37], active monitoring [38], hierarchical monitoring [39], and semantics-aware framework [

Mapping Study Result and Discussion
Imbued by the aforementioned RQs, we specify the following magnitudes to outline and examine the selected studies:  Figure 4 denotes the number of studies over the considered period which is from 2013 to 2020. It is clear that during that period of time, there exist at least one study concerning the use of machine learning and deep learning algorithms for intrusion detection in ICSs environment. According to the trend, there has been a growing interest of applying machine learning and deep learning-based IDS on industrial network. The results indicate that since 2017, there has been a dramatic increase of interest in harnessing ML and DL algorithms for intrusion detection in ICSs.

Mapping Selected Studies w.r.t. Publication Venue
This section is devoted to summarizing the selected studies (e.g., 74 publications) according to the outlets they appeared. Among the selected studies, the vast majority of studies were disseminated in conference proceedings (e.g., 42 papers), followed by journals (e.g., 26 papers). Figure 5 shows a categorization of the selected studies w.r.t. the publication venue. The selected studies were published as a book section and workshop paper account for five papers and one paper, respectively. Table 3

Mapping Selected Studies w.r.t Dataset Considered
This section outlines the selected studies concerning the datasets considered in the experiment. Nowadays, there is a growing need to utilize multiple datasets for validating the proposed detection model. It is required to prove the generalizability of the model in different ICS environment settings. However, as indicated in Tables 6, 7, 8 and 9, in most cases, researchers only considered one single dataset in their experiment. Therefore, it can be assumed that the major flaw of the selected studies is the model's generalizability. Table 4 depicts the number of IDS datasets in the current literature. It is worth mentioning that most datasets (e.g., used in 29 papers) are not publicly available (e.g., private); thus, it would not be easy to make the experiment reproducible and comparable. Several studies (e.g., [41][42][43][44][45][46][47]) even used inappropriate datasets (e.g., NSL-KDD, KDD Cup 99, and DARPA 1998) which are not specifically applicable in ICS environment. Other prominent datasets for IDS in industrial control network are gas pipeline and power system that appeared eighteen and eleven times in the literature, respectively.

Mapping Selected Studies w.r.t. Algorithms
There is a large number of ML algorithms that are commonly categorized into two learning approaches, i.e., supervised and unsupervised. A supervised learner deals with a process of learning from the labeled training data that can be represented as follows.
where x i ∈ X are m-dimensional feature input vectors (m ∈ ℕ) and y i ∈ Y are the corresponding output variable, e.g. target value. Labeled training data are employed to fit a predictive model that assigns labels on new samples given label training data. Roughly speaking, a model is used to learn the mapping function identified in the training data: X → Y [115]. On the contrary, unsupervised learning deals with discovering the fundamental relationship between the inputs, where the objective is to assign the inputs into different groups [116]. Clustering is an example of unsupervised learning algorithms. However, some algorithms are not suitable for being grouped into supervised or unsupervised. These such algorithms are regarded as semi-supervised learning that deals with the learning tasks by employing both labeled and unlabeled datasets. According to the results of our mapping study, most intrusion detection approaches in ICSs are addressed and handled as supervised learning (see Table 5). There exist only, respectively, eight and two studies that resolved unsupervised and semi supervised learning for intrusion detection in ICSs. In addition, there has been a great hype on the use of deep neural network algorithms, e.g. recurrent neural network (RNN), convolutional neural network (CNN), and autoencoder.   [98] to improve the effectiveness of detecting ICSs attacks. Anton et al. [57] compared SVM and RF for anomaly-based intrusion detection in an industrial network, in which RF slightly outperformed SVM in terms of accuracy metric. Besides, DT and Bayesian network classifiers were compared for anomaly-based intrusion detection in SCADA network [51]. Terai et al. [87] incorporated SVM to construct a discriminant model between normal and anomalous packets based on the ICSs communication profile. Considering the same ML algorithm, e.g., SVM, Li et al. [52] had attempted to optimize SVM's learning parameters using a velocity adaptive shuffled frog leaping bat algorithm for ICSs intrusion detection. Li and Qin [88] applied five different ML    [75,77,112], LSTM and GRU [56], LSTM [111] 6 Deep belief network [41,110] 2 Deep neural network [42,102] 2 Generative adversarial network [ [89] for attack detection in cyber-physical systems (CPSs), which are usually controlled and monitored by an ICSs. Francia [90] proposed test datasets using an ICSs testbed and employed machine learning algorithms, i.e., Adaboost, complex DT, KNN, SVM, and linear discriminant model for evaluating the generated test dataset. A one-class anomaly detection framework based on neural network was studied in [53]. The proposed classifier was trained exclusively with normal traffic data of ICSs, yet it was able to detect abnormalities involved with advance persistent threat (APT) attacks. Stefanidis and Voyiatzis [50] presented a new approach of intrusion detection in ICSs environment using a hidden Markov model (HMM). The proposed method is more suitable for real-time applications since it produces the results on a per-packet basis. A decision tree classifier combined with session duration-based feature extraction for intrusion detection in a control system network is suggested by [85].
Detection of a particular attack, i.e., man-in-the-middle in industrial control network had been discussed [86]. A machine learning algorithm, i.e., KNN with Bregman divergence was proposed to specify normal behavior. Samdarshi et al. [45] discussed a number of ML algorithms, i.e., DT, RF, NB, and AdaBoost for SCADA security. The proposed IDS technique was built based on a three-layer detection system. By analyzing the ICSs network's telemetry data, Ponomarev and Atkison [82] classified the network traffic data using several ML algorithms, i.e. bagging, dagging, decision stump, LR, REPT, DT, NB, NB multinomial, and Ridor. A fuzzy logic-based decision tree to detect anomalies in ICSs networks was exploited in [83]. The proposed method evolved a combination of DT and genetic programming. A one-class classification, e.g., having only samples from a particular class of training dataset for detecting intrusions on industrial systems is presented in [49]. Two different approaches were studied: support vector data description (SVDD) and kernel principal component analysis (KPCA).
Subsequently, different kinds of machine learning algorithms, i.e., KNN, SVM, LR, and DT, were employed to detect DCS's abnormal traffic. Several effective features were obtained using a dual window scheme [91]. In the same vein, Beaver et al. [48] benchmarked several ML techniques, i.e., NB, A combination of random subspace learning and K-nearest neighbor to defend against the forged commands which target the industrial control process was studied in [100]. Zong et al. [43] adopted an SVM classifier for intrusion detection based on traffic research in industrial control systems. Furthermore, the imbalanced data problem in anomaly detection for IIoT was studied in [92]. The paper investigated the efficiency of artificial neural networks in detecting anomalies through different imbalance ratios. An evaluation of two machine learning algorithms, i.e., SVM and RF, for intrusion detection in the SCADA system was conducted in [54]. The experimental result revealed that RF detected intrusion effectively in terms of F 1 score > 99%. Unlike ordinary individual classifiers, classifier ensembles train multiple classifiers and combine them for prediction [117]. It is common knowledge that a classifier ensemble is generally significantly more accurate than individual classifiers. This motivated [68] to explore the suitability of classifier ensembles as an apparatus of detecting power system cyber-attacks. The proposed detection model relied on several different ensemble schemes, i.e., adaptive boosting, bagging, majority voting, and RF. In addition, Vávra and Hromada [67] utilized majority voting to combine three ML algorithms, i.e., IB1, RF, and SVM, to evaluate the predictive model for intrusion detection on ICSs.
As mentioned, deep learning algorithms have received tremendous interest in the intrusion detection field. Kravchik and Shabtai [75] studied 1D convolutional neural network (CNN) for detecting cyber-attacks on ICSs. A variety of deep neural architectures, including different variants of convolutional and recurrent networks, were applied. Furthermore, a deep belief network (DBN) based threat detection model for the SCADA system was investigated in [110]. The proposed model provided an adaptive mechanism to the dynamic changes in new malware variants. Yang et al. [101] proposed deep learning-based intrusion detection for SCADA systems. The proposed method utilized CNN to define a salient temporal pattern of SCADA traffic and identify the time windows in which attacks exist. Furthermore, rather than proposing an anomaly-based intrusion detection, Potluri and Diedrich [42] used a deep neural network (DNN) to identify the different types of attacks in IDS.
Using a similar method, Liu et al. [93] proposed a twolevel anomaly detector framework. In the first level, CNN was used to feature extraction and anomaly identification, while a process state transfer algorithm was taken into consideration in the second level. Vavra and Hromada [112] introduced a genetic algorithm to optimize a recurrent neural network for industrial network anomaly detection. Two different recurrent neural network architectures, i.e., LSTM and gated recurrent unit (GRU), were proposed for intrusion detection on the Gas pipeline dataset [56]. Similarly, Yang et al. [111] proposed a stealthy attack detection in ICSs using multi-dimensional data fusion, while LSTM was deployed to model the normal behavior of ICSs. Work in [41] evaluated the performance of the detection mechanism by combining DL and ML techniques. Two ML algorithms, i.e., softmax regression and SVM, and two deep learning algorithms, i.e., stack autoencoder and DBN, were used in the benchmark. Upadhyay et al. [69] focused on selecting the most promising features using gradient boosting feature selector. Süzen [102] found that DBN was a preferred method for detecting malicious attacks in network traffic. Hidden layers were updated using contrastive divergence, while the output layer is combined with a softmax classifier. Robles-Durazno et al. [103] used energy-based features and compared five traditional machine learning algorithms for real-time anomaly detection in a water supply system. Ramotsoela et al. [114] proposed a voting-based ensemble technique to enhance a behavioral-based IDS in the water distribution system. Priyanga et al. [76] presented a hypergraph-based anomaly detection with enhanced PCA and CNN. Phillips et al. [58] evaluated the viability of ML techniques in detecting new security threats specific to the SCADA system. Likewise, Onoda [44] compared supervised and unsupervised-based IDS methods. He concluded that supervised methods could achieve the same performance as unsupervised ones if we have sufficient training samples.
Neha et al. [77] presented a sine-cosine optimizationbased RNN to detect the cyber-physical attacks against SCADA systems. MR et al. [78] proposed a multi-layer perceptron model for anomaly detection in ICSs. A cumulative sum is integrated with MLP to detect abnormal deviations in the sensor values due to attacks. Mozaffari et al. [70] presented a comparison of supervised ML methods for classifying power system behaviors and detecting future attacks. Liu et al. [59] proposed a bidirectional generative adversarial network in ICS intrusion detection. The proposed method showed better accuracy and shorter detection time than other baselines. Lan et al. [106] benchmarked several ML methods for classifying network traffic data in ICS to detect man-in-the-middle attack. Hassan et al. [71] improved the trustworthiness of an IIoT network through a scalable and reliable cyberattack detection model. Specifically, a random subspace ensemble model with a random tree classifier was employed to overcome the overfitting problem.
Hallaji et al. [61] employed several feature selection techniques, called multi-subspace feature selection to perform intrusion detection in smaller subspace, which brought about efficiency and accuracy. Haghnegahdar and Wang [72] applied a whale optimization algorithm to initialize and adjust the ANN's weight vector to achieve the minimum mean square error. The proposed model could address the challenges of attacks, failure prediction, and failure detection in a power system. Gumaei et al. [73] considered CFS-based feature selection to remove irrelevant features, while KNN was used to classify normal and cyberattack events. Gao et al. [47] proposed a stacking ensemble to fuse LSTM and feedforward neural network. Combining LSTM and neural network through an ensemble approach further improves the IDS performance with F 1 of 99.68% regardless of the data packets' temporal correlations.
Egger et al. [108] benchmarked various ML techniques for addressing security concerns in the ICS domain. Specifically, both supervised and unsupervised learning methods were assessed for intrusion detection in substations, which use the asynchronous communication protocol International Electrotechnical Commission (IEC) 60870-5-104. Das et al. [81] designed a rule-based system to detect any change in sensor measurements' behaviors due to an attack. The rules were extracted from historical sensor measurements, and these rules can categorize the condition of a plant. Choubineh et al. [63] considered the techniques of cost-sensitive learning and Fisher's (e.g., linear) discriminant analysis (FDA) to overcome class imbalance issues in SCADA system datasets using five different ML algorithms.

Unsupervised Learning
A new approach to detect malicious activities in the ICSs network using a clustering technique was considered in [46]. In order to detect abnormal patterns, a simple K-means algorithm was employed. Schuster et al. [94] discussed two popular unsupervised learning methods, i.e., one-class SVM and isolation forest, to build a self-adaptive anomaly detector. On top of that, another variant of deep learning that works in unsupervised mode, e.g., autoencoder had been introduced in [41,64,79,95,[103][104][105]113]. Using autoencoder, the proposed model could detect replay attack's abnormal traffic by learning the interpacket arrival time. Moreover, a classical frequent itemset mining algorithm, e.g., FP-Growth, was taken into account in [45]. Another frequent itemset mining, e.g., Apriori for state-based IDS in an industrial network, was suggested by [84].
Similarly, autoencoder and DBN were used in [41] for feature extraction in order to achieve the best performance of intrusion detection in network control systems. An unsupervised anomaly-based IDS based on clustering technique was proposed in [66]. The clustering approach was made up of four main processes, i.e., data preprocessing, cluster analysis, features generation form cluster, and states classification using a fuzzy inference system. Furthermore, Mantere et al. [109] used self-organizing maps (SOMs) for anomaly detection in ICSs networks. Hassan et al. [107] used restricted Boltzman machine to extract the features from unlabeled data, while SVM and RF were used to detect the unlabeled attacks. Elnour et al. [80] combined isolation forest and CNN as a hybrid attack detection approach for ICSs. The proposed approach was applied to the SWaT testbed and showed an improvement over the other works in terms of detection capability. Chaithanya et al. [74] proposed an outlier detection approach using salp swarm optimization-based isolation forest. The proposed model was used to build an efficient SCADA intrusion detection system and tested it on the power system dataset.

Semi-supervised Learning
A study in [55] discussed semi-supervised learning to generate large scale training datasets using few labeled data samples using the K-means algorithm and one-class SVM. Almalawi et al. [65] proposed KNN and fixed-width clustering technique for detecting cyber-attacks. The proposed techniques provide considerable accuracy compared to well-known anomaly detection techniques. Joshi et al. [60] used autoencoder in a semi-supervised way to detect malicious behavior in SCADA used to control gas pipeline system. Demertzis et al. [62] developed and tested an anomaly detection algorithm, called Gryphon. It is a semi-supervised unary anomaly detection system evolving spiking neural network one-class classifier.

Mapping Selected Studies w.r.t. IDS Approaches
Following an IDS taxonomy presented in [21], we classify the primary studies based on three primary IDS detection techniques, i.e., anomaly, misuse, and hybrid-based approaches (see Fig. 6). The greatest number of selected studies have taken into account the anomaly-based approach (about 67.57%), while misuse and hybrid-based approach share about 17.57% and 14.86% of the total selected studies, respectively. Besides, we also categorize the primary studies based on the area of concern. Tables 6, 7, 8 and 9 summarize 74 studies that propose intrusion detection for ICSs based on machine learning and deep learning techniques. These tables also show for each study the following information: (i) machine learning and deep learning task, (ii) the considered datasets, (iii) the utilized performance metrics, and (iv) remarks for the further research problem.

Empirical Study
Empirical evaluation is the most often used technique for assessing the performance of algorithms. This research extends the scope of the previous article by giving an empirical benchmark for numerous machine learning and deep learning methods used for IDS in industrial control networks. This section compares the performance of the algorithms used to address RQ4.

Classification Methods
This benchmark includes five classification algorithms, i.e., random forest (RF), gradient boosting machine (GBM), XGBoost, and deep neural network (DNN) implemented in R. The classifiers were chosen since they have relatively received little attention in the current literature. Note that, currently available works involving ensemble learning for IDS in ICS, such as [126] and [127], respectively, use individual XGBoost and majority voting approaches. Hence, to justify the contribution of this empirical study, a stacked generalization [128,129] technique is proposed since it has not been previously taken into account in the literature (Table 10).
The stacking combines several base learners, i.e., RF, GBM, XGBoost, and DNN altogether, hence enhancing the diversity of ensemble. Besides, a GBM is employed as a meta-classifier to get the final prediction. The procedures used to construct the stacked generalization ensemble considered in the experiment are as follows: (i) we train and validate each base classifier B using ten-fold cross-validation on the training set and collect the prediction results R ; (ii) each base classifier's prediction result is combined in such a way that a new matrix G is created. Train the meta-classifier on the level-1 data in conjunction with the response vector; and (iii) to obtained the final prediction, stacked generalization model and meta model are used to validate the testing set. To conclude, Algorithm 1 describes the complete process of constructing the stacked generalization ensemble.
The experiment makes use of a machine learning framework named H2O [130] that offers an interface in R. All parameters were determined using the random search [131] command. The base classifiers used in this work, together with their optimum hyperparameters, are briefly described below.
(a) Random forest (RF) [132]. It has been intensively employed due to its ability in reducing the overfitting while improving the classification accuracy. It grows many classification trees in the forest. Each tree provides a vote for the class, and the forest's final prediction is made using the most votes. The forest error rate relies on the correlation between any trees in the forest and each tree's strength in the forest. Many trees (e.g., 500) are used to build the forest, while other learning parameters are set as follows. Maximum depth = 2, nbins = 1024, nbins cats = 64, sample rate = 0.56, col sample rate change per level = 1.04, and col sample rate per tree = 0.62. (b) Gradient boosting machine (GBM) [133]. The principle of boosting lies in the idea of whether a weak classification algorithm can be converted to become a strong classifier. GBM involves several elements to work. Those are a loss function is to be optimized, a weak classifier to make predictions, and an additive model, i.e., gradient descent procedure, to add a weak classifier to minimize the loss function. Decision trees are used as a weak classifier in gradient boosting. In the experiment, we employed 500 decision trees, maximum depth = 19, minimum rows = 2, nbins = 1024, nbins cats = 64, learn rate = 0.05, col sample rate change per level = 1.1, learn rate annealing = 0.99, col sample rate = 0.80, and col sample rate per tree = 0.80. (c) Extreme gradient boosting machine (XGB) [134]. It has been dominating applied ML benchmarks for tabular  [43] Anomaly Classification NSL-KDD [121] Accuracy, detection rate, and false alarm rate More types of attack features are needed to be addressed Zolanvari et al. [92] Anomaly Classification on imbalanced dataset Private Accuracy, false alarm rate, undetected rate, sensitivity, and Matthews correlation coefficient Only one classifier was used Perez et al. [54] Hybrid Classification Gas pipeline [48,120] Accuracy, precision, recall, and F 1 Used only limited number of classifiers Chen et al. [68] Anomaly Classification Power system [119] Accuracy, precision, recall, and F 1 Tested on a wider classification schemes is necessary Kravchik and Shabtai [75] Anomaly Classification SWaT [122] F 1 and AUC Timeliness of the attack detection is further needed to be investigated Huda et al. [110] Anomaly Classification Vx Heaven [123] Accuracy, false positive rate, and false negative rate

Lacked of GPU and parallel computation
Liu et al. [93] Anomaly Classification Private Accuracy, precision, recall, and F 1 Features extracted by CNN was less interpretable Yang et al. [111] Anomaly Classification GPNS AUC Proposed classifier was validated on single dataset Schuster et al. [94] Anomaly Cluster analysis and classification Private Precision, recall, and F 1 Some attacks were not addressed Hong et al. [95] Anomaly Cluster analysis Private Not mentioned Utilized small attack samples Yang and Zhou [55] Anomaly Training data generation using few samples and classification Gas pipeline and water storage [48,118,120] Accuracy, detection rate, and false positive rate Hybrid kernel function is further needed to be addressed Teixeira et al. [96] Anomaly Classification Private Accuracy and false positive rate Generating more attacks is required data, and an implementation of gradient boosted decision trees focusing on computational speed and model performance. XGBoost follows the same principle as GBM; however, it uses a more regularized model to control overfitting. Optimal parameters are set as follows. Number of trees = 500, maximum depth = 8, min rows = 5, learn rate = 0.05, sample rate = 0.42, col sample rate = 0.80, and col sample rate per tree = 0.39. A faster implementation of XGBoost using GPU-based computation is also enabled. (d) Deep neural network (DNN) [14]. It is derived from a multilayer feed forward neural network that is constructed using stochastic gradient descent of back-propagation. When it comes to DNN models, feedforward artificial neural networks (ANNs) or multilayer perceptron are the most prevalent and the only ones supported natively in H2O. The number of hidden layer is set to 3, where the number of neurons is 258, 516, and 258 for the first, second, and third hidden layer, respectively.

Materials
This section discusses the datasets that are prevalently considered for ICS and IIoT cyber-attack detection. We briefly outline the datasets as follows. We excluded several datasets, including Gas Pipeline, Water Storage Tank [135], and New Gas Pipeline [120] due to flaws and criticisms such as machine learning's misclassification error, the ease with which machine learning algorithms can achieve 100 percent accuracy, and missing values in the data. The characteristics of each dataset is summarized in Table 12, which also includes a calculation of the imbalance ratio, despite the fact that the majority of datasets are imbalanced. The imbalance ratio is defined as a ratio of the number of samples from the majority class (i.e., natural class) to the number of samples from the minority class (i.e., attack class). In the other words, the higher ratio means a less skewed dataset.
(a) Power systems [136]. The power system datasets 1 is comprised of fifteen sets, namely P1, P2, ..., P15, where the number of input features in each set is 128 and one target feature. Each dataset includes the measurements related to electric transmission normal, disturbance, control, and cyber-attack behavior. One hundred six-teen features were obtained from 29 types of measurements from each phasor measurement unit (PMU), while 12 features were obtained from control panel logs, Snort alert, and relay logs. There is a total of 37 power system event scenarios, which can be classified as natural events (8), no events (1), and attack events (28) (see Table 11). The target feature consists of a binary marker that indicates attack and natural traffic. (b) WUSTL-IIoT-2018 [137]. The dataset 2 was developed using the SCADA system testbed presented in [137]. Several attacks were performed against the testbed, such as port scanner, address scan attack, device identification attack, device identification attack in aggressive mode, and exploit attack. After the data pre-processing, the final dataset consists of 6 features as the inputs of machine learning algorithms such as source port, total packets, total bytes, source packets, destination packets, and source bytes. The number of samples is 7,037,983, where each row is labeled as 0 or 1, denoting natural traffic or attack traffic, respectively. (c) UNSW-IoT-Botnet-2018 [138]. The dataset 3 was captured from pcap files with 69.3 GB in size and more than 72 million records. It was created by designing a realistic network environment, incorporating normal and botnet traffic. The dataset comprises multiple attacks such as DDoS, DoS, service scan, keylogging, and data exfiltration attacks. In this study, we used the compact version of the dataset, where it is only 5% of the original samples. The extracted 5% includes about three million records with 16 input features.

Evaluation Result and Discussion
The experiment is run on a machine with an Intel Xeon Gold 6240 2.6 GHz, 32GB RAM, and six NVIDIA Tesla V100 Volta GPUs. We use a non-resampling validation technique (e.g., hold-out), where the ratio between training and testing samples is 70:30. The models' predictive performances are estimated using an accuracy, F1, area under ROC curve (AUC), and area under precision-recall curve (AUCPR) which are better-suited for binary classification involving class imbalance problem [139]. In case of a binary classification problem, the above-mentioned performance metrics are formally defined as follows.
(2) Accuracy = TP + TN TP + TN + FP + FN 1 https:// bit. ly/ 38TGs bB. where TP, TN, FP, and FN values can be obtained from a confusion matrix shown in Fig. 7. TN is not considered in AUCPR since when data is skewed, a high number of TNs often outweighs the impact of changes in other variables, such as FPs. Therefore, AUCPR is much sensitive to TPs, FPs, and FNs compared to AUC [140]. For the calculation of AUCPR, the interpolation between two points m and n in the AUCPR space is specified as a function: where x is any real value between TP m and TP n .
We first show the performance scores of all benchmarked algorithms. Figure 8 compares the distribution of performance across multiple performance measures. The stacking ensemble outperforms the other techniques in all median scores except AUCPR. Additionally, there is a greater degree of fluctuation (e.g., dispersion) in the performance score of DNN, which exhibits a positive skew. It indicates that DNN is more unstable than any benchmarked algorithms. In comparison, the performance of the other algorithms, i.e., XGBoost, RF, GBM, and Stacking exhibits less dispersion, indicating that they perform consistently across datasets. Next, using the average performance score, hierarchical clustering was conducted on classifiers and datasets in order to better understand their relationships (see Fig. 9). The clustering task was completed using the Euclidean distance and Ward's clustering criterion. This experiment identified two and three major clusters for classifiers and datasets, Fig. 6 Dissemination of chosen studies according to three IDS technique categories, namely anomaly (i.e., binary class), misuse (i.e., multi class), and hybrid  [112] Anomaly Optimization via genetic algorithm SCADA network [124] Not mentioned Detection result was not clearly presented Sokolov et al. [56] Anomaly Classification Gas pipeline [120] Accuracy, precision, and recall Small traffic samples were used Qassim et al. [46] Anomaly Cluster analysis DARPA 1998 Not mentioned Focused on a particular attack Anton et al. [57] Anomaly Classification on imbalanced datasets Gas pipeline [120] Accuracy, precision, recall, and F 1 Validation on some types of attacks is necessary respectively. The clusters of classifiers are particularly robust, as the top and the worst-performing classifiers were grouped separately. Furthermore, the three clusters of datasets highlight the main peculiarities between datasets. For instance, one cluster consists of datasets with extremely low imbalance ratio value such as WUSTL SCADA and UNSW-IoT-BoTnet, while another cluster, on the other hand, contains datasets with relatively higher imbalance ratio scores (> 0.4) such as P6, P15, P12, P3, and P8. Statistical tests are used to evaluate the performance results in accordance with the recommendation in [141]. For statistical significance, a Friedman test [142] was utilized, followed by the Nemenyi posthoc test [143] to verify the locations of statistically significant differences between classifiers. Statistical analysis results are typically presented as a critical difference plot [141]. The diagram depicts the average ranks of the classifiers and connect those whose average ranks are less than the critical difference. The critical difference is determined by the significance level (e.g., 0.05 in our case). In the first evaluation scenario, the Friedman omnibus test indicates that there is at least a highly significant performance difference ( p < 0.001) between two algorithms across all performance metrics. We then apply posthoc test using Nemenyi test and visualize the critical difference plot in Figure 10. Except for the AUC score, Stacking is obviously a top performer, outperforming other individual ensemble algorithms such as GBM, RF, XGBoost, and DNN across the board. In contrast, DNN has consistently performed poorly across all performance criteria.
As an important part of our study, we are interested in reporting the computational complexity of the benchmarked classifiers, particularly the time necessary for the training and testing tasks (see Tables 13,14). In average, XGBoost requires shorter training time than other base learners, i.e., RF, GBM, and DNN, despite the fact that all base learners are trained using 10-fold cross validation. Stacking needs substantially less training effort than other methods as it merely involves basic matrix manipulation (e.g., collecting the prediction values from base classifiers). Furthermore, regardless of the size of the testing set, XGBoost obtained the quickest detection time with an average of 0.40 second.

Conclusion and Further Research Directions
The paper discussed a systematic mapping study that provided particular attention on carrying out a literature review of machine learning and deep learning algorithms for intrusion detection in ICSs environment. We conveyed our following RQs and served answers for them.
(i) RQ 1 : What is the research trend in machine learning and deep learning-based intrusion detection in ICSs?  [80] Anomaly Classification SWaT Accuracy, FPR Higher FPR rate. Egger et al. [108] Hybrid Classification Private AUC More difficult attack vectors and more advanced ML algorithms will be further explored. Demertzis et al. [62] Anomaly Clustering, classification Gas pipeline, water storage, power system Accuracy, precision, recall, F 1 , AUC Online learning method using data stream will be further investigated. Das et al. [81] Anomaly Classification SWaT Precision, recall, F 1 Detection on resource constrained device will be further explored. Choubineh et al. [63] Anomaly Classification Gas pipeline Accuracy, FPR, TPR, Multi-attacks detection is not discussed. Chaithanya et al. [74] Anomaly Classification Power system Accuracy, DR The generalizability of the proposed model. Al-Abassi et al. [64] Anomaly Feature representations and classification Gas pipeline, SWaT Accuracy, precision, recall, F 1 Identifying different attack types and their location will be further explored.  The research trend that we could observe is the use of various deep learning-based models, both in supervised and unsupervised learning tasks. Our results suggest that there has been a steep rise in applying ML and DL techniques for IDS on the industrial network started from 2017 onward. (ii) RQ 2 : What types of learning algorithms have been employed to deal with the problems of IDSs in industrial networks? The vast majority of the algorithm presented in this study is supervised learning. Sev-eral classification techniques, such as SVM, RF, and KNN, are the most frequently utilized classifiers. (iii) RQ 3 : Which types of intrusion detection techniques are prevalently used in ICSs? According to our mapping study, an anomaly-based detection technique is commonly considered, which accounts for two-thirds of the total selected studies. (iv) RQ 4 : What are the relative performance of AI algorithms for ICS-based IDS? This study compares the relative performance of stacked generalization ensemble and several individual classifiers, i.e., RF, GBM, XGBoost, and DNN. On a binary classification task, it is demonstrated that the stacked generalization ensemble outperforms individual classifiers significantly.
Numerous potential extensions to the works presented here are as follows. First, according to our findings in Tables 6,  7, 8, 9 and 10, there is still a significant research gap in the use of AI algorithms in unsupervised and semi-supervised learning modes. More exactly, a deep learning technique, i.e., autoencoder, remains mostly unexplored due to the fact that just a few studies have utilized it thus far. Currently, there has been a tremendous progress in the application of deep learning models to tabular data [144,145]. Therefore, further study is probably required in this area, particularly to determine whether deep learning models perform statistically superior on tabular data. Second, as Zolanvari et al. [97] and Upadhyay et al. [127] pointed out, some features might degrade the accuracy of a machine learning algorithm; hence, taking the importance of the features into account is critical. The features are ranked based on how salient they are in contributing to the final prediction. Feature importance indicates how useful or valuable each feature was in the construction of the classification model. Lastly, there are limited number of benchmark datasets available for comparing the algorithms' performance. Hence, it is necessary to have a well-studied real-world or artificially generated ICS-based IDS datasets so that the performance comparison between algorithms can be fairly conducted.