1 Introduction

Cyber-physical systems (CPS) are a new category of embedded systems that interact with the physical world and adapt using a combination of computation, communication, and control theory. The CPS market is expected to reach $9563 million by 2025 [1]. CPS impact our daily lives through their use in critical infrastructures, including healthcare [2], military systems [3], autonomous and unmanned vehicles [4], etc. However, their networked nature and operation in hostile environments make CPS a target of malware attacks [5,6,7], such as ransomwares [8,9,10], and physical tampering. Ransomware [8], a type of malware that threatens victims’ data or blocks access, initially reported in 1989, infected 2000 disk drives of the participants of the World Health Organization’s AIDS conference [11]. So far, according to FireEye’s report [1, 12], ransomware can result in catastrophic consequences to the economy, environment, and even human lives. Other kinds of malware attacks such as return-oriented program attacks [13, 14] and code injection attacks [15,16,17] could also cause severe damage to human property.

For securing CPS, different kinds of methods have been proposed. One commonly used technique is signature-based detection [18] where a unique identifier is established using static information such as file name and a code value that a known threat might attack. However, signature-based detection requires a new signature for every new malware, and thus it cannot deal with unknown threats [19]. Unlike signature-based detection, anomaly-based detection compares the observed events with known, benign events to see a significant difference [19]. Anomaly-based detection can divided into two types: internal [20, 21] and external [22,23,24]. The internal approaches rely on software and/or built-in hardware (e.g., performance counters) to detect the abnormal activities and events occurring on the target device. These methods [20, 21] could achieve a high recognition rate to various attacks but consumes large memory and/or CPU of the target device. Moreover, they cannot be applied in legacy devices that lack the required hardware. In contrast, external methods collect and examine side-channel traces (e.g., power, EM, timing, acoustics, etc.) of the target device using external equipment.

Considering the granularity, the external methods can also be divided into two sub-types: coarse-grained and fine-grained [25]. The coarse-grained techniques can be applied for malware detection with repetitive features [26]. Some measure the loop time or perform spectrum analysis of the side-channel signals and compare them with benign cases [26,27,28]. The advantage of these methods is that they do not consume any resources in the target device or modify the original circuits of the target device. However, the coarse-grained methods suffer from difficulty in detecting small changes in code caused by malware, from high latency, and lack the ability of real-time monitoring [29].

The fine-grained methods capture measurements of the processor’s power consumption and compare against stored signatures from benign software [24, 29]. In [29], the authors train the system model’s EM emanations from an uncompromised device using a neural network and detect malware behaviors such as DDoS and ransomware using Altera Nios-II soft-processor. In [24], the authors adopt a power fingerprinting (PFP) monitor to detect unauthorized code in an RF transceiver using a trained LDA classifier. The target time window of these two methods owns dozens of instructions or even the whole traces. Thus, even though they could achieve a high detection rate towards malware, they cannot represent the power/EM traces situation at an instruction-level granularity. Unlike these two methods [24, 29], instruction-level granularity [30,31,32] or disassembly are much more accurate. Typically, power or EM traces are monitored with a trained classifier for each instruction in order to recover the instructions/program that is running inside the target device. Park et al. [32] propose a disassembler based on the power side-channel that utilizes machine learning algorithms, Kullback-Leibler (KL) divergence, and principal component analysis (PCA). Experimental results demonstrate that the trained disassembler can recognize test AVR instructions, including register names, with 99% accuracy with various machine learning methods. However, these disassembly methods still rely on large and expensive equipment such as oscilloscopes to collect power traces and have never been implemented outside the lab.

In summary, the limitations of the existing methods are apparent. The signature-based method cannot cover new kinds of malware attacks. The internal anomaly detection methods require on-chip resources and do not apply to legacy devices. The external techniques rely on traditional side-channel systems which are impractical to deploy into the field. To resolve these limitations, we propose a new instruction-level, fine-grained verification methodology and successfully insert the verification classifier of the proposed methodology into a custom-designed, low-cost external monitoring platform called RASC (short for ‘remote access to side channels’). RASC miniaturizes the traditional side-channel system into a small PCB, has on-board processing capabilities for classification, and can remotely communicate with security administrators—either to perform additional analysis (e.g., verification) off-board or to transmit alerts of anomalous behavior occurring on the IoT/CPS device being monitored.

In this paper, our main contributions are summarized as follows:

  • First, we consider a realistic threat model where the authentic code running on the IoT/CPS system under monitor is known. We collect measurements from RASC and successfully train binary classifiers for each instruction. The effectiveness of RASC’s measurements and the classifiers are shown by analyzing the receiver operating characteristics’ (ROC’s) area under curve (AUC) for all 92 AVR instructions.

  • Second, we develop a methodology that uses the aforementioned classifiers, including their confidence values, to classify an entire program as good, suspect (possibly malicious), and malicious. It can also localize the malicious changes by determining which instructions have been modified. Our approach is evaluated on six real benchmarks with and without malware (ROP- and code injection-based). Using RASC’s collected power traces and Matlab for program verification, the results show the proposed methodology has at least a 90% successful detection rate for all six malware-free benchmarks and malware-infected benchmarks. Further, the accuracy improves with additional measurements, e.g., of multiple loops in a code.

  • Third, we successfully implement SVM-based verification classifiers into RASC for real-time processing. This means that rather than transmitting power traces to administrators, RASC can process the traces on-board and send results to the admins. The classification results from RASC, while worse than Matlab, are promising. To our knowledge, this is the only case where an external fine-grained, side channel-based program verification has been performed in real-time in the literature.

The rest of the paper is organized as follows. Section 2 discusses the overview of RASC and our threat model under consideration. Section 3 explains the details of the proposed methodology, including the generation of instruction-level binary classifiers, the description of the program verification methodology, and how to implement the classifiers in RASC’s FPGA. In Sect. 4, we examine the accuracy of our instruction-level classifiers and perform program verification (i.e., malware detection) on six real benchmarks, both on-RASC (real-time) and on a PC (offline). Section 5 discusses the related work corresponding to our methodology. The last section concludes the paper and discusses directions for future work.

2 Overview of RASC and threat model

2.1 RASC platform

Fig. 1
figure 1

Depiction of RASC being deployed in the field. RASC is a small PCB board that contains chips for signal processing, computation (FPGA), communication (Bluetooth), and memory (Flash). RASC measures the IoT/CPS system power traces to detect anomalies related to malware, hardware Trojans, and other types of tampering

Fig. 2
figure 2

RASC schematic. The power module on RASC is powered by the Vdd pin on the target device. The input power traces are digitalized by the power ADC circuit on the board. The external JTAG is connected to program the RASC’s FGPA. External UART module is used to transmit power traces in the training phase. In the testing phase, RASC uses a Bluetooth module to transmit pass/fail/suspect outcomes from instruction-level classifiers. Except for these modules, there are 8 I/O ports left on RASC for achieving other potential functionalities

Figure 1 depicts the deployment scenario of RASC, while RASC’s detailed schematic is shown in Fig. 2. RASC contains an analog to digital converter (ADC) for digitalizing power traces, a Bluetooth module for remote communication, and a Xilinx Spartan-3E (XC3S500E) [33] FPGA for data processing. The ADC is TI ADC08200 [34], and its maximum sampling speed is 220 MS/s. SESUB-PAN 14580 [33] is selected as the Bluetooth module since it has a tiny body (\(\hbox {8}\times \hbox {8 mm}\)) and a long range for communication (20 m). The key reason for choosing Spartan-3E is its small body and large number of I/O ports.

As depicted in Fig. 1, RASC is attached to or arranged near the target device and connects to its power supply not only for powering RASC but also for collecting power traces of the target device for verification. Alternatively, an external battery can be used to power RASC. Its power side channel-based verification results can be transmitted via Bluetooth to a router or base station, where they can be forwarded over more considerable distances to security admins (see Fig. 3).

Fig. 3
figure 3

RASC functional diagram. RASC collects power traces, samples and converts them to digital values, processes the traces on its FPGA, performs instruction and program verification, and transmits results to a security administrator using Bluetooth

Fig. 4
figure 4

Dimensions comparison between RASC and a US quarter

Table 1 shows the comparison between RASC, a medium-end oscilloscope, and the ChipWhisperer [35] which is a popular educational board for studying side-channel attacks. The areas where RASC shines are cost, size, and communication capabilities. A traditional side-channel analysis system consists of an oscilloscope. The oscilloscope (Tektronix MDO 3102) we use in our lab can sample data over 5 GS/s but costs over $16,000. In contrast, the total cost of RASC’s power analysis board is around $300 (produced at low volume), which is comparable to ChipWhisper-Lite edition (produced at high volume). The tiny body of RASC is compared to a US quarter in Fig. 4. It is only 20 \(\times \) 20 mm is shown and much smaller than a typical oscilloscope. RASC’s tiny footprint allows it to be easily situated close to the IoT/CPS system under monitor, such as within its enclosure or chassis. RASC supports Bluetooth communication over 20 meters. The combination of Bluetooth communication and its tiny size allows RASC to work in narrow spaces as a remote monitor. However, ChipWhisper and the oscilloscope do not include such features.

Table 1 Comparison between traditional side-channel analysis systems and RASC

RASC is weaker than the others in terms of test voltage range, sampling speed, and resolution. The voltage measurement range of RASC is from 0 to 3.3 V. For the power ADC, this range is acceptable since this range covers the power supply of most commercial FPGA/MCU development boards. Compared with RASC and ChipWhisperer, the oscilloscope has a much more extensive testing range. The ADC on the RASC board is ADC08200 (8 bit), giving it a resolution of 1.2 mV. It is clear that the oscilloscope and ChipWhisper have significant advantages in detecting a faint difference in traces. In terms of sampling speed, the oscilloscope has the most important benefit. As for Tektronix MDO 3054, the maximum sampling speed could reach 5 GS/s. The sampling speed of RASC and ChipWhisperer is lower than the oscilloscope. In the malware detection area, the sampling speed is an important parameter. Its optimal value depends on the target device and the desired accuracy of the detection method.

In conclusion, RASC has advantages in low price, tiny body, and remote communication. At the same time, it has limitations in the sampling speed, the resolution of the amplitude, and the voltage testing range; some of which will be improved in future versions of RASC. Although RASC has a lower sampling speed and resolution, it performs quite well in our later verification experiments. In other words, the RASC can serve as a substitute for an oscilloscope in practical applications, especially when in-field, remote, and real-time monitoring is needed.

2.2 Threat model and verification assumptions

RASC is an external monitor that can be attached to a high-assurance IoT/CPS system in order to detect abnormal behavior related to malware, hardware Trojans, and physical tampering of the IoT/CPS system using power side-channels. The following summarizes our threat model and assumptions related to verification of the programs running on the IoT/CPS system, its side channels, and assurance of RASC.

  1. 1.

    Defender’s (RASC’s) Prior Information We assume that the target device’s benign (known good or authentic) source code is available to RASC. Thus, it can be used to train RASC’s classification algorithms and verify the instruction-level behavior of the target device in the field. This could allow RASC to identify even the smallest changes in a target device’s code, such as a change to one instruction, provided its instruction classification algorithms are robust. Note, however, that RASC needs not know anything about the actual malware/Trojan inserted or its impact on side channels to achieve this.

  2. 2.

    Adversary’s Capabilities and Prior Information We assume that the adversary has prior knowledge of the hardware and software of the target device, which can be exploited to inject malware or tamper with it. Further, the adversary has access to the target device at some point, either remotely, physically in a hostile environment, during fabrication, or during system integration (e.g., the alleged “Big Hack” of Supermicro server motherboards [36]).

  3. 3.

    Security Assurance of RASC We assume that the hardware and software of RASC are trusted and assured. In contrast to the hardware of a potential IoT/CPS system, RASC is very cheap to manufacture and can be done so onshore by a trusted party. Further, RASC consists of basic commercial-off-the-shelf (COTS) components that can be purchased directly from original component manufacturers (OCMs). Although not included in the prototype discussed in Sect. 2.1, RASC can also rely on other elements that protect it from physical and remote attacks, such as tamper-proof PCB coatings (e.g.,  [37, 38]), secure boot/debug (e.g., on-FPGA or via TPM), and communication confidentiality and integrity of messages sent to security admins (e.g., through lightweight cipher, HMAC, etc.). Note that the same or a subset of countermeasures might not be present in the IoT/CPS system, especially if it is a legacy system. Further, if RASC is constantly monitoring the IoT/CPS system and communicating with the security admins, removing RASC from the system is also trivial to detect because it would create a significant anomaly in the side-channel measurements, i.e., a break in power measurements.

In short, RASC is a cheap and effective root-of-trust to monitor IoT/CPS systems in the field from multiple attack vectors.

3 Instruction-level verification methodology

This section discusses the methodology and implementation of our real-time, instruction-level granularity verification method. Section 3.1 introduces more details about the experimental setup we use for the remainder of the paper. Section 3.2 discusses how we created training templates for each AVR instruction. Section 3.3 compares four different kinds of classifiers (QDA, LDA, linear SVM, and neural network) and explains why we eventually choose linear SVM for real-time processing. In Sect. 3.4, we present a flow chart of the methodology and explain it in detail. Section 3.5 explains how to implement the instruction-level classifiers in RASC for real-time verification.

3.1 Experimental setup

The experimental setup for collecting power traces using RASC and an oscilloscope is shown in Fig. 5. The target device in this paper is Arduino UNO, and its original core frequency is 16 MHz. For collecting power traces, RASC is directly connected to pin 8 of the ATmega328P chip on the target board Arduino UNO. The training power traces are collected by RASC and transmitted to PC via UART whenever offline processing is performed. In the testing phase, the testing traces could be verified inside RASC in real time and the pass/fail/suspect outcome will be sent via Bluetooth module to remote security administrators. The sampling speed of the RASC is 128MS/s, and the core frequency of the Arduino UNO is adjusted to 1MHz in our experiments.

Fig. 5
figure 5

Experimental setup for collection of power traces from ATmega328P using RASC

3.2 Training templates

The templates we use to train RASC are similar to the ones used in [32]. As will be shown in Sect. 3.4, the example templates are used to collect power traces for each instruction type in the target device (Arduino UNO). The pipeline structure of the Arduino UNO is shown in Fig. 6a. One stage fetches the instruction while another pipeline executes the previous instruction. Thus, the instructions in both stages will contribute to the shape of the power traces. The example template in Fig. 6b includes one sbi instruction for triggering the start signal, two nop instructions, and a random instruction, and the target instruction followed by another random instruction, two nops, and one cbi for ending trigger signal. The usage of the trigger signals lets RASC know when to start and end collecting power traces for a given template, and the two nops are used to separate the trigger signals and the target instruction.

Aside from the training template, we also adopt the same approach as [32] to alleviate the covariate shift problem caused by different loading times and process variations of different target devices. Whenever we load a new program into Arduino UNO, we collect a trace that is full of nop instructions to compute a nop template trace. After collecting power traces from the example template, we subtract it with the nop template trace [32]. Thus, the power traces for training classifiers are actually normalized by the nop power traces.

Fig. 6
figure 6

a Instructions in the pipeline during clock cycles \(i-1\), i, and \(i+1\); b Example template used to train a classifier for the add instruction

3.3 Classification algorithms and instruction classifier generation

A binary classifier is generated for each instruction and will be later used to verify that the instruction occurs in the target device’s known benign program code during in-field operation. To do so, RASC collects 900 power traces for each instruction using randomly generated example templates as discussed in Sect. 3.2. 900 normalized traces of the target instruction are chosen to represent class 0, and the other 900 normalized power traces of other instructions are selected to represent class 1. Principal component analysis (PCA) is adopted for dimensionality reduction. After multiplying the PCA coefficient, these 1800 power traces are trained to obtain a binary classifier. We repeat this process for each AVR instruction and generate 92 classifiers for 92 common AVR instructions. During the in-field monitoring phase, the binary classifiers use the power traces and generate a 0/1 output if the instruction’s power trace matches/does not match the one in the known benign program.

In this paper, we consider and compare four instruction classifiers; QDA, LDA, linear SVM, and Naive Bayes. The four classifiers are tested in Matlab, and the most appropriate one will be implemented in RASC for real-time detection. We will first introduce the PCA and then briefly describe the algorithms of these four classifiers.

  1. 1.

    Principle Component Analysis (PCA) PCA is generally used for unsupervised dimensionality reduction [39]. It orthogonally projects the data onto a lower dimensional linear space, known as the principal subspace, such that the variance of the projected data is maximized [40]. The principal component formula is listed in Eq. (1), and can be infered from the information in [41].

    $$\begin{aligned} PCA(\mathbf {X})=((\mathbf {X}-\overline{\mathbf {X}}) \mathbf {U}_{\mathbf {X}-\overline{\mathbf {X}}}^{-1} \mathbf {S}_{\mathbf {X}-\overline{\mathbf {X}}}^{-1})^T. \end{aligned}$$
    (1)

    By default, PCA centers the data and uses the singular value decomposition (SVD) algorithm [41]. In Eq. (1), \(\mathbf {X}\) stands for the input matrix. \(\overline{\mathbf {X}}\) is the column average of \(\mathbf {X}\). \(\mathbf {U}_{\mathbf {X}-\overline{\mathbf {X}}}\) is the left singular vectors of \(\mathbf {X}-\overline{\mathbf {X}}\) in singular value decomposition. \(\mathbf {S}_{\mathbf {X}-\overline{\mathbf {X}}}\) is the singular values of \(\mathbf {X}-\overline{\mathbf {X}}\) in singular value decomposition. T is the symbol of matrix transpose.

  2. 2.

    Quadratic and Linear Discriminant Analysis (QDA and LDA). The discriminant formula determines the predicted class as follows [42]

    $$\begin{aligned} \hat{y}= \mathop {\arg \min }_{y=1,2,\ldots ,K} \sum _{k=1}^{K} \hat{P}(k\mid {x})C(y\mid {k}), \end{aligned}$$
    (2)

    where K stands for the number of classes, \(\hat{P}(k\mid {x})\) is the posterior probability of class k for observation x, and \(C(y\mid {k})\) is the cost of classifying an observation as y when its true class is k [42]. The discriminant analysis results from the assumption that each class is drawn from a multivarite Gaussian distribution with a class specific mean row vector and a covariance matrix. For linear discriminant analysis, all classes have the same covariance matrix. However, the covariance matrix varies in the quadratic discriminant analysis.

  3. 3.

    Linear SVM Support vector machines (SVMs) use a hyperplane to divide data from the two classes. A hyperplane H is defined as follows [43]:

    $$\begin{aligned} f(\mathbf {x})={\mathbf {x}^T}\varvec{\beta }+b, \end{aligned}$$
    (3)

    where \(\mathbf {x}\) stands for the input observation, the vector \(\varvec{\beta }\) contains coefficients that define an orthogonal vector to the hyperplane, and the scalar b is the bias term. For better separating the data into two classes, the algorithm uses the Lagrange multipliers method to optimize the hyperplane. Based on [43], the optimal hyperplane maximizes a margin surrounding itself for separate classes. For inseparable classes, the algorithm imposes a penalty on the length of the margin for every observation that is on the wrong side of its class boundary. In this paper, we use the SVM function “fitcvsm” from Matlab machine-learning toolbox for offline cases.

  4. 4.

    Naive Bayes Naive Bayes is a classification algorithm that applies density estimation to the data [44]. The posterior probability using Bayes’ theorem is defined as follows:

    $$\begin{aligned} \hat{P}(Y=k\mid {X_1,...X_P})= \frac{\pi (Y=k) \prod \limits _{j=1}^{P} P(X_j\mid {Y=k})}{\sum \limits _{k=1}^{K}\pi (Y=k) \prod \limits _{j=1}^{P} P(X_j\mid {Y=k)}}, \end{aligned}$$
    (4)

    where Y stands for the random variable corresponding to the class index of an observation, \(X_{1},...,X_{P}\) are the random predictors of an observation, and \(\pi (Y=k)\) is the prior probability that a class index is k. In this paper, we use the Naive Bayes function “fitcnb” from Matlab machine-learning toolbox for instruction classification in offline modes.

In Sect. 4.1, we will perform experiments to compare the performance for all of these classifiers when they are applied on a PC. Due to the limited resources of the RASC’s FPGA (Xilinx Spartan 3e), the instruction-level classifiers and program verification need to be simple enough for real-time detection on RASC. Thus, we try to bypass all resource-consuming calculations, such as multiplication, floating-point divisions, factorials, and integrals. Hence, this eliminates QDA and LDA from consideration since Eq. (2) is too complex. Meanwhile, for implementing Naive Bayes onto the FPGA, it is still not easy due to the limited resources of RASC. In the end, we choose linear-mode SVM. In Sect. 3.5, we shall explain how Eq. (3) was simplified to successfully implement the classifiers on RASC’s FPGA.

3.4 Overall methodology

Fig. 7
figure 7

Flow chart of the verification methodology and application in experiments

Our overall methodology is presented in Fig. 7. It consists of two phases: training and testing. In the training phase (lefthand side of Fig. 7), we train the binary classifiers for each AVR instruction. After loading the sample templates, RASC is used to collect normalized power traces \(T_{t,i}\) 900 times for each instruction i. In our notation, t represents the index of the collected trace and thus \(1\le t \le 900\). Further, \(1\le i \le 92\) because there are 92 common AVR instructions.

For a specific AVR instruction v, 900 power traces for instruction v are put into class 0 and 900 randomly sampled power traces of all instructions except v are considered for class 1. Due to the conclusion in Sect. 3.3, the linear SVM classifier is adopted to train a binary classifier (\(C_{v}\)) with these 1800 power traces for AVR instruction v. Meanwhile, for calculating the confidence in the next step, we also save the margin (\(M_{v}\)) and the hyperplane (\(H_{v}\)). This process is repeated for all 92 instructions.

In the testing phase (center of Fig. 7), we load one real benchmark into the Arduino UNO, which consists of n instructions (\(B_{1}\), \(B_{2}\), \(B_{3}\),...\(B_{n}\)). For better results in testing, we collect multiple power traces (e.g., 5) and average them to get \(T_{l}\). Then, \(T_{l}\) is divided into n non-overlapping power segments (\(CT_{l_1}\),\(CT_{l_2}\), \(CT_{l_3}\)..\(CT_{l_n}\)) to match n instructions (\(B_{1}\), \(B_{2}\), \(B_{3}\),...\(B_{n}\)). Later, we need to verify the correctness of these power segments; that is, each power segment should match the instruction that we expect in the benign code. For power segment k, we use classifier \(C_{B_k}\) to predict its class (\(Class_{l_k}\)). If the \(Class_{l_k}\) is equal to 0, the power segment \(CT_{l_k}\) is assumed to belong to the benign program. Otherwise, it is considered a malicious instruction that has been replaced by malware.

Besides predicting the class, we also calculate the distance (D) between the power piece \(CT_{l_k}\) and the hyperplane \(H_{B_k}\) of the classifier \(C_{B_k}\). If the distance is larger than the margin \(M_{B_k}\), the confidence (\(R_{l_k}\)) is assumed to be 1. If the distance is within the margin, the confidence \(R_{l_k}\) is D/\(M_{B_k}\). After verifying the correctness of power segment k and its confidence \(R_{l_k}\), we move on to verifying other power segments in trace l. When finishing verifying all n power segments in trace l, we give a classification for the entire program’s power trace l. There are three possible evaluation results—“pass”, “fail”, and “suspect”—which are determined based on the confidence or reliability for each instruction/segment. Specifically, if the confidence \(R_{{l,}}\) of all n power segments in the power trace l is larger than 70% (\(1\le k \le n\)), we assume all the classification results \(Class_{l_k}\) are reliable.

  • If the classification result is reliable (confidence great than 70%), and all the power segments are classified to be in class 0, the evaluation result for power trace l is “pass”. That is, based on the power traces, we classify the program running on the target device as benign/correct.

  • If the classification result is reliable (confidence great than 70%), and at least one of the power segments is classified as class 1, the evaluation result for power trace l is “fail”. In other words, we classify the program as malicious or containing malware.

  • If the confidence \(R_{l_k}\) of all n power segments in the power trace l is less than 70%, we assume the classification result is unreliable, and the program is considered “suspect”.

In our later experiments, we collect \(T_{l}\) at least 30 times, and calculate the average suspect rates, pass rates and the fail rates for benign programs and malware. This is shown in the (righthand side of Fig. 7).

3.5 Classifier implementation on RASC for real-time verification

For performing classification in real-time on RASC, the most important thing is to bypass the time-consuming calculations. The linear-mode SVM only needs to do one vector subtraction, a matrix multiplication, one scalar addition, and one scalar comparison for any power segment. The relationship between input power traces and its classification outcome is presented as

$$\begin{aligned} f(\mathbf {x})={(\mathbf {x}-\mathbf {x_{nop}})}^T \mathbf {P_c} \varvec{\beta }+b, \end{aligned}$$
(5)

where \(\mathbf {x}\) stands for the input observation, a vector \(\varvec{\beta }\) contains coefficients that define an orthogonal vector to the hyperplane, a scalar b is the bias term of the SVM binary classifier, a vector \(\mathbf {x_{nop}}\) is the power segment for the nop instruction, and a matrix \(\mathbf {P_c}\) is the PCA coeffient. Due to the limitation of the FPGA memory on RASC, we use fewer PCA coefficients when implementing classifiers in RASC compared to performing classification using Matlab on a PC. In the end, if f(x) is larger than 0, the power segment belongs to class 1 (malicious). Otherwise, it belongs to class 0 (benign).

After obtaining Eq. (5), we still need to simply the calculation because \(\mathbf {P_c}\) and \(\varvec{\beta }\) are all floating points. After finishing the training classifier with Matlab, we could get the PCA coefficients of the observation matrix, the scalar b, and the vector \(\varvec{\beta }\) of the SVM classifier. We multiply the scalar b and the vector \(\varvec{\beta }\) together, calculate the multiplication result 100 times and take its integer approximation matrix \(\mathbf {P_{c_{approx}}}\). This method also happens to scalar \( {b}\). We also compute scalar b 100 times and take its integer approximation \(b_{approx}\). The purpose of calculating 100 times and taking integer approximation is to bypass floating-point operations inside the FPGA. In the end, the simplified Eq. (6) is implemented into RASC:

$$\begin{aligned} f(\mathbf {x})={(\mathbf {x}-\mathbf {x_{nop}})}^T \mathbf {P_{c_{approx}}} + b_{approx}. \end{aligned}$$
(6)

When \(f(\mathbf {x})\) equals to 0, Eq. (6) becomes the hyperplane (H in Sect. 3.4) of the SVM classifier as follows:

$$\begin{aligned} {(\mathbf {x}-\mathbf {x_{nop}}})^T \mathbf {P_{c_{approx}}} + b_{approx}=0. \end{aligned}$$
(7)

For generating the margin for the classifier, we put 95% training observations out of the margin and 5% left inside the margin. Here, we assume that \(x_{m}\) is the observation just on the hyperplane. In the case, we get the following equation for margin (M in Sect. 3.4):

$$\begin{aligned} {(\mathbf {x}-\mathbf {x_{nop}}})^T \mathbf {P_{c_{approx}}} + b_{approx}=f(x_{m}). \end{aligned}$$
(8)

For easier calculation of the confidence value inside RASC, the distance (\(D_{M}\)) between the margin M and the hyperplane H equals to \(f(x_{m})\). Distance (\(D_{Xi'}\)) between the input testing observation \(X_{i'}\) and the hyperplane is \({f(X_{i'})}\). If the testing observation \(X_{i'}\) is inside the margin, the confidence value of the input testing observation \(X_{i'}\) is \({f(X_{i'})/f(x_{m})}\). Otherwise, the confidence value is 100%.

4 Experimental results and discussion

In this section, we validate the binary classifiers at the instruction level and our overall methodology using six real benchmarks and two types of malware (ROP and code injection). Our methodology is investigated offline using Matlab on a PC and then in real-time using RASC’s FPGA. Our classifiers are trained using power traces from one board and applied to other boards to illustrate how they are robust to process variations.

4.1 Comparison of instruction-level classifiers

As is described in Sect. 3.4, before generating an SVM-based verification classifier for each AVR instruction, we collect 900 power traces from the example benchmark to be in class 0 and 900 power traces of other instructions to be in class 1. For verifying the correctness of these classifiers, we arrange 92 AVR instructions in order and use Matlab to calculate the area under curve (AUC) of the receiver operating characteristic (ROC) curve for each binary classifier. The result is presented in Fig. 8 where the index in the x-axis denotes the instruction.

Fig. 8
figure 8

Area under curve (AUC) value of ROC curves for all 92 AVR instructions

In Fig. 8, the minimum AUC value for Naive Bayes (NB) is much lower than the other three kinds of classifiers. For linear-SVM, QDA and LDA, the AUC value of the ROC curve of 92 classifiers are all close to 1, which means the verification classifier could distinguish correct from incorrect instructions using the power traces.

Based on the AUC result of the linear-SVM classifier in Fig. 8, there are three instructions whose AUC is lower than 0.95, and all of them are branch instructions: BRNE (AUC=0.9), BRVC (AUC=0.93), and BRID (0.88). Telling the difference between instructions that share the same functionality and operands is more challenging than instructions that do not have the same operands and functionality [32]. In the training phase, there are 20 branch instructions (instruction 45–64), and all of them share the same operand. A large number of branch instructions lower the AUC value not only for the SVM classifier but also for LDA, QDA, and Naive Bayes classifiers.

The result in Fig. 8 proves that the performance of the SVM-based classifier is close to the QDA and LDA-based classifier. Meanwhile, the linear-mode SVM-based classifier owns advantages in bypassing all resource-consuming calculations. In this situation, we choose the linear-mode SVM-based classifier for our experiments from this point forward.

4.2 Benchmarks

This section discusses the benchmarks used for the remaining experiments. Each benchmark includes one numerical function necessary for real applications such as face recognition, self-driving cars, manufacturing, etc.

  1. 1.

    Timeloop The benchmark “Timeloop” counts the loop time of the whole routine in the system with a counter.

  2. 2.

    Matrixmultiplication The benchmark “Matrixmultiplication” loads an input 8\(\times \)8 matrix and multiply a predefined 8\(\times \)8 matrix.

  3. 3.

    Decimaldivision The benchmark “Decimaldivision” loads an input decimal value and divides it by a decimal in the register.

  4. 4.

    Decimal2float The benchmark “Decimal2float” converts input decimal data into its float format and stores the result in a register.

  5. 5.

    ASCII The benchmark “ASC2II” is an AVR routine that converts binary data to ASCII data. The conversion is done by repeated subtraction of binary representations of the decimal numbers 10,000, 1,000, 100, and 10.

  6. 6.

    ADconverter The benchmark “ADconverter” is an 8-Bit-AD-Converter that measures an input signal in the range from 0.00 to 2.55 Volts and returns the result of a binary value in the range from 0x00 and 0xFF. The result, a voltage, is to be displayed on an LCD.

Fig. 9
figure 9

Sample code for benchmark “Matrixmultiplication” with color codes used to distinguish between normal and malware lines of code

After introducing the functionality of six real benchmarks, we also present the sample code of the benchmark “Matrixmultiplication” in Fig. 9. In Fig. 9, function asm volatile() executes matrix multiplication codes. The AVR instructions in black color stand for the normal code. The red lines are code injected by the adversary. The blue color indicates malicious gadgets inside the memory. The purpose of the ROP attack and the code injection attack is the same. The adversary tries to hijack the control flow of the program and steal the important data inside the register.

4.3 Malware types

For implementing the malware code, we assume that the attacker can access the running code and the ROM address of the device. There are two different malware attack scenarios included in this section: code injection attack and ROP attack. The details of the malware payloads and triggers are discussed below.

4.3.1 Triggers

Trigger for code injection attack For triggering the code injection attack, since we assume the attacker could access the code, a few lines are inserted into the benchmarks to alter the value of some parameters or print some sensitive data.

Trigger for ROP attack In the section before, we assume the attacker has preloaded the malicious code into some unused part of the memory. Thus, we preload a few malicious ASM codes in the function void setup() since this function only runs for one time. When triggering the ROP attack, we simply insert one ASM line to call the malicious code.

4.3.2 Payloads

For the code injection attack, we assume the attacker could access to the running code and insert its malicious code. For the ROP attack, we assume the attacker could load the malicious code to some unused part of the memory and call the malicious code in the main function. The payload of the buffer overflow attack and the code injection attack in these 6 benchmarks aims to redirect the control flow of the target device to achieve malicious functionality. The ROP attack aims to call the pre-loaded malicious code to leak the critical data stored in the memory.

4.4 Offline (PC) experimental results and discussion

In this subsection, we use the SVM classifiers trained in Sect. 4.1 and the methodology in Sect. 3.4 to judge the pass/fail/suspect rate on power traces of 6 real benchmark (Sect. 4.2) in 3 different cases: malware-free, attacked by ROP, attacked by code injection). We collect testing traces from 3 different Arduino UNO boards and the result is presented in Table 2. \(P_i\), \(F_i\), and \(S_i\) stands for the pass, fail, and suspect rates of the board i. Timeloop_CI stands for benchmark “Timeloop” being attacked by code injection and Timeloop_ROP means benchmark “Timeloop” being attacked by ROP. Note that instruction-level classifiers were only trained using board 1 and then tested on power traces from all the boards.

Table 2 Experimental results for malware detection on 6 benchmarks in 3 scenarios on 3 boards out using a PC for offline verification. \(P_i\), \(F_i\), and \(S_i\) stands for the pass, fail, and suspect rates of the board i

The results show a high detection rate for all 6 malware-free benchmarks on 3 different Arduino UNO boards. For board 1, the pass rate of the malware-free traces is close to 100%, and the fail rate of the malware trace is larger than 90%. Compared with the result from board 1, the result from board 2 and 3 is slightly worse but still achieves at least 70% pass rate for the malware-free traces and 96% rates against malware attacks. The reason is We only use board 1 to execute the example template and collect training traces from only from board 1. The DC offset between different Arduino UNO board causes the drop of the pass rate of testing traces from board 2 and board 3.

4.4.1 Accuracy vs. loops in average

Fig. 10
figure 10

Pass/fail/suspect rate versus average loops for benchmark “Matrixmultiplication” in a malware-free b ROP c code injection cases

The results can be improved when more loops are used to verify the code. In Fig. 10, we compare the pass/fail/suspect rate versus average loops for benchmark “Matrixmultiplication”. The pass rate is close to 100% when we average five loops of traces in the malware-free case of benchmark “Matrixmultiplication”. Meanwhile, the fail rate is close to 100% when we average four loops of traces in two malware cases of benchmark “Matrixmultiplication”. Thus, in the methodology section, we average five loops in the testing phase to further remove the DC offset and lower the noise.

4.5 Real-time (RASC) experimental results and discussion

In this section, we insert the SVM classifier into RASC based on Eq. (6). Then, we let RASC collect testing power traces, send classification outcomes of each power piece, and their confidence values to the security admin. Figure 11 presents the screenshot of the outcome from RASC. The “Instr1” stands for the first instruction. “p” means “pass”. If “f” shows up on the screen, it means RASC detects malware power segments. The integer 100 means 100% confidence. The Bluetooth module on RASC is SESUB-PAN-d14580, and we use DSPS app to receive Bluetooth data on the smartphone.

Fig. 11
figure 11

Screenshot of pass/fail/suspect outcome and confidence per instruction sent by RASC through Bluetooth to the DSPS app on a smartphone

After receiving the classification outcome and the confidence value, the pass, fail, and suspect rates are calculated with Matlab, and the result is presented in Table 3.

Table 3 Experimental results for malware detection on 6 benchmarks in 3 scenarios on 3 boards inside RASC for real-time verification. \(P_i\), \(F_i\), and \(S_i\) stands for the pass, fail, and suspect rates of the board i

Compared with Table 2, the pass rate listed in Table 3 is lower and the suspect/fail rate is increased. There are two possible reasons behind this. First, as described in Sect. 3.5, we approximate the coefficients and the bias values to integer values. The approximation decreases the difficulty of implementing verification classifiers into RASC but increases the risk of an incorrect classification. The second reason is also mentioned in the Sect. 3.5. Due to the memory size of RASC, we use fewer coefficient data when processing data inside RASC. This can be improved in future versions of RASC that use a more complex FPGA.

5 Related work

The proposed methodology could be used to non-invasive verify the instructions running within a program inside the target device, thus making it effective in defending against malware insertion and hardware Trojans that alter instructions and control flow. This section describes the related works in anomaly, malware, and hardware Trojan detection, and compares them to the RASC prototype and methodology described in this paper. The comparison of related work is given in Table 4.

5.1 Side-channel based Malware detection

Malware Detection in Embedded Systems using Neural Network Model Khan et al. [29] use a panel antenna (roughly 0.38 \(\times \) 0.30 m in size) to capture EM radiation from an IoT device. They train an EM model of the reference device using a neural network with the fingerprint for normal (authentic or benign) program activity as input. Later on, the EM traces from the target device are examined with the trained EM model to detect changes caused by malware. Khan attacks the target device with three different malware (DDoS, ransomware, and code modification). The experiment result shows 100% accuracy of detecting DDoS and ransomware when signal-to-noise (SNR) is larger than 5dB and the antenna’s distance is less than 2 m. Compared with this method, the target time window of our methodology is much smaller, allowing us to verify code at a higher (instruction-level) granularity. Thus, our approach can localize where the malware changes have been made. In addition, the monitoring setup is larger and more expensive than RASC.

Code Execution Tracking Liu et al. [45] propose a non-intrusive code execution tracking solution via power-side channel. It models code execution on MCU and its power side-channel behavior as a hidden Markov model (HMM). Then, the testing traces are analyzed with a revised Viterbi algorithm to recover the most likely executed instruction sequence. This method could achieve high detection rate for all 9 normal programs, and detect abnormal code execution behavior even when only a single instruction is modified. In comparison, our methodology does not need to transform power signal into the frequency domain and performs detection in real-time with high accuracy.

VirusMeter Liu et al. [20] propose a novel detection method called VirusMeter to detect anomalous behaviors on mobile devices. By internally collecting power consumption on a mobile device, VirusMeter trains a multiple user-centric power models (using linear regression, neural network, and decision tree) to abnormal power consumption related to malware in real-time. Moreover, Liu et al verified this method on Nokia 5500 Sport and successfully detect real cellphone malware, including FlexiSPY and Cabir in real-time. Unlike RASC, this is an internal monitoring approach and only applies to mobile devices.

Mobile Device Profiling Buennemeyer et al. [21] propose a battery-sensing intrusion protection system (N-SIPS) for mobile computers. The Correlation Intrusion Detection Engine (CIDE) provides power profiling and an extensive usability study was conducted to generate CIDE features. The feedback from 31 expert participants is used to validate the condition of the system. Similar to [20, 21] is also an internal anomaly detection method. Our methodology could collect traces and processes them externally; thus, it does not need any software installed onto the target device.

Wattsupdoc Clark et al. [23] proposed WattsUpDoc for detecting malware attacks on an embedded medical devices. WattsUpDoc collects power traces at run-time and uses supervised machine-learning (ML) with AC power traces of both normal and abnormal activity. The output of the supervised ML algorithm is a functional states like idle, booting, and shutdown of the medical device. The algorithms could achieve 94% detection rate towards known attacks and at least 85% accuracy for unknown malware. In comparison, our proposed methodology works at the instruction-level granularity and only needs to know the code of the authentic (benign) program to be used.

NIPAD Xiao et al. [46] propose NIPAD to detect abnormal activities in a programmable logic controller (PLC). NIPAD collects power traces using an Agilent U2541A data acquisition system and extracts a discriminative feature set. Later on, the features of normal samples are trained based on a long short-term memory (LSTM) neural network. The abnormal behavior could be identified by comparing the predicted sample and the actual sample. The experiment results are performed in Matlab and show a high successful detection rate of the LSTM network (90.33%) even if only one line of the original program is modified. Besides, NIPAD compares three different methods which could be used for malware detection: one-class SVM classifier, LSTM network, and correlation-based algorithm. LSTM achieves a higher successful recognition rate against modification and a lower equal error rate than the other two methods. In comparison, our approach is demonstrated in real-time using a custom-designed platform. Further, it can localize the instruction or instructions that have been changed and can communicate this information to remote security administrators.

Power Fingerprinting in SDR González et al. [24] introduces a novel approach called power fingerprinting (PFP) to perform integrity assessment of software-defined radio (SDR). This method relies on an external monitor to measure the processor’s power consumption and compares it against stored signatures with a pattern recognition and signal detection technique. The result proves this method could detect the execution of a tampered routine and transmission routine. Our approach operates at the instruction-level granularity and can pinpoint altered instructions.

Power-aware Malware Detection Kim et al. [47] introduce a framework to monitor, detect, and analyze unknown energy-depletion threats. This framework is composed of a power monitor, which collects power samples and builds a power consumption history from the collected samples, and a data analyzer to generate a power signature. The pre-stored signature in this framework could be used to verify the collected traces in the testing phase. The experiment result is from an HP iPAQ running inside a Windows Mobile OS and can achieve a 99% true-positive (TP) rate and less than 5% false-negative (FN) rate in classifying mobile malware. Our approach is not limited to energy-depletion threats.

5.2 Side-channel based Trojan detection

Region-Based Identification of Hardware Trojans Banga and Hsiao [48] propose an approach for identifying hardware Trojans. They partition the circuit into different regions and generate test vectors for each. Once they have identified the regions of interest (ROI), they create an activity peak on a per-region basis by maximizing the switching activity within the ROI and minimizing the switching activity for the rest of the circuit. The location of the Trojan is perceivable if the difference in the activity of the Trojan-infected chip and the genuine chip (without Trojan) exceeds the process variation. The experiment result shows a close approximation of the infected regions. Compared with this method, our approach does not need to send any testing vectors to the target device not does it need a Trojan-free (golden) chip for comparison. RASC only needs to monitor the power traces of the target device externally. When the hardware Trojan is triggered and alters the instructions, RASC could detect this change and send alerts to security admins.

Power Signal Methods for Detecting Hardware Trojans Rad et al. [49] collect power supply transient signals from multiple individual power ports on the chip with a testing sequence that is applied to the inputs of the core logic. For dealing with process and environmental (PE) variation effects in the chip, a calibration circuit is inserted into each IC to support a calibration test. Then, transient anomalies are detected using a statistical analysis by comparing the two-dimensional scatterplots generated from this processed supply transient signal between Trojan-free circuit and Trojan-infected circuit. The simulation result shows this method can detect Trojans as small as a single gate under noise-free conditions and one gate for 30 dB SNR to four gates for 10 dB SNR if the noise and the background switching activity are considered. In contrast, our method does not need to insert any calibration circuit into the original design. Further, we can identify the abnormal behaviors from the power traces with a trained machine-learning model when the hardware Trojans are triggered without the need for a golden chip.

On-Chip Sensor Circle Distribution In [50], Köse et al. perform real-time hardware Trojan detection by monitoring voltage and current consumption using on-chip sensors. In addition, the sensors are optimally distributed to localize the Trojan’s activity. The approach is validated using simulations and the authors localize Trojans to within 1% error for large power grid sizes. In comparison, our approach measures the power consumption from the outside of the chip and is validated with real hardware.

On-Chip EM Sensors In [51], He et al. present a method to design an on-chip EM sensor can be deployed inside the chip. The proposed sensor can achieve signal to noise ratio (SNR) compared to external EM probes. The Euclidean distance between EM measurements for known Trojan-inactive vs. run-time cases is computed to detect hardware Trojans in the time and frequency domains. Simulation and silicon results show that the distance becomes larger when the Trojan is activated. Unlike [51], our approach does not require changes to the chip and uses power rather than EM. Further, our classification algorithms are also run in real-time.

On-chip Thermal Sensors and Temperature Tracking. Forte et al. [52] proposed to use on-chip thermal sensors to detect Trojans at run-time. A Kalman filter processes the sensor measurements to estimate the thermal profile of the chip over time. After hardware Trojan activation, the power consumption of the chip is altered, thereby affecting the thermal profile. An autocorrelation-based approach uses the growing error in the Kalman filter innovation caused by divergence between estimated profile and thermal sensor measurements to detect Trojans. Simulations on Trust-hub benchmarks show that Trojans can be detected as long as the change in power consumption is large enough. Compared to this approach, our method does not require on-chip sensors and our approach is validated with real hardware experiments.

5.3 Preliminary RASC results

A preliminary version of RASC was designed by Stern et al. for malware detection in [53]. The original RASC could visually identify code injection attacks, demonstrating its potential applications for monitoring cyber-physical systems, real-time CPS monitoring, and secure firmware upgrades. However, detection was not automated nor performed in real-time on RASC. Compared with the original version of RASC, the prototype in this paper upgrades the sampling speed amd provides more flexible arrangement of I/O ports, on-board FPGA processing, and better Bluetooth communication.

Table 4 Comparison of related work in side-channel based anomaly, malware, and hardware Trojan detection

6 Conclusion and future work

In this paper, we demonstrated the correctness of the proposed instruction-level verification methodology using power measurements obtained by the miniature RASC system against two kinds of attacks (ROP attack and code injection attack) on 6 real benchmarks. After simplifying the algorithm, we successfully inserted the classifiers into RASC for real-time demonstration. With the classification result from RASC, the malware-free and malware power traces of 6 real benchmarks can still be identified with a high detection rate.

In future work, we plan to upgrade RASC to a more capable version, which consists of a larger memory FPGA and an additional EM measurement capability. The larger memory FPGA could contain more advanced classification algorithms and process power and EM traces at the same time. We assume that the usage of two modalities should be better than only one channel. With the improved version of RASC and more advanced algorithms, we also hope to demonstrate real-time disassembly.