1 Introduction

The development of AI-based systems in many IoT domains comes across security challenges that are especially important given that these systems are usually performing critical tasks with the help of sensitive data. Since almost a decade, the machine learning (ML) security community (mainly structured on Adversarial Machine Learning [1] and Privacy-Preserving Machine Learning [2]) works unceasingly with a threefold objective: (1) turn the spotlights on attacks that target every step of the machine learning pipeline with an impressive diversity of attack vectors, (2) propose defense schemes to improve the robustness of the models or the systems, (3) build sound evaluation methodologies to properly assess the intrinsic robustness of models or the real impact of protections.

However, most of these works are focused on demonstrating (or defending against) attacks that exploit the inputs and the outputs of a white or black-box target model seen as a pure algorithmic abstraction. Obviously, these studies are compulsory since they enable to reveal theoretical flaws, but the attack surface needs to encompass attack vectors related to the physical implementation of the models on specific hardware platforms. Interestingly, one can draw a parallel with cryptography-based systems for which international standardization and certification are well established. For example, if we consider the actual standard for symmetric encryption (AES–Advanced Encryption Standard), this algorithm is known to be secure as no cryptanalysis-based attack has been proven (unless the brute-force strategy). However, several physical attacks (typically, side-channel and fault injection analysis) have been successfully demonstrated on many platforms (ASIC, FPGA, microcontrollers...) to recover the secret key. Then, to claim a certain level of security, an AES-based system must be evaluated against a set of state-of-the-art physical attacks to obtain the related certification level as it is the case for the Common Criteria.Footnote 1

The purpose of this chapter is twofold. First, we outline a panorama of the threats against embedded machine learning models (especially state-of-the-art deep neural networks) as well as the available defense strategies. Second, we highlight advanced physical attacks against the confidentiality and integrity of models since these threats are still insufficiently concerned, despite a recent amplification of these topics in the hardware security community.

2 Threat Models

The definition of a threat model enables to precisely set the goal, knowledge and ability of an adversary as well as the most important features of the system to defend [3]. First, we define the formalism used in this chapter, then we briefly describe the different features composing a threat model.

2.1 Formalism

First, we distinguish a ML model, \(\mathbb {M}\), as an abstract algorithm and its implementations \(\mathcal {M}\) after its deployment in different hardware and software environments. In some cases, these deployed models may be functionally different because of optimization techniques that modify their architecture or some parameters (e.g., quantization, pruning).

A supervised neural network model \(\mathbb {M}_W\) is a parametric model that maps an input space \(\mathcal {X}=\mathbb {R}^d\) to the output space \(\mathcal {Y}\). W is the set of parameters to be learned and \(\mathcal {D}_{\mathcal {X}}\) is the data distribution. Typically, for a classification task, \(\mathcal {Y}\) is a finite set of labels \(\{0,...,C-1\}\). According to the Empirical Risk Minimization framework, \(\mathbb {M}_{W}\) is trained by minimizing a loss function \(\mathcal {L}\) (e.g., the cross-entropy loss) that quantifies the error between the prediction \(\hat{y} = M_W(x)\) and the groundtruth label y as defined in Eq. 1:

$$\begin{aligned} W^* = \mathop {\mathrm {arg\,min}}\limits _W\Big ( \mathop {\mathbb {E}}_{x,y\sim \mathcal {D}}\big [\mathcal {L}(M_{W}(x);y)\big ] \Big ) \end{aligned}$$
(1)
Fig. 1
4 illustrations start from the poisoning or backdoor attacks, data extraction, and membership inference. The deployment from the learning is transferred to the model extraction. It consists of fault injection and side-channel. The adversarial and sponge connected to the interface.

The (supervised) ML pipeline for an embedded ML model deployed on a mobile device. State-of-the-art threats cover every stage of the pipeline and can target both data and the model

Figure 1 illustrates the traditional (supervised) machine learning pipeline: the data used for training are samples of \(\mathcal {X}\) and compose the training data set: \(X^{train}\). A fraction of that data set, \(X^{val}\), is used to intrinsically validate the training in order to evaluate the behavior of \(\mathbb {M}_W\) on unseen data and therefore measures the so-called generalization gap. At inference time, the learned model is used on new samples and outputs the probability vector that \(x^{test}\) belongs to each class of \(\mathcal {Y}\).

2.2 Adversarial Objectives

Classically, the goal of an adversary is defined thanks to the confidentiality/integrity/availability triad that we develop in Sect. 3.

Confidentiality and privacy concern both the data and the model. First, extracting data–even partial information–captured or memorized within a model is a critical threat for many domains such as medical applications. Second, another major adversarial objective is to reverse-engineer a protected model by extracting information about its architecture or parameters. Model extraction is a growing concern in the security community with both API-based and physical attacks.

Integrity-based attacks aim to deflect the nominal behavior of a model. Classically, the objective may be to fool the prediction of a model on a global scale (i.e., to drop the average accuracy on a test set) or only for very specific inference inputs. The most popular threats are adversarial examples and poisoning attacks respectively at inference and training times.

Threaten the availability means that the adversary targets the overall system hosting the ML algorithms so that it will become unavailable. That also encompasses adversaries that significantly deteriorate the performance (quality or system-based such as processing times) so that that it becomes useless.

2.3 The System Under Attack

Distinction between sensor and API-based systems is fundamental because both types induce different ways of interacting with the model. For example, most of the works related to inference input-based attacks (such as adversarial examples explained in Sect. 3.2.2 and Fig. 3) deal with API-based systems (more particularly cloud-based ML-as-a-service systems) for which an adversary may have a full control over the inputs contrary to a sensor-based system that brings with additional physical (and software) layers.

Another essential point is the position of the system within the ML pipeline, more essentially its ability to perform both training and inference steps. For many IoT applications, ML models are previously trained on high-computing platforms (essentially GPU-based servers) and then deployed to connected devices only for pure inference purposes. This paradigm is no longer exclusive because of the fast development of powerful ML-compliant hardware architectures, the improvement of model optimization techniques as well as advanced training strategies dedicated to IoT (e.g., federated learning). Even platforms with memory and power constraints will likely be able to support training tasks. In that case, the attack surface is significantly wider and an attacker may exploit potential flaws at training time (e.g., poisoning or backdoor attacks) to weaken the resulting learned model and perform efficient attacks at inference time.

2.4 Knowledge and Capacity of an Adversary

As previously mentioned, an important criterion is the ability of an adversary to perform an attack at training or/and at inference time. At training time, an adversary may have a full access to the training set as well as the training process itself. At inference time, the attacker generally exploits (at least) the inputs and outputs by querying the model according to potential restrictions. However, the most critical point is the level of knowledge of the adversary about the target model.

An adversary that performs a white-box attack has a perfect knowledge of the parameters and architecture of the target model. This setting may be widened to the knowledge and the access to the training set. On the contrary, with black-box attacks, the attackers have no (or partial) information about the target model. They need to guess some information thanks to his expertise and his knowledge about the task or by querying the model.

Efforts on adversarial and privacy-preserving machine learning have highlighted (and still do) many traps in the evaluation of the robustness of ML systems such as the underestimation of the attackers. An important evaluation standard is the definition of a so-called worst-case scenario that relies on the assumption of an advanced adversary who has enough expertise and knowledge about protections to perform adaptive attacks that completely thwart these defenses with only minor changes in state-of-the-art attacks as exposed in [4].

2.5 Attack Surface

As mentioned in introduction, the complexity of defending embedded machine learning models mainly relies on the extent of the attack surface. This complex attack surface must encompass the algorithmic threats and the implementation-based threats as illustrated in Fig. 2. The first ones will exploit the theoretical flaws of the models and the second ones include powerful physical attacks such as side-channel analysis (SCA) and fault injection analysis (FIA) that both leverage some characteristics of the (software or hardware) implementation of a model as well as the hardware platforms (e.g., memory types, instruction sets...).

Fig. 2
A model diagram of the attack surface. The two types of attack surfaces are algorithmic and implementation. The flow is represented as algorithmic, model deployment, and embedded M L system. The implementation consists of physical, side-channel, and fault injection attacks.

For embedded ML-based systems, the attack surface must encompass algorithmic and physical threats including physical attacks

To illustrate that point, let’s focus on the Rectified Linear Unit (ReLU) activation function, widely used in many deep neural network models, a piecewise linear function defined as: \(ReLU(x) = max(0,x)\). From a mathematical standpoint, this function as a several strong properties. For example, ReLU maps input values to \([0,+\infty ]\) and then cancels out negative inputs. Moreover, its second derivative is zero everywhere except at its critical point (\(x=0\)). These properties are exploited in some attacks, for example a cryptanalysis-based attack from Carlini et al. [5] to reverse-engineer the parameter values of a multilayer perceptron (MLP) model or to attack the training process by altering the initialization process of the parameters [6]. These attacks are algorithmic ones since they directly exploit the definition of ReLU, whatever the way this function has been implemented and deployed on a device. On the contrary, other attack vectors rely on the implementation flaws of ReLU, such as timing attacks that will exploit the fact that most ReLU implementations are non-constant time, which can leak some information about the architecture of a protected model.

3 A Panorama of Algorithmic Attacks

3.1 Confidentiality and Privacy Threats

3.1.1 Data Leakages

In many critical domains, the training data is exclusively or partially composed of private information that may be captured by the model during the training process. It is obviously the case for models suffering from overfitting, but Carlini et al. [7] also demonstrated that large language models tend to memorize training data at the early steps of training. Note that privacy breaches may concern even low levels of information. The best example is membership inference attack [8]: the adversary aims at guessing if an inference input fed to the model belongs to the training dataset. This membership knowledge about a data can be a critical information in some cases, for example in medical prediction tasks or biometric-based access control systems.

3.1.2 Model Theft

The confidentiality of models is also an important issue that drives stakeholders to protect models against reverse-engineering attacks. There are many goals underlying a model theft attack as detailed by Jagielski et al. [9] that highlight the concepts of fidelity and accuracy. In a fidelity context, an adversary aims to precisely extract the model’s characteristics in order to obtain a clone model. Additionally to model theft, the adversary may aim to steal a model to shift from a black-box to a white-box context in order to craft more efficient attacks. On the contrary, an accuracy objective refers to performing well over the underlying learning task of the original model. The attacker aims at stealing the performance of the model and, effortlessly, reach equal or even superior performance. In such a case, the exact extraction of the architecture nor the parameters values are compulsory.

API-based approaches for model extraction exploit input/output pairs and potential information about the target model. We highlight a milestone work from Carlini et al.  who cleverly consider the extraction of parameters as a cryptanalytic problem [5] and demonstrate significant improvements from [9]. The threat model sets an adversary that knows the architecture of the target model but not the internal parameters. The attack relies on the ReLU properties, more precisely the fact that the second derivative is null everywhere except at a critical point. The authors demonstrate a complete extraction of a 100,000 parameters MLP (one hidden layer) with \(2^{21.5}\) queries with a worst-case extraction error of \(2^{-25}\).

In Sect. 4.1 we focus on implementation-based approaches for model extraction.

3.2 Integrity-Based Attacks

3.2.1 At Training Time

Attacking the integrity of a model at training time relies on a strong assumption related to the ability of the adversary. Poisoning attacks aim to control the behavior of a model by manipulating its training data [10,11,12]. By injecting, modifying or removing training samples, an adversary aims at altering the decision boundary. Therefore, several objectives can be linked to such attacks. For example, attackers want to degrade the overall accuracy of the model or they want to target a very specific behavior when the model is fed with a trigger signal.

A first approach is to alter the training labels as proposed in [13] in a healthcare context with an additional model that learns the influence of a data and therefore enables to highlight samples of higher interest to poison.

However, most of the state-of-the-art poisoning attacks are based on the inputs by altering or injecting new training samples. Trigger-based data poisoning (also referred to as backdoor poisoning attack) aims at fooling a model at inference time with inputs containing a specific trigger (e.g., an image patch or a specific word sequence) that has been learned by the model thanks to poisoned data [14]. On the contrary, trigger-less poisoning attacks (or features-based) are only focused on the training process. Shafahi et al. [10] propose altering a clean training data sample so that some features extracted from the model are very close to ones from a target training sample that belongs to another class. The authors add an adversarial watermark in order to blend features between the clean and target samples.

3.2.2 At Inference Time

An adversarial example is an input \(x^*\), crafted from an initial sample x, that is misclassified by a model even though it is the result of small (even imperceptible) perturbations (see Fig. 3). Formally, Szegedy et al. [15], define \(x^*\) as in Eq. 2:

$$\begin{aligned} \mathop {\mathrm {arg\,min}}\limits _{\delta } \mathbb {M}_{\Theta }(x + \delta ) = l \ \ (l\ne \mathbb {M}_{\Theta }(x)) \text { with } x^* = x + \delta \in \mathcal {X} \end{aligned}$$
(2)
Fig. 3
An illustration has the initial sample photo of the Panda. A sample image of Panda 57.7% confidence + 0.07 times to the magnified visualization = Gibbon 99.3% confidence Panda photo.

From [16]: an illustration of a successful adversarial example that fools a model. The adversarial perturbation is magnified for visualisation

With \(\delta \) a perturbation applied to an input x so that \(x^*\) is still in \(\mathcal {X}\), and l the (mis)classification output. A classical setting bounds the perturbation with a \(l_p\) norm: \(||x-x*||_{p}\le \epsilon \) with \(\epsilon \) the adversarial budget. Classically, the \(l_2\) and \(l_\infty \) norms are used, which lead to an alteration of every dimension of the input. State-of-the-art crafting methods are mainly based on the gradients of the loss w.r.t. the inputs such as the PGD (projected gradient descent) attack, proposed in [17], an iterative approach defined as:

$$\begin{aligned} x_0^* \sim \mathcal {B}\big (x,\epsilon \big ) \end{aligned}$$
$$\begin{aligned} x_{t+1}^* = x_{t}^* + \lambda .\text {sign}\Big (\nabla _x\mathcal {L}\big (\mathbb {M}_\Theta (x_{t}^*),y\big )\Big ) \end{aligned}$$
(3)

Where \(\mathcal {B}(x,\epsilon )\) is an \(\epsilon \)-ball around x according to the \(l_p\) norm. PGD is performed several times and, for each attempt, the initial state \(x_0\) is picked randomly in \(\mathcal {B}\).

Interestingly, \(l_0\)-based attacks aim to minimize the number of perturbed dimensions such as the so-called One-Pixel Attack [18] or [19] that reaches more than 95% success rate on CIFAR10 by perturbing at most 10 pixels by combining sparsity and imperceptibility (a mix of \(l_0\) and \(l_\infty \) constraints).

In a black-box setting, the adversary cannot directly compute the gradients \(\nabla _x\mathcal {L}(x,y))\) but can take benefit from two types of approaches. First, the attacker can leverage the transferability property of adversarial examples [20]. Indeed, adversarial examples crafted on a model \(\mathbb {M}\) are likely to be successful on a model \(\mathbb {M'}\) that performs the same task. Therefore, an adversary can design and train a substitute model as close as possible from the target model with which he can compute gradients. Second, an adversary may approximate the gradients by exploiting a set of input/output pairs gathered by massively querying the model [21]. The complexity of these query-based attacks depends on the nature of the available outputs (logits, scores or labels only).

Some works shift the digital adversarial examples into the physical world [22] such as Eykholt et al. with road sign classifiers fooled with posters or stickers [23] or face recognition systems fooled with handcrafted glasses in [24]. Moreover, several works highlight that the classical gradient-based attacks (such as PGD) usually demonstrated on computer vision tasks are efficient without any adaptation on multivariate time series. Adversarial examples have also been successfully crafted for other applications like speech-to-text [25], Q &A systems, malware detection [26] and even for reinforcement learning [27, 28].

3.3 Availability

First, the availability threats encompass all the typical network-based attacks that aim at weakening a computing or communication infrastructure such as Denial-of-Service (DoS) attacks. However, some works focus on specific availability-based attacks that target the training process. In [6], Grosse et al.  propose exploiting the initialization of the weights as a way to fool the training. The attack aims to severely degrade the model’s performance and increase training time. The basic principle of the attack is straightforward and relies on the property of the ReLU activation function to map negative inputs to zero. By controlling even a small proportion of the initial value of the weight matrix, the attacker leverages the ReLU property to set to zero activation values with a cascade effect with deeper models. In some cases, the training is impossible since too many neurons are shutdown.

Another concerning attack vector is the batch-feeding process. To work properly the standard mini-batch stochastic gradient descent method (SGD) used in deep learning relies on the assumption of a uniform random sampling of the training data. In [29], Shumailov et al.  consider an adversary that sets in a strong black-box threat model, without knowledge about the model nor any prior knowledge of the training data. The basic principle of the attack is to interact on the batching part of the ML pipeline in order to thwart the randomness assumption of the mini-batch SGD leading to strong convergence issues.

4 A Focus on Physical Attacks

The attacks described in the previous section are algorithmic threats with models seen as pure mathematical abstractions. In this section, we focus on implementation-based attack vectors by highlighting the use of side-channel (SCA) and fault injection analysis (FIA) for model extraction and integrity-based attacks. We highlight recent results concerning 32-bit MCUs, typically used for IoT applications. For a broader survey on hardware security of deep neural networks, interested readers may refer to [30].

4.1 Model Extraction Based on Side-Channel Analysis

As previously detailed in Sect. 3.1.2, model extraction is becoming a major threat with different adversarial objectives (model cloning, functionality theft). Interestingly, even if several API-based strategies have been proposed, model extraction is also a threat significantly studied by the hardware security community because of the well-known efficiency of side-channel analysis in extracting critical information on an embedded program (both the data and the instructions).

4.1.1 Side-Channel Analysis

Side-channel analysis (SCA) are physical attacks relying on the observations of some physical signals that depend on both the algorithm and the processed data. A classical attack targets the software or hardware implementations of cryptographic primitives to recover a secret information (e.g., ciphering key) [31]. A typical setup is to feed the system with known inputs and capture the electromagnetic emanations of the target chip. The information is stored for further statistical processing and is traditionally referred as traces. A leakage model links these traces with known algorithmic steps that process the (known) inputs and the secret information (e.g., the SubBytes operation for the AES). Classically, a leakage model enables to bridge the physical observations and the target algorithm and data. Classical choices for leakage models are the Hamming weight and Hamming distance. Then, an adversary can make some hypothesis on the secret values and correlate these hypothesis to the recorded traces thanks to the leakage model. It is possible to extract the most likely hypothesis because this value will better explain what has been physically observed. This technique is known as Correlation Power Analysis (CPA) but simpler analysis are possible, for example by simply reading the traces (Simple Power Analysis–SPA) that may emphasize patterns as it is the case with data-dependent constant-time algorithms. Figure 4 illustrates a power trace of two inferences of a MobileNet (v2) neural network deployed on a Cortex-M7 platform with clear distinctions separating the different structural blocks of the network.

Fig. 4
An electromagnetic trace of two mobile net models on a microcontroller. The neural network Cortex M 7 controller denotes heavy fluctuations with almost zero wavelength.

An electromagnetic trace (red) of two consecutive inferences of a MobileNet (v2) model running on a Cortex M7 microcontroller. We can observe the clear separations between the convolutional basic blocks of the model. The blue line is a trigger signal showing the start and stop time sample of each inference

4.1.2 Timing Analysis

Because the inference process of an embedded model is hardly time-constant, timing analysis is a classic way to infer information about the architecture and (in some cases) the parameter values of a model. For example, Gongye et al.  exploit extra CPU cycles (on a \(\times \) 86 processor) for IEEE-754 multiplications or additions with subnormal values [32] in order to precisely recover the weights and bias of a 4-layer neural network model. Maji et al. [33] also demonstrate parameter extraction with timing analysis by exploiting the ReLU activation function and the multiplication operation with floating-point, fixed-point and binary models deployed on three platformsFootnote 2 without floating-point unit (FPU).

4.1.3 SCA-Based Extractions

Several works [34, 35] highlight the use of side-channel analysis to extract the value of the internal parameters of a deep neural network model. Here, we focus on the software implementation of neural network models on typical IoT platforms based on microcontroller such as Cortex-M 32-bit microcontrollers.

Exploitable leakages are related to the basic operation used in a neural network, that is the multiplication operation between a secret parameter (also called weight) and an input value (i.e., the input data or the output of the previous layer). For a full-precision implementation of a neural network, this multiplication operation handles two IEEE-754 32 bit floating-point values. We remind that a 32-bit single-precision floating-point value a is composed of a sign, exponent and mantissa parts as in Eq. 4:

$$\begin{aligned} a &= (-1)^{b_{31}}\times 2^{(b_{30}...b_{23})_2-127}\times \Big ( 1.b_{22}...b_{0} \Big )_2 \\ &= (-1)^{S_a} \times 2^{E_a - 127} \times \Big (1 + 2^{-23} \times M_a\Big ) \nonumber \end{aligned}$$
(4)

With \(S_a\), \(E_a\) and \(M_a\) are respectively the sign, exponent and mantissa values. Then, the result of the multiplication operation \(o = x \times w\) with x and w the input and parameter value of a neuron, leads to the sign (\(S_o\)), exponent (\(E_o\)) and mantissa (\(M_o\)) detailed in Eq. 5:

$$\begin{aligned} S_o = S_x \oplus S_w \text {, } E_o = E_x + E_w - 127 \text {, } M_o = M_x + M_w + 2^{-23} \times M_x \times M_w \end{aligned}$$
(5)

Joud et al. [35] demonstrate a coarse-to-fine strategy to exploit side-channel leakage in order to precisely extract the 32 bits of a parameter. The basic idea is to first focus on the exponent part and keep (with CPA) the most likely hypothesis and, then, progressively extend the correlation analysis to other bits of the mantissa. With this approach, experiments on a Cortex M7 target using FPU (floating-point unit) demonstrate an extraction precision close to \(10^{-7}\) for the absolute value of a set of parameters of a shallow MLP model. Figure 5 shows the setup for collecting the traces for these experiments.

However, some challenges remain open for a full extraction of internal parameters [35]. First, the use of ReLU as activation function, by mapping the output of a layer into positive or null values, significantly increases the complexity of the extraction of the bit sign. Second, a single error extraction in a fully-connected network automatically and dramatically leads to the impracticality of the extraction for all the weights of the next layers. Moreover, the exploitable leakage relies on the multiplication operation that does not concern bias values and the addition operation is more complex to exploit than the multiplication with classical CPA methods.

Fig. 5
Two experimental setups S C A and laser. The left photo is an electromagnetic probe connected to the chip. The right photo has a neural network model of laser is connected to the microcontroller.

(Left) Experimental setup used in [35] for the high-precision extraction of 32-bit parameters of a shallow MLP model on a Cortex-M7 microcontroller. A electromagnetic probe is precisely position on the chip the capture EM traces. (Right) Laser bench used for laser fault injection against an embedded neural network model in a Cortex-M 32-bit microcontroller

4.2 Weight-Based Adversarial Attacks

Input-based integrity attacks are not the only attack vectors available for an adversary that aims to fool a model at inference time. In the past few years, several implementation-based attacks have been proposed and demonstrated that directly target the parameters stored in memory (e.g., DRAM or Flash memory). Alongside safety-related efforts that evaluate the robustness of ML models against random faults, these works highlight the lack of robustness of deep neural network models to fault injection attacks that alter the data, the parameters as well as the instruction flow [36, 37].

4.2.1 Target the Parameters Stored in Memory

As formalized in [38] or [39], a parameter-based attack aims at maximazing the loss (i.e., increasing mispredictions) on a small set of test inputs, as represented in the first part of Eq. 6. As for the imperceptibility criterion of adversarial examples, the attacker may add a constraint over the perturbation by bounding the bit-level Hamming distance (HD) between the initial (W) and faulted parameters (\(W'\)), corresponding to an adversarial budget S (second part of Eq. 6).

$$\begin{aligned} \max _{W'} \sum _{i=0}^{N-1}{\mathcal {L}\Big (M\big ( x_i; W'\big ),y_i\Big )} \text { s.t. } HD(W',W)\le S \end{aligned}$$
(6)

A state-of-the-art parameter-based attack is the (white-box) Bit-Flip Attack (hereafter BFA) initially proposed by Rakin et al. [38]. The goal is to decrease the performance of a model by selecting the most sensitive bits of the stored parameters and progressively flipping these bits until reaching an adversarial goal. Typically, the objective of an attacker is to ultimately degrade the model so that its accuracy corresponds to a random-guess level (as in [38] or [40]). The selection of the bits is based on the ranking of the gradients of the loss w.r.t. to each bit \(\nabla _b\mathcal {L}\), computed thanks to a small set of inputs. The most sensitive bit is permanently flipped according to the gradient ascendant as defined in [38] with Eq. 7 (with \(\hat{b}\), the bit after bit-flip) and the process is repeated until the adversarial objective is met.

$$\begin{aligned} \begin{aligned} &\hat{b}= b\oplus \big (sign(\nabla _b\mathcal {L})/2 + 0.5\big ) \end{aligned} \end{aligned}$$
(7)

Many simulation-based works have been proposed since the BFA presentation [39,40,41,42,43,44] but fewer efforts have been made to practically implement the attacks. Among these works, an important milestone is [37] that demonstrates the BFA with RowHammer on an Intel i7-3770 CPU platform. RowHammer [45] is powerful attack vector that can only target DRAM memory cells. It relies on the interaction between DRAM cells, more particularly between the nearby memory rows of a target cell stressed by an adversary that can lead to bit flips in those adjacent rows.

4.2.2 Practical Experiments on Microcontrollers

In this section, we present experiments on a Cortex-M 32-bit microcontroller that embeds an 8-bit quantized neural network thanks to NNoM an open-source deployment library (Neural Network on Microcontrollers [46]). Our attack vector (BFA-like attack) and fault model enable to evaluate the robustness of a model against an advanced adversary who aims at significantly altering the performance of a model with a very limited number of faults.

Working on a 32-bit microcontroller with SRAM and Flash memory, we cannot exploit RowHammer technique. A state-of-the-art means of fault injection, used in most of the certification and security testing labs is laser fault injection (LFI). We considered an accurate fault model relevant for laser injections previously explained and demonstrated for Flash memory of Cortex-M MCU by Colombier and Menu [47, 48]: the bit-set fault model that consists of setting a targeted bit to a logical 1. When targeting a Flash memory at read time, the induced bit-set is transient, it affects the data being read at that time while the stored value is left unmodified. LFI is a very accurate technique, both temporally and spatially. Depending on the laser spot diameter, up to two adjacent bits can be faulted simultaneously [48].

Figure 5b shows the laser setup we used for these experiments. The laser platform gathers two near-infrared independent laser sources focused through the same lens (spot diameter is ranging from 1.5 to 15 \(\upmu \)m) with a maximum power of 1, 700 mW. An infrared camera is used to observe the spot location on the target and a XY stage enables moving the objective above the entire surface of the target device.

Since faulting every bit of every parameter stored in memory may be impractical we leverage the BFA principle to select the most sensitive bits to shoot with the laser and adapt it to our laser fault model (i.e., bit-set). As recommended in [42], we also adapted the adversarial objective by introducing an adversarial budget (20 bit-sets) representing a maximum of faults we are able to process.

We implement a MLP model trained on the standard digit recognition dataset (MNIST [49]). We reduced the complexity of our model by compressing the input data (784 pixels grayscale images) from \(\mathbb {R}^{784}\) to \(\mathbb {R}^{50}\) with principal component analysis. The model has one intermediate layer of 10 neurons and ReLU as activation function. The resulting model has 620 trainable parameters (including bias). After training, the model reaches 92% of accuracy on the test set.

Fig. 6
A graph of accuracy versus number of bit-sets. Two curves depict laser fault injection and simulation-based, it represents a decreasing trend, high at (0, 90), and low at (20, 30,) respectively. The values are approximate.

Experimental and simulation results of a laser fault injection (LFI) attack on a MLP model trained on MNIST targeting the 20 most sensitive MSB of the \(2\mathrm{{nd}}\) weight column

We ran a BFA simulation over all the weight columns and bit lines of the Flash memory that pointed out the most significant bits of the second column weight as the most sensitive. The model accuracy was evaluated over 100 inferences. The blue curve in Fig. 6 represents our experimental results while the red one is the BFA simulations for the most significant bits. First, we can notice that experimental and simulation results are quite similar, meaning that we can guide our LFI evaluation with high reliability and confidence.

The fact that experimental results are slightly more powerful than simulations may be explained by the impact of the width of the laser spot on nearby memory cells. We observe that for an adversarial budget of only 5 bit-sets (0.1% faulted bits) the embedded model accuracy drops to 39% which represents a significant loss and a strong integrity impact compared to the nominal performance of 92.5%. After 10 bit-sets (accuracy to 25%), the most effective faults have been injected and the accuracy does not decrease anymore: this positions the level of robustness of the model according to the adversarial budget.

5 Protecting ML System

In this section we discuss the ways of defending a ML-based systems in view of the different threats we presented in the previous sections. Before focusing on the specific protections and countermeasures available in the literature to thwart algorithmic and physical attacks, we first examine the protection of a ML-system at a wider level. Indeed, in many IoT applications, a ML inference program is a critical part of an overall system that strongly interacts with other subsystems. Therefore, as in many critical information systems, strong integrity-requirements apply to an ML inference to guarantee the authenticity of the inference process as well as its inputs and outputs.

5.1 Embedded Authentication Mechanism

The traditional security model of a cyber-physical system is composed of several successive layers. The first layer consists in integrating in-depth physical counter-measures in the electronic design. The second layer is dedicated to reducing the attack surface to protect the ML system against cyber-security attacks. However, vulnerabilities still exist that may be exploited by advanced attackers. This is why the third layer consists in integrating anomaly detection mechanisms, and the fourth layer aims at tracing the events and the behavior of the system.

In [50], Paulin et al. propose dealing with detection and traceability needs for an ML inference process by staying at the device-level. The platform named HistoTrust integrates the ML system on a System-on-Module (SoM) composed of two microcontrollers and hardware security components such as a secure enclave for Trusted Environment Execution (TEE) and a Trusted Platform Module (TPM), as illustrated and detailed in the Fig. 7. This enables the integration of embedded security mechanisms as close as possible to the ML system.

Fig. 7
A model diagram of the design of the Histo system. It consists of S T 33, A R M cortex A 7 includes secure world, normal world, industrial application, and attestation process. The arm cortex M 4 consists of embedded neural network.

Design of HistoTrust System-on-Module includes a TPM (ST33), a Cortex-A7 and a Cortex-M4 in a STM32MP1 platform [50]

HistoTrust embeds an attestation scheme, based on the TPM2 attestation principle, that provides evidences of events and inferences issued by the embedded neural network. These are integrated into transactions sent to an ethereum blockchain [51].

This enables tracing the inference outputs by the embedded AI, the events that may modify the behaviour of the AI and to monitor the integrity of the embedded neural network model. The transaction authenticates the issuer device and the embedded AI in an history known to be immutable. By this way, decisions or choices relevant from the usage of an embedded neural network are authenticated through the use of secure hardware components, and can be justified. This is useful in the event of a failure to understand the behavior of the AI and to attribute accountability to those who trained, configured, embedded or used the AI.

A typical scenario for such a device-centered authentication scheme is for integrity audit of a system composed of several ML embedded systems in devices belonging to different stakeholders. In such a scenario, IoT devices perform a task thanks to ML-based algorithm (in [50], inferences of a classical convolutional neural network model). A detected anomaly is linked to the detection timestamp that corresponds with a record in the ledger. Thus, we can consider an history of transactions starting from the detection timestamp. Each transaction/attestation embeds the authentication code (account address) of the device as well as the hashing of the data produced by this device at the transaction timestamp. An independent auditor may ask each stakeholder to provide the raw data, certified, produced by their devices. The auditors check that this data is authentic and not corrupted thanks to the transactions/attestations recorded in the ledger. This data can be safely used and exploited to further investigate the source of the anomaly and potentially unveil integrity or authenticity breaches.

5.2 Main Defenses Against Algorithmic Attacks

A first class of defenses encompasses all the input pre-processing techniques that basically aim at monitoring the inputs that feed the models, for example by excluding statistical outliers. Another approach is to purify or, on the contrary, drown potential alterations of the inputs.

A second class of approaches gathers so-called hiding techniques. For many of API-based attacks, adversaries take benefit from information provided by the system that may not be necessary to correctly achieve a task (such as all the prediction scores whereas the predicted label).

Another important defense scheme is based on model hardening at training time. For example, differential privacy [52] or adversarial training [17] are standards to make models more robust against privacy and integrity-based attacks. Moreover, if it is possible according to the characteristics and the requirements of the system, the use of ensemble methods is known as a good strategy to weaken the impact of attacks.

5.3 Countermeasures Against Physical Attacks

To thwart timing analysis, some simple countermeasures are available, such as the flush-to-zero mode offered in many embedded platforms (e.g., ARM Cortex-M cores) which turns subnormal values into zeros or to favor constant-time implementation of neural network primitives like the ReLU activation function.

For SCA-based parameters extraction, randomization may significantly enhance the complexity of an attack and may take place at different levels. Traditional hiding techniques encompass the use of random dummy instructions that desynchronize the traces. Similarly, at the neuron-level, the weighted sum between the parameters and the inputs as well as the addition of the bias can be processed in a randomized order (at each inference process). Moreover, additional noise can be efficiently applied (i.e., with minor influence on the accuracy of the model) to bring uncertainty in the input layer values that are important knowledge for the adversary when making hypothesis on the secret values (parameters).

Additionally to traditional countermeasures against fault injection [53], specific defense schemes against BFA encompass weight clipping, model pruning [54], clustering-based quantization [36, 55], code-based detectors [56] or adversarial training [39]. The practical evaluation of these defenses against fault attack means such as RowHammer, glitching or LFI is an important direction for future research efforts. As for adversarial examples (with so many defenses regularly broken afterwards) the definition of proper and sound evaluations of defenses against parameter-based attacks against embedded ML models is a research action of the highest importance.

6 Conclusion

The large-scale development and deployment of IoT systems, relying on Artificial Intelligence solutions, raises critical security issues that highlight two urgent needs. First, ML practitioners have to define security requirements at the early stages of their design and development processes: with so many demonstrated attacks at every stage of the ML pipeline, ML security breaches are unlikely to be simply patched afterwards as in the standard security fairy tales. Second, despite major efforts from the ML Security community, evaluation methodologies to properly assess the impact of attacks, robustness of models as well as defense benefits have to be strengthened and disseminated. The upcoming challenges for the security of machine learning models and systems are essentially focused on the design and development of robust defenses including certifications or guarantees as well as taking into consideration more realistic attacks and threat models. Indeed, for now, most of the state-of-the-art research efforts are model-centered whereas real-world applications need to consider security at a global system-scale. These challenges could be overcome by joint efforts between the adversarial machine learning community and AI-system stakeholders (designers, developers, end-users).