1 Introduction

While deep neural networks (DNNs) have achieved notable success in various application domains [5, 31], their deployment on resource-constrained, embedded, real-time systems is currently impeded by their substantial demand for computing and storage resources [27]. Quantization is one of the most popular and promising techniques to address this issue [8, 39]. By storing the full-precision values in a DNN (such as parameters and/or activation values) into low bit-width fixed-point numbers, quantization facilitates the compression of a DNN and leads to a quantized neural network (QNN), making the network more efficient.

While a lot of techniques have been proposed to minimize the loss of accuracy induced by quantization [8, 15, 21, 22, 32, 33, 42, 44, 48], an important side-effect of quantization is overlooked, that is the risk of breaking desired critical properties, e.g., robustness [24, 41] and backdoor-freeness [13, 26, 34, 55], thereby raising great concerns, especially when they are deployed in safety-critical applications. While quantization-aware training techniques have been proposed to improve the robustness for a given fixed quantization strategy [23, 24, 41, 43], they fail to provide robustness guarantees. Therefore, it becomes imperative to devise a quantization strategy synthesis technique, ensuring that the resulting QNNs retain specific desired properties. Noting that although various verification methods for QNNs have been proposed [3, 9, 12, 52,53,54], they exclusively focus on post-hoc analyses rather than synthesis, namely, these methods merely verify or falsify the properties but offer no solutions for those that are falsified.

Fig. 1.
figure 1

Visualized data distribution shift using 400 random samples centered around an input image. These inputs are processed through both a DNN (trained on MNIST [20]) and its counterparts quantized with bit-width \(Q\in \{4,6,8,10\}\). The resulting high-dimensional convex shapes are visualized in 2D. The blue and brown scatters demonstrate the distribution of output values of each affine layer of the DNN and QNNs. (Color figure online)

Contributions. In this work, we propose the first quantization strategy synthesis method, named Quadapter, such that the desired properties are verifiably maintained by the quantization. Given a DNN \({\mathcal {N}}\) and a property \(\langle {\mathcal {I}},{\mathcal {O}}\rangle \) where \({\mathcal {I}}\) and \({\mathcal {O}}\) are the pre- and post-condition for the input and output, our general idea is first to compute the preimage of each layer w.r.t. the output region formed by \({\mathcal {O}}\). Then, considering the typical data distribution shift caused by quantization in each layer (cf. Fig. 1), we identify the minimal bit-width for each layer such that the shifted quantized reachable region w.r.t. \({\mathcal {I}}\) always remains within the corresponding preimage. This method allows us to derive a quantization strategy for the entire network, preserving the desired property \(\langle {\mathcal {I}}, {\mathcal {O}}\rangle \) after quantization.

A key technical question is how to represent and compute the preimage for each layer effectively and efficiently. In this work, we propose to compute an under-approximation of the preimage for each layer and represent it by adapting the abstract domain of DeepPoly  [40]. Specifically, we devise a novel Mixed Integer Linear Programming (MILP) based method to propagate the (approximate) preimage layer-by-layer in a backward fashion, where we encode the affine transformation and activation function precisely as linear constraints and compute under-approximate preimage via MILP solving.

We implement our methods as an end-to-end tool Quadapter and extensively evaluate our tool on a large set of synthesis tasks for DNNs trained using two widely used datasets MNIST [20] and Fashion-MNIST [46], where the number of hidden layers varies from 2 to 6 and the number of neurons in each hidden layer varies from 100 to 512. The experimental results demonstrate the effectiveness and efficiency of Quadapter in synthesizing certified quantization strategies to preserve robustness and backdoor-freeness. The quantization strategy synthesized by Quadapter generally preserves the accuracy of the original DNNs (with only minor degradation). We also show that by slightly relaxing the under-approximate preimages of the hidden layers (without sacrificing the overall soundness), Quadapter can synthesize quantization strategies with much smaller bit-widths while preserving the desired properties and accuracy.

The remainder of this paper is organized as follows. Section 2 gives the preliminaries and formulates the problem. Section 3 presents the details of our approach and Sect. 4 demonstrates its applications. Section 5 reports our experimental results. We discuss related work in Sect. 6 and finally, Sect. 7 concludes. The source code for our tool, along with the benchmarks, is available in [50], which also includes a long version of the paper containing all missing proofs, design choices, implementation details, and additional experimental results.

2 Preliminaries

We denote by \(\mathbb {R}\) the set of real numbers. Given an integer n, let \([n]:=\{1,\ldots , n\}\) and \(\mathbb {R}^n\) be the set of the n-tuples of real numbers. We use bold lowercase letters (e.g., \({\textbf{x}}\)) and BOLD UPPERCASE (e.g., \({\textbf{W}}\)) to denote vectors and matrices. We denote by \({\textbf{W}}_{i,:}\) (resp. \({\textbf{W}}_{:,i}\)) the i-th row (resp. column) vector of the matrix \({\textbf{W}}\), and by \({\textbf{x}}_j\) (resp. \({\textbf{W}}_{i,j}\)) the j-th entry of the vector \({\textbf{x}}\) (resp. \({\textbf{W}}_{i,:}\)). M denotes an extremely large number.

A Deep Neural Network (DNN) with 2d layers is a function \({\mathcal {N}}:\mathbb {R}^{n_0}\rightarrow \mathbb {R}^{n_{2d}}\) such that \({\mathcal {N}}=f_{2d}\circ \cdots \circ f_1\), where \(f_1:\mathbb {R}^{n_0}\rightarrow \mathbb {R}^{n_1}\) is the input layer, \(f_{2d}:\mathbb {R}^{n_{2d-1}}\rightarrow \mathbb {R}^{n_{2d}}\) is the output layer, and the others are hidden layers. The hidden layers alternate between affine layers \(f_{2i}:\mathbb {R}^{n_{2i-1}}\rightarrow \mathbb {R}^{n_{2i}}\) and activation layers \(f_{2i+1}:\mathbb {R}^{n_{2i}}\rightarrow \mathbb {R}^{n_{2i+1}}\) for \(i\in [d-1]\). The semantics of each layer is defined as follows: \({\textbf{x}}^1 = f_1({\textbf{x}})={\textbf{x}}\), \({\textbf{x}}^{2i}=f_{2i}({\textbf{x}}^{2i-1})={\textbf{W}}^{2i}{\textbf{x}}^{2i-1}+{\textbf{b}}^{2i}\) for \(i\in [d]\) and \({\textbf{x}}^{2i+1}=f_{2i+1}({\textbf{x}}^{2i})=\text {ReLU}({\textbf{x}}^{2i})\) for \(i\in [d-1]\), where \({\textbf{W}}^{2i}\) and \({\textbf{b}}^{2i}\) are the weight matrix and the bias vector of the 2i-th layer, \(n_0=n_1\) and \(n_{2i}=n_{2i+1}\) for \(i\in [d-1]\). Note that for the sake of presentation, we regard affine and activation layers separately as hidden layers, some prior work regards the composition of an affine layer and an activation layer as one hidden layer, e.g., [4, 25, 38]. Given a DNN \({\mathcal {N}}\) with 2d layers, we use \({\mathcal {N}}_{[i:j]}:\mathbb {R}^{n_{i-1}}\rightarrow \mathbb {R}^{n_{j}}\) to denote the composed function \(f_{j}\circ \cdots \circ f_{i}\). By \({\mathcal {N}}({\mathcal {I}})\) (resp. \({\mathcal {N}}({\mathcal {I}})_g\)), we refer to the output region of the network \({\mathcal {N}}\) (resp. neuron \({\textbf{x}}^{2d}_g\)) w.r.t. the input region \({\mathcal {I}}\).

A Quantized Neural Network (QNN) is structurally identical to a DNN but uses fixed-point values for its parameters and/or layer outputs. In this work, we focus on QNNs where only parameters are quantized using the most hardware-efficient quantization scheme, i.e., signed power-of-two quantization [33].

A quantization configuration \(\xi \) is a pair \(\langle Q, F\rangle \), where Q denotes the total bit-width and F denotes the bit-width for the fractional part of the value. Given a quantization configuration \(\xi \) and a real-valued number u, its fixed-point counterpart \(\hat{u}\) is defined as \(\hat{u} = \text {min}(\text {max}(\frac{\lfloor u\cdot 2^F\rceil }{2^F}, -2^{Q-1}), 2^{Q-1}-1)\), where \(\lfloor \cdot \rceil \) is the round-to-nearest operator. Given a DNN \({\mathcal {N}}:\mathbb {R}^{n_0}\rightarrow \mathbb {R}^{n_{2d}}\) with 2d layers and a set of quantization configurations for affine and output layers \(\varXi = \{\xi _1, \ldots ,\xi _{d}\}\), its quantized version \(\widehat{{\mathcal {N}}}:\mathbb {R}^{n_0}\rightarrow \mathbb {R}^{n_{2d}}\) is a composed function as \(\widehat{{\mathcal {N}}}=\hat{f}_{2d}\circ \cdots \circ \hat{f}_1\), where each layer is defined the same as that in the DNN \({\mathcal {N}}\) except that the parameters \({\textbf{W}}^{2i}\) and \({\textbf{b}}^{2i}\) for \(i\in [d]\) from the DNN \({\mathcal {N}}\) are quantized into fixed-point values \(\widehat{{\textbf{W}}}^{2i}\) and \(\widehat{{\textbf{b}}}^{2i}\) in the QNN \(\widehat{{\mathcal {N}}}\) according to the quantization configuration \(\xi _i\). In this work, we call the set \(\varXi \) a quantization strategy of the DNN \({\mathcal {N}}\).

Definition 1

Given a DNN \({\mathcal {N}}:\mathbb {R}^{n_0}\rightarrow \mathbb {R}^{n_{2d}}\), a property of \({\mathcal {N}}\) is a pair \(\langle \phi , \psi \rangle \) where \(\phi \) is a pre-condition over the input \({\textbf{x}}\in \mathbb {R}^{n_0}\) and \(\psi \) is a post-condition over the output \({\textbf{y}}={\mathcal {N}}({\textbf{x}})\in \mathbb {R}^{n_{2d}}\). \({\mathcal {N}}\) satisfies the property \(\langle \phi , \psi \rangle \), denoted by \({\mathcal {N}}\models \langle \phi ,\psi \rangle \), if \(\phi ({\textbf{x}})\Rightarrow \psi ({\mathcal {N}}({\textbf{x}}))\) holds for any input \({\textbf{x}}\in \mathbb {R}^{n_0}\).

Following prior work [49], we assume that the pre-condition \(\phi \) and post-condition \(\psi \) are expressible by polyhedra, namely, \({\mathcal {I}}\) and \({\mathcal {O}}\), respectively. It is reasonable since, for typical properties such as robustness, both conditions can be effectively represented by a set of linear constraints. For simplicity, we will use \(\langle {\mathcal {I}},{\mathcal {O}}\rangle \) to denote the property directly. We are now ready to define our problem.

Definition 2

Given a DNN \({\mathcal {N}}\) and a property \(\langle {\mathcal {I}},{\mathcal {O}}\rangle \) such that \({\mathcal {N}}\models \langle {\mathcal {I}},{\mathcal {O}}\rangle \), the problem of certified quantization strategy synthesis is to find a quantization strategy \(\varXi \) such that \(\widehat{{\mathcal {N}}}\models \langle {\mathcal {I}},{\mathcal {O}}\rangle \), where \(\widehat{{\mathcal {N}}}\) is the QNN obtained from \({\mathcal {N}}\) under the quantization strategy \(\varXi \).

Review of DeepPoly . The core idea of DeepPoly is to (approximately) represent the transformation of each layer using an abstract transformer, and compute lower/upper bounds for the output of each neuron. Fix a neuron \({\textbf{x}}^i_j\), its abstract element \({\mathcal {A}}^{i,\sharp }_j\) is given by a tuple \(\langle {\textbf{a}}^{i,{\le }}_{j}, {\textbf{a}}^{i,{\ge }}_j, l^i_j,u^i_j\rangle \), where \({\textbf{a}}^{i,{\le }}_{j}\) (resp. \({\textbf{a}}^{i,{\ge }}_{j}\)) is a symbolic lower (resp. upper) bound in the form of a linear combination of variables from its preceding layers, \(l^i_j\) (resp. \(u^i_j\)) is the concrete lower (resp. upper) bound of \({\textbf{x}}^i_j\). We denote by \({\textbf{a}}^{i,\le }\) (resp. \({\textbf{a}}^{i,\ge }\)) the vector of symbolic bounds \({\textbf{a}}^{i,\le }_j\) (resp. \({\textbf{a}}^{i,\ge }_j\)) of the neurons \({\textbf{x}}^i_j\)’s in the i-th layer. The concretization of \({\mathcal {A}}^{i,\sharp }_j\) is defined as \(\gamma ({\mathcal {A}}^{i,\sharp }_j)=\{{\textbf{x}}^{i}_j\in \mathbb {R}\mid {\textbf{a}}^{i,{\le }}_j \le {\textbf{x}}^{i}_j \le {\textbf{a}}^{i,{\ge }}_j\}\). By repeatedly substituting each variable \(x_{j'}^{i'}\) in \({\textbf{a}}^{i,{\le }}_{j}\) (resp. \({\textbf{a}}^{i,{\ge }}_{j}\)) using \({\textbf{a}}^{i',{\le }}_{j'}\) or \({\textbf{a}}^{i',{\ge }}_{j'}\) according to the coefficient of \(x_{j'}^{i'}\), until no further substitution is possible, \({\textbf{a}}^{i,{\le }}_{j}\) (resp. \({\textbf{a}}^{i,{\ge }}_{j}\)) will be a linear combination over the input variables of the DNN. We denote by \(f^{i,{\le }}_j\) and \(f^{i,{\ge }}_j\) the resulting linear combinations of \({\textbf{a}}^{i,{\le }}_{j}\) and \({\textbf{a}}^{i,{\ge }}_{j}\). Then, the concrete lower bound \(l^i_j\) (resp. concrete upper bound \(u^i_j\)) of the neuron \({\textbf{x}}^i_j\) can be derived using the input region \({\mathcal {I}}\) and \(f^{i,{\le }}_j\) (resp. \(f^{i,{\ge }}_j\)). All the abstract elements \({\mathcal {A}}^{i,\sharp }_j\) are required to satisfy the domain invariant: \(\gamma ({\mathcal {A}}^{i,\sharp }_j) \subseteq [l^i_j,u^i_j].\) We denote by \({\mathcal {A}}^{i}_j\) the abstract element \(\langle f^{i,{\le }}_j, f^{i,{\ge }}_j, l^i_j,u^i_j\rangle \). For an affine function \({\textbf{x}}^{i}= {\textbf {W}}^{i}{\textbf{x}}^{i-1}+{\textbf {b}}^i\), the abstract affine transformer sets \({\textbf{a}}^{i,{\le }}={\textbf{a}}^{i,{\ge }}={\textbf {W}}^{i}{\textbf{x}}^{i-1}+{\textbf {b}}^i\). Given the abstract element \({\mathcal {A}}^{i,\sharp }_j=\langle {\textbf{a}}^{i,{\le }}_{j}, {\textbf{a}}^{i,{\ge }}_j, l^i_j,u^i_j\rangle \) of the neuron \({\textbf{x}}^{i}_j\), \({\mathcal {A}}^{i+1,\sharp }_j\) of the neuron \({\textbf{x}}^{i+1}_j=\text {ReLU}({\textbf{x}}_j^i)\) have three cases as follows, where \(\lambda ^i_j=\frac{u^i_{j}}{u^i_j-l^i_j}\): i) if \(l^i_j \ge 0\), then \({\textbf{a}}_{j}^{i+1,{\le }}={\textbf{a}}_{j}^{i+1,{\ge }}={\textbf{x}}_j^i\), \(l^{i+1}_{j}=l^{i}_{j}\), \(u^{i+1}_{j}=u^i_{j}\); ii) if \(u^i_{j} \le 0\), then \({\textbf{a}}_{j}^{i+1,{\le }}={\textbf{a}}_{j}^{i+1,{\ge }}=l^{i+1}_{j}=u^{i+1}_{j}=0\); iii) if \(l^{i}_j u^i_{j}<0\), then \({\textbf{a}}_{j}^{i+1,{\ge }}=\lambda ^i_j ({\textbf{x}}^{i}_{j}-l^i_{j})\), \({\textbf{a}}_{j}^{i+1,{\le }}=\kappa \cdot {\textbf{x}}^i_{j}\) where \(\kappa \in \{0,1\}\) such that the area of resulting shape by \({\textbf{a}}_{j}^{i+1,{\le }}\) and \({\textbf{a}}_{j}^{i+1,{\ge }}\) is minimal, \(l^{i+1}_j=\kappa \cdot l^i_j\) and \(u^{i+1}_j=u^i_j\).

3 Our Approach

In the following, we fix a DNN \({\mathcal {N}}\) with 2d layers and a property \(\langle {\mathcal {I}},{\mathcal {O}}\rangle \).

3.1 Foundation of Quadapter

Consider a function f and an output set Y, the preimage \(f^{-1}(Y)\) of the output set Y for f is the set \(\{x\mid f(x)\in Y\}\). An under-approximation of \(f^{-1}(Y)\) is a set \({\mathcal {P}}\) such that \({\mathcal {P}}\subseteq f^{-1}(Y)\).

Definition 3

A set \(\mathfrak {P}=\{{\mathcal {P}}^{2i}\mid i\in [d-1]\}\) is an under-approximate preimage of the output region \({\mathcal {O}}\) for the DNN \({\mathcal {N}}\) if for every \(i\in [d-1]\), \({\mathcal {P}}^{2i}\subseteq {\mathcal {N}}_{[2i+1:2d]}^{-1}({\mathcal {O}})\).

Intuitively, \({\mathcal {P}}^{2i}\) (resp. \({\mathcal {P}}^{2i}_j\)) is the preimage of the activation layer \(f_{2i+1}\) (resp. neuron \({\textbf{x}}^{2i+1}_j\)) w.r.t. the output region \({\mathcal {O}}\). Since it suffices to consider preimages of the activation layers in the set \(\mathfrak {P}\) for computing bit-widths of affine layers, the preimages of the affine layers are excluded.

Proposition 1

Let \(\widehat{{\mathcal {N}}}^{2i}\) be a network obtained from \({\mathcal {N}}\) by quantizing the first 2i layers. If \(\mathfrak {P}=\{{\mathcal {P}}^{2i}\mid i\in [d-1]\}\) is an under-approximate preimage of the output region \({\mathcal {O}}\) for the DNN \({\mathcal {N}}\), then \(\widehat{{\mathcal {N}}}^{2i}_{[1:2i]}({\mathcal {I}}) \subseteq {\mathcal {P}}^{2i} \Rightarrow \widehat{{\mathcal {N}}}^{2i}\models \langle {\mathcal {I}},{\mathcal {O}}\rangle .\)    \(\square \)

Intuitively, Proposition 1 states that regardless of the quantization configurations of the first 2i layers, the property \(\langle {\mathcal {I}},{\mathcal {O}}\rangle \) is always preserved in the resulting QNN, as long as the reachable region of the quantized layer \(\hat{f}_{2i}\) w.r.t. the input region \({\mathcal {I}}\) remains within the preimage \({\mathcal {P}}^{2i}\). This proposition allows us to repeatedly compute a quantization configuration \(\xi _i\) for each layer \(f_{2i}\) \((i\in [d])\), from the first affine layer to the output layer, that guarantees the reachable region of each quantized layer \(\hat{f}_{2i}\) remains within its respective preimage \({\mathcal {P}}^{2i}\). Putting all the quantization configurations of the affine layers and the output layer together yields a quantization strategy \(\varXi \) that preserves the desired property \(\langle {\mathcal {I}},{\mathcal {O}}\rangle \).

However, it is non-trivial to compute the preimages \({\mathcal {N}}_{[2i+1:2d]}^{-1}({\mathcal {O}})\) from the functions \({\mathcal {N}}_{[2i+1:2d]}^{-1}\) for \(i\in [d-1]\). To resolve this issue, we propose to repeatedly compute a preimage \({\mathcal {P}}^{2i}\) of each activation layer \(f_{2i+1}\) starting from the output layer to the first activation layer by analyzing the function \({\mathcal {N}}_{[2i+1:2i+2]}^{-1}\) instead of the function \({\mathcal {N}}_{[2i+1:2d]}^{-1}\), according to the following proposition.

Proposition 2

Let \(\mathfrak {P}=\{{\mathcal {P}}^{2i}\mid i\in [d-1]\}\) be a set such that for every \(i\in [d-1]\), i) if \(i=d-1\), \({\mathcal {P}}^{2i}\subseteq {\mathcal {N}}_{[2i+1:2i+2]}^{-1}({\mathcal {O}})\); ii) if \(i\le d-2\), \({\mathcal {P}}^{2i}\subseteq {\mathcal {N}}_{[2i+1:2i+2]}^{-1}({\mathcal {P}}^{2i+2})\). \(\mathfrak {P}\) is an under-approximate preimage of the output region \({\mathcal {O}}\) for the DNN \({\mathcal {N}}\).    \(\square \)

3.2 Overview of Quadapter

Fig. 2.
figure 2

An overview of our method.

Let \({\mathcal {P}}^{2d}={\mathcal {O}}\). The overall workflow of Quadapter is depicted in Fig. 2 which consists of the following two steps:

  • Step 1: Preimage Computation. We first compute an under-approximate preimage \({\mathcal {P}}^{2d-2}\) for the output layer s.t. \({\mathcal {P}}^{2d-2}\subseteq {\mathcal {N}}_{[2d-1:2d]}^{-1}({\mathcal {O}})\), and then propagate it through the network until reaching the first affine layer. Finally, we obtain the under-approximate preimage \(\mathfrak {P}=\{{\mathcal {P}}^{2i}\mid i\in [d-1]\}\) for the DNN \({\mathcal {N}}\) (the yellow part);

  • Step 2: Forward Quantization. We then conduct a forward quantization procedure layer-by-layer to find a quantization configuration \(\xi _i=\langle Q_i, F_i\rangle \) with minimal bit-width \(Q_i\) for each layer \(f_{2i}\), ensuring that the reachable region characterized by the quantized abstract element \(\widehat{{\mathcal {A}}}^{2i}\) (the blue part) is included in the preimage \({\mathcal {P}}^{2i}\), i.e., \(\gamma (\widehat{{\mathcal {A}}}^{2i})\subseteq {\mathcal {P}}^{2i}\) for \(1\le i\le d\).

The overall algorithm is given in Algorithm 1. Given a DNN \({\mathcal {N}}\), a property \(\langle {\mathcal {I}},{\mathcal {O}}\rangle \), and the minimum (resp. maximum) fractional bit-width \(\mathfrak {B}_l\) (resp. \(\mathfrak {B}_u\)) for each layer, we first apply DeepPoly on the DNN \({\mathcal {N}}\) w.r.t. input region \({\mathcal {I}}\) to obtain the abstract elements \({\mathcal {A}}^{2i}\) for \(i\in [d]\). Then, the first for-loop computes the preimage by invoking the function UnderPreImage\(({\mathcal {N}},{\mathcal {A}}^{2i},{\mathcal {P}}^{2i+2})\) which propagates \({\mathcal {P}}^{2i+2}\) to the preceding activation layer and returns the approximate preimage \({\mathcal {P}}^{2i}\) with \({\mathcal {P}}^{2i}\subseteq {\mathcal {N}}^{-1}_{[2i+1:2i+2]}({\mathcal {P}}^{2i+2})\). The second for-loop performs a forward quantization procedure, where the i-th iteration is used to compute the quantization configuration \(\xi _i\) for layer \(f_{2i}\). First, we obtain the minimal bit-width \(\mathfrak {I}\) for the integer part of weights and biases to prevent overflow. Then, we iterate through all the possible configurations \(\check{\xi }_i=\langle F+\mathfrak {I}, F\rangle \) by varying the fractional bit-width F from the smallest one \(\mathfrak {B}_{l}\) to the largest one \(\mathfrak {B}_{u}\). For each \(F\in [\mathfrak {B}_{l},\mathfrak {B}_{u}]\), we compute a partially quantized DNN \(\widehat{{\mathcal {N}}}^{2i}\), where only the first i affine layers (and the output layer) are quantized using \(\xi _1,\cdots , \xi _{i-1},\check{\xi }_i\). Next, we apply DeepPoly on \(\widehat{{\mathcal {N}}}^{2i}_{[1:2i]}\) w.r.t. the input region \({\mathcal {I}}\) to obtain the abstract element \(\widehat{{\mathcal {A}}}^{2i}\) of the quantized layer \(\hat{f}_{2i}\), resulting in reachable region as the blue part in Fig. 2. We then check whether this reachable region is strictly contained in the preimage \({\mathcal {P}}^{2i}\), i.e., \(\gamma (\widehat{{\mathcal {A}}}^{2i})\subseteq {\mathcal {P}}^{2i}\). If this is the case, we update \(\xi _i\) as \(\check{\xi }_i\), stop the iteration, and proceed to find the quantization configuration \(\xi _{i+1}\) for the next layer \(f_{2i+2}\). If there is no such quantization configuration, we return UNKNOWN.

Below, we present the details of function UnderPreImage\(({\mathcal {N}},{\mathcal {A}}^{2i},{\mathcal {P}}^{2i+2})\) and the method of checking the condition \(\gamma (\widehat{{\mathcal {A}}}^{2i})\subseteq {\mathcal {P}}^{2i}\). We first introduce the template of preimage \({\mathcal {P}}^{2i}\) utilized in this work.

figure a

3.3 Template \({\mathcal {T}}^{2i}\) of Preimage \({\mathcal {P}}^{2i}\)

Given the abstract elements \({\mathcal {A}}^{2i}=\{{\mathcal {A}}^{2i}_j\mid j\in [n_{2i}]\}\) of the neurons in the layer \(f_{2i}\), where \({\mathcal {A}}^{2i}_j=\langle f^{2i,{\le }}_j, f^{2i,{\ge }}_j, l^{2i}_j, u^{2i}_j\rangle \), we define the template \({\mathcal {T}}^{2i}\) of the preimage \({\mathcal {P}}^{2i}\) as \(\bigwedge _{j\in [n_{2i}]}{\mathcal {T}}^{2i}_j\), where \({\mathcal {T}}^{2i}_j=\{{\textbf{x}}^{2i}_j\in \mathbb {R}\mid f^{2i,{\le }}_j-\alpha ^{2i}_j\le {\textbf{x}}^{2i}_j\le f^{2i,{\ge }}_j+\beta ^{2i}_j\}\), \(\alpha _j^{2i}=\beta _j^{2i}=(\frac{u^{2i}_j-l^{2i}_j}{2})\chi ^{2i}\), and \(\chi ^{2i}\) is an additional variable over the domain \(\mathbb {R}\). Intuitively, \({\mathcal {T}}^{2i}_j\) is a scaling of \({\mathcal {A}}^{2i}_j\) using the scaling variable \(\chi ^{2i}\) and step \(\frac{u^{2i}_j-l^{2i}_j}{2}\). Thus, \({\mathcal {T}}^{2i}_j\) is \({\mathcal {A}}^{2i}_j\) when \(\chi ^{2i}=0\), and is super-region (resp. sub-region) of \({\mathcal {A}}^{2i}_j\) when \(\chi ^{2i}>0\) (resp. \(\chi ^{2i}<0\)).

3.4 Details of Function UnderPreImage

We present an MILP-based method to implement UnderPreImage\(({\mathcal {N}},{\mathcal {A}}^{2i},{\mathcal {P}}^{2i+2})\). Given the abstract element \({\mathcal {A}}^{2i}\) and preimage \({\mathcal {P}}^{2i+2}\), we construct a maximization problem with objective function \(\chi ^{2i}\) subject to the constraints \({\mathcal {T}}^{2i}\subseteq {\mathcal {N}}_{[2i+1:2i+2]}^{-1}({\mathcal {P}}^{2i+2})\), where \({\mathcal {T}}^{2i}\) is the template of \({\mathcal {P}}^{2i}\) with the scaling variable \(\chi ^{2i}\). The solution, i.e., the value of \(\chi ^{2i}\), yields the tightest under-approximate preimage \({\mathcal {P}}^{2i}\) such that \({\mathcal {P}}^{2i}\subseteq {\mathcal {N}}_{[2i+1:2i+2]}^{-1}({\mathcal {P}}^{2i+2})\). Hence, the key is addressing \({\mathcal {T}}^{2i}\subseteq {\mathcal {N}}_{[2i+1:2i+2]}^{-1}({\mathcal {P}}^{2i+2})\), for which we present an MILP-based method. We first express \({\mathcal {T}}^{2i}\subseteq {\mathcal {N}}_{[2i+1:2i+2]}^{-1}({\mathcal {P}}^{2i+2})\) as the following maximization problem:

$$\begin{aligned} \text {maximize} \ \chi ^{2i} \ \text {s.t.}\ {\mathcal {N}}_{[2i+1:2i+2]}({\mathcal {T}}^{2i}) \subseteq {\mathcal {P}}^{2i+2}. \end{aligned}$$
(1)

However, Problem (1) is not an MILP, due to the “forall”-type of constraints. To address this issue, we construct the following minimization problem:

$$\begin{aligned} \text {minimize} \ \chi ^{2i} \ \text {s.t. } \ {\textbf{x}}^{2i+2}\in {\mathcal {N}}_{[2i+1:2i+2]}({\mathcal {T}}^{2i}) \wedge {\textbf{x}}^{2i+2}\notin {\mathcal {P}}^{2i+2}. \end{aligned}$$
(2)

Intuitively, given the solution to Problem (2), e.g., \(\chi ^{2i,*}_{\text {min}}\), we can always get a value for \(\chi ^{2i}\) by subtracting an extremely small value from \(\chi ^{2i,*}_{\text {min}}\). The resulting value of \(\chi ^{2i}\) is close to the optimal solution of Problem (1), within a negligible margin of error. Such a transformation to an “existential” constraint provides an alternative way for handling \({\mathcal {T}}^{2i}\subseteq {\mathcal {N}}_{[2i+1:2i+2]}^{-1}({\mathcal {P}}^{2i+2})\), allowing the problem to be effectively tackled within the MILP framework.

Suppose \({\mathcal {T}}^{2i}_j= \{ {\textbf{x}}^{2i}_j \in \mathbb {R}\mid f^{2i,{\le }}_j-\alpha ^{2i}_j \le {\textbf{x}}^{2i}_j \le f^{2i,{\ge }}_j+\beta ^{2i}_j\}\) for \(j\in [n_{2i}]\) and \({\mathcal {P}}^{2i+2}_k=\{ {\textbf{x}}^{2i+2}_k \in \mathbb {R}\mid f^{2i+2,{\le }}_k-a^{2i+2}_k \le {\textbf{x}}^{2i+2}_k \le f^{2i+2,{\ge }}_k+b^{2i+2}_k\}\) for \(k\in [n_{2i+2}]\) and \(i\le d-2\). We reformulate Problem (2) as the following MILP problem:

$$\begin{aligned} \text {minimize}\ \chi ^{2i} \ \text {s.t.} \ \varPsi _{\in {\mathcal {I}}} \cup \varPsi _{{\mathcal {T}}^{2i}} \cup \varPsi _{{\mathcal {T}}^{2i+1}} \cup \varPsi _{{\mathcal {T}}^{2i+2}} \cup \varPsi _{\notin {\mathcal {P}}^{2i+2}}, \end{aligned}$$
(3)

where \(\varPsi _{\in {\mathcal {I}}}\) and \(\varPsi _{\notin {\mathcal {P}}^{2d}}\) will be given in Sect. 4 which entail \({\textbf{x}}\in {\mathcal {I}}\) and \({\textbf{x}}^{2d}\notin {\mathcal {P}}^{2d}\) respectively, as they depend on the property \(\langle {\mathcal {I}},{\mathcal {O}}\rangle \). \(\varPsi _{{\mathcal {T}}^{2i}}\), \(\varPsi _{{\mathcal {T}}^{2i+1}}\), \(\varPsi _{{\mathcal {T}}^{2i+2}}\), and \(\varPsi _{\notin {\mathcal {P}}^{2i+2}}\) are defined as follows (\(\{\eta _j^{2i+1}, \eta _j^{2i+2}, \zeta _j^{2i+2}\}\) are Boolean variables):

  • \(\varPsi _{{\mathcal {T}}^{2i}} = \{f^{2i,{\le }}_j-\alpha ^{2i}_j \le {\textbf{x}}^{2i}_j \le f_j^{2i,{\ge }}+\beta ^{2i}_j \mid j\in [n_{2i}]\}\) expressing template \({\mathcal {T}}^{2i}\);

  • \(\varPsi _{{\mathcal {T}}^{2i+1}} = \{ {\textbf{x}}^{2i+1} \ge 0, {\textbf{x}}^{2i+1} \ge {\textbf{x}}^{2i}, {\textbf{x}}^{2i+1}\le {\textbf {M}}\cdot \eta _j^{2i+1}, {\textbf{x}}^{2i+1} \le {\textbf{x}}^{2i}+{\textbf {M}}\cdot (1-\eta _j^{2i+1}) \mid j\in [n_{2i+1}] \}\) encoding the activation layer \(f_{2i+1}\) (cf. [54]);

  • \(\varPsi _{{\mathcal {T}}^{2i+2}} = \{ {\textbf{x}}^{2i+2}_j = {\textbf{W}}^{2i+2}_{j,:} {\textbf{x}}^{2i+1}+{\textbf{b}}^{2i+2}_j\mid j\in [n_{2i+2}]\}\) encoding the affine layer \(f_{2i+2}\) (cf. [54]). Note that \(\varPsi _{{\mathcal {T}}^{2i}}\), \(\varPsi _{{\mathcal {T}}^{2i+1}}\) and \(\varPsi _{{\mathcal {T}}^{2i+2}}\) together express the condition \({\textbf{x}}^{2i+2}\in {\mathcal {N}}_{[2i+1:2i+2]}({\mathcal {T}}^{2i})\).

  • \(\varPsi _{\notin {\mathcal {P}}^{2i+2}}= \left\{ \begin{array}{c} {\textbf{x}}^{2i+2}_j > f^{2i+2,{\ge }}_j+b^{2i+2}_j + {\textbf {M}}\cdot (\eta _j^{2i+2}-1),\\ {\textbf{x}}^{2i+2}_j \le f^{2i+2,{\ge }}_j+b^{2i+2}_j + {\textbf {M}}\cdot \eta _j^{2i+2},\\ {\textbf{x}}^{2i+2}_j \ge f^{2i+2,{\le }}_j-a^{2i+2}_j - {\textbf {M}}\cdot \zeta _j^{2i+2},\\ {\textbf{x}}^{2i+2}_j < f^{2i+2,{\le }}_j-a^{2i+2}_j - {\textbf {M}}\cdot (\zeta _j^{2i+2}-1), \\ j\in [n_{2i+2}] \wedge \sum _{k=1}^{n_{2i+2}} \big ( \eta _k^{2i+2}+\zeta _k^{2i+2}\big ) \ge 1 \end{array}\right\} \) expressing the condition \({\textbf{x}}^{2i+2}\notin {\mathcal {P}}^{2i+2}\).

Theorem 1

Problems (2) and (3) are equivalent.    \(\square \)

3.5 Checking \(\gamma (\widehat{{\mathcal {A}}}^{2i})\subseteq {\mathcal {P}}^{2i}\)

Fix the abstract elements \(\widehat{{\mathcal {A}}}^{2i}=\{\widehat{{\mathcal {A}}}^{2i}_j\mid j\in [n_{2i}]\}\) for the quantized layer \(\hat{f}_{2i}\) with \(\widehat{{\mathcal {A}}}^{2i}_j=\langle \hat{f}^{2i,{\le }}_j, \hat{f}^{2i,{\ge }}_j, \hat{l}^{2i}_j, \hat{u}^{2i}_j \rangle \), we have \(\gamma (\widehat{{\mathcal {A}}}^{2i}_j)=\{{\textbf{x}}^{2i}_j \in \mathbb {R}\mid \hat{f}^{2i,{\le }}_j\le {\textbf{x}}^{2i}_j \le \hat{f}^{2i,{\ge }}_j\}\). Let \({\mathcal {P}}^{2i}_j=\{{\textbf{x}}^{2i}_j \in \mathbb {R}\mid f^{2i,{\le }}_j-a^{2i}_j \le {\textbf{x}}^{2i}_j \le f^{2i,{\ge }}_j+b^{2i}_j\}\) for \(j\in [n_{2i}]\) be the preimage obtained by the function UnderPreImage for \(i\le d-1\), where \(a^{2i}_j\) and \(b^{2i}_j\) are real-valued numbers.

Since reformulating the problem of checking \(\gamma (\widehat{{\mathcal {A}}}^{2i})\subseteq {\mathcal {P}}^{2i}\) into an MILP problem directly is infeasible due to its inherent nature of “forall”-type constraint, we instead check the negation of this statement.

Let \(\varPhi _{\notin {\mathcal {P}}^{2i}}\) be the following set of the linear constraints:

$$\begin{aligned} \varPhi _{\notin {\mathcal {P}}^{2i}}= \varPsi _{\in {\mathcal {I}}} \cup \left\{ \begin{array}{c} f^{2i,{\ge }}_j +b^{2i}_j +{\textbf {M}}\cdot (\eta _j^{2i}-1)< \hat{f}^{2i,{\ge }}_j \le f^{2i,{\ge }}+ b^{2i}_j +{\textbf {M}}\cdot \eta _j^{2i}, \\ f^{2i,{\le }}-a^{2i}_j - {\textbf {M}}\cdot \zeta _j^{2i}\le \hat{f}^{2i,{\le }}_j < f^{2i,{\le }}-a^{2i}_j - {\textbf {M}}\cdot (\zeta _j^{2i}-1), \\ j\in [n_{2i}], \qquad \sum _{k=1}^{n_{2i}} \big ( \eta _k^{2i}+\zeta _k^{2i}\big ) \ge 1 \end{array}\right\} \end{aligned}$$

where \(\eta ^{2i}_j\) and \(\zeta ^{2i}_j\) are two additional Boolean variables, and \(\varPsi _{\in {\mathcal {I}}}\) and \(\varPhi _{\notin {\mathcal {P}}^{2d}}\) will be given in Sect. 4 such that \(\varPsi _{\in {\mathcal {I}}}\) entails \({\textbf{x}}\in {\mathcal {I}}\) and \(\lnot \varPhi _{\notin {\mathcal {P}}^{2d}}\) entails \(\gamma (\widehat{{\mathcal {A}}}^{2d}) \subseteq {\mathcal {P}}^{2d}\) respectively, as they depend on the property \(\langle {\mathcal {I}},{\mathcal {O}}\rangle \).

Theorem 2

If \(\varPhi _{\notin {\mathcal {P}}^{2i}}\) does not hold, then \(\gamma (\widehat{{\mathcal {A}}}^{2i})\subseteq {\mathcal {P}}^{2i}\).    \(\square \)

4 Applications: Robustness and Backdoor-Freeness

4.1 Certified Quantization for Robustness

We use Algorithm 1 to synthesize quantization strategies for preserving robustness.

Definition 4

Let \({\mathcal {N}}:\mathbb {R}^{n_0}\rightarrow \mathbb {R}^{n_{2d}}\) be a DNN, \({\mathcal {I}}^r_{\textbf{u}}=\{{\textbf{x}}\in \mathbb {R}^{n_0} \mid ||{\textbf{x}}-{\textbf{u}}||_\infty \le r\}\) be a perturbation region around an input \({\textbf{u}}\in \mathbb {R}^{n_0}\), and \({\mathcal {O}}_g=\{{\textbf{x}}^{2d}\in \mathbb {R}^{n_{2d}}\mid \text {argmax}({\textbf{x}}^{2d})=g\}\) be the output region corresponding to a specific class g. Then, \(\langle {\mathcal {I}}^r_{\textbf{u}},{\mathcal {O}}_g\rangle \) is a (local) robustness property of the DNN \({\mathcal {N}}\).

We now give the encoding details that are not covered in Sect. 3, i.e., \(\varPsi _{\in {\mathcal {I}}}\) and \(\varPsi _{\notin {\mathcal {P}}^{2d}}\) in Problem (3), and \(\varPhi _{\notin {\mathcal {P}}^{2d}}\) in Sect. 3.5 for the property \(\langle {\mathcal {I}}^r_{\textbf{u}},{\mathcal {O}}_g\rangle \)Footnote 1:

  • \(\varPsi _{\in {\mathcal {I}}}=\{\text {max}({\textbf{u}}_j-r,0)\le {\textbf{x}}_j\le \text {min}({\textbf{u}}_j+r,1)\mid j\in [n_0]\}\) specifying the feasible input range \({\mathcal {I}}^r_{\textbf{u}}\);

  • \(\varPsi _{\notin {\mathcal {P}}^{2d}}= \left\{ \begin{array}{c} {\textbf{x}}^{2d}_g + {\textbf {M}}\cdot (\eta _j^{2d}-1) \le {\textbf{x}}^{2d}_j \le {\textbf{x}}^{2d}_g + {\textbf {M}}\cdot \eta _j^{2d}, \\ j\in [n_{2d}]\setminus g, \qquad \sum _{k\in [n_{2d}]\setminus g} \eta _k^{2d} \ge 1 \end{array}\right\} \) stating \({\textbf{x}}^{2d}\notin {\mathcal {O}}_g\), i.e., \(\text {argmax}({\textbf{x}}^{2d})\ne g\), where \(\eta _j^{2d}\) is a Boolean variable;

  • \(\varPhi _{\notin {\mathcal {P}}^{2d}}=\left\{ \begin{array}{c} \hat{f}^{2d,{\le }}_g + {\textbf {M}}\cdot (\eta _j^{2d}-1) \le \hat{f}^{2d,{\ge }}_j \le \hat{f}^{2d,{\le }}_g + {\textbf {M}}\cdot \eta _j^{2d}, \\ j\in [n_{2d}]\backslash g, \qquad \sum _{k\in [n_{2d}]\setminus g} \eta _k^{2d} \ge 1 \end{array}\right\} \) whose unsatisfiability ensuring \(\gamma (\widehat{{\mathcal {A}}}^{2d})\subseteq {\mathcal {O}}_g\), where \(\eta _j^{2d}\) is a Boolean variable.

The soundness of the algorithm is captured by the theorem below.

Theorem 3

\(\varPsi _{\in {\mathcal {I}}}\Leftrightarrow {\textbf{x}}\in {\mathcal {I}}^r_{\textbf{u}}\), \(\varPsi _{\notin {\mathcal {P}}^{2d}}\Leftrightarrow {\textbf{x}}^{2d}\notin {\mathcal {O}}_g\), \(\lnot \varPhi _{\notin {\mathcal {P}}^{2d}}\Rightarrow \gamma (\widehat{{\mathcal {A}}}^{2d})\subseteq {\mathcal {O}}_g\).    \(\square \)

4.2 Certified Quantization for Backdoor-Freeness

Given a DNN \({\mathcal {N}}:\mathbb {R}^{n_0}\rightarrow \mathbb {R}^{n_{2d}}\) and an input \({\textbf{u}}\in \mathbb {R}^{n_0}\), assume that the 2D-shape of \({\textbf{u}}\) is a rectangle \((h_u,w_u)\) (i.e., \(n_0=h_u\times w_u\)). A backdoor trigger is any 2D input \({\textbf{s}}\in \mathbb {R}^{h_s\times w_s}\) with a shape of rectangle \((h_s,w_s)\) such that \(h_s\le h_u\) and \(w_s\le w_u\). We use \({\textbf{u}}[x,y]\) to denote the element located in the x-th row and y-th column within the 2D-input \({\textbf{u}}\). Let \((h_p,w_p)\) denote the position of (i.e., the top-left corner of) the trigger \({\textbf{s}}\) such that \(h_p+h_s\le h_u\) and \(w_p+w_s\le w_u\). Then, \({\textbf{u}}^{\textbf{s}}\) is the stamped input where \({\textbf{u}}^{\textbf{s}}[x,y]={\textbf{s}}[x-h_p,y-w_p]\) if \(h_p \le x \le h_p+h_s \wedge w_p \le y \le w_p+w_s\), and \({\textbf{u}}^{\textbf{s}}[x,y]={\textbf{u}}[x,y]\) otherwise.

Definition 5

Let \({\mathcal {N}}:\mathbb {R}^{n_0}\rightarrow \mathbb {R}^{n_{2d}}\) be a DNN, \((h_s,w_s)\), \((h_p,w_p)\), t, and \(\theta \) be the shape, position, target class, and attack success rate of potential triggers. Then, the DNN \({\mathcal {N}}\) satisfies the backdoor-freeness property if there does not exist a backdoor trigger \({\textbf{s}}\) which has an attack success rate of at least \(\theta \), i.e., the probability of \({\mathcal {N}}({\textbf{u}}^{{\textbf{s}}})=t\) for any \({\textbf{u}}\in \mathbb {R}^{n_0}\) is at least \(\theta \) [37].

Given an input \({\textbf{u}}\in \mathbb {R}^{n_0}\), let \(\langle {\mathcal {I}}^B_{\textbf{u}}, {\mathcal {O}}^B_t\rangle \) be a property such that \({\mathcal {I}}^B_{\textbf{u}}=\{{\textbf{u}}^{\textbf{s}}\in \mathbb {R}^{n_0}\mid {\textbf{s}}\in \mathbb {R}^{h_s\times w_s}\) is any trigger at position \((h_p,w_p)\}\) and \({\mathcal {O}}^{B}_t=\{{\textbf{x}}^{2d}\in \mathbb {R}^{n_{2d}}\mid \text {argmax}({\textbf{x}}^{2d})\ne t\}\). Intuitively, \(\langle {\mathcal {I}}^B_{\textbf{u}}, {\mathcal {O}}^B_t\rangle \) entails that no trigger exists whereby the input \({\textbf{u}}\), once stamped, would be classified as class t.

figure b

The overall algorithm is given in Algorithm 2 by applying a hypothesis testing (a type I/II error \(\sigma \)/\(\varrho \) and a half-width of the indifference region \(\delta \)), i.e., the SPRT algorithm [1]. The while loop first keeps randomly selecting a set of K properties and collects the preimage with the highest value of the scaling variable of the first affine layer, along with the property, until one of the hypotheses is accepted. When the null hypothesis \(H_0\) is accepted (line 9), we try to find a shared quantization strategy for all the properties collected before, following Algorithm 1, with the innermost for-loop to traverse all properties. Due to space limitations, details of the hypothesis testing and input parameters are explained in [50].

Table 1. Benchmarks of DNNs on MNIST and Fashion-MNIST.

We now give the encoding details that are not covered in Sect. 3, i.e., \(\varPsi _{\in {\mathcal {I}}}\) and \(\varPsi _{\notin {\mathcal {P}}^{2d}}\) in Problem (3), and \(\varPhi _{\notin {\mathcal {P}}^{2d}}\) in Sect. 3.5 for the property \(\langle {\mathcal {I}}^B_{\textbf{u}},{\mathcal {O}}^B_t\rangle \):

  • \(\varPsi _{\in {\mathcal {I}}}=\left\{ \begin{array}{c} 0\le {\textbf{x}}[a,b]\le 1 \text { if } h_p\le a\le h_p+h_s \wedge w_p\le b\le w_p+w_s,\\ {\textbf{x}}[a,b]={\textbf{u}}[a,b] \text { otherwise} \end{array}\right\} \);

  • \(\varPsi _{\notin {\mathcal {P}}^{2d}}=\{{\textbf{x}}^{2d}_t \ge {\textbf{x}}^{2d}_j \mid j\in [n_{2d}]\}\);

  • \(\varPhi _{\notin {\mathcal {P}}^{2d}}= \{ \hat{f}^{2d,{\le }}_j \le \hat{f}^{2d,{\ge }}_t \mid j\in [n_{2d}]\setminus t\}\).

Theorem 4

(1) \(\varPsi _{\in {\mathcal {I}}}\Leftrightarrow {\textbf{x}}\in {\mathcal {I}}^B_{\textbf{u}}\), \(\varPsi _{\notin {\mathcal {P}}^{2d}}\Leftrightarrow {\textbf{x}}^{2d}\notin {\mathcal {O}}^B_t\), \(\lnot \varPhi _{\notin {\mathcal {P}}^{2d}}\Rightarrow \gamma (\widehat{{\mathcal {A}}}^{2d})\subseteq {\mathcal {O}}^B_t\), and (2) there is sufficient evidence (subject to type 1 error \(\sigma \) and type 2 error \(\varrho \)) that there are no backdoor attacks with the featured triggers within the QNN obtained by Algorithm 2.    \(\square \)

5 Evaluation

We have implemented our methods as a tool Quadapter with Gurobi [11] as the back-end MILP solver. To address the numerical stability problem using big-M, we use alternative formulations for the ReLU activation function and tighter bounds for other big-M. Details refer to [50]. All experiments are run on a machine with Intel(R) Xeon(R) Platinum 8375C CPU@2.90GHz, using 30 threads in total. The time limit for each task is 2 h.

Benchmarks. We train 8 DNNs using the MNIST [20] and Fashion-MNIST [46] datasets based on their popularity in previous verification studies with comparable size [9, 12, 19, 36, 37]. To evaluate the performance of Quadapter, these DNNs vary in architectures, whose details are given in Table 1, where \(x\times y\) means that the network has x hidden layers and y neurons per each hidden layer. Hereafter, we use \(\text {MP}x\) (resp. \(\text {FP}x\)) with \(x\in \{1,2,3,4\}\) to denote the network of architecture \(\text {P}x\) trained using MNIST (resp. Fashion-MNIST).

5.1 Performance of UnderPreImage Function

We evaluate the effectiveness and efficiency of the MILP-based method introduced in Sect. 3.4 for computing the under-approximate preimage of DNNs \(\text {MP}x\) with \(x\in \{1,2,3,4\}\) for robustness properties. Specifically, we randomly select 50 inputs from the test set of MNIST and set the perturbation radius as \(r\in \{2,4\}\), resulting in a total of 400 robustness properties, each of which can be certified using DeepPoly. The time limit for each computation task is 2 h. We also implement an abstraction-based method (ABS) to compute the preimages for comparative analysis. Details refer to [50].

Fig. 3.
figure 3

Results of preimage computation.

The results are depicted in Fig. 3. The boxplot shows the distribution of the values of the scaling variables obtained by the two methods for each layer, where Ax and Mx denote the results of layer \(f_x\) obtained by the ABS and MILP methods, respectively. (Note that some Ax and Mx may be missing because the DNN has no \(f_x\) layer.) The table reports the average computation time in seconds, where (i) indicates the number of tasks that run out of time in 2 h. We find that compared to the MILP method, the ABS method tends to obtain significantly smaller values for scaling variables in earlier layers, albeit requiring less time. It is mainly attributed to the inherent over-approximation in the abstract transformers. Note that the scaling variable for the last affine layer returned by the ABS method is typically larger than that obtained via the MILP method. However, we argue that the scaling variables of preceding layers are more significant, with larger values being preferable for a successful forward quantization process subsequently. Therefore, we opt for the MILP method to implement UnderPreImage, despite its longer execution time. Integrating both methods is an interesting direction for future work.

Table 2. Certified quantization strategy synthesis results for robustness.

Unsurprisingly, we also observe the decrease of scaling variables as r increases or the layer index decreases. The former is attributed to the enlargement of the reachable region of each neuron with an increasing r, leading to a diminution in the theoretical range of the amplification. The latter is because we propagate the preimage towards the input layer and the preimage returned by UnderPreImage increasingly under-approximates the ground truth. Additionally, we find a more pronounced impact of the number of layers in a DNN on the scaling, as opposed to the impact of the number of neurons per each layer. For example, when \(r=4\), while the scaling of the last affine layer is similar across MP2, MP3, and MP4, a notable divergence is observed as the preimage computation progresses to the preceding layer, i.e., the scaling of \(f_4\) in MP3 largely diminishes compared to that of \(f_2\) in MP2 and MP4, and even approaches zero in some tasks. We conjecture that as the DNN gets deeper and r gets larger, DeepPoly shows enhanced efficacy in its symbolic propagation such that the region delineated by \({\mathcal {A}}^{2i+2}\) becomes significantly tighter compared to the region confined by \({\mathcal {N}}_{[2i+1:2i+2]}({\mathcal {A}}^{2i})\). Finally, we find that the preimage computation time is predominantly impacted by the number of neurons per each layer (e.g., MP2 vs MP4).

5.2 Certified Quantization for Robustness

We evaluate Quadapter in terms of robustness properties on all the networks listed in Table 1 with the fractional bit-width range \([\mathfrak {B}_l,\mathfrak {B}_u]=[1,16]\). For each network, we randomly select 50 inputs from the test set of the respective dataset and set the perturbation radius as \(r\in \{1,2,3,4,5\}\). It results in a total of 250 synthesis tasks for each network, each of which can be certified by DeepPoly.

The results are reported in Columns 2 to 7 in Table 2. Columns (#S) and (#F) list the number of quantization successes and quantization failures due to small values of scaling variables. Column (Bit-width) lists the average bit-width for each layer within the quantization strategies synthesized by Quadapter and Column (Acc.) lists the average accuracy of the resulting QNNs. Columns (PTime) and (QTime) show the average execution time in seconds for the preimage computation and forward quantization procedures, respectively. Overall, Quadapter solves almost all the tasks of MPx and FPx for \(x\in \{1,2\}\), and most tasks of MP4 and FP4, where all timeout cases occur in the preimage computation process. For MP3 and FP3, all quantization failures are due to the excessively small preimage returned by UnderPreImage, posing a great challenge in finding a feasible quantization strategy, which requires that the quantized region must be strictly included within the preimage. Given the distribution shift phenomenon shown in Fig. 1, we hypothesize that it may be alleviated by relaxing such “strict-inclusion” requirement on the early layer quantization while not compromising soundness. Thus, we next relax the restriction by permitting the quantized regions of some portion of neurons, e.g., 25%, in each affine layer (except the output layer to guarantee the soundness of the approach) to deviate from the preimage returned by UnderPreImage. Note that, when using the relaxed version of our tool, named Quadapter \(^*\), we set \(\mathfrak {B}_l=2\) to circumvent situations where the use of the smallest bit-width (specifically, 1-bit), while theoretically yielding a viable solution for the current layer, may lead to a lack of feasible quantization for subsequent layers. Experimental results are shown in Columns 8 to 13 in Table 2. We observe that Quadapter \(^*\) usually synthesizes quantization strategies with smaller bit-widths for earlier layers, larger bit-widths for the last later, better accuracy, and solves more tasks on average. While the accuracy drops slightly, it also slightly drops using the same but non-certified quantization scheme and our certified quantization achieved comparable accuracy [50].

Fig. 4.
figure 4

Certified quantization strategies synthesis results for backdoor-freeness.

5.3 Certified Quantization for Backdoor-Freeness

We evaluate Quadapter in terms of backdoor-freeness on MP1, MP2, FP1 and FP2. For each network, we randomly select 5 trigger positions and consider all the 10 output classes as target labels of the backdoor attacks with two shapes of triggers, i.e., \(h_s=w_s=3\) and \(h_s=w_s=5\), resulting in \(5\times 10\times 2 =100\) backdoor-freeness properties. Following [37], we set the input parameters of Algorithm 2 as \((\mathfrak {B}_l,\mathfrak {B}_u)=(2,16)\), \(\theta =0.9\), \(K=5\), \(\epsilon =0.01\), and \(\sigma =\varrho =\delta =0.05\). Note that these parameters do not affect the soundness of Algorithm 2.

The results are given in Fig. 4. We observe that for \((h_s,w_s)=(3,3)\), Quadapter solves almost all the tasks of MP1 and FP1, and most tasks on MP2 and FP2. For \((h_s,w_s)=(5,5)\), over half of the tasks are solved by Quadapter. All the quantization failures (due to small values of scaling variables) may be solvable with the relaxed version of Quadapter which is left as future work. The histogram shows the distribution of target classes in the solved tasks on MP1 and FP1, where the x-axis gives the synthesis success rate. We also observe that Quadapter is more likely to successfully find certified quantization strategies w.r.t. target classes \(\{0,1,4,6,9\}\) on MP1 and target classes \(\{1,2,4,5,7,8,9\}\) on FP1, compared to its efficacy w.r.t. other classes. Due to the black-box nature, we currently cannot explain the discrepancy in performance between target classes.

6 Related Work

Numerous methods have been proposed to verify (local) robustness of DNNs (e.g., [7, 10, 17, 40, 45, 47]) and QNNs (e.g., [9, 12, 14, 19, 52,53,54]). Recently, backdoor-freeness verification for DNNs has been explored leveraging a similar hypothesis testing method [37]. Methods for verifying quantization error bound [30, 35, 36, 51] and Top-1 equivalence [16] between DNNs and QNNs have also been proposed. Except for [16], these works only verify properties without adjusting quantization strategies for falsified properties. The concurrent work [16] iteratively searches for a quantization strategy and verifies Top-1 equivalence after quantization, refining strategies if equivalence is violated. However, it does not support general properties (e.g., backdoor freeness or robustness of multi-label classification [6]). Additionally, [16] requires frequent equivalence verification, which is computationally expensive and inefficient (e.g., networks with 100 neurons in 20 min). Comparison experiments are given in [50].

The primary contribution of this work is the first certified quantization strategy synthesis approach utilizing preimage computation as a crucial step. Hence, any (under-approximate) preimage computation methods can be integrated. [28] introduced an exact preimage computation method that, while precise, is impractical due to its exponential time complexity. The inverse abstraction approach [4] circumvents the intractability of exact preimage computation by using symbolic interpolants [2] for compact symbolic abstractions of preimages. However, it still faces scalability issues due to the complexity of the interpolation process. [18, 49] considered over-approximate preimages, which are unsuitable for our purpose.

Quantization-aware training has been studied to improve robustness for a given fixed quantization strategy [19, 23, 24, 41, 43], but only [19] provides robustness guarantees by lifting abstract interpretation-based training [29] from DNNs to QNNs. In contrast, our work aims to obtain a better quantification strategy for preserving given properties. Thus, our work is orthogonal to and could be combined with them. We leave this as interesting future work.

7 Conclusion

In this work, we have presented a pioneering method Quadapter to synthesize a fine-grained quantization strategy such that the desired properties are preserved within the resulting quantized network. We have implemented our methods as an end-to-end tool and conducted extensive experiments to demonstrate the effectiveness and efficiency of Quadapter in preserving robustness and backdoor-freeness properties. For future work, it would be interesting to explore the adaptation of Quadapter to other activation functions and network architectures, towards which this work makes a significant step.