1 Introduction

Deep neural networks (DNNs) have demonstrated impressive performance in a broad range of applications, including computer vision [1, 2], natural language processing [3, 4], and speech recognition [5, 6]. However, the deployment of large DNNs on resource-constrained devices, such as smartphones and embedded systems, poses challenges due to their high memory and computational requirements. To address this issue, researchers have proposed various model compression techniques, including model quantization [711], pruning [1216], and neural network distillation [17, 18]. Among these techniques, model quantization has become a critical approach for compressing DNNs due to its ability to maintain network structure and achieve comparable performance simultaneously. Model quantization achieves reduced memory usage and faster inference by mapping the network parameters from continuous 32-bit floating-point numbers to discrete low-bit integers, making it well-suited for resource-constrained devices.

While model quantization offers a viable solution for deploying DNNs on resource-constrained devices, it also presents challenges in ensuring the trustworthiness (such as robustness, fairness, and privacy) of quantized models in the real world [1925]. DNNs are highly susceptible to adversarial examples [2629], which are perturbations carefully designed to be undetectable to human vision but can easily deceive DNNs, posing a significant threat to practical deep learning applications. For example, the placement of three small adversarial stickers on a road intersection can cause the Tesla Autopilot system [30] to misinterpret the lane markings and swerve into the wrong lane, which severely risks people’s lives and even leads to potential death. In addition, DNNs are vulnerable to natural corruption [31] such as snow and motion blur, which are common in real-world scenarios and can significantly reduce the accuracy of DNN models. Moreover, system noise resulting from the mismatch between software and hardware can also have an adverse impact on model accuracy [32]. These vulnerabilities demonstrate that quantized networks deployed in safety-critical applications (such as autonomous driving and face recognition) are unreliable when faced with various perturbations in real-world scenarios.

Therefore, it is critical to conduct a comprehensive evaluation of the robustness of quantized models before their deployment to identify potential weaknesses and unintended behaviors. In recent years, researchers have developed robustness benchmarks [3337] tailored for assessing the robustness of deep learning models, employing multiple adversarial attack methods to thoroughly evaluate floating-point networks across various tasks. Researchers have revealed and substantiated several insights through extensive experiments, such as the observation that larger model parameter sizes often lead to better adversarial robustness [33, 38], which highlights the significance of model complexity in determining robustness. While numerous studies have extensively investigated the robustness of floating-point networks, research on the robustness of quantized models [3942] remains inadequate, lacking diversity in terms of noise sources and relying solely on small datasets. Consequently, the current literature fails to thoroughly assess the robustness of quantized models, leading to a gap in the understanding of their vulnerabilities and strengths.

To bridge this gap, we build RobustMQ, a comprehensive robustness evaluation benchmark for quantized models. Our benchmark systematically assesses the robustness of quantized models using three popular quantization methods (i.e., DoReFa [43], PACT [44], and LSQ [45]) and four classical architectures (ResNet18 [46], ResNet50 [46], RegNetX600M [47], and MobileNetV2 [48]). Each method is evaluated for four commonly used bit-widths. To thoroughly study the robustness of quantized models against noise originating from different sources, our analysis comprises three progressive adversarial attacks (covering \(\ell _{1}\), \(\ell _{2}\), \(\ell _{\infty}\) magnitudes, along with three different perturbation budgets), 15 types of natural corruption types, and 14 types of systematic noise on the ImageNet benchmark. Our empirical results demonstrate that lower-bit quantized models exhibit better adversarial robustness but are more susceptible to natural corruption and systematic noise. In summary, increasing the quantization bit-width generally leads to a decrease in adversarial robustness, an increase in natural robustness, and an increase in systematic robustness. Moreover, our findings indicate that impulse noise and glass blur are the most harmful corruption methods for quantized models, while brightness has the least impact. Additionally, among systematic noise, the nearest neighbor interpolation has the highest impact, while bilinear interpolation, cubic interpolation, and area interpolation are the three least harmful. Our main contributions can be summarized as follows.

  1. 1)

    To the best of our knowledge, RobustMQ is the first to comprehensively evaluate the robustness of quantized models. RobustMQ covers three popular quantization methods, four common bit-widths, and four classical architectures across a range of noise types, including adversarial attacks, natural corruption, and systematic noise.

  2. 2)

    Through extensive experiments, RobustMQ uncovers valuable insights into the robustness of quantized models, shedding light on their strengths and weaknesses in comparison to floating-point models across various scenarios.

  3. 3)

    The RobustMQ benchmark provides a standardized framework for evaluating the robustness of quantized models, enabling further research and development in this field. It is publicly available on our website [49].

2 Related work

In this section, we review related works on network quantization, adversarial attacks, and the recent advancements in the integration of these fields.

2.1 Network quantization

Network quantization compresses DNN models by reducing the number of bits required to represent each weight to reduce memory usage and speed up model inference [50]. A classic quantization process involves both quantization and de-quantization operations. The quantization function maps real values to integers, while the de-quantization function allows approximate recovery of real values from the quantized values. This process can be mathematically formulated as

$$ Q(r)=\operatorname{Int}(r/S)-Z,\qquad r^{\prime }=S \cdot \bigl(Q(r)+Z \bigr), $$
(1)

where Q is the quantization operator, r and \(r^{\prime }\) are the real value and de-quantized real value, respectively, S and Z denote scale and zero-point, respectively. Given t bits, the range after quantization is determined by \([-2^{t-1},2^{t-1}-1]\). After quantization, the recovered real value \(r^{\prime }\) may not exactly be equal to the original value r due to the rounding operation.

Quantization methods can be broadly classified into two strategies: post-training quantization (PTQ) and quantization-aware training (QAT). PTQ methods are applied after the model is fully trained, without any adjustments to the model during the training process, which often results in lower accuracy. By comparison, QAT methods involve fine-tuning or retraining the model to achieve higher accuracy in the quantized form. Thus, we focus on QAT methods, and we here provide a brief review of the commonly used QAT methods.

One string of research designed rules to fit the quantizer to the data distribution [43, 51]. For example, the DoReFa-Net method [43] simply clips the activation to \([0,1]\) and then quantizes it due to the observation that most activation falls into this range in many network architectures (e.g., AlexNet [1] and ResNet [46]). Other notable work focused on learning appropriate quantization parameters during the backpropagation process [44, 45, 52]. In particular, the PACT technique [44] clips the activation by a hand-crafted parameter and optimizes the clipping threshold. However, it is important to note that PACT has no gradient below the clipping value, which can lead to gradient vanishing problems during backpropagation. To address the limitation in PACT, the LSQ method [45] estimates the gradient at each weight and activation layer to adaptively adjust the step size of quantization. It is able to achieve more fine-grained quantization by learning the scale alongside network parameters, which improves the accuracy of quantized models.

2.2 Adversarial attacks

Adversarial examples [27] are inputs with small perturbations that can easily mislead the DNNs. Formally, given a DNN \(f_{\Theta}\) and an input x with the ground truth label y, an adversarial example \({x}_{\text{adv}}\) satisfies

$$ f_{\Theta}({x}_{\text{adv}}) \neq {y} \quad \textit{s.t.} \quad \Vert {x}-{x}_{ \text{adv}} \Vert \leq \epsilon , $$
(2)

where \(\Vert \cdot \Vert \) is a distance metric and commonly measured by the \(\ell _{p}\)-norm (p∈{1,2,∞}).

A long line of work has been dedicated to performing adversarial attacks [21, 26, 27, 38, 5357], which can be mainly divided into white-box and black-box approaches based on access to the target model. For white-box attacks, adversaries have complete knowledge of the target model and can fully access it, while for black-box attacks, adversaries have limited or even no knowledge of the target model and cannot directly access it. Our study primarily employs white-box attacks to evaluate the adversarial robustness of target models, as they have stronger attack capabilities. Here, we introduce the attack methods relevant to our benchmark.

Fast gradient sign method (FGSM). FGSM [27] is a one-step attack method with the \(\ell _{\infty}\)-norm. It calculates the gradient of the loss function with respect to the input only once and subsequently adds gradient noise to generate an adversarial example. Although FGSM has a relatively weaker attack capability, it is computationally efficient in generating adversarial examples.

Projected gradient descent (PGD). PGD [38] is regarded as one of the most powerful and widely used attacks due to its high attack success rate. It builds upon the FGSM by introducing an iterative process with a gradient projecting at each step.

AutoAttack. Croce and Hein [53] first proposed two automatic step size adjustment methods (APGD-CE and APGD-DLR) to address problems such as suboptimal step size in PGD. Then, they combined two existing complementary attacks to form a parameter-free and computationally affordable attack (i.e., AutoAttack). AutoAttack has demonstrated superior performance compared to state-of-the-art attacks, thus becoming a crucial tool for evaluating model robustness.

2.3 Robustness of quantized models

A number of studies have been proposed to evaluate the robustness of floating-point networks [20, 22, 33, 34, 58]. For instance, Croce et al. [34] introduced RobustBench, a benchmark based on the CIFAR dataset, which employs AutoAttack to assess the robustness of models strengthened by various defense methods, including adversarial training, gradient masking, and label smoothing. In contrast to the evaluation of defense methods, Tang et al. [33] focused on investigating the robustness of model structures and training techniques using the large-scale ImageNet dataset, offering valuable insights into the methods of training robust models.

Comparatively, Merkle et al. [59] benchmarked the adversarial robustness of pruned models for several pruning methods, revealing that pruned models enjoy better robustness against adversaries. However, the robustness of quantized networks has been relatively underexplored. Bernhard et al. [39] utilized an ensemble of quantized models to filter adversarial examples, given that current adversarial attacks demonstrate limited transferability against quantized models. Lin et al. [40] proposed a defensive quantization method to suppress the amplification of adversarial noise during propagation by controlling the Lipschitz constant of the network during quantization. Similarly, Alizadeh et al. [41] also designed a regularization scheme to improve the robustness of the quantized model by controlling the magnitude of adversarial gradients. It is worth mentioning that a benchmark proposed by Yuan et al. [60] also accesses the reliability of PTQ. However, there are fundamental distinctions between their objectives and ours. Yuan et al. [60] concentrated on examining the influence of different PTQ steps on model performance, encompassing steps such as constructing calibration datasets, assigning quantization settings, and optimizing quantization parameters. In contrast, we aim to thoroughly evaluate the robustness of quantized models against multiple test-time attacks covering quantization methods, architectures, and quantization bits.

3 RobustMQ benchmark

Existing research on the impact of quantization compression on neural network robustness is fragmented and lacks adherence to established principles in robustness evaluation. To address this issue, this study proposes RobustMQ, a comprehensive robustness evaluation benchmark for quantized models with consistent settings. RobustMQ provides researchers with a valuable tool to gain deeper insights into the impact of various perturbations on quantized model robustness, aiding in the development of more robust quantization methods for deploying reliable and secure deep learning models in real-world scenarios. The RobustMQ benchmark encompasses adversarial robustness, natural robustness, and systematic robustness, considering three quantization methods, four bit-widths, four architectures, three progressive adversarial attack methods (covering three magnitudes and three perturbation budgets), 15 types of natural corruption, and 14 types of systematic noise on ImageNet. The overall framework of RobustMQ is illustrated in Fig. 1.

Figure 1
figure 1

Overview of the RobustMQ benchmark

3.1 Robustness evaluation approaches

Quantized models that are extensively deployed in edge devices are vulnerable to various perturbations in real-world scenarios. In accordance with the guidelines proposed by Tang et al. [33], we classify these perturbations into adversarial attacks, natural corruption, and systematic noise, and leverage them to thoroughly evaluate the robustness of quantized models.

3.1.1 Adversarial attacks

To model the worst-case scenario (i.e., strongest adversaries), we consider attacks conducted under the white-box attack setting where the adversary has full access to the model architecture, training data, and gradient information. Specifically, we employ FGSM-\(\ell _{\infty}\), PGD-\(\ell _{1}\), PGD-\(\ell _{2}\), PGD-\(\ell _{\infty}\) and AutoAttack-\(\ell _{\infty}\) to craft adversarial perturbations. These three attack methods form a progressive evaluation, where their computing resource consumption and attack capabilities are improved, enabling a comprehensive assessment of the quantized model’s adversarial robustness. Furthermore, we set three progressive perturbation budgets (small, middle, and large) for each attack method.

3.1.2 Natural corruption

To simulate natural corruption, we utilize 15 distinct perturbation methods from the ImageNet-C benchmark [31]. These methods can be categorized into four groups: 1) noise, which includes Gaussian noise, shot noise, and impulse noise; 2) blur, which includes defocus blur, frosted glass blur, motion blur, and zoom blur; 3) weather, which includes snow, frost, fog, and brightness; and 4) digital noise, which includes contrast, elastic, pixelation, and JPEG compression. Each corruption type is evaluated at five levels of severity to account for variations in the intensity of corruption. Thus, we have 75 perturbation methods in total for natural corruption evaluation. In addition to single corruption images, we also investigate corruption sequences generated from ImageNet-P [31]. Each sequence in ImageNet-P comprises more than 30 frames, allowing us to study the robustness of quantized models against dynamic and continuous corruption.

3.1.3 Systematic noise

Moreover, different types of system noise are always present when models are deployed in edge devices due to changes in hardware or software. To assess the influence of system noise on quantized models, we incorporate pre-processing operations from ImageNet-S [32], which involve image decoding and image resize processes. Image decoding refers to the process of converting an original image file into an RGB channel map tensor, where the inverse discrete cosine transform (iDCT) serves as a core step. However, discrepancies in the implementation of iDCT among various image processing libraries result in variations in the output. As a consequence, the pixel values of the final RGB tensor are affected, leading to slight differences in the decoded images. Therefore, we employ the decoders from third-party libraries such as Pillow [61], OpenCV [62], and FFmpeg [63] to obtain systematic noise. Additionally, image resize is utilized to change the image resolution. In the resize process, different interpolation algorithms are employed to predict the new pixel positions, potentially leading to slight variations in the new pixel values. Thus, for different image resize methods, we incorporate bilinear, nearest, cubic, hamming, lanczos, area, and box interpolation modes from the OpenCV and Pillow tools. In total, systematic noise consists of three frequently used decoders and seven commonly used resize modes.

3.2 Evaluation metrics

3.2.1 Adversarial robustness

For specific adversarial attacks, we adopt the model accuracy to measure adversarial robustness (AR), which is calculated by subtracting the attack success rate (ASR) from 1. Mathematically, AR can be calculated using Eq. (3):

$$ \mathit{AR} = 1 - P_{({x},{y})\sim \mathcal{D}}{\bigl(f\bigl( \mathcal{A}_{\epsilon ,p}^{f}({x})\bigr) \neq {y}\bigr)}, $$
(3)

where f is the target tested model, \(\mathcal{D}\) is the validation set, \(\mathcal{A}_{\epsilon ,p}\) represents the adversary, P represents the fraction that satisfies the specified criteria, and ϵ and p denote the perturbation budget and distance norm, respectively. This metric quantifies the model’s capability to retain normal functioning under attacks, with higher AR indicating a stronger model against the specific adversarial attack. While we aim to evaluate the robustness among models with different clean accuracies, it is crucial to measure the relative performance drop against adversarial attacks, denoted as adversarial attack impact (AAI):

$$ \mathit{AAI} = \frac{\mathit{ACC} - \mathit{AR}}{\mathit{ACC}}, $$
(4)

where ACC represents the clean accuracy. A lower AAI value indicates that the model is more robust.

For the union of different attacks, we adopt worst-case adversarial robustness (WCAR) to measure adversarial robustness (a higher value indicates a more robust model) against them:

$$ \mathit{WCAR} = 1 - P_{({x},{y})\sim \mathcal{D}}{\mathrm{Any}_{\mathcal{A} \in \mathcal{A}s}\bigl(f \bigl(\mathcal{A}_{\epsilon ,p}^{f}({x})\bigr) \neq {y}\bigr) }, $$
(5)

where \(\mathcal{A}s\) represents a set of adversaries, \(\mathrm{Any}(\cdot )\) is a function that returns true if any of the adversary \(\mathcal{A}\) in \(\mathcal{A}s\) attacks successfully. WCAR represents a lower bound of model adversarial robustness against various adversarial attacks. Specifically, we further refine WCAR based on the perturbation budget employed in adversarial attacks: WCAR (small ϵ), WCAR (middle ϵ), and WCAR (large ϵ).

3.2.2 Natural robustness

Natural robustness measures the accuracy of a model in maintaining its performance after being perturbed by natural noise. Therefore, given a corruption method c, we calculate the accuracy as its natural robustness:

$$ \mathit{ACC}_{c}=P_{({x},{y})\sim \mathcal{D}}{\bigl(f\bigl(c({x})\bigr)={y} \bigr)}. $$
(6)

Similar to AAI in adversarial robustness, we also define natural corruption impact (NCI) to measure the relative performance drop against natural corruption:

$$ \mathit{NCI} = \frac{\mathit{ACC} - \mathit{ACC}_{c}}{\mathit{ACC}}. $$
(7)

To aggregate the evaluation results among 15 types of corruption, we adopt the average accuracy of the quantized model on all corruption types to measure the mean natural robustness, denoted as mNR:

$$ \mathit{mNR} = \mathbb{E}_{c\sim C}\bigl({P_{({x},{y})\sim \mathcal{D}}{\bigl(f \bigl(c({x})\bigr)={y}\bigr)}}\bigr), $$
(8)

where c is a specific corruption method, C denotes the set of corruption methods, and \(\mathbb{E}\) calculates the average expectation. A higher value of mNR means better natural robustness.

For the corruption sequence \(\mathcal{S}\), we utilize the “flip probability” of model predictions to measure its robustness, denoted as FP:

$$ \mathit{FP} = P_{{x}\sim \mathcal{S}}{\bigl(f({x}_{j}) \neq f({x}_{j-1})\bigr)}, $$
(9)

where \(x_{j}\) is the j-th image in the sequence.

For sequences generated by multiple corruption methods, we average their flip probability to obtain mean flip probability (mFP). Note that a lower FP value indicates a model that performs more stably in the presence of dynamic and continuous corruption.

3.2.3 Systematic robustness

Systematic robustness measures a model’s ability to withstand various software-dependent and component-dependent system noise attacks in diverse deployment environments. For a given decoding or resizing method s, we compute the accuracy as its systematic robustness using Eq. (10):

$$ \mathit{ACC}_{s}=P_{({x},{y})\sim \mathcal{D}}{\bigl(f \bigl(s({x})\bigr)={y}\bigr)}. $$
(10)

To emphasize the impact of noise, we introduce systematic noise impact (SNI) as a metric to quantify the relative performance drop:

$$ \mathit{SNI} = \frac{\mathit{ACC} - \mathit{ACC}_{s}}{\mathit{ACC}}. $$
(11)

Furthermore, to evaluate the robustness among all systematic noise, we calculate the standard deviation of their \({\mathit{ACC}}_{s}\) as systematic robustness (SR):

$$ \mathit{SR} = \mathbb{D}_{s\sim S}\bigl({P_{({x},{y})\sim \mathcal{D}}{\bigl(f \bigl(s({x})\bigr)={y}\bigr)}}\bigr), $$
(12)

where S denotes a set of decoding or resizing methods. A lower value of SR means better stability towards different types of systematic noise.

3.3 Evaluation objects

3.3.1 Dataset

Our RobustMQ aims to obtain broadly applicable results and conclusions for quantized models in the computer vision field. Therefore, we focus on the basic image classification tasks and follow well-established quantization literature [9, 4345] to employ the large-scale ImageNet [64] dataset, which is widely recognized as a standard benchmark within the realm of quantization and computer vision. ImageNet provides a more extensive collection of images and classes, making it more suitable as a benchmark for testing models in robustness evaluation, in contrast to commonly used small-scale datasets such as MNIST [65], CIFAR-10 [66], and CIFAR-100 [66]. With an expansive repository of 1.2 million training images, 50,000 validation images, and spanning 1000 different classes, the ImageNet dataset significantly bolsters the applicability of our evaluation methodology to real-world scenarios.

3.3.2 Network architectures

Our RobustMQ contains four architectures, including ResNet18 [46], ResNet50 [46], RegNetX600M [47], and MobileNetV2 [48]. 1) ResNet18 and ResNet50 are classic backbone architectures that have proven to be highly effective in various computer vision tasks. Both architectures are built on the concept of residual blocks, which employ skip connections to mitigate the vanishing gradient problem and facilitate the training of deep networks. 2) RegNetX600M is an advanced architecture discovered through model structure search, specifically optimized to achieve efficient and powerful feature extraction. It leverages group convolution to enable parallel processing and significantly reduce computational complexity, making it ideal for resource-constrained edge devices. 3) MobileNetV2 is a lightweight network designed for efficient deployment on edge devices. It employs depthwise separable convolutions, which separate the spatial and channel-wise convolutions, reducing the computational burden while maintaining performance.

3.3.3 Quantization methods

Within RobustMQ, we concentrate on three widely used quantization methods: DoReFa [43], PACT [44], and LSQ [45]. The LSQ and DoReFa methods perform their quantization methods both on model weights and activation values. On the other hand, PACT applies its specific parameter truncation to quantize activation values, while leveraging the method in DoReFa for weight quantization. For the choice of quantization bits, we adopt the commonly used set in deployments (i.e., 2, 4, 6, and 8). For each architecture, we first quantize models on the ImageNet training set starting from the same floating-point model, and then evaluate their robustness against perturbations generated on the ImageNet validation set.

4 Experiments and analysis

In this section, we showcase the evaluation results of the quantized models under different types of noise. Subsequently, we consolidate several conclusive insights gleaned from the evaluations by addressing the following research questions: 1) How robust are quantized models compared with floating-point models? 2) Which quantization method or bit-width exhibits greater robustness against perturbations? 3) What is the impact of architecture or size on the robustness of quantized models? 4) To which type of noise is the quantized model most susceptible?

4.1 Clean accuracy

Table 1 reports the clean accuracies of the quantized models. Most of the quantized models maintain comparable accuracy to their 32-bit pre-trained models, while certain quantization methods may struggle to maintain comparable accuracy when using low bit-widths (e.g., 2-bit). Among the quantized models evaluated, 12 models fail to converge. For example, ResNet18 PACT 2-bit achieves a mere 2.61% accuracy. Therefore, we label these models as “NC” and exclude them from our evaluations to ensure fair and reliable assessments of robustness.

Table 1 Clean accuracy of quantized models and floating-point models. “NC” denotes not converged. The best performers in each architecture are highlighted in bold

4.2 Evaluation of adversarial attacks

Under medium and high perturbation budgets, the values of AR and WCAR degrade significantly to almost 0 due to the increasing attack abilities. Therefore, in this section, we primarily present the results under small budgets to highlight the differences between different models. The robustness evaluation results with small budgets under all adversarial attacks, including FGSM-\(\ell _{\infty}\) and PGD-\(\ell _{\infty}\) attacks, are presented in Table 2, Table 3, and Table 4, respectively. Other adversarial robustness evaluation results can be found on our website [49]. From the results, we can make several observations for quantized models as follows.

Table 2 Worst-case adversarial robustness of quantized models under all adversarial attacks with small budgets. The results are presented using WCAR (small ϵ). The best performers in each architecture are highlighted in bold
Table 3 Adversarial robustness of models under the FGSM-\(\ell _{ \infty}\) attack with a small budget (\(\epsilon =0.5/255\)). The results are shown with . The best performers in each architecture are highlighted in bold
Table 4 Adversarial robustness of quantized models under the PGD-\(\ell _{\infty}\) attack with a small budget (\(\epsilon =0.5/255\)). The results are shown with . The best performers in each architecture are highlighted in bold

1) Better adversarial robustness vs. floating-point models. Unlike the decrease in clean accuracy, quantized models exhibit higher worst-case adversarial robustness and are almost better than floating-point networks. For example, the ACC of ResNet18 DoReFa 2-bit is 8.75% lower than that of the floating-point ResNet18 (see Table 1), while WCAR of ResNet18 DoReFa 2-bit is 4.25% higher than that of the floating-point ResNet18 (see Table 2). Moreover, Fig. 2 illustrates that under the same quantization method, the adversarial robustness of quantized ResNet18 models increases as the quantization bit-width decreases. These phenomena can also be observed in other network architectures, suggesting that quantization can provide defense against adversarial attacks to a certain extent.

Figure 2
figure 2

Adversarial robustness of ResNet18 models under specific attacks. The models are quantized by (a) DoReFa, (b) PACT, and (c) LSQ, respectively. WCAR denotes the worst-case adversarial robustness. The “NC” models are assigned an AR value of 0 in the figure. Similar trends, with AR decreasing as the bit-width increases, are observed in the results of other architectures as well

2) At the same quantization bit-width, PACT outperforms other quantization methods under the worst-case adversarial robustness. For instance, RegNetX600M PACT 4-bit achieves a WCAR of 6.21%, while DoReFa and LSQ quantized models achieve WCAR values of 2.74% and 2.05%, respectively (see Table 2). However, when facing specific adversarial attacks, the robustness performance of quantization methods is not consistent across different model architectures (see Table 3 and Table 4). For ResNet18, ResNet50, and MobileNetV2 architectures, the LSQ quantization method demonstrates better robustness against FGSM-\(\ell _{\infty}\) and PGD-\(\ell _{ \infty}\) attacks. However, for the RegNetX600M architecture, the PACT method exhibits better robustness.

3) Regarding the model size and network architecture, quantized models show similar trends to floating-point models. Regarding the model size, we observe that quantized models with larger network capacity exhibit better adversarial robustness, consistent with the findings in Ref. [33]. For instance, in the ResNet families, all quantized ResNet50 models demonstrate better robustness than ResNet18 against all adversarial attacks. Regarding the network architecture, the adversarial robustness of quantized models follows the order: MobileNetV2 > RegNetX600M > ResNet50 > ResNet18, which coincides with the findings in Ref. [33]. This highlights the significance of architecture design in achieving better adversarial robustness.

4) For the adversarial attack methods, their attack capabilities in quantized models are generally consistent with those in floating-point models: AutoAttack > PGD > FGSM. However, quantized models demonstrate varying degrees of adversarial robustness against different attack methods. For instance, compared to DoReFa and PACT, LSQ performs better under FGSM and PGD attacks but is more vulnerable to AutoAttack.

4.3 Evaluation of natural corruption

We first report the aggregated evaluation results (i.e., mean natural robustness) among 15 corruption methods for all quantized models, displayed in Table 5. Next, we explore the natural robustness results for different quantized models under each corruption method. For instance, we provide the natural robustness results for the ResNet18, ResNet50, and MobileNetV2 architectures in Table 6, Fig. 3, and Fig. 4, respectively. Due to the limit of space, more detailed results can be found on our website [49]. The natural robustness evaluation results for quantized models reveal several key observations as follows.

Figure 3
figure 3

Natural robustness of ResNet50 models under specific corruption. The models are quantized by (a) DoReFa, (b) PACT, and (c) LSQ, respectively. ACC denotes the clean accuracy of the models, and mNR represents the mean natural robustness of the models. The “NC” models are labeled with “×” in the figure

Figure 4
figure 4

Natural robustness of MobileNetV2 models under specific corruption. The models are quantized by (a) DoReFa, (b) PACT, and (c) LSQ, respectively. ACC denotes the clean accuracy of the models, and mNR represents the mean natural robustness of the models. The “NC” models are labeled with “×” in the figure

Table 5 Natural robustness of quantized models under all corruption methods. The results are presented using . The best performers in each architecture are highlighted in bold
Table 6 Natural robustness results for ResNet18 models are shown with for each corruption (e.g., Gauss). The most influential noise is marked in bold, and the least influential noise is underlined

1) Worse natural robustness vs. floating-point models. Despite achieving similar clean accuracy compared to the 32-bit model, quantized models are more susceptible to natural corruption. Notably, the 2-bit quantized model is severely affected, with a decrease in performance that far exceeds the accuracy drop observed in clean accuracy. As an example, ResNet18 DoReFa 2-bit exhibits a NCI of 62.61%, which is higher than the corresponding floating-point model with 53.87% NCI (see Table 5), indicating that quantization can negatively impact the model’s robustness against such perturbations. Furthermore, Fig. 3 demonstrates that under the same quantization method, the natural robustness of quantized ResNet50 models generally increases as the quantization bit-width increases. However, the rate of increase in robust performance gradually decreases. Similar phenomena can also be observed in other network architectures, indicating that in real scenarios, using quantized models requires more attention than floating-point models. For instance, leveraging corruption data to enhance natural robustness becomes crucial in maintaining the performance of quantized models.

2) At the same quantization bit-width, quantization methods exhibit varied performance on different architectures. As presented in Table 5, LSQ achieves less relative performance drop (i.e., lower NCI value) than DoReFa and PACT methods on the ResNet18 and ResNet50 architectures. However, DoReFa and PACT may outperform other quantization methods across most bit-widths on the RegNetX600M and MobileNetV2 architectures. When examining a specific architecture, Fig. 3 reveals a consistent relative trend among different quantization methods across various corruption methods (other architectures also show this trend). Specifically, the order of impact on model performance is as follows: Noise > Blur > Weather (except brightness) > Digital. This consistent trend emphasizes the varying degrees of sensitivity of quantized models to different types of corruption, and it remains independent of the specific quantization method used.

3) Under the same network architecture, network capacity could lead to better natural robustness. Similar to the adversarial evaluation, the quantized ResNet50 models here present a higher mNR and a lower NCI value compared to ResNet18 under natural corruption. For different network architectures, the natural robustness of the quantized models follows the order: ResNet > RegNetX600M > MobileNetV2, which is exactly opposite to the order observed under adversarial attacks.

4) Regarding natural corruption methods, our conclusions are as follows. For ResNet18, ResNet50, and RegNetX600M, impulse noise has the most severe impact on the model’s robustness. For MobileNetV2, glass blur is the most harmful corruption to the model’s robustness. Among all the corruption methods, brightness has the least impact on the model’s robustness. ResNet18 quantized models exhibit an average decrease of approximately 53.15% under impulse noise, whereas the average decrease is only approximately 11.31% under brightness weather corruption (see Table 6). Similarly, MobileNetV2 quantized models show an average decrease of approximately 49.04% under glass blur, while the average decrease is only approximately 12.47% under brightness.

For natural robustness under corruption sequences, we report the evaluation results in Table 7, Table 8, Fig. 5, and Fig. 6. Table 7 shows the mean flip probability of quantized models against multiple sequences. Table 8, Fig. 5, and Fig. 6 report the detailed robustness results under specific corruption sequences for ResNet18, ResNet50 and RegNetX600M, respectively. Other results can be found on our website [49]. From these results, we can draw several conclusions about natural robustness under dynamic and continuous corruption, which mostly align with our observations from static natural corruption evaluations.

Figure 5
figure 5

Natural robustness of ResNet50 models under corruption sequences. The models are quantized by (a) DoReFa, (b) PACT, and (c) LSQ, respectively. mFP denotes the mean flip probability of the models. The “NC” models are labeled with “×” in the figure

Figure 6
figure 6

Natural robustness of RegNetX600M models under corruption sequences. The models are quantized by (a) DoReFa, (b) PACT, and (c) LSQ, respectively. mFP denotes the mean flip probability of the models. The “NC” models are labeled with “×” in the figure

Table 7 Natural robustness of quantized models under all corruption sequences. The results are presented using . The best performers in each architecture are highlighted in bold
Table 8 Natural robustness of ResNet18 under corruption sequences. Results for each corruption (e.g., Gauss) are shown using . The most influential noise is marked in bold, and the least influential noise is underlined

1) Worse dynamic natural robustness vs. floating-point models. Coinciding with the static natural corruption, quantized models also exhibit inferior performance compared to floating-point models under dynamic corruption sequences. Furthermore, the 2-bit quantized models demonstrate extreme instability. For example, ResNet18 DoReFa 2-bit exhibits an mFP of 0.283, which is nearly three times higher than that of the ResNet18 floating-point model (0.098 mFP). As depicted in Fig. 5, we can find that the natural robustness increases with increasing bit-width (from right to left in each subfigure).

2) At the same quantization bit-width, quantization methods under dynamic natural corruption also exhibit similar trends as observed under static natural corruption. For example, LSQ achieves better robustness on the ResNet architecture but may not perform as dominantly on lightweight architectures as RegNetX600M and MobileNetV2.

3) For specific corruption sequences, shot noise has the most severe impact on all models, while brightness remains the least harmful. Additionally, we observe that for RegNetX600M models, the floating-point model is not the most robust under shot noise (see Fig. 6), which is inconsistent with the trend observed in ResNet models (see Fig. 5).

4.4 Evaluation of systematic noise

We first report the aggregated evaluation results of systematic robustness among 14 types of systematic noise for all quantized models (see in Table 9). Then, we explore the performance of different quantized models under specific type of systematic noise. For instance, we provide the systematic robustness results for ResNet18, ResNet50, and RegNetX600M architectures in Table 10, Fig. 7, and Fig. 8, respectively. Due to the limit of space, more detailed results can be found on our website [49]. The evaluation results for quantized models provide valuable insights into their systematic robustness. Several key observations are as follows:

Figure 7
figure 7

Systematic robustness of ResNet50 models under specific type of noise. The models are quantized by (a) DoReFa, (b) PACT, and (c) LSQ, respectively. ACC denotes the clean accuracy of the models, and “mean” represents the average accuracy of the models under all noise. The “NC” models are labeled with “×” in the figure

Figure 8
figure 8

Systematic robustness of RegNetX600M models under specific type of noise. The models are quantized by (a) DoReFa, (b) PACT, and (c) LSQ, respectively. ACC denotes the clean accuracy of the models, and “mean” represents the average accuracy of the models under all noise. The “NC” models are labeled with “×” in the figure

Table 9 Mean systematic robustness of models under all types of noise. The results are shown with . The best performers in each architecture are highlighted in bold
Table 10 Systematic robustness of ResNet18 models under each type of systematic noise. The results for each type of noise (e.g., bilinear) are presented using . The most influential noise is marked in bold, and the least influential noise is underlined

1) Worse systematic robustness vs. floating-point models. Despite the prevalent system noise in deployed environments not causing a significant drop in the prediction accuracy of quantized models, their performance in various deployment scenarios exhibits considerable instability. In contrast, floating-point models demonstrate little fluctuation. For instance, on the ResNet18 architecture, all quantized models exhibit at least 3.43 times higher instability (SR) than the floating-point ResNet18, confirming the sensitivity of quantized models to system noise. In particular, 2-bit quantized models experience significant impacts under systematic noise (e.g., ResNet18 DoReFa 2-bit shows a drop of 4.49% in terms of \({\mathit{ACC}}_{\textit{s}}\) and an increase of 4.38 times in instability). For the same quantization method, lower-bit models present less robustness (i.e., lower stability) generally. This indicates that maintaining consistency between the deployment and training process is crucial to avoid unnecessary accuracy loss.

2) At the same quantization bit-width, quantization methods demonstrate varied performance results across different architectures. Generally, LSQ shows dominance in ResNet and RegNetX600M architectures, while PACT performs better in MobileNetV2. When focusing on a specific architecture (such as ResNet50 in Fig. 7 and RegNetX600M in Fig. 8), we can observe the similar relative impact across different decoders and resize modes when using three quantization methods.

3) Network capacity could lead to better systematic robustness under the same network architecture. For instance, quantized ResNet50 models present a lower SR value compared to ResNet18 under systematic noise. For different network architectures, the systematic robustness of quantized models follows the order: RegNetX600M > ResNet > MobileNetV2. It is worth noting that the RegNetX600M floating-point model exhibits the worst systematic robustness, while surprisingly, the quantized RegNetX600M models demonstrate the best systematic robustness among all architectures.

4) We draw the following conclusions when considering specific type of systematic noise. Among 14 types of systematic noise, the nearest neighbor interpolation methods in the Pillow and OpenCV libraries have the most harmful impact on the model performance, which induces nearly a 6% decrease in performance for the 2-bit ResNet18 models (see Table 10). By contrast, the least impactful noise on model performance are bilinear interpolation and cubic interpolation in the Pillow library, as well as area interpolation in the OpenCV library. For example, under bilinear interpolation of the Pillow library, 2-bit ResNet18 models only present a 1.67% decrease in performance on average. Regarding the different decoders, the performance of quantized models has only minor fluctuations among the three models.

5 Conclusion

This paper presents a benchmark named RobustMQ, which aims to evaluate the robustness of quantized models under various perturbations, including adversarial attacks, natural corruption, and systematic noise. The benchmark evaluates four classical architectures and three popular quantization methods with four different bit widths. The comprehensive results empirically provide valuable insights into the performance of quantized models in various scenarios. Some of the key observations are summarized as follows. 1) Quantized models exhibit higher adversarial robustness than their floating-point counterparts, but are more vulnerable to natural corruption and systematic noise. 2) Using the same quantization method, we observe that as the quantization bit-width increases, the adversarial robustness decreases, the natural robustness increases, and the systematic robustness increases. 3) Among the 15 corruption methods, impulse noise consistently exhibits the most harmful impact on ResNet18, ResNet50, and RegNetX600M models, while glass blur is the most harmful corruption on MobileNetV2 models. By comparison, brightness is observed to be the least harmful corruption for all models. 4) Among the 14 types of systematic noise, the nearest neighbor interpolation has the highest impact, while bilinear interpolation, cubic interpolation in Pillow, and area interpolation in OpenCV are the three least harmful types of noise across different architectures. We hope that our benchmark will significantly contribute to the assessment of the robustness of quantized models. The insights gained from our study could further support the development and deployment of robust quantized models in real-world scenarios.