RobustMQ: benchmarking robustness of quantized models

Xiao, Yisong; Liu, Aishan; Zhang, Tianyuan; Qin, Haotong; Guo, Jinyang; Liu, Xianglong

doi:10.1007/s44267-023-00031-w

RobustMQ: benchmarking robustness of quantized models

Research
Open access
Published: 15 December 2023

Volume 1, article number 30, (2023)
Cite this article

Download PDF

You have full access to this open access article

Visual Intelligence Aims and scope Submit manuscript

RobustMQ: benchmarking robustness of quantized models

Download PDF

1111 Accesses
1 Citation
Explore all metrics

Abstract

Quantization has emerged as an essential technique for deploying deep neural networks (DNNs) on devices with limited resources. However, quantized models exhibit vulnerabilities when exposed to various types of noise in real-world applications. Despite the importance of evaluating the impact of quantization on robustness, existing research on this topic is limited and often disregards established principles of robustness evaluation, resulting in incomplete and inconclusive findings. To address this gap, we thoroughly evaluated the robustness of quantized models against various types of noise (adversarial attacks, natural corruption, and systematic noise) on ImageNet. The comprehensive evaluation results empirically provide valuable insights into the robustness of quantized models in various scenarios. For example: 1) quantized models exhibit higher adversarial robustness than their floating-point counterparts, but are more vulnerable to natural corruption and systematic noise; 2) in general, increasing the quantization bit-width results in a decrease in adversarial robustness, an increase in natural robustness, and an increase in systematic robustness; 3) among corruption methods, impulse noise and glass blur are the most harmful to quantized models, while brightness has the least impact; 4) among different types of systematic noise, the nearest neighbor interpolation has the highest impact, while bilinear interpolation, cubic interpolation, and area interpolation are the three least harmful. Our research contributes to advancing the robust quantization of models and their deployment in real-world scenarios.

An Integrated Approach to Produce Robust Deep Neural Network Models with High Efficiency

Towards Demystifying Adversarial Robustness of Binarized Neural Networks

Symmetry Regularization and Saturating Nonlinearity for Robust Quantization

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Deep neural networks (DNNs) have demonstrated impressive performance in a broad range of applications, including computer vision [1, 2], natural language processing [3, 4], and speech recognition [5, 6]. However, the deployment of large DNNs on resource-constrained devices, such as smartphones and embedded systems, poses challenges due to their high memory and computational requirements. To address this issue, researchers have proposed various model compression techniques, including model quantization [7–11], pruning [12–16], and neural network distillation [17, 18]. Among these techniques, model quantization has become a critical approach for compressing DNNs due to its ability to maintain network structure and achieve comparable performance simultaneously. Model quantization achieves reduced memory usage and faster inference by mapping the network parameters from continuous 32-bit floating-point numbers to discrete low-bit integers, making it well-suited for resource-constrained devices.

While model quantization offers a viable solution for deploying DNNs on resource-constrained devices, it also presents challenges in ensuring the trustworthiness (such as robustness, fairness, and privacy) of quantized models in the real world [19–25]. DNNs are highly susceptible to adversarial examples [26–29], which are perturbations carefully designed to be undetectable to human vision but can easily deceive DNNs, posing a significant threat to practical deep learning applications. For example, the placement of three small adversarial stickers on a road intersection can cause the Tesla Autopilot system [30] to misinterpret the lane markings and swerve into the wrong lane, which severely risks people’s lives and even leads to potential death. In addition, DNNs are vulnerable to natural corruption [31] such as snow and motion blur, which are common in real-world scenarios and can significantly reduce the accuracy of DNN models. Moreover, system noise resulting from the mismatch between software and hardware can also have an adverse impact on model accuracy [32]. These vulnerabilities demonstrate that quantized networks deployed in safety-critical applications (such as autonomous driving and face recognition) are unreliable when faced with various perturbations in real-world scenarios.

Therefore, it is critical to conduct a comprehensive evaluation of the robustness of quantized models before their deployment to identify potential weaknesses and unintended behaviors. In recent years, researchers have developed robustness benchmarks [33–37] tailored for assessing the robustness of deep learning models, employing multiple adversarial attack methods to thoroughly evaluate floating-point networks across various tasks. Researchers have revealed and substantiated several insights through extensive experiments, such as the observation that larger model parameter sizes often lead to better adversarial robustness [33, 38], which highlights the significance of model complexity in determining robustness. While numerous studies have extensively investigated the robustness of floating-point networks, research on the robustness of quantized models [39–42] remains inadequate, lacking diversity in terms of noise sources and relying solely on small datasets. Consequently, the current literature fails to thoroughly assess the robustness of quantized models, leading to a gap in the understanding of their vulnerabilities and strengths.

To bridge this gap, we build RobustMQ, a comprehensive robustness evaluation benchmark for quantized models. Our benchmark systematically assesses the robustness of quantized models using three popular quantization methods (i.e., DoReFa [43], PACT [44], and LSQ [45]) and four classical architectures (ResNet18 [46], ResNet50 [46], RegNetX600M [47], and MobileNetV2 [48]). Each method is evaluated for four commonly used bit-widths. To thoroughly study the robustness of quantized models against noise originating from different sources, our analysis comprises three progressive adversarial attacks (covering $\ell _{1}$, $\ell _{2}$, $\ell _{\infty}$ magnitudes, along with three different perturbation budgets), 15 types of natural corruption types, and 14 types of systematic noise on the ImageNet benchmark. Our empirical results demonstrate that lower-bit quantized models exhibit better adversarial robustness but are more susceptible to natural corruption and systematic noise. In summary, increasing the quantization bit-width generally leads to a decrease in adversarial robustness, an increase in natural robustness, and an increase in systematic robustness. Moreover, our findings indicate that impulse noise and glass blur are the most harmful corruption methods for quantized models, while brightness has the least impact. Additionally, among systematic noise, the nearest neighbor interpolation has the highest impact, while bilinear interpolation, cubic interpolation, and area interpolation are the three least harmful. Our main contributions can be summarized as follows.

1)
To the best of our knowledge, RobustMQ is the first to comprehensively evaluate the robustness of quantized models. RobustMQ covers three popular quantization methods, four common bit-widths, and four classical architectures across a range of noise types, including adversarial attacks, natural corruption, and systematic noise.
2)
Through extensive experiments, RobustMQ uncovers valuable insights into the robustness of quantized models, shedding light on their strengths and weaknesses in comparison to floating-point models across various scenarios.
3)
The RobustMQ benchmark provides a standardized framework for evaluating the robustness of quantized models, enabling further research and development in this field. It is publicly available on our website [49].

2 Related work

In this section, we review related works on network quantization, adversarial attacks, and the recent advancements in the integration of these fields.

2.1 Network quantization

Network quantization compresses DNN models by reducing the number of bits required to represent each weight to reduce memory usage and speed up model inference [50]. A classic quantization process involves both quantization and de-quantization operations. The quantization function maps real values to integers, while the de-quantization function allows approximate recovery of real values from the quantized values. This process can be mathematically formulated as

$$ Q(r)=\operatorname{Int}(r/S)-Z,\qquad r^{\prime }=S \cdot \bigl(Q(r)+Z \bigr), $$

(1)

where Q is the quantization operator, r and $r^{\prime }$ are the real value and de-quantized real value, respectively, S and Z denote scale and zero-point, respectively. Given t bits, the range after quantization is determined by $[-2^{t-1},2^{t-1}-1]$. After quantization, the recovered real value $r^{\prime }$ may not exactly be equal to the original value r due to the rounding operation.

Quantization methods can be broadly classified into two strategies: post-training quantization (PTQ) and quantization-aware training (QAT). PTQ methods are applied after the model is fully trained, without any adjustments to the model during the training process, which often results in lower accuracy. By comparison, QAT methods involve fine-tuning or retraining the model to achieve higher accuracy in the quantized form. Thus, we focus on QAT methods, and we here provide a brief review of the commonly used QAT methods.

One string of research designed rules to fit the quantizer to the data distribution [43, 51]. For example, the DoReFa-Net method [43] simply clips the activation to $[0,1]$ and then quantizes it due to the observation that most activation falls into this range in many network architectures (e.g., AlexNet [1] and ResNet [46]). Other notable work focused on learning appropriate quantization parameters during the backpropagation process [44, 45, 52]. In particular, the PACT technique [44] clips the activation by a hand-crafted parameter and optimizes the clipping threshold. However, it is important to note that PACT has no gradient below the clipping value, which can lead to gradient vanishing problems during backpropagation. To address the limitation in PACT, the LSQ method [45] estimates the gradient at each weight and activation layer to adaptively adjust the step size of quantization. It is able to achieve more fine-grained quantization by learning the scale alongside network parameters, which improves the accuracy of quantized models.

2.2 Adversarial attacks

Adversarial examples [27] are inputs with small perturbations that can easily mislead the DNNs. Formally, given a DNN $f_{\Theta}$ and an input x with the ground truth label y, an adversarial example ${x}_{\text{adv}}$ satisfies

$$ f_{\Theta}({x}_{\text{adv}}) \neq {y} \quad \textit{s.t.} \quad \Vert {x}-{x}_{ \text{adv}} \Vert \leq \epsilon , $$

(2)

where $\Vert \cdot \Vert $ is a distance metric and commonly measured by the $\ell _{p}$-norm (p∈{1,2,∞}).

A long line of work has been dedicated to performing adversarial attacks [21, 26, 27, 38, 53–57], which can be mainly divided into white-box and black-box approaches based on access to the target model. For white-box attacks, adversaries have complete knowledge of the target model and can fully access it, while for black-box attacks, adversaries have limited or even no knowledge of the target model and cannot directly access it. Our study primarily employs white-box attacks to evaluate the adversarial robustness of target models, as they have stronger attack capabilities. Here, we introduce the attack methods relevant to our benchmark.

Fast gradient sign method (FGSM). FGSM [27] is a one-step attack method with the $\ell _{\infty}$-norm. It calculates the gradient of the loss function with respect to the input only once and subsequently adds gradient noise to generate an adversarial example. Although FGSM has a relatively weaker attack capability, it is computationally efficient in generating adversarial examples.

Projected gradient descent (PGD). PGD [38] is regarded as one of the most powerful and widely used attacks due to its high attack success rate. It builds upon the FGSM by introducing an iterative process with a gradient projecting at each step.

AutoAttack. Croce and Hein [53] first proposed two automatic step size adjustment methods (APGD-CE and APGD-DLR) to address problems such as suboptimal step size in PGD. Then, they combined two existing complementary attacks to form a parameter-free and computationally affordable attack (i.e., AutoAttack). AutoAttack has demonstrated superior performance compared to state-of-the-art attacks, thus becoming a crucial tool for evaluating model robustness.

2.3 Robustness of quantized models

A number of studies have been proposed to evaluate the robustness of floating-point networks [20, 22, 33, 34, 58]. For instance, Croce et al. [34] introduced RobustBench, a benchmark based on the CIFAR dataset, which employs AutoAttack to assess the robustness of models strengthened by various defense methods, including adversarial training, gradient masking, and label smoothing. In contrast to the evaluation of defense methods, Tang et al. [33] focused on investigating the robustness of model structures and training techniques using the large-scale ImageNet dataset, offering valuable insights into the methods of training robust models.

Comparatively, Merkle et al. [59] benchmarked the adversarial robustness of pruned models for several pruning methods, revealing that pruned models enjoy better robustness against adversaries. However, the robustness of quantized networks has been relatively underexplored. Bernhard et al. [39] utilized an ensemble of quantized models to filter adversarial examples, given that current adversarial attacks demonstrate limited transferability against quantized models. Lin et al. [40] proposed a defensive quantization method to suppress the amplification of adversarial noise during propagation by controlling the Lipschitz constant of the network during quantization. Similarly, Alizadeh et al. [41] also designed a regularization scheme to improve the robustness of the quantized model by controlling the magnitude of adversarial gradients. It is worth mentioning that a benchmark proposed by Yuan et al. [60] also accesses the reliability of PTQ. However, there are fundamental distinctions between their objectives and ours. Yuan et al. [60] concentrated on examining the influence of different PTQ steps on model performance, encompassing steps such as constructing calibration datasets, assigning quantization settings, and optimizing quantization parameters. In contrast, we aim to thoroughly evaluate the robustness of quantized models against multiple test-time attacks covering quantization methods, architectures, and quantization bits.

3 RobustMQ benchmark

Existing research on the impact of quantization compression on neural network robustness is fragmented and lacks adherence to established principles in robustness evaluation. To address this issue, this study proposes RobustMQ, a comprehensive robustness evaluation benchmark for quantized models with consistent settings. RobustMQ provides researchers with a valuable tool to gain deeper insights into the impact of various perturbations on quantized model robustness, aiding in the development of more robust quantization methods for deploying reliable and secure deep learning models in real-world scenarios. The RobustMQ benchmark encompasses adversarial robustness, natural robustness, and systematic robustness, considering three quantization methods, four bit-widths, four architectures, three progressive adversarial attack methods (covering three magnitudes and three perturbation budgets), 15 types of natural corruption, and 14 types of systematic noise on ImageNet. The overall framework of RobustMQ is illustrated in Fig. 1.

3.1 Robustness evaluation approaches

Quantized models that are extensively deployed in edge devices are vulnerable to various perturbations in real-world scenarios. In accordance with the guidelines proposed by Tang et al. [33], we classify these perturbations into adversarial attacks, natural corruption, and systematic noise, and leverage them to thoroughly evaluate the robustness of quantized models.

3.1.1 Adversarial attacks

To model the worst-case scenario (i.e., strongest adversaries), we consider attacks conducted under the white-box attack setting where the adversary has full access to the model architecture, training data, and gradient information. Specifically, we employ FGSM-$\ell _{\infty}$, PGD-$\ell _{1}$, PGD-$\ell _{2}$, PGD-$\ell _{\infty}$ and AutoAttack-$\ell _{\infty}$ to craft adversarial perturbations. These three attack methods form a progressive evaluation, where their computing resource consumption and attack capabilities are improved, enabling a comprehensive assessment of the quantized model’s adversarial robustness. Furthermore, we set three progressive perturbation budgets (small, middle, and large) for each attack method.

3.1.2 Natural corruption

To simulate natural corruption, we utilize 15 distinct perturbation methods from the ImageNet-C benchmark [31]. These methods can be categorized into four groups: 1) noise, which includes Gaussian noise, shot noise, and impulse noise; 2) blur, which includes defocus blur, frosted glass blur, motion blur, and zoom blur; 3) weather, which includes snow, frost, fog, and brightness; and 4) digital noise, which includes contrast, elastic, pixelation, and JPEG compression. Each corruption type is evaluated at five levels of severity to account for variations in the intensity of corruption. Thus, we have 75 perturbation methods in total for natural corruption evaluation. In addition to single corruption images, we also investigate corruption sequences generated from ImageNet-P [31]. Each sequence in ImageNet-P comprises more than 30 frames, allowing us to study the robustness of quantized models against dynamic and continuous corruption.

3.1.3 Systematic noise

Moreover, different types of system noise are always present when models are deployed in edge devices due to changes in hardware or software. To assess the influence of system noise on quantized models, we incorporate pre-processing operations from ImageNet-S [32], which involve image decoding and image resize processes. Image decoding refers to the process of converting an original image file into an RGB channel map tensor, where the inverse discrete cosine transform (iDCT) serves as a core step. However, discrepancies in the implementation of iDCT among various image processing libraries result in variations in the output. As a consequence, the pixel values of the final RGB tensor are affected, leading to slight differences in the decoded images. Therefore, we employ the decoders from third-party libraries such as Pillow [61], OpenCV [62], and FFmpeg [63] to obtain systematic noise. Additionally, image resize is utilized to change the image resolution. In the resize process, different interpolation algorithms are employed to predict the new pixel positions, potentially leading to slight variations in the new pixel values. Thus, for different image resize methods, we incorporate bilinear, nearest, cubic, hamming, lanczos, area, and box interpolation modes from the OpenCV and Pillow tools. In total, systematic noise consists of three frequently used decoders and seven commonly used resize modes.

3.2 Evaluation metrics

3.2.1 Adversarial robustness

For specific adversarial attacks, we adopt the model accuracy to measure adversarial robustness (AR), which is calculated by subtracting the attack success rate (ASR) from 1. Mathematically, AR can be calculated using Eq. (3):

$$ \mathit{AR} = 1 - P_{({x},{y})\sim \mathcal{D}}{\bigl(f\bigl( \mathcal{A}_{\epsilon ,p}^{f}({x})\bigr) \neq {y}\bigr)}, $$

(3)

where f is the target tested model, $\mathcal{D}$ is the validation set, $\mathcal{A}_{\epsilon ,p}$ represents the adversary, P represents the fraction that satisfies the specified criteria, and ϵ and p denote the perturbation budget and distance norm, respectively. This metric quantifies the model’s capability to retain normal functioning under attacks, with higher AR indicating a stronger model against the specific adversarial attack. While we aim to evaluate the robustness among models with different clean accuracies, it is crucial to measure the relative performance drop against adversarial attacks, denoted as adversarial attack impact (AAI):

$$ \mathit{AAI} = \frac{\mathit{ACC} - \mathit{AR}}{\mathit{ACC}}, $$

(4)

where ACC represents the clean accuracy. A lower AAI value indicates that the model is more robust.

For the union of different attacks, we adopt worst-case adversarial robustness (WCAR) to measure adversarial robustness (a higher value indicates a more robust model) against them:

$$ \mathit{WCAR} = 1 - P_{({x},{y})\sim \mathcal{D}}{\mathrm{Any}_{\mathcal{A} \in \mathcal{A}s}\bigl(f \bigl(\mathcal{A}_{\epsilon ,p}^{f}({x})\bigr) \neq {y}\bigr) }, $$

(5)

where $\mathcal{A}s$ represents a set of adversaries, $\mathrm{Any}(\cdot )$ is a function that returns true if any of the adversary $\mathcal{A}$ in $\mathcal{A}s$ attacks successfully. WCAR represents a lower bound of model adversarial robustness against various adversarial attacks. Specifically, we further refine WCAR based on the perturbation budget employed in adversarial attacks: WCAR (small ϵ), WCAR (middle ϵ), and WCAR (large ϵ).

3.2.2 Natural robustness

Natural robustness measures the accuracy of a model in maintaining its performance after being perturbed by natural noise. Therefore, given a corruption method c, we calculate the accuracy as its natural robustness:

$$ \mathit{ACC}_{c}=P_{({x},{y})\sim \mathcal{D}}{\bigl(f\bigl(c({x})\bigr)={y} \bigr)}. $$

(6)

Similar to AAI in adversarial robustness, we also define natural corruption impact (NCI) to measure the relative performance drop against natural corruption:

$$ \mathit{NCI} = \frac{\mathit{ACC} - \mathit{ACC}_{c}}{\mathit{ACC}}. $$

(7)

To aggregate the evaluation results among 15 types of corruption, we adopt the average accuracy of the quantized model on all corruption types to measure the mean natural robustness, denoted as mNR:

$$ \mathit{mNR} = \mathbb{E}_{c\sim C}\bigl({P_{({x},{y})\sim \mathcal{D}}{\bigl(f \bigl(c({x})\bigr)={y}\bigr)}}\bigr), $$

(8)

where c is a specific corruption method, C denotes the set of corruption methods, and $\mathbb{E}$ calculates the average expectation. A higher value of mNR means better natural robustness.

For the corruption sequence $\mathcal{S}$, we utilize the “flip probability” of model predictions to measure its robustness, denoted as FP:

$$ \mathit{FP} = P_{{x}\sim \mathcal{S}}{\bigl(f({x}_{j}) \neq f({x}_{j-1})\bigr)}, $$

(9)

where $x_{j}$ is the j-th image in the sequence.

For sequences generated by multiple corruption methods, we average their flip probability to obtain mean flip probability (mFP). Note that a lower FP value indicates a model that performs more stably in the presence of dynamic and continuous corruption.

3.2.3 Systematic robustness

Systematic robustness measures a model’s ability to withstand various software-dependent and component-dependent system noise attacks in diverse deployment environments. For a given decoding or resizing method s, we compute the accuracy as its systematic robustness using Eq. (10):

$$ \mathit{ACC}_{s}=P_{({x},{y})\sim \mathcal{D}}{\bigl(f \bigl(s({x})\bigr)={y}\bigr)}. $$

(10)

To emphasize the impact of noise, we introduce systematic noise impact (SNI) as a metric to quantify the relative performance drop:

$$ \mathit{SNI} = \frac{\mathit{ACC} - \mathit{ACC}_{s}}{\mathit{ACC}}. $$

(11)

Furthermore, to evaluate the robustness among all systematic noise, we calculate the standard deviation of their ${\mathit{ACC}}_{s}$ as systematic robustness (SR):

$$ \mathit{SR} = \mathbb{D}_{s\sim S}\bigl({P_{({x},{y})\sim \mathcal{D}}{\bigl(f \bigl(s({x})\bigr)={y}\bigr)}}\bigr), $$

(12)

where S denotes a set of decoding or resizing methods. A lower value of SR means better stability towards different types of systematic noise.

3.3 Evaluation objects

3.3.1 Dataset

Our RobustMQ aims to obtain broadly applicable results and conclusions for quantized models in the computer vision field. Therefore, we focus on the basic image classification tasks and follow well-established quantization literature [9, 43–45] to employ the large-scale ImageNet [64] dataset, which is widely recognized as a standard benchmark within the realm of quantization and computer vision. ImageNet provides a more extensive collection of images and classes, making it more suitable as a benchmark for testing models in robustness evaluation, in contrast to commonly used small-scale datasets such as MNIST [65], CIFAR-10 [66], and CIFAR-100 [66]. With an expansive repository of 1.2 million training images, 50,000 validation images, and spanning 1000 different classes, the ImageNet dataset significantly bolsters the applicability of our evaluation methodology to real-world scenarios.

3.3.2 Network architectures

Our RobustMQ contains four architectures, including ResNet18 [46], ResNet50 [46], RegNetX600M [47], and MobileNetV2 [48]. 1) ResNet18 and ResNet50 are classic backbone architectures that have proven to be highly effective in various computer vision tasks. Both architectures are built on the concept of residual blocks, which employ skip connections to mitigate the vanishing gradient problem and facilitate the training of deep networks. 2) RegNetX600M is an advanced architecture discovered through model structure search, specifically optimized to achieve efficient and powerful feature extraction. It leverages group convolution to enable parallel processing and significantly reduce computational complexity, making it ideal for resource-constrained edge devices. 3) MobileNetV2 is a lightweight network designed for efficient deployment on edge devices. It employs depthwise separable convolutions, which separate the spatial and channel-wise convolutions, reducing the computational burden while maintaining performance.

3.3.3 Quantization methods

Within RobustMQ, we concentrate on three widely used quantization methods: DoReFa [43], PACT [44], and LSQ [45]. The LSQ and DoReFa methods perform their quantization methods both on model weights and activation values. On the other hand, PACT applies its specific parameter truncation to quantize activation values, while leveraging the method in DoReFa for weight quantization. For the choice of quantization bits, we adopt the commonly used set in deployments (i.e., 2, 4, 6, and 8). For each architecture, we first quantize models on the ImageNet training set starting from the same floating-point model, and then evaluate their robustness against perturbations generated on the ImageNet validation set.

4 Experiments and analysis

In this section, we showcase the evaluation results of the quantized models under different types of noise. Subsequently, we consolidate several conclusive insights gleaned from the evaluations by addressing the following research questions: 1) How robust are quantized models compared with floating-point models? 2) Which quantization method or bit-width exhibits greater robustness against perturbations? 3) What is the impact of architecture or size on the robustness of quantized models? 4) To which type of noise is the quantized model most susceptible?

4.1 Clean accuracy

Table 1 reports the clean accuracies of the quantized models. Most of the quantized models maintain comparable accuracy to their 32-bit pre-trained models, while certain quantization methods may struggle to maintain comparable accuracy when using low bit-widths (e.g., 2-bit). Among the quantized models evaluated, 12 models fail to converge. For example, ResNet18 PACT 2-bit achieves a mere 2.61% accuracy. Therefore, we label these models as “NC” and exclude them from our evaluations to ensure fair and reliable assessments of robustness.

Table 1 Clean accuracy of quantized models and floating-point models. “NC” denotes not converged. The best performers in each architecture are highlighted in bold

Full size table

4.2 Evaluation of adversarial attacks

Under medium and high perturbation budgets, the values of AR and WCAR degrade significantly to almost 0 due to the increasing attack abilities. Therefore, in this section, we primarily present the results under small budgets to highlight the differences between different models. The robustness evaluation results with small budgets under all adversarial attacks, including FGSM-$\ell _{\infty}$ and PGD-$\ell _{\infty}$ attacks, are presented in Table 2, Table 3, and Table 4, respectively. Other adversarial robustness evaluation results can be found on our website [49]. From the results, we can make several observations for quantized models as follows.

Table 2 Worst-case adversarial robustness of quantized models under all adversarial attacks with small budgets. The results are presented using *WCAR* (small ϵ). The best performers in each architecture are highlighted in bold

Table 3 Adversarial robustness of models under the FGSM-$\ell _{ \infty}$ attack with a small budget ($\epsilon =0.5/255$). The results are shown with . The best performers in each architecture are highlighted in bold

Table 4 Adversarial robustness of quantized models under the PGD-$\ell _{\infty}$ attack with a small budget ($\epsilon =0.5/255$). The results are shown with . The best performers in each architecture are highlighted in bold

1) Better adversarial robustness vs. floating-point models. Unlike the decrease in clean accuracy, quantized models exhibit higher worst-case adversarial robustness and are almost better than floating-point networks. For example, the ACC of ResNet18 DoReFa 2-bit is 8.75% lower than that of the floating-point ResNet18 (see Table 1), while WCAR of ResNet18 DoReFa 2-bit is 4.25% higher than that of the floating-point ResNet18 (see Table 2). Moreover, Fig. 2 illustrates that under the same quantization method, the adversarial robustness of quantized ResNet18 models increases as the quantization bit-width decreases. These phenomena can also be observed in other network architectures, suggesting that quantization can provide defense against adversarial attacks to a certain extent.

2) At the same quantization bit-width, PACT outperforms other quantization methods under the worst-case adversarial robustness. For instance, RegNetX600M PACT 4-bit achieves a WCAR of 6.21%, while DoReFa and LSQ quantized models achieve WCAR values of 2.74% and 2.05%, respectively (see Table 2). However, when facing specific adversarial attacks, the robustness performance of quantization methods is not consistent across different model architectures (see Table 3 and Table 4). For ResNet18, ResNet50, and MobileNetV2 architectures, the LSQ quantization method demonstrates better robustness against FGSM-$\ell _{\infty}$ and PGD-$\ell _{ \infty}$ attacks. However, for the RegNetX600M architecture, the PACT method exhibits better robustness.

3) Regarding the model size and network architecture, quantized models show similar trends to floating-point models. Regarding the model size, we observe that quantized models with larger network capacity exhibit better adversarial robustness, consistent with the findings in Ref. [33]. For instance, in the ResNet families, all quantized ResNet50 models demonstrate better robustness than ResNet18 against all adversarial attacks. Regarding the network architecture, the adversarial robustness of quantized models follows the order: MobileNetV2 > RegNetX600M > ResNet50 > ResNet18, which coincides with the findings in Ref. [33]. This highlights the significance of architecture design in achieving better adversarial robustness.

4) For the adversarial attack methods, their attack capabilities in quantized models are generally consistent with those in floating-point models: AutoAttack > PGD > FGSM. However, quantized models demonstrate varying degrees of adversarial robustness against different attack methods. For instance, compared to DoReFa and PACT, LSQ performs better under FGSM and PGD attacks but is more vulnerable to AutoAttack.

4.3 Evaluation of natural corruption

We first report the aggregated evaluation results (i.e., mean natural robustness) among 15 corruption methods for all quantized models, displayed in Table 5. Next, we explore the natural robustness results for different quantized models under each corruption method. For instance, we provide the natural robustness results for the ResNet18, ResNet50, and MobileNetV2 architectures in Table 6, Fig. 3, and Fig. 4, respectively. Due to the limit of space, more detailed results can be found on our website [49]. The natural robustness evaluation results for quantized models reveal several key observations as follows.

**Table 5 Natural robustness of quantized models under all corruption methods. The results are presented using . The best performers in each architecture are highlighted in bold**

**Table 6 Natural robustness results for ResNet18 models are shown with for each corruption (e.g., Gauss). The most influential noise is marked in bold, and the least influential noise is underlined**

1) Worse natural robustness vs. floating-point models. Despite achieving similar clean accuracy compared to the 32-bit model, quantized models are more susceptible to natural corruption. Notably, the 2-bit quantized model is severely affected, with a decrease in performance that far exceeds the accuracy drop observed in clean accuracy. As an example, ResNet18 DoReFa 2-bit exhibits a NCI of 62.61%, which is higher than the corresponding floating-point model with 53.87% NCI (see Table 5), indicating that quantization can negatively impact the model’s robustness against such perturbations. Furthermore, Fig. 3 demonstrates that under the same quantization method, the natural robustness of quantized ResNet50 models generally increases as the quantization bit-width increases. However, the rate of increase in robust performance gradually decreases. Similar phenomena can also be observed in other network architectures, indicating that in real scenarios, using quantized models requires more attention than floating-point models. For instance, leveraging corruption data to enhance natural robustness becomes crucial in maintaining the performance of quantized models.

2) At the same quantization bit-width, quantization methods exhibit varied performance on different architectures. As presented in Table 5, LSQ achieves less relative performance drop (i.e., lower NCI value) than DoReFa and PACT methods on the ResNet18 and ResNet50 architectures. However, DoReFa and PACT may outperform other quantization methods across most bit-widths on the RegNetX600M and MobileNetV2 architectures. When examining a specific architecture, Fig. 3 reveals a consistent relative trend among different quantization methods across various corruption methods (other architectures also show this trend). Specifically, the order of impact on model performance is as follows: Noise > Blur > Weather (except brightness) > Digital. This consistent trend emphasizes the varying degrees of sensitivity of quantized models to different types of corruption, and it remains independent of the specific quantization method used.

3) Under the same network architecture, network capacity could lead to better natural robustness. Similar to the adversarial evaluation, the quantized ResNet50 models here present a higher mNR and a lower NCI value compared to ResNet18 under natural corruption. For different network architectures, the natural robustness of the quantized models follows the order: ResNet > RegNetX600M > MobileNetV2, which is exactly opposite to the order observed under adversarial attacks.

4) Regarding natural corruption methods, our conclusions are as follows. For ResNet18, ResNet50, and RegNetX600M, impulse noise has the most severe impact on the model’s robustness. For MobileNetV2, glass blur is the most harmful corruption to the model’s robustness. Among all the corruption methods, brightness has the least impact on the model’s robustness. ResNet18 quantized models exhibit an average decrease of approximately 53.15% under impulse noise, whereas the average decrease is only approximately 11.31% under brightness weather corruption (see Table 6). Similarly, MobileNetV2 quantized models show an average decrease of approximately 49.04% under glass blur, while the average decrease is only approximately 12.47% under brightness.

For natural robustness under corruption sequences, we report the evaluation results in Table 7, Table 8, Fig. 5, and Fig. 6. Table 7 shows the mean flip probability of quantized models against multiple sequences. Table 8, Fig. 5, and Fig. 6 report the detailed robustness results under specific corruption sequences for ResNet18, ResNet50 and RegNetX600M, respectively. Other results can be found on our website [49]. From these results, we can draw several conclusions about natural robustness under dynamic and continuous corruption, which mostly align with our observations from static natural corruption evaluations.

**Table 7 Natural robustness of quantized models under all corruption sequences. The results are presented using . The best performers in each architecture are highlighted in bold**

Table 8 Natural robustness of ResNet18 under corruption sequences. Results for each corruption (e.g., Gauss) are shown using . The most influential noise is marked in bold, and the least influential noise is underlined

1) Worse dynamic natural robustness vs. floating-point models. Coinciding with the static natural corruption, quantized models also exhibit inferior performance compared to floating-point models under dynamic corruption sequences. Furthermore, the 2-bit quantized models demonstrate extreme instability. For example, ResNet18 DoReFa 2-bit exhibits an mFP of 0.283, which is nearly three times higher than that of the ResNet18 floating-point model (0.098 mFP). As depicted in Fig. 5, we can find that the natural robustness increases with increasing bit-width (from right to left in each subfigure).

2) At the same quantization bit-width, quantization methods under dynamic natural corruption also exhibit similar trends as observed under static natural corruption. For example, LSQ achieves better robustness on the ResNet architecture but may not perform as dominantly on lightweight architectures as RegNetX600M and MobileNetV2.

3) For specific corruption sequences, shot noise has the most severe impact on all models, while brightness remains the least harmful. Additionally, we observe that for RegNetX600M models, the floating-point model is not the most robust under shot noise (see Fig. 6), which is inconsistent with the trend observed in ResNet models (see Fig. 5).

4.4 Evaluation of systematic noise

We first report the aggregated evaluation results of systematic robustness among 14 types of systematic noise for all quantized models (see in Table 9). Then, we explore the performance of different quantized models under specific type of systematic noise. For instance, we provide the systematic robustness results for ResNet18, ResNet50, and RegNetX600M architectures in Table 10, Fig. 7, and Fig. 8, respectively. Due to the limit of space, more detailed results can be found on our website [49]. The evaluation results for quantized models provide valuable insights into their systematic robustness. Several key observations are as follows:

**Table 9 Mean systematic robustness of models under all types of noise. The results are shown with . The best performers in each architecture are highlighted in bold**

Table 10 Systematic robustness of ResNet18 models under each type of systematic noise. The results for each type of noise (e.g., bilinear) are presented using . The most influential noise is marked in bold, and the least influential noise is underlined

1) Worse systematic robustness vs. floating-point models. Despite the prevalent system noise in deployed environments not causing a significant drop in the prediction accuracy of quantized models, their performance in various deployment scenarios exhibits considerable instability. In contrast, floating-point models demonstrate little fluctuation. For instance, on the ResNet18 architecture, all quantized models exhibit at least 3.43 times higher instability (SR) than the floating-point ResNet18, confirming the sensitivity of quantized models to system noise. In particular, 2-bit quantized models experience significant impacts under systematic noise (e.g., ResNet18 DoReFa 2-bit shows a drop of 4.49% in terms of ${\mathit{ACC}}_{\textit{s}}$ and an increase of 4.38 times in instability). For the same quantization method, lower-bit models present less robustness (i.e., lower stability) generally. This indicates that maintaining consistency between the deployment and training process is crucial to avoid unnecessary accuracy loss.

2) At the same quantization bit-width, quantization methods demonstrate varied performance results across different architectures. Generally, LSQ shows dominance in ResNet and RegNetX600M architectures, while PACT performs better in MobileNetV2. When focusing on a specific architecture (such as ResNet50 in Fig. 7 and RegNetX600M in Fig. 8), we can observe the similar relative impact across different decoders and resize modes when using three quantization methods.

3) Network capacity could lead to better systematic robustness under the same network architecture. For instance, quantized ResNet50 models present a lower SR value compared to ResNet18 under systematic noise. For different network architectures, the systematic robustness of quantized models follows the order: RegNetX600M > ResNet > MobileNetV2. It is worth noting that the RegNetX600M floating-point model exhibits the worst systematic robustness, while surprisingly, the quantized RegNetX600M models demonstrate the best systematic robustness among all architectures.

4) We draw the following conclusions when considering specific type of systematic noise. Among 14 types of systematic noise, the nearest neighbor interpolation methods in the Pillow and OpenCV libraries have the most harmful impact on the model performance, which induces nearly a 6% decrease in performance for the 2-bit ResNet18 models (see Table 10). By contrast, the least impactful noise on model performance are bilinear interpolation and cubic interpolation in the Pillow library, as well as area interpolation in the OpenCV library. For example, under bilinear interpolation of the Pillow library, 2-bit ResNet18 models only present a 1.67% decrease in performance on average. Regarding the different decoders, the performance of quantized models has only minor fluctuations among the three models.

5 Conclusion

This paper presents a benchmark named RobustMQ, which aims to evaluate the robustness of quantized models under various perturbations, including adversarial attacks, natural corruption, and systematic noise. The benchmark evaluates four classical architectures and three popular quantization methods with four different bit widths. The comprehensive results empirically provide valuable insights into the performance of quantized models in various scenarios. Some of the key observations are summarized as follows. 1) Quantized models exhibit higher adversarial robustness than their floating-point counterparts, but are more vulnerable to natural corruption and systematic noise. 2) Using the same quantization method, we observe that as the quantization bit-width increases, the adversarial robustness decreases, the natural robustness increases, and the systematic robustness increases. 3) Among the 15 corruption methods, impulse noise consistently exhibits the most harmful impact on ResNet18, ResNet50, and RegNetX600M models, while glass blur is the most harmful corruption on MobileNetV2 models. By comparison, brightness is observed to be the least harmful corruption for all models. 4) Among the 14 types of systematic noise, the nearest neighbor interpolation has the highest impact, while bilinear interpolation, cubic interpolation in Pillow, and area interpolation in OpenCV are the three least harmful types of noise across different architectures. We hope that our benchmark will significantly contribute to the assessment of the robustness of quantized models. The insights gained from our study could further support the development and deployment of robust quantized models in real-world scenarios.

Availability of data and materials

The datasets generated and analyzed during the current study are available on our website [49].

References

Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2017). ImageNet classification with deep convolutional neural networks. Communications of the ACM, 60(6), 84–90.
Article Google Scholar
Zhao, Z., Zhang, J., Xu, S., Lin, Z., & Pfister, H. (2022). Discrete cosine transform network for guided depth map super-resolution. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 5697–5707). Piscataway: IEEE.
Google Scholar
Bahdanau, D., Cho, K., & Bengio, Y. (2014). Neural machine translation by jointly learning to align and translate. arXiv preprint. arXiv:1409.0473.
Sutskever, I., Vinyals, O., & Le, Q. V. (2014). Sequence to sequence learning with neural networks. In Z. Ghahramani, M. Welling, C. Cortes, et al. (Eds.), Proceedings of the 28th international conference on neural information processing systems (pp. 3104–3112). Red Hook: Curran Associates.
Google Scholar
Hinton, G., Deng, L., Yu, D., Dahl, G. E., Mohamed, A., Jaitly, N., et al. (2012). Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups. IEEE Signal Processing Magazine, 29(6), 82–97.
Article Google Scholar
Graves, A., Mohamed, A., & Hinton, G. (2013). Speech recognition with deep recurrent neural networks. In Proceedings of the IEEE international conference on acoustics, speech and signal processing (pp. 6645–6649). Piscataway: IEEE.
Google Scholar
Qin, H., Gong, R., Liu, X., Shen, M., Wei, Z., Yu, F., et al. (2020). Forward and backward information retention for accurate binary neural networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 2250–2259). Piscataway: IEEE.
Google Scholar
Zhang, X., Qin, H., Ding, Y., Gong, R., Yan, Q., Tao, R., et al. (2021). Diversifying sample generation for accurate data-free quantization. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 15658–15667). Piscataway: IEEE.
Google Scholar
Li, Y., Shen, M., Ma, J., Ren, Y., Zhao, M., Zhang, Q., et al. (2021). MQBench: towards reproducible and deployable model quantization benchmark. arXiv preprint. arXiv:2111.03759.
Qin, H., Zhang, M., Ding, Y., Li, A., Cai, Z., Liu, Z., et al. (2023). BiBench: benchmarking and analyzing network binarization. arXiv preprint. arXiv:2301.11233.
Qin, H., Ding, Y., Zhang, M., Yan, Q., Liu, A., Dang, Q., et al. (2022). BiBert: accurate fully binarized BERT. arXiv preprint. arXiv:2203.06390.
Guo, J., Liu, J., & Xu, D. (2021). JointPruning: pruning networks along multiple dimensions for efficient point cloud processing. IEEE Transactions on Circuits and Systems for Video Technology, 32(6), 3659–3672.
Article Google Scholar
Guo, J., Ouyang, W., & Xu, D. (2020). Multi-dimensional pruning: a unified framework for model compression. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 1508–1517). Piscataway: IEEE.
Google Scholar
Guo, J., Ouyang, W., & Xu, D. (2020). Channel pruning guided by classification loss and feature importance. In F. Rossi, V. Conitzer, & F. Sha (Eds.), Proceedings of the AAAI conference on artificial intelligence (pp. 10885–10892). Palo Alto: AAAI Press.
Google Scholar
Guo, J., Zhang, W., Ouyang, W., & Xu, D. (2020). Model compression using progressive channel pruning. IEEE Transactions on Circuits and Systems for Video Technology, 31(3), 1114–1124.
Article Google Scholar
Guo, J., Xu, D., & Ouyang, W. (2023). Multidimensional pruning and its extension: a unified framework for model compression. IEEE Transactions on Neural Networks and Learning Systems. Advance online publication. https://doi.org/10.1109/TNNLS.2023.3266435.
Article Google Scholar
Buciluǎ, C., Caruana, R., & Niculescu-Mizil, A. (2006). Model compression. In T. Eliassi-Rad, L. H. Ungar, M. Craven, et al. (Eds.), Proceedings of the 12th ACM SIGKDD international conference on knowledge discovery and data mining (pp. 535–541). New York: ACM.
Chapter Google Scholar
Hinton, G., Vinyals, O., & Dean, J. (2015). Distilling the knowledge in a neural network. arXiv preprint. arXiv:1503.02531.
Guo, J., Bao, W., Wang, J., Ma, Y., Gao, X., Xiao, G., et al. (2023). A comprehensive evaluation framework for deep model robustness. Pattern Recognition, 137, 109308.
Article Google Scholar
Liu, A., Liu, X., Yu, H., Zhang, C., Liu, Q., & Tao, D. (2021). Training robust deep neural networks via adversarial noise propagation. IEEE Transactions on Image Processing, 30, 5769–5781.
Article Google Scholar
Wang, J., Liu, A., Yin, Z., Liu, S., Tang, S., & Liu, X. (2021). Dual attention suppression attack: generate adversarial camouflage in physical world. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 8565–8574). Piscataway: IEEE.
Google Scholar
Zhang, C., Liu, A., Liu, X., Xu, Y., Yu, H., Ma, Y., et al. (2020). Interpreting and improving adversarial robustness of deep neural networks with neuron sensitivity. IEEE Transactions on Image Processing, 30, 1291–1304.
Article Google Scholar
Xiao, Y., Liu, A., Li, T., & Liu, X. (2023). Latent imitator: generating natural individual discriminatory instances for black-box fairness testing. In R. Just & G. Fraser (Eds.), Proceedings of the 32nd ACM SIGSOFT international symposium on software testing and analysis (pp. 829–841). New York: ACM.
Chapter Google Scholar
Wei, Z., Chen, J., Wu, Z., & Jiang, Y.-G. (2023). Enhancing the self-universality for transferable targeted attacks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 12281–12290). Piscataway: IEEE.
Google Scholar
Liu, A., Tang, S., Liu, X., Chen, X., Huang, L., Qin, H., et al. (2023). Towards defending multiple $\ell _{p}$-norm bounded adversarial perturbations via gated batch normalization. International Journal of Computer Vision. Advance online publication. https://doi.org/10.1007/s11263-023-01884-w.
Article Google Scholar
Liu, A., Liu, X., Fan, J., Ma, Y., Zhang, A., Xie, H., et al. (2019). Perceptual-sensitive GAN for generating adversarial patches. In P. Stone, P. Van Hentenryck, & Z.-H. Zhou (Eds.), Proceedings of the AAAI conference on artificial intelligence (pp. 1028–1035). Palo Alto: AAAI Press.
Google Scholar
Goodfellow, I. J., Shlens, J., & Szegedy, C. (2014). Explaining and harnessing adversarial examples. arXiv preprint. arXiv:1412.6572.
Liu, A., Guo, J., Wang, J., Liang, S., Tao, R., Zhou, W., et al. (2023). X-ADV: physical adversarial object attacks against X-ray prohibited item detection. In The 32nd USENIX security symposium (pp. 3781–3798). Anaheim: USENIX Association.
Google Scholar
Liu, S., Wang, J., Liu, A., Li, Y., Gao, Y., Liu, X., et al. (2022). Harnessing perceptual adversarial patches for crowd counting. In H. Yin, A. Stavrou, C. Cremers, et al. (Ed.), Proceedings of the 2022 ACM SIGSAC conference on computer and communications security (pp. 2055–2069). New York: ACM.
Chapter Google Scholar
Boloor, A., Garimella, K., He, X., Gill, C., Vorobeychik, Y., & Zhang, X. (2020). Attacking vision-based perception in end-to-end autonomous driving models. Journal of Systems Architecture, 110, 101766.
Article Google Scholar
Hendrycks, D., & Dietterich, T. (2019). Benchmarking neural network robustness to common corruptions and perturbations. arXiv preprint. arXiv:1903.12261.
Wang, Y., Li, Y., Gong, R., Xiao, T., & Yu, F. (2021). Real world robustness from systematic noise. In D. Song, D. Tao, A. L. Yuille, et al. (Eds.), Proceedings of the 1st international workshop on adversarial learning for multimedia (pp. 42–48). New York: ACM.
Chapter Google Scholar
Tang, S., Gong, R., Wang, Y., Liu, A., Wang, J., Chen, X., et al. (2021). RobustART: benchmarking robustness on architecture design and training techniques. arXiv preprint. arXiv:2109.05211.
Croce, F., Andriushchenko, M., Sehwag, V., Debenedetti, E., Flammarion, N., & Chiang, M., et al. (2020). RobustBench: a standardized adversarial robustness benchmark. arXiv preprint. arXiv:2010.09670.
Wang, B., Xu, C., Wang, S., Gan, Z., Cheng, Y., Gao, J., et al. (2021). Adversarial glue: a multi-task benchmark for robustness evaluation of language models. arXiv preprint. arXiv:2111.02840.
Yi, C., Yang, S., Li, H., Tan, Y.-P., & Kot, A. (2021). Benchmarking the robustness of spatial-temporal models against corruptions. arXiv preprint. arXiv:2110.06513.
Zhang, T., Xiao, Y., Zhang, X., Li, H., & Wang, L. (2023). Benchmarking the physical-world adversarial robustness of vehicle detection. arXiv preprint. arXiv:2304.05098.
Madry, A., Makelov, A., Schmidt, L., Tsipras, D., & Vladu, A. (2017). Towards deep learning models resistant to adversarial attacks. arXiv preprint. arXiv:1706.06083.
Bernhard, R., Moellic, P.-A., & Dutertre, J.-M. (2019). Impact of low-bitwidth quantization on the adversarial robustness for embedded neural networks. In Proceedings of the international conference on cyberworlds (pp. 308–315). Piscataway: IEEE.
Google Scholar
Lin, J., Gan, C., & Han, S. (2019). Defensive quantization: when efficiency meets robustness. arXiv preprint. arXiv:1904.08444.
Alizadeh, M., Behboodi, A., van Baalen, M., Louizos, C., Blankevoort, T., & Welling, M. (2020). Gradient $\ell _{1}$ regularization for quantization robustness. arXiv preprint. arXiv:2002.07520.
Xiao, Y., Zhang, T., Liu, S., & Qin, H. (2023). Benchmarking the robustness of quantized models. arXiv preprint. arXiv:2304.03968.
Zhou, S., Wu, Y., Ni, Z., Zhou, X., Wen, H., & DoReFa-Net, Y. Z. (2016). Training low bitwidth convolutional neural networks with low bitwidth gradients. arXiv preprint. arXiv:1606.06160.
Choi, J., Wang, Z., Venkataramani, S., I-Jen Chuang, P., Srinivasan, V., & Gopalakrishnan, K. (2018). PACT: parameterized clipping activation for quantized neural networks. arXiv preprint. arXiv:1805.06085.
Esser, S. K., McKinstry, J. L., Bablani, D., Rathinakumar, A., & Modha, D. S.. (2019). Learned step size quantization. arXiv preprint. arXiv:1902.08153.
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770–778). Piscataway: IEEE.
Google Scholar
Radosavovic, I., Kosaraju, R.P., Girshick, R., He, K., & Dollár, P. (2020). Designing network design spaces. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 10428–10436). Piscataway: IEEE.
Google Scholar
Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., & Chen, L.-C. (2018). MobileNetV2: inverted residuals and linear bottlenecks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 4510–4520). Piscataway: IEEE.
Google Scholar
Xiao, Y., Liu, A., Zhang, T., Qin, H., Guo, J., & Liu, X. Robustmq. https://sites.google.com/view/robustmq. Retrieved 17 Sep 2023.
Gholami, A., Kim, S., Dong, Z., Yao, Z., Mahoney, M. W., & Keutzer, K. (2021). A survey of quantization methods for efficient neural network inference. arXiv preprint. arXiv:2103.13630.
Li, F., Zhang, B., & Liu, B. (2016). Ternary weight networks. arXiv preprint. arXiv:1605.04711.
Jung, S., Son, C., Lee, S., Son, J., Han, J.-J., Kwak, Y., et al. (2019). Learning to quantize deep networks by optimizing quantization intervals with task loss. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 4350–4359). Piscataway: IEEE.
Google Scholar
Croce, F., & Hein, M. (2020). Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In D. Blei, H. Daumé III, A. Singh, et al. (Eds.), Proceedings of the 37th international conference on machine learning (pp. 2206–2216). Stroudsburg: International Machine Learning Society.
Google Scholar
Liu, A., Wang, J., Liu, X., Cao, B., Zhang, C., & Yu, H. (2020). Bias-based universal adversarial patch attack for automatic check-out. In A. Vedaldi, H. Bischof, T. Brox, et al. (Eds.), Proceedings of the 16th European conference on computer vision (pp. 395–410). Cham: Springer.
Google Scholar
Liu, A., Huang, T., Liu, X., Xu, Y., Ma, Y., Chen, X., et al. (2020). Spatiotemporal attacks for embodied agents. In A. Vedaldi, H. Bischof, T. Brox, et al. (Eds.), Proceedings of the 16th European conference on computer vision (pp. 122–138). Cham: Springer.
Google Scholar
Wei, Z., Chen, J., Wu, Z., & Jiang, Y.-G. (2022). Boosting the transferability of video adversarial examples via temporal translation. In K. Sycara, V. Honavar, & M. Spaan (Eds.), Proceedings of the AAAI conference on artificial intelligence (pp. 2659–2667). Palo Alto: AAAI Press.
Google Scholar
Wei, Z., Chen, J., Wu, Z., & Jiang, Y.-G. (2022). Cross-modal transferable adversarial attacks from images to videos. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 15064–15073). Piscataway: IEEE.
Google Scholar
Wang, J., Yin, Z., Hu, P., Liu, A., Tao, R., Qin, H., et al. (2022). Defensive patches for robust recognition in the physical world. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 2456–2465). Piscataway: IEEE.
Google Scholar
Merkle, F., Samsinger, M., & Schöttle, P. (2022). Pruning in the face of adversaries. In S. Sclaroff, C. Distante, M. Leo, et al. (Eds.), International conference on image analysis and processing (pp. 658–669). Cham: Springer.
Google Scholar
Yuan, Z., Liu, J., Wu, J., Yang, D., Wu, Q., Sun, G., et al. (2023). Benchmarking the reliability of post-training quantization: a particular focus on worst-case performance. arXiv preprint. arXiv:2303.13003.
Umesh, P. (2012). Image processing in Python. CSI Communications, 23(2), 23–24.
MathSciNet Google Scholar
Bradski, G. (2000). The OpenCV library. Dr. Dobb’s Journal of Software Tools for the Professional Programmer, 25(11), 120–123.
Google Scholar
Tomar, S. (2006). Converting video formats with FFmpeg. Linux Journal, 2006(146), 10.
Google Scholar
Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., & Li, F.-F. (2009). ImageNet: a large-scale hierarchical image database. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 248–255). Piscataway: IEEE.
Google Scholar
LeCun, Y. The mnist database of handwritten digits. Retrieved September 17, 2023 from http://yann.lecun.com/exdb/mnist/.
Krizhevsky, A., & Hinton, G. (2009). Learning multiple layers of features from tiny images (Technical report). University of Toronto.

Download references

Funding

This work was supported by the National Key R&D Program of China (No. 2022ZD0116310), the National Natural Science Foundation of China (Nos. 62022009 and 62206009), and the State Key Laboratory of Software Development Environment.

Author information

Authors and Affiliations

Shen Yuan Honors College, Beihang University, Beijing, China
Yisong Xiao
National Laboratory of Software Development Environment, Beihang University, Beijing, China
Yisong Xiao, Aishan Liu, Tianyuan Zhang, Jinyang Guo & Xianglong Liu
The Institute of Dataspace, Hefei, Anhui, China
Aishan Liu & Xianglong Liu
ETH Zurich, Zürich, Switzerland
Haotong Qin
The Institute of Artificial Intelligence, Beihang University, Beijing, China
Jinyang Guo
Zhongguancun Laboratory, Beijing, China
Xianglong Liu

Authors

Yisong Xiao
View author publications
You can also search for this author in PubMed Google Scholar
Aishan Liu
View author publications
You can also search for this author in PubMed Google Scholar
Tianyuan Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Haotong Qin
View author publications
You can also search for this author in PubMed Google Scholar
Jinyang Guo
View author publications
You can also search for this author in PubMed Google Scholar
Xianglong Liu
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

AL and XL conceived the original concept and supervised the study. The first draft of the manuscript was written by YX and AL. YX and TZ collected data and conducted the experiment. YX, AL, HQ, and JG contributed to the analysis and manuscript preparation, and XL finalized the manuscript. All authors commented on previous versions of the manuscript and have read and approved the final manuscript.

Corresponding authors

Correspondence to Aishan Liu or Xianglong Liu.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Abbreviations

AAI, adversarial attack impact; AR, adversarial robustness; ASR, attack success rate; DNNs, deep neural networks; FGSM, fast gradient sign method; FP, flip probability; mFP, mean flip probability; mNR, mean natural robustness; NCI, natural corruption impact; PGD, projected gradient descent; PTQ, post-training quantization; QAT, quantization-aware training; SNI, systematic noise impact; SR, systematic robustness. WCAR, worst-case adversarial robustness.

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Xiao, Y., Liu, A., Zhang, T. et al. RobustMQ: benchmarking robustness of quantized models. Vis. Intell. 1, 30 (2023). https://doi.org/10.1007/s44267-023-00031-w

Download citation

Received: 06 August 2023
Revised: 14 November 2023
Accepted: 14 November 2023
Published: 15 December 2023
DOI: https://doi.org/10.1007/s44267-023-00031-w

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

RobustMQ: benchmarking robustness of quantized models

Abstract

Similar content being viewed by others

An Integrated Approach to Produce Robust Deep Neural Network Models with High Efficiency

Towards Demystifying Adversarial Robustness of Binarized Neural Networks

Symmetry Regularization and Saturating Nonlinearity for Robust Quantization

1 Introduction

2 Related work

2.1 Network quantization

2.2 Adversarial attacks

2.3 Robustness of quantized models

3 RobustMQ benchmark

3.1 Robustness evaluation approaches

3.1.1 Adversarial attacks

3.1.2 Natural corruption

3.1.3 Systematic noise

3.2 Evaluation metrics

3.2.1 Adversarial robustness

3.2.2 Natural robustness

3.2.3 Systematic robustness

3.3 Evaluation objects

3.3.1 Dataset

3.3.2 Network architectures

3.3.3 Quantization methods

4 Experiments and analysis

4.1 Clean accuracy

4.2 Evaluation of adversarial attacks

4.3 Evaluation of natural corruption

4.4 Evaluation of systematic noise

5 Conclusion

Availability of data and materials

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Competing interests

Additional information

Abbreviations

Publisher’s Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation