RobustMQ: Benchmarking Robustness of Quantized Models

Quantization has emerged as an essential technique for deploying deep neural networks (DNNs) on devices with limited resources. However, quantized models exhibit vulnerabilities when exposed to various noises in real-world applications. Despite the importance of evaluating the impact of quantization on robustness, existing research on this topic is limited and often disregards established principles of robustness evaluation, resulting in incomplete and inconclusive findings. To address this gap, we thoroughly evaluated the robustness of quantized models against various noises (adversarial attacks, natural corruptions, and systematic noises) on ImageNet. The comprehensive evaluation results empirically provide valuable insights into the robustness of quantized models in various scenarios, for example: (1) quantized models exhibit higher adversarial robustness than their floating-point counterparts, but are more vulnerable to natural corruptions and systematic noises; (2) in general, increasing the quantization bit-width results in a decrease in adversarial robustness, an increase in natural robustness, and an increase in systematic robustness; (3) among corruption methods, \textit{impulse noise} and \textit{glass blur} are the most harmful to quantized models, while \textit{brightness} has the least impact; (4) among systematic noises, the \textit{nearest neighbor interpolation} has the highest impact, while bilinear interpolation, cubic interpolation, and area interpolation are the three least harmful. Our research contributes to advancing the robust quantization of models and their deployment in real-world scenarios.


Introduction
Deep neural networks (DNNs) have demonstrated impressive performance in a broad range of applications, including computer vision [1,2], natural language processing [3,4], and speech recognition [5,6].However, the deployment of large DNNs on resource-constrained devices, such as smartphones and embedded systems, poses challenges due to their high memory and computational requirements.To address this issue, researchers have proposed various model compression techniques, including model quantization [7][8][9][10][11], pruning [12][13][14][15][16], and neural network distillation [17,18].Among these techniques, model quantization has become a critical approach for compressing DNNs due to its ability to maintain network structure and achieve comparable performance simultaneously.By mapping the network parameters from continuous 32-bit floating-point (FP) numbers to discrete low-bit integers, model quantization achieves reduced memory usage and faster inference, making it well-suited for resource-constrained devices.
While model quantization offers a viable solution for deploying DNNs on resource-constrained devices, it also presents challenges in ensuring the trustworthiness (such as robustness, fairness, privacy, and etc) of quantized models in the real world [19][20][21][22][23][24][25][26][27][28][29].DNNs are highly susceptible to adversarial examples [20,25,26,30], which are perturbations carefully designed to be undetectable to human vision but can easily deceive DNNs, posing a significant threat to practical deep learning applications.For example, the placement of three small adversarial stickers on a road intersection can cause the Tesla Autopilot system [31] to misinterpret the lane markings and swerve into the wrong lane, which severely risks people's lives and even lead to potential death.In addition, DNNs are vulnerable to natural corruptions [32] such as snow and motion blur, which are common in real-world scenarios and can significantly reduce the accuracy of DNN models.Moreover, system noises resulting from the mismatch between software and hardware can also have a detrimental impact on model accuracy [33].These vulnerabilities demonstrate that quantized networks deployed in safety-critical applications (like autonomous driving and face recognition) are unreliable when faced with various perturbations in real-world scenarios.
Therefore, it is critical to conduct a comprehensive evaluation of the robustness of quantized models before their deployment to identify potential weaknesses and unintended behaviors.In recent years, researchers have developed robustness benchmarks [34][35][36][37][38] tailored for assessing the robustness of deep learning models, employing multiple adversarial attack methods to thoroughly evaluate floating-point networks across various tasks.Through extensive experiments, researchers have revealed and substantiated several insights, such as the observation that larger model parameter sizes often lead to better adversarial robustness [34,39], which highlights the significance of model complexity in determining robustness.While numerous studies have extensively investigated the robustness of floating-point networks, research on the robustness of quantized models [40][41][42][43] remains inadequate, lacking diversity in terms of noise sources and relying solely on small datasets.Consequently, the existing literature fails to thoroughly assess the robustness of quantized models, leading to a gap in the understanding of their vulnerabilities and strengths.
To bridge this gap, we build RobustMQ, a comprehensive robustness evaluation benchmark for quantized models.Our benchmark systematically assesses the robustness of quantized models using 3 popular quantization methods (i.e., DoReFa [44], PACT [45], and LSQ [46]) and 4 classical architectures (ResNet18 [47], ResNet50 [47], RegNetX600M [48], and MobileNetV2 [49]).Each method is evaluated for four commonly used bit-widths.To thoroughly study the robustness of quantized models against noises originating from different sources, our analysis comprises 3 progressive adversarial attacks (covering ℓ 1 , ℓ 2 , ℓ ∞ magnitudes, along with three different perturbation budgets), 15 natural corruptions, and 14 systematic noises on the ImageNet benchmark.Our empirical results demonstrate that lower-bit quantized models exhibit better adversarial robustness but are more susceptible to natural corruptions and systematic noises.In summary, increasing the quantization bit-width generally leads to a decrease in adversarial robustness, an increase in natural robustness, and an increase in systematic robustness.Moreover, our findings indicate that impulse noise and glass blur are the most harmful corruption methods for quantized models, while brightness has the least impact.Additionally, among systematic noises, the nearest neighbor interpolation has the highest impact, while bilinear interpolation, cubic interpolation, and area interpolation are the three least harmful.Our main contributions can be summarized as follows: -To the best of our knowledge, RobustMQ is the first to comprehensively evaluate the robustness of quantized models.

Related Work
In this section, we review related works on network quantization, adversarial attacks, and the recent advancements in the integration of these fields.

Network quantization
Network quantization compresses DNN models by reducing the number of bits required to represent each weight to save memory usage and speed up hardware inference [51].A classic quantization process involves both quantization and de-quantization operations.The quantization function maps real values r to integers, while the de-quantization function allows approximate recovery of real values r ′ from the quantized values.This process can be mathematically formulated as: where Q is the quantization operator, r and r ′ is real value and de-quantized real value respectively, S and Z denote scale and zero-point respectively.Given t bits, the range after quantization is determined by . After quantization, the recovered real values r ′ may not exactly be equal to the original values r due to the rounding operation.Quantization methods can be broadly classified into two strategies: Post-Training Quantization (PTQ) and Quantization-Aware Training (QAT).PTQ methods are applied after the model is fully trained, without any adjustments to the model during the training process, which often results in lower accuracy.On the other hand, QAT methods involve fine-tuning or retraining the model with training data to achieve higher accuracy in the quantized form.Thus, we primarily focus on QAT methods, and we here provide a brief review of the commonly used QAT methods.
One string of research designed rules to fit the quantizer to the data distribution [44,52].For example, DoReFa-Net [44] simply clips the activation to [0, 1] and then quantizes it, due to the observation that most activation falls into this range in many network architectures (e.g., AlexNet [1] and ResNet [47]).Other notable work focused on learning appropriate quantization parameters during the backpropagation process [45,46,53].PACT [45] clip the activation by a handcrafted parameter and optimize the clipping threshold.However, it is important to note that PACT has no gradient below the clip point, which can lead to gradient vanishing problems during backpropagation.Notice the limitation in PACT, LSQ [46] estimates the gradient at each weight and activation layer to adaptively adjust the step size of quantization.By learning the scale alongside network parameters, LSQ is able to achieve more fine-grained quantization, which improves the accuracy of quantized models.

Adversarial attacks
Adversarial examples are inputs with small perturbations that could easily mislead the DNNs [30].Formally, given a DNN f Θ and an input x with the ground truth label y, an adversarial example x adv satisfies where ∥ • ∥ is a distance metric and commonly measured by the ℓ p -norm (p ∈{1,2,∞}).
A long line of work has been dedicated to performing adversarial attacks [20-22, 27, 30, 39, 54], which can be mainly divided into white-box and black-box manners based on access to the target model.For white-box attacks, adversaries have complete knowledge of the target model and can fully access it; while for black-box attacks, adversaries have limited or even without any knowledge of the target model and can not directly access it.This paper primarily employs white-box attacks to evaluate the adversarial robustness of target models, as they offer stronger attack capabilities.Here, we introduce the attack methods relevant to our benchmark.
Fast Gradient Sign Method (FGSM).FGSM [30] is a one-step attack method with ℓ ∞ -norm.It calculates the gradient of the loss function with respect to the input only once and subsequently adds gradient noise to generate an adversarial example.Although FGSM has a relatively weaker attack capability, it is computationally efficient in generating adversarial examples.
Projected Gradient Descent (PGD).PGD [39] is regarded as one of the most powerful and widely used attacks due to its high attack success rate.It builds upon the FGSM by introducing an iterative process with a gradient projecting at each step.
AutoAttack.Croce et al. [54] proposed two automatic step size adjustment methods (APGD-CE and APGD-DLR) to address problems such as suboptimal step size in PGD.Then they combined two existing complementary attacks to form a parameter-free and computationally affordable attack (i.e., AutoAttack).AutoAttack has demonstrated superior performance compared to state-of-the-art attacks, thus becoming a crucial tool for evaluating model robustness.

Robustness of quantized models
A number of studies have been proposed to evaluate the robustness of floating-point networks [24,28,34,35,55].For instance, Croce et al. [35] introduced RobustBench, a benchmark based on the CIFAR dataset, which employs AutoAttack to assess the robustness of models strengthened by various defense methods, including adversarial training, gradient masking, and label smoothing.In contrast to the evaluation of defense methods, Tang et al. [34] focused on investigating the robustness of model structures and training techniques using the large-scale ImageNet dataset, offering valuable insights for training robust models.
Comparatively, Merkle et al. [56] benchmarked the adversarial robustness of pruned models for several pruning methods, revealing that pruned models enjoy better robustness against adversaries.However, the robustness of quantized networks has been relatively underexplored.Bernhard et al. [40] utilized an ensemble of quantized models to filter adversarial examples, given that current adversarial attacks demonstrate limited transferability against quantized models.Lin et al. [41] proposed a defensive quantization method to suppress the amplification of adversarial noise during propagation by controlling the Lipschitz constant of the network during quantization.Similarly, Alizadeh et al. [42] also designed a regularization scheme to improve the robustness of the quantized model by controlling the magnitude of adversarial gradients.In this paper, we aim to thoroughly evaluate the robustness of quantized models against multiple noises for several quantization methods, architectures, and quantization bits.

RobustMQ Benchmark
Existing research on the impact of quantization compression on neural network robustness is fragmented and lacks adherence to established principles in robustness evaluation.To address this issue, this paper proposes RobustMQ, a comprehensive robustness evaluation benchmark for quantized models with consistent settings.RobustMQ provides researchers with a valuable tool to gain deeper insights into the impact of various perturbations on quantized model robustness, aiding in the development of more robust quantization methods for deploying reliable and secure deep learning models in real-world scenarios.The RobustMQ benchmark encompasses adversarial robustness, natural robustness, and systematic robustness, considering three quantization methods, four bit-widths, four architectures, three progressive adversarial attack methods (covering three magnitudes and three perturbation budgets), fifteen natural corruptions, and fourteen systematic noises on ImageNet.

Robustness evaluation approaches
Quantized models that are extensively deployed in edge devices are vulnerable to various perturbations in realworld scenarios.In accordance with the guidelines proposed by Tang et al. [34], we classify these perturbations into adversarial attacks, natural corruptions, and systematic noises, and leverage them to thoroughly evaluate the robustness of quantized models.

Adversarial attacks
To model the worst-case scenario (i.e., strongest adversaries), we consider attacks conducted under a white-box manner where the adversary has full access to the model architecture, training data, and gradient information.Specifically, we employ FGSM-ℓ ∞ , PGD-ℓ 1 , PGD-ℓ 2 , PGD-ℓ ∞ and AutoAttack-ℓ ∞ to craft adversarial perturbations.These three attack methods form a progressive evaluation, where their computing resource consumption and attack capabilities are improved, enabling a comprehensive assessment of the quantized model's adversarial robustness.Furthermore, we set three progressive perturbation budgets (small, middle, and large) for each attack method.

Natural corruptions
To simulate natural corruptions, we utilize 15 distinct perturbation methods from ImageNet-C benchmark [32].These methods can be categorized into four groups: (1) noise, which includes gaussian noise, shot noise, and impulse noise; (2) blur, which includes defocus blur, frosted glass blur, motion blur, and zoom blur ; (3) weather, which includes snow, frost, fog, and brightness; and (4) digital, which includes contrast, elastic, pixelation, and JPEG compression.Each corruption type is evaluated at five levels of severity to account for variations in the intensity of corruptions.Thus, we have 75 perturbation methods in total for natural corruptions evaluation.In addition to single corruption images, we also investigate corruption sequences generated from ImageNet-P [32].Each sequence in ImageNet-P comprises more than 30 frames, allowing us to study the robustness of quantized models against dynamic and continuous corruptions.

Systematic noises
Moreover, system noises are always present when models are deployed in edge devices due to changes in hardware or software.To assess the influence of system noises on quantized models, we incorporate pre-processing operations from ImageNet-S [33], which involve image decoding and image resize processes.Image decoding refers to the process of converting an original image file into an RGB channel map tensor, where the inverse discrete cosine transform (iDCT) serves as a core step.However, discrepancies in the implementation of iDCT among various image processing libraries result in variations in the output.As a consequence, the pixel values of the final RGB tensor are affected, leading to slight differences in the decoded images.Therefore, we employ the decoders from third-party libraries such as Pillow [57], OpenCV [58], and FFmpeg [59] to obtain systematic noises.Additionally, image resize is utilized to change the image resolution.In the resize process, different interpolation algorithms are employed to predict the new pixel positions, potentially leading to slight variations in the new pixel values.Thus, for different image resize methods, we incorporate bilinear, nearest, cubic, hamming, lanczos, area, and box interpolation modes from the OpenCV and Pillow tools.In total, systematic noises consist of three frequently used decoders and seven commonly used resize modes.

Adversarial robustness
For specific adversarial attacks, we adopt the model accuracy to measure adversarial robustness (AR), which is calculated by subtracting the Attack Success Rate (ASR) from 1 (i.e., 1 − ASR).Mathematically, AR can be calculated with the following expression: where f is the target tested model, D is the validation set, A ϵ,p represents the adversary, ϵ and p denotes the perturbation budget and distance norm respectively.This metric quantifies the model's capability to retain normal functioning under attacks, with higher AR indicating a stronger model against the specific adversarial attack.While we aim to evaluate the robustness among models with different clean accuracy, it is crucial to measure the relative performance drop against adversarial attacks, denoted as AAI (Adversarial Attack Impact): where ACC represents the clean accuracy.A lower AAI value indicates that the models are more robust.
For the union of different attacks, we adopt Worst-Case Adversarial Robustness (W CAR) to measure adversarial robustness (a higher value indicates a more robust model) against them: where As represents a set of adversaries, Any(•) is a function that returns true if any of the adversary A in As attacks successfully.W CAR represents a lower bound of model adversarial robustness against various adversarial attacks.Specifically, We further refine W CAR based on the perturbation budget employed in adversarial attacks: W CAR (small ϵ), W CAR (middle ϵ), and W CAR (large ϵ).

Natural robustness
Natural robustness measures the accuracy of a model in maintaining its performance after being perturbed by natural noise.Therefore, given a corruption method c, we calculate the accuracy as its natural robustness: Similar to AAI in adversarial robustness, we also define N CI (Natural Corruption Impact) to measure the relative performance drop against natural corruptions: To aggregate the evaluation results among 15 corruptions, We adopt the average accuracy of the quantized model on all corruptions to measure the mean natural robustness, denoted as mN R: where C denotes the set of corruption methods.A higher value of mN R means better natural robustness.The average relative performance drop can be calculated by (ACC − mN R)/ACC.As for the corruption sequence S, we utilize "Flip Probability" of model predictions to measure its robustness, denoted as F P : For sequences generated by multiple corruption methods, we average their Flip Probability to obtain mF P (i.e., mean Flip Probability).Note that a lower F P value indicates a model that performs more stably in the presence of dynamic and continuous corruptions.

Systematic robustness
Systematic robustness measures a model's ability to withstand various software-dependent and componentdependent system noise attacks in diverse deployment environments.For a given decode or resize method s, we compute the accuracy as its systematic robustness: To emphasize the impact of noise, we introduce SN I (Systematic Noise Impact) as a metric to quantify the relative performance drop: Furthermore, to evaluate the robustness among all systematic noises, we calculate the standard deviation of their ACC s as systematic robustness (SR): where S denotes a set of decode or resize methods.
A lower value of SR means better stability towards different systematic noises.

Dataset
Our RobustMQ aims to obtain broadly applicable results and conclusions for quantized models in the computer vision field.Therefore, we focus on the basic image classification tasks and employ the large-scale ImageNet [60] dataset.In contrast to widely used small-scale datasets like MNIST [61], CIFAR-10 [62], and CIFAR-100 [62], ImageNet provides a more extensive collection of images and classes, making it more suitable as a benchmark for testing models in robustness evaluation.The Im-ageNet dataset comprises 1.2 million training images and 50,000 validation images, covering a total of 1,000 different classes.

Quantization methods
Within RobustMQ, we concentrate on three widely used quantization methods: DoReFa [44], PACT [45], and LSQ [46].LSQ and DoReFa methods perform their quantization methods both on model weights and activation values.On the other hand, PACT applies its specific parameter truncation to quantize activation values, while leveraging the method in DoReFa for weight quantization.For the choice of quantization bits, we adopt the commonly used set in deployments (i.e., 2, 4, 6, and 8).
For each architecture, we quantize models on the Ima-geNet training set starting from the same floating-point model, then evaluate their robustness against perturbations generated on the ImageNet validation set.

Experiments and Analysis
In this section, we showcase the evaluation results of quantized models under different noises.Subsequently, we consolidate several conclusive insights gleaned from the evaluations by addressing the following research questions: (1) How robust are quantized models compared with FP models?(2) Which quantization method or bit-width exhibits greater robustness against perturbations?(3) What is the impact of architecture or size on the robustness of quantized models?(4) To which type of noise is the quantized model most susceptible?

Clean accuracy
Tab. 1 reports the clean accuracies of quantized models.Most of the quantized models maintain comparable accuracy to their 32-bit pre-trained models, while certain quantization methods may struggle to maintain comparable accuracy when using low bit-widths (e.g., 2-bit).Among the quantized models evaluated, 12 models fails to converge.For example, ResNet18 PACT 2-bit achieves a mere 2.61% accuracy.Therefore, we label these models as "NC" and exclude them from our evaluations to ensure fair and reliable assessments of robustness.

Evaluation of adversarial attacks
Under medium and high perturbation budgets, the values of AR and W CAR degrade significantly to almost 0 due to the increasing attack abilities.Therefore, in this section, we primarily the results under small budgets to highlight the differences between different models.The robustness evaluation results with small budgets under all adversarial attacks, including FGSMℓ ∞ and PGD-ℓ ∞ attacks, are shown in Tab. 2, Tab. 3, and Tab. 4, respectively.And other adversarial robustness evaluation results can be found on our website [50].
From the results, we could make several observations for quantized models as follows.
(1)Better adversarial robustness vs FP models.Unlike the decrease in clean accuracy, quantized models exhibit higher worst-case adversarial robustness and are almost better than FP networks.For example, ACC of ResNet18 DoReFa 2-bit is 8.75% lower than ResNet18 FP (see Tab. 1), while W CAR of ResNet18 DoReFa 2-bit is 4.25% higher than ResNet18 FP (see Tab. 2).Moreover, Fig. 1 illustrates that under the same quantization method, the adversarial robustness of quantized ResNet18 models increases as the quantization bit-width decreases.These phenomena can also be observed in other network architectures, suggesting that quantization can provide defense against adversarial attacks to a certain extent.
(2) At the same quantization bit-width, PACT outperforms other quantization methods under the worstcase adversarial robustness.For instance, RegNetX600M PACT 4-bit achieves a W CAR of 6.21%, while DoReFa and LSQ quantized models achieve W CAR values of 2.74% and 2.05%, respectively (see Tab. 2).However, when facing specific adversarial attacks, the robustness performance of quantization methods is not consistent across different model architectures (see Tab. 3 and Tab. 4).For ResNet18, ResNet50, and MobileNetV2 architectures, the LSQ quantization method demonstrates better robustness against FGSM-ℓ ∞ and PGD-ℓ ∞ attacks.However, for RegNetX600M architecture, the PACT method exhibits better robustness.
(3) As for the model size and network architecture, quantized models show similar trends to FP models.Regarding the model size, we observe that quantized models with larger network capacity (e.g., FLOPs and Params) exhibit better adversarial robustness, consistent with the findings in [34].For instance, in the ResNet families, all quantized ResNet50 models demonstrate better robustness than ResNet18 against all adversarial attacks.Regarding the network architecture, the adversarial robustness of quantized models follows the order: MobileNetV2 > RegNetX600M > ResNet, which coincides with the findings in [34].This highlights the significance of architecture design in achieving better adversarial robustness.
(4) As for the adversarial attack methods, their attack capabilities in quantized models are generally consistent with those in FP models: AutoAttack > PGD > FGSM.However, quantized models demonstrate varying adversarial robustness against different attack methods.For instance, compared to DoReFa and PACT, LSQ performs better under FGSM and PGD attacks but is more vulnerable to AutoAttack.

Evaluation of natural corruptions
We first report the aggregated evaluation results (i.e., mean natural robustness) among fifteen corruption methods for all quantized models, shown in Tab. 5. Next, we explore the natural robustness results for different quantized models under each corruption method.For instance, we provide the natural robustness results for ResNet18, ResNet50, and MobileNetV2 architectures in Tab.6, Fig. 2, and Fig. 3, respectively.Due to the limit of space, more detailed results can be found on our website [50].The natural robustness evaluation results for quantized models reveal several key observations as follows.
(1) Worse natural robustness vs FP models.Despite achieving similar clean accuracy compared to the 32-bit model, quantized models are more susceptible to natural corruptions.Notably, the 2-bit quantized model is severely affected, with a decrease in performance that far exceeds the accuracy drop observed in clean accuracy.As an example, ResNet18 DoReFa 2-bit exhibits a N CI of 62.61%, which is higher than the corresponding FP model with 53.87% N CI (see Tab. 5), indicating that quantization can negatively impact the model's robustness against such perturbations.Furthermore, Fig. 2 Table 3 Adversarial Robustness of models under FGSM-ℓ ∞ attack with small budget (ϵ = 0.5/255).Results are shown in AR↑(AAI↓).demonstrates that under the same quantization method, the natural robustness of quantized ResNet50 models generally increases as the quantization bit-width increases.However, the rate of increase in robust performance gradually decreases.Similar phenomena can also be observed in other network architectures, indicating that in real scenarios, using quantized models requires more attention than FP models.For instance, leveraging corruption data to enhance natural robustness becomes crucial in maintaining the performance of quantized models.

Model
(2) At the same quantization bit-width, quantization methods exhibit varied performance on different architectures.As shown in Tab. 5, on ResNet18 and ResNet50 architectures, LSQ achieves less relative performance drop (i.e., lower N CI value) than DoReFa and PACT methods.However, on RegNetX600M and MobileNetV2 architectures, DoReFa and PACT may outperform other quantization methods across most bitwidths.When examining a specific architecture, Fig. 2 reveals a consistent relative trend among different quan-tization methods across various corruption methods (other architectures also show this trend).Specifically, the order of impact on model performance is as follows: Noise > Blur > Weather (except Brightness) > Digital.This consistent trend emphasizes the varying degrees of sensitivity of quantized models to different types of corruption, and it remains independent of the specific quantization method used.
(3) Under the same network architecture, network capacity could lead to better natural robustness.Similar to the adversarial evaluation, quantized ResNet50 models here present higher mN R and lower N CI compared to ResNet18 under natural corruptions.While for different network architectures, the natural robustness of quantized models follows the order: ResNet > Reg-NetX600M > MobileNetV2, which is exactly opposite to the order observed under adversarial attacks.
(4) Regarding natural corruption methods, our conclusions are as follows.For ResNet18, ResNet50, and RegNetX600M, impulse noise has the most severe impact on the model's robustness.For MobileNetV2,   glass blur is the most harmful corruption to the model's robustness.Among all the corruption methods, brightness has the least impact on the model's robustness.As shown in Tab.6, ResNet18 quantized models exhibit an average decrease of about 76.32% under impulse noise, whereas the average decrease is only about 16.27% under brightness weather corruption.Similarly, MobileNetV2 quantized models show an average decrease of about 72.96% under glass blur, while the average decrease is only about 17.39% under brightness.
For natural robustness under corruption sequences, we report the evaluation results in Tab. 7, Tab. 8, Fig.    (1) Worse dynamic natural robustness vs FP models.Coinciding with the static natural corruptions, quantized models also exhibit inferior performance compared to FP models under dynamic corruption sequences.Furthermore, the 2-bit quantized models demonstrate extreme instability.For example, ResNet18 DoReFa 2-bit exhibits a mF P of 0.283, which is nearly three times higher than that of the ResNet18 FP model (0.098 mF P ).As depicted in Fig. 4, we can find that the natural robustness increases with increasing bit-width (from right to left in each subfigure).
(2) At the same quantization bit-width, quantization methods under dynamic natural corruptions also exhibit similar trends as observed under static natural corruptions.For example, LSQ achieves better robustness on the ResNet architecture but may not perform as dominantly on lightweight architectures such as Reg-NetX600M and MobileNetV2.(3) As for specific corruption sequences, shot noise has the most severe impact on all models, while brightness remains the least harmful.Additionally, we observe that for RegNetX600M models, the FP model is not the most robust under shot noise (see Fig. 5), which is inconsistent with the behavior observed in ResNet models (see Fig. 4).

Evaluation of Systematic Noises
We report the aggregated evaluation results of systematic robustness among fourteen systematic noises for all quantized models, shown in Tab.4.4.Then, we explore the performance for different quantized models under specific systematic noise.For instance, we provide the systematic robustness results for ResNet18, ResNet50, and RegNetX600M architectures in Tab.4.4, Fig. 6, and Fig. 7, respectively.Due to the limit of space, more detailed results can be found on our website [50].The evaluation results for quantized models provide valuable insights into their systematic robustness.Several key observations are as follows: (1) Worse systematic robustness vs FP models.Despite the prevalent system noise in deployed environ-ments not causing a significant drop in the prediction accuracy of quantized models, their performance in various deployment scenarios exhibits considerable instability.In contrast, FP models demonstrate little fluctuation.For instance, on the ResNet18 architecture, all quantized models exhibit at least 3.43 times higher instability (SR) than the FP ResNet18, confirming the sensitivity of quantized models to system noise.In particular, 2-bit quantized models experience significant impacts under systematic noise (e.g., ResNet18 DoReFa 2-bit shows a drop of 4.49% in ACC s and an increase of 4.38 times in instability).As for the same quantization method, lower-bit models present less robustness (i.e., lower stability) generally.It indicates that maintaining consistency between the deployment and training process is crucial to avoid unnecessary accuracy loss.
(2) At the same quantization bit-width, Quantization methods show varied performance across different architectures.Generally, LSQ shows dominance in ResNet and RegNetX600M architectures, while PACT performs better in MobileNetV2.When focusing on a specific architecture (such as ResNet50 in Fig. 6 and RegNetX600M in Fig. 7), we can observe the similar rel- ))PSHJ 0HDQ Fig. 6 Systematic robustness of ResNet50 models under specific noises.From left to right: quantized by DoReFa, PACT, and LSQ respectively.The "NC" models are labeled with 'x' in the figure .ative impact across different decoders and resize modes under three quantization methods.
(3) Under the same network architecture, network capacity could lead to better systematic robustness.Quantized ResNet50 models present lower SR compared to ResNet18 under systematic noises.While for different network architectures, the systematic robustness of quantized models follows the order: RegNetX600M > ResNet > MobileNetV2.It is worth noting that the RegNetX600M FP model exhibits the worst systematic robustness, while surprisingly, the quantized Reg-NetX600M models demonstrate the best systematic robustness among all architectures.
(4) When considering specific systematic noise, we draw the following conclusions.Among 14 systematic noises, the nearest neighbor interpolation methods in Pillow and OpenCV libraries have the most harmful impact on the model performance, which induce nearly a 6% decrease in performance for the 2-bit ResNet18 models (see Tab. 4.4).By contrast, the least impactful noises on model performance are bilinear interpolation and cubic interpolation in the Pillow library, as well as area interpolation in the OpenCV library.For example, under bilinear interpolation of Pillow library, 2-bit ResNet18 models only present 1.67% decrease in performance on average.Regarding the different decoders, the performance of quantized models has only minor fluctuations among the three.

Conclusion
This paper presents a benchmark named RobustMQ, aiming to evaluate the robustness of quantized models under various perturbations, including adversarial attacks, natural corruptions, and systematic noises.The benchmark evaluates four classical architectures and three popular quantization methods with four different bit widths.The comprehensive results empirically provide valuable insights into the performance of quantized models in various scenarios.Some of the key observations are as follows.(1) Quantized models exhibit higher adversarial robustness than their floating-point counterparts, but are more vulnerable to natural corruptions and systematic noises.(2) Under the same quantization method, we observe that as the quantization bit-width increases, the adversarial robustness decreases, the natural robustness increases, and the systematic robustness increases.(3) Among the 15 corruption methods, impulse noise consistently exhibits the most harmful impact on ResNet18, ResNet50, and RegNetX600M models, while glass blur is the most harmful corruption on MobileNetV2 models.On the other hand, brightness is observed to be the least harmful corruption for all models.(4) Among the 14 systematic noises, the nearest neighbor interpolation has the highest impact, while bilinear interpolation, cubic interpolation in Pillow, and area interpolation in OpenCV are the three least harmful ones across different architectures.We hope that our benchmark will significantly contribute to the assessment of the robustness of quantized models.And the insights gained from our study could further support the development and deployment of robust quantized models in real-world scenarios.
Data Availability: The datasets generated during and analyzed during the current study are available on our website [50].
Competing Interests: The authors declare no competing interests.

Authors' Contributions:
The first draft of the manuscript was written by YX and AL.YX and TZ collected data and conducted the experiment.YX, AL, HQ, and JG contributed to the analysis and manuscript preparation, and XL finalized the manuscript.All authors commented on previous versions of the manuscript and have read and approved the final manuscript.AL and XL conceived the original concept and supervised the study.

4 ,Fig.
Fig. Natural robustness of ResNet50 models under specific corruption.From left to right: quantized by DoReFa, PACT, and LSQ respectively.The "NC" models are labeled with 'x' in the figure.

Fig. 3
Fig.3Natural robustness of MobileNetV2 models under specific corruption.From left to right: quantized by DoReFa, PACT, and LSQ respectively.The "NC" models are labeled with 'x' in the figure.

FPRIFig. 4
Fig. 4 Natural robustness of ResNet50 models under corruption sequences.From left to right: quantized by DoReFa, PACT, and LSQ respectively.The "NC" models are labeled with 'x' in the figure.

Fig. 5
Fig. 5 Natural robustness of RegNetX600M models under corruption sequences.From left to right: quantized by DoReFa, PACT, and LSQ respectively.The "NC" models are labeled with 'x' in the figure.

Fig. 7
Fig. 7 Systematic robustness of RegNetX600M models under specific noises.From left to right: quantized by DoReFa, PACT, and LSQ respectively.The "NC" models are labeled with 'x' in the figure.

Table 1
Clean accuracy of quantized models and FP models."NC" denotes not converged.

Table 2
Worst-Case Adversarial Robustness of quantized models under all adversarial attacks with small budgets.Results are shown in W CAR (small ϵ)↑.
Adversarial robustness of ResNet18 models under specific attacks.From left to right: quantized by DoReFa, PACT, and LSQ respectively.The "NC" models are assigned a AR value of 0 in the figure.Similar trends, with AR decreasing as the bit-width increases, are observed in the results of other architectures as well.

Table 5
Natural Robustness of quantized models under all corruption methods.Results are shown in mN R↑(N CI↓).The best performers in each architecture are highlighted in bold.

Table 6
Natural robustness results for ResNet18 models are shown in N R↑ for each corruption (e.g., Gauss).The most influential noise is marked in bold, and the least influential noise is underlined.

Table 7
Natural Robustness of quantized models under all corruption sequences.Results are shown in mF P ↓.The best performers in each architecture are highlighted in bold.

Table 8
Natural Robustness of ResNet18 under corruption sequences.Results for each corruption (e.g., Gauss) are shown in F P ↓.The most influential noise is marked in bold, and the least influential noise is underlined.

Table 9
Mean Systematic Robustness of models under all noises.Results are shown in ACC s ↑(SN I↓) ± SR↓.

Table 10
Systematic Robustness of ResNet18 models under each systematic noise.Results for each noise (e.g., Bilinear) are shown in ACC s ↑.The most influential noise is marked in bold, and the least influential noise is underlined.