Tiny adversarial multi-objective one-shot neural architecture search

The widely employed tiny neural networks (TNNs) in mobile devices are vulnerable to adversarial attacks. However, more advanced research on the robustness of TNNs is highly in demand. This work focuses on improving the robustness of TNNs without sacrificing the model’s accuracy. To find the optimal trade-off networks in terms of the adversarial accuracy, clean accuracy, and model size, we present TAM-NAS, a tiny adversarial multi-objective one-shot network architecture search method. First, we build a novel search space comprised of new tiny blocks and channels to establish a balance between the model size and adversarial performance. Then, we demonstrate how the supernet facilitates the acquisition of the optimal subnet under white-box adversarial attacks, provided that the supernet significantly impacts the subnet’s performance. Concretely, we investigate a new adversarial training paradigm by evaluating the adversarial transferability, the width of the supernet, and the distinction between training subnets from scratch and fine-tuning. Finally, we undertake statistical analysis for the layer-wise combination of specific blocks and channels on the first non-dominated front, which can be utilized as a design guideline for the design of TNNs.


I. INTRODUCTION
I T is well known that deep neural networks are vulnerable to attacks that add small perturbations to the input data, which are almost imperceptible to human vision systems [1], [2].The maliciously perturbed examples are commonly obtained by two operations.One is to add pixel-wise −bounded perturbations to the input data [3].The other operation is to generate examples using the unrestricted perturbation like rotation, spatially translations [7].In this work, we focus on defensive mechanisms against the former, although they can be extended to the latter.
To defend the attacks, current research [3], [8], [?] constantly adopt ResNet [9] as the backbone network to explore the relationship between the capacity of the model and its adversarial robustness.It has been empirically proved that increasing the number of parameters of neural networks is able to improve robustness.However, little research pays attention to the improvement on robustness of tiny neural networks, which are widely used for mobile applications.Their sizes typically range from 10K to 2M.Therefore, our work focuses on the balance between the adversarial defensive ability and tiny model size.
The trade-off between the adversarial accuracy and the clean accuracy have been examined in [10].The work in [10] uses one specific binary classifier to illustrate the tradeoff, which can be easily generalized to multi-class problems with different assumptions [12].Previous work have illustrated the importance of the network architecture for adversarial robustness [3], [8].Most of them add a specific layer to the existing architecture to increase the adversarial accuracy, which, however, do not take into account the potential relationship between different layers in the whole neural network architecture.Our aim is to find the best trade-off neural networks with respect to the clean accuracy, the adversarial accuracy and tiny model size.Hence, we need to take a global view to redesign tiny neural networks.Our assumption is that designing a tiny robust neural network without the loss of clean accuracy can be transformed into a combinational optimization problem of different layers with what kinds of blocks and channels.Hence, we propose a tiny adversarial multi-objective oneshot network architecture search (TAM-NAS) to search the best trade-off solutions.
There are different defensive training strategies for adversarial attacks.The most common approach [2] is to train neural networks on adversarial examples.Recently, another popular approach is to formulate adversarial training as a min-max robust optimization problem [3], [4].The inner maximization problem is to find the worst performance in the presence of perturbations in the input data and the outer minimization problem is to find the optimal model parameters given the worst-case perturbation.The state-of-the-art defensive result is reported in [12], which proposes a new classificationcalibrated surrogate loss function to simultaneously minimize the clean error and the boundary error.The boundary error refers to the neural network output differences between the original input data and adversarial examples.But the major issue of adversarial training is the high computational cost for adversarial training.To speed up our adversarial training for each epoch, we use TRADES-YOPO-m-n from [17] as our adversarial training method.TRADES-YOPO-m-n aims not only to incorporate the surrogate loss function from [12] but also to reduce the computational cost by restricting most of forward and backward propagations within the first layer of the network during adversary training.As a result, we are able to reduce the training time to 2.5 minutes for each epoch on a single V100 GPU platform.
Human experts make tons of effort to devise several classical architectures, such as ResNet [9] and Inception [18].But it is not feasible to manually explore an infinite number of architectures.As an emerging automated machine learning technique, neural network architecture search (NAS) [20], [21] is prevalent because it requires much less expertise and effort to discover good architectures for a new task and dataset.Most NAS approaches employ reinforcement learning [24], evolutionary algorithms [27], and gradient-based methods [20] to design neural networks automatically.However, most of them treat it as a single-objective optimization problem, which is not well suited for solving the trade-off problem.Inspired by the work in [29], [?], [?], we employ a multi-objective approach based NAS to find the best architecture for the trade-off solutions between the adversarial accuracy, the clean accuracy and the mode size.In our work, we mainly address the trade-off problem based on the ShuffleNetV2 architecture [32], Xception block [33], SE layer [34], Non-Local block [35], and their variants.
Our contributions can be summarized as follows: • To find the best trade-off neural networks between the adversarial accuracy, the clean accuracy and the model size, we specifically propose three novel tiny robust blocks.Due to the inertial self-attention mechanism, the layer-wise combination of these three blocks can increase robustness without significantly worsening clean performance.Finally, we rebuild a tiny neural network according to the provided guideline and find that it can not only reduces the model size but also increases the adversarial accuracy and clean accuracy.

II. RELATED WORK
Adversarial Training is the most common defensive mechanism against adversarial attacks [2], which uses both clean and adversarial images for training.Originated from game theory [36], the work in [37] reformulates the min-max optimization problem of adversarial learning as Nash equilibrium [38].The game-theory based optimization method [17] can effectively reduce the high computational cost without sacrificing adversarial accuracy.Most of existing approaches only focus on improving robustness but ignore the deterioration of clean accuracy.Especially, the work [10] has empirically proved the trade-off information between the adversarial accuracy and the clean accuracy.Kannan et al. [39] and Zhang et al. [12] construct the surrogate loss function to make the difference of clean images and adversarial counterparts smaller.Madry et al. [3] conclude that larger capacity size of neural networks can improve the performance under adversarial attacks.it is known that most neural network models running in our electronic devices are very tiny because the devices have constraints on the energy consumption and the amount of storage for the models.Hence, we aim to figure out which kind of tiny neural network architectures can be effective for the resilience of adversarial perturbations.
Neural Network Architecture Search aims to replace handcrafted architecture search with automated machine learning technique.Representative search algorithms includes evolutionary algorithms [27], [26], reinforcement learning [24], [21], and gradient-based methods [20], [40], [41].In oneshot NAS [42], the authors [42] construct a supernet which can generate every possible architecture in the search space.The work in [42] train a supernet for once and then at search time, they can obtain various fitness value for different subnets by weight sharing from the supernet.But most of them employ a single-objective optimization approach to search, which is not well suited for solving the trade-off problem [10].In order to solve the multi-objective optimization problem, Fig. 2. The overall framework of TAM-NAS.The first step is to design a supernet search space and uniform sample the new subnet candidates.Our sampling strategy is divided into two phase.One is the block sampling phase, the other is block and channel jointly sampling phase.The second step is to adversarial train the subnets sampled from the supernet.The third step is to multi-objective search the best trade-off subnets.The fitness value of the subnets is achieved by cloning the weight from the supernet.The final step is to train from scratch or fine-tune the non-dominated subnets.
we adopt the elitist non-dominated sorting genetic algorithm (NSGA-II) [43] as our search algorithm.Inspired by [44], we try to investigate the influence of neural network depth on network resilience as the adversarial attack.Furthermore, we also investigate the relationship between the supernet and its subnets in our one-shot multi-objective NAS framework and give a hint on how to adversarial train supernet to obtain the better adversarial performance of subnets.

III. TINY ADVERSARIAL MULTI-OBJECTIVE ONE-SHOT NAS
Our NAS approach consists of four steps as following.(1) Design a supernet search space and uniformly sample different candidates from supernet to increase our supernet representation ability for a number of subnet architectures when using a single supernet.(2) Train the candidates sampled from the supernet on adversarial examples and make it more robust in the presence of the adversarial attacks.(3) Multi-objective search the new subnet by using the elitist non-dominated sorting generic algorithm (NSGA-II) [43] and evaluate the clean accuracy, the adversarial accuracy, and the number of parameters of each subnet by cloning it weight from the pretrained supernet.(4) Fine-tune each subnet from the first nondominated front and evaluate their performance on the test dataset.Fig. 2 shows our overall framework.

A. Problem Definition
Without loss of generality, our supernet search space A can be represented by a directed acyclic graph (DAG), denoted as N (A, W ), where W is the weight of supernet.A subnet architecture is a subgraph a ∈ A, denoted as N (a, w), where w is the weight of subnet.Γ(A) is a prior distribution of a ∈ A. L adv−train (•) is the adversarial training loss function on the adversarial training examples.The most important factor for TAM-NAS is that the performance of the subnets using inherited weights from supernet (without extra fine-tuning or training from scratch) should be highly predictive.In other words, the supernet weights W A should be optimized in a way that all subnet architectures in the search space A are optimized simultaneously.It can be expressed as in Eq. ( 1), After finishing the training of supernet, the next step is to find a set of Pareto optimal subnets a * ∈ A in terms of our objectives: the adversarial error, the clean error, and the model size.It can be expressed as in Eq. ( 2), where f 1 , f 2 , f 3 are the three objectives, the adversarial error, the clean error and the model size, namely.Actually, it is not able to get the minimum value for three objectives simultaneously since there is a strong trade-off relationship between each pair of the objectives.For instance, if the model size is larger, the adversarial error and the clean error will become smaller.Our aim is to obtain a tiny model with compatible performance in adversarial dataset and clean dataset.Fig. 1 shows the pipeline of multi-objective oneshot NAS.

B. Search Space Design
Since we aim to search tiny robust neural networks, our supernet adopts one of state-of-the-art hand-crafted tiny network architecture-ShuffleNetV2 [32] as the backbone model.Since our experiments are mainly conducted on the CIFAR10 [45] and SVHN [46] datasets, the depth and width of the supernet are largely different from the original ShuffleNetV2 which is developed on Imagenet dataset [47].Table .I shows the parameter setting of the overall architecture of the supernet.BN represents the batch norm layer.3 × 3 Conv represents a convolutional layer and its kernel size is 3. CB refers to the choice block chosen from our predefined block search space.SE refers to the SE layer [34].Moreover, we design search space for the channel number search of each choice block.In total, we provide 22 block choices and 10 channel number choices for the search space.Below, we will separately describe our search space in detail.
Three kinds of blocks we used in block search spaces as followings.
1) Pure Tiny Blocks: Pure tiny blocks mainly come from ShuffleNetV2 [32].We add the self-attention layer-SE layer [34] into the main branch for balancing between the accuracy and inference speed.Fig. 3 (a) and Fig. 3 (b) are the pure tiny blocks when the non-local layers from the main branch are removed.
2) Pure Robust Blocks: Firstly, we design a non-local block for image denoising, inspired by [35].We also add another self-attention layer-SE layer [34] as the last layer of the non-local block since it has been found [35] that the selfattention mechanism could make neural network more robust.Fig. 3 shows two non-local blocks which refer to as Embedded Gaussian version and Gaussian version [35].Fig. 3 (b) shows the internal architecture of the pure robust blocks.
3) Tiny Robust Blocks: To make pure tiny blocks become more robust, we try to add the non-local block and the SE layer into the main branch of the original shufflev2 and shufflev2xception block.Figs. 3 (a) and (b) show the internal architecture of the tiny robust block.Their kernel sizes of depthwise convolutional layer range among 3, 5, 7. Furthermore, since the non-local layer will add more parameters for our shufflev2 or shufflev2-xception block, we set the non-local layer of shufflev2 and shufflev2-xception block as an optional choice, which means it belongs to another new search space.When we remove the non-local layer and SE layer in tiny robust blocks, they will work as pure tiny blocks.In Fig. 3, the dashed line indicates the internal search space for each block.The encoding method of choice blocks is to assign an order from 0 to 21 for each block.
4) Channel Search Spaces: The channel number plays an important role in the neural network's efficiency and computational cost.Apart from the adversarial accuracy and clean accuracy, we select the total number of parameters as the third objective.We only search the intermediate channel number of each block, including pure robust blocks, pure tiny blocks, and tiny robust blocks.Heuristically, the reduction of intermediate channel number will not give rise to the deterioration of adversarial accuracy and clean accuracy, which is well suited for the balance between the performance of neural networks and their model size.Specifically, Fig. 4 shows how we add a channel selector into the intermediate part of our candidate blocks.The channel selector ratio ranges from 0 to 2 and its interval is 0.2.The encoding method of choice channels is to assign an order from 0 to 10 for each channel selector ratio.

C. Uniform Sampling
Our supernet sampling strategy is to sample choice blocks at first and then jointly sample choice blocks and choice channels once the warm-up training of the supernet is completed.Since we find out that our supernet is difficult to converge when we jointly sample block choices and channel choices in the beginning.We build up a parameter table in advance to speed up the sampling procedure.We will give more details about block sampling and channel sampling.
1) Block Sampling: Our pilot studies suggest that the tiny supernet size ranges from 1.5M to 4M.In the block sampling phase, we only search block choices for the supernet architecture and set a constraint that the number of the supernet parameters range from 1.823M to 2.375M.In all experiments, we train the supernet for 500 epochs in block sampling phases and sample a new architecture in every 20 epochs.
2) Block and Channel Jointly Sampling: The supernet jointly searches block and channel choices after the phase of block sampling.We refer to this phase as block and channel jointly sampling.We set another constraint that the number of the supernet parameters should range from 1.61M to 2.37M in this phase.In all experiments, we train the supernet for 500 epochs in block and channel jointly sampling phase and sample a new architecture in every 20 epochs.

D. Adversarial Training
We aim to explore the influence of network architecture on its robustness against adversarial attacks.So we focus on white box adversarial attacks bounded by l ∞ .As we all know, PGD [3] adversarial training is computationally expensive and hard to converge.We follow [12], [17] and adopt the TRADES-YOPO-m-n algorithm [17] to speed up our adversarial training.This work adopts the loss function in [12], which is described as below: where L (., .)denotes the cross-entropy loss function; f θ (x) denotes the output vector of the neural network, which is parameterized by θ. y is the label-indicator vector; η denotes the image noise (perturbation); λ is a balancing hyperparameter.f θ (x + η) denotes the output vector of the neural network where the perturbation η are added into the input.The aim of this loss function is to reduce the gap between adversarial examples and non-adversarial examples when the model undertakes the classification task, i.e. making the classification boundary more smooth.TRADES-YOPOm-n borrows the idea from Pontryagin's Maximum Principle [50] to approximate the back-propogation.One assumption of TRADES-YOPO-m-n is that the adversarial perturbation is only coupled with the weights of the first layer.TRADES-YOPO-m-n performs n times gradient descent to update the weights of the first layer and iteratively runs m times for     each data point.Zhang et al. [17] state that m × n should be a little larger than the number of attack iterations so that TRADES-YOPO-m-n could achieve a competitive result.In our experimental setting, the number of outer-loops m is set to 5 and the number of inner loop n is set to 3 if we desire to attack the models in 10 iterations by PGD [3].The training time for each epoch is 2.5 minutes running on a single GPU V100.The supernet adversarial training algorithm is presented in Algorithm 1.

E. Multi-Objective Search
We use NSGA-II [43]  ← Sc(Q0) // compute clean accuracy for the g + 1th offspring 12 f adv q 0 ← Sa(Q0) // compute adversarial accuracy for the g + 1th offspring 13 f params q 0 ← Scp(Q0) // calculate the size of models for the g + 1 offspring 14 while g < G do 15 Rg ← Pg ∪ Qg // merge parent and offspring population 16 Fg ← Fast Nondominated Sort(Fg) The third objective is to evaluate the number of parameters of our searched subnet.Moreover, the weight of each searched subnet clones from the corresponding part of the supernet.So there is no need to train any searched subnet in the whole search process.We are able to get the first objective value and second objective value after the inference of each searched subnet and we can quickly obtain the number of parameters for each searched subnet by checking our parameters table.The multi-objective search algorithm used in our framework is NSGA-II [43].In the stage of initialization from Line 1 to Line 13 in Algorithm 2, we randomly initialize the parent population P 0 in the beginning where each individual (subnet) of population is evaluated with three fitness value: adversarial error, clean error and the number of parameters.The weight of each subnet clones from the corresponding part of supernet.
Then P 0 is sorted based on the non-dominated sorting.From Line 9 to Line 10 in Algorithm 2, we employ the tournament scheme and mutation operator to generate a offspring population Q 0 .During the iteration of optimization (Lines 14-32), a combined population R g = P g ∪ Q g is formed.Then R g will be sorted into fronts of individuals in an ascending order according to the non-dominated sorting, as shown in Line 16.Then new population P g+1 is achieved from R g by selecting the elite solutions front by front according to their front number in an ascending order, as shown in Lines 19 -23.The selection continues until F i g where only a part of solutions are selected and they are selected according to the crowding distance values in a descending order.We will present the details of crossover and mutation operators in the next part.
1) Crossover: Before crossover, our multi-objective search adopts the tournament selection for choosing two parents.In the experimental settings, the number of individuals that participate in tournament scheme is 10.Our crossover operator inherits and recombines the block or channel from the two parents to generate the new subnet.In order to solve the channel dimensional mismatch problems, our solution is to preallocate the weight matrix for the convolutional kernels (max output channel, max input channel, kernel size).Our maximum channel dimension is two times as much as the original dimension.After crossover, the dimension of weight matrix for current batch is still unchanged.But we only keep the value of the weight matrix for current input and output channel [: c out, : c in, :].And the value of other channels in weight matrix are forced to be zero.In this case, we not only solve the channel dimension inconsistency problem, but also implement channel crossover and mutation conveniently.Moreover, we have to move non-local block out of the search space in stride-2 layer since the size of output feature map of non-local block cannot match the input size of the stride-1 layer.
2) Mutation: The mutation operator is to re-assign each block or channel selector ratio of subnets from the search space.The mutation operator will be triggered if the randomized probability is larger than the pre-defined mutation probability.Our block encoding scheme is from 0 to 21.The block that contains non-local layer is from 12 to 21.Most of blocks can be arbitrarily mutated in each layer except for stride-2 layer.This is because the size of output feature map of non-local block cannot match the input size of the stride-1 layer.To enhance the diversity of population and prohibit creating completely different network architectures, we set the mutation probability to 0.1, which means the subnet has 10% opportunity to change the block or the channel number.

F. Training From Scratch or Fine-Tuning
After finishing the multi-objective search, we obtain one set of non-dominated architectures.Generally speaking, there are two ways to deal with each searched subnet on the nondominated front.One is to inherit the weight from the supernet and fine-tune, the other is to train it from scratch.We also use TRADES-YOPO-m-n for our adversarial training algorithm.We will examine the differences of these two approaches in the next section.

IV. EXPERIMENTAL RESULT AND ANALYSIS
In this section, we introduce our experimental settings for the overall framework.In addition, we try to give a guideline for how to devise neural network architectures to defense adversarial attacks.We perform extensive studies on CIFAR-10 [45] and SVHN [46] to validate the effectiveness of our overall framework.On CIFAR-10, we do the zero-padding with the 4 pixels on each side and randomly crop back into the original size.Then we randomly flip the images horizontally and normalize them into [0, 1] for CIFAR and SVHN datasets.In order to better investigate the influence of the network architecture on robustness under adversarial attacks, we assume that the adversary has complete access to a neural network, including the architecture and all parameters.That is why we focus on white-box attacks on different architectures of neural networks.

A. Experimental Settings
1) Supernet: According to Algorithm 1, our supernet first enters into block sampling phase and the number of training epoch is set to 500.We provide 22 different blocks for the block sampling search space.We use the stochastic gradient descent method (SGD) as our optimizer.We use a batchsize of 512, a momentum of 0.9 and a weight decay of 5e − 4. The initial learning rate in block sampling is set to 0.1 and is lowered by 10 times at epoch 200, 400 and 450.After that, we jointly sample the blocks and channels of each layer in our supernet.We increase the number of epochs to 1000 with a batchsize of 512, a momentum of 0.9 and a weight decay of 5e − 4. We set our channel selector ratio to 1.8 and 2.0 before epoch 520 and add one more channel selector ratio 1.6 at epoch 540.The initial learning rate in block and channel jointly sampling is set to 0.1 and is lowered by 10 times at epoch 600, 700 and 800.The λ of Equation 3 is set to 1, which means that we try to balance the model performance on adversarial and non-adversarial examples.
2) NSGA-II: In our experiment settings, our total population size is 100.Then P 0 is sorted based on the nondominated sorting and the size of P 0 is 50.From Line 9 to Line 10 in Algorithm 2, we employ the tournament scheme and mutation operator discussed in Sec.III-E1 and Sec.III-E2 to generate a offspring population Q 0 of size 50.The number of individuals which take part in tournament scheme is 10.During the iteration of optimization (Lines 14-32), a combined population R t = P t ∪ Q t is formed and the size of R t is 100.Note that the population size of P t+1 and Q t+1 are both 50.The number of generation is set to 20.We use the hypervolume (HV) to indicate whether our search algorithm has been converged or not.Most of our experiments indicate that our multi-objective search algorithm has been converged at 18th generation.
3) Training From Scratch: According to Fig. 2, we can obtain one set of non-dominated subnet architectures.We randomly initialize each subnet's weights and set the training epoch for every subnet to 100.But we find that most of subnets have converged at epoch 40.The initial learning rate of each subnet is 0.1 and is lowered by 10 times at epoch 20, 40 and 80.The optimizer we use here is SGD.We use a batchsize of 512, a momentum of 0.9 and a weight decay of 5e − 3. We evaluate our model on white-box l inf bounded PGD attack with different number of epsilon and steps size.The epsilon for evaluation ranges from 2/255 to 8/255 and its interval is 2/255.The number of PGD attack steps for evaluation ranges from 10 to 50.The hyperparameter of fine-tuning method is the same as mentioned above.Fine-tuning is to inherit the weight from the supernet for each subnet as initialization while training from scratch is to randomly initialize the weight of each subnet.

B. Supernet Transferability
In this section, we aim to understand that if we use weaker PGD attack for the supernet adversarial training, whether it would largely deteriorate the adversarial performance of subnets.Specifically, we adjust the degree of PGD attack by changing the number of attack steps and epsilon size.To begin with, we build up a baseline for our best subnet in CIFAR-10 and SVHN dataset, which is presented in the first row of   steps for its own adversarial training.Moreover, we can easily observe that the adversarial accuracy of subnets is strongly affected by the epsilon size but not the number of attack steps.For example, the difference of the adversarial accuracy P 10 2/255 , P 30  2/255 and P 2/255 is very small regardless how the supernet is trained.It meets the same conclusion in Table .III.Therefore, we think that this observation not only helps us to save much more training time by reducing the number of attack steps but also make it possible for the subnets to obtain stronger adversarial defensive ability.
2) Epsilon Size: Table .IV indicates that the supernet is able to improve its subnets' representation abilities if it is not overloaded with the epsilon size.Firstly, the subnet of S 10  2/255 achieves the highest clean accuracy up to 81.95%, but the adversarial accuracy of its subnet under P 10  8/255 , P 30 8/255 , P 50  8/255 attacks is unable to surpass 9%.Our assumption is that the subnet has not fully developed its resilience ability since its supernet is incapable of learning by generating adversarial examples with a larger epsilon size.The assumption has been verified that the subnet of S 10 4/255 and S 10 6/255 hugely increase the adversarial accuracy when it is under P 10 8/255 attack.Another assumption is that if the epsilon size exceeds the supernet's workload, it will reduce both the clean accuracy and adversarial accuracy of subnets.We can easily observe that the subnet of S 10  6/255 obtains the best adversarial performance while S 10  8/255 largely weakens its subnet's performance regardless in clean accurarcy, or under P 10 8/255 , P 10 6/255 , P 10 4/255 attacks.However, when the epsilon size exceeds 6/255 in our case, the subnet performance begins to decline gradually.It meets the same conclusion in Table .V.

C. The Model Size
Table .VI shows that the adversarial performance of model may not be closely correlated with the model size, which is inconsistent with the conclusion in [3].When comparing S 2  8/255 with our baseline model S 10 8/255 , the subnet model size drops by 3.751% but the clean accuracy and adversarial accuracy in P 10  8/255 , P 10 6/255 , P 10 4/255 , P 10 2/255 separately increase by 4.591%, 2.642%, 4.864%, 5.838%, 6.420%.When comparing with the subnet of S 10  6/255 and S 10 8/255 , we find that the best subnet size drops by 3.640%, but the clean accuracy and the adversarial accuracy in P 10  6/255 , P 10 4/255 , P 10 2/255 separately increase by 7.447%, 1.63%, 5.338%, 8.256%.So we think that the performance of subnet has strong correlation with the training mode of its supernet but not the mode size of itself.

D. Training From Scratch or Fine Tuning
We can easily observe there does exists the huge gap between training from scratch and fine-tuning from Fig. 5.The three axes for each graph represent the number of parameters, the adversarial error and clean error for each subnet, respectively.The subscript for each graph in Fig. 5 denotes the supernet training mode.The red points represent the solutions on the non-dominated front obtained by NSGA-II, where the weights for subnets are randomly initialized.By contrast, the blue points are different solutions achieved by the same way but the weights of the subnets are inherited from supernet.We can clearly observe that no matter how we train the supernet, the subnets which adopt training from scratch as initialization mode perform better on adversarial examples and non-adversarial examples.We hypothesize that the role of the supernet in our NAS framework is to find the best architecture for the subnet but not to deliver the best weight to the subnet.Fine-tuning is not always beneficial to the training.The reason may be that the weight of each newly sampled subnet is not good enough as there are only 20 epochs for the training.However, we can find the best subnets among them by means of NSGA-II during the optimization in terms of the objectives.

E. Subnets Analysis
This section analyzes the top ten subnets architecture in terms of the clean accuracy and adversarial accuracy and their counterpart supernet training method on CIFAR-10 and SVHN datasets.Our aim is to gain insights from our top best results and reveal the rule for how to design a more robust tiny neural network.
1) Adversarial Error, Clean Error and the Size of Neural Network: Fig. 6 shows that the nondominated front obtained by NSGA-II on CIFAR-10 dataset and their counterpart supernet come from 2S   (c), it can be easily observed that there does exist the tradeoff relationship between adversarial error, clean error and the size of neural network.Fig. 7 shows that the order of nondominated subnets has greatly changed after training from scratch.In order to clearly illustrate how the order of nondominated subnets changes after training from scratch, each subnet (circle) is denoted by different color.The same color circles in Fig. 6 and Fig. 7 indicate that they own the same network architecture.We get an important observation from Fig. 6 and Fig. 7 that most of tiny neural networks (tiny circles) achieve a significant reduction on both adversarial error and clean error after training from scratch.For instance, the G point in Fig. 7(a) which owns the lowest clean error and lowest adversarial error has larger adversarial error and clean error before training from scratch.It also meets the same observation for the F point when we compare with Fig. 6(b) and Fig. 7(b).Hence, we conclude that our pipeline can effectively increase adversarial accuracy and clean accuracy of the tiny neural networks.
2) Block and Channel Analysis: We make another statistics analysis of which block and channel expansion ratio are used most frequently in the top 10 subnets from the search results.Our assumption is that the best trade-off tiny neural network architectures can be viewed as a combinatorial optimization problem.Namely, each layer of network architecture could be viewed as the combination of certain blocks and channels and the optimum network architecture could be viewed as different layers of neural network combinatorial optimization problem.size of Block 0 is 3.The specific block internal composition can be referred in Fig. 3 (a).In addition, Fig. 8 shows which two blocks are the most frequently adopted in a certain layer.The data in Fig. 8 comes from Table.VII.For our block encoding scheme, block identifies less than eight denote pure tiny blocks.Block identifiers between 9 and 17 are used to enhance the robustness of the tiny blocks.Block identifiers larger than 18 denote pure robust blocks.From Fig. 8, we can observe the trend that the top trade-off tiny neural networks prefer to use robust blocks or tiny robust blocks in the first four layers while the last eight layers prefer to adopt pure tiny blocks.Our explanation is that since PGD attack mainly focuses on pixel-wise perturbations, the robust blocks are able to mitigate the attack effect in the first several layers and the tiny blocks can help the neural network to keep the balance between clear performance and the number of parameters.Fig. 9 shows which two channel expansion ratios are used most frequently in certain layer.From Fig. 9, we can observe that the top trade-off tiny neural networks prefer to adopt larger channel in the first three layers and then the channel number for the rest of layers gradually declines.The top 10 tiny subnets use the smallest channel number in the last three layers.Our explanation is that since PGD attack mainly focuses on pixel-wise perturbations, the wider channel in the first several layers can help to mitigate the adversarial attacks, and the gradually declining channel number is able to maintain Fig. 8. Block statistics for each layer of top 10 subnets on CIFAR-10.The horizontal axis represents the layer id of subnets.For instance, the "0" means that the first layer on the subnets.The vertical axis represents which two blocks are the most frequently adopted in certain layer.For instance, block id 15 is the most frequently being used in the first layer of top 10 subnets.And block id 13 and block id 4 is the second most frequently being used in the first layer.the tiny size of neural network.
3) Assumption Verification: In order to verify our assumption, the most frequently used blocks and channels of each layer in Fig. 8 and Fig. 9 are selected as the and channel choices, respectively, in our model.We call this new subnet architecture as Lego-Net.Table .VIII shows that Lego-Net performs better than our state-of-the-art subnet 2S 1 8/255 , especially in adversarial performance.Specifically, Lego-Net is able to increase adversarial accuracy by 10.36% in P 10 8/255 and clean accuracy by 0.78% while the size of Lego-Net drops by 1.52%.We also achieve the same result on SVHN dataset shown in Table .IX.In conclusion, in order to build up tiny robust neural networks, we should put more pure robust or tiny robust blocks in the shallow layers and pure tiny blocks in the rest of layers.In term of channels design, we should put wider intermediate channels in the shallow layers and gradually reduce the intermediate channels in the rest of layers.

Fig. 1 .
Fig.1.Multi-objective oneshot NAS framework.The first step is to design the supernet search space and the loss function.The second step is to sample different subnets architectures from the supernet under a fixed distribution and train the subnets for a small number of epochs.The third step is to multiobjective search the best trade-off subnets and obtain the fitness value of the subnets by cloning the weight from the sueprnet.The fourth step is to train the non-dominated subnets.

Fig. 3 .
Fig. 3. (a) and (b) represent the internal architectures of the tiny robust blocks.K refer to the kernel size and it ranges among 3, 5, 7. The dashed line indicates the internal search space for the tiny robust blocks.So if we removed the non-local layer from the main branch of (a) and (b), (a) and (b) will denote as the pure tiny blocks.(c) represents the internal architectures of the pure robust blocks.The upper part of (c) represents the combination of SE layer and Embedded-Gaussian non-local layer.The bottom part of (c) represents the combination of SE layer and Gaussian non-local layer.

Fig. 4 .
Fig. 4. (a), (b),(c)  shows that how we add the channel selector into the tiny robust blocks and the pure robust blocks, respectively.refer to the kernel size and it ranges among 3, 5, 7. The dashed line indicates the internal search space for the tiny robust blocks.So if we removed the non-local layer from the main branch of (a) and (b), (a) and (b) will denote as the pure tiny blocks.

Fig. 5 .
Fig. 5. Difference between training from scratch and fine-tuning.The red points represent the solutions on the non-dominated front obtained by NSGA-II, where the weights for subnets are randomly initialized.By contrast, the blue points are different solutions achieved by the same way but the weights of the subnets are inherited from supernet.

Fig. 6 .Fig. 7 .
Fig. 6.The first nondominated front before training from scratch among adversarial error, clean error and the number of parameters of subnets on CIFAR-10.The size of circle indicates the amount of parameter of subnets.The vertical axis represents the clean error of subnets while the horizontal axis represents the adversarial error of subnets.

Fig. 9 .
Fig.9.Channel statistics for each layer of top 10 subnets on CIFAR-10.The horizontal axis represents the layer id of subnets.The vertical axis represents which two channel expansion ratios are the most frequently being used in certain layer.For instance, channel expansion ratio 1.8 is the most frequently being used in the first layer of top 10 subnets and channel expansion ratio 1.6 is the second most frequently being used in the first layer.
the perturbation range and the number of attack steps for the supernet adversarial training affect the performance of the subnets.• We seamlessly integrate multi-objective search algorithm with one-shot NAS algorithm.After finishing the search procedure, we can directly obtain the non-dominated front which can speed up finding the best trade-off subnets.In addition, we discover that training from scratch outperforms fine-tuning for the non-dominated subnets.• We draw a conclusion about how to design a tiny robust neural network.Firstly, the robust blocks, i.e. pure robust blocks and tiny robust blocks should be put into the shallow layers while pure tiny blocks should be put into the deep layers.Secondly, larger intermediate channels should be put into the shallow layers and the intermediate channels should gradually decline in the rest of layers.
• We explore a new adversarial training paradigm for the supernet.Because the subnets highly rely on the supernet in oneshot NAS, the adversarial performance of the subnets could be further improved by using our proposed training paradigm.To this end, we analyse how the width of supernet,

Crossover Mutate Block Channel SuperNet Adversarial Training NSGA-II Search Adversarial Training From Scratch Parameters Error Weight Share Crossover and Mutation
as the multi-objective search algorithm.The multi-objective search algorithm is presented in Algorithm 2. The first objective is the clean accuracy.It evaluates the performance of model without being attacked.The second Multi-Objective Search Input : population size N ., crossover probability pc, mutation probability pm, number of generations G, supernet Sw,, clean accuracy predictor Sc, adversarial accuracy Sa, parameters calculator Scp, P 0 the first generated population, Pg the gth generated population, F 0 all non-dominated fronts of first generated population, Fg all non-dominated fronts of gth generated population, F 0 g the first non-dominated front of gth generated population, F i g the ith non-dominated front of gth generated population.Q0 the first offspring produced from the first population P0, Qg the gth offspring produced from the gth population Pg Output: F 0 G The first non-dominated front on Gth generation 1 g ← 0 // initialize a generation counter 2 P0 ← Initialize Population N from Supernet Sw Table.II and Table.III.The first column denotes how we train the supernet.For instance, subscript of S 10 8/255 is the epsilon size of PGD attack for the supernet, which is set to 8/255.And the superscript is the number of attack steps, which is set to 10.The last column denotes that the best adversarial accuracy of subnet model under different degrees of attacks.For example, P 10 8/255 means the subnet is under PGD attack with epsilon size of 8/255 and the number of attack steps of 10.Since we focus on the network architecture under adversarial attacks, the subnet presented in the following tables is the non-dominated architecture which achieves the best adversarial performance after training from scratch. 1) Number of Attack Steps: Table.II indicates that the subnet performs better even if the supernet is under fewer number of attack steps during the adversarial training.In comparison with S 10 8/255 , S 2 8/255 helps to increase the adversarial accuracy and clean accuracy by 6.4% and 4.5%, respectively, and reduce its subnet model size by 3.6%.In addition, the gap between S 1 8/255 and our baseline S 10 8/255 is very tiny for different objectives.So we surmise that the supernet will have strong transferability even we reduce the number of attack

TABLE III SUPERNET
TRANSFERABILITY ON NUMBER OF ATTACK STEPS WITH THE TERM OF ACCURACY ON SVHN (%) 18/255 , S106/255 and S 2 8/255 , respectively.We use circles to represent the subnets and the size of circle indicates its size (number of parameters).In Figs.6 (b) and

TABLE IV SUPERNET
TRANSFERABILITY ON EPSILON SIZE WITH THE TERM OF ACCURACY ON CIFAR-10 (%) Table.VII shows our block encoding scheme and layer statistics.The upper part of Table.VII explains how to build up the certain block for each block identifier.For instance, Block 0 is composed by ShuffleV2 block and SE layer.The kernel

TABLE VIII THE
LEGO SUBNET EXPERIMENTS ON CIFAR-10(%) .4565.31 57.43 79.71 73.42 65.27 57.38 79.70 73.40 65.18 57.32 From Table X we can clearly see that the accuracy and the size of LEGO 2S18/255 outperforms the handcraft tiny neural network, such as ShuffleNetV2, MobileNetV3 and ResNet18.