Introduction

Deep neural networks (DNNs) have made significant progress in various challenging tasks, including image classification [1,2,3,4], object detection [5,6,7], and segmentation [8, 9]. One of the key factors behind the progress lies in the innovation of neural architectures. For example, VGGNet [1] suggested the use of smaller convolutional filters and stacked a series of convolution layers to achieve better performance. ResNet [10] introduced the residual blocks to benefit the training of deeper neural networks. DenseNet [11] designed the densely connected blocks to stack features from different depths. Generally, manually designing a powerful and efficient neural network architecture requires a lot of expert experiments and domain knowledge. It is not recently until a series of neural architecture search (NAS) methods have been proposed, bringing great convenience to ordinary users and learners, and allowing them to benefit from the success of deep neural networks.

Generally, a NAS task can be regarded as a complex optimization problem. In machine learning and computational intelligence, population-based intelligent optimization algorithms, such as genetic algorithm (GA) and particle swarm optimization (PSO), have been widely adopted as the concept of neuroevolutionary, to optimize the topology structure and hyperparameters of neural networks in the late 1990 [12,13,14]. Recently, lots of NAS methods employing population-based intelligent optimization algorithms as search strategies have attracted increasing attention. Although intelligent optimization algorithms, such as GA, have a competitive search performance on various complex optimization tasks, they still suffer from high computational costs. This shortcoming is particularly true in NAS tasks since the NAS process involves a large number of architecture evaluations. More specifically, for the NAS task, each network architecture evaluation involves the completed training of a deep neural network on a large amount of data from scratch. For example, Hierarchical EA [15] consumes 300 GPU days, and AmoebaNet-A [16] consumes 3150 GPU days to search architectures on the CIFAR-10.

In addition, reinforcement learning (RL) is also adopted to design neural architectures automatically, such as [7, 17, 18]. A significant limitation of RL-based NAS methods is also computationally expensive despite their remarkable performance. For example, it takes 2000 GPU days for the typical RL-based method NASNet-A to obtain an optimized CNN architecture on CIFAR-10. These methods require a large number of computational resources, which is unaffordable for most researchers and learners. To reduce the computational cost, ENAS [18] proposed a parameter-sharing strategy, which shares weights among the architectures through the use of superset and is adopted in various gradient descent (GD) NAS methods, such as [19,20,21]. Compared with EA-based and RL-based NAS methods, GD-based NAS methods are usually more efficient, which apply gradient descent to optimize the weights of candidate architectures. However, GD-based NAS methods still have some limitations, such as requiring excessive GPU memory during the search, and resulting in premature convergence to the local optimum [22, 23].

Recently, some population-based methods, such as the various EA-based methods [15, 16, 24,25,26,27,28], have been utilized for NAS tasks and have achieved some progress. The explosion gravitation field algorithm (EGFA) [29] inspired by the formation process of planets is a novel intelligent optimization algorithm with excellent global optimization capability and remarkable efficiency, compared with the classical population-based optimization algorithms, such as GA and PSO. Nowadays, computational time and resource limitations remain the major bottleneck in using and developing NAS methods. Thus, this paper attempts to develop a more efficient NAS method, by utilizing the work mechanisms of EGFA, to allow for discovering an optimal neural architecture with competitive learning accuracy, but only consuming a little computational time and resources. Specifically, the proposed EGFA-NAS utilizes EGFA and gradient descent to optimize the weights of the candidate architectures in conjunction. To reduce the computational cost, EGFA-NAS proposes a training strategy by utilizing the population mechanism of EGFA-NAS. To improve the efficiency and performance, EGFA-NAS proposes the weight inheritance strategy for the new generated dust individuals during the explosion operation. The main contributions of this paper are summarized as follows.

  1. 1.

    A novel population-based NAS method is proposed, called EGFA-NAS, which utilizes EGFA and gradient descent to optimize the weights of candidate architecture jointly, and is applicable to any universal micro search space with a fixed number of edges and a determined candidate operations set, such as NAS-Bench-201 and DARTS search space.

  2. 2.

    A training strategy is proposed to reduce the computational cost by utilizing the population mechanism. Specifically, all dust individuals cooperate to complete the training of the dataset at each epoch. Although each dust individual is only trained on part of batches at each epoch, it will be trained on all batches over a large number of epochs.

  3. 3.

    A weight inheritance is proposed to improve performance and efficiency. Specifically, during the explosion operation, the weights \(w\) of each new generated dust individual are inherited from the center dust. By utilizing this strategy, the new generated can be evaluated directly at the current epoch without retraining.

  4. 4.

    The experimental results show that the optimal neural network architectures searched by EGFA-NAS have competitive learning accuracy and require the least computational cost, compared with four kinds of state-of-the-art NAS methods.

The remainder of the paper is organized as follows. “Related work” introduces the related work of the work. “Proposed NAS method” describes the details of this proposed NAS method. The experimental design and results are presented in “Experimental design” and “Experimental results”, respectively. The final part is the conclusion placed in “Conclusion”.

Related work

General formulation of NAS task

NAS is an extremely complex optimization task, the primary objective of which is to transform the process of manually designing neural networks into automatically searching for optimal architectures. The process of the NAS can be depicted in Fig. 1. During the search, the search strategy samples a candidate architecture from the search space. Then we train the architecture to converge and evaluate the architecture’s performance. Next, the search strategy picks up another candidate architecture for training and evaluation according to the evaluation result of the last architecture.

Fig. 1
figure 1

Process of neural architecture search

In NAS tasks, denote a neural network architecture as \(A\) and the weights of all functions of the neural network as \(w_{A}\). Then the goal of NAS is to find an architecture \(A\), which can achieve the minimum validation loss \(L_{{\text{V}}}\) after being trained by minimizing the training loss \(L_{{\text{T}}}\), as shown in Eq. (1).

$$ \begin{gathered} \mathop {\min }\limits_{A} \, L_{{\text{V}}} \left( {w_{A}^{ * } ,A} \right) \, \hfill \\ {\text{s.t.}}\,w_{A}^{ * } = \arg \mathop {\min }\limits_{w} \, L_{{\text{T}}} \left( {w^{A} ,A} \right), \hfill \\ \end{gathered} $$
(1)

where \(w_{A}^{ * }\) is the best weight of \(A\) and achieves the minimum loss of training dataset. \(L_{{\text{T}}}\) and \(L_{{\text{V}}}\) are the losses on training dataset and validation dataset, respectively. Both losses are determined not only by the architecture \(A\), but also the weights \(w\). This is a bi-level optimization problem [30] with \(A\) as the upper-level variable and \(w\) as the lower-level variable.

NAS methods

Search strategy determines how to sample the neural network architectures. According to the different kinds of search strategy, NAS methods can be roughly divided into three categories: EA-based NAS methods, RL-based NAS methods, GD-based NAS methods.

EA-based NAS methods

EA-based NAS methods use evolutionary algorithms (EAs) to sample neural architectures. Early EA-based research for the optimization of networks was proposed as the concept of neuroevolutionary [12,13,14], which not only optimizes the network’s topology but also optimizes the hyperparameters and connection weights associated with the network. Over the past years, EA-based NAS methods have attracted increasing attention. For example, Xie et al. published the first EA-based NAS work GeNet [31] in 2017, which encodes the candidate architectures using fixed-length binary strings. Real et al. searched network architectures by EA, and started searching from trivial initial conditions [27]. Subsequently, Real et al. evolved an image classifier: AmoebaNet-A [16], which modifies the tournament selection by introducing a concept of age and surpasses hand designs for the first time. Liu et al. proposed Hierarchical EA [15], which combines a novel hierarchical genetic representation scheme that imitates the modularized design pattern and expressive search space. Elsken et al. proposed the LEMONADE [24], which is an evolutionary algorithm for multi-objective architecture search. Suganuma et al. constructed CNN architectures based on Cartesian genetic programming (CGP) [25]. Sun et al. proposed CNN-GA [26] and AE-CNN [32], which evolves CNN architectures using GA, based on ResNet and DenseNet blocks. To accelerate the fitness evaluation in evolutionary deep learning, Sun and Wang et al. proposed an end-to-end offline performance predictor based on the random forest [33].

Although the neural network architectures searched by above EA-based NAS methods have achieved competitive performance compared with the state-of-the-art handle-designed CNNs, however, as the population-based methods, they still suffer from huge resource costs because of involving a large number of fitness evaluations. During the search phase, each new generated candidate architecture needs to be trained on a training dataset and evaluated on a validation dataset. Then most EA-based NAS methods are time-consuming. For example, to search architectures on the CIFAR-10 dataset, Hierarchical EA [15] needs 300 GPU days, AmoebaNet-A [16] needs 3150 GPU days, CNN-GA [26] needs 35 GPU days, and AE-CNN [32] needs 27 GPU days. Then it is essential to accelerate the evaluation process for EA-based NAS methods, especially under the condition of limited computational resources.

RL-based NAS methods

The agent, environment, and reward are the three factors of reinforcement learning (RL). In the context of NAS, sampling the network architectures from the search space by the controller is defined as the action of the agent, the performance of network is regarded as the reward, and the controller is updated based on the reward in the next iteration. The earliest RL-based NAS method was proposed by Zoph et al. in 2017, which used RNNs as controllers to sample the network architecture and generate actions via policy gradients [7]. Subsequently, Zoph et al. used a proximal optimization strategy to optimize the RNN controller [17]. Cai et al. presented a RL-based algorithm: ProxylessNAS [34], which is an alternative strategy to handle hardware metrics. BlockQNN [35] automatically builds high-performance networks using the Q-Learning paradigm with epsilon-greedy exploration strategy.

Earlier RL-based NAS methods are usually computationally expensive. To reduce the computational cost, work [17] proposed the well-known NASNet search space, which allows us to search the best cell on the CIFAR-10 dataset and then apply this cell to the ImageNet dataset by stacking together more copies of this cell. ENAS [18] proposed a parameter-sharing strategy and the one-shot estimator (OSE), which regards all candidate architectures as the subgraphs of the super-network. Then all candidate architectures can share the parameters.

GD-based NAS methods

Recently, there is an increasing interest in adopting gradient descent (GD) methods for NAS tasks. A typical GD-based NAS method is DARTS [19], which optimizes the network architecture parameters by GD methods after converting the discrete search space into a continuous search space through a relaxation strategy. Subsequently, Dong et al. proposed the GDAS [20], which develops a learnable differentiable sampler to accelerate the search procedure. Xie et al. proposed the SNAS [21], which trains neural operation parameters and architecture distribution parameters by proposing a novel search gradient. Above-mentioned ProxylessNAS [34] proposed a gradient-based approach to handle non-differentiable hardware objectives.

Compared with EA-based and RL-based NAS methods, GD-based NAS methods are every efficient, because they represent the structures of the candidate networks as directed acyclic diagrams (DAGs) and use the parameter-sharing strategy. However, GD-based NAS methods have some drawbacks. For example, references [22, 23] point out that the DARTS tends to select skip-connection operations, which leads to performance degradation of searched architectures. To overcome the shortcoming of DARTS [19], several variants of DARTS methods have been proposed, such as DARTS- [36], DARTS + [37], RC-DARTS [38], and β-DARTS [39].

Besides the above three kinds of NAS methods, there are also other NAS methods that are not mentioned or do not fully fall into the above categories. For example, Liu et al. proposed the PNAS [40], which uses a sequential model-based optimization (SMBO) strategy.

Explosion gravitation field algorithm

Explosion gravitation field algorithm (EGFA) [29] is a novel optimization algorithm based on the original GFA [40,41,43], which stimulates the formation process of planets based on SNDM [44]. It was proposed by our research team in 2019 and has achieved good performance when solving optimization problems and tasks, such as benchmark functions [29] and feature selection tasks [45]. Compared with the classical population-based intelligent algorithm, such as genetic algorithm (GA) and particle swarm optimization (PSO), EGFA has better global optimization capability and remarkable efficiency. In addition, the fact that EGFA converges to the global best solution with probability 1 under some conditions has been proven [29].

In EGFA, all individuals can be mimicked as dust particles with mass, and each of them belong to a certain group. In every group, the one with the maximum mass value is regarded as the center dust and the others are surrounding dust particles. Based on the idea of SNDM [44], each center dust attracts its surrounding dust by the gravitation field, and the gravitation field makes all surrounding dust particles move toward their centers. In EGFA, each dust particle can be represented by a four-tuple \(\left( {\text{location, mass, group, flag}} \right)\), where \({\text{flag}}\) is a Boolean value indicating whether it is a center, location corresponds to a solution for the problem, group indicates the group number, mass is the value of objective function. When the value of mass is bigger, the solution is better. There are six basic operations for EGFA as Fig. 2: (1) dust sampling (DS), (2) initialize, (3) group, (4) move and rotate, (5) absorb, and (6) explode. The detailed processes of EGFA are summarized as follows:

Fig. 2
figure 2

Flow chart of EGFA

Step 1: Subspace location by dust sampling (DS). The task of DS is to efficiently locate a small enough search space which more likely contains the optimal solution.

Step 2: Initialize the dust population randomly based on the subspace located by Step 1.

Step 3: Divide the dust population into several subgroups randomly, and calculate the mass value of all individuals. In each group, set the dust particle with maximum mass value as the center, and set its flag as 1; set the other individuals as the surrounding dust particles, set their flag as 0.

Step 4: Check the stop condition. If the stop condition is met, return the best solution and the algorithm terminates, otherwise goes to Step 5.

Step 5: Perform the movement and rotation operation. In each group, each center attracts its surrounding dust particles by the gravitation field, and the gravitation field makes all surrounding dust particles move toward their centers.

Step 6: Perform the absorbing operation. Some surrounding dust particles which are close to their centers enough are absorbed by the centers. The size of the dust population will decrease in this process.

Step 7: Perform the explosion operation, and some new dust particles are generated around the centers. When explosion operation is accomplished, algorithm goes to Step 4.

In addition, DS in Step 1 avoids a long iterative process because the algorithm only searches in the subspace which is small enough compared with the original search space. The explosion operation maintains the size of the population and can stop the algorithm from being in stagnation behavior because of falling into local optima.

In this work, we proposed a NAS method based on explosion gravitation field algorithm, EGFA-NAS for short. In EGFA-NAS, an individual (a dust particle) represents a candidate network architecture. EGFA-NAS aims to discover a network architecture with the best performance, such as accuracy on the testing dataset. For the NAS task, the subspace small enough that contains the best architecture is hard to locate and computationally intensive. Therefore, EGFA-NAS abandons the first operation DS. As a population-based method for the NAS task, there are several key issues to be addressed. Namely, (1) which type of search space to search, (2) how to represent and code a CNN network, (3) how to accelerate the network architecture evaluation process, (4) how to use heuristic information to guide the search process.

Proposed NAS method

Micro search spaces, such as NASNet [17], DARTS [19], and NAS-Bench-201 [23] search spaces, are popularly utilized for NAS tasks recently, which search for the neural cells to form the blocks and construct the macro skeleton of network by stacking multiple blocks multiple times as [16,17,18,19,20, 23, 46]. In this work, we propose an efficient NAS method for micro search space. To investigate the performance of our proposed method sufficiently, we choose two classical micro search spaces: i.e., NAS-Bench-201 and DARTS search space to test.

Representation of search space

In this work, we search for a computation cell as the building block of the final architecture and represent a cell as a directed acyclic diagram (DAG). Specifically, a node represents the information flow, e.g., a feature map in CNNs, and an edge between two nodes donates the candidate operation, which is known as successful modules designed by human experts. We denoted \(O\) as the candidate operations set. To process the intermediate nodes more efficiently in the forward propagation, two kinds of cells need to be searched: normal cell with stride of 1 and reduction cell (block) with stride of 2. Once the two kinds of cells are identified, we can stack multiple copies of the searched cell to make up a whole neural network. In the rest of this section, we introduce the two search spaces: NAS-Bench-201 and DARTS search space, respectively.

NAS-Bench-201

NAS-Bench-201 was proposed by Dong et al. [23], which is an algorithm-agnostic micro search space. Specifically, a cell from NAS-Bench-201 includes one input node, three computational nodes, the last computational node is also the output node for next cell. Every edge in a cell has five candidate options. Then a cell in NAS-Bench-201 can be represented as a DAG, the nodes of which are connected fully, and there is \(5^{{C_{4}^{2} }} = 15,625\) cell candidates in total. In NAS-Bench-201, the candidate operations set \(O\) contains the following FIVE operations: (1) zeroize, (2) skip-connection, (3) \(1 \times 1\) convolution, (4) \(3 \times 3\) convolution, and (5) \(3 \times 3\) average pooling.

As shown in Fig. 3, the macro skeleton of NAS-Bench-201 is mainly stacked by three normal blocks and connected by two reduction blocks. Each normal block consists of \(B\) normal cells. The reduction block is the basic reduction block [10], which serves to down-sample the spatial size and double the channels of an input feature map. The skeleton is initiated with one \(3 \times 3\) convolution, and ends up with a global average pooling layer to flatten the feature map into a feature vector.

Fig. 3
figure 3

Macro skeleton of NAS-Bench-201

Additionally, work [23] evaluates each candidate architecture for NAS-Bench-201 on three different datasets: CIFAR-10, CIFAR-100 [47], and ImageNet-16-120 [48]. Then once the final architecture is found, the retraining process is not essential, and we can directly obtain the network final performance by the API provided by [23].

DARTS search space

DARTS [19] search space is a popular micro search space, proposed by Liu et al. in 2019, which is similar to NASNet [17] search space but removes some unused operations and adds some powerful operations. Specifically, a cell from the DARTS search space contains two input nodes, four computational nodes, and one output node. The output node is the concatenation of four computational nodes. As the depiction in Fig. 4, there are 14 edges in a cell for search, and each edge has 8 options. Unlike the NAS-Bench-201, the nodes in a cell are not connected fully during the search phase. Moreover, during the evaluation phase, each node only connects with two previous nodes. In DARTS search space, the candidate operations set \(O\) contains the following eight operations: (1) identify, (2) zeroize, (3) \(3 \times 3\) depth-wise separate convolution, (4) \(3 \times 3\) dilated depth-wise separate convolution, (5) \(5 \times 5\) depth-wise separate convolution, (6) \(5 \times 5\) dilated depth-wise separate convolution, (7) \(3 \times 3\) average pooling, (8) \(3 \times 3\) max pooling.

Fig. 4
figure 4

Macro skeleton of DARTS search space

As shown in Fig. 4, \(B\) normal cells are stacked as one normal block. For a given image, it forwards thought a \(3 \times 3\) convolution and then forwards thought three normal blocks with two reduction cells in between. In this paper, we follow the work [19] to set up the overall network architecture of DARTS search space.

Overall of search process

Figure 5 shows the overall of search process in EGFA-NAS. (a) Operations on edges are initialized unknown. (b) Continuous relaxation of search space and sampling the candidate operations for the edges with the mix probabilities. (c) Optimize the mix probabilities and the weights of cells simultaneously. (d). Inferring the final structure of cell from the learned mixing probabilities

Fig. 5
figure 5

Overall of search process

Representation and encoding of cell

As discussed in “Representation of search space”, the cells to search in this work can be represented by the DAGs. Specifically, each computational node represents one feature map, which is transformed from the previous feature map. Each edge in this DAG is associated with an operation transforming the feature map from one node to another node. All possible operations are selected from a candidate operation set \(O\). Then the output of any node \(j\) can be formulated as Eq. (2).

$$ I_{j} = \sum\limits_{i < j} {o_{i,j} \left( {I_{i} } \right)} , $$
(2)

where \(I_{i}\) and \(I_{j}\) represent the output of the node \(i\) and node \(j\), respectively. \(o_{i,j}\) represents the operation transforming the feature map from node \(i\) to node \(j\), which is selected from the candidate operation set \({\rm O}\).

In NAS-Bench-201 [23], a normal cell contains four nodes, i.e., \(\left\{ {I_{i} \left| {0 \le i \le 3} \right.} \right\}\). \(I_{0}\) is the output tensor of the previous layer, \(I_{1} ,I_{2} ,I_{3}\) are the output tensors of node 1, 2, and 3, calculated by Eq. (2). According to work [23], a normal cell contains six edges and each edge has five candidate operations.

In DARTS search space, a cell contains seven nodes, i.e., \(\left\{ {I_{i} \left| {0 \le i \le 6} \right.} \right\}\). \(I_{0}\) and \(I_{1}\) are the input tensors, \(I_{2} ,I_{3} ,I_{4}\) and \(I_{5}\) are the output tensor of node 2, 3, 4, and 5. \(I_{6}\) indicates the output of this cell, which is the concatenation of the four computational nodes, i.e., \(I_{6} = I_{2} \cap I_{3} \cap I_{4} \cap I_{5}\).

Define \(e\) as the number of edges for a cell, \(\left| O \right|\) represents the size of the candidate operations set \(O\). According to the above description of NAS-Bench-201 and DARTS search space, a cell can be encoded as \(A\) with size of \(e \times \left| O \right|\). In NAS-Bench-201, \(e = 6\), \(\left| O \right| = 5\), \(A\) is a tensor with size of \(6 \times 5\). In DARTS search space, \(e = 14\), \(\left| O \right| = 8\), \(A\) is a tensor with size of \(14 \times 8\). A general representation for a cell is formulated as Eq. (3).

$$ A = \left[ \begin{gathered} \vec{a}_{0} \\ \vec{a}_{1} \\ \vdots \\ \vec{a}_{p} \\ \vdots \\ \vec{a}_{e - 1} \\ \end{gathered} \right] = \left[ \begin{gathered} a_{0}^{0} , \, a_{0}^{1} , \, \cdots ,a_{0}^{q} , \cdots , \, a_{0}^{\left| O \right| - 1} \hfill \\ a_{1}^{0} , \, a_{1}^{1} , \, \cdots ,a_{1}^{q} , \cdots , \, a_{1}^{\left| O \right| - 1} \hfill \\ \, \vdots \, \vdots \, \cdots \, \vdots \, \cdots \, \vdots \hfill \\ a_{p}^{0} , \, a_{p}^{1} , \, \cdots ,a_{p}^{q} , \cdots , \, a_{p}^{\left| O \right| - 1} \hfill \\ \, \vdots \, \vdots \, \cdots \, \vdots \, \cdots \, \vdots \hfill \\ a_{e - 1}^{0} ,a_{e - 1}^{1} , \cdots ,a_{e - 1}^{q} , \cdots ,a_{e - 1}^{\left| O \right| - 1} \hfill \\ \end{gathered} \right], $$
(3)

where \(\vec{a}_{p}\) represents the probabilities of sampling the \(\left| O \right|\) candidate operations for edge \(p\), \(a_{p}^{q}\) is the \(q\)th element of \(\vec{a}_{p}\) and represents the probability of sampling the \(q\)th candidate operation for edge \(p\). In fact, the way of encoding for cell as Eq. (3) can be used for any micro search space, the searched cell of which has a fixed number of edges \(e\) and a defined candidate operations set \(O\).

Continuous relaxation of the search space

As described in “Representation of search space”, a neural network architecture consists of many copies of the cell. These cells are sampled from the NAS-Bench-201 and DARTS search space. Specifically, from node \(j\) to node \(i\), we sample the transformation function from the candidate operation set \(O\) with a discrete probability \(\alpha_{{\left( {i \leftarrow j} \right)}}\). During the search, we calculated each node in a cell by Eq. (4).

$$ I_{i} = \sum\limits_{j = 0}^{i - 1} {\sum\limits_{k = 0}^{\left| O \right|} {\alpha_{{\left( {i \leftarrow j} \right)}}^{k} } } o^{k} \left( {I_{j} ,w_{{\left( {i \leftarrow j} \right)}}^{k} } \right), $$
(4)

where \(\left| O \right|\) is the number of candidate operation of the set \(O\), \(\alpha_{{\left( {i \leftarrow j} \right)}}^{k}\) represents the probability for the \({\text{edge}}_{{\left( {i \leftarrow j} \right)}}\) (from node \(j\) to node \(i\)) that selects the \(k\)th candidate operation as the transformation function, \(o^{k}\) represents the \(k\)th candidate operation, \(I_{j}\) is the output of node \(j\), \(w_{{\left( {i \leftarrow j} \right)}}^{k}\) is the weight for the function of \(o^{k}\) on \({\text{edge}}_{{\left( {i \leftarrow j} \right)}}\). To make the search space continuous, we relax the probability of a particular operation \(\alpha_{{\left( {i \leftarrow j} \right)}}^{k}\) to a softmax over all possible operations by Eq. (5).

$$ \alpha_{{\left( {i \leftarrow j} \right)}}^{k} = \frac{{\exp \left( {\left( {\alpha_{{\left( {i \leftarrow j} \right)}}^{k} + c_{k} } \right) + \tau } \right)}}{{\sum\nolimits_{{k^{\prime} = 0}}^{{k^{\prime} < \left| O \right|}} {\exp \left( {\left( {\alpha_{{\left( {i \leftarrow j} \right)}}^{{k^{\prime}}} + c_{{k^{\prime}}} } \right) + \tau } \right)} }}, $$
(5)

where \(c_{k}\) are i.i.d that samples from \({\text{Gumble}} \left( {0,1} \right)\), \(c_{k} = - log\left( { - log\left( u \right)} \right)\) with \({\text{u}} \sim {\text{Unif}} \left[ {0,1} \right]\). \(\tau\) is a softmax temperature; in this work, \(\tau\) is set 10 as same as study [23].

Training strategy

In this work, we aim to reduce the computational cost by utilizing the population mechanism of EGFA-NAS. The main idea of the training strategy is illustrated as Fig. 6. Specifically, define \(D_{T}\) as the training dataset, \({\text{batch}}\_{\text{num}}\) as the number of batches of \(D_{T}\), \(n\) as the population size. At each epoch, each dust individual is training on \(k\) batches, where \(k = \left\lceil {{\text{batch}}\_{\text{num}}/n} \right\rceil\). All dust individuals cooperate to complete the training of the dataset at each epoch. This training process repeats until the maximum number of epochs is reached. Each dust individual (architecture network) will be trained on many different batches since the number of batches \({\text{batch}}\_{\text{num}}\) is usually larger than the population size \(n\) and the training process is repeated for a large number of epochs. In this work, set \({\text{batch}}\_{\text{num}} = 98\), \(n = 20\), \(k = 5\) for the CIFAR-10, and set the maximum number of epochs as 80 and 200 for NAS-Bench-201 and DARTS search space, respectively. Although each dust individual (architecture network) is trained only on a subset (\(1/n\) training data) at each epoch, it will be trained on all training data over a large number of epochs by this training strategy.

Fig. 6
figure 6

Training strategy for EGFA-NAS

In addition, due to the facts that each dust individual is responsible for part of the training work, and the complete training of each epoch is done with the participation of all individuals, therefore the efficiency of EGFA-NAS is not sensitive to the setting of population size \(n\), which will be experimentally confirmed in “Parameter settings for NAS-Bench-201”.

Explosion operation and weights inheritance

In the context of neural architectural search, a dust in EGFA-NAS represents a candidate architecture and not only maintains the original four attributes: location, mass, group number, and a Boolean flag indicating whether it is a center as description in 2.3, but also maintains an attribute “\(w\)” to record the weights of functions in cells. Each dust particle can be represented by a five-tuple \(\left( {{\text{location, \, w, \, mass, \, group, \, flag}}} \right)\). In EGFA-NAS, the location is denoted as the operations mixing probability \(A\), then a neural network architecture can be represented as \(\left( {A,w,{\text{mass, group, flag}}} \right)\).

As a population-based NAS method, the main computational bottleneck of EGFA-NAS is involving a lot of evaluation of architectures. In this work, we attempt to reduce the computational cost by taking advantage of the working mechanism of EGFA. At each epoch, additional computational cost is caused because a number of new generated dust particles (architectures) need to be trained during the explosion operation. On the other hand, the new dust particles are generated based on the center dust, and there are close relationships between the new generated dust particles and their center. Based on the above two observations, we proposed a weight inheritance strategy during explosion operation. The detail of explosion operation in EGFA-NAS is described in Algorithm 1.

figure a

As shown in Algorithm 1, the first part (lines 1–8) is the process of generating new individuals based on the center dust. The mix probabilities \(A\) of candidate operations of \(dust_{i}\) are computed as line 4, the weights \(w\) of functions in cells are inherited from the center dust as line 5. The second part (lines 9–14) calculates the mass value for new generated dust particle, and update the parameter \(w\). Line 15 combines the dust population \(Dust_{absorb}\)(output of previous process) and the new generated dust population \(Dust_{new}\). The last part (lines 16–18) updates the center dust for each group. By utilizing the weight inheritance, the new generated dust can be evaluated directly at the current epoch without retraining.

Figure 7 illustrates the process of generating new dust particles by means of weight inheritance during the explosion operation. \(\vec{a}_{i}\) represents the probabilities of sampling the \(\left| O \right|\) candidate operations for edge \(i\), \(w_{i}\) records the weights of functions for edge \(i\). The right partition in Fig. 7 shows the new generated dust population with size of \(m\), the mixing probability \(A\) of new dust particles is based on their center as line 4 in Algorithm 1, and parameters \(w\) are inherited from their center dust particle as line 5 in Algorithm 1.

Fig. 7
figure 7

Process of generating new dust particles by weight inheritance during explosion operation

Process of EGFA-NAS

As described above, during the process of NAS, the two parameters: architecture \(A\) and weight \(w\) need to be optimized meanwhile. To solve the bi-level optimization problem, we divide the original training dataset into two parts: the new training dataset \(D_{{\text{T}}}\) and the validation dataset \(D_{{\text{V}}}\), then use the new training dataset \(D_{{\text{T}}}\) to optimize the parameter \(w\), use the validation dataset \(D_{{\text{V}}}\) to optimize the parameter \(A\). In EGFA-NAS, we apply the EGFA and gradient descent jointly to optimize the parameter \(w\) and architecture \(A\) meanwhile in an iterative way. The processes of EGFA-NAS are described in detail as follows:

Step 1: Initialize all parameters, including the size of dust population \(n\), the number of group \(g\), the absorptivity \({\text{abs}}\) for absorb operation, the number of epochs \(epoch_{\max }\), the maximum radius \(r_{\max }\) and minimum radius \(r_{\min }\) for explosion strategy; initialize the dust population \(Dust = \left\{ {dust_{0} ,dust_{1} , \cdots ,dust_{n - 1} } \right\}\) randomly. For each \(dust_{i}\), the location (the \(i\)th cell architecture \(dust_{i} .A\)) is initialized randomly, which is a \(e \times \left| O \right|\) tensor as Eq. (3). After initialization, each cell can be stacked into a neural network. Then the loss on training dataset \(L_{{\text{T}}}\) and the loss on validation dataset \(L_{{\text{V}}}\) can be calculated. To optimize the two parameters \(w\) and \(A\) meanwhile, we use Eq. (6) to evaluate the performance of network architecture, and denote Eq. (6) as the mass value of \(dust_{i}\). It is noted that the \(L_{{\text{T}}}\) and \(L_{{\text{V}}}\) are not the loss of network architecture after full training, but the loss on the training dataset and the validation dataset at the current epoch, respectively.

$$ dust_{i} .mass = L_{{\text{T}}}^{2} + L_{{\text{V}}}^{2} , $$
(6)

where the losses \(L_{{\text{T}}}\) and \(L_{{\text{V}}}\) are calculated by Eq. (7), which are the cross-entropy loss functions [49].

$$ L = - \frac{1}{s}\sum\limits_{x} {\left( {y\ln \hat{y} + \left( {1 - y} \right)\ln \left( {1 - \hat{y}} \right)} \right)} , $$
(7)

where \(x\) represents the data sample, \(y\) is the true label, \(\hat{y}\) represents the predicted label, and \(s\) is the size of data.

Step 2: Divide the dust population into \(g\) subgroups. In EGFA-NAS, the value of \(g\) is set as 2; set the dust particle with maximum mass as the center dust, and the others are the surrounding dust particles. For \(dust_{i}\), the attribute \(flag\) is set as Eq. (8), where \(best\_mass_{j}\) is the maximum mass value in group \(j\).

$$ dust_{i} .flag = \left\{ \begin{gathered} 1, \, dust_{i} .mass = best\_mass_{j} ,j < g \hfill \\ 0 \hfill \\ \end{gathered} \right., $$
(8)

Step 3: Check the termination conditions. There are two termination conditions in EGFA-NAS, one is the maximum epochs, the other one is the average change of mass value of dust population. Once one condition is met, the main loop of EGFA-NAS ends. Then return the optimal network architecture \(A\) and deduce the structure of the neural network; otherwise, go to Step 4.

Step 4: Perform the movement and rotation operation. The surrounding dust particles move toward the center dust. For each dust particle \(dust_{i}\), the pace of movement is calculated by Eq. (9).

$$ \Delta A_{1} = p * \left( {\exp \left( {center.A + 3} \right) - \exp \left( {dust_{i} .A + 3} \right)} \right) + q * A_{{{\text{random}}}} , $$
(9)

where \(center.A\) presents the cell structure of the center dust; \(dust_{i} .A\) represents the \(i\)th cell structure; \(A_{{{\text{random}}}}\) is a \(6 \times 5\) tensor generated randomly. \(p\) is the pace of movement, \(q\) is a value close to zero. In this work, we set \(p = 0.1\), \(q = 0.001\), respectively. We denote the pace of the movement and rotation operation on the location of \(dust_{i}\) as \(\Delta A_{1}\). In addition, in EGFA-NAS, we also apply the gradient descent to optimize the parameters: \(A\) and \(w\). We denote the pace of gradient descent on the location of \(dust_{i}\) as \(\Delta A_{2}\), which is calculated by Eq. (10).

$$ \Delta A_{2} = - \xi_{2} \nabla_{{dust_{i} .A}} L_{V} \left( {dust_{i} .w,dust_{i} .A} \right), $$
(10)

where \(\xi_{2}\) is the learning rate, \(\nabla_{{dust_{i} .A}} L_{V}\) represents the architecture gradient on validation dataset.

As shown in Fig. 8, considering the impacts of the above two factors on the cell structure \(A\), the location of \(dust_{i}\) is updated as Eq. (11)

$$ dust_{i} .A = dust_{i} .A + \Delta A_{1} + \Delta A_{2} . $$
(11)
Fig. 8
figure 8

Change of the \(i\)th cell structure \(A\) during the process of movement and rotation operation

During this process, for each dust particle \(dust_{i}\), we not only need to optimize the parameter \(dust_{i} .A\), but also need to optimize the parameter \(dust_{i} .w\), which is updated by Eq. (12).

$$ dust_{i} .w = dust_{i} .w - \xi_{1} \nabla_{{dust_{i} .w}} L_{T} \left( {dust_{i} .w,dust_{i} .A} \right), $$
(12)

where \(\xi_{1}\) is the learning rate, \(\nabla_{{dust_{i} .w}} L_{T}\) represents the architecture gradient on training dataset.

Step 5: Perform the absorption operation. Some surrounding dust particles with small mass value will be absorbed by their center dust. During this process, the size of dust population will change, the new size is determined by the absorptivity \({\text{abs}}\) as Eq. (13).

$$ n = n * \left( {1 - {\text{abs}}} \right), $$
(13)

where \(n\) is the size of the initial population, \({\text{abs}}\) represents the absorptivity. In this work, we set \({\text{abs}}\) as 0.5.

Step 6: Perform the explosion operation. During the process of Step 5, some dust particles with small mass value are absorbed by their center dust particles. To maintain the size of dust population, some new dust particles will be generated around the center dust particles during this process. This part is descripted in “Explosion operation and weights inheritance” in detail.

Once Step 6 finishes, go to Step 3.

According to the above detailed description of EGFA-NAS, the pseudo-code of EGFA-NAS is shown in Algorithm 1. Step 1 (lines 1–3) is the initialization. Step 2 (lines 4–5) is the operation of grouping. Step 3 (line 6) checks the termination conditions. Step 4 (lines 7–12) is the process of movement and rotation. Step 5 (line 13) is the absorption operation. Step 6 (line 14) is the explosion operation.

figure b

Experimental design

The goal of EGFA-NAS is to search the optimal neural network architecture automatically which can achieve satisfying performance on a complex task, such as image classification. For this purpose, a series of experiments is designed to demonstrate the advantages of the proposed EGFA-NAS compared with the state-of-the-art NAS methods. First, we utilize the proposed EGFA-NAS to search neural network architectures in the benchmark search space: NAS-Bench-201, and evaluate the performance of proposed EGFA-NAS by investigating the classification accuracy and computational cost of the searched architecture on CIFAR-10, CIFAR-100, and ImageNet-16-120. Second, we investigate the consistency of relative evaluation with absolute evaluation, in terms of the accuracy and loss. Third, we investigate the effectiveness of the weight inheritance strategy. Finally, we examine the proposed EGFA-NAS in the larger and more practical search space: DARTS search space, and investigate the performance and universality of EGFA-NAS.

We first perform the proposed EGFA-NAS in the benchmark search space: NAS-Bench-201. When the search process terminates, the absolute performance evaluation of the optimal architecture can be obtained directly by the NAS-Bench-201’s API with negligible computational cost. By utilizing NAS-Bench-201, we verify the consistency of relatively performance evaluation and absolute performance evaluation for the searched network architectures without retraining from scratch. In addition, we verify the effectiveness of weight inheritance in NAS-Bench-201 search space. But when the search process in DARTS search space terminates, the optimal network architecture needs to be retrained from scratch and be test on the test datasets. The test classification accuracy is reported as the results of our experiments.

In the rest of this section, we introduce the peer competitors to compare with this proposed EGFA-NAS, the benchmark datasets, and finally the parameter setting for the two typical search spaces: NAS-Bench-201 and DARTS search space.

Peer competitors

To demonstrate the advantage of the proposed EGFA-NAS, a series of competitors are chosen for comparison. “Competitors of NAS-Bench-201” introduces the competitors compared with the performance of the optimal architecture searched by EGFA-NAS in NAS-Bench-201 search space, and “Competitors of DARTS search space” introduces the competitors compared with the performance of optimal architecture searched by EGFA-NAS in DARTS search space.

Competitors of NAS-Bench-201

Due to the facts that NAS-Bench-201 (only has five candidate operations) is a smaller search space, and the best architecture has lower classification accuracy compared with the best one searched in other search space, the performance of optimal architecture searched by EGFA-NAS in NAS-Bench-201 are only compared with the competitors which have reported the results in NAS-Bench-201 search space.

The selected competitors are mainly the efficient GD-based NAS methods, including DARTS-V1 [19], DARTS-V2 [19], SETN [50], iDARTS [51], and GDAS [20]. The other three selected NAS competitors, namely ENAS [18], RSPS [22], and EvNAS [52], utilize RL, random search, and EA as the search strategies for NAS tasks, respectively.

Competitors of DARTS search space

DARTS search space is a functional search space for NAS tasks, in which the optimal network architecture has promising performance compared with the state-of-the-art manually designed CNN architectures. To compare the performance of the optimal network architecture searched by EGFA-NAS in the DARTS search space, we select four different kinds of competitors for comparison.

  1. 1.

    The first kind of competitors are the state-of-the-art CNN architectures, manually designed by domain experts, including ResNet-101 [10], DenseNet-BC [11], SENet [53], IGCV3 [54], ShuffleNet [55], VGG [1], and Wide ResNet [56].

  2. 2.

    The second kind of competitors are the state-of-the-art EA-based NAS methods, including Hierarchical EA [15], AmoebaNet-A [16], LEMONADE [24], CGP-CNN [25], CNN-GA [26], AE-CNN [32], and AE-CNN + E2EPP [33], LargeEvo [27], GeNet [31], SI-EvoNet [57], NSGA-Net [28], and MOEA-PS [58].

  3. 3.

    The third kind of competitors utilize RL to search for CNN architectures, such as NASNet-A [17], NASNet-A + CutOut [17], Proxyless NAS [34], BlockQNN [35], DPP-Net [59], MetaQNN [60], and ENAS [18].

  4. 4.

    The fourth kind of competitors are mainly the GD-based NAS methods, such as DARTS-V1 + CutOut [19], DARTS-V2 + CutOut [19], RC-DARTS [38], and SNAS [21]. In addition, PNAS [40] is also selected for comparison, which use a sequential model-based optimization (SMBO) strategy.

Benchmark datasets

To investigate the performance of EGFA-NAS on NAS tasks, we test EGFA-NAS in two different search space, including NAS-Bench-201 and DARTS search space. All experiments involve three benchmark datasets: CIFAR-10, CIFAR-100 [47], and ImageNet-16-120 [48], which are widely adopted in experimental studies of state-of-the-art CNNs and NAS methods. In this work, each architecture searched in NAS-Bench-201 is trained and evaluated on CIFAR-10, CIFAR-100 [47], and ImageNet-16-120 [48]. Each architecture searched in DARTS search space is trained and evaluated on CIFAR-10, CIFAR-100. Each dataset is split into three subsets: training set, validation set, and test set.

CIFAR-10: It is an image classification dataset consisting of 60K images with with classes. The original set contains 50K training images and 10K test images. Due to the need for a validation set, the original training set is randomly split into two subsets with equal size, each subset contains 25K images with ten classes. In this work, we regard one subset as the new training set and another as the validation set.

CIFAR-100: It has the same images as CIFAR-10, but it categorizes the images into 100 fine-grained classes. The CIFAR-100 original contains 50K images in the training set and 10K images in the test set. In this work, the original training set is randomly split into two subsets with equal size. One is regarded as the training set and another as the new validation set.

ImageNet-16-120: ImageNet is a large-scale and well-known dataset for image classification. Image-16-120 was built with \(16 \times 16\) pixels from the down-sampling variant of ImageNet [61] (i.e., ImageNet \(16 \times 16\)). ImageNet-16-120 contains all images with labels \(\in \left[ {0,119} \right]\). In sum, ImageNet-16-120 consists of 151.7K images for training, 3K images for validation, and 3K images for testing with 120 classes.

Parameter settings

This section introduces the parameter setting for EGFA-NAS in detailed.

Parameter settings for NAS-Bench-201

For the NAS-Bench-201 search space, the parameter settings are only involved in the search process, because NAS-Bench-201 provides the absolute (final) performance evaluation for each architecture, and we can obtain the evaluation of the optimal architecture directly without retraining from scratch. We adopt the same skeleton network following [23] as Fig. 3. Specifically, we set the number of initial channels for the first convolution layer as 16; set the number of cells in one normal block \(B\) as 5. During the search, almost parameter settings follows [23], as shown in Table 1. Specifically, we train each architecture via Nesterov momentum SGD, using the cross-entropy loss as the loss function with batch size 256. We set the weight decay as \(5 \times 10^{ - 4}\) and decay the learning rate from \(2.5 \times 10^{ - 2}\) to \(1 \times 10^{ - 3}\) with a cosine annealing scheduler.

Table 1 Hyperparameter settings of searching process

In NAS-Bench-201 search space, we set up the same hyperparameters on three different datasets: CIFAR-10, CIFAR-100 [47], and ImageNet-16-120 [48], except for the part of data augmentation due to the slightly difference of images’ resolution. For CIFAR-10 and CIFAR-100, we use the random flip with probability of 0.5, the random crop 32 × 32 patch with 4 pixels padding, and the normalization over RGB channels. For ImageNet-16-120, we use the same strategies, except for random crop 16 × 16 patch with 2 pixels padding.

The parameters listed in Table 1 are related to neural network architecture. As a population-based method, EGFA-NAS has its own parameters. Specifically, we set the number of groups \(g\) as 2, set the absorptivity \({\text{abs}}\) as 0.5 for absorb operation, set the maximum radius \(r_{\max }\) as 0.1, and set the minimum radius \(r_{\min }\) as 0.001 for the explosion operation.

As a population-based NAS method, a larger number of epochs may lead to better performance, but the computational cost will also increase. We investigate the impact of the maximum number of epochs on the performance and computational cost on the CIFAR-10 dataset. The relative and absolute performance (accuracy) of the best architecture searched by EGFA-NAS on CIFAR-10 with different numbers of epochs are shown in Table 2. The relative performance of the searched architectures is evaluated at the last epoch in the search phase without retraining. The absolute performance of the searched architecture is inquired by the API provided by NAS-Bench-201. From the results in Table 2, we can observe that the best performance (93.67% accuracy on CIFAR-10) is achieved when the number of epochs is set as 80. When the number of epochs is increased to 100, no improvement of the absolute performance is achieved, although the computational cost becomes more. Hence, we set the number of epochs as 80 in the experiments for NAS-Bench-201.

Table 2 Relative and absolute performance (accuracy) of best architectures searched by EGFA-NAS on CIFAR-10 with different number of epochs

Generally, population size is a vital factor for the performance and efficiency of the population-based method, a larger population size usually leads to better performance, but also leads to an increment in search cost. But, in EGFA-NAS, we proposed a training strategy, which utilizes all dust individuals to complete the data training at each epoch. This training strategy reduce the sensitivity of performance to the population size, which can be verified by the results in Table 3. Specifically, EGFA-NAS not only has a similar performance, but also has similar search cost (GPU days) with different population size. In addition, the architectures searched by EGFA-NAS achieve the best absolute performance when population sizes \(n \ge 20\). In view of above observation, we set the population size \(n\) as 20 in this work. In a word, absolute performance (accuracy) and search cost (GPU days) of EGFA-NAS are closely related to the maximum number of epochs, but are not much related to the population size.

Table 3 Relative and absolute performance (accuracy) of best architecture searched by EGFA-NAS on CIFAR-10 with different population size

Parameter settings for DARTS search space

The neural cells for CNNs are searched in DARTS search space on CIFAR-10/100 following [7, 17]. The macro skeleton of DARTS search space is shown as Fig. 4. The parameter setting for DARTS search space can be divided into two parts: (1) searching phase and (2) evaluation phase.

During searching phase, we set the number of initial channels for the first convolutional layer as 16, set the number of cells in a normal block \(B\) as 2, set the number of epochs as 200. For training parameter \(w\), we optimize each architecture via Nesterov momentum SGD with batch size of 256, set the initial learning rate as \(2.5 \times 10^{ - 2}\), and anneal it down to \(1 \times 10^{ - 3}\) with a cosine annealing scheduler. We set the momentum as 0.9 and decay weight as \(5 \times 10^{ - 4}\). To optimize parameter \(A\), we use the Adam optimizer with default settings.

During evaluation phase, we train the searched network by 600 epochs in total. We set the initial channels as 33, and set the number of cells in a normal block \(B\) as 6 or 8. We start the learning rate of \(2.5 \times 10^{ - 2}\) and reduce it to 0 with the cosine scheduler. We set the probability of path drop as 0.2 and the auxiliary tower with the weight of 0.4. Other parameter settings are set as same as in the searching phase (Table 4).

Table 4 Hyperparameter settings for DARTS search space

Compared with NAS-Bench-201 (\(e = 6\), \(\left| O \right| = 5\)), DARTS search space (\(e = 14\), \(\left| O \right| = 8\)) is a larger search space. Then we set the number of epochs as 200 to explore DARTS search space. The other parameters about EGFA-NAS, such as population size \(n\), the number of groups \(g\), the absorptivity \({\text{abs}}\), the maximum radius \(r_{\max }\), and the maximum radius \(r_{\min }\), are set as same as “Parameter settings for NAS-Bench-201”.

Experimental results

Overall results in NAS-Bench-201 search space

The experimental results of the optimal network discovered by EGFA-NAS and other competitors in NAS-Bench-201, in terms of classification accuracy and computational cost (GPU days), are presented in Table 5. The symbol “–” means that the corresponding result was not reported. The results of iDARTS [51] and EvNAS [52] are sourced from the original published paper, and the consequences of the other competitors are extracted from [23]. The results highlighted in bold are the results of optimal best architectures and the results of the architectures searched by EGFA-NAS.

Table 5 Comparison of EGFA-NAS with the peer competitors in terms of the classification accuracy (%) and the computational cost (GPU days) on CIFAR-10, CIFAR-100, and ImageNet-16-120 datasets

From the results in Table 5, we can observe that EGFA-NAS can achieve better performance than the peer competitors: DARTS-V1 [19], DARTS-V2 [19], SETN [50], iDARTS [51], GDAS [20], ENAS [18], RSPS [22], and EvNAS [52]. Specifically, in the NAS-Bench-201 search space, EGFA-NAS discovers a network architecture with only 1.29M parameters, which consumes 0.048 GPU days and achieves 93.67% accuracy on CIFAR-10. For the CIFAR-100 dataset, EGFA-NAS achieves 71.29% accuracy with 1.23M parameters, and consumes 0.094 GPU day. For ImageNet-16-120, the architecture searched by EGFA-NAS obtains 42.33% accuracy with 1.32M parameters and 0.236 GPU days cost. Limited by the small search space: NAS-Bench-201, the performance of the network architecture searched is not comparable with the state-of-the-art designed CNN networks. But the performance of network architecture searched by EGFA-NAS has the smallest difference (0.7% worse on CIFAR-10, 2.22% worse on CIFAR-100, and 4.95% worse on ImageNet-16–120) with the performance of the optimal theoretical architecture, compared with the other competitors in the NAS-Bench-201 search space. In addition, the proposed EGFA-NAS has the best efficiency compared with all selected peer competitors.

Note that the search cost (GPU Days) of the competitors listed in Table 5 is extracted from [23]. But reference [23] does not indicate to which dataset the result belongs. The number of parameters (Params) for the peer competitors is obtained by running the code provided by [23] on the CIFAR-10 dataset. The search cost (GPU Days) of EGFA-NAS is the computational consumption counted for the three datasets, respectively, on the computational platform with one NVIDIA GeForce RTX 3090 GPU card.

Effectiveness of the relative performance evaluation

Due to the fact that NAS-Bench-201 [23] provides the evaluation information for each candidate architecture, in this section, we utilize the API provided by NAS-Bench-201 to obtain the absolute (final) performance evaluation (loss and accuracy) for the searched architectures without retraining, and verify the effectiveness of the evaluation strategy adopted by EGFA-NAS. Figure 9 shows the comparison of relative performance evaluation with absolute performance evaluation, in terms of loss (Fig. 9a) and accuracy (Fig. 9b) on CIFAR-10. In Fig. 9, the label “rel” represents the relative performance, and the label “abs” represents the absolute performance. The relative performance of the searched architectures is obtained on the validation dataset at the current epoch during the architecture search phase. From the results in Fig. 9, we can observe that the relative performance of searched architectures cannot be comparable with their absolute performance, this is because the architectures searched during the search phase are not trained sufficiently. Figure 9 illustrates that the trend of the relative performance is consistent with the absolute performance of the searched architectures. In addition, we can observe that EGFA-NAS is only not stable enough for the first several epochs and can achieve architectures with stable performance when the number of epochs is larger than 30. The observation above verifies the effectiveness of the evaluation strategy adopted by EGFA-NAS.

Fig. 9
figure 9

Comparison of relative evaluation and absolute evaluation of the architecture searched by EGFA-NAS

Effectiveness of weight inheritance strategy

To improve the efficiency of EGFA-NAS and reduce the computational cost, we propose a weight inheritance strategy during the explosion operation as described in “Explosion operation and weights inheritance”. Specifically, the parameters \(w\) of new generated dust individuals are inherited from their centers. In this section, we attempt to verify the effectiveness of the weight inheritance strategy by replacing this proposed strategy with generating the parameter \(w\) randomly on CIFAR-10, and other settings are kept unchanged. To observe the difference between our proposed strategy and the way of generating parameter \(w\) randomly more clearly, we set the number of epochs as 300 in this experiment. The estimated (relative) performance of searched network architectures using weight inheritance and the way of generating the parameter \(w\) randomly are shown in Fig. 10a and c, in terms of accuracy and loss, respectively. The final (absolute) performance of the network architectures searched by the two strategies is shown in Fig. 10b and d, in terms of accuracy and loss, respectively. The results in Fig. 10 show a big difference between the estimated (relative) performance of the two strategies. Although the final (absolute) performance of the architectures searched by the two strategies is similar on CIFAR-10, EGFA-NAS using the proposed weight inheritance can achieve the best network architecture earlier than utilizing the way of generating the parameter \(w\) randomly. In addition, the final performance of the architecture searched by inheritance weight is slightly better (93.67% accuracy) than utilizing the way of generating parameter \(w\) randomly (96.36% accuracy).

Fig. 10
figure 10

Comparison of the performance of EGFA-NAS using weight inheritance strategy and by way of generation parameter \(w\) randomly on CIFAR-10

Overall results in DARTS search space

The experimental results of the optimal network discovered by EGFA-NAS in DARTS search space, in terms of classification accuracy and computational cost (GPU days), are presented in Table 6. The symbol “–” means that the corresponding results were not reported. The symbol “*” means that the results are extracted from [19]. The mode “a/b” in Table 5.4 means that “a” is the result for CIFAR-10 and “b” is the result for CIFAR-100. The results of most competitors are extracted from the original published papers. B = 6 or 8 represents the number of normal cells in a normal block in the retraining phase. The results highlighted in bold are the result of the architectures searched by EGFA-NAS.

Table 6 Comparison of EGFA-NAS with the peer competitors in terms of the classification accuracy (%) and the computational cost (GPU days) on CIFAR-10, CIFAR-100

The results in Table 6 show that EGFA-NAS (B = 8) can achieve better performance than most state-of-the-art manual-designed CNN networks, including ResNet-101, ResNet + CutOut, SENet, IGCV3, ShuffleNet, VGG, and Wide ResNet, but a little worse than DenseNet-BC (1.05% on CIFAR-100). The performance improvement of optimal network architecture searched by EGFA-NAS (B = 8) is 13.9% on CIFAR-100, and 3.89% on CIFAR-10, compared with VGG.

Compared with the 12 EA-based NAS methods, EGFA-NAS (B = 8) achieves better performance than Hierarchical EA, AmoebaNet-A, CGP-CNN, CNN-GA, AE-CNN, AE-CNN + E2EPP, LargeEvo, GeNet, SI-EvoNet, and MOEA-PS, but slightly worse than LEMONADE (0.19%) and NSGA-Net (0.02%) on CIFAR-10. EGFA-NAS (B = 8) achieves the best classification accuracy (81.85%) on the CIFAR-100, and consumes the least search cost (0.21 GPU days) than all selected EA-based NAS methods.

Compared with the six RL-based NAS methods, EGFA-NAS (B = 8) achieves better performance than NASNet-A, NASNet-A + CutOut, BlockQNN, DPP-Net, MetaQNN, and ENAS, but a little worse than Proxyless NAS (0.86%) on the CIFAR-10. The performance improvement of the optimal network architecture searched by EGFA-NAS (B = 8) is 4.15% on the CIFAR-10, and 8.99% on the CIFAR-100, compared with MetaQNN. The proposed EGFA-NAS (B = 8) has the best efficiency and consumes the least GPU days even compared with the ENAS, which only consumes 0.5 GPU days on the CIFAR-10 in the published paper.

Compared with four GD-based NAS methods and PNAS, EGFA-NAS (B = 8) achieves better performance than DARTS-V1 + CutOut, RC-DARTS, and SNAS, but a little worse than DARTS-V2 + CutOut (0.61%) on the CIFAR-100. Although GD-based NAS methods usually have better efficiency than EA-based and RL-based methods, our proposed EGFA-NAS (B = 8) has the best efficiency compared to all selected GD-based NAS methods.

In addition, EGFA-NAS can obtain better final learning accuracy when setting larger number of cells in a normal block during the retraining phase, but will lead to larger number of parameters. The overall results in Table 6 show that this proposed EGFA-NAS not only has competitive learning accuracy but also has the best efficiency compared with the four kinds of competitors.

Conclusion

This paper proposes an efficient population-based NAS method based on the EGFA, called EGFA-NAS, which can achieve an optimal neural architecture with competitive learning accuracy but consumes a little computational cost. Specifically, EGFA-NAS relaxes the discrete search space to a continuous one and then utilizes EGFA and gradient descent to optimize the weights of the candidate architectures in conjunction. The proposed training and weight inheritance strategies for EGFA-NAS reduce the computational cost dramatically. The experimental results in two typical micro search spaces: NAS-Bench-201 and DARTS, demonstrate that EGFA-NAS is able to match or outperform the state-of-the-art NAS methods on image classification tasks with remarkable efficiency improvement. Specifically, to search the CIFAR-10 on the computational platform with one NVIDIA GeForce RTX 3090 GPU card, EGFA-NAS obtains the optimal neural architectures in NAS-Bench-201 search space with 93.67% accuracy but only consumes 0.048 GPU days, discovers the optimal neural architectures in DARTS search space with 97.23% accuracy and a cost of 0.21 GPU day.

Although EGFA-NAS is promising for designing high-performance neural networks automatically, it still has one limitation. Similar to the other NAS methods using the low-fidelity evaluation strategy, the relative evaluation adopted in EGFA-NAS during the search phase may lead to missing some promising architectures. In future work, we will attempt to design a better evaluation strategy with better rank consistency for lightweight NAS.