A Data-driven Approach to Neural Architecture Search Initialization

Algorithmic design in neural architecture search (NAS) has received a lot of attention, aiming to improve performance and reduce computational cost. Despite the great advances made, few authors have proposed to tailor initialization techniques for NAS. However, literature shows that a good initial set of solutions facilitate finding the optima. Therefore, in this study, we propose a data-driven technique to initialize a population-based NAS algorithm. Particularly, we proposed a two-step methodology. First, we perform a calibrated clustering analysis of the search space, and second, we extract the centroids and use them to initialize a NAS algorithm. We benchmark our proposed approach against random and Latin hypercube sampling initialization using three population-based algorithms, namely a genetic algorithm, evolutionary algorithm, and aging evolution, on CIFAR-10. More specifically, we use NAS-Bench-101 to leverage the availability of NAS benchmarks. The results show that compared to random and Latin hypercube sampling, the proposed initialization technique enables achieving significant long-term improvements for two of the search baselines, and sometimes in various search scenarios (various training budgets). Moreover, we analyze the distributions of solutions obtained and find that that the population provided by the data-driven initialization technique enables retrieving local optima (maxima) of high fitness and similar configurations.


Introduction
Deep learning has successfully been applied to a wide variety of problems, showing (in many cases) super human performance [1,2].However, this success has been followed by an increasing complexity of the deep learning models, that in most cases are manually designed [3].State-of-the-art deep neural networks (DNNs) may have several million parameters, thus automating the design of DNNs is the logical next step.
Neural architecture search (NAS) is the process of automating architecture engineering [3,4].Great advances have been made in this matter, but in practice NAS has not been adopted yet [3,4].From an optimization point of view, NAS is a challenging task.It requires dealing with huge search spaces (the deeper the model, the bigger the search space), mixed type solutions (i.e., a combination of integer, discrete, and real values to represent the model), and solutions that are expensive to evaluate (i.e., training a DNN on a large data set may take several days).
Researchers are working to alleviate NAS challenges.Several (optimization) approaches have been tailored to design neural networks [5][6][7], to speed up model evaluation [8,9], and a lot of effort have been put to design NAS search spaces [5,6].Recently, some authors have released performance evaluation databases [10,11], aiming to democratize NAS, i.e., everyone can test a NAS algorithm regardless of having powerful computational resources, and to improve the reproducibility [3].
Despite the great advances made so far, there remain open questions.For example, many NAS approaches can be categorized as population-based algorithms [3,4].However, little attention has been drawn to the initialization of the population.Considering all the available resources, including NAS databases, we propose to address the following question: Can we improve the performance of a population-based NAS algorithm by initializing its population with a data-driven approach?To this problem, we propose a two-step approach.First, a tailored clustering analysis of a target search space is performed.Second, after obtaining satisfying quantitative clustering results, the centroids are extracted and used to initialize a population based NAS algorithm.
To validate our proposal we selected three population-based NAS algorithms: an evolutionary algorithm (EA) [12], a genetic algorithm (GA) [13], and aging evolution (AE) [6], and we have benchmarked our proposed initialization technique on NAS-bench-101 [10] against the most popular initialization methods (random initialization and Latin hypercube sampling).The results show that centroids extracted using Bayesian Gaussian Mixture of models (BGM) for clustering are a promising approach to initialize the population.Particularly, our approach used with GA shows significant long-term improvements (after 2000 iterations, in test) over random initialization and Latin hypercube sampling.When used with EA, a faster convergence (in validation) and a significant long-term improvement over random initialization and Latin hypercube sampling is observed.Additional investigations on the distributions of the solutions found by the algorithms suggest that centroids enable retrieving local optima (maxima) of high fitness and similar configurations.
The remainder of this article is organized as follows: The following section introduces NAS and briefly summarizes the state-of-the-art of initialization techniques.Section 3 describes the proposed methodology.Section 4 introduces the experimental setup.Section 5 presents the results.Finally, Section 6 outlines the conclusions and proposes future work.

Related Work
In this section, we summarize some of the most relevant works related to our proposal.First, we introduce NAS and highlight the state-of-the art, with a special emphasis on metaheuristic approaches.Second, we outline the population initialization problem.

Neural Architecture Search
NAS is the process of automating architecture engineering [3].Currently, it is considered to be a subfield of AutoML [14].However, its roots can be tracked to the late 1980s, where the use of evolutionary computation was explored to design and train neural networks [15][16][17][18][19].These ideas gather together under the neuroevolution concept, and in the 2000s gained popularity thanks to the NeuroEvolution of Augmenting Topologies (NEAT) method [20], a genetic algorithm (GA) that increasingly evolves complex neural network topologies and weights.Later, due to apparition of deep learning, the neuroevolution research started to attract attention again [3,4].
From the optimization (algorithm) point of view, many approaches can be found in the neuroevolution literature, ranging from evolutionary algorithm (EA) [21], GA [22], harmony search [23] and mixed integer parallel efficient global optimization technique [24], to Bayesian optimization [7].And also, from the point of view of the neural network architecture, e.g., recurrent neural network [25], convolutional neural network [26] and generative adversarial networks [27].
On the other hand, in the past few years, a new branch of NAS approaches emerged based on continuous optimization (e.g., DARTS [5]).Particularly, these approaches search over a large graph of overlapping configurations (i.e., the Super-Net) using a gradient-based approach.These recent improvement result in large speedups in terms of search time [3] but also in some case in a lack of robustness and interpretability [28].
Despite the NAS approach used, literature stress the importance of optimizing the architecture of a deep neural network (given a particular problem).The main challenges of NAS are three-fold: First, the number of the parameters increases in proportion to the number of layers, thus the search space is huge.Second, the search space is (usually) a mix of categorical (e.g., the type of operation, the activation functions, ...), real (e.g., the weights), and integer (e.g., the number of hidden layers, the number of neurons per layer, ...) or discrete (e.g., the adjacency matrix) values, resulting in a complicated problem, i.e., each parameter type require a different optimization approach.Third, the evaluation of an architecture is extremely resources and time-consuming.Therefore, NAS problems fall into the family of expensive optimization problems.
To cope with the latter problem, i.e., the evaluation cost, and aiming to improve reproducibility, lots of effort have been made to provide opensource benchmarks for NAS.Several areas of applied machine learning have been included in these benchmarks, including computer vision [11,[29][30][31] and natural language processing [32], among others [3].Also, some authors have explored techniques to speed up the performance evaluation [8,9].The mixed search space problem has been faced from multiple perspectives, ranging from tailored encoding [33,34], and specific operations [25], to mixed (hybrid) approaches [35].
Finally, to tackle the problems that arise due to the size of the search space (first challenge), several authors have invested time tailoring the design of the search space [5,6], providing tools to assess its quality [36] [37], and proposing techniques to adapt the search space [7,38], among others.Despite all the advances made in this regard, the initialization of the NAS algorithms (especially the population-based ones) has not received much attention.However, it is important to remark that starting from a set of good solutions is key to solve large-scale optimization problems using a population of finite size [39].

Population Initialization Techniques
All population-based metaheuristic algorithms share a common step: population initialization.The goal of this initialization is to provide a first set of solutions, that (normally) will be improved in an iterative way until the termination criteria is met.How good (or bad ) is the initial population facilitate (or prevent) finding the optima [39][40][41], and this is more serious for large-scale optimization problems using a population of finite size [39], which is the most common case, specially for expensive to evaluate problems (like NAS).Therefore, the greater the search space (given a limited population size), the smaller the chance to cover promising regions of the search space [42].
In the past few decades, some authors have started to propose initialization techniques aiming to boost the performance of population-based metaheuristic algorithms (mainly EA) [43,44].Great advances have been made, for example, [45] shows that initialization can increase the probability of finding global optima, another study shows that stability can be improved [46], in [47] it is show that the solution quality is related to the initialization, among many others [43,44].
However, in black-box optimization problems, such as NAS, it is not possible to determine beforehand what is a good and bad solution.Therefore, not all initialization techniques are suitable for NAS.Moreover, few practical rules of thumbs are provided in the literature to choose an appropriate initialization technique.Thus, from a practitioner perspective, it is unclear how to choose the right initialization technique [43].

A data-driven approach to initializing a NAS search strategy
This section introduces the proposed approach of Cluster Analysis for enhancing the performances of a NAS algorithm.First, we describe the overall pipeline of the methodology.Second, we detail the feature engineering essential to the analysis.

Pipeline
This study aims at leveraging the knowledge about a Search Space to help improve the performances of a Search Strategy.In particular, it sets out to answer the following question: can we improve the convergence of a populationbased NAS algorithm by initializing it with a data-driven approach?Fig. 1 The pipeline of the proposed data-driven approach to NAS initialization.
To tackle this problem, we propose an approach consisting of two steps, depicted in Figure 1.First, we perform a performance-based clustering analysis of the Search Space.Given the search space and a machine learning task, we sample a set of N architectures.Each architecture is trained, evaluated, and encoded using the procedures described in Section 3.2.As the feature vector consists of an architecture representation and its performance in test, the resulting clusters should relate to specific behaviors (performances) on the learned task.Moreover, processing high dimensional and sparse data can sometimes be uneasy, therefore we propose to facilitate the clustering by reducing the dimension of input features.With the reduced samples, we proceed with a clustering analysis composed itself of a sequence of sub-steps.This sequence is as follows: we reduce the dimension of the samples, perform the clustering and assessing qualitatively and quantitatively its results.Besides, calibrating the proper number of clusters play an important role in the results retrieved.Thus, we seek to identify values to this hyperparameter providing satisfying results.
In the second step of the proposed methodology, we extract the centroids obtained in the clustering, initialize a population-based algorithm, and assess it performances.
To summarize, this approach to search initialization is: • Composite: It is a multistep initialization procedure relying on sampling a search space, clustering it, and initializing an algorithm with the centroids extracted.
• Generic: It is not application-specific, in fact the clustering could be done on any type of search space given an encoding including a solution representation and its fitness evaluation.• Deterministic or stochastic: The stochasticity of the procedure depends on the stochasticity of the tool selected for clustering.
Also, it is important to note that the clusters may be analyzed to obtain insights regarding the archetypes (i.e., the representative architectures), including the most frequent operations and the connection between the operations (i.e., the edges in the graph).

Feature representation
To best take advantage of information about the search space when clustering, we first introduce a minimal feature engineering.
As we look to uncover models and structures relevant to NAS algorithms via clustering, we seek a feature representation encoding an architecture as well as its performances.As in [10], we consider neural architectures identified by an elementary component repeated in blocks, a feed-forward cell.This cell is a directed acyclic graph (DAG), with a maximum number of operations (nodes), a maximum number of transformations (edges) and a fixed set of possible operations (e.g., max pool, convolution 3x3) labeling each node.A cell is in practice represented a list of selected operations and an adjacency matrix of variable size.
Therefore, we construct two versions of clustering feature representation, both in the form of vectors.The first one (Original, Short Encoding) consists in concatenating for each model, its adjacency matrix, the list of operations, and the list of performances in test for all available training duration {t 0 , t 1 , t 2 , t 3 }.Note that this is a variable length feature representation due to the nature of the adjacency matrix.
Alternatively, the second representation (Binary, Long Encoding) corresponds to the expanded adjacency matrix, i.e., the matrix that consider all possible operations (according to the constrains of the search space).This is a fixed length encoding.Moreover, for both encoding, the vector form of the adjacency matrix is obtained by a flattening in row-major fashion (C-style).

Experimental Setup
The experiments performed aim to validate that the initialization of a population-based NAS Algorithm can benefit from models identified via Clustering Analysis of a Search Space.In this Section, first, we introduce the problem used to validate our proposal.Second, we present the parameters used for performing the experiments on clustering.Third, we detail the performance metrics used to assess the quality of the clustering.Fourth, we describe the three (3) baseline algorithms that we aim to initialize.

NASBench-101
NASBench-101 is a database of neural network architectures and their performance evaluated on the data set of CIFAR-10.It contains N = 450K unique architectures [10].Indeed, to tackle the given machine learning task of CIFAR-10, all contained models use of a classical image classification structure similar to ResNet.Indeed, the backbone of a model contains a head, a body and a tail.Its body is made by alternating three (3) times a block with a down-sampling module.Each block is obtained by repeating three (3) times a module called 'cell'.A cell is a computational unit that can be represented by a DAG.It consists of an input node, an output node and intermediate nodes representing operations (convolution 3x3, convolution 1x1, maxpool 3x3), and connections indicating features being transformed.Therefore, each architecture differs by its cell.In practice, the DAG of a model is encoded by an adjacency matrix and a list of operations labelling the associated nodes.The constraints on such DAG are the following: There can be at most N = 7 nodes and E = 9 edges in a cell.
Moreover, all models were trained for 108 epochs using the same experimental setting (i.e., learning rate, etc.), but performance evaluations in training, validation and test were also provided after 4, 12 and 36 epochs.

Hyperparameters for Clustering
The clustering experiments were done with a set of randomly sampled models.The size of the sample is identified in Section 5.2.The considered clustering algorithms are k -means, DBSCAN, BIRCH, spectral clustering, and a BGM.All were obtained from the latest version (0.24.1) of the scikit-learn library [51].Table 1 shows the hyperparameters selected for each method, including the maximum number of iterations (Max iter ), the number of samples used at initialization (N init), and other algorithm-specific parameters (Other ).Note that they are either default (NA) or slightly modified to provide satisfying clustering performances.

Clustering performance evaluation
Moreover, we use various ways of assessing the quality of the results for each step of the approach.Regarding step one (1), we propose to measure the clustering performance using the following three (3) standard metrics: Silhouette coefficient [52], Calinski-Harabasz [53] and Davies-Bouldin index [54].These metrics inform on how well separated and dense are the resulting clusters.They all apply in the context of clustering with missing labels, which is relevant, as we seek to investigate relevant clusters and features for NAS algorithms without prior assumptions.The Silhouette coefficient is a metric comprised between -1 and +1, with higher values associated to more dense and separated clusters The Calinski-Harabasz index also rates a better defined clusters with higher values.Similarly, the Davies-Bouldin index, measures a similarity between clusters, providing smaller values for better clustering.Additionally, we propose to corroborate the quantitative assessment with a qualitative analysis (visual inspection) for validation before step two.
Regarding step two (2), we assess the quality of the centroid-based initialization using the performance of the algorithm (best obtained accuracy in test).More importantly, we compare the performance against initializing the same algorithm with random and LHS initialization.Also, as a sanity check, we compare the results against random search.

Baseline Algorithms
To benchmark the performance of the different initialization techniques, we propose to use three (3) population-based algorithms.Particularly, we implemented a Genetic Algorithm (GA), an Evolutionary Algorithm (EA), and Aging Evolution (AE) [6], a popular EA-based algorithm specifically designed for NAS on CV problems.

Genetic Algorithm
A GA is a population-based meta-heuristic algorithm inspired by natural evolution [13].At a glance, a population of individuals (a.k.a.solution) is evolved using selection, crossover, mutation, and replacement operations.Particularly, we used the implementation available in the latest version (1.3.1) of DEAP library [55].Algorithm 1 presents a high-level view of the implemented GA.
Particularly, an individual encodes a neural network architecture (in the given search space) by a mix of binary entries, that represent the adjacency An initial population of size pop size is initialized by the function Initialize(•).Particularly, we define three variations to initialize the population: Random initialization, LHS, and our proposed method (Section 3).An individual is evaluated by the function Evaluate(D 1 , D 2 ).The decoded architecture is trained using SGD on D 1 data set, and evaluated (accuracy) on D 2 data set (a.k.a., the fitness).Then, the best solution (i.e., the one with the highest accuracy) of the population is selected by the Best(•) function.
Then, the evolution takes place.First, an offspring of size pop size is created.More specifically, each offspring individual is created by a single point crossover operation SinglePointCrossover(P 1 , P 2 , cx p ) with probability cx p, where P i is selected using a binary tournament operation BinaryTournament(•).Note that with probability 1 -cx p one of the parents P i is returned unmodified.Later, the offspring is mutated with probability mut p by the function Mutate(•).If mutated, each position is mutated using bit-flip (for the binary entries) or round-robin (for the categorical values) with probability mut i.The offspring is evaluated using Evaluate(•), and the current population is replaced by the offspring.Finally, the best solution is updated, i.e., if the fitness of the best individual in the population is higher than the current best solution, then the best individual become the best solution.
Once the number of max evaluations is reached, the best solution is evaluated using the test data set.

(µ + λ) Evolutionary Algorithm
The (µ + λ)EA [12], a generic population-based metaheuristic algorithm, evolves a population of µ individuals by creating λ offspring.Then, both the original population and the offspring are combined, and the best µ individuals replace the population.Algorithm  The population (refer to Section 4.4.1) of size µ is initialized using the Initialize(•) function.Then, the population is evaluated using the Evaluate(•) function (refer to Section 4.4.1).Then, the evolutionary process takes place.First, an offspring of size λ is generated by randomly sampling (with uniform probability) individuals from the population.Following, the offspring is mutated using the Mutate(•) function (refer to Section 4.4.1), and the offspring is evaluated.In the last evolutionary step, the population and the offspring are combined, ranked according to their fitness, and the top µ individuals are selected by the RankSelection(•) to replace the current population.
Once the number of evaluations is greater than max evaluations, the best individual of the population is selected by Best(•), i.e., the solution.Finally, the solution is evaluated on the test data set.

Aging Evolution
A few years ago, the Aging Evolution (AE) [6], an EA-based approach to NAS, became popular because it achieved state-of-the-art performance on classical CV benchmarks.Algorithm 3 outline AE.Notice that the nomenclature does not match exactly the one proposed in [6], instead the algorithm presents a version that is closer to Algorithms 1 and 2. Also, we used the implementation available on NASBench-101 repository.AE is a steady state EA, where the oldest individual of the population is replaced by the offspring.Particularly, the population of size pop size is initialized using the function Initialize(•), evaluated using the function Evaluate(•) (we are reusing the function defined above in this section), and the best solution is selected from the population by the function Best(•).
Then, the evolution begins.First, an individual (i.e., the offspring) is selected using a k tournament selection.Second, the offspring is mutated by a two-step process Mutate(•): ( i) a hidden state mutation, the connections between operations in a graph-represented solution (cell) are modified, and ( ii) an operation mutation, the operation within the cell is modified.Then, the offspring is evaluated, and if its performance is higher than the previous best seen solution, the solution is replaced by the offspring using the function Best(•), In the last step of the evolution, the oldest individual of the population (i.e., the earliest evaluated one in the population) is replaced by the offspring using the Enqueue(•) (add the new one) and Dequeue(•) (remove the oldest one) functions.The authors of AE claim that exists a parallel between the introduced age-based removal to a regularization of the evolution.
The evolution continues until the number of evaluated candidate solutions exceed the predefined budget max evaluations.Finally, the best solution of the population is returned.

Results
In this section, we present results on clustering for accelerating NAS algorithms.First, we show results on selecting the proper dimension reduction tool and hyperparameters for the clustering.Then, we show results on identifying the number of clusters providing satisfying clustering performances.We also present results on qualitatively assessing the clusters quality for various algorithms.Then, we provide results on improving NAS performances using a centroid-based initialization for three (3) NAS evolutionary algorithms.Last but not least, we show results of a quantitative assessment for solutions found from the bench-marking, in the form of matrices of cell occurrence.

Dimension Reduction
To begin our experimental study, we seek to calibrate the dimension reduction of the input features.To do so, we arbitrarily fix the number of samples to N = 10000, in order to perform relevant experiments.
Figure 2 shows clustering performance as a function of the number of components of input features.The blue and red curves display performance using respectively the Short (Original) and the Long encoding (Binary).The dimension reduction is performed using PCA and clustering with k -means.Using the Short encoding, the three metrics are in favor of using a small number of components for input features via PCA.Indeed, the smaller the number of components the higher the Silhouette and Calinski-Harabasz scores, and the lower the Davies-Bouldin index, with optimal values for using two (2) components.The same is observed when using the Long encoding.
Using the Long encoding yields slightly better performance than the Short encoding, with a sensible improvement on larger number of components with PCA.
As both encoding are rather sparse (length of up to 58, or up to 298), we study the effect of this sparsity in the dimension reduction tool.Figure 3 shows clustering performances as a function of the number of components of input features, for various reduction tool.The blue and red curves display performances using respectively PCA and Truncated SVD as dimension reduction tools.Plot (a) and (b) display results using respectively the Short and the Long encoding.The clustering is performed with k -means.
Trying an alternative dimensional reduction tool (Truncated SVD) more suitable for highly sparse data does not worsen results on the Short encoding (see Figure 3a).Moreover, it allows for a slight improvement over PCA when using the Long encoding (see Figure 3b).To summarize, the findings show that reducing the dimensions of the input features to 2D provides the best performances on both encoding.Using the Long encoding improves the results.Also, using Truncated SVD shows slight improvements as it is more suitable for sparse data.Given these findings, the following experiments are performed using Truncated SVD for a 2D reduction of input.

Number of Samples
Having identified a suitable dimension reduction tool (Truncated SVD) and value for the number of components to reduce to (N=2), we now seek to find a satisfying number of samples for the clustering.
Figure 4 shows clustering performances as a function of the number of clusters, for various sample sizes.All were obtained when clustering with the Short encoding for feature representation.The blue, orange, green, red, purple and brown curves are for respectively using 100, 500, 1000, 2500, 5000 and 10000 samples.This range of values enables us to consider small to intermediately large complexity for our proposal.Overall, all performance metrics points towards the use of large number of clusters.Indeed, the higher the number of clusters, the higher the Silhouette and Calinski-Harabasz scores, and the lower the Davies-Bouldin index.These trends are observed for all sample sizes, with similar values for Silhouette Score and the Davies-Bouldin index.Only the Calinsky-Harabasz index discriminates towards the use of increasingly larger sample sizes.
Given these findings, we identify N = 10000 (the largest tested value) as the sample size to use for optimal clustering in future experiments.

Number of Clusters
Next, we look closer into the number of clusters to use, for when clustering with various feature representations.
Figure 5 shows clustering performance as a function of the number of clusters.The blue and red curves display the performance results using respectively the Short and the Long encoding.All input features were reduced to two (2) components using Truncated SVD, and clustering is performed with k -means.Using both encoding, all performance metrics points towards the use of large number of clusters.The higher the number of clusters, the higher the Silhouette and Calinski-Harabasz scores, and the lower the Davies-Bouldin index.Additionally, intermediate values around twenty (20) and twenty-seven (27) clusters respectively for the Original and Binary encoding seem to reach satisfying performance already.
Therefore, results suggest using an intermediate (20,30) to large number of clusters for improving the k -means clustering performances, with a preference for the Short encoding.

Qualitative Cluster Analysis
As an additional way to validate the clustering results, we seek to visualize the clusters, and compare them to the natural layout of the reduced data.When using the Short encoding (Figure 6a), the clusters seem to have natural horizontal to diagonal (45 degree) layout.This layout is not well captured by the evaluated algorithms.The BGM seems to provide the most satisfying results, despite little calibration.
When using the Long encoding (Figure 6b), clusters naturally layout in well separated vertical columns.This is also best captured by BGM.
Overall, results suggest using BGM for robust clustering on both feature representations.

Initialization Benchmark
In order to assess the quality of the centroids extracted, we use them for initializing the baselines algorithms GA, (µ + λ)EA, and Aging Evolution.Figure 7 shows performance in validation for the three search baselines.The color red stands for the random sampling initialization (rand ), blue for LHS, and green for centroids (i.e., our approach).From left to right, the GA, EA, and Aging Evolution results are plotted.The top row corresponds to 36 epochs of training, and the bottom one to 108 epochs.In all cases, each algorithm is executed 100 independent times.Each plot provides with the mean fitness of the current population (bold), complemented with the range of fitness (min/max).The centroids are initialized considering the Short encoding and BGM previous results.The population size is set to 19 (i.e., the number of centroids) in all cases.
For all three baselines, we observe that the centroid-based initialization provides with the highest initial mean population fitness.On the other hand, both LHS and random sampling-based initialization provide with a very low initial mean fitness (up to 20 percentage points of difference).The EA takes the best advantage of this improved initial population: It converges faster and has long-term improvements over an initialization with random sampling or LHS.Both GA and Aging Evolution fail to benefit from such improvements as their mean population fitness plummets after a few iterations, and reaches similar values to those of the alternative initialization techniques (rand or LHS).This is observed when searching either after 36 or 108 epochs of training.
Figure 8 summarizes the benchmark provided in Figure 7.In particular, it provides boxplots of performance in test for the best found solutions (100 runs) after 2000 search evaluations.It also complements the three baseline algorithms, i.e., GA, EA, and Aging Evolution, with random search (RS).The plot on the left corresponds to 36 epochs of training (i.e., the evaluation of the solutions), and the right one to 108 epochs.
Overall, performances in test after deployment (2000 evaluations) are similar to those in validation.Indeed, the ranking is preserved: The EA reaches the highest mean fitness, for all initialization settings.EA and GA have very narrow fitness distributions, while Aging Evolution has a more spread one.All the baselines improve over Random Search.For GA and EA, centroids help reach higher mean test fitness over other initialization techniques.
To complement these results, we performed a Wilcoxon rank-sum test.For GA and when selecting after 36 epochs of training, the p-value for the centroidbased initialization versus random sampling is 4.093 • 10 −7 .Versus LHS, it is 2.324•10 −7 .When selecting after 108 epochs, it is 0.69 versus random sampling, and 0.039 versus LHS.For EA and when selecting after 36 epochs of training, the p-value for the centroid-based initialization versus random sampling is 5.611•10 −8 .Versus LHS, it is 3.767•10 −6 .When selecting after 108 epochs, it is 0.006 versus random sampling, and 0.006 versus LHS.Thus, the centroid-based initialization significantly improves over LHS and random sampling, for both EA and GA when selecting after 36 epochs of training.For EA, it improves significantly over LHS and random sampling in all training budgets.
Figure 9 depicts the results for Long encoding benchmark.In all cases the pop size=13 (=µ = λ).Accordingly, the number of evaluations is set to 1989 (153 generations).
Similarly, centroids obtained considering the Long encoding enable all baseline algorithms to have an improved initial mean population fitness.Also, EA is the best at taking advantage of this initialization (centroids), with an improved convergence, up until 500 to 1000 evaluations.
Figure 10 summarizes the benchmark provided in Figure 9 with performances in test, in the same fashion as Figure 8.
Results of performance in test after deployment are similar to those obtained considering the Short encoding for finding the initial population.However, the centroids do not provide with improvements to the final performance of the baseline algorithm.Also, we notice that the centroid-based initialization worsen the distribution of fitness of solutions found by Aging Evolution (i.e., larger variance).
To summarize, the centroids extracted from a fitness-based clustering of the search space seem to be a promising strategy to initialize a population-based search algorithms.We observe improved convergence and long-term performances of EA with a centroid-based initialization, over LHS and rand, when  considering the Short encoding.In the case of searching with only 36 epochs of budget, it also helps final test performances for GA (Short encoding).
The limited improvements when clustering with the Long encoding might be explained by the fact that the baselines (EA, GA, and Aging Evolution) are deployed on models using the Short encoding.Note that experiments using the Long encoding were discarded because of the increased complexity for the search procedure.Future work might explore this option, as it could help better exploit the extracted population.

Visualization of the solutions found
Last but not least, we look to gain insights into the solutions found by the algorithms deployed in Section 5.5.
Figure 11 provides a visualization of solutions found (100 independent runs) by the search baselines, for all initialization settings, considering the Short encoding.More precisely, it shows the frequency of connections on the adjacency matrix (100 solutions), for each baseline.The darker, the higher the frequency.Figures 11a and 11b show results when searching respectively after 36 or 108 epochs of training.From left to right appear results for GA, EA and Aging Evolution.From top to bottom appear results using as initialization random sampling (rand), centroids, and LHS.
Figure 12 provides the same visualization of solutions found but considering the Long encoding.
Overall, the connections gathered from the solutions found after 36 epochs of training differ from those found after 108 epochs.In the first case, the activations on the adjacency matrices have clusters that are more restricted, as opposed to the more widespread and larger clusters obtained when searching after 108 epochs of training.
Besides, we also observe a difference in the output based on the algorithm used to find the solutions.EA and GA provide solutions whose connections are overall similar, in the form of widespread clusters.On the other had, the Aging Evolution has patterns of connections in its cells that are regrouped and in slightly smaller cluster.
Furthermore, we analyzed the solutions based on the initialization technique used when deploying search.Across all settings, it appears to be more diverse solutions (on average more activated cell in adjacency matrices) obtained via LHS and random sampling, than for a centroid-based initialization.
To summarize, the longer the training allowed when selecting models, the more diverse are the solutions retrieved.Also, EA and GA tend to find more diverse solutions than Aging Evolution.When it comes to initialization, the centroids-based approached results in solutions that are more similar to each other, with matrices of adjacency that are less activated.
We find that the patterns highlighted in this section correlate with the findings of authors in [37].In the study, the authors show that on the search space of NASBench-101, the longer the training the more narrow the fitness  distribution with most solutions having close to the top fitness after 108 epochs of training.They also showed that the fitness landscape becomes flat, with many local optima.Therefore, when searching with a training budget of 108 epochs and a fixed number of iterations, a search algorithm is likely to retrieve more diverse solutions than after 36 epochs, since most of them satisfy the criterion of high fitness.
When it comes to the differences based on the algorithm to be used, this could be explained both the very rugged landscape (many local maxima) and the nature of the algorithms.Indeed, as Aging Evolution provides with nondiverse sets of solutions, which could be explained by it being stuck in local maxima and not diversifying enough, i.e., discarding old solutions.
Regarding the centroids, Section 5.5 already shows that they consist of an initial population of particularly high average fitness, with little variance.This could be explained by the centroids being potential local maxima of high fitness and very diverse nature, since coming from distinct clusters.

Conclusion
In this study, we seek to gain insights about a search space of image classification models in order to improve the performance of NAS algorithms.More precisely, we want to know if the convergence of a search strategy could be improved using a data-driven initialization technique exploiting the search space.
For this purpose, we propose a two-step approach to improve the performances of a NAS search strategy.First, we perform a clustering analysis of the search space, involving a sequence of sub-tasks.It summarizes as follows: We sample models from a search space, reduce their dimension, perform a clustering.After a careful tuning of the clustering pipeline (number of dimensions, clusters, etc.), we select the algorithm providing the best qualitative and quantitative results.Second, we extract and use the centroids as an initial population to a search strategy.
We validate our proposal by initializing three (3) evolutionary algorithms, namely a genetic algorithm (GA), an evolutionary algorithm (EA), and Aging Evolution (AE), and benchmark our data-driven initialization method against conventional initialization baselines, i.e., random initialization and Latin Hypercube Sampling (LHS).To test the algorithms, we query the dataset of NAS-Bench-101, providing with a search space of image classifiers and their fitness evaluation on CIFAR-10.Our results show that centroids extracted using BGM for clustering are a promising approach to initialize a populationbased algorithm.In the scenario of selecting models trained only 36 epochs, this approach used with GA shows significant long-term improvements (after 2000 iterations, in test) over random initialization and LHS, when using a Short encoding.When used with EA, it shows faster convergence (in validation) and significant long-term improvements over random initialization and LHS, when using a Short encoding and for all training budgets.Additional investigations on the distributions of the solutions found by the algorithms suggest that centroids enable retrieving local optima (maxima) of high fitness and similar configurations.
As future work, we propose to investigate performances of this approach when selecting models on the Long Encoding.We also propose to study in depth the obtained clusters to gain more insights on obtained performances.One might also explore the benefits of such data-driven initialization method on other families of algorithms (Bayesian optimization, local search, etc.).

Fig. 2
Fig. 2 Input feature reduction for clustering, with an arbitrary number of clusters N = 10.

Fig. 3
Fig. 3 Identifying the proper reduction tool for sparse data, using an arbitrary number of clusters of N = 10.

Fig. 4
Fig. 4 Identifying the proper sample size, when clustering using Truncated SVD for dimension reduction (N = 2 components) and k -means.

Fig. 5
Fig. 5 Identifying the proper number of clusters, using Truncated SVD for dimension reduction (N = 2 components) and k -means.

Fig. 6
Fig. 6 Qualitative analysis of clustering for both feature representations.Here we use Truncated SVD for dimension reduction, and N = 2 components.

Figure 6
Figure 6 depicts visual clustering results for five algorithms: k -means, spectral clustering, DBSCAN, Birch, and BGM.All input features were reduced to two (2) components using Truncated SVD.Plots (a) and (b) display results using the Short and the Long encoding, respectively.When using the Short encoding (Figure6a), the clusters seem to have natural horizontal to diagonal (45 degree) layout.This layout is not well captured by the evaluated algorithms.The BGM seems to provide the most satisfying results, despite little calibration.When using the Long encoding (Figure6b), clusters naturally layout in well separated vertical columns.This is also best captured by BGM.Overall, results suggest using BGM for robust clustering on both feature representations.

Fig. 7
Fig. 7 Performances in validation of various NAS algorithms, when clustering with the Short Encoding.

Fig. 8
Fig. 8 Benchmark of NAS algorithm performances after 2000 iterations.The search is performed either when training solutions for 36 or 108 epochs.The data-driven initialization techniques involve the Short encoding.

Fig. 9
Fig. 9 Performances in validation of various NAS algorithms, when clustering with the Long encoding.

Fig. 10
Fig. 10 Benchmark of NAS algorithm performances after 2000 iterations.The search is performed either when training solutions for 36 or 108 epochs.The data-driven initialization techniques involve the Long encoding.

Table 1
Hyperparameters of the clustering algorithms.

Algorithm 1 :
Genetic Algorithm input: The size of the population pop size, the crossover probability cx p, mutation probabilities mut p and mut i, and the maximum number of evaluations max evaluations.Alongside with train, validation, and test data sets.Best(solution, population) evaluations ← evaluations + pop size end performance ← Evaluate(solution, train, test) return solution, performance matrix of the architecture, and categorical values, that correspond to the operations on the edges of the adjacency matrix.Please refer to Section 4.1 for more details.
2 presents a high-level view of the (µ+λ)EA basic implementation provided by the latest version (1.3.1) of DEAP library [55].