Automated Machine Learning pp 6377  Cite as
Neural Architecture Search
Abstract
Deep Learning has enabled remarkable progress over the last years on a variety of tasks, such as image recognition, speech recognition, and machine translation. One crucial aspect for this progress are novel neural architectures. Currently employed architectures have mostly been developed manually by human experts, which is a timeconsuming and errorprone process. Because of this, there is growing interest in automated neural architecture search methods. We provide an overview of existing work in this field of research and categorize them according to three dimensions: search space, search strategy, and performance estimation strategy.
3.1 Introduction

Search Space. The search space defines which architectures can be represented in principle. Incorporating prior knowledge about properties wellsuited for a task can reduce the size of the search space and simplify the search. However, this also introduces a human bias, which may prevent finding novel architectural building blocks that go beyond the current human knowledge.

Search Strategy. The search strategy details how to explore the search space. It encompasses the classical explorationexploitation tradeoff since, on the one hand, it is desirable to find wellperforming architectures quickly, while on the other hand, premature convergence to a region of suboptimal architectures should be avoided.

Performance Estimation Strategy. The objective of NAS is typically to find architectures that achieve high predictive performance on unseen data. Performance Estimation refers to the process of estimating this performance: the simplest option is to perform a standard training and validation of the architecture on data, but this is unfortunately computationally expensive and limits the number of architectures that can be explored. Much recent research therefore focuses on developing methods that reduce the cost of these performance estimations.
This chapter is based on a very recent survey article [23].
3.2 Search Space
The search space defines which neural architectures a NAS approach might discover in principle. We now discuss common search spaces from recent works.
Recent work on NAS [9, 11, 21, 22, 49, 75] incorporate modern design elements known from handcrafted architectures such as skip connections, which allow to build complex, multibranch networks, as illustrated in Fig. 3.2 (right). In this case the input of layer i can be formally described as a function \(g_i(L_{i1}^{out}, \dots , L_{0}^{out})\) combining previous layer outputs. Employing such a function results in significantly more degrees of freedom. Special cases of these multibranch architectures are (i) the chainstructured networks (by setting \(g_i(L_{i1}^{out}, \dots , L_{0}^{out}) = L_{i1}^{out}\)), (ii) Residual Networks [28], where previous layer outputs are summed (\(g_i(L_{i1}^{out}, \dots , L_{0}^{out}) = L_{i1}^{out}+L_{j}^{out},j<i\)) and (iii) DenseNets [29], where previous layer outputs are concatenated (\(g_i(L_{i1}^{out}, \dots , L_{0}^{out}) = concat( L_{i1}^{out}, \dots , L_{0}^{out})\)).
 1.
The size of the search space is drastically reduced since cells can be comparably small. For example, Zoph et al. [75] estimate a seventimes speedup compared to their previous work [74] while achieving better performance.
 2.
Cells can more easily be transferred to other datasets by adapting the number of cells used within a model. Indeed, Zoph et al. [75] transfer cells optimized on CIFAR10 to ImageNet and achieve stateoftheart performance.
Consequently, this cellbased search space was also successfully employed by many later works [11, 22, 37, 39, 46, 49, 72]. However, a new designchoice arises when using a cellbased search space, namely how to choose the metaarchitecture: how many cells shall be used and how should they be connected to build the actual model? For example, Zoph et al. [75] build a sequential model from cells, in which each cell receives the outputs of the two preceding cells as input, while Cai et al. [11] employ the highlevel structure of wellknown manually designed architectures, such as DenseNet [29], and use their cells within these models. In principle, cells can be combined arbitrarily, e.g., within the multibranch space described above by simply replacing layers with cells. Ideally, the metaarchitecture should be optimized automatically as part of NAS; otherwise one easily ends up doing metaarchitecture engineering and the search for the cell becomes overly simple if most of the complexity is already accounted for by the metaarchitecture.
One step in the direction of optimizing metaarchitectures is the hierarchical search space introduced by Liu et al. [38], which consists of several levels of motifs. The first level consists of the set of primitive operations, the second level of different motifs that connect primitive operations via a direct acyclic graphs, the third level of motifs that encode how to connect secondlevel motifs, and so on. The cellbased search space can be seen as a special case of this hierarchical search space where the number of levels is three, the second level motifs corresponds to the cells, and the third level is the hardcoded metaarchitecture.
The choice of the search space largely determines the difficulty of the optimization problem: even for the case of the search space based on a single cell with fixed metaarchitecture, the optimization problem remains (i) noncontinuous and (ii) relatively highdimensional (since more complex models tend to perform better, resulting in more design choices). We note that the architectures in many search spaces can be written as fixedlength vectors; e.g., the search space for each of the two cells by Zoph et al. [75] can be written as a 40dimensional search space with categorical dimensions, each of which chooses between a small number of different building blocks and inputs. Similarly, unbounded search spaces can be constrained to have a maximal depth, giving rise to fixedsize search spaces with (potentially many) conditional dimensions.
In the next section, we discuss Search Strategies that are wellsuited for these kinds of search spaces.
3.3 Search Strategy
Many different search strategies can be used to explore the space of neural architectures, including random search, Bayesian optimization, evolutionary methods, reinforcement learning (RL), and gradientbased methods. Historically, evolutionary algorithms were already used by many researchers to evolve neural architectures (and often also their weights) decades ago [see, e.g., 2, 25, 55, 56]. Yao [67] provides a literature review of work earlier than 2000.
Bayesian optimization celebrated several early successes in NAS since 2013, leading to stateoftheart vision architectures [7], stateoftheart performance for CIFAR10 without data augmentation [19], and the first automaticallytuned neural networks to win competition datasets against human experts [41]. NAS became a mainstream research topic in the machine learning community after Zoph and Le [74] obtained competitive performance on the CIFAR10 and Penn Treebank benchmarks with a search strategy based on reinforcement learning. While Zoph and Le [74] use vast computational resources to achieve this result (800 GPUs for three to four weeks), after their work, a wide variety of methods have been published in quick succession to reduce the computational costs and achieve further improvements in performance.
To frame NAS as a reinforcement learning (RL) problem [4, 71, 74, 75], the generation of a neural architecture can be considered to be the agent’s action, with the action space identical to the search space. The agent’s reward is based on an estimate of the performance of the trained architecture on unseen data (see Sect. 3.4). Different RL approaches differ in how they represent the agent’s policy and how they optimize it: Zoph and Le [74] use a recurrent neural network (RNN) policy to sequentially sample a string that in turn encodes the neural architecture. They initially trained this network with the REINFORCE policy gradient algorithm, but in followup work use Proximal Policy Optimization (PPO) instead [75]. Baker et al. [4] use Qlearning to train a policy which sequentially chooses a layer’s type and corresponding hyperparameters. An alternative view of these approaches is as sequential decision processes in which the policy samples actions to generate the architecture sequentially, the environment’s “state” contains a summary of the actions sampled so far, and the (undiscounted) reward is obtained only after the final action. However, since no interaction with an environment occurs during this sequential process (no external state is observed, and there are no intermediate rewards), we find it more intuitive to interpret the architecture sampling process as the sequential generation of a single action; this simplifies the RL problem to a stateless multiarmed bandit problem.
A related approach was proposed by Cai et al. [10], who frame NAS as a sequential decision process: in their approach the state is the current (partially trained) architecture, the reward is an estimate of the architecture’s performance, and the action corresponds to an application of functionpreserving mutations, dubbed network morphisms [12, 63], see also Sect. 3.4, followed by a phase of training the network. In order to deal with variablelength network architectures, they use a bidirectional LSTM to encode architectures into a fixedlength representation. Based on this encoded representation, actor networks decide on the sampled action. The combination of these two components constitute the policy, which is trained endtoend with the REINFORCE policy gradient algorithm. We note that this approach will not visit the same state (architecture) twice so that strong generalization over the architecture space is required from the policy.
An alternative to using RL are neuroevolutionary approaches that use evolutionary algorithms for optimizing the neural architecture. The first such approach for designing neural networks we are aware of dates back almost three decades: Miller et al. [44] use genetic algorithms to propose architectures and use backpropagation to optimize their weights. Many neuroevolutionary approaches since then [2, 55, 56] use genetic algorithms to optimize both the neural architecture and its weights; however, when scaling to contemporary neural architectures with millions of weights for supervised learning tasks, SGDbased weight optimization methods currently outperform evolutionary ones.^{1} More recent neuroevolutionary approaches [22, 38, 43, 49, 50, 59, 66] therefore again use gradientbased methods for optimizing weights and solely use evolutionary algorithms for optimizing the neural architecture itself. Evolutionary algorithms evolve a population of models, i.e., a set of (possibly trained) networks; in every evolution step, at least one model from the population is sampled and serves as a parent to generate offsprings by applying mutations to it. In the context of NAS, mutations are local operations, such as adding or removing a layer, altering the hyperparameters of a layer, adding skip connections, as well as altering training hyperparameters. After training the offsprings, their fitness (e.g., performance on a validation set) is evaluated and they are added to the population.
Neuroevolutionary methods differ in how they sample parents, update populations, and generate offsprings. For example, Real et al. [50], Real et al. [49], and Liu et al. [38] use tournament selection [27] to sample parents, whereas Elsken et al. [22] sample parents from a multiobjective Pareto front using an inverse density. Real et al. [50] remove the worst individual from a population, while Real et al. [49] found it beneficial to remove the oldest individual (which decreases greediness), and Liu et al. [38] do not remove individuals at all. To generate offspring, most approaches initialize child networks randomly, while Elsken et al. [22] employ Lamarckian inheritance, i.e, knowledge (in the form of learned weights) is passed on from a parent network to its children by using network morphisms. Real et al. [50] also let an offspring inherit all parameters of its parent that are not affected by the applied mutation; while this inheritance is not strictly functionpreserving it might also speed up learning compared to a random initialization. Moreover, they also allow mutating the learning rate which can be seen as a way for optimizing the learning rate schedule during NAS.
Real et al. [49] conduct a case study comparing RL, evolution, and random search (RS), concluding that RL and evolution perform equally well in terms of final test accuracy, with evolution having better anytime performance and finding smaller models. Both approaches consistently perform better than RS in their experiments, but with a rather small margin: RS achieved test errors of approximately 4% on CIFAR10, while RL and evolution reached approximately 3.5% (after “model augmentation” where depth and number of filters was increased; the difference on the actual, nonaugmented search space was approx. 2%). The difference was even smaller for Liu et al. [38], who reported a test error of 3.9% on CIFAR10 and a top1 validation error of 21.0% on ImageNet for RS, compared to 3.75% and 20.3% for their evolutionbased method, respectively.
Bayesian Optimization (BO, see, e.g., [53]) is one of the most popular methods for hyperparameter optimization (see also Chap. 1 of this book), but it has not been applied to NAS by many groups since typical BO toolboxes are based on Gaussian processes and focus on lowdimensional continuous optimization problems. Swersky et al. [60] and Kandasamy et al. [31] derive kernel functions for architecture search spaces in order to use classic GPbased BO methods, but so far without achieving new stateoftheart performance. In contrast, several works use treebased models (in particular, treed Parzen estimators [8], or random forests [30]) to effectively search very highdimensional conditional spaces and achieve stateoftheart performance on a wide range of problems, optimizing both neural architectures and their hyperparameters jointly [7, 19, 41, 69]. While a full comparison is lacking, there is preliminary evidence that these approaches can also outperform evolutionary algorithms [33].
Architectural search spaces have also been explored in a hierarchical manner, e.g., in combination with evolution [38] or by sequential modelbased optimization [37]. Negrinho and Gordon [45] and Wistuba [65] exploit the treestructure of their search space and use Monte Carlo Tree Search. Elsken et al. [21] propose a simple yet well performing hill climbing algorithm that discovers highquality architectures by greedily moving in the direction of better performing architectures without requiring more sophisticated exploration mechanisms.
In contrast to the gradientfree optimization methods above, Liu et al. [39] propose a continuous relaxation of the search space to enable gradientbased optimization: instead of fixing a single operation o_{i} (e.g., convolution or pooling) to be executed at a specific layer, the authors compute a convex combination from a set of operations {o_{1}, …, o_{m}}. More specifically, given a layer input x, the layer output y is computed as \(y = \sum _{i=1}^{m} \lambda _i o_i(x), \lambda _i \ge 0, \sum _{i=1}^m \lambda _i = 1 \), where the convex coefficients λ_{i} effectively parameterize the network architecture. Liu et al. [39] then optimize both the network weights and the network architecture by alternating gradient descent steps on training data for weights and on validation data for architectural parameters such as λ. Eventually, a discrete architecture is obtained by choosing the operation i with \( i = {\arg \max }_i \, \lambda _i\) for every layer. Shin et al. [54] and Ahmed and Torresani [1] also employ gradientbased optimization of neural architectures, however they only consider optimizing layer hyperparameters or connectivity patterns, respectively.
3.4 Performance Estimation Strategy
The search strategies discussed in Sect. 3.3 aim at finding a neural architecture A that maximizes some performance measure, such as accuracy on unseen data. To guide their search process, these strategies need to estimate the performance of a given architecture A they consider. The simplest way of doing this is to train A on training data and evaluate its performance on validation data. However, training each architecture to be evaluated from scratch frequently yields computational demands in the order of thousands of GPU days for NAS [49, 50, 74, 75].
To reduce this computational burden, performance can be estimated based on lower fidelities of the actual performance after full training (also denoted as proxy metrics). Such lower fidelities include shorter training times [69, 75], training on a subset of the data [34], on lowerresolution images [14], or with less filters per layer [49, 75]. While these lowfidelity approximations reduce the computational cost, they also introduce bias in the estimate as performance will typically be underestimated. This may not be problematic as long as the search strategy only relies on ranking different architectures and the relative ranking remains stable. However, recent results indicate that this relative ranking can change dramatically when the difference between the cheap approximations and the “full” evaluation is too big [69], arguing for a gradual increase in fidelities [24, 35].
Another possible way of estimating an architecture’s performance builds upon learning curve extrapolation [5, 19, 32, 48, 61]. Domhan et al. [19] propose to extrapolate initial learning curves and terminate those predicted to perform poorly to speed up the architecture search process. Baker et al. [5], Klein et al. [32], Rawal and Miikkulainen [48], Swersky et al. [61] also consider architectural hyperparameters for predicting which partial learning curves are most promising. Training a surrogate model for predicting the performance of novel architectures is also proposed by Liu et al. [37], who do not employ learning curve extrapolation but support predicting performance based on architectural/cell properties and extrapolate to architectures/cells with larger size than seen during training. The main challenge for predicting the performances of neural architectures is that, in order to speed up the search process, good predictions in a relatively large search space need to be made based on relatively few evaluations.
Another approach to speed up performance estimation is to initialize the weights of novel architectures based on weights of other architectures that have been trained before. One way of achieving this, dubbed network morphisms [64], allows modifying an architecture while leaving the function represented by the network unchanged [10, 11, 21, 22]. This allows increasing capacity of networks successively and retaining high performance without requiring training from scratch. Continuing training for a few epochs can also make use of the additional capacity introduced by network morphisms. An advantage of these approaches is that they allow search spaces without an inherent upper bound on the architecture’s size [21]; on the other hand, strict network morphisms can only make architectures larger and may thus lead to overly complex architectures. This can be attenuated by employing approximate network morphisms that allow shrinking architectures [22].
OneShot Architecture Search is another promising approach for speeding up performance estimation, which treats all architectures as different subgraphs of a supergraph (the oneshot model) and shares weights between architectures that have edges of this supergraph in common [6, 9, 39, 46, 52]. Only the weights of a single oneshot model need to be trained (in one of various ways), and architectures (which are just subgraphs of the oneshot model) can then be evaluated without any separate training by inheriting trained weights from the oneshot model. This greatly speeds up performance estimation of architectures, since no training is required (only evaluating performance on validation data). This approach typically incurs a large bias as it underestimates the actual performance of architectures severely; nevertheless, it allows ranking architectures reliably, since the estimated performance correlates strongly with the actual performance [6]. Different oneshot NAS methods differ in how the oneshot model is trained: ENAS [46] learns an RNN controller that samples architectures from the search space and trains the oneshot model based on approximate gradients obtained through REINFORCE. DARTS [39] optimizes all weights of the oneshot model jointly with a continuous relaxation of the search space obtained by placing a mixture of candidate operations on each edge of the oneshot model. Bender et al. [6] only train the oneshot model once and show that this is sufficient when deactivating parts of this model stochastically during training using path dropout. While ENAS and DARTS optimize a distribution over architectures during training, the approach of Bender et al. [6] can be seen as using a fixed distribution. The high performance obtainable by the approach of Bender et al. [6] indicates that the combination of weight sharing and a fixed (carefully chosen) distribution might (perhaps surprisingly) be the only required ingredients for oneshot NAS. Related to these approaches is metalearning of hypernetworks that generate weights for novel architectures and thus requires only training the hypernetwork but not the architectures themselves [9]. The main difference here is that weights are not strictly shared but generated by the shared hypernetwork (conditional on the sampled architecture).
A general limitation of oneshot NAS is that the supergraph defined apriori restricts the search space to its subgraphs. Moreover, approaches which require that the entire supergraph resides in GPU memory during architecture search will be restricted to relatively small supergraphs and search spaces accordingly and are thus typically used in combination with cellbased search spaces. While approaches based on weightsharing have substantially reduced the computational resources required for NAS (from thousands to a few GPU days), it is currently not well understood which biases they introduce into the search if the sampling distribution of architectures is optimized along with the oneshot model. For instance, an initial bias in exploring certain parts of the search space more than others might lead to the weights of the oneshot model being better adapted for these architectures, which in turn would reinforce the bias of the search to these parts of the search space. This might result in premature convergence of NAS and might be one advantage of a fixed sampling distribution as used by Bender et al. [6]. In general, a more systematic analysis of biases introduced by different performance estimators would be a desirable direction for future work.
3.5 Future Directions
In this section, we discuss several current and future directions for research on NAS. Most existing work has focused on NAS for image classification. On the one hand, this provides a challenging benchmark since a lot of manual engineering has been devoted to finding architectures that perform well in this domain and are not easily outperformed by NAS. On the other hand, it is relatively easy to define a wellsuited search space by utilizing knowledge from manual engineering. This in turn makes it unlikely that NAS will find architectures that substantially outperform existing ones considerably since the found architectures cannot differ fundamentally. We thus consider it important to go beyond image classification problems by applying NAS to less explored domains. Notable first steps in this direction are applying NAS to language modeling [74], music modeling [48], image restoration [58] and network compression [3]; applications to reinforcement learning, generative adversarial networks, semantic segmentation, or sensor fusion could be further promising future directions.
An alternative direction is developing NAS methods for multitask problems [36, 42] and for multiobjective problems [20, 22, 73], in which measures of resource efficiency are used as objectives along with the predictive performance on unseen data. Likewise, it would be interesting to extend RL/bandit approaches, such as those discussed in Sect. 3.3, to learn policies that are conditioned on a state that encodes task properties/resource requirements (i.e., turning the setting into a contextual bandit). A similar direction was followed by Ramachandran and Le [47] in extending oneshot NAS to generate different architectures depending on the task or instance onthefly. Moreover, applying NAS to searching for architectures that are more robust to adversarial examples [17] is an intriguing recent direction.
Related to this is research on defining more general and flexible search spaces. For instance, while the cellbased search space provides high transferability between different image classification tasks, it is largely based on human experience on image classification and does not generalize easily to other domains where the hardcoded hierarchical structure (repeating the same cells several times in a chainlike structure) does not apply (e.g., semantic segmentation or object detection). A search space which allows representing and identifying more general hierarchical structure would thus make NAS more broadly applicable, see Liu et al. [38] for first work in this direction. Moreover, common search spaces are also based on predefined building blocks, such as different kinds of convolutions and pooling, but do not allow identifying novel building blocks on this level; going beyond this limitation might substantially increase the power of NAS.
The comparison of different methods for NAS is complicated by the fact that measurements of an architecture’s performance depend on many factors other than the architecture itself. While most authors report results on the CIFAR10 dataset, experiments often differ with regard to search space, computational budget, data augmentation, training procedures, regularization, and other factors. For example, for CIFAR10, performance substantially improves when using a cosine annealing learning rate schedule [40], data augmentation by CutOut [18], by MixUp [70] or by a combination of factors [16], and regularization by ShakeShake regularization [26] or scheduled droppath [75]. It is therefore conceivable that improvements in these ingredients have a larger impact on reported performance numbers than the better architectures found by NAS. We thus consider the definition of common benchmarks to be crucial for a fair comparison of different NAS methods. A first step in this direction is the definition of a benchmark for joint architecture and hyperparameter search for a fully connected neural network with two hidden layers [33]. In this benchmark, nine discrete hyperparameters need to be optimized that control both architecture and optimization/regularization. All 62.208 possible hyperparameter combinations have been preevaluated such that different methods can be compared with low computational resources. However, the search space is still very simple compared to the spaces employed by most NAS methods. It would also be interesting to evaluate NAS methods not in isolation but as part of a full opensource AutoML system, where also hyperparameters [41, 50, 69], and data augmentation pipeline [16] are optimized along with NAS.
While NAS has achieved impressive performance, so far it provides little insights into why specific architectures work well and how similar the architectures derived in independent runs would be. Identifying common motifs, providing an understanding why those motifs are important for high performance, and investigating if these motifs generalize over different problems would be desirable.
Footnotes
 1.
Some recent work shows that evolving even millions of weights is competitive to gradientbased optimization when only highvariance estimates of the gradient are available, e.g., for reinforcement learning tasks [15, 51, 57]. Nonetheless, for supervised learning tasks gradientbased optimization is by far the most common approach.
Notes
Acknowledgements
We would like to thank Esteban Real, Arber Zela, Gabriel Bender, Kenneth Stanley and Thomas Pfeil for feedback on earlier versions of this survey. This work has partly been supported by the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme under grant no. 716721.
Bibliography
 1.Ahmed, K., Torresani, L.: Maskconnect: Connectivity learning by gradient descent. In: European Conference on Computer Vision (ECCV) (2018)Google Scholar
 2.Angeline, P.J., Saunders, G.M., Pollack, J.B.: An evolutionary algorithm that constructs recurrent neural networks. IEEE transactions on neural networks 5 1, 54–65 (1994)CrossRefGoogle Scholar
 3.Ashok, A., Rhinehart, N., Beainy, F., Kitani, K.M.: N2n learning: Network to network compression via policy gradient reinforcement learning. In: International Conference on Learning Representations (2018)Google Scholar
 4.Baker, B., Gupta, O., Naik, N., Raskar, R.: Designing neural network architectures using reinforcement learning. In: International Conference on Learning Representations (2017a)Google Scholar
 5.Baker, B., Gupta, O., Raskar, R., Naik, N.: Accelerating Neural Architecture Search using Performance Prediction. In: NIPS Workshop on MetaLearning (2017b)Google Scholar
 6.Bender, G., Kindermans, P.J., Zoph, B., Vasudevan, V., Le, Q.: Understanding and simplifying oneshot architecture search. In: International Conference on Machine Learning (2018)Google Scholar
 7.Bergstra, J., Yamins, D., Cox, D.D.: Making a science of model search: Hyperparameter optimization in hundreds of dimensions for vision architectures. In: ICML (2013)Google Scholar
 8.Bergstra, J.S., Bardenet, R., Bengio, Y., Kégl, B.: Algorithms for hyperparameter optimization. In: ShaweTaylor, J., Zemel, R.S., Bartlett, P.L., Pereira, F., Weinberger, K.Q. (eds.) Advances in Neural Information Processing Systems 24. pp. 2546–2554 (2011)Google Scholar
 9.Brock, A., Lim, T., Ritchie, J.M., Weston, N.: SMASH: oneshot model architecture search through hypernetworks. In: NIPS Workshop on MetaLearning (2017)Google Scholar
 10.Cai, H., Chen, T., Zhang, W., Yu, Y., Wang, J.: Efficient architecture search by network transformation. In: Association for the Advancement of Artificial Intelligence (2018a)Google Scholar
 11.Cai, H., Yang, J., Zhang, W., Han, S., Yu, Y.: PathLevel Network Transformation for Efficient Architecture Search. In: International Conference on Machine Learning (Jun 2018b)Google Scholar
 12.Chen, T., Goodfellow, I.J., Shlens, J.: Net2net: Accelerating learning via knowledge transfer. In: International Conference on Learning Representations (2016)Google Scholar
 13.Chollet, F.: Xception: Deep learning with depthwise separable convolutions. arXiv:1610.02357 (2016)Google Scholar
 14.Chrabaszcz, P., Loshchilov, I., Hutter, F.: A downsampled variant of imagenet as an alternative to the CIFAR datasets. CoRR abs/1707.08819 (2017)Google Scholar
 15.Chrabaszcz, P., Loshchilov, I., Hutter, F.: Back to basics: Benchmarking canonical evolution strategies for playing atari. In: Proceedings of the TwentySeventh International Joint Conference on Artificial Intelligence, IJCAI18. pp. 1419–1426. International Joint Conferences on Artificial Intelligence Organization (Jul 2018)Google Scholar
 16.Cubuk, E.D., Zoph, B., Mane, D., Vasudevan, V., Le, Q.V.: AutoAugment: Learning Augmentation Policies from Data. In: arXiv:1805.09501 (May 2018)Google Scholar
 17.Cubuk, E.D., Zoph, B., Schoenholz, S.S., Le, Q.V.: Intriguing Properties of Adversarial Examples. In: arXiv:1711.02846 (Nov 2017)Google Scholar
 18.Devries, T., Taylor, G.W.: Improved regularization of convolutional neural networks with cutout. arXiv preprint abs/1708.04552 (2017)Google Scholar
 19.Domhan, T., Springenberg, J.T., Hutter, F.: Speeding up automatic hyperparameter optimization of deep neural networks by extrapolation of learning curves. In: Proceedings of the 24th International Joint Conference on Artificial Intelligence (IJCAI) (2015)Google Scholar
 20.Dong, J.D., Cheng, A.C., Juan, D.C., Wei, W., Sun, M.: Dppnet: Deviceaware progressive search for paretooptimal neural architectures. In: European Conference on Computer Vision (2018)Google Scholar
 21.Elsken, T., Metzen, J.H., Hutter, F.: Simple And Efficient Architecture Search for Convolutional Neural Networks. In: NIPS Workshop on MetaLearning (2017)Google Scholar
 22.Elsken, T., Metzen, J.H., Hutter, F.: Efficient Multiobjective Neural Architecture Search via Lamarckian Evolution. In: International Conference on Learning Representations (2019)Google Scholar
 23.Elsken, T., Metzen, J.H., Hutter, F.: Neural architecture search: A survey. arXiv:1808.05377 (2018)Google Scholar
 24.Falkner, S., Klein, A., Hutter, F.: BOHB: Robust and efficient hyperparameter optimization at scale. In: Dy, J., Krause, A. (eds.) Proceedings of the 35th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 80, pp. 1436–1445. PMLR, Stockholmsmässan, Stockholm Sweden (10–15 Jul 2018)Google Scholar
 25.Floreano, D., Dürr, P., Mattiussi, C.: Neuroevolution: from architectures to learning. Evolutionary Intelligence 1(1), 47–62 (2008)CrossRefGoogle Scholar
 26.Gastaldi, X.: Shakeshake regularization. In: International Conference on Learning Representations Workshop (2017)Google Scholar
 27.Goldberg, D.E., Deb, K.: A comparative analysis of selection schemes used in genetic algorithms. In: Foundations of Genetic Algorithms. pp. 69–93. Morgan Kaufmann (1991)Google Scholar
 28.He, K., Zhang, X., Ren, S., Sun, J.: Deep Residual Learning for Image Recognition. In: Conference on Computer Vision and Pattern Recognition (2016)Google Scholar
 29.Huang, G., Liu, Z., Weinberger, K.Q.: Densely Connected Convolutional Networks. In: Conference on Computer Vision and Pattern Recognition (2017)Google Scholar
 30.Hutter, F., Hoos, H., LeytonBrown, K.: Sequential modelbased optimization for general algorithm configuration. In: LION. pp. 507–523 (2011)Google Scholar
 31.Kandasamy, K., Neiswanger, W., Schneider, J., Poczos, B., Xing, E.: Neural Architecture Search with Bayesian Optimisation and Optimal Transport. arXiv:1802.07191 (Feb 2018)Google Scholar
 32.Klein, A., Falkner, S., Springenberg, J.T., Hutter, F.: Learning curve prediction with Bayesian neural networks. In: International Conference on Learning Representations (2017a)Google Scholar
 33.Klein, A., Christiansen, E., Murphy, K., Hutter, F.: Towards reproducible neural architecture and hyperparameter search. In: ICML 2018 Workshop on Reproducibility in ML (RML 2018) (2018)Google Scholar
 34.Klein, A., Falkner, S., Bartels, S., Hennig, P., Hutter, F.: Fast Bayesian Optimization of Machine Learning Hyperparameters on Large Datasets. In: Singh, A., Zhu, J. (eds.) Proceedings of the 20th International Conference on Artificial Intelligence and Statistics. Proceedings of Machine Learning Research, vol. 54, pp. 528–536. PMLR, Fort Lauderdale, FL, USA (20–22 Apr 2017b)Google Scholar
 35.Li, L., Jamieson, K., DeSalvo, G., Rostamizadeh, A., Talwalkar, A.: Hyperband: banditbased configuration evaluation for hyperparameter optimization. In: International Conference on Learning Representations (2017)Google Scholar
 36.Liang, J., Meyerson, E., Miikkulainen, R.: Evolutionary Architecture Search For Deep Multitask Networks. In: arXiv:1803.03745 (Mar 2018)Google Scholar
 37.Liu, C., Zoph, B., Neumann, M., Shlens, J., Hua, W., Li, L.J., FeiFei, L., Yuille, A., Huang, J., Murphy, K.: Progressive Neural Architecture Search. In: European Conference on Computer Vision (2018a)Google Scholar
 38.Liu, H., Simonyan, K., Vinyals, O., Fernando, C., Kavukcuoglu, K.: Hierarchical Representations for Efficient Architecture Search. In: International Conference on Learning Representations (2018b)Google Scholar
 39.Liu, H., Simonyan, K., Yang, Y.: Darts: Differentiable architecture search. In: International Conference on Learning Representations (2019)Google Scholar
 40.Loshchilov, I., Hutter, F.: Sgdr: Stochastic gradient descent with warm restarts. In: International Conference on Learning Representations (2017)Google Scholar
 41.Mendoza, H., Klein, A., Feurer, M., Springenberg, J., Hutter, F.: Towards AutomaticallyTuned Neural Networks. In: International Conference on Machine Learning, AutoML Workshop (Jun 2016)Google Scholar
 42.Meyerson, E., Miikkulainen, R.: Pseudotask Augmentation: From Deep Multitask Learning to Intratask Sharing and Back. In: arXiv:1803.03745 (Mar 2018)Google Scholar
 43.Miikkulainen, R., Liang, J., Meyerson, E., Rawal, A., Fink, D., Francon, O., Raju, B., Shahrzad, H., Navruzyan, A., Duffy, N., Hodjat, B.: Evolving Deep Neural Networks. In: arXiv:1703.00548 (Mar 2017)Google Scholar
 44.Miller, G., Todd, P., Hedge, S.: Designing neural networks using genetic algorithms. In: 3rd International Conference on Genetic Algorithms (ICGA’89) (1989)Google Scholar
 45.Negrinho, R., Gordon, G.: DeepArchitect: Automatically Designing and Training Deep Architectures. arXiv:1704.08792 (2017)Google Scholar
 46.Pham, H., Guan, M.Y., Zoph, B., Le, Q.V., Dean, J.: Efficient neural architecture search via parameter sharing. In: International Conference on Machine Learning (2018)Google Scholar
 47.Ramachandran, P., Le, Q.V.: Dynamic Network Architectures. In: AutoML 2018 (ICML workshop) (2018)Google Scholar
 48.Rawal, A., Miikkulainen, R.: From Nodes to Networks: Evolving Recurrent Neural Networks. In: arXiv:1803.04439 (Mar 2018)Google Scholar
 49.Real, E., Aggarwal, A., Huang, Y., Le, Q.V.: Aging Evolution for Image Classifier Architecture Search. In: AAAI Conference on Artificial Intelligence (2019)Google Scholar
 50.Real, E., Moore, S., Selle, A., Saxena, S., Suematsu, Y.L., Le, Q.V., Kurakin, A.: Largescale evolution of image classifiers. International Conference on Machine Learning (2017)Google Scholar
 51.Salimans, T., Ho, J., Chen, X., Sutskever, I.: Evolution strategies as a scalable alternative to reinforcement learning. arXiv preprint (2017)Google Scholar
 52.Saxena, S., Verbeek, J.: Convolutional neural fabrics. In: Lee, D.D., Sugiyama, M., Luxburg, U.V., Guyon, I., Garnett, R. (eds.) Advances in Neural Information Processing Systems 29, pp. 4053–4061. Curran Associates, Inc. (2016)Google Scholar
 53.Shahriari, B., Swersky, K., Wang, Z., Adams, R.P., de Freitas, N.: Taking the human out of the loop: A review of bayesian optimization. Proceedings of the IEEE 104(1), 148–175 (Jan 2016)CrossRefGoogle Scholar
 54.Shin, R., Packer, C., Song, D.: Differentiable neural network architecture search. In: International Conference on Learning Representations Workshop (2018)Google Scholar
 55.Stanley, K.O., D’Ambrosio, D.B., Gauci, J.: A hypercubebased encoding for evolving largescale neural networks. Artif. Life 15(2), 185–212 (Apr 2009), URL https://doi.org/10.1162/artl.2009.15.2.15202
 56.Stanley, K.O., Miikkulainen, R.: Evolving neural networks through augmenting topologies. Evolutionary Computation 10, 99–127 (2002)CrossRefGoogle Scholar
 57.Such, F.P., Madhavan, V., Conti, E., Lehman, J., Stanley, K.O., Clune, J.: Deep neuroevolution: Genetic algorithms are a competitive alternative for training deep neural networks for reinforcement learning. arXiv preprint (2017)Google Scholar
 58.Suganuma, M., Ozay, M., Okatani, T.: Exploiting the potential of standard convolutional autoencoders for image restoration by evolutionary search. In: Dy, J., Krause, A. (eds.) Proceedings of the 35th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 80, pp. 4771–4780. PMLR, Stockholmsmässan, Stockholm Sweden (10–15 Jul 2018)Google Scholar
 59.Suganuma, M., Shirakawa, S., Nagao, T.: A genetic programming approach to designing convolutional neural network architectures. In: Genetic and Evolutionary Computation Conference (2017)Google Scholar
 60.Swersky, K., Duvenaud, D., Snoek, J., Hutter, F., Osborne, M.: Raiders of the lost architecture: Kernels for bayesian optimization in conditional parameter spaces. In: NIPS Workshop on Bayesian Optimization in Theory and Practice (2013)Google Scholar
 61.Swersky, K., Snoek, J., Adams, R.P.: Freezethaw bayesian optimization (2014)Google Scholar
 62.Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z.: Rethinking the Inception Architecture for Computer Vision. In: Conference on Computer Vision and Pattern Recognition (2016)Google Scholar
 63.Wei, T., Wang, C., Chen, C.W.: Modularized morphing of neural networks. arXiv:1701.03281 (2017)Google Scholar
 64.Wei, T., Wang, C., Rui, Y., Chen, C.W.: Network morphism. In: International Conference on Machine Learning (2016)Google Scholar
 65.Wistuba, M.: Finding Competitive Network Architectures Within a Day Using UCT. In: arXiv:1712.07420 (Dec 2017)Google Scholar
 66.Xie, L., Yuille, A.: Genetic CNN. In: International Conference on Computer Vision (2017)Google Scholar
 67.Yao, X.: Evolving artificial neural networks. Proceedings of the IEEE 87(9), 1423–1447 (Sept 1999)CrossRefGoogle Scholar
 68.Yu, F., Koltun, V.: Multiscale context aggregation by dilated convolutions (2016)Google Scholar
 69.Zela, A., Klein, A., Falkner, S., Hutter, F.: Towards automated deep learning: Efficient joint neural architecture and hyperparameter search. In: ICML 2018 Workshop on AutoML (AutoML 2018) (2018)Google Scholar
 70.Zhang, H., Cissé, M., Dauphin, Y.N., LopezPaz, D.: mixup: Beyond empirical risk minimization. arXiv preprint abs/1710.09412 (2017)Google Scholar
 71.Zhong, Z., Yan, J., Wu, W., Shao, J., Liu, C.L.: Practical blockwise neural network architecture generation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 2423–2432 (2018a)Google Scholar
 72.Zhong, Z., Yang, Z., Deng, B., Yan, J., Wu, W., Shao, J., Liu, C.L.: Blockqnn: Efficient blockwise neural network architecture generation. arXiv preprint (2018b)Google Scholar
 73.Zhou, Y., Ebrahimi, S., Arik, S., Yu, H., Liu, H., Diamos, G.: Resourceefficient neural architect. In: arXiv:1806.07912 (2018)Google Scholar
 74.Zoph, B., Le, Q.V.: Neural architecture search with reinforcement learning. In: International Conference on Learning Representations (2017)Google Scholar
 75.Zoph, B., Vasudevan, V., Shlens, J., Le, Q.V.: Learning transferable architectures for scalable image recognition. In: Conference on Computer Vision and Pattern Recognition (2018)Google Scholar
Copyright information
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence and indicate if changes were made.
The images or other third party material in this chapter are included in the chapter's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.