Neural Architecture Search for Dense Prediction Tasks in Computer Vision

The success of deep learning in recent years has lead to a rising demand for neural network architecture engineering. As a consequence, neural architecture search (NAS), which aims at automatically designing neural network architectures in a data-driven manner rather than manually, has evolved as a popular field of research. With the advent of weight sharing strategies across architectures, NAS has become applicable to a much wider range of problems. In particular, there are now many publications for dense prediction tasks in computer vision that require pixel-level predictions, such as semantic segmentation or object detection. These tasks come with novel challenges, such as higher memory footprints due to high-resolution data, learning multi-scale representations, longer training times, and more complex and larger neural architectures. In this manuscript, we provide an overview of NAS for dense prediction tasks by elaborating on these novel challenges and surveying ways to address them to ease future research and application of existing methods to novel problems.


Introduction
With the advent of deep learning, features are no longer manually designed but rather learned in an end-to-end fashion from data, resulting in impressive results for various problems, such as image recognition (Krizhevsky et al, 2012), speech recognition (Hinton et al, 2012), machine translation (Bahdanau et al, 2015), or reasoning in games (Silver et al, 2016).This, however, lead to a new design problem: the feature engineering process is replaced by engineering neural network architectures (Simonyan and Zisserman, 2015;He et al, 2016a;Szegedy et al, 2016Szegedy et al, , 2017;;Howard et al, 2017;Goodfellow et al, 2014;Zhang et al, 2018;Long et al, 2015;Girshick et al, 2014;Girshick, 2015;Ren et al, 2015;Redmon et al, 2016;Liu et al, 2016;Ronneberger et al, 2015;Tan and Le, 2019;Mohan and Valada, 2020;Cheng et al, 2020a;Zhong et al, 2020b).This architectural engineering is especially prevalent for dense prediction tasks in computer vision, such as semantic segmentation, object detection, optical flow estimation, or disparity estimation.These tasks typically require complex neural architectures, often composed of various components, each having a different purpose, e.g., extracting features at different scales, feature fusion across levels, or dedicated architectural heads for, e.g., generating bounding boxes or making class predictions.
Fig. 1 Visualization of the widely differing architecture search process for Auto-DeepLab (Liu et al, 2019a) and dense prediction cell (DPC) (Chen et al, 2018).Left: illustration of the overall architecture and which components of the architecture are searchable.Chen et al (2018) fix the encoder and search for a dense prediction cell to encode multi-scale information, while Liu et al (2019a) search for the encoder and augment it with a fixed module for multi-scale feature aggregation.DPC employs a simple blackbox optimization strategy, namely a combination of random search and local search, while Auto-DeepLab leverages a one-shot model and gradient-based NAS (Liu et al, 2019b).Right: summary of (i) the different training phases (pretraining, architecture search, re-ranking, re-training and final evaluation), (ii) non-searchable components in each stage, and (iii) parameters that are optimized in each stage (weights associated with the non-searchable architectural component wns, weights associated with the searchable architectural component ws, searchable architectural components αs).
Unfortunately, manually designing neural network architectures comes with some major drawbacks, reminding of the drawbacks of manually designing features.Firstly, it is a time-consuming and error-prone process, requiring human expertise.This dramatically limits access to deep learning technologies since architecture engineering expertise is rare.Secondly, performance will be limited by the human imagination.Inspired by learning features from data rather than manually designing them, it seems natural to also replace the manual architecture design by learning architectures from data.This process of automating architectural engineering is commonly referred to as neural architecture search (NAS).
Until recently, NAS research has mostly focused on image classification problems, such as CIFAR-10 or Im-ageNet, due to the demand for computational resources in the order of hundreds or thousands of GPU days that early methods required (Zoph and Le, 2017;Zoph et al, 2018;Real et al, 2019).Compared to image classification, dense prediction tasks have barely been addressed even though they are of high practical relevance for applications, such as autonomous driving (Huang and Chen, 2020) or medical imaging (Litjens et al, 2017).These problems are intrinsically harder than image classification for several reasons: they typically come with longer training times as well as higher memory footprints due to high-resolution data, and they also require more complex neural architectures.These differences lead to even higher computational demands and make the application of many NAS approaches problematic.Early works on NAS for dense prediction tasks (e.g., Chen et al (2018); Ghiasi et al ( 2019)) are thus limited to optimizing only small parts of the overall architectures, while still requiring enormous computational resources even though employing various tricks for speeding up the search process.
Fortunately, recent weight-sharing approaches via one-shot-models (Saxena and Verbeek, 2016;Bender et al, 2018;Pham et al, 2018;Liu et al, 2019b;Cai et al, 2019;Xie et al, 2019) have dramatically reduced the computational costs to essentially the same order of magnitude as training a single network, making NAS applicable to a much wider range of problems.This lead to an increasing interest in developing NAS approaches tailored towards dense prediction tasks.However, due to the complex nature of the problem, these approaches vary vastly, as illustrated in Figure 1.With this survey, we aim to provide guidance to the most important design decisions.
This manuscript is structured as follows: in Section 2, we briefly review NAS.We then discuss NAS for dense prediction tasks in general in Section 3. In the remaining sections, we focus on the specific problems of semantic segmentation (Section 4) and object detection (Section 5) and conclude by discussing other less-studied but promising applications (Section 6).

A Brief Recap of NAS
We briefly review neural architecture search; please refer to the surveys by Elsken et al (2019b)  The search space defines which architectures can be discovered in principle.Searchable components of an architecture can be architectural hyperparameters, such as the number of layers, the number of filters, or kernel sizes for convolutional layers, but also the layer types themselves, e.g., whether to use a convolutional or a pooling layer.Furthermore, NAS methods can optimize in which form layers are connected to each other, i.e., they search for the topology of the graph associated with a neural network.
Building prior knowledge about neural network architectures into a search space can simplify the search.For instance, inspired by popular manually designed architectures, such as ResNet (He et al, 2016b) or Inception-v4 (Szegedy et al, 2017), Zhong et al (2018) and Zoph et al (2018) proposed to search for repeatable building blocks (referred to as cells) rather than the whole architecture.These building blocks are then simply stacked in a pre-defined manner to build the full model.
Restricting the search space to repeating building blocks limits methods to only optimize these building blocks rather than also discovering novel connectivity patterns and ways of constructing architectures on a macro level from a set of building blocks.Yang et al (2020) show that the most commonly used search space is indeed very narrow in the sense that almost all architectures perform well.As a consequence, simple search methods, such as random search can be competitive (Li and Talwalkar, 2019;Yu et al, 2020a;Elsken et al, 2017).We note that this does not necessarily hold for richer, more diverse search spaces (Bender et al, 2020;Real et al, 2020).In contrast, one could also build as little prior knowledge as possible into the search space, e.g., by searching over elementary mathematical operations (Real et al, 2020), however, this would significantly increase the search cost.In general, there is typically a trade-off between search efficiency and the diversity of the search space.
Common search strategies used to find an optimal architecture within a search space are black-box optimizers, such as evolutionary algorithms (Stanley and Miikkulainen, 2002;Real et al, 2017;Liu et al, 2018a;Real et al, 2019;Elsken et al, 2019a), reinforcement learning (Zoph and Le, 2017;Baker et al, 2017a;Zhong et al, 2018;Zoph et al, 2018) or Bayesian optimization (Swersky et al, 2013;Mendoza et al, 2016;Kandasamy et al, 2018;Oh et al, 2019;White et al, 2019;Ru et al, 2021b).As these methods typically require training hundreds or thousands of architectures and thus result in high computational costs, several newer methods tailored towards NAS go beyond this blackbox view.A popular approach to speed up this search is to employ a continuous relaxation of the architecture search space (Liu et al, 2019b), which also allows for gradientbased optimization.In this line of research, rather than making a discrete decision for choosing one out of many candidate operations (such as convolution or pooling), a weighted sum of candidates is used, whereas the weights can then be interpreted as a parameterization of the architecture.
The objective function to be optimized by NAS methods is typically the performance an architecture would obtain after running a predefined (or also optimized) training procedure.However, this true performance is typically too expensive to evaluate.Therefore, various methods for estimating the performance have been developed.A common strategy to speed up training is to employ lower-fidelity estimates (e.g., training for fewer epochs, training on subsets of data or downscaled images, and using downscaled architectures in the search phase (Chrabaszcz et al, 2017;Baker et al, 2017b;Zoph et al, 2018;Zela et al, 2018;Zhou et al, 2020)).Another popular approach is to employ weight sharing between architectures within one-shot-models (Saxena  (Wen et al, 2020;Siems et al, 2020;Dudziak et al, 2020), considering learning curves (Domhan et al, 2015;Ru et al, 2021a;Baker et al, 2017b;Klein et al, 2017) or zero-cost methods that are typically based on the statistics of an architecture or a single forward pass through the architecture (Mellor et al, 2021;Lee et al, 2019;Abdelfattah et al, 2021).
We refer the interested reader to White et al (2021) for a recent overview and comparison of such approaches.
Hardware-awareness.Recently, many researchers also consider the resource consumption of neural networks, e.g., in terms of latency, model size, or energy consumption as objectives in NAS, since these are severely limited in many applications of deep learning.The importance of this fact is reflected by a whole line of research on manually designing top-performing yet resource-efficient architectures (Iandola et al, 2016;Howard et al, 2017;Sandler et al, 2018;Zhang et al, 2018;Ma et al, 2018;Gholami et al, 2018).Many NAS methods also consider such requirements for dense prediction tasks by now, e.g., Zhang et al (2019) Chen et al (2020a).Typically this is achieved by either adding a regularizer penalizing excessive resource consumption to the objective function (Cai et al, 2019;Tan et al, 2019) or by multi-objective optimization (Elsken et al, 2019a;Lu et al, 2019).We refer the interested reader to Benmeziane et al (2021) for a more general discussion on this topic.

NAS for Dense Prediction Tasks
Before discussing specific tasks in later sections, we first look at the aforementioned dimensions (search space, search strategy, and performance estimation) more generally in the context of dense prediction tasks, which come with novel challenges.For instance, compared to image classification problems, where one typically searches for encoder-like architectures, dense prediction tasks usually require much more complex architectures, e.g., to generate multi-scale features (which have been shown to be helpful for dense prediction tasks (Lin et al, 2017)).We refer to Figure 3 for an illustrative comparison of commonly employed high-level architectures for image classification, semantic segmentation and object detection.Training models for dense prediction tasks is typically also much more demanding than for image classification, for at least three reasons: firstly, the spatial resolution is often higher, e.g., 32 × 32 for CIFAR-10 ( Krizhevsky, 2009) or 224 × 224 for ImageNet (Russakovsky et al, 2015) compared to 1024×2048 for Cityscapes (Cordts et al, 2016).Secondly, the network's output is considerably larger as it requires a per-pixel prediction rather than a single prediction for a whole image.Consequently, the employed neural networks also tend to be much bigger.Both of these reasons lead to higher computational costs for training, as well as higher memory footprints.Lastly, data sets are typically smaller due to the higher annotation effort.Consequently, networks are often pretrained on other data sets to increase performance, again resulting in increased computational costs and complexity of the training pipeline.
Search space.Due to the increased architectural complexity, many researchers focus on optimizing one component of the architecture for simplicity and efficiency.
For the encoder (commonly referred to as the backbone), the search spaces are often similar to search spaces for image classification, but more complex architectural building blocks are used.For image classification, these blocks are typically elementary oper-ations, such as convolution or pooling layers.On the other hand, for dense prediction tasks, it is common to employ already pre-optimized blocks from state-of-theart image classification networks (Shaw et al, 2019;Wu et al, 2019a;Bender et al, 2020;Chen et al, 2019;Guo et al, 2020), such as MobileNetV2 (Sandler et al, 2018), MobileNetV3 (Howard et al, 2019) or ShuffleNetV2 (Ma et al, 2018) and solely search over their architectural hyperparameters (e.g., the kernel size or the number of filters).For other parts of the architectures, search spaces are typically built around well-performing manually designed architectures.For example, Chen et al ( 2018) search for a dense prediction cell inspired by operations from DeepLab (Chen et al, 2017;Chen et al, 2018) and PSPNet (Zhao et al, 2017), Xu et al (2019) build their space to contain FPN (Lin et al, 2017) and PANet (Liu et al, 2018b), and the space of Liu et al (2019a) contains the architectures proposed by Noh et al (2015), Newell et al (2016) and Chen et al (2017).
Performance estimation plays an important role in making the search costs feasible.While lower-fidelity estimates -such as conducting the search on downscaled models and training for fewer iterations -are employed just like in NAS for image classification, NAS for dense prediction tasks also saves computational costs in other ways.Common approaches include employing pretrained models (Chen et al, 2018;Nekrasov et al, 2019;Guo et al, 2020;Wang et al, 2020;Chen et al, 2020a), caching of features generated by a backbone (Chen et al, 2018;Wang et al, 2020;Nekrasov et al, 2019), or using not just a smaller but potentially also different backbone architecture (Chen et al, 2018;Ghiasi et al, 2019) in the search process.Lower-fidelity estimates are often used in multiple search phases.In the first stage, architectures are screened in a setting where they are cheap to evaluate (e.g., by using the tricks discussed above).Once a pool of well-performing architectures or candidate operations is identified, this pool is re-evaluated in a setting closer to the target setting (e.g., by scaling the model up to the target size or by training for more iterations).For example, Chen et al (2018) explore 28,000 architectures in a first stage with a downscaled and pretrained backbone, which is frozen during the search.The authors then choose the top 50 architectures found and train all of them fully to convergence.Rather than selecting top-performing architectures, Guo et al (2020) propose a sequential screening of the search space to identify and remove poorly performing operations from the search space.All components of the architecture can then be jointly optimized on the reduced search space, which would have been infeasible on the full space due to memory limitations.
Search strategies employed for dense prediction tasks are often build upon image classification methods.For example, many methods (Xu et al, 2019;Liu et al, 2019a;Saikia et al, 2019;Zhang et al, 2019;Guo et al, 2020) use gradient-based techniques such as DARTS (Liu et al, 2019b), often in its first-order approximation for computational reasons.Reinforcement learning based approaches are also re-used (Ghiasi et al, 2019;Chen et al, 2020a;Du et al, 2020;Wang et al, 2020;Bender et al, 2020), and Chen et al ( 2019) employ an evolutionary algorithm in combination with a one-shot model, as proposed by Guo et al (2019) for image classification.
4 Semantic Image Segmentation

Design Principles
Semantic segmentation refers to the task of assigning a class label to each pixel of an image.The semantic segmentation model is trained to learn a mapping f : R w×h×c → P w×h , where w × h refers to the spatial resolution, c to the number of input channels, and , with C being the number of classes.Long et al (2015) proposed to address this problem with deep learning by adapting image classification networks to produce dense outputs with fully convolutional neural networks.Related tasks to which the NAS methods discussed below can be applied without considerable changes are instance segmentation, which requires segmenting each object instance, and panoptic segmentation, which unifies semantic and instance segmentation.Popular data sets for semantic segmentation include PASCAL VOC (Everingham et al, 2015), Cityscapes (Cordts et al, 2016), ADE20K (Zhou et al, 2016(Zhou et al, , 2017)), CamVid (Brostow et al, 2008a,b), and MS COCO (Lin et al, 2014).
Several years of manual neural architecture engineering for semantic segmentation have identified several concepts that can be used when designing a search space for NAS: 1. Encoder-decoder macro-architecture (Long et al, 2015;Ronneberger et al, 2015): while input and output of a semantic segmentation model have the same spatial resolution, addressing the "what is in an image?" question typically requires integrating long-range spatial dependencies in the input.As this becomes easier on downsampled representations of the input, a popular approach is to use an encoder that gradually decreases spatial resolution while generating more abstract representations of the input.Using such a scale-decreased encoder has the additional advantages of being more computationally efficient and being able to use adapted feature extractors that were pretrained for image classifi-Predict Predict Predict Predict Predict Predict Fig. 3 High-level illustration of architectures employed for different tasks.Top left: typical encoder-like architecture (blue) for image classification problems; predictions are made based on low-resolution but semantically strong features.Top right: typical architecture for tasks like semantic segmentation; semantically strong features are generated for all scales through augmenting the encoder with a decoder (red).Bottom: semantically strong features from all scales serve as the input for the object detection head (green); note that the feature maps within the encoder and decoder might be densely connected and feature maps in the decoder might be connected to any other feature map in the encoder as well as the decoder.
cation on ImageNet as encoders.Answering the "where?"question at full spatial resolution is the task of the decoder, which learns to gradually upsample the lowerresolution output of the encoder to the full image resolution.2. Skip connections (Ronneberger et al, 2015): while the encoder-decoder macro-architecture is efficient in addressing the "what?" question, the low-resolution bottleneck between encoder and decoder loses spatial precision which makes it unnecessarily difficult to adequately answer the "where?"question.One way of addressing this is to add higher resolution skip connections between encoder and decoder that bypass the bottleneck.The popular U-Net architecture (Ronneberger et al, 2015) introduces these skip connections between identical spatial resolutions of the encoder and the decoder in order to avoid any loss of detail through the bottleneck.
3. Common building blocks: building blocks used in neural architectures for image classification, such as residual or dense blocks, can be readily re-used in the encoder architecture.Moreover, search spaces for neural cells for image classification can also be utilized when applying NAS to the encoder part of semantic segmentation.4. Multi-scale integration: augmenting encoder-decoder-like macro-architectures with a specific component that supports multi-scale integration helps capturing longrange dependencies.(Atrous) spatial pyramid pooling (ASPP) (He et al, 2015;Chen et al, 2018) is one popular approach to this.
A NAS search space for semantic segmentation based on these design principles can thus learn (a) building blocks/cells used in the encoder, (b) the downsampling strategy of the encoder, (c) the building blocks/cells of the decoder, (d) the upsampling strategy of the decoder, (e) where and how to add skip connections between encoder and decoder, and (f) how to perform multi-scale integration.
Learning only (a) and/or (c) would be similar to the so-called micro search, since the backbone/decoder is fixed and the NAS algorithm only searches for the optimal structure of the building blocks.On the other hand, we shall refer as macro search to approaches that optimize for at least one of the other components besides (a) and/or (c).While the components above are the canonical components to be optimized, we would also like to note that a promising direction for future work on NAS for semantic segmentation is to define search spaces that allow exploring architectures that do not follow the predominant encoder-decoder design principle (Du et al, 2020).

NAS for Semantic Segmentation
We refer to Table 1 for an overview and comparison of the methods that we discuss in the following.The table is structured according to the criteria discussed in Section 3.
Backbone Search.Auto-DeepLab (Liu et al, 2019a) builds upon the DARTS (Liu et al, 2019b) search space and algorithm, which were initially designed for learning optimal convolutional cells for image classification.The authors extend DARTS to the semantic segmentation task by also considering the macro architecture, in the sense that it does not search only for an optimal cell structure, but also searches for the optimal spatial resolution of the feature maps that each cell processes.More specifically, a cell C l,s at layer l that outputs a tensor with spatial resolution s, can learn to process input tensors from previous layers with output tensors with resolutions s/2, s or 2s.This is performed by continuously relaxing these discrete choices as it is done for the operation choices inside the cells.This results in multiple network outputs, each having a different spatial resolution.Each of these outputs is connected with an ASPP module (Chen et al, 2018).For optimizing the architectural weights (of both cell and macro architecture), the authors utilize first-order DARTS.
A line of follow-up work improves Auto-DeepLab in various directions.Chen et al (2020b), Shaw et al (2019), and Lin et al (2020) search for efficient architectures (e.g., by means of latency) by adding a regularizer for hardware-costs.Various more powerful search spaces are also proposed, e.g., to cover channel expansion ratios and multi-branch architectures (Chen et al, 2020b), or by employing stronger building blocks such as inverted residual blocks (Shaw et al, 2019) rather than simple convolutions, as well as removing the typical constraint that the cell topology is shared across the whole architecture (Lin et al, 2020).Zhang et al (2021) address DARTS' (and therefore also Auto-DeepLab's) problem of keeping the entire one-shot model in memory; this is done by sampling paths in the one-shot model rather than training the entire model at once, similar to the approaches by Xie et al (2019) and Dong and Yang (2019).Due to the memory efficiency, the search is directly conducted on the target space and data set rather than employing a proxy task.
Semantic segmentation is in particular important for medical image analysis (Ronneberger et al, 2015) and consequently, NAS methods are also applied to optimize on medical image data sets.NAS-Unet (Weng et al, 2019) employs ProxylessNAS (Cai et al, 2019) to automatically search for a set of downsampling and upsampling cells that are connected using a Unet-like (Ronneberger et al, 2015) backbone.Yu et al (2020b) and Zhu et al (2019) consider 3D medical image segmentation.For this task, Coarse-to-Fine NAS (C2FNAS) (Yu et al, 2020b) uses a search space inspired by the one employed in Auto-DeepLab (Liu et al, 2019a) and an evolutionary strategy operating on clusters of similar networks to search for the macro structure of their model, whilst the operation choices inside the cells of the macro structure are randomly sampled similarly to the protocol by Li and Talwalkar (2019).Finally, V-NAS (Zhu et al, 2019) extends DARTS to encoder-decoder architectures used for volumetric medical image segmentation.
Multi-Scale Feature Search.Chen et al (2018) employ NAS for dense prediction tasks in order to search for a better multi-scale feature extractor called dense prediction cell (DPC) given a fixed backbone network.The proposed search space is a micro search space that contains, e.g., atrous separable convolutions with different rates or average spatial pyramid pooling inspired by DeepLabv3 (Chen et al, 2018).They run a combination of random search and local search to optimize the dense prediction cell given a fixed, pretrained backbone, which, despite the use of a series of proxy tasks, still required 2600 GPU days.
Nekrasov et al ( 2019) also consider a fixed encoder network and search for an optimal decoder architecture together with the respective connections to the encoder layers.The decoder architecture is modeled as a sequence of cells sharing the same structure that processes the inputs from the encoder layers.The authors utilize various heuristics to speed-up architecture search.For example, they freeze the weights of the encoder network and train only the decoder part (as already done in DPC) and early-stop training of architectures with poor performance.Moreover, a knowledge distillation loss (Hinton et al, 2015) is employed as well as an auxiliary cell to reduce the training time.Rather than using random and local search, a controller trained with reinforcement learning is employed to sample candidate architectures, similar to Zoph et al (2018).
In follow-up work, Nekrasov et al (2020) extend their work to semantic video segmentation by learning a dynamic cell that learns to aggregate the information coming from previous and current frames to output segmentation masks.Since the search costs depend on the hardware and are also not explicitly mentioned in each paper, we only categorize them as "small" and "high".We assign the cost label "small" to methods that can be run within a week on a server with eight GPUs, i.e., in less than 56 GPU days.Methods with "high" search costs typically employ a large-scale, distributed infrastructure, resulting in hundreds or thousands of GPU/TPU days of compute.
Joint Search and Novel Design Principles.While previous work considers optimizing either the encoder or decoder, Customizable Architecture Search (CAS) (Zhang et al, 2019) searches for both an optimal backbone and multi-scale feature extractor, however in a sequential manner.For the backbone, a normal cell (which preserves the spatial resolution and number of feature maps) and a reduction cell (which reduces the spatial resolution and increases the number of feature maps) are optimized.Once these two cells have been determined, a multi-scale cell is optimized to learn how to integrate spatial information from the backbone.Rather than searching for optimized building blocks for the encoder and/or the decoder, Wu et al (2019b) propose to search for the connectivity pattern between the two components, which is typically fixed in other work.The encoder and decoder are first densely connected, where each connection is weighted by a realvalued parameter.This real-valued parameterization of the connections allows for gradient-based optimization as in DARTS (Liu et al, 2019b).The authors also propose a loss function for inducing a sparse connectivity pattern.
Closely related to NAS techniques, Li et al (2020) extend Auto-DeepLab by considering a differentiable gating function that learns data-dependent routes that propagate information at different scales depending on the input image.Moreover, they also consider budget constraints in their objective and use a one-shot model to find routing schemes that require fewer FLOPs compared to Auto-DeepLab.

Design Principles
Object detection (Liu et al, 2020) refers to the task of identifying if/how many objects of predetermined categories are present in an input (e.g., an image) and, for each identified object, determining its category as well as its spatial localization.Spatial localization can be represented in different ways, with the most common one being a 2D bounding box in image space, encoded by a 4D real-valued vector.However, other representations, such as pixel-wise segmentation, are possible as well.We note that in contrast to semantic segmentation, deep learning-based object detection often has a post-processing step that maps from dense network outputs to a sparse set of object detections, e.g., using nonmaximum suppression.However, this post-processing is typically fixed and not used during training; and thus also ignored during NAS (we note that applying NAS to this post-processing would be an interesting future direction).Moreover, deep learning-based object detection can be split into one-stage and two-stage ap-proaches.Two-stage approaches first identify the presence and extent of an arbitrary object at a position and thereupon apply a region classifier to the identified object region to classify the category of the object and (optionally) refine its spatial localization.In contrast, single-stage approaches directly predict the presence of an object, its class, as well as its spatial localization in a single forward pass.
Since objects can have vastly different scales, typically multi-scale approaches are applied for single-stage object detection.This can be achieved by either attaching "detection heads" at layers of different spatial resolutions or by combining features of different layers; effectively, this results in certain network outputs (those corresponding to lower resolutions) specializing on larger objects and higher resolution outputs on smaller ones.In this case, the dense prediction task can be framed as f : R w×h×c → [P w×h × R w×h×b , P w/2×h/2 ×R w/2×h/2×b , . . .], where w×h refers to the spatial resolution, c to the number of input channels, b to the parameters encoding the spatial localization, and , with C being the number of classes and -1 corresponding to the "no object" class.
Many of the design principles of semantic segmentation carry over to object detection.However, there are also notable differences: -Since the network requires a dense and multi-scale output, a further design choice is how "detection heads" generating these multi-scale outputs are attached to the main network.The heads' architecture itself is another open design choice.-Two-stage object detection can impose complex interdependencies between the architectures of the two stages, making the design of a search space covering both stages together challenging.

NAS for Object Detection
We summarize the methods that we discuss in the following in Table 2.The table is again structured according to the criteria discussed in Section 3. Early work on NAS for object detection focuses on either optimizing the backbone or the multi-scale feature extractor.We start by discussing these two, orthogonal, directions and then go on to methods that jointly search all components, and to other search space design principles.Backbone Search.Since NAS is a technique that comes from image classification, considering that researchers typically employ image classification architectures as backbones for object detection, it is not surprising that a comprehensive line of research adapts existing methods for optimizing the backbone (Chen et al, 2019;Bender et al, 2020).Bender et al (2020) propose Tu-NAS, inspired by ProxylessNAS (Cai et al, 2019) and ENAS (Pham et al, 2018), for image classification and also evaluate it on object detection, with only minor hyperparameter adjustments required.In contrast to most other work, Bender et al (2020) train a one-shot model from scratch directly on the target task rather than employing pretraining.To improve the scalability with respect to the search space, the authors propose a more aggressive weight sharing across candidate choices, e.g., by sharing filters weights across convolutions with a different number of filters.Furthermore, the memory footprint when training the one-shot model is dramatically reduced by "rematerialization", i.e., recomputing intermediate activations rather than storing them.The authors also propose a novel hardware regularizer allowing to find models closer to the desired hardware cost.In a follow-up work (Xiong et al, 2020), the performance of TuNAS is further improved due to a more powerful search space.
Rather than searching for an architecture from scratch, Peng et al (2019) propose to transform a given, wellperforming backbone.Motivated by improving the effective receptive field size of convolutions, the authors search over various dilation rates.For each dilation rate, channels are grouped to allow for different dilation rates for different groups.The parameters of convolutional kernels are inherited from a pretrained model and shared across all rates to avoid additional parameters.Gradientbased architecture search on a continuous relaxation of the search space is used to then search for the optimal dilation rates for each channel group.In a similar fashion, Jiang et al (2020) modify an existing, wellperforming and pretrained backbone by applying network morphisms (Wei et al, 2016), which are commonly used in NAS (Elsken et al, 2017(Elsken et al, , 2019a;;Cai et al, 2018a,b), to improve the backbone.Since network morphisms inherit the performance of the parent network to the child network, the child network does not need to be trained from scratch and thus the authors avoid pre-training all candidate architectures on ImageNet, which would be infeasible.In a first search phase, a purely sequential model is optimized, while a second search phase adds parallel branches to enable more powerful architectures.Fang et al (2020) et al, 2018;Pham et al, 2018).Pretraining refers to ImageNet pretraining.Since the search costs depend on the hardware and are also not explicitly mentioned in each paper, we only categorize them as "small" and "high".We assign the cost label "small" to methods that can be run within a week on a server with eight GPUs, i.e., in less than 56 GPU days.Methods with "high" search costs typically employ a large-scale, distributed infrastructure, resulting in hundreds or thousands of GPU/TPU days of compute.
Multi-Scale Feature and Head Search.Orthogonal to the methods discussed above, another line of research focuses on optimizing multi-scale feature extractors as well as the object detection head.Ghiasi et al ( 2019) employ a reinforcement learning-based NAS framework (Zoph and Le, 2017;Zoph et al, 2018) to search for NAS-FPN, an improved feature pyramid network (FPN) (Lin et al, 2017) yielding multi-scale features.In follow-up work, Chen et al (2020a) extend the search space and employ Mnas-Net (Tan et al, 2019) as a search method for not only optimizing performance but also latency to find efficient networks.This is in contrast to NAS-FPN, where lightweight architectures are searched after manually adapting non-searchable components to be efficient.As both NAS-FPN and Mnas-FPN are based on expensive, black-box NAS methods that train around 10,000 architectures, they require substantial computational resources to be run, in the order of hundreds of TPU days.
While the aforementioned work focuses on the FPN module, Wang et al (2020) additionally optimize the object detection head on top of the multi-scale features.This is done by using RL to first search for an FPNlike module and afterwards for a detection head.For the FPN module, similar to Ghiasi et al ( 2019), the RL controller chooses feature maps from a list of candidates, an elementary operation to process, and in which way to merge it with another candidate.Once an optimal FPN is found, it is used to search for a suitable head.While typically the weights of the head are shared across all levels of the feature pyramid, Wang et al (2020) also search over an index indicating from where on to share weights, while all layers of the head architecture before the index can have different weights for different pyramid levels.As the backbone architecture is not optimized, they pre-compute the output features from the backbone to make the search more efficient.Xu et al (2019) propose Auto-FPN, a method for searching for a multi-scale feature extractor and a detection head, based on a continuous relaxation and gradientbased optimization as done by Liu et al (2019b).Again, a cell-based search space is used, for both components.The search is conducted in a sequential manner (i.e., the FPN is searched first and the head afterward) as DARTS.Since the employed search strategy requires to keep the whole one-shot model in memory, it does not allow for a joint optimization in the considered setting.Zhong et al (2020a) also employ DARTS to search for a detection head.To mitigate memory problems, they propose a more efficient scheme for sharing representations across operations with different receptive field sizes by re-using intermediate representations.
Joint Search and Novel Design Principles.The works discussed so far focus on optimizing different parts of object detection architectures, but they all employed some non-searchable components.Given enough data and compute power, optimizing all components jointly should in principle dominate this approach.To give a concrete example of interaction effects between architecture components: while NAS-FPN (Ghiasi et al, 2019) in combination with a ResNet-50 (He et al, 2016a) outperforms the original FPN module when also combined with a ResNet-50 (suggesting that NAS-FPN yields richer multi-scale features) and the DetNAS (Chen et al, 2019) backbone in combination with FPN outperforms FPN in combination with the original ResNet-50 backbone (suggesting that DetNAS is a better backbone), Guo et al (2020) showed that the combination of the DetNAS backbone and NAS-FPN yields worse performance than ResNet-50 in combination with NAS-FPN.Therefore, they propose to search for all three components jointly.The main concern with this approach is that it can easily get computationally infeasible.To overcome this problem, a hierarchical search is proposed.In the first search phase conducted on a small proxy task, a rich search space (build around FBNet (Wu et al, 2019a)) for all three components is explored with the goal of shaping the search space by pruning building blocks that are unlikely to be optimal.Notably, this allows starting with the same set of candidate operations for all three components, while in prior work the set of candidates is typically adapted to the specific component to be optimized (Xu et al, 2019).By imposing a regularizer enforcing sparsity among the architectural parameters in the one-shot model in the first phase, suboptimal candidates can naturally be pruned away.The second search phase uses the resulting pruned subspace to determine an optimal architecture.Both search phases employ gradient-based optimization for efficiency.Furthermore, the authors penalize architectures with high computational costs by adding a proper regularization term.Similarly, Yao et al (2020) first search for the best combination of backbone, multi-scale feature extractor, region-proposal network as well as detection head, with a set of possible candidates for each component (e.g.ResNet or MobileNet V2 as a backbone or different versions of FPNs for multi-scale feature fusion).In the second stage, the best-performing combinations of these components are then fine-tuned on a more fine-grained level, e.g., by optimizing the number of channels in the chosen backbone.While all previously discussed work is guided by manually designed architectures that consist of a scaledecreasing backbone followed by multi-scale feature fusion, Du et al (2020) question this design principle and propose to search for a single network covering both components.This approach permutes layers of the network and searches for a better connectivity pattern be-tween them.We highlight that for this search space consisting of layer permutations it is unclear how oneshot models could be employed and thus the authors rely on the computationally more expensive black-box optimization via RL.
6 Outlook: promising application domains and future work So far, most NAS research on dense prediction has focused on semantic segmentation or object detection, but there are many more dense prediction tasks domains where the discussed methods could be applied or adapted to.
For example, disparity estimation can be solved in an end-to-end fashion with encoder-decoder architectures (Mayer et al, 2016).First studies in this direction have already been conducted.Saikia et al (2019) propose AutoDispNet, which extends the typical search space from image classification consisting of a normal and a reduction cell by an upsampling cell in order to search for encoder-decoder architectures.The first order approximation of DARTS is used to allow an efficient search, followed by a hyperparameter optimization for the discovered architectures using the popular multi-fidelity Bayesian optimization method BOHB (Falkner et al, 2018).Cheng et al (2020b) build upon AutoDisp-Net by also searching for a matching network on top of the feature extractor, inspired by recent manually designed networks for disparity estimation.Architectures discovered for disparity estimation (Saikia et al, 2019) or semantic segmentation (Nekrasov et al, 2019) have also been evaluated on depth estimation.Ulyanov et al (2018) showed that the structure of an encoder-decoder architecture employed as a generative model is already sufficient to capture statistics of natural images without any training.Thus, such architectures can be seen as a "deep image prior" (DIP), which can be used to parameterize images.On a variety of tasks, such as image denoising, super-resolution or inpainting, a natural image could successfully be generated from random noise and a randomly initialized encoder-decoder architecture.As the authors noted that the best results can be obtained by tuning the architecture for a particular task, Ho et al (2020) and Chen et al (2020c) employed NAS to search for deep image prior architectures via evolution and reinforcement learning, respectively.Differentiable architecture search has also been adapted for image denoising by Gou et al (2020).
Other promising tasks are panoptic segmentation with some first work by Wu et al (2020) and 3D detection and segmentation (Tang et al, 2020).Finally, op-tical flow estimation (Dosovitskiy et al, 2015;Ilg et al, 2017;Sun et al, 2018) is a problem that has not been considered by NAS researchers so far, and it is conceivable that NAS methods could further improve performance on this task.
or Wistuba et al (2019) for a more thorough overview.Neural architecture search (NAS) is typically framed as a bi-level optimization problem min A∈A L val D val , A, w * A s.t.w * A ∈ arg min w L train (D train , A, w), with the goal of finding an optimal neural network architecture A within a search space A with respect to a validation loss function L val , a validation data set D val and weights w * A of the architecture obtained by minimizing a training loss function L train on a training data set D train .NAS methods can be categorized along three dimensions (Elsken et al, 2019b): search space, search strategy and performance estimation, compare Figure 2.

Table 1
(Bender et al, 2018;Pham et al, 2018) semantic segmentation.For search methods, EA, LS, GB, RL, and RS refer to evolutionary algorithm, local search, gradient-based, reinforcement learning and random search, respectively.Weight sharing refers to weight sharing via one-shot models(Bender et al, 2018;Pham et al, 2018).Pretraining refers to ImageNet pretraining.
also employ network morphisms, or more general parameter remapping methods, to initialize a one-shot model from a pretrained model to avoid pretraining the one-shot model.

Table 2
Overview of different NAS methods for object detection.For search methods, EA, LS, GB, and RL refer to evolutionary algorithm, local search, gradient-based and reinforcement learning, respectively.Weight sharing refers to weight sharing via one-shot models (Bender