Improved differentiable neural architecture search for single image super-resolution

Deep learning has shown prominent superiority over other machine learning algorithms in Single Image Super-Resolution (SISR). In order to reduce the efforts and resources cost on manually designing deep architecture, we use differentiable neural architecture search (DARTS) on SISR. Since neural architecture search was originally used for classification tasks, our experiments show that direct usage of DARTS on super-resolutions tasks will give rise to many skip connections in the search architecture, which results in the poor performance of final architecture. Thus, it is necessary for DARTS to have made some improvements for the application in the field of SISR. According to characteristics of SISR, we remove redundant operations and redesign some operations in the cell to achieve an improved DARTS. Then we use the improved DARTS to search convolution cells as a nonlinear mapping part of super-resolution network. The new super-resolution architecture shows its effectiveness on benchmark datasets and DIV2K dataset.


Introduction
As a branch of computer vision [20], image super-resolution technique reconstructs a higher-resolution image from the observed lower-resolution counterpart, which is known as a notoriously challenging ill-posed inverse procedure. Benefit from the strong capacity of extracting effective high-level abstractions which bridge the low-resolution (LR) space and high-resolution (HR) space, many SISR methods [19,21,22,33]  Designing a good neural network architecture is timeconsuming and laborious, in order to reduce the efforts and resources cost on manually designing network architecture, scholars and researchers put their attention on neural architecture search (NAS). Neural architecture search [24,31,45] is currently a popular method for searching for an effective architecture, which has been applied to image classification [44], image segmentation [39], object recognition [45] and other fields. The key problem of SISR methods based on deep neural network is to search a deep neural network architecture with better performance for SR.
Generally speaking, NAS contains reinforcement learning (RL) [44,45], evolutionary algorithm (EA) [31], gradient descent (GD) [4,39], Bayesian optimization [29] and so on. Compared with most NAS methods based on mentioned above, DARTS as a prominent representative of GD-based methods, greatly reduces the training time and hardware resources [25]. This leads us to apply DARTS to search for a cell-based architecture instead of the nonlinear mapping part in super-resolution network. However, in original DARTS, with the increase of training rounds, the entire proxy network tends to choose more skip-connect operations in searched architecture, which leads to the collapse of the model, and it is demonstrated by out experiments that this problem is even more serious when we apply DARTS directly to super-resolution tasks. Therefore, it is necessary for DARTS to have made some improvements for the application in the field of SR. In this paper, we optimize the connection between the intermediate nodes and the output node of the cell. Because SISR task requires huge memory and is difficult to overfit, we remove pooling operations to accelerate training procedure. The entire proxy network prefers to choose many skip connection operations in searched architecture, so we add identity mapping in convolutions operations to avoid aforementioned phenomenon.

Related works
Single image super-resolution refers to the task of restoring high resolution images from one low-resolution observations of the same scene. At present, Image super-resolution research can be divided into three main categories: interpolation-based, reconstruction-based and learningbased methods. The early methods are interpolation-based methods, such as bicubic interpolation [18] and Lanczos resampling [11], are very fast and simple but it also has some obvious shortcomings. Firstly, it assumes that the change in the gray value of a pixel is a continuous and smooth process, but in fact this assumption is not completely true. Secondly, during the reconstruction process, SR images are calculated based on only a pre-defined conversion function, and image degradation model is not taken into account, which often results in blurred, jagged and other phenomena in the restored images. Considering the upper problems, reconstruction-based SR methods [8,27,35,40] start from the degradation model of the image. It is assumed that the high-resolution image undergoes proper motion transformation, blurring, and noise to obtain a low-resolution image. This method restricts the generation of super-resolution images by extracting key information from low-resolution images and combining prior knowledge of unknown super-resolution images. It can generate flexible and sharp details, but also brings some problems. With the scale factor increases, the performance of many reconstruction-based methods degrades rapidly, and these methods are usually time-consuming.
With the development of machine learning, machine learning is widely used in different fields, such as intelligent transportation systems [14], recommendation [13,41], data translation [23], extubation failure [6] and dynamic reconfiguration [12]. Besides, deep learning is widely used in super-resolution reconstruction algorithms. Learning-based method [5,9,10,19,21,22,[32][33][34] uses a large amount of training data to learn a certain correspondence between the low-resolution image and the high-resolution image, and then predicts the high-resolution image corresponding to the low-resolution image based on the learned mapping relationship. Learning-based method has a better performance, but the design of neural network requires a lot of manpower, computing resources and time.
In order to reduce the efforts and resources cost on manually designing architecture, neural architecture search has attracted the attention of researchers. Most NAS approaches can be categorized in two modalities: macro search and micro search. Macro search algorithms aim to directly discover the entire neural network, reinforcement learning (RL) [44,45], evolutionary algorithm (EA) [31] and Bayesian optimization [29] are the representatives, but these methods need long training time and high resource consumption. Micro search algorithms aim to discover neural cells and design a neural architecture by stacking many copies of the discovered cell. Since NASNet [45] successfully search neural cells on NASNet search space, more researchers propose their methods [4,26,43] based on NASNet search space. Notably, DARTS is simpler than many existing approaches as it does not involve any controllers [1,30,44], hypernetworks [3] or performance predictors [24] . Besides, DARTS reduces structure search time to several GPU days, its simpler than many existing approaches. Considering that the current application of deep learning in super-resolution is mainly to learning the mapping relationship between low-resolution images and corresponding high-resolution images, using DARTS to search a nonlinear mapping network is a very worthwhile thing to try. In summary, considering training time and resource consumption, we use a simpler neural architecture search method (DARTS) to find an efficient architecture for super-resolution tasks.

Preliminary of differentiable architecture search
For the case of convolution neural networks, DARTS [25] searches for a normal cell and a reduction cell to build up the final architecture. A cell is a directed acyclic graph constructed by N nodes. Each node x i is a feature map in cell, and each edge (i, j ) is associated with some operation O (i,j ) , which are used to change x i . For convolution cells, each cell has two inputs and one output. Two inputs are the outputs of previous two cells. The output of the cell is obtained by applying a reduction operation (e.g. concatenation) to all intermediate nodes in the cell. Each intermediate node is computed based on all of its predecessors, the details are shown in Fig. 1.: where O (j,i) means an operation on x j , summing all the obtained feature maps to get x i . In DARTS, the author specifies that feature map of each intermediate node is obtained by operating the feature map of all previous nodes. Therefore, the task of learning the cell is transformed into learning operations on its edges. Suppose O is a collection of all operations (e.g., convolution, zero, skip-connection) where each operation represents some function o to be applied to x i . The blending weights for node i and node j are represented by the vector α (i,j ) .To make the search space continuous, the categorical choice of a particular operation is relaxed as a softmax over all possible operations: Where α mean an operation's weight on edge (i, j ), o(x) means an operation to be applied to feature map x. The weight represents the importance of the operation. The larger the weight value, the more important the operations is.
In the following, the author refers to α as the encoding of the architecture and the get the corresponding operations based on the learned blending weight α. For the search procedure, we denote L train and L val as the training and validation loss respectively. Then the architecture parameters are learned with the following bilevel optimization problem: In DARTS, the author proposed an approximate iterative optimization procedure where w and α are optimized by alternating between gradient descent steps in the weight and architecture space respectively. The details are shown in Table 1.
As shown in Table 1, in DARTS, the architecture α and the weight w of the neural network are optimized by alternate iteration. In the gradient back propagation phrase, the neural network weights w are updated with the loss of the training set, and the architecture α is updated with the validation loss. In step k, given the current architecture α k−1 , the proxy network obtains w k by moving w k−1 in the direction of minimizing the training loss L train (w k−1 , α k−1 ). Then, keeping the weights w k fixed and updating the architecture so as to minimize the validation loss after a single step of gradient descent w.r.t the weights: where is the learning rate for this virtual step. After get the continuous architecture α, we retained 2 strongest predecessors for each intermediate node, where the strength of an edge is defined as:

The phenomenon of performance drop caused by intractable skip connections
In PDARTS [7], a severe issue underlying DARTS has been found. Namely, after a certain searching epoch, the number of skip connections increases dramatically in the selected architecture, which results in poor performance of the selected architecture. why it happens? From another perspective, the authors relaxed the categorical choice of a particular operation as a softmax over all possible operations in DARTS. This weighted sum resembles a basic residual model in ResNet [15,16], which states that the identity mapping ensures that information is directly propagated back to any shallower layers. In other words, it is helpful to train a deeper neural network. In ResNet experiments, they showed that the learned residual functions in general have small responses, suggesting that identity mapping provide reasonable preconditioning. Therefore, the skip connection's corresponding architectural weight increases much faster than other operations' architectural weights. The identity mapping part and residual function part in residual module work together to reach a better result, but the final architecture is obtained by picking only the top-performing ones among all operations which break this cooperation.
In image classification task, it has been observed in DARTS that lots of skip connections are involved in the selected architecture, which make the architecture shallow and the performance poor. To see if there is the same problem in super-resolution tasks, we run DARTS two times with different random seed. In super-resolution DIV2K dataset, the alpha value of skip connection become very large when the number of search epochs is large, and thus the number of skip-connect increases in the selected architecture as shown in the blue line in Fig. 2. Following DARTS, we select 8 top-performing operations per cell. The number of dominate skip connection operations (highest softmax (α) among all operations in that edge) occupies a large part of all operation searched by DARTS. In addition, in left picture of Fig. 2, all top-performing operations in selected architecture are skip connections, it indicates that the proxy network has not learned useful operations. Considering the above situation, we need to prevent search for too many skip-connect operations in search stage.

Improved DARTS for super-resolution tasks
In this chapter, we adjust the search space based on the characteristics of some notable super-resolution networks. After that, we describe our proposed SR network architecture based on differentiable architecture search. We ultimately present the proposed network and compare this network with the baseline.

Remove redundant operations
Like the search space proposed in DARTS, we search for a basic cell structure and then stack it to construct the final convolutional neural network, which is used as a nonlinear mapping part of the super-resolution network. In some super-resolution networks [21,22], a large initial channel means that convolution operation can get more feature maps, thereby saving more information about LR images and getting higher quality SR images. Reduction cell will increase the number of channels to be output, if the initial channel number if too large, the usage of reduction cell will result in out of memory. In addition, the length and width of the feature output will be halved, which will result in information loss and more upsampling Fig. 2 The number of skip connections accounts for a large portion of all operation searched by DARTS. Train and validation PSNR are also drawn to show the level of convergence operations. Therefore, in order to ensure that the initial channel is relatively large, we removed the reduction cell in experiment.
For SISR tasks, since the input LR image and the output HR image are strongly correlated, and the LR image is down-sampling by the HR image. Since pooling operations often leads to the information loss, so they may harm the final performance. Of course, removing redundant operations speeds up the search architecture stage. To reduce the size of parameter in cells, unlike the output of cell in DARTS, the output of cell in our search space is the mean of feature maps of all intermediate nodes in the cell.

Adding identity mapping on convolution operations
Since the proxy network tends to choose skip connection as dominant operation in all operations, we decided to add residual block unit in search space. Instead of adding extra residual module in search space, we add identity mapping in original convolution operations in DARTS. Besides, by stacking residual block unit, some notable networks [21,22] have achieved good performance on SR tasks, it indicates that residual module is useful. Like residual block, we add skip connection to convolution neural network (The details are shown in Fig. 3). In image classification task, the architecture will prefer to search many skip-connection operations that cause the collapse of the final model. Adding skip connection in convolution operations not only improves the performance, but also avoid choosing more skip-connection in architecture search stage. Then we use experiments to verify that this method can really bring about an improvement in the effect.
For comparison, like we did before, we run two times with different random seed, the number of skip connections keep steady in Fig. 4. In next section, we use experiment to verify that searched architecture performance better. Compared with result got by Original DARTS in Fig. 3, the result says that choosing more skip connections gets higher PSNR value during the search phase. In architecture search phase, an edge consists of six competitive operations, the entire proxy network has a better performance. Whereas, the final architecture is obtained by picking only the topperforming ones among all operations, which result in the performance drop of final architecture. In the experiment part, we verifies that the results of Improved DARTS are better than Original DARTS.

The architecture of the proposed network
At the begin of the neural network, we use convolution operations extract low-level features. Then we use cell structure that can be searched by DARTS to serve as a nonlinear mapping part of super-resolution network (Fig. 5). The network searched by DARTS also has skip-connection, so it can make the network deeper. After that, we use element-wise addition to fuse the low-level features and the high-level features got by nonlinear mapping part of superresolution network. Finally, we use subpixel upsampling to get corresponding HR images. During the forward propagation of the entire network, the dimensions of the features remain the same.

Experiments and results
The experiment includes two parts: architecture search and architecture evaluation. In the first part we use DARTS to search a cell structure (including the connection method and the operation used for the connection). In architecture evaluation phrase, we train the searched model from scratch in the training process. For testing, we use five standard benchmark datasets. The SR results are evaluated with PSNR and SSIM [38] on Y channel of transformed YCbCr space.

PSNR/SSIM for measuring reconstruction quality
PSNR and SSIM are quantitative criteria in most superresolution papers. Given two images X and Y both with N pixels, peak signal-to-noise ratio is defined as: Where MSE is the mean square error between the original image and the SR image, MAX I indicates the maximum value of the image color (8-bit sample point is represented as 255). PSNR value evaluates the quality of the image by comparing the gray value difference of the corresponding pixels of two images. The higher the PSNR value, the better the image obtained by super-resolution Structural similarity index (SSIM) is defined as: where μ X and μ Y represent the mean values of the images X and Y , and σ X and σ Y represent the standard deviations of the images X and Y . σ XY represents the covariance of the images X and Y , C 1 and C 2 are constant. SSIM evaluates the similarity of the two images from three aspects: brightness, contrast, and structure. SSIM is a number between 0 and 1. The larger the SSIM value, the better the super-resolution effect.

Architecture search
We include the following operations in O: 3 × 3 and 5 × 5 proposed separable convolutions, 3 × 3 and 5 × 5 proposed dilated separable convolutions, skip-connection and zero. All operations are of stride one (if applicable) and the convolved feature maps are padded to preserve their spatial resolution. We use Conv-ReLU-Conv order for proposed separable convolution and dilated separable convolution. Our convolutional cell consists of N = 7 nodes, and the output of the cell is the mean of the feature maps of all intermediate nodes in the cell (input nodes excluded). The rest of the setup follows DARTS [25]. We took a half of the training data in DIV2K dataset as validation set and the other used was used as training data. In order to find a suitable architecture faster, we chose a small network with only 8 cells at this stage. The network is trained using DARTS for 50 epochs, with batch size 16 and the initial number of channels 64. We use SGD optimizer to optimize the weight w of the network, with momentum 0.9, initial learning rate 1e − 2 and the regularization 1e − 3. At the same time, we use the Adam optimizer to adjust architecture α, with initial learning rate 3e − 4, the momentum β = (0.9, 0.999), and the regularization 1e − 3. It took 8 hours on one NVIDIA TITAN Xp GPU. The searched results are shown in Fig. 6.  As the test dataset ground truth is not released, we report and compare the performance on the validation dataset. We train our model with 800 training images and use 10 validation images in the training process.
We used a 20-layer large network and trained for 300 epochs, with input patch size 192, and initial channel 256 and batch size 16. Initial learning rate is set to 1e-4 and is dropped by half for every 200 epochs. We augment the training data with random horizonal flips and 90 rotations. We pre-process all the images by subtracting the mean RGB value of DIV2K dataset. We train our model with ADAM optimizer by setting β = (0.9, 0.999), and = 1e − 8. Because super-resolution model is hard to converge, so we removed the regularization part and dropout layer. It took 5 days on three NVIDIA TITAN Xp GPUs. The result is shown in Fig. 7. Compared with our baseline model EDSR, the authors use 32 residual blocks and 256 filters to get the best performance, and the model parameters of their model is 43M. While in our proposed model, the parameters of our model is 26.68M. For training, we use the RGB input patches of size 48 × 48 from low-resolution image with the corresponding high-resolution patches. Under the same input condition as above, the flops of EDSR is 1853 GFLOPS and the flops of our proposed model is 1162 GFLOPS. Other models do not clearly give model parameters and flops, so we don't compare with them here.

Benchmark results
We provide quantitative and qualitative comparisons. Besides DIV2K validation set, we evaluate our proposed model on four standard benchmark datasets: Set5 [2], Set14 [42], B100 [28], and Urban100 [17]. We compare our model with some notable methods including SRCNN [9], A+ [36], VDSR [19] and SRResNet [21]. We only compare our model with others on scale 4. For comparison, we measure PSNR on the y channel. In Table 2, we provide a summary of quantitative evaluation on several datasets. Our model shows significant improvement than other models. Except for PSNR, we also introduce SSIM as another parameter to measure the performance of the model on benchmark datasets. On more complex datasets B100 and Urban100, improved DARTS is better than original DARTS.

Conclusion
In this work, we have presented a novel cell-based superresolution method using very deep networks. We redesign convolution operations including dilation convolution and separable convolution by adding skip-connection. By removing unnecessary operations from search operation set, we shortened the search time of the architecture. Our proposed SISR model surpasses some notable works. Moreover, neural architecture search offers a feasible way for engineers to compress existing popular human-designed models.