1 Introduction

Image super-resolution (SR) is one of the most important and challenging tasks in computer vision. It aims to generate a visually pleasing high-resolution (HR) image from a low-resolution (LR) input. This task widely benefits medical imaging, virtual reality, video surveillance, to name a few.

Single image SR is an ill-posed problem as the given LR image loses high-frequency information of the image. The inverse problem of single image SR becomes particularly pronounced when the scaling factor increases. The learning-based approach attempts to solve this ill-posed problem by directly or indirectly learning the mapping between LR and its HR image counterpart.

In recent years, with the help of deep convolutional neural networks (DCNNs), several frameworks for image SR, e.g., [3, 10, 11, 15, 18], were proposed where performance grows rapidly. Dong et al. [3] proposed a Super-Resolution Convolution Neural Network (SRCNN) which firstly used CNN to sample images and achieved significant improvements. The performance of SRCNN was limited by its shallow structure. In [10, 11], Kim et al. increased the depth of network to 20, achieving notable improvements over SRCNN. In order to get higher performance, the network tends to be deeper and deeper. Recently, Lim et al. [18] built a very wide network EDSR and a very deep one MDSR by using simplified residual blocks, which achieved a very satisfactory performance on super-resolution tasks. After that, many super-resolution models of dense connection integration have been proposed to effectively utilize hierarchical features, including SRDenseNet [16] and RDN [32].

Although these latest models have produced promising results by learning deeper hierarchical features, there are still several bottlenecks. A major issue is that the great majority network models get relatively poor results in large scaling factors such as 8 ×, as shown in Fig. 1. One of the main reasons is that they may not be able to effectively combine multi-scale feature information with hierarchical feature information for reconstruction.

Fig. 1
figure 1

Visual comparison for 8x super-resolution on Urban100 [9]. PSNR: Bicubic (15.89 dB), LapSRN [15] (18.27 dB), EDSR [18] (19.53 dB), MSRN [17] (19.41 dB), MSLapSRN [14] (18.52 dB), and PRNet (ours)(20.16 dB). (Zoom in for best view)

Our findings

For SR, as objects in images have different scales and perspectives, combining the hierarchical features and multi-scale features from a deeper network would give more clues for reconstruction. However, the previous SR methods only gained and incorporated more contextual knowledge through deeper networks and intricate skip connections, such as EDSR [18] and RDN [32], it is difficult to integrate complementary multi-scale feature information using a single stream structure. Further, [28, 33] found that increasing the width of a deep network may be more beneficial than increasing the depth. Therefore, in order to facilitate information integration of the image SR, multi-stream structure and network widening may be effective. Finally, the existing super-resolution multi-scale approach [17] did not fully utilize and fuse multi-scale feature information. And it was not conducive to building and training deeper network structures on limited hardware facilities.

Our Contributions

Inspired by these observations and findings, we propose a progressive residual network (PRNet) for still single image SR. Our networks successfully perform on the large scaling factors, as shown in Fig. 1 (PRNet). We summarize our research and contributions as follows:

  • We propose a novel framework PRNet for high-quality image SR. The network integrates multi-scale and hierarchical feature information from the original LR image by using multi-stream structure.

  • We propose a Progressive Residual Module (PRM) for local multi-scale features representation in PRNet, which can not only extract the deep multi-scale features from the image by progressive structure, but also fully utilize all the different size layers within it via local dense connections. Simultaneously, residual learning is introduced to promote the flow of gradient and information. It is worth mentioning that the larger the scaling factor is, the more multi-scale feature information will be fused by adjusting the width of PRM.

  • We propose an effective multi-scale features fusion architecture for image reconstruction in PRNet, which can adaptively fuse global feature information of different scales to improve the performance of image reconstruction.

  • Extensive experimental results show that our model achieves state-of-the-art performance on several popular benchmarks. And the larger the scaling factor becomes, the more highlighted the superiority of PRNet will be, which is due to its specially designed structure.

The remainder of this paper is organized as follows. Section 2 reviews the related works. In Section 3, we elaborate on the proposed methods. Experimental results and analysis are reported in Section 4. Finally, Section 5 summarizes our work.

2 Related work

Single image super-resolution (SR) researches can be divided into three classes: interpolation-based [31], reconstruction-based [30] and learning-based [3, 16, 22,23,24, 27]. The most popular one is learning-based, including neighbor embedding [6], sparse coding [25, 29] and random forest [21]. Methods in this class learn the complex mapping relationship between LR and HR by using large training datasets. As an implementation of learning-based approaches, SRCNN [3] employed a three-layer full convolutional neural network to learn end-to-end mapping between low/high resolution images, which was the first successful attempt to apply CNN to SR problems. To expand the field of perception, DRCN [11] and VDSR [10] increased the depth of network through skip connections and residual learning. Those works often used pre-processed LR images as input, which was upscaled to HR space via an up-sampling operator, such as bicubic. However, ESPCN [22] has been proved that it will increase computational complexity and produce visible artifacts.

To solve this, many deep neural network models [16, 18, 23] were introduced to take advantage of the hierarchical features of the original low-resolution (LR) image, while the original low-resolution (LR) image did not use preprocessing. Among these methods, the most well-known one is EDSR [18] which was the champion of the NTIRE2017 SR Challenge. It is based on SRResNet [16] to enhance performance by removing the normalization layer and using deeper and broader network structures.

Further, to make full use of the information across several layers or blocks, the pattern of multiple or dense skip connections between layers or modules is adopted in SR. Inspired by densely connected convolutional networks (DenseNet [8]) which achieved high performance in image classification, [13, 27] utilized the DenseNet structure as building modules to reuse learnt feature maps and introduced dense skip connections to fuse features at different levels. Compared with SRDenseNet [27], RDN [32] used a residual dense connection method that extracted and fused multi-level features from the original LR image to further improve the performance.

The aforementioned methods showed impressive performance of super-resolution, while most of them lose some useful hierarchical features and ignored the multi-scale features from the original LR image. Meanwhile, we believe that as the scaling factor is amplified, more multi-scale information should be integrated. Although Li et al. [17] introduced two fixed-size convolution kernels (3 × 3 and 5 × 5) to detect image features of different scales, it is impossible to adaptively select multi-scale features based on the different scaling factors. To resolve these cases, we propose progressive residual network (PRNet) to integrate hierarchical features and multi-scale features from all the layers in the LR space efficiently. We will detail our PRNet in the next section.

3 Progressive residual networks

In this section, we describe the design methodology of the proposed PRNet. Firstly, we introduce the network architecture of PRNet in Section 3.1. Then, Sections 3.2 3.3 detail two main parts: progressive residual module and reconstruction net with multi-scale features fusion.

3.1 Network architecture

In PRNet, the aim is to estimate a super-resolution image ISR from a low-resolution input image ILR. Correspondingly, we use IHR to denote the high-resolution image. ILR is obtained by performing the down-sampling operation. Figure 2 shows the architecture of our proposed PRNet. It consists of three parts: initial multi-scale features extraction net (MSFENet), multiple stacked progressive residual modules (PRMs) and finally a reconstruction net with multi-scale features fusion (ReconNet).

Fig. 2
figure 2

Illustrating the architecture of the proposed progressive residual network. a Initial multi-scale features extraction net (MSFENet). b Progressive residual modules (PRMs). c Reconstruction net with multi-scale features fusion (ReconNet). The horizontal and vertical directions correspond to the hierarchical depth of the network and the scale of the feature maps, respectively. Upper-right legend: 3 × 3 resblock = a ReLU between two 3 × 3 convolutions, upscale = pixelshuffle with different up-sampling scale factors, identity = no operation

Specifically, a convolutional layer and several up-sampling layers are used in MSFENet to extract the initial multi-scale shallow features \(\textbf {M}_{0}=\{{M_{0}^{s}}, s=0,1,2,3\}\). \({M_{0}^{0}}\) is extracted from ILR and the formula is as follows:

$$ {M_{0}^{0}} = \mathcal{F}_{Ext} (I^{LR})~, $$
(1)

where \(\mathcal {F}_{Ext}(\cdot )\) denotes 3 × 3 convolution operation. \({M_{0}^{0}}\) is then used for further multi-scale features extraction and global residual learning. So we can further have

$$ {M_{0}^{s}} = \mathcal{F}_{Up}(M_{0}^{s-1})~, $$
(2)

where \(\mathcal {F}_{Up} (\cdot )\) denotes 2 × up-sampling operation used in [18, 22], s ∈{1,2,3}, and s corresponds to the sampling factor (s = i for the scaling factor is 2i). M0 are the extracted inital multi-scale features to be sent to the first progressive residual module (PRM).

Supposing T progressive residual modules are stacked to act as the feature mapping, the output Mt of the t −th PRM can be obtained by:

$$ \begin{array}{@{}rcl@{}} \textbf{M}_{t} &=& \mathcal{F}_{Prn,t} (\textbf{M}_{t-1})\\ &=& \mathcal{F}_{Prn,t} (\mathcal{F}_{Prn,t-1} (\ldots(\mathcal{F}_{Prn,1} (\textbf{M}_{0}))\ldots)),\quad t\in 1,\ldots,T, \end{array} $$
(3)

where \(\mathcal {F}_{Prn,t}(\cdot )\) denotes the operations of the t-th PRM. \(\mathcal {F}_{Prn,t}(\cdot )\) can be a composite function of operations, such as convolution and rectified linear units (ReLU).

Finally, our model uses the up-sampling and convolution layers in ReconNet to fuse multi-scale features and reconstruct residual images. Therefore, our PRNet can be formulated as:

$$ \begin{array}{@{}rcl@{}} I^{SR} &=& \mathcal{F}_{PRNet} (I^{LR})\\ &=& \mathcal{F}_{Rec}(\mathcal{F}_{Prn,t} (\mathcal{F}_{Prn,t-1} (\ldots(\mathcal{F}_{Prn,1} (\textbf{M}_{0}))\ldots))\\ &&+{M_{0}^{0}})~, \end{array} $$
(4)

where \(\mathcal {F}_{PRNet} (\cdot )\) and \(\mathcal {F}_{Rec}(\cdot )\) denote the function of our PRNet and the reconstruction net with multi-scale features fusion respectively.

Given a training set \(\left \{I_{i}^{HR},I_{i}^{LR}\right \}_{i=1,\ldots ,N}\), where N is the number of training patches and \(I_{i}^{HR}\) is the ground truth of the low-quality patch \(I_{i}^{LR}\), the loss function of our PRNet with the parameter set Θ, is

$$ \mathcal{L}^{SR}(\boldsymbol{\varTheta}) = \frac{1}{N}\sum\limits_{i=1}^{N} \lVert \mathcal{F}_{PRNet} (I_{i}^{LR})-I_{i}^{HR} \rVert_{1}. $$
(5)

3.2 Progressive residual module

In order to synthesize local hierarchical features and local multi-scale features, we propose a progressive residual module (PRM), as show in Fig. 3. Here we will present more details of this structure, which contains local cross-scale feature fusion (LSCFF) and local residual learning (LRL).

Fig. 3
figure 3

Progressive residual module (PRM) architecture

Local cross-scale feature fusion

is exploited to adaptively fuse the states from preceding PRM and fully utilize hierarchical information and multi-scale information in current PRM. Different from previous studies, we construct a multi-stream sub-network in which complementary multi-scale feature information can be detected and fused on different streams. In this way, the hierarchical and multi-scale features of the deep network can be integrated to provide more clues for image reconstruction. The operation can be formulated as:

$$ \begin{array}{@{}rcl@{}} M_{t,LC}^{0}&=&p_{t}*\boldsymbol{\sigma}(q_{t}*M_{t-1}^{0} ),\\ M_{t,LC}^{1}&=&{W_{t}^{1}}*[M_{t-1}^{1},(M_{t,LC}^{0}){\uparrow_{2}} ],\\ \ldots\ldots&&\\ M_{t,LC}^{s}&=&{W_{t}^{s}}*[M_{t-1}^{s},(M_{t,LC}^{s-1}){\uparrow_{2}} ]\\ &&+((M_{t,LC}^{s-2}){\uparrow_{2^{2}}}+{\ldots} +(M_{t,LC}^{0}){\uparrow_{2^{s}}})~, \end{array} $$
(6)
$$ \textbf{M}_{t,LC}=\{M_{t,LC}^{s}, s= 0,1,2,3\}, $$
(7)

where ∗ is the spatial convolution operator, \({\uparrow _{2^{s}}}\) is the up-sampling operator with scaling factor \(2^{s}\), and \(p_{t}\), \(q_{t}\), \(\textbf {W}_{t} = \{{W_{t}^{s}}, s=1,2,3\}\) are convolutional layers at stage t. \(q\sigma (*)=max(0,x)\) stands for the ReLU function, and \([x,y]\) denotes the concatenation operation with x and y.

Local residual learning

is applied to further facilitate the information flow. Formally, we define LRL as:

$$ \textbf{M}_{t}=\textbf{M}_{t,LC} + \textbf{M}_{t-1}, $$
(8)

where Mt− 1 and Mt represent the input and output of the PRM, respectively. The operation + is performed by a shortcut connection and element-wise addition. LRL not only makes the computational complexity greatly reduced, but also promotes the performance of the network.

3.3 Reconstruction net with multi-scale features fusion

After extracting local features at different scales with a set of PRMs, we further propose a multi-scale features fusion structure for reconstruction, as shown in Fig. 4, which can adaptively fuse global multi-scale feature information. It consists of global residual fusion (GRF) and multi-scale features fusion (MSFF).

Fig. 4
figure 4

Reconstruction net with multi-scale features fusion

Global residual learning

is introduced to gain the feature-maps before conducting up-sampling by

$$ {M_{F}^{0}}= h_{f}* {M_{T}^{0}}+{M_{0}^{0}}, $$
(9)

where \({M_{0}^{0}}\) denotes the initial shallow feature-maps, \({M_{T}^{0}}\) denotes the zeroth element of MT, which is the output of the last PRM module of PRNet, ∗ is the spatial convolution operator, and hf is the convolutional layer at the global residual stage. All the other layers before global feature fusion are fully exploited with our proposed progressive residual modules (PRMs). PRM generates multi-level local features of different sizes, which are further adaptively fused into the image up-sampling process to enhance the reconstruction performance.

Multi-scale features fusion

is then utilized to adaptively fuse global feature information of different scales in the up-sampling process. We name this operation as multi-scale features fusion (MSFF) formulated as:

$$ \begin{array}{@{}rcl@{}} &&{M_{F}^{1}}={W_{f}^{1}} *[{M_{T}^{1}},({M_{F}^{0}}){\uparrow_{2}}]~, \\ &&\ldots\ldots\\ &&{M_{F}^{s}}={W_{f}^{s}} *[{M_{T}^{s}},(M_{F}^{s-1})){\uparrow_{2}}]~, \end{array} $$
(10)
$$ I^{SR}=r_{f} * {M_{F}^{s}}, $$
(11)

where ∗ is the spatial convolution operator, 2 is the up-sampling operator with scaling factor 2, and \(\textbf {W}_{f} = \{{W_{f}^{s}}, s=1,2,3\}\) and rf are convolutional layers at the multi-scale features fusion stage. [x,y] denotes concatenation operation with x and y.

4 Experiments

In this section, we first describe the implementation and training details of our network. We then validate the contributions of different components in the proposed network and compare the proposed PRNet with several state-of-the-art SR methods on benchmark datasets. We present the quantitative evaluation and qualitative comparison. Finally, we apply our approach to real photos.

4.1 Implementation and training details

Datasets and Metrics

In our work, we choose DIV2K [1] as the training dataset, a new high-quality image dataset for image restoration challenge. DIV2K consists of 800 training images, 100 validation images, and 100 test images. We train all of our models with 800 training images and use 14 validation images in the training process. For testing, we use five standard benchmark datasets: Set5 [2], Set14 [29], B100 [19], Urban100 [9], and Manga109 [20]. These datasets contain a wide variety of images that can fully verify our model. Following previous works, all testing is based on luminance channel in YCbCr space, and scaling factors: × 2, × 3, × 4, and × 8 are used for training and testing.

Training Setting

Following settings of [18], in each training batch, we randomly extract 16 LR RGB patches with the size of 48 × 48 as inputs. We randomly augment the patches by flipping and rotating before training. To maintain the image details, instead of transforming the RGB patches into a YCbCr space, we use the 3-channel image information from the RGB for training. The entire network is optimized by Adam [12] with L1 loss by setting β1 = 0.9 and β2 = 0.999. The learning rate is initially set to 10− 4 and halved at every 2 × 10− 5 mini-batch updates for 3 × 10− 5 total mini-batch updates. All experiments are conducted using Pytorch [5], MATLAB R2015b on NVIDIA TITAN Xp GPUs.

4.2 Model analysis

In this section, we first analyze the effects of the number of PRM, which is closely related to the depth of the network. We then use ablation experiments to evaluate several key design methodologies of PRNet.

PRMs and Network depth

We investigate the basic network parameters: the number of PRM (denote as T for short). We use the performance of SRCNN [3] as a reference. For illustration purpose, we train the proposed model with different number of PRM, that is the different depth of whole PRNet. We choose T = 8,16,32. As shown in Fig. 5, larger T would lead to higher performance. This is mainly because the network becomes deeper with larger T. On the other hand, although PRNet with smaller T would suffer some performance degradation during training, it still outperforms SRCNN [3] due to its powerful feature representation and fusion capabilities. More importantly, PRNet allows deeper and wider network, with bigger T and s, where more hierarchical features and more multi-scale features are extracted for higher performance.

Fig. 5
figure 5

Convergence analysis of PRNet with different number of PRM

Ablation Investigation

Table 1 shows the ablation investigation on the effects of local cross-scale feature fusion (LCSFF) , local residual learning (LRL), and multi-scale features fusion (MSFF). The eight networks have the same number of PRMs (T = 16). The baseline is obtained without LCSFF, LRL, or MSFF and performs very poorly (PSNR = 31.34 dB). This is due to the difficulty of training and also demonstrates that stacking many basic blocks in a very deep network does not yield better performance.

Table 1 Ablation investigation of local residual learning (LRL), local cross-scale feature fusion (LCSFF), and multi-scale features fusion (MSFF)

We then add one of the LRL, LCSFF or MSFF to the baseline (the 2nd to 4th combinations in Table 1). The results prove that each component can greatly improve the performance of the baseline. We further remove one of LRL, LCSFF, or MSFF from PRNet to verify the validity of each component design. The quantitative results in Table 1 show that each component can significantly improve the performance of the network.

This is because LRL can promote the flow of information and gradients. LCSFF can fully combine multi-scale features and hierarchical features in feature extraction. MSFF can effectively fuse multi-scale features in reconstruction. A similar phenomenon can be seen when we use these three components simultaneously (denote as the full model). PRNet using three components performs the best.

We can also visualize the convergence process of the above combinations in Fig. 6. The convergence curves are consistent with our analyses and indicate that LCSFF, LRL or MSFF can stabilize the training process without obvious performance degradation. These quantitative and visual analysis prove the benefits and effectiveness of the proposed LCSFF, LRL or MSFF.

Fig. 6
figure 6

Convergence analysis on MCSFF, LRL, and MSFF. The curves for each combination are based on the PSNR on Set14 [29] with scaling factor × 2 in 200 epochs

4.3 Comparisons with state-of-the-art methods

To confirm the ability of the proposed network, we perform several experiments and analysis. We compare our network with 11 state-of-the-art SR algorithms: SRCNN [3], SelfExSR [9], FSRCNN [4], VDSR [10], DRRN [23], SRDCNN [13], LapSRN [15], EDSR [18], MSRN [17], D-DBPN [7], and RDN [32]. Similar to [18, 32], we also adopt the self-ensemble strategy [26] to further improve our PRNet and denote the self-ensembled PRNet as PRNet+.

Table 2 shows quantitative comparisons for × 2, × 3, × 4, and × 8 SR. It is worth noted that D-DBPN [7] has to divide each image in Manga109 into four parts and then calculate PSNR separately which will significantly improve the super-resolution performance. For a fair comparison, we do not compare the results of D-DBPN [7] on the Manga109 dataset in the 8 × up-sampling sampling factor, and the result of other datasets are cited from their paper.

Table 2 Quantitative evaluation of state-of-the-art SR algorithms: average PSNR / SSIM for scale factors 2 ×, 3 ×, 4 × and 8 ×

When compared with all previous methods, our PRNet+ performs the best on all the datasets with all scaling factors. Even without self-ensemble, our PRNet also achieves the best average results on most datasets. Specifically, for the scaling factor × 3, our PRNet would not hold a similar advantage over RDN [32]. This is mainly because we adjust the number of feature map channels of our network (252 and 28). However, our results are still better than the rest of the models. Moreover, we have better applicability than D-DBPN [7]. For the scaling factor × 2, × 4, and × 8, our PRNet performs the best on all datasets. On the other hand, when the scaling factor becomes larger (e.g., × 8), the gains of our PRNet over EDSR [18] also becomes larger.

We also provide visual comparison results as qualitative comparisons. Figure 7 shows the visual comparisons on the 4 × scale. For image “21077” and “HighschoolKimengumi_vol 20”, we observe that most of the compared methods would produce blurred artifacts and distorted edges. By contrast, our PRNet can restore sharper and more natural edges, and produce more faithful results. For the visually farther texture in the image “img_093”, all methods of comparison can’t reconstruct it correctly. While our PRNet can obviously reconstruct it. This is mainly because PRNet uses multi-scale feature information through multi-scale features fusion.

Fig. 7
figure 7

Visual comparison for 4 × SR on BSD100 [19], Urban100 [9], and Manga109 [20] datasets. The best results are highlighted. (Zoom in for best view)

To further illustrate the analysis above, we show visual comparisons for 8 × SR in Fig. 8. For image “img_092”, due to the large scaling factor, the results of Bicubic would lose details and produce an incorrect texture structure. This false pre-amplification result would also lead some state-of-the-art methods (e.g., SRCNN, VDSR, and DRRN) to reconstruct totally erroneous structures. Even the original LR as input, other methods cannot reconstruct the right structure either. While, our PRNet can recover them correctly and clearly. Similar observations are shown in image “Hamlet”. Our proposed PRNet can integrate multi-scale and hierarchical feature information to enhance the ability of feature representation and improve the performance of reconstruction.

Fig. 8
figure 8

Visual comparison for 8 × SR on Urban100 [9] and Manga109 [20] datasets. The best results are highlighted. (Zoom in for best view)

4.4 Super-resolving real-world photos

We also conduct SR experiments on two historical real-world images, “banner” (with 400 × 270 pixels) and “building” (with 400 × 327 pixels). In this case, the original HR images are not available and the degradation model is unknown either. We compare our PRNet with VDSR [10], LapSRN [15], EDSR [18], and RDN [32]. As shown in Fig. 9, our PRNet recreates finer details and more loyal to real-world scenarios than other state-of-the-art methods. These results further indicate the benefits of learning multi-scale features from the original input image. Combining the hierarchical features and multi-scale features performs robustly for unknown degradation models.

Fig. 9
figure 9

Visual results on real-world images with scaling factor x4. The two rows show SR results for images “banner” and “building” respectively. “CEWV” is the ground-truth of “banner” and denotes the “Committee to End the War in Vietnam”. (Zoom in for best view)

5 Conclusion

In this paper, we propose a Progressive Residual Network (PRNet) for image SR, where progressive residual module (PRM) serves as the basic build module. In each PRM, dense connected up-sampling convolution layers allow full usage of local multi-scale features. The local residual leaning (LRL) further improves the flow of information and gradient. Moreover, we propose the multi-scale features fusion (MSFF) to fuse multi-scale features extracted from previous PRMs during reconstruction. By fully using local and global multi-scale features, our PRNet leads to a dense fusion of hierarchical and scale features. We use the same PRNet structure to handle the bicubic degradation model and real-world data. Extensive benchmark evaluations demonstrate that our PRNet yields superior results and successfully outperforms other state-of-the-art methods on large scaling factors such as 4 × and 8 × enlargement.