Introduction

Image super-resolution (SR) involves the process of restoring a high-resolution (HR) image from a corresponding degraded low-resolution (LR) image. Currently, image SR is favored for its broad applications in medical imaging, surveillance, and face recognition. The ill-posed problem of SR, however, presents a challenge where multiple HR images can be reconstructed from a single LR image. Many SR methods have been introduced to tackle this issue, such as interpolation-based, reconstruction-based, and learning-based methods.

The emergence of convolutional neural networks (CNN) recently instigated a profound revolution in SR tasks. From SRCNN1 (incorporating only three convolutional layers) to RCAN2 (encompassing over 400 layers), there has been a consistent increase in network depth, width, and complexity. This development trend benefits network representation capabilities while achieving reconstruction performance improvements. For example, enhanced deep SR network (EDSR)3 and non-local sparse attention (NLSA)4 with parameters of 43M and 42M, respectively, can produce significant restoration effects by powerful nonlinear learning. Nevertheless, the practical application of the majority of these CNN-based methods in real-world scenarios remains challenging, primarily owing to their demanding memory and computational requirements. Despite various efforts directed toward reducing the number of network parameters and operations, most methods struggle to maintain good reconstruction performance. A deeply recursive convolutional network(DRCN)5 has fewer model parameters by recursive paradigm, but the reconstruction accuracy is lower. Cascading residual network (CARN)6 implemented a cascading residual architecture by introducing a cascading mechanism, where the whole structure is lightweight but poorly performing. To better balance network performance and computational cost, researchers have introduced attention mechanisms into the SR task. LatticeNet7, DRSAN8, and A2F9 exploited different attention mechanisms to focus on informative features, boosting reconstruction performance while maintaining moderate computational demands. Notably, most existing attention mechanisms lack structural priors, which are crucial for image detail recovery. Therefore, it is essential to devise an effective and lightweight network that explores structural information within attention mechanisms for reconstructing high-quality HR images.

Conversely, CNN-based SR approaches struggle to address global dependencies, primarily attributed to the inherent local properties of convolutional operations. As an alternative to CNN, Transformer can capture global interactions between contexts through the self-attention mechanism, making them widely adopted in the field of SR. Swin Transformer10 has exhibited significant promise by harnessing the advantages of both CNN and Transformer. Later, the hybrid structure of CNN and Transformer has gradually become a mainstream trend in research. An efficient long-range attention network (ELAN)11 proposed a share attention technique to speed up the calculation in its group multi-head self-attention. Hybrid network of CNN and Transformer (HNCT)12 combined CNN and Transformer to extract deep features in consideration of both local and non-local priors. Similarly, cross-receptive focused inference network (CFIN)13 elegantly integrated CNN and Transformer and achieved competitive performance. In aggregate enriched features extracted from both CNN and Transformer (ACT)14, it exploited multi-scale local and non-local attributes to improve SR quality. Benefiting from the advantages of this hybrid architecture, we further explore the processing of local and global information to obtain more valuable information for HR reconstruction.

In this study, a lightweight interactive feature inference network (IFIN) is implemented for image SR tasks. Specifically, a series of interactive feature aggregation modules (IFAM) capture more abstract depth features in a coarse-to-fine fashion. IFAM is supported by structure-aware attention block (SAAB), Swin Transformer block (SWTB), and enhanced spatial adaptive block (ESAB), complementing and integrating different features synergistically. SAAB and SWTB extract local structure and global aware priors, respectively. These two different feature properties are merged and fused in ESAB, recovering natural and realistic textures of HR images. As reported in Fig. 1, our proposed networks deliver a favorable trade-off between performance and model size, outperforming most renowned SR models.

Figure 1
figure 1

PSNR and model size comparison of our methods (red star) with mainstream SR networks on Set14 for scale factor \(\times \)2.

In brief, we make three primary contributions.

  1. 1.

    We introduce a lightweight and efficient model, dubbed IFIN, which utilizes chain-stacked IFAMs to extrapolate image features from coarse to fine granularity. Supported by SAAB, SWTB, and ESAB, the IFAM effectively leverages both local and global a priori knowledge, thereby enhancing the network’s discriminatory capabilities. IFIN achieves favorable performance with modest computing requirements, surpassing most well-known lightweight approaches.

  2. 2.

    We propose SAAB, which incorporates asymmetrical convolution within the attention mechanism. This integration facilitates the learning of intricate structural information and the generation of more generalized weights, effectively emphasizing critical target regions.

  3. 3.

    We propose ESAB, which synergistically aggregates local structural information from SAAB and global aware information from SWTB. In such a way, ESAB can enhance the network’s adaptability to various image contents and scenes, thereby significantly improving image reconstruction performance.

Related work

CNN-based image SR

Dong et al.1 were pioneers in applying CNN to the SR domain, developing the SRCNN model which outperformed traditional methods in achieving superior SR results. Inspired by this idea, Kim et al.15 increased the network depth to 20 layers and further improved the reconstruction performance. Later, a wide variety of CNN-based SR network designs emerged to facilitate reconstruction accuracy, such as increasing network depth, expanding network width, and designing complex network architectures. For instance, enhanced deep SR network (EDSR)3, residual dense network (RDN)16, holistic attention network (HAN)17, and dual interactive implicit neural network (DIINN)18 were very deep networks that had very dominant restoration accuracy, but they suffered from a very large number of parameters and computations. Instead of designing huge networks, efficient SR methods provide a good balance of performance and model capacity. CARN6 leveraged group convolution and a cascade scheme to decrease model capacity and enhance network representation. IDN19 distilled more useful information for SR reconstruction via distillation technology to reduce network parameters. LatticeNet7 designed lattice blocks that favor the lightweight SR framework, reducing the number of parameters by about half while maintaining similar SR performance. Additionally, to promote the efficiency of feature utilization, several works have incorporated the attention mechanism into the SR field. MemNet20 and channel-wise and spatial feature modulation (CSFM)21 aggregated channel attention and spatial attention, exploring the interdependencies between channel and spatial attributes. Plus, PAN22 and DRSAN8 conducted attention mechanisms that adaptively rescaled features using three-dimensional (3D) attention maps, resulting in improved SR outcomes. Although different solutions can generate different lightweight SR results, they ignore the exploration and use of structural priors which are beneficial for image detail reconstruction.

Transformer-based image SR

As an alternative to CNN, Transformer which adopts the self-attention mechanism has escalated the accuracy of various computer vision tasks. One pioneering work is Vision Transformer (ViT )23, which flattened two-dimensional (2D) image patches in a vector and delivered them into the Transformer structure, obtaining remarkable performance gains. Shortly afterward, an increasing number of Transformer-based approaches have sprung up in SR tasks. Image processing Transformer (IPT)24, based on ViT, acquired better restoration results in denoising, deraining, and SR tasks. Instead of the standard self-attention, Swin Transformer10 adopted the Swin Transformer block by incorporating convolutional layers within the block to enforce local connectivity. Currently, the popular direction of research in the SR domain is the hybrid structure of CNN and Transformer. Many research efforts have demonstrated the effectiveness of this hybrid architecture, mainly thanks to the fact that CNN structure can extract local features while the Transformer structure can establish global features. For instance, efficient super-resolution Transformer (ESRT)25, ELAN11, hierarchical patch Transformer (HIPA)26, and ACT27 extracted and enhanced feature representations by hybridizing CNN backbone and Transformer backbone, acquiring better performance than most Transformer-based and CNN-based methods. Indeed, effectively incorporating both local features and global information into lightweight networks is crucial for achieving high-performance results. In this study, we aim to enhance the flexibility and robustness of local structure and global feature priors, thereby achieving HR image restoration.

Proposed method

Overall network architecture

In this work, we construct a lightweight interactive feature inference network (IFIN) for image SR fields. As depicted in Fig. 2, the entire workflow of IFIN consists of a shallow feature extraction module, several interactive feature aggregation modules (IFAM), and an upsample part. Firstly, the LR image passes through a 3 \(\times \) 3 convolution to distill shallow features, which can be defined as:

Figure 2
figure 2

The architecture of our proposed IFIN, which consists of T IFAMs to gradually infer rich contextual features.

$$\begin{aligned} {F_0} = {H_{SFE}}\left( {{I^{LR}}} \right) \end{aligned}$$
(1)

where \({I^{LR}} \in {{\mathbb {R}}^{H \times W \times 3}}\) denotes the LR input images. H and W indicate the height and width of the image. \({H_{SFE}}( \cdot )\) is the 3\(\times \)3 convolution operation and \({F_0} \in {{\mathbb {R}}^{H \times W \times C}}\) is the extracted shallow features, where C is the number of channels. Then, \({F_0}\) will be transmitted to T chained stacking IFAMs for learning more abstract high-level features. IFAM is composed of a structure-aware attention block (SAAB), Swin Transformer block (SWTB), and enhanced spatial adaptive block (ESAB), which will be described in “Interactive feature aggregation module (IFAM)” . The process can be expressed as follows:

$$\begin{aligned} F_{_{IFAM}}^t = \mathrm{{\;}}H_{\mathrm{{IFAM}}}^t\left( {{F_{t - 1}}} \right) = H_{\mathrm{{IFAM}}}^t\left( {H_{\mathrm{{IFAM}}}^{t - 1}\left( { \cdots H_{\mathrm{{IFAM}}}^1\left( {{F_0}} \right) \cdots } \right) } \right) \end{aligned}$$
(2)

where \(\mathrm{{\;}}H_{\mathrm{{IFAM}}}^t( \cdot )\) indicates the operation of the t-th IFAM. \({F_{_{IFAM}}^{t - 1}} \in {{\mathbb {R}}^{H \times W \times C}}\) and \({F_{_{IFAM}}^t} \in {{\mathbb {R}}^{H \times W \times C}}\) are the input and output feature maps of the t-th IFAM. Finally, the extracted deeper features \({F_{_{IFAM}}^T \in {{\mathbb {R}}^{H \times W \times C}}}\) is upsampled to the ideal HR image size, which can be expressed as:

$$\begin{aligned} {I^{HR}} = {H_{HU}}({F_{_{IFAM}}^T}) + {H_{LU}}({I^{LR}}) \end{aligned}$$
(3)

where \({I^{HR}} \in {{\mathbb {R}}^{rH \times rW \times 3}}\) is the HR image, where r is the scale factor. \({H_{HU}}(\cdot )\) and \({H_{LU}}(\cdot )\) denote the upsampling operations for the deeper features and input LR image, respectively. Similar to work28, both operations integrate a 3 \(\times \) 3 convolution for \({H_{HU}}(\cdot )\) and 5\(\times \)5 convolution for \({H_{LU}}(\cdot )\), as well as a sub-pixel convolutional layer. With this technique, the stability of network training is improved.

We utilize \({L_1}\) norm as the objective function of the proposed IFIN. Assuming a training dataset \(\{ I_i^{LR},I_i^{SR}\} _{i = 1}^N\), where \(I_i^{LR} \in {{\mathbb {R}}^{H \times W \times 3}}\) and \(I_i^{SR} \in {{\mathbb {R}}^{rH \times rW \times 3}}\) denote the i-th LR image and the corresponding ground-truth image, respectively. A powerful non-linear mapping function \({H_{IFIN}}(\cdot )\) , capturing the relationship \(I_i^{LR}\) and \(I_i^{SR}\) using the \(L_1^{}\) norm, can be defined as:

$$\begin{aligned} \begin{aligned} L\left( \mathrm{{\Theta }} \right)&= \frac{1}{N}\mathop \sum \limits _{i = 1}^N {\left\| {I_i^{SR} - {H_{IFIN}}\left( {I_i^{LR}} \right) } \right\| _1} \\&= \frac{1}{N}\mathop \sum \limits _{i = 1}^N {\left\| {I_i^{SR} - I_i^{HR}} \right\| _1} \end{aligned} \end{aligned}$$
(4)

where \(\Theta \) is the learnable parameter set of IFIN.

Interactive feature aggregation module (IFAM)

As the backbone of IFIN, IFAM allows collaborative exploration of the local and global prior of the image, helping to reconstruct a more texture-rich HR image. IFAM is made up of SAAB, SWTB, and ESAB, which are described as follows.

Structure-aware attention block (SAAB)

Asymmetric convolution explores structural information by leveraging the vertical and horizontal gradient information parallelly, not only reducing model operations but also helping to recover high-quality images. For instance, Tian et al.29 introduced ACNet, which utilizes asymmetric blocks with higher efficiency and fewer parameters. Analogously, Xu et al.30 proposed asymmetric attention convolution (AAConv) to gradually extract advanced spatial patterns and spectral features. Considering the excellent structural prior of asymmetric convolution, we embed it into the attention mechanism to focus on more important structural features and improve network representation. Therefore, we propose a structure-aware attention block (SAAB), which embeds asymmetric convolution within the attention path and modulates it with the convolutional path to acquire rich structure-aware features. In contrast to AAConv, we advocate for leveraging structural priors to empower the attention path in learning more generalized weights, followed by adaptive reweighting of the convolutional path to emphasize essential target structural information.

As shown in Fig.  3, SAAB starts with 1 \(\times \) 3 and 3 \(\times \) 1 convolutions for structural information exploration, and then passes to three 3 \(\times \) 3 convolutions for feature learning, followed by sigmoid to generate 3D modulation coefficients \({\alpha ^t} \in {{\mathbb {R}}^{H \times W \times C}}\). Additionally, to gather more important generalized features \(F_{gen}^t \in {{\mathbb {R}}^{H \times W \times C}}\), we use two 3 \(\times \) 3 convolutions that are independent of the attention path. Finally, the generalized features are recalibrated by 3D modulation coefficients, acquiring rich discriminative feature representation \(F_{SAAB}^t \in {{\mathbb {R}}^{H \times W \times C}}\) for accurate SR reconstruction. The above process can be formulated as follows:

$$\begin{aligned} {\alpha ^t} = \sigma \left( {{H_{3 \times 3}}({f_{1 \times 3}}(F_{IFAM}^{t - 1}) + {f_{3 \times 1}}(F_{IFAM}^{t - 1}))} \right) \end{aligned}$$
(5)
$$\begin{aligned} F_{gen}^t = {f_{3 \times 3}}({f_{3 \times 3}}(F_{IFAM}^{t - 1})) \end{aligned}$$
(6)
$$\begin{aligned} F_{SAAB}^t = {\alpha ^t} \cdot F_{gen}^t + F_{gen}^t \end{aligned}$$
(7)

where \(\sigma \left( \cdot \right) \) denotes the sigmoid function. \(H\left( \cdot \right) \) and \(f\left( \cdot \right) \) are different convolution operations, where the subscripts indicate the sizes of convolution.

Figure 3
figure 3

The architecture of SAAB that concentrates on rich structural-aware features.

Swin transformer block (SWTB)

SWTB is derived from the literature31 that introduces local attention and shifted window mechanisms to decrease model complexity and achieve efficient learning. We adopt SWTB to learn global context information, acquiring more valuable information to realize detail restoration.

Figure 4 presents the structure of two consecutive SWTB, containing a LayerNorm (LN) layer, a multi-head self-attention block, residual connection, and two multi-layer perceptrons (MLP). The window-based multi-head self-attention (W-MSA) unit and the shifted window-based multi-head self-attention (SW-MSA) unit are employed in the two successive Transformer blocks, respectively. \(F_{IFAM}^{^{t - 1}} \in {^{H \times W \times C}}\) will be linearly projected and reshaped into \({\hat{F}}_{IFAM}^{^{t - 1}} \in {{\mathbb {R}}^{N \times C}}\), where \(N = H \times W\). The input feature will be separated into non-overlapping windows, with each window containing M\(\times \)M patches (set to 8 by default). With the window partitioning mechanism, the continuous SWTB can be represented as:

$$\begin{aligned} {\hat{F}}_{SWTB}^t = W{\text{- }}MSA\left( {LN\left( \hat{F_{IFAM}^{^{t - 1}}} \right) } \right) + {\hat{F}}_{IFAM}^{^{t - 1}} \end{aligned}$$
(8)
$$\begin{aligned} {\tilde{F}}_{SWTB}^t = MLP\left( {LN\left( {{\hat{F}}_{SWTB}^t} \right) } \right) + {\hat{F}}_{SWTB}^t \end{aligned}$$
(9)
$$\begin{aligned} {\bar{F}}_{SWTB}^t = SW{\text{- }}MSA\left( {LN\left( {\tilde{F}_{SWTB}^t} \right) } \right) + {\tilde{F}}_{SWTB}^t \end{aligned}$$
(10)
$$\begin{aligned} F_{SWTB}^t = MLP\left( {LN\left( {{\bar{F}}_{SWTB}^t} \right) } \right) + {\bar{F}}_{SWTB}^t \end{aligned}$$
(11)

where \({\hat{F}}_{SWTB}^t\), \({\bar{F}}_{SWTB}^t\), \({\tilde{F}}_{SWTB}^t\), and \(F_{SWTB}^t\) are the outputs of the (S)W-MSA module and the MLP of the t-th block, respectively. The self-attention figured in W-MSA and SW-MSA can be summarized by the following formula:

$$\begin{aligned} A\mathrm{{ttention}}\left( {Q,K,V} \right) \mathrm{{ = }} = \mathrm{{softmax}}\left( {\frac{{Q{K^T}}}{{\sqrt{d} }} + B} \right) \mathrm{{V}} \end{aligned}$$
(12)

where Q, K, \(V \in {{\mathbb {R}}^{{M^2} \times d}}\) indicate querie, key, and value matrices, respectively. d indicates size of the query and key. \(B \in {{\mathbb {R}}^{{M^2} \times {M^2}}}\) indicates the relative position bias.

Figure 4
figure 4

The architecture of SWTB that can model global information effectively.

Enhanced spatial adaptive block (ESAB)

It is recognized that both the local and global priors of an image contribute to the reconstruction of rich texture details. While we have independently considered the local and non-local features of images from SAAB and SWTB, there is room for further exploration to enhance the flexibility of fusion. By exploiting local and global priors, the network becomes more robust to changes in the input image and is better able to handle different image contents and scenes, favouring the recovery of richer high-frequency information.

Figure 5 illustrates the structure of ESAB. Firstly, the output features \(F_{SAAB}^t\) and \(F_{\mathrm{{SWTB}}}^t\) are concatenated and processed by 1\(\times \)1 convolutional layer to harvest diverse types of fused characteristics. Then, 3\(\times \)3 convolutional layer is exploited to generate the modulation parameters \({\alpha _1^t}\) and \({\beta _1^t}\), which will be updated with the mean and standard deviation of fused characteristics. Subsequently, a sigmoid operation is applied to yield modulation coefficients. Finally, the spatial features \(F_{_{SAAB}}^t\) enhanced by a 1\(\times \)1 convolution are multiplied with modulation coefficients and then added with \({\beta _1^t}\), to acquire spatial modulation features \({{\hat{F}}_{SAAB}^t} \in {{\mathbb {R}}^{H \times W \times C}}\).

$$\begin{aligned} {\hat{F}}_{SAAB}^t = {f_{1 \times 1}}\left( {F_{SAAB}^t} \right) \cdot \sigma \left( {{\alpha _1^t}} \right) + {\beta _1^t} \end{aligned}$$
(13)

Analogously, the modulated features \({{\hat{F}}_{SAAB}^t}\) are convolved to produce another modulation parameters \({\alpha _2^t}\) and \({\beta _2^t}\), and whose are multiplied and added to the enhanced global features \({\hat{F}}_{\mathrm{{SWTB}}}^t\), distilling global modulation features \(F_{\mathrm{{SWTB}}}^t\), which is also the output features \(F_{_{IFAM}}^t\) of the t-th IFAM.

$$\begin{aligned} F_{_{IFAM}}^t = F_{ESAB}^t = {f_{1 \times 1}}\left( {{\hat{F}}_{SWTB}^t} \right) \cdot \sigma \left( {{\alpha _2^t}} \right) + {\beta _2^t} \end{aligned}$$
(14)

where \(\sigma \left( \cdot \right) \) denotes the sigmoid function. \({f_{1 \times 1}}\left( \cdot \right) \) denoete 1\(\times \)1 convolution.

Figure 5
figure 5

The architecture of ESAB that autonomously aggregates local structure and global feature priors.

Experiments

Datasets and metrics

IFIN-S and IFIN are trained on DIV2K32 dataset, in which 800 high-quality images are available. Then we test on five benchmark datasets: Set533, Set1434, B10035, Urban10036, and Manga10937. Besides, three degradation models, known as bicubic (BI), blur-downscale (BD), and downscale-noise (DN), are leveraged to demonstrate the effectiveness of IFIN. The experimental outcomes are estimated utilizing two metrics, the peak signal-to-noise ratio (PSNR) and structure similarity index (SSIM), on the Y channel of the transformed YCbCr color space.

Implementation details

In order to build a lightweight architecture, we devise two network variants, referred to as IFIN-S and IFIN. The channels of C are configured to 50. We construct IFIN-S by stacking three IFAMs and configuring the group size of the 3 \(\times \) 3 convolutions in SAAB to 2. For IFIN, we stack five IFAMs.

Following the BI degradation model, we downsampled the datasets by scale factors of \(\times \)2, \(\times \)3, and \(\times \)4 to produce the corresponding LR images. For the BD and DN degradation models, however, we specifically process the dataset using a scale factor of \(\times \)3. Each mini-batch comprises 16 image patches, each of size 60 \(\times \) 60. To augment the database, both the LR and HR image patches undergo random horizontal flipping as well as rotations of 90\(^\circ \), 180\(^\circ \), and 270\(^\circ \). Before inputting the mini-batch into the model, we normalize it by subtracting the average RGB value calculated from the entire training dataset. The \({L_1}\) paradigm is minimized by employing the Adam optimizer, whose parameters are given as \(\beta _1 = 0.9\), \(\beta _2 = 0.999\), and \(\varepsilon = 10^{-8}\). The initial learning rate is set to 1e−3 at the beginning and is halved every 500 epochs. The detailed hyperparameters utilized for our network architecture are listed in Table 1. Our IFIN-S and IFIN are implemented by exploiting the PyTorch framework on an NVIDIA TESLA V100 GPU.

Table 1 Values of hyperparameters for our network.

Ablation study

In this section, we implement ablation studies to demonstrate the effectiveness of various components of IFIN in enhancing reconstruction accuracy. We respectively get rid of SAAB, ESAB, and SWTB, thus obtaining three additional models. Table 2 reports model capacity, PSNR/SSIM, and time consumption for four models on five benchmark datasets. The time consumption tests are performed on an NVIDIA GeForce RTX 3060 GPU, with results averaged across the datasets. Additionally, we explore the impact of varying the number of IFAMs on network performance, aiming to identify the optimal balance between computational efficiency and enhancement efficacy.

Table 2 Ablation studies on effects of SAAB, SWTB, and ESAB.
  1. (1)

    Investigation of SAAB. Our proposed SAAB inherits the property of asymmetric convolution and can learn the structural features of the image for better detail restoration. As reported in Table 2, it is evident that IFIN equipped with SAAB acquires a substantial performance improvement, especially on the structurally complex Urban100 and Manga109 datasets, where SSIM respectively improves by 0.0027 and 0.0035. Despite the fact that the addition of SAAB increases the parameter count by 328K, the performance improvement is significant. Additionally, the increase in time consumption is only 7%. As expected, SAAB facilitates the recovery of high-quality images by embedding structural priors. Figure 6 is the visual heatmap of different stages of IFAM, produced by IFIN w/o SAAB and IFIN, respectively. As for IFIN enabled by SAAB (Fig. 6f–j), obviously, it can effectively outline clear and sharp edge information, validating the ability to explore structural textures. While IFIN w/o SAAB (Fig. 6a–e) not only has low reconstruction accuracy but also displays blurred and distorted structural information.

    Figure 6
    figure 6

    Visualized feature maps of IFIN with and without SAAB. (ae) Heatmaps of IFIN without SAAB, (fj) show heatmaps of IFIN with SAAB.

  2. (2)

    Investigation of SWTB. SWTB has a strong representation ability and global information utilization, thus facilitating the recovery of more useful characteristics. As depicted in Table 2, by introducing SWTB, the reconstruction results provide at least 0.10 dB gains, which indicates the importance of global dependence in image restoration. The convergence analysis is presented in Fig. 7. We can find that IFIN w/o SWTB converges faster, while the other models are relatively slower. Inevitably, the inference time of the model becomes longer with the introduction of SWTB architecture. However, the loss of time extrapolation is a necessary concession to the improvement of reconstruction accuracy.

    Figure 7
    figure 7

    Convergence results of different models on Set5 for scale factor \(\times \)4.

  3. (3)

    Investigation of ESAB. As one of the key components of IFIN, the proposed ESAB effectively explores local structural and global dependence from SAAB and SWTB for better feature aggregation. It can be seen from Table 2 that IFIN augmented with ESAB attains improvements of 0.06dB in PSNR and 0.0010 in SSIM on Urban100, albeit with an increase in the number of model parameters by 8.2% and Multi-Adds by 20%. Additionally, we give visual heatmaps of IFIN with and without ESAB in Fig. 8, observing how ESAB acts on local and global responses. From Fig. 8a–e, high-frequency texture features present blurry and checkerboard artifacts in the absence of ESAB. On the contrary, detailed features of the image are clearer and more comprehensive after ESAB processing in Fig. 8f–j. Among them, the model not only focuses on the repeated small patterns but also emphasizes the sharp edge details. Combining all the improvements, SAAB, ESAB, and SWTB have exhibited great reasonableness and effectiveness.

    Figure 8
    figure 8

    Visualized feature maps of IFIN with and without ESAB. (ae) Heatmaps of IFIN without ESAB, (fj) show heatmaps of IFIN with ESAB.

  4. (4)

    Investigation of IFAM. In Fig. 9, we illustrate visual features to analyze the interaction between the modules within the last IFAM. Based on the visualization, the output features of the three modules demonstrate minimal attention towards the low-frequency regions. Specifically, the output feature of SAAB focuses on texture structure details, such as lines and small patterns. In contrast, the output feature of SWTB shows an even distribution of activation values across the feature map. More importantly, the output characteristics of ESAB fuse global and local properties, resulting in a more pronounced representation of the target area and higher overall activation values. This observation suggests that the complementary fusion of local and global features facilitates the generation of additional high-frequency information, thereby aiding in the reconstruction of high-quality images.

    Figure 9
    figure 9

    Average feature visualization inside the last IFAM.

  5. (5)

    Analysis of different numbers of IFAMs. To investigate the impact of model depth on network performance, we conduct analyses by varying the number of IFAM, setting T to 2, 3, 4, 5, and 6. As reported in Table 3, one can find that increasing the depth of the model by adding more IFAMs generally leads to better performance in terms of both PSNR and SSIM, but performance changes slowly when T exceeds 5. Meanwhile, we opt for T = 5 as the number of IFAMs to balance model performance against computational costs.

    Table 3 Average PSNR and SSIM results on Set14 for scale factor \(\times \)4.

Results with BI degradation

We make comparisons of the proposed IFIN-S and IFIN against existing SR approaches on the BI degradation model: SRCNN1, FSRCNN38, VDSR15, DRCN5, LapSRN39, DRRN40, MemNet20, CARN6, CBPN41, AWSRN-M42, OISR-RK2-s43, A2F-S9, LESRCNN44, SPBP-L45, RMUN46, FALSR47, WMRN48, LMAN-s49, MADNet-L150, MSWSR51, Cross-SRN52, ACNet29, CRMBN53, DRSAN-48m8, FMEN54, AFAN55, ESRT56, LBNet57, CFGN58, and Ngswin59.

Quantitative comparison

For a practical comparison that aligns with real-world application needs, we focus on selecting mainstream models that have a total network parameter count of less than 2000K. To enhance the comprehensibility of the comparison, we utilize Multi-Adds calculated by recovering a 1280 \(\times \) 720 (720P) HR image. According to the results presented in Tables 4, 5 and 6, the proposed IFIN-S and IFIN exhibit competitive or superior PSNR and SSIM at different scales compared to popular SR networks. Compared to CNN-based methods, IFIN-S and IFIN exhibit better reconstruction performance at similar computational complexity. Note that our IFIN-S shows comparable results to CFGN, Cross-SRN, and FMFN, which suffer from more parameters and computations than ours. And IFIN-S achieves competitive results with ESRT while requiring less model capacity. Additionally, IFIN stands out by producing promising SR results with modest network parameters and Multi-Adds, even when compared to recently proposed Transformer-based methods. For example, in Table 6, IFIN has 0.19 dB higher PSNR and 0.0016 higher SSIM on Urban100 for \(\times \)4 over Ngswin. Although the Multi-Adds of IFIN are higher than that of Ngswin, the judicious increase in Multi-Adds is a necessary trade-off for improving accuracy. In essence, the superiority of our proposed methods is even more remarkable in reconstructing large-scale factors. This phenomenon can be attributed to the fact that LR images contain fewer pixel values at larger scale factors, necessitating the extraction of richer features to accurately restore HR images. Our proposed models, which implement both local and global strategies, exhibit strong representational capabilities. This strategy enables our network to capture intricate details more effectively, thereby enhancing performance significantly in SR tasks.

Table 4 Quantitative comparison on benchmark datasets for scale factor \(\times \)2.
Table 5 Quantitative comparison on benchmark datasets for scale factor \(\times \)3.
Table 6 Quantitative comparison on benchmark datasets for scale factor \(\times \)4.

Qualitative comparison

We offer visual comparisons on selected portions of benchmark datasets, as depicted in Figs. 10, 11, 12, 13 and 14. Our IFIN exhibits superior restoration of stripes and line patterns, demonstrating finer and more accurate super-resolved images. As Fig. 10 depicts, our IFIN produces sharper details, which are close to HR image. In Figs. 12 and 13, ESRT, LBNet, and CFGN can yield stripe characteristics but with visible blurring. For Figs. 11 and 14, which are rich in stripe information, the comparison methods show severe distortions and deformations. Conversely, our IFIN effectively mitigates these issues, thereby recovering finer and more accurate detail information. This efficacy stems from IFIN’s specialization in capturing minute textures and extracting high-frequency cues, resulting in clearer and more precise image restorations.

Figure 10
figure 10

Qualitative comparison of popular networks on Set14 for scale factor \(\times \)2.

Figure 11
figure 11

Qualitative comparison of popular networks on Urban100 for scale factor \(\times \)2.

Figure 12
figure 12

Qualitative comparison of popular networks on B100 for scale factor \(\times \)3.

Figure 13
figure 13

Qualitative comparison of popular networks on Set5 for scale factor \(\times \)4.

Figure 14
figure 14

Qualitative comparison of popular networks on Urban100 for scale factor \(\times \)4.

Results with BD and DN degradations

In this section, we conduct SR experiments on BD and DN degradation models to display the effectiveness and robustness of the proposed IFIN further. In this comparison, IFIN is evaluated against some popular SR methods, containing SRCNN1, FSRCNN38, VDSR15, IRCNN_G60, IRCNN_C60, SRMDNF61, RDN16, and AFAN55. According to the results presented in Table 7, our IFIN achieves optimal and suboptimal reconstruction results. Note that our IFIN outperforms RDN on the DN degradation model, which suffers from more parameters and operations than ours. The number of parameters and operations in IFIN are around 4.5% (0.98M vs. 22M) and 4.7% (107G vs. 2282G) of RDN, respectively. In addition, IFIN is superior to AFAN because it excels in exploring local and global priors, which considerably boosts the network’s discriminative ability. All experiments indicate that our IFIN strikes an advantageous trade-off between model capacity and reconstruction accuracy when compared to the state-of-the-art SR methods.

Table 7 Quantitative comparison on benchmark datasets for scale factor \(\times \)3.

Results on real remote-sensing images

To further demonstrate the efficacy of our proposed methods, we test them on remote-sensing images of relatively low quality and spatial resolution. Following the methodologies established in works62,63, we use two test sets named RS-T1 and RS-T2 from the UC Merced dataset64. Both RS-1 and RS-2 contain 120 images and cover diverse scenes with complicated image patterns. We exploit existing remote-sensing SR methods for comparison, including SRCNN1, VDSR15, LGCNet65, LapSRN39, IDN19, LESRCNN44, CARN-M6, FENet63, FDENet66, and DRAN67. All the aforementioned methods are directly evaluated on remote sensing data utilizing pre-trained models provided by relevant workers. Meanwhile, these approaches are trained on the DIV2K dataset to ensure the fairness of comparison results.

As presented in Table 8, it is noted that our IFIN obtains the best PSNR and SSIM scores, surpassing advanced remote sensing SR methods such as LGCNet, FeNet, DRAN, and FDENet. For instance, the proposed IFIN obtains 0.04-0.16 dB PSNR and 0.0004-0.0051 SSIM gains compared with lightweight DRAN. It is important to note that IFIN-S, which holds lower parameters than DRAN, attains competitive results on RS-T1 and RS-T2 datasets, exhibiting strong flexibility and stability. Figures 15 and  16 give visual comparisons of some methods. As we can see, our methods exhibit a better reconstruction effect than the comparison networks, particularly in terms of object outlines and texture details. In Fig. 16, LGCNET, LESRCNN, and FENet can not recover the stripe information, resulting in serious blurring, distortion, and artifacts respectively. Contrastively, our IFIN-S and IFIN with SAAB, SWTB, and ESAB can reconstruct details more clearly and accurately, visually consistent with the HR image.

Table 8 Quantitative comparison of remote-sensing datasets on RS-T1 and RS-T2.
Figure 15
figure 15

Qualitative comparison of popular networks on RS-T1 for scale factor \(\times \)3.

Figure 16
figure 16

Qualitative comparison of popular networks on RS-T2 for scale factor \(\times \)4.

Model complexity

As we recognize the importance of model parameters and computational operations in designing lightweight methods, time consumption emerges as a critical metric for assessing the suitability of these methods for real-time applications. To this end, we perform time testing on some representative SR approaches exploiting the same device with an NVIDIA GeForce RTX 3060 GPU. Notably, we assess the time consumption of each method four times and subsequently compute the average score as the final result.

We depict the model parameters, Multi-Adds, PSNR/SSIM, and time consumption on B100 dataset in Table 9. Compared to CNN-based methods, the difference in performance and time cost between IFIN-S and AWSRN-M is not significant, but the computational complexity of IFIN-S is half that of AWSRN-M. Additionally, the time inference of CFGN is 24.7% longer than that of IFIN-S. Crucially, our IFIN obtains the highest PSNR and SSIM scores while utilizing fewer model parameters and calculations. However, the inclusion of a self-attention mechanism in ESAB, which necessitates a greater number of multiplication operations, somewhat diminishes the computational efficiency of our method. Nevertheless, our IFIN-S and IFIN exhibit superior computational efficiency in comparison to Transformer-based SR approaches. For instance, the time cost associated with DRSAN-48m and Ngswin is 8 and 2.5 times greater than that of our IFIN, respectively. Consequently, it is reasonable to conclude that our proposed IFIN-S and IFIN present a beneficial balance among network complexity, reconstruction accuracy, and time consumption.

Table 9 Comparison of model complexity on B100 for scale factor \(\times \)3.

Limitations

In this study, our proposed method demonstrates slower inference speeds compared to most CNN-based methods, primarily due to the higher computational complexity of the Transformer. Also, although the structure we designed achieves high performance, there is still a limitation that we employ a fixed upsampling strategy. In future work, we intend to focus on improving the inference speed of the network and designing adaptive upsampling strategies while ensuring reconstruction accuracy.

Conclusion

In this study, an efficient and lightweight interactive feature aggregation network (IFIN) is devised for the image SR task. Specifically, we propose an interactive feature aggregation module (IFAM), which consists of three key components: a structure-aware attention block (SAAB), a Swin Transformer block (SWTB), and an enhanced spatial adaptive block (ESAB). The SAAB focuses on capturing local salient structural features using asymmetric convolution, aiding in the restoration of texture details. Additionally, it collaborates with the SWTB to integrate global information efficiently into the ESAB. The ESAB plays a crucial role in seamlessly fusing and complementing local and global characteristics, generating more expressive feature representations and reconstructing natural and realistic image details. Extensive experiments indicate that our IFIN-S and IFIN are superior with respect to model capacity and reconstruction performance, exceeding mainstream lightweight SR methods.