1 Introduction

Single image super-resolution (SISR) has applications in many fields, including surveillance [1], medical imaging [2], and remote sensing imaging [3]. The goal of SISR is to reconstruct a high-resolution (HR) image from its low-resolution (LR) observation by recovering the lost high-frequency information. However, this is an inherently ill-posed inverse problem, since multiple corresponding HR images can be generated from a given LR image.

In the last decade, deep learning (DL)-based methods have significantly improved the performance of many computer vision tasks compared to traditional methods. For example, DL-based traffic sign detectors can effectively detect small and fuzzy traffic signs in the wild, offering potential for autonomous driving. Zhang et al. [4] proposed ReYOLO to learn rich contextual information and senses scale variations, showing great performances on traffic sign dataset such as CCTSDB2021 [5].

Due to the powerful feature extraction capabilities of convolutional neural networks (CNNs), many CNN-based SR models [6,7,8,9,10,11,12] have been proposed to learn the mapping between LR and HR images from a large dataset. Dong et al. were the first to develop a CNN-based SR model (SRCNN) [6]. In order to improve the feature extraction ability of CNNs, Lim et al. further increased the network depth by stacking more than 30 residual blocks and proposed an enhanced deep SR (EDSR) network [7], yielding high reconstruction accuracy. To effectively extract channel and spatial features and capture global contextual information from high-resolution remote-sensing images, Zhou et al. [13] proposed a split depth-wise separable graph convolutional network (SGCN) and obtained significant performance for road extraction. ClassSR [14] combines classification and SR in a unified framework to accelerate SR network on large images. SR images generated by the aforementioned networks are still blurry at edges and textures. Based on the powerful generative adversarial network (GAN) [15], Ledig et al. [16] proposed SRGAN for photorealistic SR. To reduce the impact of inter-class similarities among feature maps, NIGAN [17] embedded a remote sensing road scene neighbourhood probability enhancement method into the conditional GAN to extract meaningful features. Inspired by fluid dynamics, Zhang et al. [18] delved into the mechanism of image evolution during the SR process and proposed FM-Net, to simulate the motions of pixels in the reconstruction process as the flow of fluid. Based on the consistency between thermal particles in the thermal field and pixels in the image domain, Zhang et al. [19] further proposed HTI-Net for image SR, and these methods improve the effectiveness of the hardware Trojan detection system, offering the potential for real-world applications [20].

Attention mechanisms aim at guiding the network to focus on important signals or features and to suppress irrelevant ones. Due to the high information redundancy between channels, a channel attention mechanism was proposed in SENet [21] to re-weigh the channels and obtained great success in image classification. In the same way, in RCAN [10] a first-order channel attention mechanism was introduced for SISR. The method in SAN [22] calculates the covariance of feature maps to predict second-order channel attention maps to improve final performance for SISR. Liu et al. [11] further enhanced the spatial attention and proposed the RFANet. In MFA [23], a frequency-channel attention mechanism was proposed to perform the attention across both channel and frequency dimensions to rescale the important frequency regions.

The aforementioned channel or spatial attention mechanisms generate attention maps by calculating the mean statistical information, which only uses the lowest frequency component while discarding some high-frequency information. However, capturing high-frequency information from feature maps is crucial for SISR in order to recover the lost high-frequency components such as fine details and textures. It has been demonstrated that CNNs are biased towards learning from low-frequency signals [24], making high-frequency signals hard to propagate through the network. To address this problem, FcaNet [25] transfers the feature maps from the spatial domain to the frequency domain and generates global descriptors for each channel based on its discrete cosine transform (DCT) coefficients. In a similar way, Magid et al. [24] proposed a matrix multi-spectral channel attention mechanism that re-weighs channels according to their frequency spectrum. These frequency-domain-based channel attention mechanisms integrate several artificially selected high-frequency components into their proposed attention mechanisms. While this method can capture high-frequency information when generating the global descriptors, it requires artificial selection of DCT coefficients and lacks guidance while generating channel-wise scaling factors for each channel.

Intrigued by these issues, we propose to apply a channel attention mechanism with contrastive learning to make the network focus more on learning high-frequency features. In particular, we treat this problem from both channel and spatial perspectives. Our channel-wise approach is to construct a high-frequency channel attention (HFCA) module. We conducted experiments to evaluate representations of high-frequency features in the network. Figure 1 shows the correlation coefficients between the feature maps and the output of the high-pass filter of the corresponding feature maps. The results show that the high-frequency signals of different channels vary a lot with respect to their relative feature maps. In general, channels with abundant high-frequency information should be assigned with high weight values. To achieve this design, the proposed HFCA module predicts the channel-wise global descriptors by calculating the similarity statistical metrics between the original feature maps and the high-frequency components. Compared to the previous channel attention mechanisms [10, 22, 24], this method captures high-frequency information using a more powerful statistical approach and adjusts adaptively the channel-wise scaling factor according to its high-frequency feature similarity. Our spatial approach is to introduce a contrastive learning loss to split high-frequency and low-frequency features in the spatial domain. Considering higher-frequency areas are expected to be assigned larger scaling factors on feature maps, we proposed a frequency contrastive learning (FCL) module to train a spatial mask by optimizing the contrastive learning loss constraint. This trained spatial mask can adaptively make the network focus on high-frequency areas and make high-frequency signals easier to propagate through the network.

Fig. 1
figure 1

Correlation coefficients between feature maps and output of high-pass filters of each channel. Results were obtained based on the features maps from the 12th residual block of a EDSR-baseline model with 40 channels

We incorporate the proposed HFCA and FCL into a standard residual block and obtain the high-frequency channel contrastive block (HFCCB). By integrating the HFCCB in an EDSR baseline, we construct a lightweight high-frequency channel contrastive network (HFCCN) with only around 1M parameters. As shown in Fig. 2, compared to models with the same parameter scale, the proposed network achieves improved results. Main contributions of this paper are three-fold. (1) A HFCA module is proposed to rescale channel-wise features based on their high-frequency similarity, while introducing fewer additional parameters. (2) A FCL loss is used to train a spatial mask and rescale weights from high-frequency features. (3) A lightweight HFCCN is proposed for highly accurate image SR. HFCCN obtains improved performance compared to other CNN-based SR models on the same parameter scale.

Fig. 2
figure 2

PSNR performance comparison between the proposed HFCCN and other lightweight networks with similar model sizes. Models were evaluated on Set5 with a scale factor of 2

2 Related work

2.1 CNN-based super-resolution models

For the past decade, CNN-based models have outperformed traditional methods in low-level computer vision tasks [26], due to their powerful nonlinear representation abilities. Dong et al. [6] first proposed a CNN with three convolutional layers by constructing a nonlinear mapping between HR and LR images. To increase the receptive field, Kim et al. proposed VDSR [27] with 20 layers by introducing residual learning to accelerate the convergence of the network. DRCN [28] and DRRN [29] incorporated recursive blocks to further improve the overall performance and reduce the number of model parameters. Shi et al. [8] proposed an efficient sub-pixel convolutional layer at the end of the network, to allow the learning operations to be performed on LR feature maps and to reduce computation time and model parameters. To obtain more accurate reconstruction results, Lim et al. [7] removed unnecessary modules in the residual block and stacked these modules to construct a very deep and wide network, EDSR. Utilizing the benefits of the dense structure, Tong et al. [30] proposed SRDenseNet, which fully utilizes the residual information from the previous layers. Zhang et al. [31] combined residual connection [32] and dense connection [33] in a dense residual block and then proposed RDN. These CNN-based SR models are biased towards constructing deeper networks. Although they gained good reconstruction performances, there are considerable increases in the model complexity and computation costs. MFFN [34] proposed a multi-level feature fusion network that fuses deep layer feature information with shallow layer feature information, and constructed a lightweight SR model. Similarly, we construct our lightweight CNN-based SR network based on an EDSR baseline model as a trade-off between number of parameters and restoration performance.

2.2 Attention mechanism

In recent years, attention mechanisms have been widely used to guide or rescale network resources towards important features. To explore the redundancy between different channels, SENet [21] was proposed to fully utilize relationships among channels, improving the performance on image classification. RCAN [10] then incorporated a channel attention mechanism into the SR network. In the channel attention block, a global average pooling (GAP) is placed after each input channel to produce a channel descriptor. These descriptors are then fed into several convolutional layers to further explore channel-wise relationships. By applying a sigmoid function, channel-wise scaling factors are calculated to rescale the input channels. SAN [22] was designed with a second-order channel attention (SOCA) module. The SOCA characterizes the correlation of channel-wise features by calculating covariance matrix. DRLN [35] was proposed with a Laplacian module, which adaptively rescales feature maps across multiple feature spaces. In RFANet [11] a residual feature aggregation block was designed to utilize the features on the residual branch and an enhanced spatial attention block to generate more powerful spatial attention maps. Considering interdependencies between pixels on each feature map, in SARN [36] a novel spatial attention block was proposed to rescale pixel-wise features. Hu et al. [9] combined channel attention and spatial attention mechanisms to propose the channel-wise and spatial feature modulation network (CSFM) for SISR. However, most of these channel attention methods do not generate global descriptors based on the frequency elements of the input features. To overcome this, Magid et al. [24] proposed a matrix multi-spectral channel attention mechanism that incorporates high-frequency components into their design. To reduce the impact of background information or distractors, Zhang et al. [37] proposed a novel Siamese anchor-free network based on criss-cross attention, obtaining more accurate and robust tracking results. Considering that most attention mechanisms are only processed at a single resolution, Liu et al. [38] proposed a cross-resolution feature attention mechanism to progressively reconstruct images at different scale factors. Similarly, we adapt and improve channel and spatial attention mechanisms to enhance reconstruction performance. In particular, we explicitly focus on the learning of high-frequency signals. The existing methods use only several artificially selected frequency components and cannot capture meaningful high-frequency components in an adaptive way, while our proposed model generates global descriptors by calculating the similarity between features and high-frequency components. In this way, channels with abundant high-frequency components are given large weights to improve the performance of the channel attention mechanisms.

2.3 Contrastive learning

Contrastive learning has been used for self-supervised learning recently and shown impressive performances. Contrastive learning aims to push anchors closer to positive samples and away from negative samples. Contrastive learning maximizes the mutual information in representation space and has been used previously in high-level tasks [39, 40]. However, there are few attempts to explore the performance of contrastive learning in SISR tasks. Wang et al. [41] proposed a contrastive self-distillation framework by introducing contrastive loss and knowledge transfer. Wu et al. [42] proposed a novel contrastive loss and demonstrated its effectiveness for improving the quality of reconstructed images. Xia et al. [12] applied contrastive learning to further separate relevant and irrelevant features. RLFN [43] re-selects the features used in the contrastive loss by replacing the ReLU activation function of a randomly initialized two-layer feature extractor with the Tanh function. CIPPSRNet [44] introduced contrastive learning to optimize a pretrained network and improve its adaptability and robustness. These models adapted contrastive learning to better extract feature representations by identifying similarities and differences of data without labels. Herein, we adapt contrastive learning to extract high-frequency representations as priors without supervision. It is to increase the gap between high-frequency and low-frequency features with the use of a spatial mask that adaptively assigns a scaling factor to every pixel. By incorporating contrastive learning, the network can focus more on learning of high-frequency areas of the feature maps, leading high-frequency signals to propagate through the network more easily.

3 Methodology

The proposed high-frequency channel contrastive network (HFCCN) consists of two novel modules that can be simply integrated into an existing SR architecture (e.g. EDSR [7]), including the high-frequency channel attention (HFCA) module (Sect. 3.1) and the frequency contrastive learning (FCL) module (Sect. 3.2). HFCA and FCL adopt channel and spatial approaches, respectively, to make the network focus on high-frequency information learning. HFCA rescales channels using similarity statistical metric between features and high-frequency details. FCL separates high- and low-frequency information to allow for better propagation of high-frequency components through the network.

Fig. 3
figure 3

Illustration of proposed high-frequency channel contrastive block (HFCCB). a Structure of HFCCB, containing two modules, high-frequency channel attention (HFCA) module and frequency contrastive learning (FCL) module. b Architecture of the HFCA module. HPF represents the Laplacian high-pass filter

3.1 High-frequency channel attention

Firstly, we revisit the residual blocks in EDSR [7]. The residual block contains two \(3 \times 3\) convolutional layers and a ReLU activation function. We denote \({x_{in}}\) and \({x_{res}}\) as the input and the output of the residual block, which can be formulated as:

$$\begin{aligned} {x_{res}} = {W_{3 \times 3}^2} * \phi \left( {{W_{3 \times 3}^1} * {x_{in}}} \right) \end{aligned}$$
(1)

where \({W_{3 \times 3}^1}\) and \({W_{3 \times 3}^2}\) are the weights of \(3 \times 3\) convolutional layers, and \(\phi \) denotes the ReLU function.

As shown in Fig. 3a, our proposed HFCA module and FCL module can be easily integrated into a normal residual block to construct the high-frequency channel contrastive block (HFCCB). \({x_{out}}\) represents the output of HFCCB, defined as:

$$\begin{aligned} {x_{out}} = {x_{hfa}} \cdot {x_{cl}} + {x_{in}} \end{aligned}$$
(2)

where \({x_{hfa}}\) and \({x_{cl}}\) represent the outputs of HFCA module and FCL module, respectively.

In general, the challenge of SISR lies in restoring high-frequency components such as textures and edges. Previous channel attention mechanisms [10, 21, 22] average the features across the spatial domain by an average pooling layer, hence neglecting most high-frequency components that may be crucial for restoring textures and details. Since high-frequency signals are more difficult to recover in image reconstruction, focus should be placed more on learning high-frequency features. Considering the limited ability of the mean statistic to describe the network features, we propose to use a more powerful statistical metric to describe the high-frequency information of each channel more accurately. It has been demonstrated that channels with abundant high-frequency information are critical for SR performance. Based on this idea, we propose to use a high-pass filter to obtain high-frequency features and take the similarity statistical metric among them as the channel-wise global descriptor. In this way, channels with greater high-frequency features are allocated with larger weights.

As shown in Fig. 3b, we refine the generation process of channel scaling factors. Since most similarity statistical metrics, such as mean square error (MSE), L1 norm, root mean square (RMS), are only related to pixel-level differences, they often lead to poor performance in comparing details and structure similarities. In order to better represent the difference between high-frequency features, we refer to the SSIM [45] metric and extract 3 key features: luminance, contrast and structure, represented by three statistical metrics: mean, standard deviation and covariance. Each statistical metric is multiplied by a learnable scale factor, contributing to the adopted similarity statistical metric. The process of generating these three statistics and the similarity metric is formulated as follows.

$$\begin{aligned} MeanSim({x_{res}},{x_H})= & {} \alpha {{2{\mu _{res}}{\mu _H}} \over {\mu _{res}^2 + \mu _H^2}} \end{aligned}$$
(3)
$$\begin{aligned} StdSim({x_{res}},{x_H})= & {} \beta {{2{\sigma _{res}}{\sigma _H}} \over {\sigma _{res}^2 + \sigma _H^2}} \end{aligned}$$
(4)
$$\begin{aligned} CovSim({x_{res}},{x_H})= & {} \chi {{Cov({x_{res}},{x_H})} \over {{\sigma _{res}}{\sigma _H}}} \end{aligned}$$
(5)
$$\begin{aligned} Sim({x_{res}},{x_H})= & {} MeanSim({x_{res}},{x_H})\nonumber \\{} & {} * StdSim({x_{res}},{x_H}) \nonumber \\{} & {} * CovSim({x_{res}},{x_H}) \end{aligned}$$
(6)

where \({x_H}\) is the high-frequency feature maps passed from a Laplacian high-pass filter. \(\mu \) and \(\sigma \) denote mean and standard deviation, respectively. \({Cov({x_{res}},{x_H})}\) represents the covariance between original feature maps and high-frequency maps. \(\alpha \), \(\beta \) and \(\chi \) represent three learnable scaling factors.

We take \(Sim({x_{res}},{x_H})\) as the channel attention descriptor. To fully capture channel-wise dependencies form the similarity descriptor, similar with most channel attention mechanisms, we apply a two-layer convolution and a sigmoid function to generate the scaling factor, which is also the output of HFCA (\({x_{hfa}}\)), and it can be formulated as:

$$\begin{aligned} {x_{hfa}} = f\left( {{W_{1 \times 1}^2} * \phi \left( {{W_{1 \times 1}^1} * Sim({x_{res}},{x_H})} \right) } \right) \end{aligned}$$
(7)

where \(f\left( \cdot \right) \) and \(\phi \left( \cdot \right) \) denote the sigmoid gating and ReLU function, respectively. \({W_{1 \times 1}^1}\) and \({W_{1 \times 1}^2}\) are the weights of a 1x1 convolutional layer. By adopting HFCA, the network can adaptively focus on channel information that contains more or abundant high-frequency features.

Fig. 4
figure 4

Illustration of the proposed frequency contrastive learning (FCL) module. High-frequency features are used as positive samples and low-frequency features as negative samples

3.2 Frequency contrastive learning

CNNs are biased towards learning from low-frequency signals, making high-frequency components hard to propagate through the network [24]. To address this problem, high-frequency areas of feature maps are exploited to be assigned larger weights. Previous spatial attention mechanisms [11, 46] generate attention maps by average or max pooling, which cannot make network focus on high-frequency areas. We introduce contrastive learning to train a spatial mask that adaptively assigns a scaling factor to every pixel. The goal of adopting contrastive learning is to increase the gap between high-frequency and low-frequency features in the feature representation. The overview of the proposed frequency contrastive learning (FCL) module is shown in Fig. 4. The output of FCL, \({x_{cl}}\), is defined as

$$\begin{aligned} {x_{cl}} = {M_{cl}} \cdot {x_{res}} \end{aligned}$$
(8)

where \({M_{cl}}\) represents the spatial mask.

In each FCL module, we take the high-frequency features as the positive samples and the low-frequency features as the negative samples. The high-frequency and low-frequency features can be generated by a high-pass filter and a low-pass filter, respectively. In this case, we use the Laplacian high-pass filter and the Gaussian low-pass filter. The Laplacian operator is a second-order, linear differential operator and is isotropic in all directions. Therefore, using the Laplacian operator as high-pass filter does not require detection operation on separate X and Y direction, hence easily adoptable for extracting meaningful high-frequency features. In addition, the convolution kernel of the Gaussian low-pass filter assigns different weights to different regions according to the Gaussian distribution, which can retain the overall distribution of the feature maps. We have compared various other filters (i.e. mean low-pass filter, Sobel high-pass filter and Scharr high-pass filter), and results are shown in Sect. 4.4. The outputs of filters are shown below:

$$\begin{aligned} {x_H}= & {} Laplacian({x_{res}}) \end{aligned}$$
(9)
$$\begin{aligned} {x_L}= & {} Gaussian({x_{res}}) \end{aligned}$$
(10)

Given positive and negative samples, we can construct a frequency contrastive learning loss \({L_{cl}}\) for training the spatial mask as:

$$\begin{aligned} {L_{cl}} = -{1 \over N}\sum \limits _{i = 1}^N { \log {{{1 \over C}\sum \nolimits _{j = 1}^C {{{\exp ( Sim({x_{cl}},{x_H}) )} / \tau }} } \over {{1 \over C}\sum \nolimits _{j = 1}^C {{{\exp ( Sim({x_{cl}},{x_L}) )} / \tau }} + \zeta }}} \end{aligned}$$
(11)

where N indicates the input size, C is the number of channels, \(\zeta \) is a small value to keep the training stable, and \(\tau \) is the temperature factor to control the penalty for negative samples. Here, we set the value of \(\zeta \) to \(1{e^{ - 6}}\) and \(\tau \) to 1. \(Sim({x_{cl}},{x_H})\) represents similarity (e.g. correlation coefficient) between the output of FCL (\(x_{cl}\)) and the output of high-pass filter (\(x_H\)), while \(Sim({x_{cl}},{x_L})\) represents similarity between the output of FCL (\(x_{cl}\)) and the output of low-pass filter (\(x_L\)).

There are several metrics to measure similarity between the anchor and samples, such as the L1 norm, L2 norm or cosine similarity. However, in order to better compare the structural similarity between high-frequency components and feature maps, we take correlation coefficient as the similarity metric. The ablation analysis of comparing different similarity statistical metrics is given in Sect. 4.4. We denote \({I^{HR}}\) and \({I^{SR}}\) as the high-resolution image and recovered super-resolution image, respectively. Finally, the overall loss function L of the HFCCN model is designed as:

$$\begin{aligned} {L_{rec}}= & {} {\left\| {{I^{HR}} - {I^{SR}}} \right\| _1} \end{aligned}$$
(12)
$$\begin{aligned} L= & {} {L_{rec}} + {\lambda _{cl}}{L_{cl}} \end{aligned}$$
(13)

where \({L_{rec}}\) is the L1 loss, which is used in most SR models, \(\lambda \) is a trade-off factor, and we set it as \(1{e^{ - 4}}\). By minimizing the contrastive learning loss constraint, the spatial mask is trained to assign high-frequency areas with larger weighs, hence improving the network’s performance of learning more difficult high-frequency details.

Fig. 5
figure 5

Overall pipeline of the proposed high-frequency channel contrastive network (HFCCN). One HFCCB is embedded after every two residual blocks. The total number of residual blocks and HFCCB is 24

Fig. 6
figure 6

Correlation coefficients between the feature maps (an intermediate layer) and the corresponding high-pass filtering results varying with training epochs. The results were obtained on the 12th residual block from EDSR-baseline and HFCCN

3.3 High-frequency channel contrastive attention network

To demonstrate the effectiveness of the HFCA and FCL modules, we integrate them into an EDSR baseline model, consisting of 24 residual blocks, to form a lightweight HFCCN. The overall structure of the proposed high-frequency channel contrastive network (HFCCN) is shown in Fig. 5. The HFCA and FCL modules are incorporated into a standard residual block to form the HFCCB. One residual block is replaced by HFCCB after every two residual blocks to ensure that our model parameters are comparable with other lightweight SR models. Finally, HFCCN consists of 16 residual blocks and 8 HFCCB, so the total number of blocks is 24.

4 Experiments

4.1 Datasets and metrics

The HFCCN has been trained on the DIV2K [47] dataset, which contains 800 HR images. The LR images were obtained by the bicubic downsampling of HR images. For evaluation, we compared the performances on four widely used benchmark datasets: Set5 [48], Set14 [49], BSD100 [50] and Urban100 [51], which have 5, 14, 100 and 100 images, respectively. The Set5, Set14 and BSD100 contain natural scenes, while the Urban100 dataset are urban scene images with more edges and higher resolutions. The SR results were evaluated with the peak signal-to-noise ratio (PSNR) and structural similarity index (SSIM) [45], popular reconstruction quality measures of super-resolution. PSNR is defined via the maximum pixel value L and the mean-squared error between images. Given the high-resolution (ground truth) image \({I^{HR}}\) with N pixels and the super-resolution image \({I^{SR}}\), as follows:

$$\begin{aligned} PSNR = 10 \cdot {\log _{10}}\left( {{{{L^2}} \over {{1 \over N}\sum \nolimits _{i = 1}^N {{{\left( {{I^{SR}}\left( i \right) - {I^{HR}}\left( i \right) } \right) }^2}} }}} \right) \end{aligned}$$
(14)

where L equals 255 when dealing with 8-bit images. As human visual perception is highly adapted for extracting structural information from a scene, SSIM metric compares luminance, contrast and structure between images. We denote \({\mu _I}\) and \({\sigma _I}\) as the mean and standard deviation of the image intensity, respectively. The comparisons on luminance \({C_l}\left( {{I^{SR}},{I^{HR}}} \right) \) and contrast \({C_c}\left( {{I^{SR}},{I^{HR}}} \right) \) are defined as follows:

$$\begin{aligned} {C_l}\left( {{I^{SR}},{I^{HR}}} \right)= & {} {{2{\mu _{{I^{SR}}}}{\mu _{{I^{HR}}}} + {C_1}} \over {\mu _{{I^{SR}}}^2 + \mu _{{I^{HR}}}^2 + {C_1}}} \end{aligned}$$
(15)
$$\begin{aligned} {C_c}\left( {{I^{SR}},{I^{HR}}} \right)= & {} {{2{\sigma _{{I^{SR}}}}{\sigma _{{I^{HR}}}} + {C_2}} \over {\sigma _{{I^{SR}}}^2 + \sigma _{{I^{HR}}}^2 + {C_2}}} \end{aligned}$$
(16)

where \({C_1} = {({k_1}L)^2}\) and \({C_2} = {({k_2}L)^2}\) are constants to ensure stability when the denominator is close to zero (\({k_1} \ll 1\) and \({k_2} \ll 1\)). The structure comparison \({C_s}\left( {{I^{SR}},{I^{HR}}} \right) \) is defined as:

$$\begin{aligned} {C_s}\left( {{I^{SR}},{I^{HR}}} \right) = {{{\sigma _{{I^{SR}}{I^{HR}}}} + {C_3}} \over {{\sigma _{{I^{SR}}}}{\sigma _{{I^{HR}}}} + {C_3}}} \end{aligned}$$
(17)

where \({\sigma _{{I^{SR}}{I^{HR}}}}\) is the covariance between \({I^{HR}}\) and \({I^{SR}}\), and \({C_3} = {({k_3}L)^2}\). Finally, the SSIM is given by:

$$\begin{aligned} SSIM\left( {{I^{SR}},{I^{HR}}} \right)= & {} {\left[ {{C_l}({I^{SR}},{I^{HR}})} \right] ^\alpha }{\left[ {{C_c}({I^{SR}},{I^{HR}})} \right] ^\beta }\nonumber \\{} & {} {\left[ {{C_s}({I^{SR}},{I^{HR}})} \right] ^\gamma } \end{aligned}$$
(18)

where \(\alpha \), \(\beta \), \(\gamma \) denote the relative importance of each metrics. Both PSNR and SSIM [45] metrics are measured on Y channel in the YCbCr space.

Table 1 Comparison of EDSR baseline, channel attention (CA), spatial attention (SA), the proposed frequency contrastive learning (FCL), and the proposed high-frequency channel attention (HFCA)
Table 2 Comparisons of network performance between mean low-pass filter, Gaussian low-pass filter, Sobel high-pass filter, Scharr high-pass filter, and Laplacian high-pass filter

4.2 Implementation and training details

For a lightweight network, we set the number of residual blocks to 16 and the number of HFCCB to 8, giving the total number of convolution blocks of 24. The frequency contrastive learning loss was added to L1 loss after 50 epochs. We set the convolutional kernel size to \(3 \times 3\) and set the intermediate feature channels to 40. During training, we employed data augmentation by randomly rotating 90, 180, 270 and flipping horizontally. The mini-batch size was set to 32 and each patch size to \(64 \times 64\). The training optimizer was ADAM [52] with \({\beta _1} = 0.9\), \({\beta _2} = 0.999\) and \(\varepsilon = {10^{ - 8}}\). The learning rate was initialized to \(2 \times {10^{ - 4}}\) and then decayed by a factor of 2 for every 200 epochs. We implemented the HFCCN model with PyTorch [53] framework. It takes 177.3 s on average to train one epoch (1000 batches) on DIV2K with scale of 4. Usually, the model is trained up to 1000 epochs. The inference time of our method is 0.1088s on the Urban100 dataset for scale factor \(\times \)4 SR. All experiments were conducted and tested on a single RTX 3090 GPU (Tables 1-4 and Figs. 7-6).

4.3 Results

Comparisons with state-of-the-art methods: HFCCN was compared with 11 state-of-the-art lightweight models, including SRCNN [6], FSRCNN [54], VDSR [27], DRCN [28], LapSRN [55], DRRN [29], MemNet [56], CARN [57], IMDN [58], ACNet [59], RLFN [43] and DeFiAN [60].

Quantitative evaluation results in terms of average PSNR and SSIM on the four benchmark datasets are shown in Table 4. The size of our method is 1.08M, 1.13M, and 1.19M, respectively, for \(\times \)2, \(\times \)4, and \(\times \)8 SR. Under similar sizes or model parameters, the proposed HFCCN gained better performance by a marked margin on almost all benchmarks and on all up-sampling scales over other models. Adding additional HFCA and FCL modules to an EDSR-baseline model showed a significant improvement, also outperforming other lightweight SR models. In particular, the proposed HFCCN brought further improvement of 0.04 dB on Urban100 with scale of 4, when comparing with the second best SR model. The greater performance gain by HFCCN shown on larger datasets demonstrates its enhanced generalization ability. In addition to the PSNR metric, the proposed model achieved higher values of SSIM on all datasets, demonstrating improvement in perceptional quality. Higher SSIM shows that the proposed HFCCN can better recover more details and textures. Such improved performance attributes to the HFCA mechanism and frequency contrastive learning strategy, in making the network learn more high-frequency features. The table also shows the comparisons with models of similar number of parameters. In particular, even though HFCCN has about 300–600K fewer parameters than ACNet [59] and CARN [57], it achieved better results than these models on all datasets. Compared to RLFN [43], although it has slightly fewer number of parameters, the proposed model achieved marked improvements on most datasets, especially for more difficult scenarios, such as Urban100, with more noticeable improvements with scale of 2. As high-frequency details are highly discounted on large scales, we further conducted experiments on a larger scale (x8 SR), and the results are shown in Table 4. With such a large scale, HFCCN surpassed MemNet [56] by a marked margin of 0.48 dB on Urban100 benchmark, which further proves the effectiveness of the proposed method and its ability to reconstruct high-frequency details.

Examples of qualitative results on B100 and Urban100 are shown in Fig. 7. HFCCN was more effective for detail restoration with scale factor \(\times \)4. Most models could not recover much of the high-frequency fine details and textures and suffered from blurring and aliasing artefacts. In contrast, HFCCN yielded favourable recovered images. These results indicate that the effectiveness of HFCCN by using the proposed HFCA module and FCL loss.

Similar to DeFiAN [60], our method also processes the high-frequency and low-frequency information in a divide-and-conquer manner. This can effectively reconstruct high-frequency details while preserving low-frequency smoothness. The difference is that DeFiAN uses an interpretable multi-scale Hessian filtering to learn an adaptive operator to dividedly process high-frequency and low-frequency information, while our method applies comparative learning and attention mechanisms to separate high-frequency and low-frequency components. The experimental results, shown in Table 4, demonstrate marked improvements of the proposed method over DeFiAN on all datasets and on all up-sampling scales under similar model parameters.

Effectiveness analysis: HFCA and FCL were devised to assign greater scaling factors to areas or channels with more high-frequency details. To investigate the effect, we conducted an experiment to understand the behaviour of HFCA and FCL in the SR model. We constructed an EDSR baseline model and a HFCCN model with 24 residual blocks and calculated the correlation coefficient between the feature maps and the high-pass filtering results at the 12th block during training. The results are shown in Fig. 6. For the baseline model, the correlation coefficients between the intermediate layer and the high-frequency features varied little with increasing epochs, implying that high frequencies were difficult to pass through the network. With the HFCA and FCL, the correlation coefficients gradually increased. This verifies that HFCCN can help learn high-frequency signals better compared to the original EDSR. It makes easier for the high-frequency signals to propagate through the network.

Table 3 Comparison of network performance between L1, L2, Cosine and correlation coefficient as the similarity metric in FCL loss
Table 4 Quantitative results on four benchmark datasets
Fig. 7
figure 7

Visual comparison for \(\times 4\) SR on Urban100 and B100 dataset

4.4 Ablation studies

To examine the effectiveness of the proposed attention mechanism, we first utilized the EDSR-baseline to ablate the roles of different attention modules. We conducted experiments by replacing HFCA or FCL with other attention modules. We selected channel attention and spatial attention and used the same configurations in CBAM [61] for them. The results are shown in Table 1. In this experiment, CA and SA improved the PSNR by 0.022 dB and 0.013 dB, respectively. However, the proposed HFCA and FCL gained 0.079 dB and 0.029 dB performance boosts, respectively, compared to the baseline. The results demonstrate that the proposed HFCA and FCL are more effective than the original CA and SA. Moreover, combining HFCA and FCL can further elevate the performance by a larger margin, validating the advantage of HFCA and FCL.

To demonstrate the effectiveness of taking correlation coefficient as similarity metric in FCL loss, we compared four similarity metrics: L1 norm, L2 norm, Cosine similarity and correlation coefficient. These metrics are all widely used in CV and natural language processing. The experimental results indicate that Cosine similarity had the worst performance. Because Cosine similarity is more about distinguishing differences in direction than the absolute values, it may lead an obvious performance drop in feature map comparison. L1 and L2 are very commonly used loss functions in SR to determine similarities between SR reconstructed images and high-resolution images. However, both of them cannot effectively compare the structural similarity between images. As shown in Table 3, using the correlation coefficient performed slightly better than the other three metrics. As human visual perception is highly adapted for extracting structural information, taking correlation as similarity metric can reflect such information between high-frequency features and hence is arguably more suitable.

We further conducted a detailed ablation study on the high-pass and low-pass filters used in the HFCA and FCL modules. The results of HFCCN using different filters are reported in Table 2. Compared with the combination of mean low-pass filter and Laplacian high-pass filter, the combination of Gaussian low-pass filter and Laplacian high-pass filter gained 0.067dB improvement, which proves effective contributions to the network. The mean low-pass filter assigns the same weight to all regions of feature maps, while Gaussian low-pass filter can better retain overall distribution of feature maps. We also conducted experiments on three different high-pass filters: Scharr, Sobel and Laplacian. Both Scharr and Sobel high-pass filters require operation on separate X and Y direction while Laplacian operator is isotropic in all directions and can extract meaningful high-frequency information. The experimental results indicate that the combination of Gaussian low-pass filter and Laplacian high-pass filter is the best combination.

5 Conclusion

In this paper, a lightweight high-frequency channel contrastive network (HFCCN) is proposed for single image super-resolution. The HFCCN consists of a high-frequency channel attention module and a frequency contrastive learning module, making the network focus more on learning high-frequency signals from both channel and spatial perspectives. HFCA module can re-weigh channels with respect to their high-frequency abundance. The frequency contrast learning module introduces a contrastive learning loss to train a spatial mask that is expected to assign larger scaling factors to higher-frequency areas. Extensive experiments have been conducted, and results show that the proposed HFCCN obtained improved or highly competitive performance than the state-of-the-art SR models on a lightweight setting. To further enhance the ability of extracting high-frequency features, we will evaluate the effect of more adaptive filters and improve contrastive learning loss functions to achieve a more robust performance. In addition, we will introduce the proposed mechanisms to other low-level CV tasks such as image deblurring and denoising. Specially, noise and nature images have distinct features in the frequency domain, and the proposed mechanism of using contrastive learning to separate high-frequency and low-frequency information may offer promising potential for these low-level CV tasks and other applications.