1 Introduction

Image restoration (IR) is a crucial task that involves enhancing the quality of low-quality (LQ) images to obtain high-quality (HQ) images. This task encompasses various sub-problems, including JPEG compression artifact reduction (CAR), image denoising, and super-resolution (SR). The IR problem has gained significant attention due to its diverse applications in computer vision, such as image recognition and object detection. In the past decades, numerous deep learning-based IR methods have been developed to establish the mapping between LQ and HQ images.

Recent works have shown tremendous improvements in the field of IR, which could be broadly divided into two main categories. The first category is convolutional neural network (CNN)-based methods [1, 2], which treat an image as pixels in a matrix form and extract local features in a sliding window fashion. This type of method is efficient and effective due to their unique properties such as localization, weight sharing, and scaling invariance. The second category is vision transformer (ViT)-based methods [3, 4], which conceptualize an image as a series of patches and fuses the information from different patches adaptively by global-range self-attention operation. Such methods focus on the global features and abandon the inductive bias inherent in CNNs, resulting in improved generalization performance [5]. Moreover, there are also some attempts to combine these two types of approaches, such as HNCT [6] and TECDNet [7]. While CNNs leverage the locality prior and ViTs excel in generalization [8], they also encounter their respective limitations. CNNs sacrifice global reception, while ViTs require significant memory and graphics resources. As a result, they do not exhibit a significant performance gap. Therefore, there is an urgent need for a new feature extraction paradigm in IR that differs from CNNs and ViTs.

Lately, clustering-based methods have demonstrated their powerful capabilities in computer vision tasks, such as image classification [9] and instance segmentation [10, 11]. The clustering-based methods [12] conceptualize an image as a collection of unorganized points and groups these points into clusters. Within each cluster, point features are clustered into one center, which is then adaptively dispatched to all the points. Compared to CNNs and ViTs, the clustering-based methods have two notable advantages. Firstly, they exhibit strong generalization abilities across different data domains, as images are treated as a collection of data points. Secondly, the clustering process provides clear interpretability for feature extraction. In essence, clustering models break new ground for the computer vision community. However, there are still three open issues that need to be addressed. Firstly, the clustering-based methods have not been extensively explored for low-level vision tasks, such as JPEG CAR, image denoising, and image SR. Secondly, the simple position embedding approach used in the clustering-based model may limit their performance when dealing with IR tasks involving inputs of multiple resolutions. Lastly, the potential style (e.g., texture) of middle layers is not fully exploited, which hampers their feature representation capabilities.

In this paper, we propose two novel frameworks called style-guided context cluster U-Net (SCoC-UNet) and style-guided clustered point interaction U-Net (SCPI-UNet) for IR based on the concept of clustering. In contrast to CNNs and ViTs, our methods treat an image as an unorganized set of points and extracts features via a clustering-based method. Both SCoC-UNet and SCPI-UNet use a symmetric Encoder–Decoder architecture for hierarchical feature representation. They incorporate position information into each point using a continuous relative position embedding (CRPE) method. To enhance computational efficiency, we employ symmetric point reducer and point increaser in the Encoder and Decoder to gradually reduce and recover the number of points. For SCoC-UNet, we introduce a style-guided context cluster (SCoC) block as the core component. This block utilizes a simplified clustering algorithm in the trunk branch for robust feature extraction and dynamically recalibrates feature weights through the mask branch. For SCPI-UNet, we present a style-guided clustered point interaction (SCPI) block as a fundamental component. This block facilitates effective information interaction across clusters that would otherwise be isolated. Compared to SCoC-UNet, SCPI-UNet exhibits superior feature extraction and efficient information processing capabilities. In summary, our main contributions are as follows:

  1. 1.

    We propose a novel clustering-based backbone called SCoC-UNet for general IR tasks. To our best knowledge, this work represents the first attempt to apply cluster method to low-level vision tasks. Moreover, we introduce a trainable CRPE method to adapt to input images of varying resolutions and efficiently migrate IR models trained on low-resolution images to high-resolution ones.

  2. 2.

    To efficiently extract context features, we propose a SCoC block, which utilizes an enhanced style-based recalibration module to guide the cluster process for feature extraction. By incorporating style information, our framework achieves more accurate and meaningful clustering results. Expanding on the foundation of SCoC block, we have devised a U-shaped architecture known as SCoC-UNet, which consists of a symmetric Encoder–Decoder structure.

  3. 3.

    We introduce a SCPI block that can be seamlessly integrated into the U-shaped architecture as a more effective alternative to the SCoC block for establishing the connections of feature point across different clusters. The SCPI-UNet, built on SCPI blocks, facilitates clustering among feature points and enables information interaction between diverse clusters.

  4. 4.

    Extensive experimental results demonstrate that our SCoC-UNet and SCPI-UNet achieve superior performance over the state-of-the-art methods on several typical IR tasks. Notably, our methods outperforms large-scale CNN- and ViT-based models on the JPEG CAR task, while utilizing fewer parameters. This highlights the effectiveness of our proposed clustering-based method.

2 Related works

2.1 Image restoration

Early IR methods [13,14,15] typically employed model-based approaches that utilized handcrafted priors to obtain HQ images. However, these methods are gradually being replaced by deep learning-based approaches, which have shown significant advancements in recent years, particularly with the emergence of CNN- and ViT-based methods.

The pioneering works of JPEG CAR, image denoising, and image SR were ARCNN [16], DnCNN [17], and SRCNN [18], respectively. They established deep CNNs to learn the mapping between LQ and HQ images and achieved better performance than traditional methods. Subsequently, a flurry of CNN-based methods have been proposed to further improve IR performance. These methods leveraged larger and deeper network structures [1, 19,20,21] along with various learning techniques, including residual connections [22,23,24], dense connections [25], batch normalization (BN) [26], and others [27, 28]. Additionally, there are some methods aiming to simplify network architectures by pruning or distillation strategies, such as CARN [29], IDN [30], IMDN [31], and RFDN [32], effectively enhancing training efficiency. However, these CNN-based approaches often struggle to address the challenge of capturing long-term dependencies between pixels.

To tackle this issue, several ViT-based methods have emerged in the field of IR. Notably, IPT [33] introduced a generalized framework for various IR tasks based on the standard Transformer [34]. While this method achieved promising IR performance, it required a large number of parameters and extensive training datasets. In response, SwinIR [4] and Swin2SR [35] employed the advanced Swin Transformer [36] in IR, which reduced model complexity by utilizing sliding window-based self-attention. Furthermore, researchers have explored the combination of attention mechanisms and CNNs, leading to channel attention [3, 37], non-local attention [38], and adaptive patch aggregation [39]. For instance, RNAN [38] proposed local and non-local attention blocks to efficiently capture long-term dependencies between pixels, overcoming the limitations of local convolution operations that treat feature pixels equally. Nevertheless, these methods typically rely on a large number of parameters (e.g., over 115 M parameters in IPT and 12 M in SwinIR) and huge computational resources, placing high demands on hardware performance.

It is important to note that while CNN- and ViT-based methods have achieved notable results, there is not a significant performance gap between them due to their respective limitations. Therefore, there is a strong anticipation for a new feature extraction paradigm that can provide fresh perspectives and advancements in the field of IR.

2.2 Clustering in image processing

Early clustering-based works such as SuperPixel [40] and SLIC [41] were initially employed for image segmentation by grouping pixels with similar features. However, these methods are gradually being replaced by deep learning approaches due to superior feature representation capabilities. As a result, some works have attempted to integrate clustering into deep learning methods. For instance, SSN [42] trained a CNN to extract pixel features and utilized an iterative K-means clustering module for superpixel segmentation. This differentiable approach enables end-to-end training and reduces runtime. In recent years, clustering methods have been applied to specific vision tasks [9, 43]. Groupvit [11] introduced clustering into ViTs and employed nonparametric grouping to achieve better zero-shot semantic segmentation. Besides, kMaX-DeepLab [10] proposed a single-head K-means clustering to replace multi-headed cross-attention by considering their relationship. More recently, Xu et al. [12] presented a generalized context cluster framework for multiple visual representations, achieving competitive results in various high-level vision tasks such as 3D point cloud classification, object detection, and instance segmentation. Nevertheless, clustering methods have not yet been extensively explored for low-level vision tasks. To our best knowledge, we are the first to apply the clustering method to image restoration tasks, aiming to investigate its effectiveness in recovering HQ images.

3 Method

In this section, we present the proposed IR methods of SCoC-UNet and SCPI-UNet, which focuses on the network architecture, components and advantages analysis.

3.1 Style-guided context cluster U-Net (SCoC-UNet)

The overall architecture of SCoC-UNet is shown in Fig. 1. Inspired by [44], we use a U-shaped architecture to perform our network, which consists of four parts: position embedding, Encoder, Decoder, and reconstruction block. Given an input LQ image \(I_{LQ} \in \mathbb {R}^{h \times w \times 3}\), a log-spaced continuous relative position bias is first embedded to retain the position information. These embedding features are then fed into Encoder. In Encoder, two fundamental components, dubbed style-guided context cluster (SCoC) block and point reducer layer, are employed to acquire hierarchical feature representation. In correspondence with the Encoder, a symmetric SCoC-based Decoder with point increaser is designed to obtain high-frequency information and recover the resolution of feature maps. Furthermore, we use skip connections to compensate for spatial information loss caused by point reduction operation by fusing multi-level features from Encoder and Decoder. Finally, a reconstruction block is employed to obtain the residual map, consisting of a convolutional (ConV) layer and pixel shuffle layer [45]. This residual map is then added to the bilinear up-sampling of the input to generate the final SR image \(I_{SR} \in \mathbb {R}^{H\times W\times 3}\), where \(H>h\), \(W>w\). For tasks that do not involve up-sampling, such as JPEG CAR and image denoising, a single ConV layer is used for reconstruction.

Fig. 1
figure 1

Illustration of the proposed SCoC-UNet for image SR. For the JPEG CAR and image denoising tasks, pixel shuffle and bilinear up-sampling are removed since their input and output have the same resolution

3.1.1 Style-guided context cluster block

As depicted in Fig. 2, the SCoC block, proposed in this study, primarily comprises two branches: the trunk branch and the mask branch. These two branches are combined using an attention mechanism [46], enabling the recalibration of channel feature mapping. This structure design allows the network to emphasize or ignore certain features, thereby obtaining more accurate character representations. In this section, we will provide a detailed description of these two branches.

Fig. 2
figure 2

Pipeline of SCoC block which includes a trunk and a mask branch. The trunk consists of two cascaded CoC modules as shown in (b). (c1) Structure of style-based recalibration module (SRM) in [47], (c2) mask branch used in our method

Trunk branch. The trunk branch is responsible for feature extraction, accomplished through two cascaded context cluster (CoC) modules [12] (ICLR’23). As illustrated in Fig. 2b, each CoC module consists of a cluster operation followed by a multi-layer perception (MLP) with a GELU activation, aiming to obtain discriminative features. Below, we will elaborate on the pipeline of the cluster operation.

First, the cluster operation starts by grouping a given feature point set \(P\in \mathbb {R}^{n \times c}\) into several clusters based on their similarities. To achieve this, p centers are uniformly distributed in the feature space, and their values are computed by averaging the nearest points. Following this, the similarities between each point and the centers are then calculated, and all points are assigned to the most similar center to yield p clusters. Each cluster may have a different number of points.

Then, the feature points within each cluster undergo dynamic aggregation. Let’s consider a cluster that contains m \((m{\le }n)\) points, with the similarity between each point and the clustering center denoted as \(s\in \mathbb {R}^{m}\). These feature points are linearly mapped to a value space, resulting in \(P_v \in \mathbb {R}^{m \times c'}\). In the value space, clustering center proposal \(v_{p}\) is accessed by adaptive averaging pooling of \(P_v\). The aggregated feature f is obtained by

$$\begin{aligned} \begin{aligned}&f = \frac{1}{b}\left( v_{p}+\sum \limits _{i = 1}^m {sig\left( \alpha s_i + \beta \right) } * v_{i}\right) , \\&b = 1 + \sum \limits _{i = 1}^m {sig\left( \alpha s_i + \beta \right) }, \end{aligned}&\end{aligned}$$
(1)

where \(\alpha\) and \(\beta\) are learnable scalars, \(sig(\cdot )\) denotes sigmoid function. b is normalization factor, and \(v_i(i=1,\ldots ,m)\) is i-th point in \(P_v\).

Next, these aggregated features are distributed to all points within a cluster to facilitate point communication and information sharing. To achieve this, each point is updated by

$$\begin{aligned}&\ {p'_i} = {p_i} + FC\left( {sig\left( {\alpha {s_i} + \beta } \right) * f} \right) ,&\end{aligned}$$
(2)

where \(FC(\cdot )\) denotes a fully connected operation to recover the original channel dimension (\(c' \rightarrow c\)).

Mask branch. To capture the potential style of LQ images, we propose a mask branch that adaptively recalibrates the intermediate feature maps using a modified version of the style-based recalibration module (SRM) [47]. The original SRM structure, depicted in Fig. 2c1, consists of two components: style pooling and style integration. Style pooling extracts style information via channel-wise average and standard deviation statistics. Style integration estimates recalibration weights using a channel-wise fully connected (CFC) layer, followed by a BN layer and a sigmoid activation.

However, we note that the original SRM suffers from certain limitations: (1) It is only applicable to pixel-based methods and cannot be used with unordered point sets. (2) Inappropriate pooling and integration strategies that may exclude important details and reduce model accuracy. To address these limitations, we propose a mask branch based on SRM with the following three modifications: first, we introduce local importance-based pooling (LIP) [48] in style pooling to enrich style features. Second, the summation is replaced by a concatenation operation in channel dimension. Third, we use point reducer and point increaser instead of CFC layer to better scale recalibration weights and capture point-wise dependencies.

The structure of mask branch is illustrated in Fig. 2c2. Formally, given an input point set \(X=[x_1,x_2,\ldots ,x_C] \in \mathbb {R}^{C\times N}\), where C represents the number of channels and N denotes the number of points in each channel, the style features \(T\in \mathbb {R}^{3C\times 1}\) are learned by

$$\begin{aligned}&\ \mu _c = F(x_c) = \frac{1}{N} \sum \limits _{n = 1}^N {x_c\left( n\right) } ,&\end{aligned}$$
(3)
$$\begin{aligned}&\ \sigma _c =\sqrt{\frac{1}{N} \sum \limits _{n = 1}^N {\left( x_c\left( n\right) -\mu _c\right) ^2}} ,&\end{aligned}$$
(4)
$$\begin{aligned}&\ \ell _c = \dfrac{F\left( x_c \cdot exp(\mathcal {G}(x_c))\right) }{F\left( exp(\mathcal {G}(x_c)\right) },&\end{aligned}$$
(5)
$$\begin{aligned}&\ t_c = \left[ \mu _c, \sigma _c, \ell _c \right] ,&\end{aligned}$$
(6)

where \(x_c\left( n\right)\) represents the n-th value of the c-th channel of point set x. \(F(\cdot )\) denotes the average pooling function, \(\mathcal {G}(\cdot )\) represents a linear mapping function, and \(t_c\) is the c-th element of T. The channel-wise style weight \(W\in \mathbb {R}^{C\times 1}\) is calculated by

$$\begin{aligned}&\ W= sig \left( W_R \delta (W_I T)\right) ,&\end{aligned}$$
(7)

where \(sig (\cdot )\) and \(\delta (\cdot )\) represent a sigmoid layer and a BN layer, respectively. \(W_R\) and \(W_I\) denote the weight sets of linear embedding layers, which act as point reducer layer (with ratio 3r) and point increaser layer (with ratio r), respectively.

3.1.2 Position embedding

Since our network is a clustering-based method, the position information of each point is not explicitly considered during the feature extraction process. Therefore, position coordinate is incorporated to enhance the image at the beginning of our network. In previous works [12, 49, 50], absolute position embedding method was employed. However, we note that this approach generates discrete and fixed coordinates, which is suboptimal when handling inputs with varying resolutions. To tackle this issue, we adopt a trainable CRPE method [51], which is capable of accommodating input images with different resolutions while maintaining its learnable nature.

Absolute position embedding (APE). Given an input LQ image \(I_{LQ} \in \mathbb {R}^{h\times w\times 3}\), the 2D coordinate of each pixel (ij) is represented as \(\left[ \frac{i}{w}-0.5, \frac{j}{h}-0.5\right]\).

CRPE. The CRPE introduces two improvements over the absolute position embedding method. Firstly, it incorporates a coordinate transformation from linear-space to log-space, which is mathematically expressed as

$$\begin{aligned}&\ \Delta x = {\rm{sign}}(x) \cdot \log (1+\left| x\right| ),&\end{aligned}$$
(8)
$$\begin{aligned}&\ \Delta y = {\rm{sign}}(y) \cdot \log (1+\left| y\right| ),&\end{aligned}$$
(9)

where x, y and \(\Delta x\), \(\Delta y\) denote the linear-space and log-space coordinates, respectively.

Secondly, a simple meta-network is employed to obtain continuous relative position bias instead of absolute position bias:

$$\begin{aligned}&\ B\left( {\Delta x,\Delta y} \right) = \psi \left( {\Delta x,\Delta y} \right) ,&\end{aligned}$$
(10)

where \(\psi\) represents the simple network, such as a 2-layer MLP followed by a sigmoid activation as we used.

3.1.3 Encoder and decoder

In Encoder, there are three stages to obtain the hierarchical feature representation. In stage 1, we first employ a linear embedding layer to project points with color feature and position information to an arbitrary dimension C (\(h\times w, 5 \rightarrow h\times w, C\)). Subsequently, a SCoC block is applied to capture the potential style of middle layers while preserving the feature resolution and channel dimension. In stages 2 and 3, the SCoC block is followed by a point reducer which reduces the number of points by concatenating and fusing nearby points. In our network, each point reducer not only reduces the number of points by a multiple of \(2 \times 2 = 4\), but also doubles the channels. As a result, the output resolutions of stages 2 and 3 are \((\frac{h}{2} \times \frac{w}{2}, 2C)\) and \((\frac{h}{4} \times \frac{w}{4}, 4C)\), respectively. The output of stage 3 is considered as the input of Decoder.

In Decoder, there are also three stages that correspond to Encoder to gradually recover the texture information. In contrast to the point reducer, we utilize a point increaser in the first and second stages of Decoder to expand the resolution of feature map by a factor of 2 and reduce the dimension by half \(\left( (\frac{h}{4} \times \frac{w}{4}, 4C) \rightarrow (\frac{h}{2} \times \frac{w}{2},2C) \rightarrow (h\times w,C) \right)\). After two stages, the resolution of the feature map is restored to match the input, so only a single SCoC block is employed in the third stage without additional point increaser. In Sect. 4.2, we will provide a detailed discussion on the impact of different point increasers.

3.2 Style-guided clustered point interaction U-Net (SCPI-UNet)

In SCoC-UNet, the CoC module focuses on calculating the similarity between points within the same cluster. However, this approach may ignore long-range dependencies between different clusters. This limitation can potentially affect the model’s ability to capture complex relationships and patterns across clusters. Here, we consider using an attention mechanism to establish the connections of feature points between different clusters, and thus propose a style-guided clustered point interaction (SCPI) block. The SCPI block, as illustrated in Fig. 3, shares similarities with the SCoC block in terms of consisting of a trunk branch and a mask branch. However, the key difference lies in its trunk branch, which comprises a CoC module and a cluster cross-attention (CCA) module instead of two consecutive CoC modules. The SCPI block can be seamlessly inserted into SCoC-UNet as a superior alternative to SCoC block. The new model with SCPI block is dubbed SCPI-UNet, which introduce cross-cluster connections while maintaining the efficient computation.

Fig. 3
figure 3

Pipeline of SCPI block, whose trunk branch consists of a CoC module and a CCA module

The pipeline of CCA module is shown in Fig. 4. Given the hidden states of i-th layer \({P^i} \in {\mathbb {R}^{n \times c}}\), we obtain the initialized cluster centers through the center proposal operation. Assume that the predefined number of clusters is p, the center proposal operation first uses a depth-wise separable convolution to change the dimension (\(\mathbb {R}^{n \times c} \rightarrow \mathbb {R}^{n \times pc}\)). Then, certain sophisticated aggregation techniques (global average pooling as we used) are utilized to obtain initialized cluster centers \(C\in \mathbb {R}^{p \times c}\). Note that center proposal operation is only required in the first CCA module, and all subsequent centers are updated from the previous CCA module.

Fig. 4
figure 4

Pipeline of CCA module. \(\otimes\) represents matrix multiplication, \(\odot\) is element-wise multiplication

Then, feature points with high similarity are aggregated into the same cluster. Unlike the clustering operation in CoC module, each cluster here should contain the same number of points to facilitate the parallel computation of the subsequent cross-attention. Suppose that the similarity matrix between hidden states and cluster centers is denoted as \(s\in \mathbb {R}^{p\times n}\), we first assign a cluster ID to each point in the hidden states

$$\begin{aligned}&{I_j} = \mathrm{{argmax}} \left( s_{:j} \right) , 1 \le j \le n, I_j\in I,&\end{aligned}$$
(11)
$$\begin{aligned}&P^{\rm{sort}}, I' = \mathrm{{argsort}} \left( P^i, I \right) ,&\end{aligned}$$
(12)

where \(s_{:j}\) is the j-th column of s, \(I_j\) is the assigned cluster ID of j-th points in hidden states. The function \(\rm{argmax}\) assigns each feature point to the most similar center by calculating the maximum similarity value. The function \(\rm{argsort}\) sorts features based on cluster IDs. \(P^{sort}\) represents the sorted feature points of \(P^{i}\). \(I'\) denotes the original position index of the shuffled feature points. The sorted feature points are then allocated into p clusters of the same size,

$$\begin{aligned}&I^C[k] = I'\left( {k\left[ {\frac{n}{p}} \right] :\left( {k + 1} \right) \left[ {\frac{n}{p}} \right] } \right) , 0 \le k < p,&\end{aligned}$$
(13)
$$\begin{aligned}&P^C[k] = {P^i}\left[ {I^C[k]} \right] , 0 \le k < p,&\end{aligned}$$
(14)

where \(I^C[k]\) denotes the IDs of feature points contained in the k-th cluster. \(P^{C}[k]\in \mathbb {R}^{m\times c}\) is the k-th clustered hidden states, where \(m=n/p\) represents the number of points in one cluster.

Next, we trunk all feature points within p clusters \(P^{C} = {\rm{concat}}(P^{C}[k]), k=0,\ldots , p-1\), \(P^{C} \in \mathbb {R}^{m\times c \times p}\), and perform a cross-attention computation. Formally, we utilize linear transformation functions to calculate the corresponding query(\(Q\in \mathbb {R}^{m\times p\times c}\)), key(\(K\in \mathbb {R}^{m\times c\times p}\)), and value(\(V\in \mathbb {R}^{m\times p\times c}\)) as:

$$\begin{aligned}&Q=P^{C}W_q, K=P^{C}W_k, V=P^{C}W_v,&\end{aligned}$$
(15)

where \(W_q\), \(W_k\), and \(W_v\) are the weight matrices for calculating query, key, and value, respectively. Moreover, we design a channel attention on the value to capture the dependencies between channels, and the attention weight \(Q^c\) of each channel are obtained by a linear transformation of the updated center f which is obtained by Eq. 1. The weight are used to adaptively rescale the value in channel-wise manner, which can be formulated as:

$$\begin{aligned}&Q^c=fW_c,&\end{aligned}$$
(16)
$$\begin{aligned}&V^c=V\odot Q^c, V^c\in \mathbb {R}^{m\times p\times c},&\end{aligned}$$
(17)

where \(W_c\) is weight matrix for calculating \(Q^c\), \(\odot\) represents element-wise multiplication operation. Then the relevance of Q and K is computed to obtain the attention weights, which is used to aggregate the value and obtain the output

$$\begin{aligned}&Y= \mathrm{{softmax}}\left( Q{K}/\sqrt{D}\right) V^c,&\end{aligned}$$
(18)

where \(\sqrt{D}\) is a penalty coefficient to avoid gradient vanishing, taking the value of channel number.

Finally, the output is mapped to the initial order according to the points IDs, which can be formulated as:

$$\begin{aligned}&I^C={\rm{concat}}(I^{C}[k]), k=0,\ldots , p-1,&\end{aligned}$$
(19)
$$\begin{aligned}&Y[I^C] = \rm{clone}(Y).&\end{aligned}$$
(20)

Our cross-attention not only takes into account the dependencies between channels, but also computes the similarities within different clusters of query and key, realizing the information exchange between clusters. Therefore, the alternation of CCA and CoC modules is more conducive to realize the point interaction between clusters as well as adequate feature learning. In addition, since we compute attention using the clustered data, the Eq. 18 has a computational complexity \(\mathcal {O}(npc)\) rather than the usual squared complexity \(\mathcal {O}(n^2c)\) of Self-attention [52], where n is much larger than p. In summary, the SCPI block is a better alternative to SCoC block both from the perspective of feature extraction and computational complexity. In Sect. 4, we will validate the effectiveness and efficiency of SCPI-UNet through extensive experiments.

3.3 Network advantage analysis

To the best of our knowledge, SCoC-UNet and SCPI-UNet are the first clustering-based IR networks and have several advantages over CNN- or ViT-based methods as follows.

  • Enhanced feature extraction: CNN-based methods can only learn features at locally fixed positions within the small convolution kernel. In contrast, our methods are able to capture contextual interdependencies by calculating the similarity between clustering centers and each point within the image. As a result, our methods extract more discriminative features from the data, leading to improved performance in various image restoration tasks.

  • Improved efficiency: Our methods have better computational efficiency by dividing the data into multiple clusters and the update of the clustering center only calculates the interrelationships within the cluster. Our methods can efficiently handle high-dimensional data with reduced computational cost, effectively handle high-dimensional data with reduced computational costs, making it more efficient than traditional CNN- and ViT-based methods.

  • Flexibility in data representation and image manipulation: Unlike CNN- and ViT-based methods that rely on fixed data representations, our approach dynamically adapts the learning process to the inherent characteristics of the data by grouping similar data points together. This adaptability enhances the model’s ability to generalize well to unseen data, which is crucial for achieving high performance.

  • Clear interpretability of feature extraction: By categorizing similar data points into the same cluster, our methods can help to understand the inherent structure and patterns in the data and provide deeper insights into the data, leading to a more comprehensive and intuitive description.

Overall, our clustering-based method present a promising alternative to traditional CNN- and ViT-based methods, offering enhanced feature extraction, improved efficiency, flexibility in data representation and image manipulation, and clear interpretability of feature extraction.

4 Experiments

In this section, we present a comprehensive analysis of our extended experiments on JPEG CAR, image denoising, and image SR tasks. First, we give the essential experimental setup. Then, ablation studies are conducted to evaluate the effects of CRPE, SCoC block, input resolution, and point increaser. In addition, we provide a comprehensive evaluation and comparison of our SCoC-UNet and SCPI-UNet. Next, the computational costs of different methods on three IR tasks are analyzed with respect to parameters, Multi-Adds, and inference time. Finally, we showcase both quantitative and qualitative results on the aforementioned IR tasks.

4.1 Experimental setup

Here, we provide implementation details of the experiments, including training data, model architecture, and training setting for each task.

Training data. We use 800 training images from DIV2K [53] for JPEG CAR, color and grayscale image denoising, and image SR. Data augmentation are employed by randomly rotating \(90^{\circ }\) and horizontal flip. For JPEG CAR, we generate compressed LQ images by using OpenCV JPEG encoder following [35]. For color and grayscale image denoising, Gaussian noise with certain noise levels (15, 25, and 50) is added to the HQ images to obtain the noisy images. For image SR, following [22], we use the bicubic down-sampling method to obtain low-resolution images with scaling factors of \(\times\)2, \(\times\)3, and \(\times\)4.

Model architecture. In the three stages of both SCoC-UNet and SCPI-UNet, the channel numbers are set to 64, 128, and 256, respectively. In SCoC-UNet, the number of proposal centers is fixed to 4. In SCPI-UNet, we use a dynamically adjusted centers, which is 32, 16, and 8 in three stages, respectively. Following [12], we employ a multi-head computing method to enhance our models, with the head numbers set to 4, 4, and 8, respectively.

Training setting. For fairness, we train SCoC-UNet and SCPI-UNet using the same settings. To train our models, the mean absolute error (MAE) [22] is used as the loss function. For optimization, we employ the Adam optimizer with \(\beta _1=0.9\) and \(\beta _2=0.99\). The batch size is set to 16. Our models are trained 600 epochs for image denoising and JPEG CAR, and 800 epochs for image SR. The initial learning rate is set to 5e-4, and it is reduced by half every 200 epochs. During the training phase, the input patch size is set to 128 \(\times\) 128 for JPEG CAR and image denoising. For image SR task, the input is a low-resolution image corresponding to the high-resolution image size of 256 \(\times\) 256. Distinctively, we do not use point reducer and point increaser in \(\times\)4 SR for preserving more of the original image features. The entire project is implemented by PyTorch 1.8.1 and trained on four NVIDIA Tesla V100 GPUs.

4.2 Ablation study

For ablation study, we conduct a series of experiments to investigate the effects of various components in our SCoC-UNet and SCPI-UNet. To carry out these experiments, we trained our SCoC-UNet and SCPI-UNet for \(\times\)4 image SR on the DIV2K dataset and subsequently tested it on four widely used benchmark datasets, namely Set5 [54], Set14 [55], BSD100 [56], and Urban100 [57].

Effect of CRPE. Table 1 shows the results (average PSNR) of evaluating the effects of position embedding methods on performance. We can find that CRPE improves PSNR on the most of test datasets both in SCoC-UNet and SCPI-UNet. These results illustrate the effectiveness of our adopted CRPE approach in enhancing the overall performance.

Table 1 Ablation study on position embedding method

Effect of SCoC block. Table 2 provides the ablation results that investigate the effects of the SCoC block in SCoC-UNet. In case 1, the trunk branch is replaced by cascaded 3\(\times\)3 ConV layers. Case 2 removes the mask branch entirely, while case 3 replaces it with the raw SRM. Firstly, by comparing the results of cases 1 and 4, it is evident that the trunk branch has a significant impact on model performance. The pivotal reason is that the CoC module is capable of aggregating and decomposing point information, leading to more discriminative features. Secondly, when comparing case 4 with cases 2 and 3, we observe that case 4 achieves the highest PSNR values, illustrating the effectiveness of the mask branch in recalibrating intermediate feature maps. Lastly, case 3 outperforms case 2 on all benchmark datasets, demonstrating that our mask branch is more effective than the raw SRM. The mask branch provides potential style guidance for the network with enhanced feature learning and expression capabilities. These results collectively demonstrate the effectiveness of both trunk and mask branches in SCoC block.

Table 2 Ablation study on SCoC block

Effect of input resolution. The test results of SCoC-UNet and SCPI-UNet with input resolutions 48 \(\times\) 48, 64 \(\times\) 64, and 80 \(\times\) 80 at training phase are shown in the table 3. It can be observed that the increasing input resolution leads to an increase in model performance. This result can be attributed to the fact that larger input resolutions allow for the learning of more useful information. However, it should be noted that higher input resolutions also result in a significant increase in computational burden. To ensure efficient training, we choose an input resolution of 64 \(\times\) 64 to evaluate our SCoC-UNet and SCPI-UNet.

Table 3 Ablation study on input resolution in training phase

Effect of point increaser. Since point increaser is not used in the \(\times\)4 SR task, we evaluate the effects of different point increasers on the image denoising for a challenging noise level of 50. As shown in table 4, we conduct experiments of SCoC-UNet with bilinear interpolation, transposed ConV, pixel shuffle, and transposed ConV \(+\) linear. The difference between transposed ConV and transposed ConV \(+\) linear is that the former solely employs the transposed ConV layer to recover both spatial resolution and channel dimension, while the latter uses a transposed ConV to recover spatial resolution and a linear embedding layer to recover the channel dimension. Experimental results indicate that transposed ConV \(+\) linear performs better on most datasets, so it is chosen as the final point increaser.

Table 4 Ablation study on the effect of point increaser

SCoC-UNet or SCPI-UNet? We evaluate the effectiveness of SCoC-UNet and SCPI-UNet by using different combinations of CoC+CoC (i.e., SCoC-UNet), CCA+CCA, and CoC+CCA (i.e., SCPI-UNet), and the experimental results are shown in table 5. The FLOPs is calculated corresponding to a 256 \(\times\) 256 HR image, and the inference time is counted when testing Set5 (\(\times\)4) in a feedforward process. The SCPI-UNet achieves competitive or even better image restoration results than SCoC-UNet, but with fewer parameters, FLOPs, and inference time. This is due to the fact that our CCA module replaces the fully connected layer with linear attention compared to the CoC module. In addition, the CCA module eliminates the MLP layer, which further reduces the computational costs. Although the network with CCA+CCA modules has the least computational costs, it achieves less effective image restoration results. Therefore, our SCPI-UNet with the component combination of CoC+CCA is a good choice both in terms of image restoration effect and computational costs.

Table 5 SCoC-UNet vs SCPI-UNet

4.3 Computational cost analysis

Figs. 5,  6, and 7 show the comparison results of our SCoC-UNet and SCPI-UNet with other state-of-the-art methods in terms of model parameters, multiply-accumulate operations (Multi-Adds), and inference time. All the comparative experiments are implemented using PyTorch according to their official provided source code. The testing results of JPEG CAR are conducted on LIVE1 at a JPEG quality factor of 40, denoising results are from BSD68 with a noise level of 50, and SR results are from Set5 with \(\times\)4 upscaling. As depicted in Fig. 5, our methods obtain higher PSNR values with fewer parameters compared to the other competitors on the JPEG CAR and denoising tasks. On the SR task, although the number of parameters is much higher than the competitors, our methods achieve the best SR results. We also report the Multi-Adds and inference time of various methods calculated corresponding to a 1280 \(\times\) 720 HR image. As shown in Figs. 6 and 7, our methods have fewer Multi-Adds and inference time but obtain higher PSNR values compared to other competitors. These results can be attributed to the unique structure of our model, particularly the SCPI and SCoC blocks, which enhance the ability to mine and recalibrate deep features. In conclusion, our proposed methods are highly efficient and effective.

Fig. 5
figure 5

Average PSNR versus parameters on JPEG CAR, image denoising, and image SR

Fig. 6
figure 6

Average PSNR versus Multi-Adds on JPEG CAR, image denoising, and image SR

Fig. 7
figure 7

Average PSNR versus inference time on JPEG CAR, image denoising, and image SR

4.4 Results on JPEG CAR

To evaluate the effectiveness of our SCoC-UNet and SCPI-UNet, we conduct both quantitative and qualitative analysis. We compare their performance with several state-of-the-art JPEG CAR methods, including QGAC [58], RNAN [38], MWCNN [27], ESCNet [59], IDCN [60], RDN [25], Swin2SR [35, 61], CODE [62], DRUNet [28], and BFeCarNet [63]. The evaluation is carried out on Classic5 [64] and LIVE1 [65] datasets, using three different JPEG quality factors of 20, 30, and 40.

Quantitative analysis. Table 6 shows the quantitative results of our methods and other competitors. Following [28], PSNR, SSIM, and PSNR-B in the transformed YCbCr space are served as the evaluation metrics. As one can see, our approaches achieve superior performance than the compared methods across most datasets and JPEG quality factors. It is worth mentioning that RDN is a CNN-based method, while RNAN, Swin2SR and CODE are ViT-based methods. However, our approaches surpasses all these methods in terms of performance. This can be attributed to the fact that our approach utilizes a fundamentally different feature extraction paradigm, and the clustering method introduces a promising avenue for further research. The U-shaped structure of our network enables the extraction of multi-level features, while the proposed SCoC and SCPI blocks facilitate adaptive recalibration of features, resulting in improved performance. Notably, the performance of SCPI-UNet outperforms SCoC-UNet on almost all of datasets, indicating that the proposed SCPI block is able to realize the interaction of feature points and obtain more discriminative features. Moreover, compared to the advanced Swin2SR and DRUNet, our approaches have fewer parameters (only 6.7 M of SCoC-UNet and 4.8 M of SCPI-UNet), whereas DRUNet has 32.7 M parameters, and Swin2SR has approximately 12.0 M parameters (see Fig. 5). These results collectively illustrate the effectiveness of our methods. Furthermore, the performance of our methods has significantly improved when the quality factor is 40 due to the retention of more high-frequency information. However, when the quality factor is set to 10, our methods is limited in its ability to improve performance due to the loss of high-frequency information. In future research, we will consider to improve the performance at low quality factors.

Table 6 Quantitative results (average PSNR, SSIM, and PSNR-B) of different methods for JPEG CAR with quality factors of 20, 30, and 40

Qualitative analysis. Fig. 8 shows the visual comparisons of different JPEG CAR methods. The compared images “boats”, “bikes”, and “caps” are taken from the Classic5 and LIVE1 datasets, respectively. From the enlarged views, we can observe that our methods effectively removes artifacts, restores shaper edges, and retains more natural texture information. Specifically, in the restored images of “bikes”, our methods recover the markings on the bike closer to the reference image. In contrast, some other methods exhibit artifacts and result in unclear images.

Fig. 8
figure 8

Visual comparisons of different JPEG CAR methods with quality factor of 20

4.5 Results on image denoising

We further compare our SCoC-UNet and SCPI-UNet with several state-of-the-art methods on the image denoising task. The compared methods include BRDNet [26], FOCNet [66], RNAN [38], RDN [25], DeamNet [67], Restormer [68], IRCNN [1], FFDNet [21], DSNet [23], and RPCNN [69]. The evaluation is conducted on Set12 [17], BSD68 [56], Urban100 [57], CBSD68 [56], Kodak24 [70], and McMaster [71], considering both color and grayscale image denoising tasks.

Results on grayscale image denoising. Table 7 displays the quantitative results of different methods on grayscale image denoising, considering three noise levels of 15, 25, and 50. As one can see, our SCPI-UNet performs better than other competitors except Restormer on almost all benchmarks and noise levels. Although our approaches show lower PSNR values compared to Restormer on some datasets, it is arguably more efficient in terms of the number of parameters (see Fig. 5(b)), Multi-Adds (see Fig. 6b), and inference time (see Fig. 7b), which are much higher than that of our methods. Notably, compared to the advanced CNN-based method RDN, SCPI-UNet achieves improvements of 0.12dB in Set12, 0.16dB in BSD68, and 0.21dB in Urban100 at a challenging noise level of 50. Furthermore, compared to the non-local attention-based method RNAN, our network also demonstrates significant performance gains. These promising improvements are attributed to the efficient network structure and novel SCPI block, highlighting the substantial ability of clustering-based methods to enhance IR performance.

Table 7 Quantitative results (average PSNR) of different methods for grayscale image denoising with noise levels of 15, 25, and 50

Results on color image denoising. Table 8 presents the quantitative results of different methods on color image denoising. As one can see, our model outperforms the compared methods except Restormer at most noise levels. Figure 9 showcases the visual comparison results of different methods with a noise level of 50. The two compared images are “208001” and “157055” from the CBSD68 dataset. From the enlarged views, we can observe that our methods are more effective in removing noise and retaining sharper textures than other competitors.

Table 8 Quantitative results (average PSNR) of different methods for color image denoising with noise levels of 15, 25, and 50
Fig. 9
figure 9

Visual comparisons with state-of-the-art methods for color image denoising with noise level of 50

4.6 Results on image SR

We compare our methods with several state-of-the-art image SR methods, including VDSR [22], DRCN [24], MemNet [72], IDN [30], ACNet [19], LESRCNN [20], LAPAR-B [73], SRMDNF [74], CARN [29], IMDN [31], DefRCN [75], and RFDN [32]. To ensure a fair comparison, all methods are trained on the same dataset, and low-resolution images are obtained using the same bicubic interpolation method.

Quantitative analysis. Table 9 presents the quantitative results of different SR methods for three scale factors (i.e., \(\times\)2, \(\times\)3, and \(\times\)4). As one can see, the proposed SCoC-UNet and SCPI-UNet achieve superior performance in most benchmark datasets and scale factors. The maximum PSNR gains reached 0.18dB and 1.14dB of SCoC-UNet and SCPI-UNet on Urban100 for \(\times\)2 image SR. These results highlight the effectiveness of our approaches in recovering texture details of high-resolution images.

Table 9 Quantitative results (average PSNR and SSIM) of different methods for image SR with three scale factors (\(\times\)2, \(\times\)3, and \(\times\)4) on four benchmark datasets

Qualitative analysis. Fig. 10 showcases the visual comparisons of different image SR methods with a scale factor of \(\times\)4. The two compared images “253027” and “img096” are from the BSD100 and Urban100 datasets, respectively. From the enlarged images, we can observe that it is evident that the super-resolved image recovered by our method closely resembles the real high-resolution image compared to other competitors. Specifically, in the super-resolved image of “253027”, our methods successfully recover clearer and sharper lines on the animal, while other competitors produce artifacts or blurring.

Fig. 10
figure 10

Visual comparisons of different image SR methods with scale factor of \(\times\)4

5 Conclusion and limitation

In this paper, we propose two novel clustering-based frameworks of SCoC-UNet and SCPI-UNet for multiple IR tasks. Specifically, we utilize the trainable CRPE instead of the original absolute position embedding method to handle inputs with different resolutions. To capture the potential style of middle layers, we propose a SCoC block to efficiently extract context features and adaptively recalibrate feature maps. To capture the long-range dependencies, we propose a SCPI block to establish the connections of feature points between different clusters. Furthermore, a U-shaped architecture consisting of a symmetric Encoder–Decoder based on SCoC block and SCPI block is established to obtain hierarchical feature representations. Extensive experimental results on several typical IR tasks demonstrate that our SCoC-UNet and SCPI-UNet achieve superior quantitative and qualitative performance compared to the state-of-the-art methods, and comprehensive ablation analysis validates the effectiveness of individual components of our networks.

While our methods have made large progress on several image restoration tasks, there are still some limitations that need to be addressed: (1) sensitivity to initialization: clustering algorithms, including our SCoC-UNet and SCPI-UNet, can be sensitive to the initial cluster centers. This sensitivity might lead to different results with different initializations, potentially impacting the stability and consistency of the method. (2) Difficulty in determining the optimal number of clusters: one challenge in our methods is determining the appropriate number of clusters, which requires manual intervention or the use of additional heuristics. In our model, selecting the optimal number of clusters remains an open question and might require further experimentation or domain expertise. Therefore, we will continue to improve and optimize our method to make it more robust and stable in future research. Moreover, we will explore ways to reduce the dependency on the cluster centers and the optimal number of clusters, making our model more user-friendly in practical applications.