Decouple Learning for Parameterized Image Operators

Fan, Qingnan; Chen, Dongdong; Yuan, Lu; Hua, Gang; Yu, Nenghai; Chen, Baoquan

doi:10.1007/978-3-030-01261-8_27

Qingnan Fan^17,19,
Dongdong Chen¹⁸,
Lu Yuan²⁰,
Gang Hua²⁰,
Nenghai Yu¹⁸ &
…
Baoquan Chen^17,21

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 11217))

Included in the following conference series:

European Conference on Computer Vision

2900 Accesses
36 Citations

Abstract

Many different deep networks have been used to approximate, accelerate or improve traditional image operators, such as image smoothing, super-resolution and denoising. Among these traditional operators, many contain parameters which need to be tweaked to obtain the satisfactory results, which we refer to as “parameterized image operators”. However, most existing deep networks trained for these operators are only designed for one specific parameter configuration, which does not meet the needs of real scenarios that usually require flexible parameters settings. To overcome this limitation, we propose a new decouple learning algorithm to learn from the operator parameters to dynamically adjust the weights of a deep network for image operators, denoted as the base network. The learned algorithm is formed as another network, namely the weight learning network, which can be end-to-end jointly trained with the base network. Experiments demonstrate that the proposed framework can be successfully applied to many traditional parameterized image operators. We provide more analysis to better understand the proposed framework, which may inspire more promising research in this direction. Our codes and models have been released in https://github.com/fqnchina/DecoupleLearning.

Q. Fan and D. Chen—Equal Contribution.

You have full access to this open access chapter, Download conference paper PDF

A Multiscale Image Denoising Algorithm Based on Dilated Residual Convolution Network

Fine-grained scale space learning for single image super-resolution

Article 06 July 2022

Learning a Deep Convolutional Network for Image Super-Resolution

1 Introduction

Image operators are fundamental building blocks for many computer vision tasks, such as image smoothing [16, 42], super resolution [25, 27] and denoising [33]. To obtain the desired results, many of these operators contain some parameters that need to be tweaked. We refer them as “parameterized image operators” in this paper. For example, parameters controlling the smoothness strength are widespread in most smoothing methods, and a parameter denoting the target upsampling scalar is always used in image super resolution.

Recently, many CNN based methods [16, 25, 44] have been proposed to approximate, accelerate or improve these parameterized image operators and achieved significant progress. However, we observe that the networks in these methods are often only trained for one specific parameter configuration, such as edge-preserving filtering [16] with a fixed smoothness strength, or super resolving low-quality images [25] with a particular downsampling scale. Many different models need to be retrained for different parameter settings, which is both storage-consuming and time-consuming. It also prohibits these deep learning solutions from being applicable and extendable to a much broader corpus of images.

In fact, given a specific network structure, when training separated networks for different parameter configurations $\overrightarrow{\gamma }_k$ as [16, 25, 44], the learned weights $W_k$ are unconstrained and probably very different for each $\overrightarrow{\gamma }_k$. But can we find a common convolution weight space for different configurations by explicitly building their relationships? Namely, $W_k = h(\overrightarrow{\gamma }_k)$, where h can be a linear or non-linear function. In this way, we can adaptively change the weights of the single target network based on h in the runtime, thus enabling continuous parameter control.

To verify our hypothesis, we propose the first decouple learning framework for parameterized image operators by decoupling the weights from the target network structure. Specifically, we employ a simple weight learning network $\mathcal {N}_{weight}$ as h to directly learn the convolution weights of one task-oriented base network $\mathcal {N}_{base}$. These two networks can be trained end-to-end. During the runtime, the weight learning network will dynamically update the weights of the base network according to different input parameters, thus making the base network generate different objective results. This should be a very useful feature in scenarios where users want to adjust and select the most visually pleasant results interactively.

We justify the effectiveness of the proposed framework for many different types of applications, such as edge-preserving image filtering with different degrees of smoothness, image super resolution with different scales of blurring, and image denoising with different magnitudes of noise. We also demonstrate the extensibility of our proposed framework on multiple input parameters for a specific application, and combination of multiple different image processing tasks. Experiments show that the proposed framework is able to learn as good results as the one solely trained with a single parameter value.

As an extra bonus, the proposed framework makes it easy to analyze the underlying working principle of the trained task-oriented network by visualizing different parameters. The knowledge gained from this analysis may inspire more promising research in this area. To sum up, the contributions of this paper lie in the following three aspects.

We propose the first decouple learning framework for parameterized image operators, where a weight learning network is learned to adaptively predict the weights for the task-oriented base network in the runtime.
We show that the proposed framework can be learned to incorporate many different parameterized image operators and achieve very competitive performance with the one trained for a single specific parameter or operator.
We provide a unique perspective to understand the working principle of the trained task-oriented network with some valuable analysis and discussion, which may inspire more promising research in this area.

2 Related Work

In the past decades, many different image operators have been proposed for low level vision tasks. Previous work [24, 42, 45, 50] proposed different priors to smooth images while preserving salient structures. Some work [2, 15] utilized the spatial relationship and redundancy to remove unpleasant noise in the image. Some other papers [37, 39, 46] aimed to recover a high-resolution image from a low-resolution image. Among them, many operators are allowed to tune some built-in parameters to obtain different results, which is the focus of this paper.

Recently, deep learning has been applied to many different tasks, like recognition [8, 9, 11, 12, 29, 48, 49], generation [28, 30, 35], and image to image translation [3,4,5, 17, 23, 32]. For the aforementioned image operators, some methods like [16, 31, 44] are also proposed to approximate, accelerate and improve them. But their common limitation is that one model can only handle one specific parameter. To enable all other parameters, enormous different models need to be retrained, which is both storage-consuming and time-consuming. By contrast, our proposed framework allows us to input continuous parameters to dynamically adjust the weights of the task-oriented base network. Moreover, it can even be applied to multiple different parameterized operators with one single network.

Recently, Chen et al. [6] conducted a naive extension for parameterized image operators by concatenating the parameters as extra input channels to the network. Compared to their method, where both the network structure and weights maintain the same for different parameters, the weights of our base network are adaptively changed. Experimentally we find our framework outperforms their strategy by integrating multiple image operators. By decoupling the network structure and weights, our proposed framework also makes it easier to analyze the underlying working principle of the trained task-oriented network, rather than leaving it as a black box as in many previous works like [6].

Our method is also related to evolutionary computing and meta learning. Schmidhuber [36] suggested the concept of fast weights in which one network can produce context-dependent weight changes for a second network. Some other works [1, 7, 41] casted the design of an optimization algorithm as a learning problem, Recently, Ha et al. [22] proposed to use a static hypernetwork to generate weights for a convolutional neural network on MNIST and Cifar classification. They also leverage a dynamic hypernetwork to generate weights of recurrent networks for a variety of sequence modelling tasks. The purpose of their paper is to exploit weight sharing property across different convolution layers. But in our cases, we pay more attention to the common shared property among numerous input parameters and many different image operators.

3 Method

3.1 Problem Definition and Motivation

The input color image and the target parameterized image operators are denoted as $\mathcal {I}$ and $f(\overrightarrow{\gamma }, \mathcal {I})$ respectively. $f(\overrightarrow{\gamma }, \mathcal {I})$ transforms the content of $\mathcal {I}$ locally or globally without changing its dimension. $\overrightarrow{\gamma }$ denotes the parameters which determine the transform degree of f and may be a single value or a multi-value vector. For example, in $L_0$ smoothing [43], $\overrightarrow{\gamma }$ is the balance weight controlling the smoothness strength, while in RTV filter [45], it includes one more spatial gaussian variance. In most cases, f is a highly nonlinear process and solved by iterative optimization methods, which is very slow in runtime.

Our goal is to implement parameterized operator f with a base convolution network $\mathcal {N}_{base}$. In previous methods like [31, 44], given a specific network structure of $\mathcal {N}_{base}$, separated networks are trained for different parameter configuration $\overrightarrow{\gamma }_k$. In this way, the learned weights $\overrightarrow{W}_k$ of these separated networks are highly unconstrained and probably very different. But intuitively, for one specific image operator, the weights $\overrightarrow{W}_k$ of different $\overrightarrow{\gamma }_k$ might be related. So retraining separated models is too redundant. Motivated by this, we try to find a common weight space for different $\overrightarrow{\gamma }_k$ by adding a mapping constraint: $\overrightarrow{W}_k = h(\overrightarrow{\gamma }_k)$, where h can be a linear or non-linear function.

In this paper, we directly learn h with another weight learning network $\mathcal {N}_{weight}$ rather than design it by handcraft. Assuming $\mathcal {N}_{base}$ is a fully convolutional network having a total of n convolution layers, we denote their weights as $\overrightarrow{W}_k=(W_1, W_2, ..., W_n)$ respectively, then

$$\begin{aligned} \begin{aligned} (W_1, W_2, ..., W_n) = \mathcal {N}_{weight}(\overrightarrow{\gamma }) \end{aligned} \end{aligned}$$

(1)

where the input of $\mathcal {N}_{weight}$ is $\overrightarrow{\gamma }$ and the outputs are these weight matrices. In the training stage, $\mathcal {N}_{base}$ and $\mathcal {N}_{weight}$ can be jointly trained. In the inference stage, given different input parameter $\overrightarrow{\gamma }$, $\mathcal {N}_{weight}$ will adaptively change the weights of the target base network $\mathcal {N}_{base}$, thus enabling continuous parameter control.

Besides the original input image $\mathcal {I}$, the computed edge maps are shown to be a very important input signal for the target base network in [16]. Therefore, we also pre-calculate the edge map E of $\mathcal {I}$ and concatenate it to the original image as an extra input channel:

$$\begin{aligned} \begin{aligned} E_{x,y} = \frac{1}{4}\sum _c(|\mathcal {I}_{x,y,c}-\mathcal {I}_{x-1,y,c}| + |\mathcal {I}_{x,y,c}-\mathcal {I}_{x+1,y,c}| \\ +\ |\mathcal {I}_{x,y,c}-\mathcal {I}_{x,y-1,c}| + |\mathcal {I}_{x,y,c}-\mathcal {I}_{x,y+1,c}|) \end{aligned} \end{aligned}$$

(2)

where x, y are the pixel coordinates and c refers to the color channels.

To jointly train $\mathcal {N}_{base}$ and $\mathcal {N}_{weight}$, we simply use pixel-wise L2 loss in the RGB color space as [6] by default:

$$\begin{aligned} \begin{aligned} \mathcal {L} = ||\mathcal {N}_{base}(\mathcal {N}_{weight}(\overrightarrow{\gamma }),\mathcal {I}, E) - f(\overrightarrow{\gamma }, \mathcal {I}) ||^2 \end{aligned} \end{aligned}$$

(3)

3.2 Network Structure

As shown in Fig. 1, our base network $\mathcal {N}_{base}$ follows a similar network structure as [16]. We employ 20 convolutional layers with the same $3\times 3$ kernel size, among which the intermediate 14 layers are formed as residual blocks. Except the last convolution layer, all the former convolutional layers are followed by an instance normalization [40] layer and a ReLU layer. To enlarge the receptive field of $\mathcal {N}_{base}$, the third convolution layer downsamples the dimension of feature maps by 1/2 using stride 2, and the third-to-last deconvolution layer (kernel size of $4\times 4$) upsamples the downsampled feature maps to the original resolution symmetrically. In this way, the receptive field is effectively enlarged without losing too much image detail, and meanwhile the computation cost of intermediate layers is reduced. To further increase the receptive field, we also adopt dilated convolution [47] as [6], more detailed network structure can be found in the supplementary material.

In this paper, the weight learning network $\mathcal {N}_{weight}$ simply consists of 20 fully connected (fc) layers by default. The $i_{th}$ fc layer is responsible to learn the weights $W_i$ for the $i_{th}$ convolutional layer, which can be written as following:

$$\begin{aligned} \begin{aligned} W_i = A_i\overrightarrow{\gamma } + B_i, \qquad \forall i \in \{1,2,...,20\} \end{aligned} \end{aligned}$$

(4)

where $A_i, B_i$ are the weight and bias of the $i_{th}$ fc layer. Assuming the parameter $\overrightarrow{\gamma }$ has a dimension of m and $W_i$ has a dimension of $n_{wi}$. The dimension of $A_i$ and $B_i$ would be $n_{wi}\times m$ and $n_{wi}$ respectively.

Note in this paper, we don’t intend to design an optimal network structure neither for the base network $\mathcal {N}_{base}$ nor the weight learning network $\mathcal {N}_{weight}$. On the contrary, we care more about whether it is feasible to learn the relationship of the weights of $\mathcal {N}_{base}$ and different parameter configurations $\overrightarrow{\gamma }$ even by such a simple weight learning network $\mathcal {N}_{weight}$.

4 Experiments

4.1 Choice of Image Operators

To evaluate the proposed framework on a broad scope of parameterized image operators, we leverage two representative types of image processing tasks: image filtering and image restoration. Within each of them, more than four popular operators are selected for detailed experiments.

Image Filtering. Here we employ six popular image filters, denoted as $L_0$ [42], WLS [18], RTV [45], RGF [50], WMF [51] and shock filter [34], which have been developed to work especially well for many different applications, such as image abstraction, detail exaggeration, texture removal and image enhancement. However, previous deep learning based approaches [16, 31, 44] are only able to deal with one single parameter value in one trained model, which is far from practical.

Image Restoration. The goal of image restoration is to recover a clear image from a corrupted image. In this paper we deal with four representative tasks in this venue: super resolution [14, 27], denoising [26, 33], deblocking [13, 38] and derain [20, 49], which have been studied with deep learning based approaches extensively. For example, image super resolution is dedicated to increasing the resolution or enhancing the lost details from a low-resolution blurry image. To generate the pairwise training samples, previous work used to downsample a clear image by a specific scale with bicubic interpolation to synthesize a low-resolution image. Likewise, many previous works have typically been developed to fit a specific type of input image, such as a fixed upsampling scale.

4.2 Implementation Details

Dataset. We take use of the 17k natural images in the PASCAL VOC dataset as the clear images to synthesize the ground truth training samples. The PASCAL VOC images are picked from Flicker, and consists of a wide range of viewing conditions. To evaluate our performance, 100 images from the dataset are randomly picked as the test data for the image filtering task. While for the restoration tasks, we take the well-known benchmark for each specific task for testing, which is specifically BSD100 (super resolution), BSD68 (denoise), LIVE1 (deblock), RAIN12 (derain). For the filtering task, we filter the natural images with the aforementioned algorithms to produce ground truth labels. As for the image restoration tasks, the clear natural image is taken as the target image while the synthesized corrupted image is used as input.

Parameter Sampling. To make our network able to handle various parameters, we generate training image pairs with a much broader scope of parameter values rather than a single one. We uniformly sample parameters in either the logarithm or the linear space depending on the specific application. Regarding the case of logarithm space, let l and u be the lower bound and upper bound of the parameter, the parameters are sampled as follows:

$$\begin{aligned} y = e^x, \text { where } x \in [{\ln l},{\ln u}] \end{aligned}$$

(5)

In other words, we first uniformly sample x between $\ln l$ and $\ln u$, then map it back by the exponential function, similar to the one used in [6]. Note if the upper bound u is tens or even hundreds of times larger than the lower bound l, the parameters are sampled in the logarithm space to balance their magnitudes, otherwise they are sampled in the linear space.

Table 1. Quantitative absolute difference between the network trained with a single parameter value and numerous random values for each image smoothing filter.

Full size table

4.3 Qualitative and Quantitative Comparison

Image Filtering. We first experiment with our framework on five image filters. To evaluate the performance of our proposed algorithm, we train one network for each parameter value ($\lambda $) in one filter, and also train a network jointly on continuous random values sampled from the filter’s parameter range, which can be inferred from the $\lambda $ column in Table 1. The performance of the two networks is evaluated on the test dataset with PSNR and SSIM error metrics. Since our goal is to measure the performance difference between these two strategies, we directly compute the absolute difference of their errors and demonstrate the results in Table 1. The results of the other two filters (RGF and WMF) are shown in the supplemental material due to space limitations.

As can be seen, though our proposed framework lags a little behind the one trained on a single parameter value, their difference is too small to be noticeable, especially for the SSIM error metric. Note that for each image filter, our algorithm only requires one jointly trained network, but previous methods need to train separate networks for each parameter value. Moreover even if the five filters are dedicated to different image processing applications, and varies a lot in their implementation details, our proposed framework is still able to learn all of them well, which verifies the versatility and robustness of our strategy.

Some visual results of our proposed framework are shown in Fig. 2. As can be seen, our single network trained on continuous random parameter values is capable of predicting high-quality smooth images of various strengths.

Image Restoration. We then evaluate the proposed framework on three popular image restoration tasks as shown in Table 2, which perform essentially different from image filtering. Unlike the above operators which employ the filtered images as target to learn, this task takes the clear images as the ground truth label while the corrupted images as input. That is to say, as for the former task, given an input image, our network learns different filtering effects, while regarding the latter one, our model learns to recover from different corrupted images.

Table 2. Quantitative absolute difference in PSNR and SSIM between the network trained on a single parameter value and numerous random values on the three image restoration tasks. Their parameters specifically mean downsampling scale (s), Gaussian standard deviation ($\sigma $) and JPEG quality (q).

Full size table

As shown in Table 2, our results trained jointly on continuous random parameter values also show no big difference from the one trained solely on an individual parameter value, which further validate our algorithm in a broader image processing literature.

4.4 Extension to Multiple Input Parameters

Except for experimenting on a single input parameter, we also demonstrate our results on inputting multiple types of parameters, which is still very common for many image processing tasks.

In this section, we evaluate our performance on the famous texture removal tool RTV [45]. Likewise in previous experiments, we leverage $\lambda $ which balances between the data prior term and smoothness term in its energy function as one parameter, and $\sigma $ which controls the spatial scale for computing the windowed variation and is even more effective in removing textures. To generate the training samples, we randomly sample these two parameters. Therefore, the input parameter $\overrightarrow{\gamma }$ of the weight learning network is a two-element vector $[\lambda , \sigma ]$.

To evaluate the performance of our network on this two dimensional parameter space compared with the single parameter setting case, we sample a few parameters along one dimension while fixing another as shown in Table 3. We can see that for most of the 10 parameter settings, all achieve very close results to the one trained with an individual parameter setting. This verifies the effectiveness of our proposed network on this more difficult case.

Table 3. Quantitative comparison between the network trained on a single parameter setting and numerous random settings under the condition of multiple input parameters. Their absolute difference is shown besides the value of nume. The results are tested by fixing one parameter while varying another.

Full size table

Table 4. Numerical results (PSNR (above) and SSIM (bottom)) of our proposed framework jointly trained over different number of image operators (#operators). “6/4” refers to the results jointly trained over either the front 6 filtering based approaches or the last 4 restoration tasks. “10" is the results of jointly training all 10 tasks.

Full size table

4.5 Extension to Joint Training of Multiple Image Operators

Intuitively, another challenging case for our proposed framework is to incorporate multiple distinct image operators into a single learned neural network, which is much harder to be trained due to their different implementation details and purposes. To explore the potential of our proposed neural network, we experiment by jointly training over (i). 6 filtering based operators, (ii). 4 image restoration operators or (iii). all the 10 different operators altogether. To generate training images of each image operator, we sample random parameter values continuously within its parameter range. For the shock filter and derain task, we leverage its default parameter setting for training.

The input to the weight learning network now takes two parameters, one indicates the specific image operator while the other is the random parameter values assigned to the specified filter. These 10 image operators are denoted simply by 10 discrete values that range from 0.1 to 1.0 in the input parameter vector. Since the absolute parameter range may differ a lot from operator to operator, for example, [2,4] for super resolution and [0.002,0.2] for $L_0$ filter, we rescale the parameters in all the operators into the same numerical range to enable consistent back-propagated gradient magnitude.

As shown in Table 4, training on each individual image operator achieves the highest numerical score (#ope.=1), which is averaged over multiple different parameter settings just like in previous tables. While jointly training over either 6 image filters or 4 restoration tasks (#ope.=6/4), even for the case where all 10 image operators are jointly trained (#ope.=10), their average performance degrades but still achieves close results to the best score. It means with the same network structure, our framework is able to incorporate all these different image operators together into a single network without losing much accuracy.

Note that for the image restoration tasks, it is more meaningful not to specify parameters since in real life, users usually do not know the corruption degree of the input image. Therefore, we disable specifying parameters for the four restoration operators in this experiment. Surprisingly, we do not observe much performance degradation with this modification. Though it degrades the necessity of learning continuous parameter settings for image restoration tasks, it still makes a lot of sense by jointly training multiple image operators.

4.6 Comparison with State-of-the-art Image Operators

Note that we do not argue for the best performance in each specific task, since this is not the goal of this paper. Essentially, the performance on image operators is determined by the base network structure, which is not our contribution, but many others [16, 31, 44] which develop more complex and advanced networks on each specific task. Even if this is not our goal, we still provide comparisons to demonstrate that our general framework performs comparably or even better than many previous work (one operator with one parameter).

Regarding image filtering, the best performance is achieved by [16]. For the WLS filter example, with our simple and straightforward base network trained with continuous parameter settings, we achieve very comparable results with [16] (PSNR/SSIM: 41.07/0.991 vs. 41.39/0.994), which are superior to [31] (PSNR /SSIM: 38.29/0.983) and [44] (PSNR/SSIM: 33.92/0.963).

As for image restoration, our framework trained with all four image restoration tasks performs better than DerainNet [19] on the derain task (PSNR:30.32 vs 28.94 on RAIN12 dataset). And our model also achieves better PSNR (26.02) than many previous approaches BM3D [10] (25.62), EPLL [52](25.67), WNNM [21] (25.87) on the BSD68 dataset for the denoising task.

4.7 Understanding and Analysis

To better understand the base network $\mathcal {N}_{base}$ and the weight learning network $\mathcal {N}_{weight}$, we will conduct some analysis experiments in this section.

The Effective Receptive Field. In neuroscience, the receptive field is the particular region of the sensory space in which a stimulus will modify the firing of one specific neuron. The large receptive field is also known to be important for modern convolutional networks. Different strategies are proposed to increase the receptive field, such as deeper network structure or dilated convolution. Though the theoretical receptive field of one network may be very large, the real effective receptive field may vary with different learning targets. So how is the effective receptive field of $\mathcal {N}_{base}$ changed with different parameters $\overrightarrow{\gamma }$ and $\mathcal {I}$ ? Here we use $L_0$ smoothing [43] as the default example operator.

In Fig. 3, we study the effective receptive field of a non-edge point, a moderate edge point, and a strong edge point with different smoothing parameters $\lambda $ respectively. To obtain the effective receptive field for a specific spatial point p, we first feed the input image into the network to get the smoothing result, then propagate the gradients back to the input while masking out the gradient of all points except p. Only the points whose gradient value is large than $0.025*grad_{max}$ ($grad_{max}$ is the maximum gradient value of input gradient) are considered within the receptive field and marked as green in Fig. 3. From Fig. 3, we observe three important phenomena: (1) For a non-edge point, the larger the smoothing parameter $\lambda $ is, the larger the effective field is, and most effective points fall within the object boundary. (2) For a moderate edge point, its receptive field stays small until a relatively large smoothing parameter is used. (3) For a strong edge point, the effective receptive field is always small for all the different smoothing parameters. It means, on one hand, the weight learning network $\mathcal {N}_{weight}$ can dynamically change the receptive field of $\mathcal {N}_{base}$ based on different smoothing parameters. On the other hand, the base network $\mathcal {N}_{base}$ itself can also adaptively change its receptive field for different spatial points.

Decomposition of the Weight Learning Network. To help understand the connection between the base network $\mathcal {N}_{base}$ and the weight learning network $\mathcal {N}_{weight}$, we decompose the parameter vector $\overrightarrow{\gamma }$ and the weight matrix $A_i$ into independent elements $\gamma _1,...,\gamma _m$ and $A_{i0}, ...,A_{im}$ respectively, then:

$$\begin{aligned} \begin{aligned} (A_i\overrightarrow{\gamma } + B_i)\otimes x = \sum _{k=1}^m \gamma _k A_{ik} \otimes x + B_i\otimes x \end{aligned} \end{aligned}$$

(6)

where $\otimes $ denotes convolution operation, and m is the dimension of $\overrightarrow{\gamma }$. In other words, the one convolution layer, whose weights are learned with one single fc layer, is exactly equivalent to a multi-path convolution block as shown in Fig. 6. Learning the weight and bias of the single fc layer is equivalent to learning the common basic convolution kernels $B_i, A_{i1}, A_{i2},...,A_{im}$ in the convolution block.

Visualization of the Learned Convolution Weights. The learned convolution weights can be generally classified into two classes: kernels generated by different parameter values of a single image operator, and kernels generated by different image operators. We analyse both groups of kernels on the model trained on 10 image operators which is introduced in Subsect. 4.5. In this case, the input to the weight learning network takes two parameters, hence the learned convolution weights for a specific layer i in the base network should be,

$$\begin{aligned} {\begin{matrix} W_i {}&= \gamma _{1} A_{i1} + \gamma _{2} A_{i2} + B_i \end{matrix}} \end{aligned}$$

(7)

where $\gamma _{1}$ refers to the input parameter value of a specific operator, and $\gamma _{2}$ indicates the type of the operator, which is defined by ten discrete numbers that range from “0.1” to “1.0” for different operators separately. $A_{i1}$ and $A_{i2}$ are its corresponding weights in the fully connected layer. Therefore, for a single image operator, $\gamma _{2} A_{i2} + B_i$ is a fixed value and the only modification to its different parameter values is $\gamma _{1} A_{i1}$, which scales a high-dimension value. That is to say, each time when one adjusts the operator parameter by $\gamma _{1}$, the learned convolution weights are only shifted to some extent in a fixed high-dimensional direction. Similar analysis also applies to the transformation of different operators.

We visualize the learned convolution kernels via t-SNE in Fig. 5. Each color indicates one image operator, and for each operator, we randomly generate 500 groups of convolution weights with different parameters. As can be seen, the distance of every two adjacent operator is almost the same, it shifts along the x dimension for a fixed distance. For a single filter, while adjusting the parameters continuously, the convolution weights shift along the y dimension. This figure just conforms to our analysis about the convolution weights in the high-dimensional space. It is very surprising that all different kinds of learned convolution weights can be related with a high-dimensional vector, and the transformation between them can be represented by a very simple linear function.

As analyzed in the supplemental material, the solution space of an image processing task could be huge in the form of learned convolution kernels. Two exactly same results may be represented by very different convolution weights. The linear transformation in our proposed weight learning network actually connects all the different image operators and constrains their learned convolution weights in a limited high dimensional space.

5 Conclusion

In this paper, we propose the first decouple learning framework for parameterized image operators, where the weights of the task-oriented base network $\mathcal {N}_{base}$ are decoupled from the network structure and directly learned by another weight learning network $\mathcal {N}_{weight}$. These two networks can be easily end-to-end trained, and $\mathcal {N}_{weight}$ dynamically adjusts the weights of $\mathcal {N}_{base}$ for different parameters $\overrightarrow{\gamma }$ during the runtime. We show that the proposed framework can be applied to different parameterized image operators, such as image smoothing, denoising and super resolution, while obtaining comparable performance as the network trained for one specific parameter configuration. It also has the potential to jointly learn multiple different parameterized image operators within one single network. To better understand the working principle, we also provide some valuable analysis and discussions, which may inspire more promising research in this direction. More theoretical analysis is worthy of further exploration in the future.

References

Andrychowicz, M., et al.: Learning to learn by gradient descent by gradient descent. In: Advances in Neural Information Processing Systems, pp. 3981–3989 (2016)
Google Scholar
Buades, A., Coll, B., Morel, J.M.: A non-local algorithm for image denoising. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition, CVPR 2005, vol. 2, pp. 60–65. IEEE (2005)
Google Scholar
Chen, D., Liao, J., Yuan, L., Yu, N., Hua, G.: Coherent online video style transfer. In: Proceedings of the International Conference on Computer Vision (ICCV) (2017)
Google Scholar
Chen, D., Yuan, L., Liao, J., Yu, N., Hua, G.: Stylebank: an explicit representation for neural image style transfer. In: Proceedings of the CVPR, vol. 1, p. 4 (2017)
Google Scholar
Chen, D., Yuan, L., Liao, J., Yu, N., Hua, G.: Stereoscopic neural style transfer. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, vol. 10 (2018)
Google Scholar
Chen, Q., Xu, J., Koltun, V.: Fast image processing with fully-convolutional networks. In: IEEE International Conference on Computer Vision, vol. 9 (2017)
Google Scholar
Chen, Y., Hoffman, M.W., Colmenarejo, S.G., Denil, M., Lillicrap, T.P., de Freitas, N.: Learning to learn for global optimization of black box functions. In: International Conference on Machine Learning (2017)
Google Scholar
Cheng, B., et al.: Robust emotion recognition from low quality and low bit rate video: a deep learning approach. In: 2017 Seventh International Conference on Affective Computing and Intelligent Interaction (ACII), pp. 65–70. IEEE (2017)
Google Scholar
Cheng, B., Wei, Y., Shi, H., Feris, R., Xiong, J., Huang, T.: Revisiting RCNN: on awakening the classification power of faster RCNN. In: ECCV (2018)
Chapter Google Scholar
Dabov, K., Foi, A., Katkovnik, V., Egiazarian, K.: Image denoising by sparse 3-D transform-domain collaborative filtering. IEEE Trans. Image Process. 16(8), 2080–2095 (2007)
Article MathSciNet Google Scholar
Dai, X., Ng, J.Y.H., Davis, L.S.: FASON: first and second order information fusion network for texture recognition. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 7352–7360 (2017)
Google Scholar
Dai, X., Singh, B., Zhang, G., Davis, L.S., Qiu Chen, Y.: Temporal context network for activity localization in videos. In: The IEEE International Conference on Computer Vision (ICCV) (2017)
Google Scholar
Dong, C., Deng, Y., Change Loy, C., Tang, X.: Compression artifacts reduction by a deep convolutional network. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 576–584 (2015)
Google Scholar
Dong, C., Loy, C.C., He, K., Tang, X.: Learning a deep convolutional network for image super-resolution. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8692, pp. 184–199. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10593-2_13
Chapter Google Scholar
Elad, M., Aharon, M.: Image denoising via sparse and redundant representations over learned dictionaries. IEEE Trans. Image Process. 15(12), 3736–3745 (2006)
Article MathSciNet Google Scholar
Fan, Q., Yang, J., Hua, G., Chen, B., Wipf, D.: A generic deep architecture for single image reflection removal and image smoothing. In: Proceedings of the 16th International Conference on Computer Vision (ICCV), pp. 3238–3247 (2017)
Google Scholar
Fan, Q., Yang, J., Hua, G., Chen, B., Wipf, D.: Revisiting deep intrinsic image decompositions (2018)
Google Scholar
Farbman, Z., Fattal, R., Lischinski, D., Szeliski, R.: Edge-preserving decompositions for multi-scale tone and detail manipulation. In: ACM Transactions on Graphics (TOG), vol. 27, p. 67. ACM (2008)
Article Google Scholar
Fu, X., Huang, J., Ding, X., Liao, Y., Paisley, J.: Clearing the skies: a deep network architecture for single-image rain removal. IEEE Trans. Image Process. 26(6), 2944–2956 (2017)
Article MathSciNet Google Scholar
Fu, X., Huang, J., Zeng, D., Huang, Y., Ding, X., Paisley, J.: Removing rain from single images via a deep detail network. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 1715–1723 (2017)
Google Scholar
Gu, S., Zhang, L., Zuo, W., Feng, X.: Weighted nuclear norm minimization with application to image denoising. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2862–2869 (2014)
Google Scholar
Ha, D., Dai, A., Le, Q.V.: Hypernetworks. In: ICLR (2018)
Google Scholar
He, M., Chen, D., Liao, J., Sander, P.V., Yuan, L.: Deep exemplar-based colorization. ACM Trans. Graph. 37, 47 (2018). Proceedings of SIGGRAPH 2018
Google Scholar
Karacan, L., Erdem, E., Erdem, A.: Structure-preserving image smoothing via region covariances. ACM Trans. Graph. (TOG) 32(6), 176 (2013)
Article Google Scholar
Kim, J., Kwon Lee, J., Mu Lee, K.: Accurate image super-resolution using very deep convolutional networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1646–1654 (2016)
Google Scholar
Kligvasser, I., Shaham, T.R., Michaeli, T.: xUnit: learning a spatial activation function for efficient image restoration. In: CVPR (2018)
Google Scholar
Ledig, C., et al.: Photo-realistic single image super-resolution using a generative adversarial network. In: CVPR, vol. 2, p. 4 (2017)
Google Scholar
Li, D., He, X., Huang, Q., Sun, M.T., Zhang, L.: Generating diverse and accurate visual captions by comparative adversarial learning. arXiv preprint arXiv:1804.00861 (2018)
Li, Y., Dixit, M., Vasconcelos, N.: Deep scene image classification with the MFAFVNet. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5746–5754 (2017)
Google Scholar
Lin, K., Li, D., He, X., Zhang, Z., Sun, M.T.: Adversarial ranking for language generation. In: Advances in Neural Information Processing Systems, pp. 3155–3165 (2017)
Google Scholar
Liu, S., Pan, J., Yang, M.-H.: Learning recursive filters for low-level vision via a hybrid neural network. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9908, pp. 560–576. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46493-0_34
Chapter Google Scholar
Ma, S., Fu, J., Chen, C.W., Mei, T.: DA-GAN: instance-level image translation by deep attention generative adversarial networks
Google Scholar
Mao, X., Shen, C., Yang, Y.B.: Image restoration using very deep convolutional encoder-decoder networks with symmetric skip connections. In: Advances in Neural Information Processing Systems, pp. 2802–2810 (2016)
Google Scholar
Osher, S., Rudin, L.I.: Feature-oriented image enhancement using shock filters. SIAM J. Numer. Anal. 27(4), 919–940 (1990)
Article Google Scholar
Qi, G.J., Zhang, L., Hu, H., Edraki, M., Wang, J., Hua, X.S.: Global versus localized generative adversarial nets. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2018)
Google Scholar
Schmidhuber, J.: Learning to control fast-weight memories: an alternative to dynamic recurrent networks. Neural Comput. 4(1), 131–139 (1992)
Article Google Scholar
Sun, J., Xu, Z., Shum, H.Y.: Image super-resolution using gradient profile prior. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2008, pp. 1–8. IEEE (2008)
Google Scholar
Tai, Y., Yang, J., Liu, X., Xu, C.: MemNet: A persistent memory network for image restoration. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4539–4547 (2017)
Google Scholar
Tipping, M.E., Bishop, C.M.: Bayesian image super-resolution. In: Advances in Neural Information Processing Systems, pp. 1303–1310 (2003)
Google Scholar
Ulyanov, D., Vedaldi, A., Lempitsky, V.: Improved texture networks: maximizing quality and diversity in feed-forward stylization and texture synthesis. In: Proceedings of the CVPR (2017)
Google Scholar
Wichrowska, O., et al.: Learned optimizers that scale and generalize. In: International Conference on Machine Learning (2017)
Google Scholar
Xu, L., Lu, C., Xu, Y., Jia, J.: Image smoothing via L 0 gradient minimization. In: ACM Transactions on Graphics (TOG), vol. 30, p. 174. ACM (2011)
Google Scholar
Xu, L., Lu, C., Xu, Y., Jia, J.: Image smoothing via l0 gradient minimization. ACM Trans. Graph. 30, 174 (2011). SIGGRAPH Asia
Google Scholar
Xu, L., Ren, J., Yan, Q., Liao, R., Jia, J.: Deep edge-aware filters. In: International Conference on Machine Learning, pp. 1669–1678 (2015)
Google Scholar
Xu, L., Yan, Q., Xia, Y., Jia, J.: Structure extraction from texture via relative total variation. ACM Trans. Graph. (TOG) 31(6), 139 (2012)
Google Scholar
Yang, J., Wright, J., Huang, T.S., Ma, Y.: Image super-resolution via sparse representation. IEEE Trans. Image Process. 19(11), 2861–2873 (2010)
Article MathSciNet Google Scholar
Yu, F., Koltun, V.: Multi-scale context aggregation by dilated convolutions. In: ICLR (2016)
Google Scholar
Zhang, D., Dai, X., Wang, X., Wang, Y.F.: S3D: Single shot multi-span detector via fully 3D convolutional network. In: British Machine Vision Conference (BMVC) (2018)
Google Scholar
Zhang, H., Patel, V.M.: Density-aware single image de-raining using a multi-stream dense network. In: CVPR (2018)
Google Scholar
Zhang, Q., Shen, X., Xu, L., Jia, J.: Rolling guidance filter. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8691, pp. 815–830. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10578-9_53
Chapter Google Scholar
Zhang, Q., Xu, L., Jia, J.: 100+ times faster weighted median filter (WMF). In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2830–2837 (2014)
Google Scholar
Zoran, D., Weiss, Y.: From learning models of natural image patches to whole image restoration. In: 2011 IEEE International Conference on Computer Vision (ICCV), pp. 479–486. IEEE (2011)
Google Scholar

Download references

Acknowledgement

This work was supported in part by: National 973 Program (2015CB352501), NSFC-ISF (61561146397), the Natural Science Foundation of China under Grant U1636201 and 61629301.

Author information

Authors and Affiliations

Shandong University, Jinan, China
Qingnan Fan & Baoquan Chen
University of Science and Technology of China, Hefei, China
Dongdong Chen & Nenghai Yu
Beijing Film Academy, Beijing, China
Qingnan Fan
Microsoft Research, Beijing, China
Lu Yuan & Gang Hua
Peking University, Beijing, China
Baoquan Chen

Authors

Qingnan Fan
View author publications
You can also search for this author in PubMed Google Scholar
Dongdong Chen
View author publications
You can also search for this author in PubMed Google Scholar
Lu Yuan
View author publications
You can also search for this author in PubMed Google Scholar
Gang Hua
View author publications
You can also search for this author in PubMed Google Scholar
Nenghai Yu
View author publications
You can also search for this author in PubMed Google Scholar
Baoquan Chen
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Baoquan Chen .

Editor information

Editors and Affiliations

Google Research, Zurich, Switzerland
Vittorio Ferrari
Carnegie Mellon University, Pittsburgh, PA, USA
Martial Hebert
Google Research, Zurich, Switzerland
Cristian Sminchisescu
Hebrew University of Jerusalem, Jerusalem, Israel
Yair Weiss

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 121 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Fan, Q., Chen, D., Yuan, L., Hua, G., Yu, N., Chen, B. (2018). Decouple Learning for Parameterized Image Operators. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds) Computer Vision – ECCV 2018. ECCV 2018. Lecture Notes in Computer Science(), vol 11217. Springer, Cham. https://doi.org/10.1007/978-3-030-01261-8_27

Download citation

DOI: https://doi.org/10.1007/978-3-030-01261-8_27
Published: 06 October 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-01260-1
Online ISBN: 978-3-030-01261-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Decouple Learning for Parameterized Image Operators

Abstract

Similar content being viewed by others

A Multiscale Image Denoising Algorithm Based on Dilated Residual Convolution Network

Fine-grained scale space learning for single image super-resolution

Learning a Deep Convolutional Network for Image Super-Resolution

1 Introduction

2 Related Work