Deep Burst Denoising

Godard, Clément; Matzen, Kevin; Uyttendaele, Matt

doi:10.1007/978-3-030-01267-0_33

Clément Godard¹⁷,
Kevin Matzen¹⁸ &
Matt Uyttendaele¹⁸

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 11219))

Included in the following conference series:

European Conference on Computer Vision

2780 Accesses
56 Citations

Abstract

Noise is an inherent issue of low-light image capture, which is worsened on mobile devices due to their narrow apertures and small sensors. One strategy for mitigating noise in low-light situations is to increase the shutter time, allowing each photosite to integrate more light and decrease noise variance. However, there are two downsides of long exposures: (a) bright regions can exceed the sensor range, and (b) camera and scene motion will cause blur. Another way of gathering more light is to capture multiple short (thus noisy) frames in a burst and intelligently integrate the content, thus avoiding the above downsides. In this paper, we use the burst-capture strategy and implement the intelligent integration via a recurrent fully convolutional deep neural net (CNN). We build our novel, multi-frame architecture to be a simple addition to any single frame denoising model. The resulting architecture denoises all frames in a sequence of arbitrary length. We show that it achieves state of the art denoising results on our burst dataset, improving on the best published multi-frame techniques, such as VBM4D and FlexISP. Finally, we explore other applications of multi-frame image enhancement and show that our CNN architecture generalizes well to image super-resolution.

C. Godard—This work was done during an internship at Facebook.

You have full access to this open access chapter, Download conference paper PDF

Burst Denoising via Temporally Shifted Wavelet Transforms

Burst Image Deblurring Using Permutation Invariant Convolutional Neural Networks

Fine-Grained Video Deblurring with Event Camera

1 Introduction

Noise reduction is one of the most important problems to solve in the design of an imaging pipeline. The most straight-forward solution is to collect as much light as possible when taking a photograph. This can be addressed in camera hardware through the use of a large aperture lens, sensors with large photosites, and high quality A/D conversion. However, relative to larger standalone cameras, e.g. a DSLR, modern smartphone cameras have compromised on each of these hardware elements. This makes noise much more of a problem in smartphone capture.

Another way to collect more light is to use a longer shutter time, allowing each photosite on the sensor to integrate light over a longer period of time. This is commonly done by placing the camera on a tripod. The tripod is necessary as any motion of the camera will cause the collected light to blur across multiple photosites. This technique is limited though. First, any moving objects in the scene and residual camera motion will cause blur in the resulting photo. Second, the shutter time can only be set for as long as the brightest objects in the scene do not saturate the electron collecting capacity of a photosite. This means that for high dynamic range scenes, the darkest regions of the image may still exhibit significant noise while the brightest ones might staturate.

In our method we also collect light over a longer period of time, by capturing a burst of photos. Burst photography addresses many of the issues above (a) it is available on inexpensive hardware, (b) it can capture moving subjects, and (c) it is less likely to suffer from blown-out highlights. In using a burst we make the design choice of leveraging a computational process to integrate light instead of a hardware process, such as in [19, 29]. In other words, we turn to computational photography.

Our computational process runs in several steps. First, the burst is stabilized by finding a homography for each frame that geometrically registers it to a common reference. Second, we employ a fully convolutional deep neural network (CNN) to denoise each frame individually. Third, we extend the CNN with a parallel recurrent network that integrates the information of all frames in the burst.

The paper presents our work as follows. In Sect. 2 we review previous single-frame and multi-frame denoising techniques. We also look at super-resolution, which can leverage multi-frame information. In Sect. 3 we describe our recurrent network in detail and discuss training. In order to compare against previous work, the network is trained on simulated Gaussian noise. We also show that our solution works well when trained on Poisson distributed noise which is typical of a real-world imaging pipeline [18]. In Sect. 4, we show significant increase in reconstruction quality on burst sequences in comparison to state of the art single-frame denoising and performance on par or better than recent state of the art multi-frame denoising methods. In addition we demonstrate that burst capture coupled with our recurrent network architecture generalizes well to super-resolution.

In summary our main contributions are:

We introduce a recurrent architecture which is a simple yet effective extension to single-frame denoising models,
Demonstrate that bursts provide a large improvement over the best deep learning based single-frame denoising techniques,
Show that our model achieves performance on par with or better than recent state of the art multi-frame denoising methods, and
Demonstrate that our recurrent architecture generalizes well by applying it to super-resolution.

2 Related Work

This work addresses a variety of inverse problems, all of which can be formulated as consisting of (1) a target “restored” image, (2) a temporally-ordered set or “burst” of images, each of which is a corrupted observation of the target image, and (3) a function mapping the burst of images to the restored target. Such tasks include denoising and super-resolution. Our goal is to craft this function, either through domain knowledge or through a data-driven approach, to solve these multi-image restoration problems.

Denoising

Data-driven single-image denoising research dates back to work that leverages block-level statistics within a single image. One of the earliest works of this nature is Non-Local Means [3], a method for taking a weighted average of blocks within an image based on similarity to a reference block. Dabov, et al. [9] extend this concept of block-level filtering with a novel 3D filtering formulation. This algorithm, BM3D, is the de facto method by which all other single-image methods are compared to today.

Learning-based methods have proliferated in the last few years. These methods often make use of neural networks that are purely feed-forward [1, 4, 15, 25, 43, 48, 49], recurrent [44], or a hybrid of the two [7]. Methods such as Field of Experts [38] have been shown to be successful in modeling natural image statistics for tasks such as denoising and inpainting with contrastive divergence. Moreover, related tasks such as demosaicing and denoising have shown to benefit from joint formulations when posed in a learning framework [15]. The recent work of [5] applied a recurrent architecture in the context of denoising ray-traced sequenced, and finally [6] used a simple fully connected RNN for video denoising which, while failing to beat VBM4D [32, 33], proved the feasibility of using RNNs for video denoising.

Multi-image variants of denoising methods exist and often focus on the best ways to align and combine images. Tico [40] returns to a block-based paradigm, but this time, blocks “within” and “across” images in a burst can be used to produce a denoised estimate. VBM3D [8] and VBM4D [32, 33] provide extensions on top of the existing BM3D framework. Liu, et al. [29] showed how similar denoising performance in terms of PSNR could be obtained in one tenth the time of VBM3D and one one-hundredth the time of VBM4D using a novel “homography flow” alignment scheme along with a “consistent pixel” compositing operator. Systems such as FlexISP [22] and ProxImaL [21] offer end-to-end formulations of the entire image processing pipeline, including demosaicing, alignment, deblurring, etc., which can be solved jointly through efficient optimization.

We in turn also make use of a deep model and base our CNN architecture on current state of the art single-frame methods [27, 36, 48].

Super-Resolution

Super-resolution is the task of taking one or more images of a fixed resolution as input and producing a fused or hallucinated image of higher resolution as output.

Nasrollahi, et al. [35] offers a comprehensive survey of single-image super-resolution methods and Yang, et al. [45] offers a benchmark and evaluation of several methods. Glasner, et al. [16] show that single images can be super-resolved without any need of an external database or prior by exploiting block-level statistics “within” the single image. Other methods make use of sparse image statistics [46]. Borman, et al. offers a survey of multi-image methods [2]. Farsiu, et al. [13] offers a fast and robust method for solving the multi-image super-resolution problem. More recently convolutional networks have shown very good results in single image super-resolution with the works of Dong et al. [11] and the state of the art Ledig et al. [27].

Our single-frame architecture takes inspiration by recent deep super-resolution models such as [27].

2.1 Neural Architectures

It is worthwhile taking note that while image restoration approaches have often been learning-based in recent years, there’s also great diversity in how those learning problems are modeled. In particular, neural network-based approaches have experienced a gradual progression in architectural sophistication over time.

In the work of Dong, et al. [10], a single, feed-forward CNN is used to super-resolve an input image. This is a natural design as it leveraged what was then new advancements in discriminatively-trained neural networks designed for classification and applied them to a regression task. The next step in architecture evolution was to use Recurrent Neural Networks, or RNNs, in place of the convolutional layers of the previous design. The use of one or more RNNs in a network design can both be used to increase the effective depth and thus receptive field in a single-image network [44] or to integrate observations across many frames in a multi-image network. Our work makes use of this latter principle.

While the introduction of RNNs led to network architectures with more effective depth and thus a larger receptive field with more context, the success of skip connections in classification networks [20] and segmentation networks [37, 39] motivated their use in restoration networks. The work of Remez, et al. [36] illustrates this principle by computing additive noise predictions from each level of the network, which then sum to form the final noise prediction.

We also make use of this concept, but rather than use skip connections directly, we extract activations from each level of our network which are then fed into corresponding RNNs for integration across all frames of a burst sequence.

3 Method

In this section we first identify a number of interesting goals we would like a multi-frame architecture to meet and then describe our method and how it achieves such goals.

3.1 Goals

Our goal is to derive a method which, given a sequence of noisy images produces a denoised sequence. We identified desirable properties, that a multi-frame denoising technique should satisfy:

1.
Work for single-frame denoising. A corollary to the first criterion is that our method should be competitive for the single-frame case.
2.
Generalize to any number of frames. A single model should produce competitive results for any number of frames that it is given.
3.
Denoise the entire sequence. Rather than simply denoise a single reference frame, as is the goal in most prior work, we aim to denoise the entire sequence, putting our goal closer to video denoising.
4.
Be robust to motion. Most real-world burst capture scenarios will exhibit both camera and scene motion.
5.
Be temporally coherent. Denoising the entire sequence requires that we do not introduce flickering in the result.
6.
Generalize to a variety of image restoration tasks. As discussed in Sect. 2, tasks such as super-resolution can benefit from multi-frame methods, albeit, trained on different data.

In the remainder of this section we will first describe a single-frame denoising model that produces competitive results with current state of the art models. Then we will discuss how we extend this model to accommodate an arbitrary number of frames for multi-frame denoising and how it meets each of our goals.

3.2 Single Frame Denoising

We treat image denoising as a structured prediction problem, where the network is tasked with regressing a pixel-aligned denoised image $\tilde{I_s} = f_s(N, \theta _s)$ from noisy image N, given the model parameters $\theta _s$. Following [50] we train the network by minimizing the L1 distance between the predicted output and the ground-truth target image, I.

$$\begin{aligned} E_{\mathrm {SFD}} = | I - f_s(N, \theta _s) | \end{aligned}$$

(1)

To be competitive in the single-frame denoising scenario, and to meet our 1^st goal, we take inspiration from the state of the art to derive an initial network architecture. Several existing architectures [27, 36, 48] consist of the same base design: a fully convolutional architecture consisting of L layers with C channels each.

We follow suit and choose this simple architecture to be our single frame denoising (SFD) baseline, with $L=8$, $C=64$, $3\times 3$ convolutions and ReLU [31] activation functions, except on the last layer.

3.3 Multi-frame Denoising

Following goals 1-3, our model should be competitive in the single-frame case while being able to denoise the entire input sequence. In other words, using a set of noisy images as input, forming the sequence $\{N^t\}$, we want to regress a denoised version of each noisy frame, $\tilde{I^t_m} = f^t_m(\{N^t\}, \theta _m)$, given the model parameters $\theta _m$. Formally, our complete training objective is:

$$\begin{aligned} \begin{aligned} E&= \sum _t^F E_{\mathrm {SFD}}^t + E_{\mathrm {MFD}}^t\\&= \sum _t^F | I^t - f_s(N^t, \theta _s) | + | I^t - f^t_m(\{N^t\}, \theta _m) | \end{aligned} \end{aligned}$$

(2)

A natural approach, which is already popular in the natural language and audio processing literature [47], is to process temporal data with recurrent neural network (RNN) modules [23]. RNNs operate on sequences and maintain an internal state which is combined with the input at each time step. As can be seen in Fig. 2, our model makes use of recurrent connections to aggregate activations produced by our SFD network for each frame. This satisfies our first goal as it allows for an arbitrary input sequence length.

Unlike [5, 42] which use a single-track network design, we use a two track network architecture with the top track dedicated to SFD and the bottom track dedicated to fusing those results into a final prediction for MFD. This two track design decouples decoupling per-frame feature extraction from multi-frame aggregation, enabling the possibility for pre-training a network rapidly using only single-frame data. In practice, we found that this pre-training not only accelerates the learning process, but also produces significantly better results in terms of PSNR than when we train the entire MFD from scratch. The core intuition is that by first learning good features for SFD, we put the network in a good state for learning how to aggregate those features across observations.

It is also important to note that the RNNs are connected in such a way as to permit the aggregation of features in several different ways. Temporal connections within the RNNs help aggregate information “across” frames, but lateral connections “within” the MFD track permit the aggregation of information at different physical scales and at different levels of abstraction.

4 Implementation and Results

We evaluate our method with the goals from Sect. 3 in mind, and examine: single-image denoising (goal 1), multi-frame denoising (goals 2–5), and multi-frame super-resolution (goal 6). In Sect. 4.5 we compare different single-frame denoising approaches, showing that quality is plateauing despite the use of deep models and that our simple single-frame denoiser is competitive with state-of-the-art. In Sect. 4.6 we show that our method significantly outperforms the reference state of the art video denoising method VBM4D [32]. Finally in Sect. 4.7 we compare our method to the state of the art burst denoising methods HDR+ [19], FlexISP [22] and ProximaL [21] on the FlexISP dataset.

4.1 Data

We trained all the networks in our evaluation using a dataset consisting of Apple Live Photos. Live Photos are burst sequences captured by Apple iPhone 6S and above^{Footnote 1}. This dataset is very representative as it captures what mobile phone users often photograph, and exhibits a wide range of scenes and motions. Approximately 73k public sequences were scraped from a social media website with a resolution of $360\times 480$. We apply a burst stabilizer to each sequence, resulting in approximately 54.5k sequences successfully stabilized. In Sect. 4.2 we describe our stabilization procedure in more detail. 50k sequences were used for training with an additional 3.5k reserved for validation and 1k reserved for testing.

4.2 Stabilization

We implemented burst sequence stabilization using OpenCV^{Footnote 2}. In particular, we use a Lucas-Kanade tracker [30] to find correspondences between successive frames and then a rotation-only motion model and a static focal length guess to arrive at a homography for each frame. We warp all frames of a sequence back into a reference frame’s pose then crop and scale the sequence to maintain the original size and aspect ratio, but with the region of interest contained entirely within the valid regions of the warp. The stabilized sequences still exhibit some residual motion, either through moving objects or people, or through camera motion which cannot be represented by a homography. This residual motion forces the network to adapt to non static scenes. Stabilization and training on any residual motion makes our system robust to motion, achieving our 4^th goal. As we show in supplementary material, stabilization improves the final results, but is not a requirement.

4.3 Training Details

We implemented the neural network from Sect. 3 using the Caffe2 framework^{Footnote 3}. Each model was trained using 4 Tesla M40 GPUs. As described in Sect. 3, training took place in two stages. First a single-frame model was trained. This model used a batch size of 128 and was trained for 500 epochs in approximately 5 hours. Using this single-frame model as initialization for the multi-frame (8-frame) model, we continue training with a batch size of 32 to accommodate the increased size of the multi-frame model. This second stage was trained for 125 epochs in approximately 20 h.

We used Adam [26] with a learning rate of $10^{-4}$ which decays to zero following a square root law. We trained on $64\times 64$ crops with random flips. Finally, we train the multi-frame model using back-propagation through time [41].

4.4 Noise Modelling

In order to make comparison possible with previous methods, such as VBM4D, we first evaluate our architecture using additive white Gaussian noise with $\sigma =15, 25, 50$ and 75. Additionally, to train a denoiser for real burst sequences, we implement a simulated camera processing pipeline. First real world sensor noise is generated following [14]. Separate models are trained using Poisson noise, labelled a in [14], with intensity ranging from 0.001 to 0.01. We simulate a Bayer mosaic on a linearized version of our training data and add the Poisson noise to this. Next we reconstruct an RGB image using bilinear interpolation followed by conversion to sRGB and clipping. In both Gaussian and Poisson cases we add synthetic noise before stabilization. While it is possible to obtain a single "blind" model by training on multiple noise levels at once [49], it typically results in a small loss in accuracy. We thus follow the protocol established by [36, 48] and train a separate model for each noise level, without loss of generality.

Table 1. Single frame additive white Gaussian noise denoising comparison on BSD68 (PSNR). Our baseline SFD models match BM3D at 8 layers and get close to both DnCNN and DenoiseNet at 20 layers.

Full size table

4.5 Single Frame Denoising

Here, we compare our baseline single frame denoiser with current state of the art methods on additive white Gaussian noise. This shows that single-frame denoising has reached a point of diminishing returns, where significant model complexity is needed improve quality by more than ${\sim }0.2\,\mathrm{dB}$.

We compare our own SFD, which is composed of 8 layers, with two 20 layer networks: DenoiseNet (2017) [36] and DnCNN (2017) [48]. For the sake of comparison, we also include a 20 layer version of our SFD. All models were trained for 2000 epochs on 8000 images from the PASCAL VOC2010 [12] using the training split from [36]. We also compare with traditional approaches, such as BM3D (2009) [9] and TNRD (2015) [7].

All models were tested on BSD68 [38], a set of 68 natural images from the Berkeley Segmentation Dataset [34]. In Table 1, we can see diminishing returns in single frame denoising PSNR over the years despite the use of deep neural networks, which confirms what Levin, et al. describe in [28]. We can see that our simpler SFD 20 layers model only slightly underperforms both DenoiseNet and DnCNN by ${\sim }0.2\,\mathrm{dB}$. However, as we show in the following section, the PSNR gains brought by multi-frame processing vastly outshine fractional single frame PSNR improvements.

4.6 Burst Denoising

We evaluate our method on a held-out test set of Live Photos with synthetic additive white Gaussian noise added. In Table 3, we compare our architecture with single frame models as well as the multi-frame method VBM4D [32, 33]. We show qualitative results with $\sigma =50$ in Fig. 5.

Table 2. Ablation study on the Live Photos test sequences with additive white Gaussian Noise of $\sigma = 50$. All models were trained on 8 frame sequences. C2F, C4F and C8F represent Concat models which were trained on respectively 2, 4, and 8 concatenated frames as input. Ours nostab was trained and tested on the unstabilized sequences.

Full size table

Ablation Study. We now evaluate our architecture choices, where we compare our full model, with 8 layers and trained on sequences of 8 frames with other variants.

Concat. We first compare our method with a naive multi-frame denoising approach, dubbed Concat, where the input consists of n concatenated frames to a single pass denoiser. We evaluated this architecture with $L = 20$ as well as $n = 2, 4$ and 8. As we can see in Table 2 this model performs significantly worse than our model.

Number of Layers. We also evaluate the impact of the depth of the network by experimenting with $N = 4, 8, 12, 16$ and 20. As can be seen in Fig. 2, the 16 and 20 layers network fail to surpass both the 8 and 12 layers after 125 epochs of training, likely because training becomes unstable with increased depth and parameter count [20]. While the 12 layers network shows a marginal 0.18 dB increase over the 8 layer model, we decided to go with the latter as we did not think that the modest increase in PSNR was worth the $50\%$ increase in both memory and computation time.

Length of Training Sequences. Perhaps the most surprising result we encountered during training our recurrent model, was the importance of the number of frames in the training sequences. In Fig. 4a, we show that models trained on sequences of both 2 and 4 frames fail to generalize beyond their training length sequence. Only models trained with 8 frames were able to generalize to longer sequences at test time, and as we can see still denoise beyond 8 frames.

Pre-training. One of the main advantages of using a two-track network is that we can first train the SFD track independently. As just mentioned, a sequence length of 8 is required to ensure generalization to longer sequences, which makes the training of the full model much slower than training the single-frame pass. As we show in Fig. 3, pre-training makes training the MFD significantly faster.

Frame Ordering. Due to its recurrent nature, our network exhibits a period of burn-in, where the first frames are being denoised to a lesser extent than the later ones. In order to denoise an entire sequence to a high quality level, we explored different options for frame ordering. As we show in Fig. 4b, by feeding the sequence twice to the network, we are able to achieve a comparable denoising quality on all frames thus obtaining a higher average PSNR. We propose two variants, either repeat the sequence in the same order or reverse it the second time (named forward-backward). As we show in Fig. 4b, the forward-backward schedule does not suffer from burn-in and remains temporally coherenent, meeting our 5^th goal. We use forward-backward for all our experiments.

Table 3. Multi-frame denoising comparison on Live Photo sequences (left) and the FlexISP sequences (right). Average PSNR for all frames on 1000 test 16-frames sequences with additive white Gaussian noise. Multi-frame denoising comparison on the FlexISP images (right). Best results are in bold. The thick line separates single frame methods and multi-frame ones.

Full size table

4.7 Existing Datasets

Here we evaluate our method on existing datasets, showing generization and allowing us to compare with other state-of-the-art denoising approaches. In Figs. 1 and 7 we demonstrate that our method is capable of denoising real sequences. This evaluation was performed on real noisy bursts from HDR+ [19]. Please see our supplementary material for more results.

In Fig. 6 we show the results of our method on the FlexISP dataset, comparing it with FlexISP, HDR+ and ProximaL on the FlexISP. The dataset consists of 4 noisy sequences: 2 synthetic (flickr doll and kodak fence) and 2 real ones (darkpaintcans and livingroom). The synthetic sequences were generated by randomly warping the input images and introducing: (for flickr doll) additive and multiplicative white Gaussian noise with $\sigma = 25.5$, and (for kodak fence) additive with Gaussian noise of $\sigma = 12$ while simulating a Bayer filter. We trained a model for each synthetic scene, by replicating by replicating the corresponding noise conditions on our Live Photos dataset. To match the evaluation of previous work, we used only the first 8 frames from each sequence for denoising.

Table 3 shows that our method matches FlexISP on flickr doll and achieves a significant advantage of 0.5 dB over FlexISP kodak fence. Interestingly, our method reaches a higher PSNR than FlexISP, despite showing some slight demosiacing artifacts on the fence (see in Fig. 6). This is likely due to the absence of high frequency demosaicing artifacts in our training data and would probably be fixed by generating training data following the same protocol as the test data.

Unfortunately, it is difficult to compare a thoroughly with ProximaL, as the publicly implementation does not have code for their experiments. We attempted to reimplement burst denoising using their publicly available framework, but were unable to produce stable results. As ProximaL only shows denoising results on flickr doll, this limits us to a less comprehensive comparison on only one scene, where our method falls short.

Like HDR+, we do not report quantitative results on the real scenes (darkpaintcans and livingroom), as we were unable to correct for a color shift between the ground truth long exposure images and the noisy bursts. However, Fig. 6 shows that our method is able to recover a lot of details while removing the noise on these bursts.

4.8 Super Resolution

To illustrate that our approach generalizes to tasks beyond denoising, and to meet our 6^th goal, we trained our model to perform $4{\times }$ super-resolution, while keeping the rest of the training procedure identical to that of the denoising pipeline. That is, instead of using a burst of noisy images as input, we provide our network with a burst of low-resolution images and task it to provide us with a sharp high-resolution output. To keep the architecture the same, we do not feed low-resolution images as input to the network, but instead remove high-frequency details by first downsampling each input patch $4{\times }$ and then resize them back to their original size with bilinear interpolation. Figure 8 shows how our multi-frame model is able to recover high-frequency details, such as the crisp contours of the lion and the railing on top of the pillar.

5 Limitations

Our single-frame architecture, based on [27, 36, 48], makes use of full resolution convolutions. They are however both memory and computationally expensive, and have a small receptive field for a given network depth. Using multiscale architectures, such as a U-Nets [37], could help alleviate both issues, by reducing the computational and memory load, while increasing the receptive field. While not necessary, we trained our network on pre-stabilized sequences, we observed a drop in accuracy on unstabilized sequences, as can be seen in Table 2, as well as instability on longer sequences. It would be interesting to train the network to stabilize the sequence by warping inside the network such as in [17, 24]. Finally the low resolution of our training data prevents the model from recoving high frequency details; a higher resolution dataset would likely fix this issue.

6 Conclusion

We have presented a novel deep neural architecture to process burst of images. We improve on a simple single frame architecture by making use of recurrent connections and show that while single-frame models are reaching performance limits, our recurrent architecture vastly outperforms such models for multi-frame data. We carefully designed our method to align with the goals we stated in Sect. 3.1. As a result, our approach achieves state-of-the-art performance in our Live Photos dataset, and matches or beats existing multi-frame denoisers on challenging existing real-noise datasets.

Notes

1.
https://support.apple.com/en-us/HT207310.
2.
https://opencv.org/.
3.
https://caffe2.ai/.

References

Agostinelli, F., Anderson, M.R., Lee, H.: Adaptive multi-column deep neural networks with application to robust image denoising. In: Burges, C.J.C., Bottou, L., Welling, M., Ghahramani, Z., Weinberger, K.Q. (eds.) Advances in Neural Information Processing Systems, vol. 26, pp. 1493–1501. Curran Associates, Inc. (2013)
Google Scholar
Borman, S., Stevenson, R.L.: Super-resolution from image sequences-a review. In: Proceedings of 1998 Midwest Symposium on Circuits and Systems, pp. 374–378. IEEE (1998)
Google Scholar
Buades, A., Coll, B., Morel, J.M.: A non-local algorithm for image denoising. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition, CVPR 2005, vol. 2, pp. 60–65. IEEE (2005)
Google Scholar
Burger, H.C., Schuler, C.J., Harmeling, S.: Image denoising: can plain neural networks compete with BM3D? In: 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2392–2399. IEEE (2012)
Google Scholar
Chaitanya, C.R.A., Kaplanyan, A.S., Schied, C., Salvi, M., Lefohn, A., Nowrouzezahrai, D., Aila, T.: Interactive reconstruction of monte carlo image sequences using a recurrent denoising autoencoder. ACM Trans. Graph. (TOG) 36(4), 98 (2017)
Article Google Scholar
Chen, X., Song, L., Yang, X.: Deep RNNs for video denoising. In: Applications of Digital Image Processing XXXIX. vol. 9971, p. 99711T. International Society for Optics and Photonics (2016)
Google Scholar
Chen, Y., Pock, T.: Trainable nonlinear reaction diffusion: a flexible framework for fast and effective image restoration. IEEE Trans. Pattern Anal. Mach. Intell. 39(6), 1256–1272 (2017)
Article Google Scholar
Dabov, K., Foi, A., Katkovnik, V., Egiazarian, K.: Image denoising by sparse 3-D transform-domain collaborative filtering. IEEE Trans. Image Process. 16(8), 2080–2095 (2007)
Article MathSciNet Google Scholar
Dabov, K., Foi, A., Katkovnik, V., Egiazarian, K.: BM3D image denoising with shape-adaptive principal component analysis. In: SPARS 2009-Signal Processing with Adaptive Sparse Structured Representations (2009)
Google Scholar
Dong, C., Loy, C.C., He, K., Tang, X.: Image super-resolution using deep convolutional networks. IEEE Trans. Pattern Anal. Mach. Intell. 38(2), 295–307 (2016). https://doi.org/10.1109/TPAMI.2015.2439281
Article Google Scholar
Dong, C., Loy, C.C., He, K., Tang, X.: Image super-resolution using deep convolutional networks. IEEE Trans. Pattern Anal. Mach. Intell. 38(2), 295–307 (2016)
Article Google Scholar
Everingham, M., Van Gool, L., Williams, C.K.I., Winn, J., Zisserman, A.: The pascal visual object classes (VOC) challenge. Int. J. Comput. Vis. 88(2), 303–338 (2010)
Article Google Scholar
Farsiu, S., Robinson, M.D., Elad, M., Milanfar, P.: Fast and robust multiframe super resolution. IEEE Trans. Image Process. 13(10), 1327–1344 (2004)
Article Google Scholar
Foi, A.: Clipped noisy images: heteroskedastic modeling and practical denoising. Signal Process. 89(12), 2609–2629 (2009)
Article Google Scholar
Gharbi, M., Chaurasia, G., Paris, S., Durand, F.: Deep joint demosaicking and denoising. ACM Trans. Graph. (TOG) 35(6), 191 (2016)
Article Google Scholar
Glasner, D., Bagon, S., Irani, M.: Super-resolution from a single image. In: ICCV (2009). http://www.wisdom.weizmann.ac.il/~vision/SingleImageSR.html
Godard, C., Mac Aodha, O., Brostow, G.J.: Unsupervised monocular depth estimation with left-right consistency. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017
Google Scholar
Hasinoff, S.W., Durand, F., Freeman, W.T.: Noise-optimal capture for high dynamic range photography. In: CVPR, pp. 553–560. IEEE Computer Society (2010)
Google Scholar
Hasinoff, S.W., Sharlet, D., Geiss, R., Adams, A., Barron, J.T., Kainz, F., Chen, J., Levoy, M.: Burst photography for high dynamic range and low-light imaging on mobile cameras. ACM Trans. Graph. (TOG) 35(6), 192 (2016)
Article Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778, June 2016. https://doi.org/10.1109/CVPR.2016.90
Heide, F., Diamond, S., Nießner, M., Ragan-Kelley, J., Heidrich, W., Wetzstein, G.: Proximal: efficient image optimization using proximal algorithms. ACM Trans. Graph. (TOG) 35(4), 84 (2016)
Article Google Scholar
Heide, F., Steinberger, M., Tsai, Y.T., Rouf, M., Pajak, D., Reddy, D., Gallo, O., Liu, J., Heidrich, W., Egiazarian, K.: Flexisp: a flexible camera image processing framework. ACM Trans. Graph. (TOG) 33(6), 231 (2014)
Article Google Scholar
Hopfield, J.J.: Neural networks and physical systems with emergent collective computational abilities. Proc. Natl. Acad. Sci. 79(8), 2554–2558 (1982)
Article MathSciNet Google Scholar
Jaderberg, M., Simonyan, K., Zisserman, A., et al.: Spatial transformer networks. In: Advances in Neural Information Processing Systems, pp. 2017–2025 (2015)
Google Scholar
Jain, V., Seung, S.: Natural image denoising with convolutional networks. In: Koller, D., Schuurmans, D., Bengio, Y., Bottou, L. (eds.) Advances in Neural Information Processing Systems, vol. 21, pp. 769–776. Curran Associates, Inc. (2009). http://papers.nips.cc/paper/3506-natural-image-denoising-with-convolutional-networks.pdf
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. In: Proceedings of the 3rd International Conference on Learning Representations (ICLR) (2014)
Google Scholar
Ledig, C., et al.: Photo-realistic single image super-resolution using a generative adversarial network. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017
Google Scholar
Levin, A., Nadler, B.: Natural image denoising: optimality and inherent bounds. In: 2011 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2833–2840. IEEE (2011)
Google Scholar
Liu, Z., Yuan, L., Tang, X., Uyttendaele, M., Sun, J.: Fast burst images denoising. ACM Trans. Graph. (TOG) 33(6), 232 (2014)
Google Scholar
Lucas, B.D., Kanade, T.: An iterative image registration technique with an application to stereo vision. In: Proceedings of the 7th International Joint Conference on Artificial Intelligence, IJCAI 1981, vol. 2, pp. 674–679. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA (1981). http://dl.acm.org/citation.cfm?id=1623264.1623280
Maas, A.L., Hannun, A.Y., Ng, A.Y.: Rectifier nonlinearities improve neural network acoustic models. In: Proceedings of the ICML, vol. 30 (2013)
Google Scholar
Maggioni, M., Boracchi, G., Foi, A., Egiazarian, K.: Video denoising, deblocking, and enhancement through separable 4-D nonlocal spatiotemporal transforms. IEEE Trans. Image Process. 21(9), 3952–3966 (2012)
Article MathSciNet Google Scholar
Maggioni, M., Boracchi, G., Foi, A., Egiazarian, K.O.: Video denoising using separable 4D nonlocal spatiotemporal transforms. In: Image Processing: Algorithms and Systems, p. 787003 (2011)
Google Scholar
Martin, D., Fowlkes, C., Tal, D., Malik, J.: A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics. In: Proceedings of the Eighth IEEE International Conference on Computer Vision, ICCV 2001, vol. 2, pp. 416–423. IEEE (2001)
Google Scholar
Nasrollahi, K., Moeslund, T.B.: Super-resolution: a comprehensive survey. Mach. Vis. Appl. 25(6), 1423–1468 (2014)
Article Google Scholar
Remez, T., Litany, O., Giryes, R., Bronstein, A.M.: Deep class aware denoising. arXiv preprint arXiv:1701.01698 (2017)
Ronneberger, O., Fischer, P., Brox, T.: U-Net: convolutional networks for biomedical image segmentation. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F. (eds.) MICCAI 2015. LNCS, vol. 9351, pp. 234–241. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-24574-4_28
Chapter Google Scholar
Roth, S., Black, M.J.: Fields of experts: a framework for learning image priors. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition, CVPR 2005, vol. 2, pp. 860–867. IEEE (2005)
Google Scholar
Shelhamer, E., Long, J., Darrell, T.: Fully convolutional networks for semantic segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 39(4), 640–651 (2017). https://doi.org/10.1109/TPAMI.2016.2572683
Article Google Scholar
Tico, M.: Multi-frame image denoising and stabilization. In: Signal Processing Conference, 2008 16th European, pp. 1–4. IEEE (2008)
Google Scholar
Werbos, P.J.: Generalization of backpropagation with application to a recurrent gas market model. Neural Netw. 1(4), 339–356 (1988)
Article Google Scholar
Wieschollek, P., Hirsch, M., Scholkopf, B., Lensch, H.P.A.: Learning blind motion deblurring. In: The IEEE International Conference on Computer Vision (ICCV), Oct 2017
Google Scholar
Xie, J., Xu, L., Chen, E.: Image denoising and inpainting with deep neural networks. In: Pereira, F., Burges, C.J.C., Bottou, L., Weinberger, K.Q. (eds.) Advances in Neural Information Processing Systems, vol. 25, pp. 341–349. Curran Associates, Inc. (2012). http://papers.nips.cc/paper/4686-image-denoising-and-inpainting-with-deep-neural-networks.pdf
Xinyuan Chen, Li Song, X.Y.: Deep rnns for video denoising. In: Proceedings of the SPIE, vol. 9971, pp. 9971–9971-10 (2016). http://dx.doi.org/10.1117/12.2239260
Yang, Chih-Yuan, Ma, Chao, Yang, Ming-Hsuan: Single-image super-resolution: a benchmark. In: Fleet, David, Pajdla, Tomas, Schiele, Bernt, Tuytelaars, Tinne (eds.) ECCV 2014. LNCS, vol. 8692, pp. 372–386. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10593-2_25
Chapter Google Scholar
Yang, J., Wright, J., Huang, T.S., Ma, Y.: Image super-resolution via sparse representation. IEEE Trans. Image Process. 19(11), 2861–2873 (2010)
Article MathSciNet Google Scholar
Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of CNN and RNN for natural language processing. arXiv preprint arXiv:1702.01923 (2017)
Zhang, K., Zuo, W., Chen, Y., Meng, D., Zhang, L.: Beyond a gaussian denoiser: residual learning of deep cnn for image denoising. IEEE Trans. Image Process. 26, 3142–3155 (2017)
Article MathSciNet Google Scholar
Zhang, K., Zuo, W., Gu, S., Zhang, L.: Learning deep CNN denoiser prior for image restoration. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017
Google Scholar
Zhao, H., Gallo, O., Frosio, I., Kautz, J.: Loss functions for image restoration with neural networks. IEEE Trans. Comput. Imaging 3(1), 47–57 (2017)
Article Google Scholar

Download references

Acknowledgements

We would like to thank Sam Hasinoff and Andrew Adams for the HDR+ dataset, Jan Kautz for the FlexISP dataset and Ross Grishick for the helpful discussions. Finally huge thanks to Peter Hedman for his last minute magic.

Author information

Authors and Affiliations

University College London, London, UK
Clément Godard
Facebook, Seattle, USA
Kevin Matzen & Matt Uyttendaele

Authors

Clément Godard
View author publications
You can also search for this author in PubMed Google Scholar
Kevin Matzen
View author publications
You can also search for this author in PubMed Google Scholar
Matt Uyttendaele
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Clément Godard .

Editor information

Editors and Affiliations

Google Research, Zurich, Switzerland
Vittorio Ferrari
Carnegie Mellon University, Pittsburgh, PA, USA
Martial Hebert
Google Research, Zurich, Switzerland
Cristian Sminchisescu
Hebrew University of Jerusalem, Jerusalem, Israel
Yair Weiss

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 12507 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Godard, C., Matzen, K., Uyttendaele, M. (2018). Deep Burst Denoising. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds) Computer Vision – ECCV 2018. ECCV 2018. Lecture Notes in Computer Science(), vol 11219. Springer, Cham. https://doi.org/10.1007/978-3-030-01267-0_33

Download citation

DOI: https://doi.org/10.1007/978-3-030-01267-0_33
Published: 07 October 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-01266-3
Online ISBN: 978-3-030-01267-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Deep Burst Denoising

Abstract

Similar content being viewed by others

Burst Denoising via Temporally Shifted Wavelet Transforms

Burst Image Deblurring Using Permutation Invariant Convolutional Neural Networks

Fine-Grained Video Deblurring with Event Camera

1 Introduction

2 Related Work

2.1 Neural Architectures

3 Method

3.1 Goals

3.2 Single Frame Denoising

3.3 Multi-frame Denoising

4 Implementation and Results

4.1 Data

4.2 Stabilization

4.3 Training Details

4.4 Noise Modelling

4.5 Single Frame Denoising

4.6 Burst Denoising

4.7 Existing Datasets

4.8 Super Resolution

5 Limitations

6 Conclusion

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 1 (pdf 12507 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation