Context-Enhanced Representation Learning for Single Image Deraining

Wang, Guoqing; Sun, Changming; Sowmya, Arcot

doi:10.1007/s11263-020-01425-9

Context-Enhanced Representation Learning for Single Image Deraining

Published: 02 March 2021

Volume 129, pages 1650–1674, (2021)
Cite this article

International Journal of Computer Vision Aims and scope Submit manuscript

1250 Accesses
19 Citations
Explore all metrics

Abstract

Perception of content and structure in images with rainstreaks or raindrops is challenging, and it often calls for robust deraining algorithms to remove the diversified rainy effects. Much progress has been made on the design of advanced encoder–decoder single image deraining networks. However, most of the existing networks are built in a blind manner and often produce over/under-deraining artefacts. In this paper, we point out, for the first time, that the unsatisfactory results are caused by the highly imbalanced distribution between rainy effects and varied background scenes. Ignoring this phenomenon results in the representation learned by the encoder being biased towards rainy regions, while paying less attention to the valuable contextual regions. To resolve this, a context-enhanced representation learning and deraining network is proposed with a novel two-branch encoder design. Specifically, one branch takes the rainy image directly as input for learning a mixed representation depicting the variation of both rainy regions and contextual regions, and another branch is guided by a carefully learned soft attention mask to learn an embedding only depicting the contextual regions. By combining the embeddings from these two branches with a carefully designed co-occurrence modelling module, and then improving the semantic property of the co-occurrence features via a bi-directional attention layer, the underlying imbalanced learning problem is resolved. Extensive experiments are carried out for removing rainstreaks and raindrops from both synthetic and real rainy images, and the proposed model is demonstrated to produce significantly better results than state-of-the-art models. In addition, comprehensive ablation studies are also performed to analyze the contributions of different designs. Code and pre-trained models will be publicly available at https://github.com/RobinCSIRO/CERLD-Net.git.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A survey on Image Data Augmentation for Deep Learning

Article Open access 06 July 2019

Methods for image denoising using convolutional neural network: a review

Article Open access 10 June 2021

Learning a Deep Convolutional Network for Image Super-Resolution

Notes

The detailed steps for obtaining the figures are provided in the Appendix.
https://github.com/yenchenlin/pix2pix-tensorflow.
This layer conveys information learned by the encoder, and also serves as input for the decoder to generate images, thus the behaviour of this layer determines the quality of the restored images. This claim is also supported by the disentangled representation learning theory in generative models (Diederik and Max 2013; Tschannen et al. 2018; Chen et al. 2016; Bengio et al. 2013; Locatello et al. 2018).
https://xueyangfu.github.io/projects/cvpr2017.html.
https://github.com/XiaLiPKU/RESCAN.
https://github.com/stevewongv/SPANet.
\(\alpha \) and \(\beta \) can also be interpreted as the quantized contribution of different patches to the final prediction.
https://cloud.google.com/vision.

References

Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., et al. (2016). TensorFlow: A system for large-scale machine learning. In USENIX Symposium on Operating Systems Design and Implementation.
Abel, G., Krahenbuhl, P., Joost, V., & Bengio, T. (2018). Image-to-image translation for cross-domain disentanglement. In Advances in Neural Information Processing Systems (NeurIPS).
Barnum, P. C., Narasimhan, S., & Kanade, T. (2010). Analysis of rain and snow in frequency space. International Journal of Computer Vision, 86(2), 256–275.
Article Google Scholar
Bengio, Y., Courville, A., & Vincent, P. (2013). Representation learning: A review and new perspectives. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(8), 1798–1828.
Article Google Scholar
Bossu, J., Hautière, N., & Tarel, J. P. (2011). Rain or snow detection in image sequences through use of a histogram of orientation of streaks. International Journal of Computer Vision, 93(3), 348–367.
Article Google Scholar
Brewer, N., & Liu, N. (2008). Using the shape characteristics of rain to identify and remove rain from video. In Joint IAPR International Workshops on Statistical Techniques in Pattern Recognition and Structural and Syntactic Pattern Recognition.
Chen, X., Duan, Y., Houthooft, R., Schulman, J., Sutskever, I., & Abbeel, P. (2016). InfoGAN: Interpretable representation learning by information maximizing generative adversarial nets. In Advances in Neural Information Processing Systems (NeurIPS).
Chen, Y., & Pock, T. (2016). Trainable nonlinear reaction diffusion: A flexible framework for fast and effective image restoration. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(6), 1256–1272.
Article Google Scholar
Chen, Y. L., & Hsu, C. T. (2013). A generalized low-rank appearance model for spatio-temporally correlated rainstreaks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
Choi, Y., Choi, M., Kim, M., Ha, J. W., Kim, S. & Choo, J. (2018). StarGAN: Unified generative adversarial networks for multi-domain image-to-image translation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
Dabov, K., Foi, A., Katkovnik, V., & Egiazarian, K. (2007). Color image denoising via sparse 3D collaborative filtering with grouping constraint in luminance-chrominance space. In Proceedings of the IEEE Conference on Image Processing (ICIP).
Diederik, P. K., & Max, W. (2013). Auto-encoding variational Bayes. In Proceedings of the International Conference on Learning Representations (ICLR).
Eigen, D., Krishnan, D., & Fergus, R. (2013). Restoring an image taken through a window covered with dirt or rain. In Proceedings of the IEEE International Conference on Computer Vision (ICCV).
Fu, X., Huang, J., Ding, X., Liao, Y., & Paisley, J. (2017a). Clearing the skies: A deep network architecture for single-image rain removal. IEEE Transactions on Image Processing, 26(6), 2944–2956.
Article MathSciNet MATH Google Scholar
Fu, X., Huang, J., Zeng, D., Huang, Y., Ding, X., & Paisley, J. (2017b). Removing rain from single images via a deep detail network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
Garg, K., & Nayar, S. K. (2004). Detection and removal of rain from videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
Garg, K., & Nayar, S. K. (2007). Vision and rain. International Journal of Computer Vision, 75(1), 3–27.
Article Google Scholar
Ge, Y., Li, Z., Zhao, H., Yin, G., Yi, S., & Wang, X. (2018). FD-GAN: Pose-guided feature distilling GAN for robust person re-identification. In Advances in Neural Information Processing Systems (NeurIPS).
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
Hu, J., Shen, L., & Sun, G. (2018). Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
Huang, D. A., Kang, L. W., Wang, Y. C., & Lin, C. W. (2013). Self-learning based image decomposition with applications to single image denoising. IEEE Transactions on Multimedia, 16(1), 83–93.
Article Google Scholar
Huang, G., Liu, Z., Laurens, V. D., & Weinberger, K. Q. (2017). Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
Huang, X., Liu, M., Belongie, S., & Kautz, J. (2018). Multimodal unsupervised image-to-image translation. In Proceedings of the European Conference on Computer Vision (ECCV).
Iizuka, S., Serra, E. S., & Ishikawa, H. (2017). Globally and locally consistent image completion. ACM Transactions on Graphics, 36(4), 107–123.
Article Google Scholar
Isola, P., Zhu, J. Y., Zhou, T. H., & Efros, A. A. (2007). Image-to-image translation with conditional adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
Kang, L. W., Lin, C. W., & Fu, Y. H. (2011). Automatic single-image-based rain streaks removal via image decomposition. IEEE Transactions on Image Processing, 21(4), 1742–1755.
Article MathSciNet MATH Google Scholar
Kingma, D. P., & Ba, J. (2014). Adam: A method for stochastic optimization. In Proceedings of the International Conference on Learning Representations (ICLR).
Li, G., He, X., Zhang, W., Chang, H., Dong, L., & Lin, L. (2018a). Non-locally enhanced encoder–decoder network for single image deraining. In Proceedings of the ACM International Conference on Multimedia (ACMMM).
Li, S., Araujo, I. B., Ren, W., Wang, Z., Tokuda, E. K., Junior, R. H., et al. (2019). Single image deraining: A comprehensive benchmark analysis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
Li, X., Wu, J., Lin, Z., Liu, H., & Zha, H. (2018b). Recurrent squeeze-and-excitation context aggregation net for single image deraining. In Proceedings of the European conference on computer vision (ECCV).
Li, Y., Tan, R. T., Guo, X., Lu, J., & Brown, M. S. (2016). Rain streak removal using layer priors. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR).
Lin, T. Y., RoyChowdhury, A., & Maji, S. (2015). Bilinear CNN models for fine-grained visual recognition. In Proceedings of the IEEE International Conference on Computer Vision (ICCV).
Locatello, F., Bauer, S., Lucic, M., Gelly, S., Schölkopf, B., & Bachem, O. (2018). Challenging common assumptions in the unsupervised learning of disentangled representations. In Advances in Neural Information Processing Systems (NeurIPS).
Luo, Y., Xu, Y., & Ji, H. (2015). Removing rain from a single image via discriminative sparse coding. In Proceedings of the IEEE International Conference on Computer Vision (ICCV).
Makhzani, A., Shlens, J., Jaitly, N., Goodfellow, I., & Frey, B. (2016). Adversarial autoencoders. In Proceedings of the International Conference on Learning Representations (ICLR).
Mao, X., Shen, C., & Yang, Y. (2016). Image restoration using convolutional auto-encoders with symmetric skip connections. In Advances in Neural Information Processing Systems (NeurIPS).
Mao, X., Li, Q., Xie, H., Lau, R., Wang, Z., & Smolley, S. P. (2017). Least squares generative adversarial networks. In Proceedings of the IEEE International Conference on Computer Vision (ICCV).
Martin, D., Fowlkes, C., Tal, D., & Malik, J. (2001). A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics. In Proceedings of the IEEE International Conference on Computer Vision (ICCV).
Pathak, D., Krahenbuhl, P., Donahue, J., Darrell, T., & Efros, A. A. (2016). Context encoders: Feature learning by inpainting. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
Qian, R., Tan, R. T., Yang, W., Su, J., & Liu, J. (2018). Attentive generative adversarial network for raindrop removal from a single image. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
Ren, D., Zuo, W., Hu, Q., Zhu, P., & Meng., D. (2019). Progressive image deraining networks: A better and simpler baseline. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
Ren, W., Ma, L., Zhang, J., Pan, J., Cao, X., Liu, W., et al. (2018). Gated fusion network for single image dehazing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
Ronneberger, O., Fischer, P., & Brox, T. (2015). U-Net: Convolutional networks for biomedical image segmentation. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI).
Santhaseelan, V., & Asari, V. K. (2015). Utilizing local phase information to remove rain from video. International Journal of Computer Vision, 112(1), 71–89.
Article Google Scholar
Shih, Y. F., Yeh, Y. M., Lin, Y. Y., Weng, M. F., Lu, Y. C., & Chuang, Y. Y. (2017). Deep co-occurrence feature learning for visual object recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. In Proceedings of the International Conference on Learning Representations (ICLR).
Sun, S. H., Fan, S. P., & Wang, Y. C. (2016). Exploiting image structural similarity for single image rain removal. In Proceedings of the IEEE Conference on Image Processing (ICIP).
Tran, L., Yin, X., & Liu, X. (2017). Disentangled representation learning GAN for pose-invariant face recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
Tschannen, M., Bachem, O., & Lucic, M. (2018). Recent advances in autoencoder-based representation learning. In Advances in Neural Information Processing Systems (NeurIPS), Bayesian Deep Learning Workshop.
Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., & Manzagol, P. (2010). Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion. Journal of Machine Learning Research, 11, 3371–3408.
MathSciNet MATH Google Scholar
Wang, G., Sun, C., & Sowmya, A. (2019a). ERL-Net: Entangled representation learning for single image de-raining. In Proceedings of the IEEE International Conference on Computer Vision (ICCV).
Wang, T., Yang, X., Xu, K., Chen, S., Zhang, Q., & Lau, R. W. (2019b). Spatial attentive single-image deraining with a high quality real rain dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
Wang, X., Girshick, R., Gupta, A., & He, K. (2018). Non-local neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
Wei, W., Meng, D., Zhao, Q., Xu, Z., & Wu, Y. (2019). Semi-supervised transfer learning for image rain removal. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
Yang, W., Tan, R. T., Feng, J., Guo, Z., Yan, S., & Liu, J. (2019). Joint rain detection and removal from a single image with contextualized deep networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 42(6), 1377–1393.
Article Google Scholar
Yang, W., Tan, R. T., Wang, S., Fang, Y., & Liu, J. (2020). Single image deraining: From model-based to data-driven and beyond. IEEE Transactions on Pattern Analysis and Machine Intelligence.
Yasarla, R., & Patel, V. M. (2019). Uncertainty guided multi-scale residual learning- using a cycle spinning CNN for single image deraining. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
Yasarla, R., & Patel, V. M. (2020). Confidence measure guided single image de-raining. IEEE Transactions on Image Processing, 29, 4544–4555.
Article Google Scholar
You, S., Tan, R. T., Kawakami, R., Mukaigawa, Y., & Ikeuchi, K. (2015). Adherent raindrop modeling, detectionand removal in video. IEEE Transactions on Pattern Analysis and Machine Intelligence, 38(9), 1721–1733.
Article Google Scholar
Yu, F., & Koltun, V. (2015). Multi-scale context aggregation by dilated convolutions. In Proceedings of the International Conference on Learning Representations (ICLR).
Zhang, H., & Patel, V. M. (2018). Density-aware single image deraining using a multi-stream dense network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
Zhang, K., Zuo, W., Chen, Y., Meng, D., & Zhang, L. (2017a). Beyond a Gaussian denoiser: Residual learning of deep CNN for image denoising. IEEE Transactions on Image Processing, 26(7), 3142–3155.
Article MathSciNet MATH Google Scholar
Zhang, K., Zuo, W., Gu, S., & Zhang, L. (2017b). Learning deep CNN denoiser prior for image restoration. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
Zhang, K., Zuo, W., & Zhang, L. (2018). FFDNet: Toward a fast and flexible solution for CNN-based image denoising. IEEE Transactions on Image Processing, 27(9), 4608–4622.
Article MathSciNet Google Scholar
Zhou, B., Bau, D., Oliva, A., & Torralba, A. (2019). Interpreting deep visual representations via network dissection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 41(9), 2131–2145.
Article Google Scholar

Download references

Author information

Authors and Affiliations

CSIRO Data61, PO Box 76, Epping, NSW, 1710, Australia
Guoqing Wang & Changming Sun
School of Computer Science and Engineering, University of New South Wales, Sydney, NSW, 2052, Australia
Guoqing Wang, Changming Sun & Arcot Sowmya

Authors

Guoqing Wang
View author publications
You can also search for this author in PubMed Google Scholar
Changming Sun
View author publications
You can also search for this author in PubMed Google Scholar
Arcot Sowmya
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Changming Sun.

Additional information

Communicated by Vishal Patel.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

1.1 (1) How to obtain the figures illustrating the long-tailed distribution?

To obtain a more precise description of the imbalance distribution in the existing rainstreak datasets and raindrop dataset. The method for obtaining Fig. 16 is designed as follows:

For the rainstreak dataset, RS-Data (Zhang and Patel 2018) used in our paper, the number of each class (including both rainstreaks and background content) is calculated in the following way: For the labels describing the distribution of rainstreaks, we directly use the three-level class labels (heavy, medium, and light) provided by Zhang and Patel (2018) to describe the distribution of rainstreaks, and the class number for each category is 4,000; For the labels describing the distribution of background content, we propose to employ the Google Vision API^{Footnote 8} to generate the labels of all the 12,000 groundtruth images in the training set of RS-Data, and then calculate the class number for each category (in our case, the class number is 86). With the class numbers of both rainstreaks and background content, the label distribution of RS-Data is plotted as in Fig. 16a.

For the raindrop dataset, RD-Data (Qian et al. 2018), used in our paper, the number of each class is calculated following the same way as used for RS-Data. Specifically, Google Vision API is employed to generate the label distribution of background content using groundtruth images (in our case, the class number is 42). For raindrop distribution, three researchers are asked to classify all the 861 training images into three groups based on the distribution of the raindrops. With the class numbers of both raindrops and background content, the label distribution of RD-Data is plotted as in Fig. 16b.

1.2 (2) Comparison of the proposed network design with AGAN

Overall speaking, the model architecture and loss formulation are similar to the design of AGAN. To clearly demonstrate the contribution of our model, a comprehensive analysis on the difference between our design and AGAN is provided:

Part-A: Analyzing the differences on learning/using attention map

(1)
For the design of AGAN (as shown in Fig. 17), the attention map is concatenated with a rainy image as input for deraining using an encoder–decoder network. Differently, in CERLD-Net, the attention map is combined with a rainy image via element-wise multiplication. Furthermore, the design of using attention map is mainly for addressing the imbalanced representation learning issue, which has never been considered for the design of all existing deraining models.
(2)
The network architecture used for generating the attention map is different: In AGAN, Qian et al. (2018) design a complicated recurrent network for coarse-to-fine attention map generation. Compared with this complex design, we directly use the lightweight pix2pix to generate the attention map. In the following, we compare the time consumption and deraining results when using either the recurrent network (in AGAN) or pix2pix (in our CERLD-Net) for generating the attention map:
1. (a)
  Time complexity: To generate the attention map with a \(512 \times 512\) pixel raindrop image, it takes only 0.09 s for pix2pix while it takes 0.21 s for the recurrent network in AGAN.
2. (b)
  Effect on improving the raindrop removal results (because AGAN is specifically designed for raindrop removal task, we only compare the results on RD-Data): By setting the encoder–decoder deraining network with the same structure, we propose to combine different attention generation networks with the encoder–decoder network and evaluate the deraining result differences. As demonstrated in Table 15, when using the encoder–decoder in AGAN as the deraining network design, by comparing Setup-I and Setup-II, it can be found that using pix2pix as attention map generator results in better quantitative results. Such results can also be obtained when replacing the encoder–decoder network with the one used in CERLD-Net. Such a comparison demonstrates that the complex recurrent network design in AGAN is unnecessary for attention map generator while only largely increasing the time consumption.
(3)
For the design of AGAN, the authors also propose to incorporate the attention map into the discriminator design for improving the deraining results. We have also tried this design in our CERLD-Net but did not find any improvement except causing very unstable training for the overall model. Differently, we propose a novel discriminator design generating both local and global adversarial losses to improve the recovery of details in the deraining output.

Table 15 Quantitative comparison on RD-Data by setting raindrop removal models as the combination of (1) different attention map generator and (2) different encoder–decoder (Enc–Dec) network

Full size table

Part-B: Analyzing the differences on using encoder–decoder network design

Encoder–decoder network has been widely used as the model design for many image-to-image translation tasks. Specifically, for the single image deraining network, many models proposed by Qian et al. (2018), Yasarla and Patel (2020), Yasarla and Patel (2019), and Wang et al. (2019a) use an encoder–decoder structure as their model design. However, compared with these designs (especially AGAN (Qian et al. 2018)), many contributions are contained in our model:

Table 16 The difference between AGAN and CERLD-Net on the weight configuration for combining reconstruction loss calculated on outputs with different size

Full size table

(1)
The design of the basic convolutional block is novel by combining the dense block and the residual block to (a) enable the learning of multi-scale representation for handling the rainstreaks/raindrops with different density, direction, and scale; (b) avoid the gradient vanishing issues with dense and residual connection.
(2)
The traditional skip-connection (connecting layers from encoder to decoder) is improved by incorporating the residual block (RB) to solve the feature compatibility issue as pointed out in Wang et al. (2019a).
(3)
As shown in Table 15, by comparing the results by Setup-I and Setup-III (or Setup-II and Setup-IV), it can be clearly seen that the deraining results from our encoder–decoder design are much better than the one used in AGAN when keeping the network architecture for attention map generation the same.
(4)
The encoder in our CERLD-Net consists of two branches, which is specifically designed to achieve the goal of learning representation being robust to the imbalance distribution, which has never been considered by all existing deraining network designs.

Part-C: Analyzing the multi-scale loss design

The multi-scale loss formulation is the same as the one in AGAN (Qian et al. 2018). However, the design of the weight for outputs of different sizes is different in our work. We carry out extensive ablation study to find out the best weight configuration as shown in Table 16.

By using the two different weight configurations as shown in Table 16 to formulate the reconstruction loss for training CERLD-Net on both raindrop and rainstreak removal task, our proposed configuration results in better PSNR, which outperforms the weight configuration proposed by AGAN [4] by 0.21 dB in average on both tasks.

Besides, the overall loss weight to combine the (1) reconstruction loss, (2) adversarial loss, and (3) perceptual loss is different between our proposed CERLD-Net and the AGAN model.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wang, G., Sun, C. & Sowmya, A. Context-Enhanced Representation Learning for Single Image Deraining. Int J Comput Vis 129, 1650–1674 (2021). https://doi.org/10.1007/s11263-020-01425-9

Download citation

Received: 10 March 2020
Accepted: 23 December 2020
Published: 02 March 2021
Issue Date: May 2021
DOI: https://doi.org/10.1007/s11263-020-01425-9

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Context-Enhanced Representation Learning for Single Image Deraining

Abstract

Access this article

Similar content being viewed by others

A survey on Image Data Augmentation for Deep Learning

Methods for image denoising using convolutional neural network: a review

Learning a Deep Convolutional Network for Image Super-Resolution

Notes

References