Skip to main content
Log in

Semantic-Aware Visual Decomposition for Image Coding

  • Published:
International Journal of Computer Vision Aims and scope Submit manuscript

Abstract

In this paper, we propose a novel image coding framework with semantic-aware visual decomposition towards extremely low bitrate compression. In particular, an input image is analyzed into a semantic map as structural representation and semantic-wise texture representation and further compressed into bitstreams at the encoder side. On the decoder side, the received bitstreams of dual-layer representations are decoded and reconstructed for target image synthesis with generative models. Moreover, the attention mechanism is introduced into the model architecture for texture representation modeling and a coherency regularization is proposed to further optimize the texture representation space by aligning the representation space with the source pixel space for higher synthesis quality. Besides, we also propose a cross-channel entropy module and control the quantization scale to facilitate rate-distortion optimization. Upon compressing the decomposed components into the bitstream, the simple yet effective representation philosophy benefits image compression in many aspects. First, in terms of compression performance, compact representations, and high visual synthesis quality can bring remarkable advantages. Second, the proposed framework yields a physically explainable bitstream composed of the structural segment and semantic-wise texture segments. Third and most importantly, subsequent vision tasks (e.g., content manipulation) can receive fundamental support from the semantic-aware visual decomposition and synthesis mechanism. Extensive experimental results demonstrate the superiority of the proposed framework towards efficient visual representation learning, high efficiency image compression (\(<0.1\) bpp), and intelligent visual applications (e.g., manipulation and analysis).

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19
Fig. 20
Fig. 21
Fig. 22
Fig. 23
Fig. 24
Fig. 25
Fig. 26
Fig. 27

Similar content being viewed by others

Notes

  1. For reproducible research, the source codes of our method will be made public when this paper is accepted.

  2. https://vcgit.hhi.fraunhofer.de/jvet/VVCSoftware_VTM.

References

  • Agustsson, E., Tschannen, M., & Mentzer, F., et al. (2019). Generative adversarial networks for extreme learned image compression. In Proceedings of the IEEE international conference on computer vision (ICCV) (pp. 221–231).

  • Akbari, M., Liang, J., & Han, J. (2019). DSSLIC: Deep semantic segmentation-based layered image compression. In IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 2042–2046).

  • Aujol, J. F., Gilboa, G., Chan, T., et al. (2006). Structure–texture image decomposition: Modeling, algorithms, and parameter selection. International Journal of Computer Vision, 67(1), 111–136.

    Article  MATH  Google Scholar 

  • Ballé, J., Chou, P. A., Minnen, D., et al. (2020). Nonlinear transform coding. IEEE Journal of Selected Topics in Signal Processing, 15(2), 339–353.

    Article  Google Scholar 

  • Ballé, J., Laparra, V., & Simoncelli, E. (2017). End-to-end optimized image compression. In Proceedings of international conference on learning representations (ICLR).

  • Ballé, J., Minnen, D., & Singh, S., et al. (2018). Variational image compression with a scale hyperprior. In Proceedings of international conference on learning representations (ICLR).

  • Benesty, J., Chen, J., & Huang, Y., et al. (2009). Pearson correlation coefficient. In Noise reduction in speech processing (pp. 1–4). Springer.

  • Bjontegaard, G. (2001). Calculation of average PSNR differences between RD-curves. ITU-T VCEG-M33, Austin, TX, USA.

  • Bross, B., Wang, Y. K., Ye, Y., et al. (2021). Overview of the versatile video coding (VVC) standard and its applications. IEEE Transactions on Circuits and Systems for Video Technology, 31(10), 3736–3764.

    Article  Google Scholar 

  • Bross, B., Wieckowski, A., & Schwarz, H., et al. (2016). Suggested process to select the benchmark set. In Document JVET-J0094 10th JVET meeting.

  • Casaca, W., Paiva, A., Gomez-Nieto, E., et al. (2013). Spectral image segmentation using image decomposition and inner product-based metric. Journal of Mathematical Imaging and Vision, 45(3), 227–238.

    Article  MathSciNet  Google Scholar 

  • Chang, J., Mao, Q., & Zhao, Z., et al. (2019). Layered conceptual image compression via deep semantic synthesis. In IEEE international conference on image processing (ICIP) (pp. 694–698).

  • Chang, J., Zhao, Z., Jia, C., et al. (2022). Conceptual compression via deep structure and texture synthesis. IEEE Transactions on Image Processing, 31, 2809–2823.

    Article  Google Scholar 

  • Chang, J., Zhao, Z., & Yang, L., et al. (2021). Thousand to one: Semantic prior modeling for conceptual coding. In 2021 IEEE international conference on multimedia and expo (ICME) (pp. 1–6). IEEE.

  • Cheng, B., Schwing, A., & Kirillov, A. (2021). Per-pixel classification is not all you need for semantic segmentation. Advances in Neural Information Processing Systems (NeurIPS), 34, 17,864-17,875.

    Google Scholar 

  • Cheng, Z., Sun, H., & Takeuchi, M., et al. (2020). Learned image compression with discretized Gaussian mixture likelihoods and attention modules. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 7939–7948).

  • Choi, Y., El-Khamy, M., & Lee, J. (2019). Variable rate deep image compression with a conditional autoencoder. In Proceedings of the IEEE/CVF international conference on computer vision (CVPR) (pp. 3146–3154).

  • Cordts, M., Omran, M., & Ramos, S., et al. (2016). The cityscapes dataset for semantic urban scene understanding. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR).

  • Ding, K., Ma, K., Wang, S., et al. (2022). Image quality assessment: Unifying structure and texture similarity. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(5), 2567–2581.

    Google Scholar 

  • Dong, X., Zhou, H., & Dong, J. (2020). Texture classification using pair-wise difference pooling-based bilinear convolutional neural networks. IEEE Transactions on Image Processing, 29, 8776–8790.

    Article  MATH  Google Scholar 

  • Gregor, K., Besse, F., & Rezende, D. J., et al. (2016). Towards conceptual compression. In Advances in neural information processing systems (NeurIPS) (pp. 3549–3557).

  • Gu, S., Meng, D., & Zuo, W., et al. (2017). Joint convolutional analysis and synthesis sparse representation for single image layer separation. In Proceedings of the IEEE international conference on computer vision (CVPR) (pp. 1708–1716).

  • Guo, C., Zhu, S. C., & Wu, Y. N. (2007). Primal sketch: Integrating structure and texture. Computer Vision and Image Understanding, 106(1), 5–19.

    Article  Google Scholar 

  • Hoang, T. M., Zhou, J., & Fan, Y. (2020). Image compression with encoder–decoder matched semantic segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops (pp. 619–623).

  • Iwai, S., Miyazaki, T., & Sugaya, Y., et al. (2020). Fidelity-controllable extreme image compression with generative adversarial networks. In ICPR (pp. 8235–8242). IEEE.

  • Jeon, J., Cho, S., & Tong, X., et al. (2014). Intrinsic image decomposition using structure-texture separation and surface normals. In European conference on computer vision (ECCV) (pp. 218–233). Springer.

  • Jia, C., Ge, Z., & Wang, S., et al. (2021). Rate distortion characteristic modeling for neural image compression. arXiv preprint arXiv:2106.12954.

  • Johnson, J., Alahi, A., & Fei-Fei, L. (2016). Perceptual losses for real-time style transfer and super-resolution. In Proceedings of European conference on computer vision (ECCV). Springer.

  • Karras, T., Laine, S., & Aila, T. (2019). A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR) (pp. 4401–4410).

  • Kazemi, V., & Sullivan, J. (2014). One millisecond face alignment with an ensemble of regression trees. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR) (pp. 1867–1874).

  • Khosla, P., Teterwak, P., Wang, C., et al. (2020). Supervised contrastive learning. Advances in Neural Information Processing Systems (NeurIPS), 33, 18661–18673.

    Google Scholar 

  • Kim, Y., Ham, B., Do, M. N., et al. (2018). Structure–texture image decomposition using deep variational priors. IEEE Transactions on Image Processing, 28(6), 2692–2704.

    Article  MathSciNet  MATH  Google Scholar 

  • Kingma, D. P., & Ba, J. (2015). Adam: A method for stochastic optimization. In Proceedings of international conference on learning representations (ICLR).

  • Lee, C. H., Liu, Z., & Wu, L., et al. (2020). Maskgan: Towards diverse and interactive facial image manipulation. In IEEE conference on computer vision and pattern recognition (CVPR).

  • Lee, J., Cho, S., & Beack, S. K. (2018). Context-adaptive entropy model for end-to-end optimized image compression. In Proceedings of international conference on learning representations (ICLR).

  • Li, J., Jia, C., & Zhang, X., et al. (2021a). Cross modal compression: Towards human-comprehensible semantic compression. In Proceedings of the 29th ACM international conference on multimedia (pp. 4230–4238).

  • Li, X., Shi, J., & Chen, Z. (2021b). Task-driven semantic coding via reinforcement learning. arXiv preprint arXiv:2106.03511.

  • Li, Y., Jia, C., & Wang, S., et al. (2018). Joint rate-distortion optimization for simultaneous texture and deep feature compression of facial images. In 2018 IEEE fourth international conference on multimedia big data (BigMM) (pp. 1–5). IEEE.

  • Li, Y., Wang, S., & Zhang, X., et al. (2021c). Quality assessment of end-to-end learned image compression: The benchmark and objective measure. In Proceedings of the 29th ACM international conference on multimedia (pp. 4297–4305).

  • Liu, D., Li, Y., Lin, J., et al. (2020). Deep learning-based video coding: A review and a case study. ACM Computing Surveys (CSUR), 53(1), 1–35.

    Article  Google Scholar 

  • Livingstone, S. R., & Russo, F. A. (2018). The Ryerson audio-visual database of emotional speech and song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North American English. PLoS ONE, 13(5), e0196,391.

    Article  Google Scholar 

  • Luo, S., Yang, Y., & Yin, Y., et al. (2018). DeepSIC: Deep semantic image compression. In International conference on neural information processing (NeurIPS) (pp. 96–106). Springer.

  • Ma, H., Liu, D., Yan, N., et al. (2020). End-to-end optimized versatile image compression with wavelet-like transform. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44, 1247–1263.

    Article  Google Scholar 

  • Ma, S., Zhang, X., Jia, C., et al. (2019). Image and video compression with neural networks: A review. IEEE Transactions on Circuits and Systems for Video Technology, 30(6), 1683–1698.

    Article  Google Scholar 

  • Mao, S., Rajan, D., & Chia, L. T. (2021). Deep residual pooling network for texture recognition. Pattern Recognition, 112(107), 817.

    Google Scholar 

  • Marr, D. (1982). Vision: A computational investigation into the human representation and processing of visual information (Vol. 1(2)). Freeman and Company.

    Google Scholar 

  • Mentzer, F., Toderici, G. D., & Tschannen, M., et al. (2020). High-fidelity generative image compression. In Proceedings of advances in neural information processing systems (NeurIPS).

  • Minnen, D., Ballé, J., & Toderici, G. D. (2018). Joint autoregressive and hierarchical priors for learned image compression. In Advances in neural information processing systems (NeurIPS) (pp. 10,771–10,780).

  • Park, T., Liu, M. Y., & Wang, T. C., et al. (2019). Semantic image synthesis with spatially-adaptive normalization. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR).

  • Park, T., Zhu, J. Y., & Wang, O., et al. (2020). Swapping autoencoder for deep image manipulation. In Advances in neural information processing systems (NeurIPS).

  • Paszke, A., Gross, S., Massa, F., et al. (2019). Pytorch: An imperative style, high-performance deep learning library. Advances in Neural Information Processing Systems (NeurIPS), 32, 8026–8037.

    Google Scholar 

  • Pennebaker, W. B., & Mitchell, J. L. (1992). JPEG: Still image data compression standard. Springer.

    Google Scholar 

  • Rabbani, M. (2002). JPEG2000: Image compression fundamentals, standards and practice. Journal of Electronic Imaging, 11(2), 286.

    Article  MathSciNet  Google Scholar 

  • Schwarz, H., Rudat, C., & Siekmann, M., et al. (2016). Coding efficiency/complexity analysis of jem 1.0 coding tools for the random access configuration. In Document JVET-B0044 3rd 2nd JVET meeting.

  • Shaham, T. R., Dekel, T., & Michaeli, T. (2019). SinGAN: Learning a generative model from a single natural image. In Proceedings of the IEEE international conference on computer vision (CVPR) (pp. 4570–4580).

  • Sneyers, J., & Wuille, P. (2016). FLIF: Free lossless image format based on MANIAC compression. In 2016 IEEE international conference on image processing (ICIP) (pp. 66–70). IEEE.

  • Sun, S., He, T., & Chen, Z. (2021). Semantic structured image coding framework for multiple intelligent applications. IEEE Transactions on Circuits and Systems for Video Technology, 31(9), 3631–3642.

    Article  Google Scholar 

  • Sun, Z., Tan, Z., & Sun, X., et al. (2021b). Interpolation variable rate image compression. In Proceedings of the 29th ACM international conference on multimedia (pp. 5574–5582).

  • Sze, V., Budagavi, M., & Sullivan, G. J. (2014). High efficiency video coding (HEVC). Integrated Circuit and Systems, Algorithms and Architectures Springer, 39, 40.

    Google Scholar 

  • Wang, S., Wang, S., Yang, W., et al. (2021). Towards analysis-friendly face representation with scalable feature and texture compression. IEEE Transactions on Multimedia, 24, 3169–3181.

    Article  Google Scholar 

  • Wang, T. C., Liu, M. Y., & Zhu, J. Y., et al. (2018a). High-resolution image synthesis and semantic manipulation with conditional GANs. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR) (pp. 8798–8807).

  • Wang, X., Girshick, R., & Gupta, A., et al. (2018b). Non-local neural networks. In Proceedings of the IEEE international conference on computer vision (CVPR) (pp. 7794–7803).

  • Wang, Y., Liu, D., Ma, S., et al. (2020). Ensemble learning-based rate-distortion optimization for end-to-end image compression. IEEE Transactions on Circuits and Systems for Video Technology, 31(3), 1193–1207.

    Article  Google Scholar 

  • Xia, Q., Liu, H., & Ma, Z. (2020). Object-based image coding: A learning-driven revisit. In 2020 IEEE international conference on multimedia and expo (ICME) (pp. 1–6). IEEE.

  • Yan, N., Liu, D., & Li, H., et al. (2020). Towards semantically scalable image coding using semantic map. In 2020 IEEE international symposium on circuits and systems (ISCAS) (pp. 1–5). IEEE.

  • Zeiler, M. D., & Fergus, R. (2014). Visualizing and understanding convolutional networks. In European conference on computer vision (ECCV) (pp. 818–833). Springer.

  • Zhang, H., Zhang, Z., & Odena, A., et al. (2020). Consistency regularization for generative adversarial networks. In Proceedings of international conference on learning representations (ICLR).

  • Zhang, P., Wang, S., & Wang, M., et al. (2023). Rethinking semantic image compression: Scalable representation with cross-modality transfer. IEEE Transactions on Circuits and Systems for Video Technology.

  • Zhang, R., Isola, P., & Efros, A. A., et al. (2018). The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR) (pp. 586–595).

  • Zhao, Z., Jia, C., & Wang, S., et al. (2021). Learned image compression using adaptive block-wise encoding and reconstruction network. In 2021 IEEE international symposium on circuits and systems (ISCAS) (pp. 1–5). IEEE.

  • Zhou, B., Zhao, H., & Puig, X., et al. (2017). Scene parsing through ade20k dataset. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR).

  • Zhu, H., Wu, W., & Zhu, W., et al. (2022a). Celebv-hq: A large-scale video facial attributes dataset. In European conference on computer vision (pp. 650–667). Springer.

  • Zhu, L., Yang, W., Chen, B., et al. (2022). Enlightening low-light images with dynamic guidance for context enrichment. IEEE Transactions on Circuits and Systems for Video Technology, 32, 5068–5079.

    Article  Google Scholar 

  • Zhu, P., Abdal, R., & Qin, Y., et al. (2020). Sean: Image synthesis with semantic region-adaptive normalization. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR).

  • Zhu, W., Ding, W., Xu, J., et al. (2014). Screen content coding based on HEVC framework. IEEE Transactions on Multimedia, 16(5), 1316–1326.

Download references

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Jian Zhang or Siwei Ma.

Additional information

Communicated by Ming-Hsuan Yang.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This work was supported in part by the National Natural Science Foundation of China under Grants 62025101 and 62088102, Shenzhen Research Project under Grant JCYJ20220531093215035, and the Young Elite Scientist Sponsorship Program By BAST under Grant No. BYSS2022019

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary file 1 (pdf 3477 KB)

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Chang, J., Zhang, J., Li, J. et al. Semantic-Aware Visual Decomposition for Image Coding. Int J Comput Vis 131, 2333–2355 (2023). https://doi.org/10.1007/s11263-023-01809-7

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11263-023-01809-7

Keywords

Navigation