Learning Spatiotemporal Frequency-Transformer for Compressed Video Super-Resolution

Qiu, Zhongwei; Yang, Huan; Fu, Jianlong; Fu, Dongmei

doi:10.1007/978-3-031-19797-0_15

Zhongwei Qiu^12,13,
Huan Yang¹⁴,
Jianlong Fu¹⁴ &
…
Dongmei Fu^12,13

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13678))

Included in the following conference series:

European Conference on Computer Vision

3405 Accesses
14 Citations

Abstract

Compressed video super-resolution (VSR) aims to restore high-resolution frames from compressed low-resolution counterparts. Most recent VSR approaches often enhance an input frame by “borrowing” relevant textures from neighboring video frames. Although some progress has been made, there are grand challenges to effectively extract and transfer high-quality textures from compressed videos where most frames are usually highly degraded. In this paper, we propose a novel Frequency-Transformer for compressed video super-resolution (FTVSR) that conducts self-attention over a joint space-time-frequency domain. First, we divide a video frame into patches, and transform each patch into DCT spectral maps in which each channel represents a frequency band. Such a design enables a fine-grained level self-attention on each frequency band, so that real visual texture can be distinguished from artifacts, and further utilized for video frame restoration. Second, we study different self-attention schemes, and discover that a “divided attention” which conducts a joint space-frequency attention before applying temporal attention on each frequency band, leads to the best video enhancement quality. Experimental results on two widely-used video super-resolution benchmarks show that FTVSR outperforms state-of-the-art approaches on both uncompressed and compressed videos with clear visual margins. Code are available at https://github.com/researchmm/FTVSR.

This work was done when Z. Qiu was an intern at Microsoft Research.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Cao, J., Li, Y., Zhang, K., Van Gool, L.: Video super-resolution transformer. arXiv preprint arXiv:2106.06847 (2021)
Chan, K.C., Wang, X., Yu, K., Dong, C., Loy, C.C.: BasicVSR: the search for essential components in video super-resolution and beyond. In: CVPR, pp. 4947–4956 (2021)
Google Scholar
Chu, M., Xie, Y., Mayer, J., Leal-Taixé, L., Thuerey, N.: Learning temporal coherence via self-supervision for gan-based video generation. ACM TOG 39(4), 75-1 (2020)
Google Scholar
Dosovitskiy, A., Beyeret al.: An image is worth 16x16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)
Ehrlich, M., Davis, L., Lim, S.-N., Shrivastava, A.: Quantization guided JPEG artifact correction. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12353, pp. 293–309. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58598-3_18
Chapter Google Scholar
Ehrlich, M., Davis, L.S.: Deep residual learning in the jpeg transform domain. In: ICCV, pp. 3484–3493 (2019)
Google Scholar
Fritsche, M., Gu, S., Timofte, R.: Frequency separation for real-world super-resolution. In: ICCVW, pp. 3599–3608. IEEE (2019)
Google Scholar
Fu, J., Zheng, H., Mei, T.: Look closer to see better: recurrent attention convolutional neural network for fine-grained image recognition. In: CVPR, pp. 4438–4446 (2017)
Google Scholar
Goto, T., Fukuoka, T., Nagashima, F., Hirano, S., Sakurai, M.: Super-resolution system for 4k-HDTV. In: ICPR, pp. 4453–4458. IEEE (2014)
Google Scholar
Gueguen, L., Sergeev, A., Kadlec, B., Liu, R., Yosinski, J.: Faster neural networks straight from JPEG. In: NeurIPS 31 (2018)
Google Scholar
Haris, M., Shakhnarovich, G., Ukita, N.: Recurrent back-projection network for video super-resolution. In: CVPR, pp. 3897–3906 (2019)
Google Scholar
Isobe, T., Jia, X., Gu, S., Li, S., Wang, S., Tian, Q.: Video super-resolution with recurrent structure-detail network. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12357, pp. 645–660. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58610-2_38
Chapter Google Scholar
Jo, Y., Oh, S.W., Kang, J., Kim, S.J.: Deep video super-resolution network using dynamic upsampling filters without explicit motion compensation. In: CVPR, pp. 3224–3232 (2018)
Google Scholar
Kim, S.Y., Lim, J., Na, T., Kim, M.: 3DSRnet: video super-resolution using 3D convolutional neural networks. arXiv preprint arXiv:1812.09079 (2018)
Kim, T.H., Sajjadi, M.S.M., Hirsch, M., Schölkopf, B.: Spatio-temporal transformer network for video restoration. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11207, pp. 111–127. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01219-9_7
Chapter Google Scholar
Lai, W.S., Huang, J.B., Ahuja, N., Yang, M.H.: Deep Laplacian pyramid networks for fast and accurate super-resolution. In: CVPR, pp. 624–632 (2017)
Google Scholar
Li, S., He, F., Du, B., Zhang, L., Xu, Y., Tao, D.: Fast spatio-temporal residual network for video super-resolution. In: CVPR, pp. 10522–10531 (2019)
Google Scholar
Li, W., Tao, X., Guo, T., Qi, L., Lu, J., Jia, J.: MuCAN: multi-correspondence aggregation network for video super-resolution. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12355, pp. 335–351. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58607-2_20
Chapter Google Scholar
Li, X., Jin, X., Yu, T., Sun, S., Pang, Y., Zhang, Z., Chen, Z.: Learning omni-frequency region-adaptive representations for real image super-resolution. In: AAAI, vol. 35, pp. 1975–1983 (2021)
Google Scholar
Li, Y., Jin, P., Yang, F., Liu, C., Yang, M.H., Milanfar, P.: COMISR: compression-informed video super-resolution. In: ICCV (2021)
Google Scholar
Liu, C., Yang, H., Fu, J., Qian, X.: Learning trajectory-aware transformer for video super-resolution. In: CVPR, pp. 5687–5696 (2022)
Google Scholar
Liu, Z., et a.: Swin transformer: hierarchical vision transformer using shifted windows. In: ICCV, pp. 10012–10022 (2021)
Google Scholar
Lu, G., Ouyang, W., Xu, D., Zhang, X., Gao, Z., Sun, M.-T.: Deep Kalman filtering network for video compression artifact reduction. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) Computer Vision – ECCV 2018. LNCS, vol. 11218, pp. 591–608. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01264-9_35
Chapter Google Scholar
Lu, G., Zhang, X., Ouyang, W., Xu, D., Chen, L., Gao, Z.: Deep non-local kalman network for video compression artifact reduction. TIP 29, 1725–1737 (2019)
MathSciNet MATH Google Scholar
Nah, S., et a.: Ntire 2019 challenge on video deblurring and super-resolution: dataset and study. In: CVPRW (2019)
Google Scholar
Qin, Z., Zhang, P., Wu, F., Li, X.: FcaNet: frequency channel attention networks. In: ICCV, pp. 783–792 (2021)
Google Scholar
Sajjadi, M.S., Vemulapalli, R., Brown, M.: Frame-recurrent video super-resolution. In: CVPR, pp. 6626–6634 (2018)
Google Scholar
Tao, X., Gao, H., Liao, R., Wang, J., Jia, J.: Detail-revealing deep video super-resolution. In: ICCV, pp. 4472–4480 (2017)
Google Scholar
Tian, Y., Zhang, Y., Fu, Y., Xu, C.: TDAN: temporally-deformable alignment network for video super-resolution. In: CVPR, pp. 3360–3369 (2020)
Google Scholar
Wang, X., Chan, K.C., Yu, K., Dong, C., Change Loy, C.: EDVR: video restoration with enhanced deformable convolutional networks. In: CVPRW (2019)
Google Scholar
Wang, Z., Liu, D., Chang, S., Ling, Q., Yang, Y., Huang, T.S.: D3: deep dual-domain based fast restoration of jpeg-compressed images. In: CVPR, pp. 2764–2772 (2016)
Google Scholar
Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P.: Image quality assessment: from error visibility to structural similarity. IEEE TIP 13(4), 600–612 (2004)
Google Scholar
Xu, K., Qin, M., Sun, F., Wang, Y., Chen, Y.K., Ren, F.: Learning in the frequency domain. In: CVPR, pp. 1740–1749 (2020)
Google Scholar
Xu, Y., Gao, L., Tian, K., Zhou, S., Sun, H.: Non-local ConvLSTM for video compression artifact reduction. In: ICCV, pp. 7043–7052 (2019)
Google Scholar
Xue, T., Chen, B., Wu, J., Wei, D., Freeman, W.T.: Video enhancement with task-oriented flow. IJCV 127(8), 1106–1125 (2019)
Article Google Scholar
Yang, F., Yang, H., Fu, J., Lu, H., Guo, B.: Learning texture transformer network for image super-resolution. In: CVPR, pp. 5791–5800 (2020)
Google Scholar
Yi, P., et al.: Omniscient video super-resolution. In: ICCV, pp. 4429–4438 (2021)
Google Scholar
Zeng, Y., Yang, H., Chao, H., Wang, J., Fu, J.: Improving visual quality of image synthesis by a token-based generator with transformers. NeurIPS 34, 21125–21137 (2021)
Google Scholar
Zhang, L., Zhang, H., Shen, H., Li, P.: A super-resolution reconstruction algorithm for surveillance images. Sig. Process. 90(3), 848–859 (2010)
Article MATH Google Scholar
Zheng, H., Fu, J., Mei, T., Luo, J.: Learning multi-attention convolutional neural network for fine-grained image recognition. In: ICCV, pp. 5209–5217 (2017)
Google Scholar

Download references

Acknowledgment

This work was supported by the Scientific and Technological Innovation of Shunde Graduate School of University of Science and Technology Beijing (No. BK20AE004 and No. BK19CE017).

Author information

Authors and Affiliations

University of Science and Technology Beijing, Beijing, China
Zhongwei Qiu & Dongmei Fu
Shunde Graduate School of University of Science and Technology Beijing, Beijing, China
Zhongwei Qiu & Dongmei Fu
Microsoft Research, Beijing, China
Huan Yang & Jianlong Fu

Authors

Zhongwei Qiu
View author publications
You can also search for this author in PubMed Google Scholar
Huan Yang
View author publications
You can also search for this author in PubMed Google Scholar
Jianlong Fu
View author publications
You can also search for this author in PubMed Google Scholar
Dongmei Fu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Zhongwei Qiu .

Editor information

Editors and Affiliations

Tel Aviv University, Tel Aviv, Israel
Shai Avidan
University College London, London, UK
Gabriel Brostow
Google AI, Accra, Ghana
Moustapha Cissé
University of Catania, Catania, Italy
Giovanni Maria Farinella
Facebook (United States), Menlo Park, CA, USA
Tal Hassner

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 1359 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Qiu, Z., Yang, H., Fu, J., Fu, D. (2022). Learning Spatiotemporal Frequency-Transformer for Compressed Video Super-Resolution. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13678. Springer, Cham. https://doi.org/10.1007/978-3-031-19797-0_15

Download citation

DOI: https://doi.org/10.1007/978-3-031-19797-0_15
Published: 03 November 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-19796-3
Online ISBN: 978-3-031-19797-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Learning Spatiotemporal Frequency-Transformer for Compressed Video Super-Resolution