Abstract
Compressed video super-resolution (VSR) aims to restore high-resolution frames from compressed low-resolution counterparts. Most recent VSR approaches often enhance an input frame by “borrowing” relevant textures from neighboring video frames. Although some progress has been made, there are grand challenges to effectively extract and transfer high-quality textures from compressed videos where most frames are usually highly degraded. In this paper, we propose a novel Frequency-Transformer for compressed video super-resolution (FTVSR) that conducts self-attention over a joint space-time-frequency domain. First, we divide a video frame into patches, and transform each patch into DCT spectral maps in which each channel represents a frequency band. Such a design enables a fine-grained level self-attention on each frequency band, so that real visual texture can be distinguished from artifacts, and further utilized for video frame restoration. Second, we study different self-attention schemes, and discover that a “divided attention” which conducts a joint space-frequency attention before applying temporal attention on each frequency band, leads to the best video enhancement quality. Experimental results on two widely-used video super-resolution benchmarks show that FTVSR outperforms state-of-the-art approaches on both uncompressed and compressed videos with clear visual margins. Code are available at https://github.com/researchmm/FTVSR.
This work was done when Z. Qiu was an intern at Microsoft Research.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Cao, J., Li, Y., Zhang, K., Van Gool, L.: Video super-resolution transformer. arXiv preprint arXiv:2106.06847 (2021)
Chan, K.C., Wang, X., Yu, K., Dong, C., Loy, C.C.: BasicVSR: the search for essential components in video super-resolution and beyond. In: CVPR, pp. 4947–4956 (2021)
Chu, M., Xie, Y., Mayer, J., Leal-Taixé, L., Thuerey, N.: Learning temporal coherence via self-supervision for gan-based video generation. ACM TOG 39(4), 75-1 (2020)
Dosovitskiy, A., Beyeret al.: An image is worth 16x16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)
Ehrlich, M., Davis, L., Lim, S.-N., Shrivastava, A.: Quantization guided JPEG artifact correction. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12353, pp. 293–309. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58598-3_18
Ehrlich, M., Davis, L.S.: Deep residual learning in the jpeg transform domain. In: ICCV, pp. 3484–3493 (2019)
Fritsche, M., Gu, S., Timofte, R.: Frequency separation for real-world super-resolution. In: ICCVW, pp. 3599–3608. IEEE (2019)
Fu, J., Zheng, H., Mei, T.: Look closer to see better: recurrent attention convolutional neural network for fine-grained image recognition. In: CVPR, pp. 4438–4446 (2017)
Goto, T., Fukuoka, T., Nagashima, F., Hirano, S., Sakurai, M.: Super-resolution system for 4k-HDTV. In: ICPR, pp. 4453–4458. IEEE (2014)
Gueguen, L., Sergeev, A., Kadlec, B., Liu, R., Yosinski, J.: Faster neural networks straight from JPEG. In: NeurIPS 31 (2018)
Haris, M., Shakhnarovich, G., Ukita, N.: Recurrent back-projection network for video super-resolution. In: CVPR, pp. 3897–3906 (2019)
Isobe, T., Jia, X., Gu, S., Li, S., Wang, S., Tian, Q.: Video super-resolution with recurrent structure-detail network. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12357, pp. 645–660. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58610-2_38
Jo, Y., Oh, S.W., Kang, J., Kim, S.J.: Deep video super-resolution network using dynamic upsampling filters without explicit motion compensation. In: CVPR, pp. 3224–3232 (2018)
Kim, S.Y., Lim, J., Na, T., Kim, M.: 3DSRnet: video super-resolution using 3D convolutional neural networks. arXiv preprint arXiv:1812.09079 (2018)
Kim, T.H., Sajjadi, M.S.M., Hirsch, M., Schölkopf, B.: Spatio-temporal transformer network for video restoration. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11207, pp. 111–127. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01219-9_7
Lai, W.S., Huang, J.B., Ahuja, N., Yang, M.H.: Deep Laplacian pyramid networks for fast and accurate super-resolution. In: CVPR, pp. 624–632 (2017)
Li, S., He, F., Du, B., Zhang, L., Xu, Y., Tao, D.: Fast spatio-temporal residual network for video super-resolution. In: CVPR, pp. 10522–10531 (2019)
Li, W., Tao, X., Guo, T., Qi, L., Lu, J., Jia, J.: MuCAN: multi-correspondence aggregation network for video super-resolution. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12355, pp. 335–351. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58607-2_20
Li, X., Jin, X., Yu, T., Sun, S., Pang, Y., Zhang, Z., Chen, Z.: Learning omni-frequency region-adaptive representations for real image super-resolution. In: AAAI, vol. 35, pp. 1975–1983 (2021)
Li, Y., Jin, P., Yang, F., Liu, C., Yang, M.H., Milanfar, P.: COMISR: compression-informed video super-resolution. In: ICCV (2021)
Liu, C., Yang, H., Fu, J., Qian, X.: Learning trajectory-aware transformer for video super-resolution. In: CVPR, pp. 5687–5696 (2022)
Liu, Z., et a.: Swin transformer: hierarchical vision transformer using shifted windows. In: ICCV, pp. 10012–10022 (2021)
Lu, G., Ouyang, W., Xu, D., Zhang, X., Gao, Z., Sun, M.-T.: Deep Kalman filtering network for video compression artifact reduction. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) Computer Vision – ECCV 2018. LNCS, vol. 11218, pp. 591–608. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01264-9_35
Lu, G., Zhang, X., Ouyang, W., Xu, D., Chen, L., Gao, Z.: Deep non-local kalman network for video compression artifact reduction. TIP 29, 1725–1737 (2019)
Nah, S., et a.: Ntire 2019 challenge on video deblurring and super-resolution: dataset and study. In: CVPRW (2019)
Qin, Z., Zhang, P., Wu, F., Li, X.: FcaNet: frequency channel attention networks. In: ICCV, pp. 783–792 (2021)
Sajjadi, M.S., Vemulapalli, R., Brown, M.: Frame-recurrent video super-resolution. In: CVPR, pp. 6626–6634 (2018)
Tao, X., Gao, H., Liao, R., Wang, J., Jia, J.: Detail-revealing deep video super-resolution. In: ICCV, pp. 4472–4480 (2017)
Tian, Y., Zhang, Y., Fu, Y., Xu, C.: TDAN: temporally-deformable alignment network for video super-resolution. In: CVPR, pp. 3360–3369 (2020)
Wang, X., Chan, K.C., Yu, K., Dong, C., Change Loy, C.: EDVR: video restoration with enhanced deformable convolutional networks. In: CVPRW (2019)
Wang, Z., Liu, D., Chang, S., Ling, Q., Yang, Y., Huang, T.S.: D3: deep dual-domain based fast restoration of jpeg-compressed images. In: CVPR, pp. 2764–2772 (2016)
Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P.: Image quality assessment: from error visibility to structural similarity. IEEE TIP 13(4), 600–612 (2004)
Xu, K., Qin, M., Sun, F., Wang, Y., Chen, Y.K., Ren, F.: Learning in the frequency domain. In: CVPR, pp. 1740–1749 (2020)
Xu, Y., Gao, L., Tian, K., Zhou, S., Sun, H.: Non-local ConvLSTM for video compression artifact reduction. In: ICCV, pp. 7043–7052 (2019)
Xue, T., Chen, B., Wu, J., Wei, D., Freeman, W.T.: Video enhancement with task-oriented flow. IJCV 127(8), 1106–1125 (2019)
Yang, F., Yang, H., Fu, J., Lu, H., Guo, B.: Learning texture transformer network for image super-resolution. In: CVPR, pp. 5791–5800 (2020)
Yi, P., et al.: Omniscient video super-resolution. In: ICCV, pp. 4429–4438 (2021)
Zeng, Y., Yang, H., Chao, H., Wang, J., Fu, J.: Improving visual quality of image synthesis by a token-based generator with transformers. NeurIPS 34, 21125–21137 (2021)
Zhang, L., Zhang, H., Shen, H., Li, P.: A super-resolution reconstruction algorithm for surveillance images. Sig. Process. 90(3), 848–859 (2010)
Zheng, H., Fu, J., Mei, T., Luo, J.: Learning multi-attention convolutional neural network for fine-grained image recognition. In: ICCV, pp. 5209–5217 (2017)
Acknowledgment
This work was supported by the Scientific and Technological Innovation of Shunde Graduate School of University of Science and Technology Beijing (No. BK20AE004 and No. BK19CE017).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Qiu, Z., Yang, H., Fu, J., Fu, D. (2022). Learning Spatiotemporal Frequency-Transformer for Compressed Video Super-Resolution. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13678. Springer, Cham. https://doi.org/10.1007/978-3-031-19797-0_15
Download citation
DOI: https://doi.org/10.1007/978-3-031-19797-0_15
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-19796-3
Online ISBN: 978-3-031-19797-0
eBook Packages: Computer ScienceComputer Science (R0)