Hybrid CNN-Transformer Architecture for Efficient Large-Scale Video Snapshot Compressive Imaging

Cao, Miao; Wang, Lishun; Zhu, Mingyu; Yuan, Xin

doi:10.1007/s11263-024-02101-y

Hybrid CNN-Transformer Architecture for Efficient Large-Scale Video Snapshot Compressive Imaging

Published: 19 May 2024

(2024)
Cite this article

International Journal of Computer Vision Aims and scope Submit manuscript

Miao Cao ORCID: orcid.org/0000-0002-2308-4388^1,2,
Lishun Wang²,
Mingyu Zhu² &
…
Xin Yuan²

285 Accesses
Explore all metrics

Abstract

Video snapshot compressive imaging (SCI) uses a low-speed 2D detector to capture high-speed scene, where the dynamic scene is modulated by different masks and then compressed into a snapshot measurement. Following this, a reconstruction algorithm is needed to reconstruct the high-speed video frames. Although state-of-the-art (SOTA) deep learning-based reconstruction algorithms have achieved impressive results, they still face the following challenges due to excessive model complexity and GPU memory limitations: (1) These models need high computational cost, and (2) They are usually unable to reconstruct large-scale video frames at high compression ratios. To address these issues, we develop an efficient network for video SCI by using hierarchical residual-like connections and hybrid CNN-Transformer structure within a single residual block, dubbed EfficientSCI++. The EfficientSCI++ network can well explore spatial-temporal correlation using convolution in the spatial domain and Transformer in the temporal domain, respectively. We are the first time to demonstrate that a UHD color video (\(1644\times {3840}\times {3}\)) with high compression ratio (40) can be reconstructed from a snapshot 2D measurement using a single end-to-end deep learning model with PSNR above 34 dB. Moreover, a mixed-precision model is trained to further accelerate the video SCI reconstruction process and save memory footprint. Extensive results on both simulation and real data demonstrate that, compared with precious SOTA methods, our proposed EfficientSCI++ and EfficientSCI can achieve comparable reconstruction quality with much cheaper computational cost and better real-time performance. Code is available at https://github.com/mcao92/EfficientSCI-plus-plus.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Adaptive Deep PnP Algorithm for Video Snapshot Compressive Imaging

Article 21 March 2023

Deep Unfolding for Snapshot Compressive Imaging

Article 08 July 2023

Dual-view Snapshot Compressive Imaging via Optical Flow Aided Recurrent Neural Network

Article 15 October 2021

Data availability

Data underlying the results are available from the corresponding author with reasonable request.

References

Ba, J.L., Kiros, J.R., Hinton, G.E. (2016) Layer normalization. Advances in NIPS 2016 Deep Learning Symposium
Bao, Q., Liu, Y., Gang, B., et al. (2023). SCTANet: A spatial attention-guided CNN-transformer aggregation network for deep face image super-resolution. IEEE Transactions on Multimedia, 25, 8554–8565.
Article Google Scholar
Behrmann, J., Grathwohl, W., Chen, R.T., et al. (2019) Invertible residual networks. In: International Conference on Machine Learning, PMLR, pp. 573–582.
Bertasius, G., Wang, H., Torresani, L. (2021) Is space-time attention all you need for video understanding? In: International Conference on Machine Learning, pp. 4.
Cai, Y., Lin, J., Hu, X., et al. (2022) Mask-guided spectral-wise transformer for efficient hyperspectral image reconstruction. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 17502–17511.
Candès, E. J., Romberg, J., & Tao, T. (2006). Robust uncertainty principles: Exact signal reconstruction from highly incomplete frequency information. IEEE Transactions on Information Theory, 52(2), 489–509.
Article MathSciNet Google Scholar
Chan, K. C., Xu, X., Wang, X., et al. (2022). Glean: Generative latent bank for image super-resolution and beyond. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45, 3154–3168.
Google Scholar
Chang, Y.L., Liu. Z.Y., Lee, K.Y., et al (2019) Free-form video inpainting with 3d gated convolution and temporal Patchgan. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9066–9075.
Chen, X., Pan, J., Lu, J., et al (2023) Hybrid CNN-transformer feature fusion for single image Deraining. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 378–386.
Cheng, Z., Lu, R., Wang, Z., et al. (2020) BIRNAT: Bidirectional recurrent neural networks with adversarial training for video snapshot compressive imaging. In: European Conference on Computer Vision. Springer, pp. 258–275.
Cheng, Z., Chen, B., Liu, G., et al (2021) Memory-efficient network for large-scale video compressive sensing. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16246–16255.
Chu, X., Tian, Z., Zhang, B., et al (2022) Conditional positional encodings for vision transformers. In: International Conference on Learning Representations
Dong, X., Bao, J., & Chen, D., et al (2022) Cswin transformer: A general vision transformer backbone with cross-shaped windows. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12124–12134.
Donoho, D. L. (2006). Compressed sensing. IEEE Transactions on Information Theory, 52(4), 1289–1306.
Article MathSciNet Google Scholar
Dosovitskiy, A., Beyer, L., Kolesnikov, A., et al (2020) An image is worth 16x16 words: Transformers for image recognition at scale. In: International Conference on Learning Representations
Gao, G., Xu, Z., Li, J., et al. (2023). CTCNet: A CNN-transformer cooperation network for face image super-resolution. IEEE Transactions on Image Processing, 32, 1978–1991.
Article Google Scholar
Gao, S. H., Cheng, M. M., Zhao, K., et al. (2019). Res2net: A new multi-scale backbone architecture. IEEE Transactions on Pattern Analysis and Machine Intelligence, 43(2), 652–662.
Article Google Scholar
He, K., Zhang, X., Ren, S., et al (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 770–778.
Hendrycks, D., Gimpel, K. (2016) Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415
Hitomi, Y., Gu, J., Gupta, M., et al (2011) Video from a single coded exposure photograph using a learned over-complete dictionary. In: 2011 International Conference on Computer Vision. IEEE, pp. 287–294.
Huang, G., Liu, Z., & Van Der Maaten, L., et al (2017) Densely connected convolutional networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4700–4708.
Ioffe, S., & Szegedy, C. (2015) Batch normalization: Accelerating deep network training by reducing internal covariate shift. In: International Conference on Machine Learning, PMLR, pp. 448–456.
Islam, M. A., Jia, S., & Bruce, N. D. (2020). How much position information do convolutional neural networks encode?
Ji, S., Xu, W., Yang, M., et al. (2012). 3d convolutional neural networks for human action recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(1), 221–231.
Article Google Scholar
Kingma, D.P., & Ba, J. (2015) Adam: A method for stochastic optimization. In: International Conference on Learning Representations
Kordopatis-Zilos, G., Tzelepis, C., Papadopoulos, S., et al. (2022). Dns: Distill-and-select for efficient and accurate video indexing and retrieval. International Journal of Computer Vision, 130(10), 2385–2407.
Article Google Scholar
Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. Advances in Neural Information Processing Systems, 25, 1097–1105.
Google Scholar
Li, C., Guo, C., Han, L., et al. (2021). Low-light image and video enhancement using deep learning: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(12), 9396–9416.
Article Google Scholar
Liu, C., Kim, K., & Gu, J., et al (2019) Planercnn: 3d plane detection and reconstruction from a single image. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4450–4459.
Liu, D., Gu, J., Hitomi, Y., et al. (2013). Efficient space-time sampling with pixel-wise coded exposure for high-speed imaging. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(2), 248–260.
Google Scholar
Liu, Y., Yuan, X., Suo, J., et al. (2018). Rank minimization for snapshot compressive imaging. IEEE Transactions on Pattern Analysis and Machine Intelligence, 41(12), 2990–3006.
Article Google Scholar
Liu, Z., Lin, Y., & Cao, Y., et al (2021) Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022.
Liu, Z., Ning, J., & Cao, Y., et al (2022) Video swin transformer. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3202–3211.
Llull, P., Liao, X., Yuan, X., et al. (2013). Coded aperture compressive temporal imaging. Optics Express, 21(9), 10526–10545.
Article Google Scholar
Maas, A.L., Hannun, A.Y., Ng, A.Y., et al (2013) Rectifier nonlinearities improve neural network acoustic models. In: International Conference on Machine Learning, Citeseer, pp. 3.
Micikevicius, P., Narang, S., Alben, J., et al (2017) Mixed precision training. In: International Conference on Learning Representations
Park, J., Woo, S., Lee, J. Y., et al. (2020). A simple and light-weight attention module for convolutional neural networks. International Journal of Computer Vision, 128(4), 783–798.
Article Google Scholar
Pont-Tuset, J., Perazzi, F., Caelles, S., et al (2017) The 2017 Davis challenge on video object segmentation. arXiv preprint arXiv:1704.00675
Qiao, M., Meng, Z., Ma, J., et al. (2020). Deep learning for video compressive sensing. APL Photonics, 5(3), 30801.
Article Google Scholar
Shi, W., Caballero, J., Huszár, F., et al (2016) Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1874–1883.
Wang, C.Y., Liao, H.Y.M., Wu, Y.H., et al (2020) CSPNet: A new backbone that can enhance learning capability of CNN. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pp. 390–391.
Wang, L., Cao, M., Zhong, Y., et al. (2022). Spatial-temporal transformer for video snapshot compressive imaging. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45, 9072–9089.
Google Scholar
Wang, L., Wu, Z., Zhong, Y., et al. (2022). Snapshot spectral compressive imaging reconstruction using convolution and contextual transformer. Photonics Research, 10(8), 1848–1858.
Article Google Scholar
Wang, L., Cao, M., Yuan, X. (2023) Efficientsci: Densely connected network with space-time factorization for large-scale video snapshot compressive imaging. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 18477–18486.
Wang, Z., Bovik, A. C., Sheikh, H. R., et al. (2004). Image quality assessment: From error visibility to structural similarity. IEEE Transactions on Image Processing, 13(4), 600–612.
Article Google Scholar
Wang, Z., Zhang, H., Cheng, Z., et al (2021) MetaSCI: Scalable and adaptive reconstruction for video compressive sensing. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2083–2092.
Wu, Z., Zhang, J., & Mou, C. (2021) Dense deep unfolding network with 3D-CNN prior for snapshot compressive imaging. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4892–4901.
Wu, Z., Yang, C., Su, X., et al. (2023). Adaptive deep pnp algorithm for video snapshot compressive imaging. International Journal of Computer Vision, 131, 1662–1679.
Article Google Scholar
Xie, S., Girshick, R., Dollár, P., et al (2017) Aggregated residual transformations for deep neural networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1492–1500.
Yang, C., Zhang, S., Yuan, X. (2022) Ensemble learning priors driven deep unfolding for scalable video snapshot compressive imaging. In: European Conference on Computer Vision
Yang, J., Liao, X., Yuan, X., et al. (2014). Compressive sensing by learning a Gaussian mixture model from measurements. IEEE Transactions on Image Processing, 24(1), 106–119.
Article MathSciNet Google Scholar
Yeom, S. K., Seegerer, P., Lapuschkin, S., et al. (2021). Pruning by explaining: A novel criterion for deep neural network pruning. Pattern Recognition, 115, 107899.
Article Google Scholar
Yu, Z., Shen, Y., Shi, J., et al. (2023). Physformer++: Facial video-based physiological measurement with slowfast temporal difference transformer. International Journal of Computer Vision, 131(6), 1307–1330.
Article Google Scholar
Yuan, X. (2016) Generalized alternating projection based total variation minimization for compressive sensing. In: IEEE International Conference on Image Processing. IEEE, pp. 2539–2543.
Yuan, X., Tsai, T. H., Zhu, R., et al. (2015). Compressive hyperspectral imaging with side information. IEEE Journal of Selected Topics in Signal Processing, 9(6), 964–976.
Article Google Scholar
Yuan, X., Liu, Y., Suo, J., et al (2020) Plug-and-play algorithms for large-scale snapshot compressive imaging. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1447–1457.
Yuan, X., Brady, D. J., & Katsaggelos, A. K. (2021). Snapshot compressive imaging: Theory, algorithms, and applications. IEEE Signal Processing Magazine, 38(2), 65–88.
Article Google Scholar
Yuan, X., Liu, Y., Suo, J., et al. (2021). Plug-and-play algorithms for video snapshot compressive imaging. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(01), 1–1.
Google Scholar
Zamir, S.W., Arora, A., Khan, S., et al (2022) Restormer: Efficient transformer for high-resolution image restoration. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5728–5739.
Zhang, Q., Xu, Y., Zhang, J., et al. (2023). Vitaev2: Vision transformer advanced by exploring inductive bias for image recognition and beyond. International Journal of Computer Vision, 131(5), 1141–1162.
Article Google Scholar
Zhang, Z., Jiang, Y., Jiang, J., et al (2021a) Star: A structure-aware lightweight transformer for real-time image enhancement. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4106–4115.
Zhang, Z., Shao, W., Gu, J., et al (2021b) Differentiable dynamic quantization with mixed precision and adaptive resolution. In: International Conference on Machine Learning, PMLR, pp. 12546–12556.
Zhang, Z., Jiang, Y., Shao, W., et al (2023b) Real-time controllable denoising for image and video. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14028–14038.
Zhuang, B., Shen, C., Tan, M., et al. (2022). Structured binary neural networks for image recognition. International Journal of Computer Vision, 130(9), 2081–2102.
Article Google Scholar

Download references

Acknowledgements

We would like to thank the Research Center for Industries of the Future (RCIF) at Westlake University for supporting this work.

Funding

This work was supported by National Natural Science Foundation of China (62271414), Science Fund for Distinguished Young Scholars of Zhejiang Province (LR23F010001), and the Key Project of Westlake Institute for Optoelectronics (Grant No. 2023GD007).

Author information

Authors and Affiliations

Zhejiang University, Hangzhou, 310058, Zhejiang, China
Miao Cao
School of Engineering and Research Center for Industries of the Future, Westlake University, Hangzhou, 310030, Zhejiang, China
Miao Cao, Lishun Wang, Mingyu Zhu & Xin Yuan

Authors

Miao Cao
View author publications
You can also search for this author in PubMed Google Scholar
Lishun Wang
View author publications
You can also search for this author in PubMed Google Scholar
Mingyu Zhu
View author publications
You can also search for this author in PubMed Google Scholar
Xin Yuan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xin Yuan.

Ethics declarations

Conflict of interest

The authors declare no Conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary file 1 (mp4 1103 KB)

Supplementary file 2 (mp4 931 KB)

Supplementary file 3 (mp4 1396 KB)

Supplementary file 4 (mp4 393 KB)

Supplementary file 5 (mp4 123 KB)

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Cao, M., Wang, L., Zhu, M. et al. Hybrid CNN-Transformer Architecture for Efficient Large-Scale Video Snapshot Compressive Imaging. Int J Comput Vis (2024). https://doi.org/10.1007/s11263-024-02101-y

Download citation

Received: 31 May 2023
Accepted: 23 April 2024
Published: 19 May 2024
DOI: https://doi.org/10.1007/s11263-024-02101-y

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Hybrid CNN-Transformer Architecture for Efficient Large-Scale Video Snapshot Compressive Imaging

Abstract

Access this article

Similar content being viewed by others

Adaptive Deep PnP Algorithm for Video Snapshot Compressive Imaging

Deep Unfolding for Snapshot Compressive Imaging

Dual-view Snapshot Compressive Imaging via Optical Flow Aided Recurrent Neural Network

Data availability

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Supplementary Information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Hybrid CNN-Transformer Architecture for Efficient Large-Scale Video Snapshot Compressive Imaging

Abstract

Access this article

Similar content being viewed by others

Adaptive Deep PnP Algorithm for Video Snapshot Compressive Imaging

Deep Unfolding for Snapshot Compressive Imaging

Dual-view Snapshot Compressive Imaging via Optical Flow Aided Recurrent Neural Network

Data availability

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Supplementary Information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation