Skip to main content
Log in

Hybrid CNN-Transformer Architecture for Efficient Large-Scale Video Snapshot Compressive Imaging

  • Published:
International Journal of Computer Vision Aims and scope Submit manuscript

Abstract

Video snapshot compressive imaging (SCI) uses a low-speed 2D detector to capture high-speed scene, where the dynamic scene is modulated by different masks and then compressed into a snapshot measurement. Following this, a reconstruction algorithm is needed to reconstruct the high-speed video frames. Although state-of-the-art (SOTA) deep learning-based reconstruction algorithms have achieved impressive results, they still face the following challenges due to excessive model complexity and GPU memory limitations: (1) These models need high computational cost, and (2) They are usually unable to reconstruct large-scale video frames at high compression ratios. To address these issues, we develop an efficient network for video SCI by using hierarchical residual-like connections and hybrid CNN-Transformer structure within a single residual block, dubbed EfficientSCI++. The EfficientSCI++ network can well explore spatial-temporal correlation using convolution in the spatial domain and Transformer in the temporal domain, respectively. We are the first time to demonstrate that a UHD color video (\(1644\times {3840}\times {3}\)) with high compression ratio (40) can be reconstructed from a snapshot 2D measurement using a single end-to-end deep learning model with PSNR above 34 dB. Moreover, a mixed-precision model is trained to further accelerate the video SCI reconstruction process and save memory footprint. Extensive results on both simulation and real data demonstrate that, compared with precious SOTA methods, our proposed EfficientSCI++ and EfficientSCI can achieve comparable reconstruction quality with much cheaper computational cost and better real-time performance. Code is available at https://github.com/mcao92/EfficientSCI-plus-plus.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18

Similar content being viewed by others

Data availability

Data underlying the results are available from the corresponding author with reasonable request.

References

  • Ba, J.L., Kiros, J.R., Hinton, G.E. (2016) Layer normalization. Advances in NIPS 2016 Deep Learning Symposium

  • Bao, Q., Liu, Y., Gang, B., et al. (2023). SCTANet: A spatial attention-guided CNN-transformer aggregation network for deep face image super-resolution. IEEE Transactions on Multimedia, 25, 8554–8565.

    Article  Google Scholar 

  • Behrmann, J., Grathwohl, W., Chen, R.T., et al. (2019) Invertible residual networks. In: International Conference on Machine Learning, PMLR, pp. 573–582.

  • Bertasius, G., Wang, H., Torresani, L. (2021) Is space-time attention all you need for video understanding? In: International Conference on Machine Learning, pp. 4.

  • Cai, Y., Lin, J., Hu, X., et al. (2022) Mask-guided spectral-wise transformer for efficient hyperspectral image reconstruction. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 17502–17511.

  • Candès, E. J., Romberg, J., & Tao, T. (2006). Robust uncertainty principles: Exact signal reconstruction from highly incomplete frequency information. IEEE Transactions on Information Theory, 52(2), 489–509.

    Article  MathSciNet  Google Scholar 

  • Chan, K. C., Xu, X., Wang, X., et al. (2022). Glean: Generative latent bank for image super-resolution and beyond. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45, 3154–3168.

    Google Scholar 

  • Chang, Y.L., Liu. Z.Y., Lee, K.Y., et al (2019) Free-form video inpainting with 3d gated convolution and temporal Patchgan. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9066–9075.

  • Chen, X., Pan, J., Lu, J., et al (2023) Hybrid CNN-transformer feature fusion for single image Deraining. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 378–386.

  • Cheng, Z., Lu, R., Wang, Z., et al. (2020) BIRNAT: Bidirectional recurrent neural networks with adversarial training for video snapshot compressive imaging. In: European Conference on Computer Vision. Springer, pp. 258–275.

  • Cheng, Z., Chen, B., Liu, G., et al (2021) Memory-efficient network for large-scale video compressive sensing. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16246–16255.

  • Chu, X., Tian, Z., Zhang, B., et al (2022) Conditional positional encodings for vision transformers. In: International Conference on Learning Representations

  • Dong, X., Bao, J., & Chen, D., et al (2022) Cswin transformer: A general vision transformer backbone with cross-shaped windows. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12124–12134.

  • Donoho, D. L. (2006). Compressed sensing. IEEE Transactions on Information Theory, 52(4), 1289–1306.

    Article  MathSciNet  Google Scholar 

  • Dosovitskiy, A., Beyer, L., Kolesnikov, A., et al (2020) An image is worth 16x16 words: Transformers for image recognition at scale. In: International Conference on Learning Representations

  • Gao, G., Xu, Z., Li, J., et al. (2023). CTCNet: A CNN-transformer cooperation network for face image super-resolution. IEEE Transactions on Image Processing, 32, 1978–1991.

    Article  Google Scholar 

  • Gao, S. H., Cheng, M. M., Zhao, K., et al. (2019). Res2net: A new multi-scale backbone architecture. IEEE Transactions on Pattern Analysis and Machine Intelligence, 43(2), 652–662.

    Article  Google Scholar 

  • He, K., Zhang, X., Ren, S., et al (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 770–778.

  • Hendrycks, D., Gimpel, K. (2016) Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415

  • Hitomi, Y., Gu, J., Gupta, M., et al (2011) Video from a single coded exposure photograph using a learned over-complete dictionary. In: 2011 International Conference on Computer Vision. IEEE, pp. 287–294.

  • Huang, G., Liu, Z., & Van Der Maaten, L., et al (2017) Densely connected convolutional networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4700–4708.

  • Ioffe, S., & Szegedy, C. (2015) Batch normalization: Accelerating deep network training by reducing internal covariate shift. In: International Conference on Machine Learning, PMLR, pp. 448–456.

  • Islam, M. A., Jia, S., & Bruce, N. D. (2020). How much position information do convolutional neural networks encode?

  • Ji, S., Xu, W., Yang, M., et al. (2012). 3d convolutional neural networks for human action recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(1), 221–231.

    Article  Google Scholar 

  • Kingma, D.P., & Ba, J. (2015) Adam: A method for stochastic optimization. In: International Conference on Learning Representations

  • Kordopatis-Zilos, G., Tzelepis, C., Papadopoulos, S., et al. (2022). Dns: Distill-and-select for efficient and accurate video indexing and retrieval. International Journal of Computer Vision, 130(10), 2385–2407.

    Article  Google Scholar 

  • Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. Advances in Neural Information Processing Systems, 25, 1097–1105.

    Google Scholar 

  • Li, C., Guo, C., Han, L., et al. (2021). Low-light image and video enhancement using deep learning: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(12), 9396–9416.

    Article  Google Scholar 

  • Liu, C., Kim, K., & Gu, J., et al (2019) Planercnn: 3d plane detection and reconstruction from a single image. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4450–4459.

  • Liu, D., Gu, J., Hitomi, Y., et al. (2013). Efficient space-time sampling with pixel-wise coded exposure for high-speed imaging. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(2), 248–260.

    Google Scholar 

  • Liu, Y., Yuan, X., Suo, J., et al. (2018). Rank minimization for snapshot compressive imaging. IEEE Transactions on Pattern Analysis and Machine Intelligence, 41(12), 2990–3006.

    Article  Google Scholar 

  • Liu, Z., Lin, Y., & Cao, Y., et al (2021) Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022.

  • Liu, Z., Ning, J., & Cao, Y., et al (2022) Video swin transformer. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3202–3211.

  • Llull, P., Liao, X., Yuan, X., et al. (2013). Coded aperture compressive temporal imaging. Optics Express, 21(9), 10526–10545.

    Article  Google Scholar 

  • Maas, A.L., Hannun, A.Y., Ng, A.Y., et al (2013) Rectifier nonlinearities improve neural network acoustic models. In: International Conference on Machine Learning, Citeseer, pp. 3.

  • Micikevicius, P., Narang, S., Alben, J., et al (2017) Mixed precision training. In: International Conference on Learning Representations

  • Park, J., Woo, S., Lee, J. Y., et al. (2020). A simple and light-weight attention module for convolutional neural networks. International Journal of Computer Vision, 128(4), 783–798.

    Article  Google Scholar 

  • Pont-Tuset, J., Perazzi, F., Caelles, S., et al (2017) The 2017 Davis challenge on video object segmentation. arXiv preprint arXiv:1704.00675

  • Qiao, M., Meng, Z., Ma, J., et al. (2020). Deep learning for video compressive sensing. APL Photonics, 5(3), 30801.

    Article  Google Scholar 

  • Shi, W., Caballero, J., Huszár, F., et al (2016) Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1874–1883.

  • Wang, C.Y., Liao, H.Y.M., Wu, Y.H., et al (2020) CSPNet: A new backbone that can enhance learning capability of CNN. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pp. 390–391.

  • Wang, L., Cao, M., Zhong, Y., et al. (2022). Spatial-temporal transformer for video snapshot compressive imaging. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45, 9072–9089.

    Google Scholar 

  • Wang, L., Wu, Z., Zhong, Y., et al. (2022). Snapshot spectral compressive imaging reconstruction using convolution and contextual transformer. Photonics Research, 10(8), 1848–1858.

    Article  Google Scholar 

  • Wang, L., Cao, M., Yuan, X. (2023) Efficientsci: Densely connected network with space-time factorization for large-scale video snapshot compressive imaging. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 18477–18486.

  • Wang, Z., Bovik, A. C., Sheikh, H. R., et al. (2004). Image quality assessment: From error visibility to structural similarity. IEEE Transactions on Image Processing, 13(4), 600–612.

    Article  Google Scholar 

  • Wang, Z., Zhang, H., Cheng, Z., et al (2021) MetaSCI: Scalable and adaptive reconstruction for video compressive sensing. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2083–2092.

  • Wu, Z., Zhang, J., & Mou, C. (2021) Dense deep unfolding network with 3D-CNN prior for snapshot compressive imaging. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4892–4901.

  • Wu, Z., Yang, C., Su, X., et al. (2023). Adaptive deep pnp algorithm for video snapshot compressive imaging. International Journal of Computer Vision, 131, 1662–1679.

    Article  Google Scholar 

  • Xie, S., Girshick, R., Dollár, P., et al (2017) Aggregated residual transformations for deep neural networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1492–1500.

  • Yang, C., Zhang, S., Yuan, X. (2022) Ensemble learning priors driven deep unfolding for scalable video snapshot compressive imaging. In: European Conference on Computer Vision

  • Yang, J., Liao, X., Yuan, X., et al. (2014). Compressive sensing by learning a Gaussian mixture model from measurements. IEEE Transactions on Image Processing, 24(1), 106–119.

    Article  MathSciNet  Google Scholar 

  • Yeom, S. K., Seegerer, P., Lapuschkin, S., et al. (2021). Pruning by explaining: A novel criterion for deep neural network pruning. Pattern Recognition, 115, 107899.

    Article  Google Scholar 

  • Yu, Z., Shen, Y., Shi, J., et al. (2023). Physformer++: Facial video-based physiological measurement with slowfast temporal difference transformer. International Journal of Computer Vision, 131(6), 1307–1330.

    Article  Google Scholar 

  • Yuan, X. (2016) Generalized alternating projection based total variation minimization for compressive sensing. In: IEEE International Conference on Image Processing. IEEE, pp. 2539–2543.

  • Yuan, X., Tsai, T. H., Zhu, R., et al. (2015). Compressive hyperspectral imaging with side information. IEEE Journal of Selected Topics in Signal Processing, 9(6), 964–976.

    Article  Google Scholar 

  • Yuan, X., Liu, Y., Suo, J., et al (2020) Plug-and-play algorithms for large-scale snapshot compressive imaging. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1447–1457.

  • Yuan, X., Brady, D. J., & Katsaggelos, A. K. (2021). Snapshot compressive imaging: Theory, algorithms, and applications. IEEE Signal Processing Magazine, 38(2), 65–88.

    Article  Google Scholar 

  • Yuan, X., Liu, Y., Suo, J., et al. (2021). Plug-and-play algorithms for video snapshot compressive imaging. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(01), 1–1.

    Google Scholar 

  • Zamir, S.W., Arora, A., Khan, S., et al (2022) Restormer: Efficient transformer for high-resolution image restoration. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5728–5739.

  • Zhang, Q., Xu, Y., Zhang, J., et al. (2023). Vitaev2: Vision transformer advanced by exploring inductive bias for image recognition and beyond. International Journal of Computer Vision, 131(5), 1141–1162.

    Article  Google Scholar 

  • Zhang, Z., Jiang, Y., Jiang, J., et al (2021a) Star: A structure-aware lightweight transformer for real-time image enhancement. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4106–4115.

  • Zhang, Z., Shao, W., Gu, J., et al (2021b) Differentiable dynamic quantization with mixed precision and adaptive resolution. In: International Conference on Machine Learning, PMLR, pp. 12546–12556.

  • Zhang, Z., Jiang, Y., Shao, W., et al (2023b) Real-time controllable denoising for image and video. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14028–14038.

  • Zhuang, B., Shen, C., Tan, M., et al. (2022). Structured binary neural networks for image recognition. International Journal of Computer Vision, 130(9), 2081–2102.

    Article  Google Scholar 

Download references

Acknowledgements

We would like to thank the Research Center for Industries of the Future (RCIF) at Westlake University for supporting this work.

Funding

This work was supported by National Natural Science Foundation of China (62271414), Science Fund for Distinguished Young Scholars of Zhejiang Province (LR23F010001), and the Key Project of Westlake Institute for Optoelectronics (Grant No. 2023GD007).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Xin Yuan.

Ethics declarations

Conflict of interest

The authors declare no Conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary file 1 (mp4 1103 KB)

Supplementary file 2 (mp4 931 KB)

Supplementary file 3 (mp4 1396 KB)

Supplementary file 4 (mp4 393 KB)

Supplementary file 5 (mp4 123 KB)

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Cao, M., Wang, L., Zhu, M. et al. Hybrid CNN-Transformer Architecture for Efficient Large-Scale Video Snapshot Compressive Imaging. Int J Comput Vis (2024). https://doi.org/10.1007/s11263-024-02101-y

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s11263-024-02101-y

Keywords

Navigation