Skip to main content

Usage of compressed domain in fast frameworks

Abstract

There has been considerable progress in the applications of Convolutional Neural Networks (CNNs) to computer vision tasks with RGB images. A few studies investigated gaining more performance by replacing RGB representation with block-wise Discrete Cosine Transform (DCT) coefficients. DCT coefficients that are readily available during JPEG decoding might be competitive with the output of computationally costly initial CNN layers fed by RGB representation. Despite the attractiveness of the approach, up to our knowledge, there is only a single study targeting the use of DCT coefficients with the low-latency models. In this paper, we investigate the usage of DCT coefficients firstly with MnasNet, a mobile image classification model processing thousands of images per second on a single modern GPU, and secondly with Yolov5, which holds the benchmark performance on Average Precision (AP) and latency. After applying our methods to MnasNet (1.0) and evaluating performance on the ImageNet dataset, we observe competitive accuracy with RGB-based MnasNet (1.0) and significantly higher processing speed compared to RGB-based MnasNet (0.5). After applying our methods to Yolov5, we evaluate performance on three benchmark datasets. The resulting DCT-based object detection model processes up to 519 more images per second, while demonstrating up to 4.7% AP drop on MSCOCO test-dev set, up to 5.1% AP drop on Pascal VOC 2007 test set, and up to 3.8% AP drop on Crowd Human (Full-Body) validation set.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Notes

  1. https://www.nvidia.com/en-us/data-center/v100/.

  2. https://competitions.codalab.org/competitions/20794.

References

  1. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)

  2. Ren, S., He, K., Girshick, R., Sun, J.: Faster r-cnn: towards real-time object detection with region proposal networks. Adv. Neural Inf. Process. Syst. 28, 91–99 (2015)

    Google Scholar 

  3. He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask r-cnn. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2961–2969 (2017)

  4. Tan, M., Le, Q.V.: Efficientnet: rethinking model scaling for convolutional neural networks. arXiv:1905.11946 (2019)

  5. Bochkovskiy, A., Wang, C.Y., Liao, H.Y.M.: YOLOv4: optimal speed and accuracy of object detection. arXiv:2004.10934 (2020)

  6. Bolya, D., Zhou, C., Xiao, F., Lee, Y.J.: Yolact: real-time instance segmentation. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 9157–9166 (2019)

  7. Gueguen, L., Sergeev, A., Kadlec, B., Liu, R., Yosinski, J.: Faster neural networks straight from jpeg. Adv. Neural Inf. Process. Syst. 31, 3933 (2018)

    Google Scholar 

  8. Tan, M., Chen, B., Pang, R., Vasudevan, V., Sandler, M., Howard, A., Le, Q.V.: Mnasnet: platform-aware neural architecture search for mobile. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2820–2828 (2019)

  9. Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: a large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255. IEEE (2009)

  10. Jocher,G., Stoken,A., Borovec,J., NanoCode012, ChristopherSTAN, Changyu,L., Laughing, tkianai, yxNONG, Hogan,A., lorenzomammana, AlexWang1900, Chaurasia,A.,Diaconu, L., Marc, wanghaoyang0106, ml5ah, Doug, Durgesh, F.Ingham, Frederik, Guilhen, A.Colmagro, H.Ye, Jacobsolawetz, J.Poznanski, J.Fang, J.Kim, K.Doan, L.Yu. ultralytics/yolov5: v4.0 - nn.SiLU() activations, Weights & Biases logging, PyTorch Hub integration (2021). 10.5281/zenodo.4418161. https://doi.org/10.5281/zenodo.4418161

  11. Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft coco: common objects in context. In: European Conference on Computer Vision, pp. 740–755. Springer (2014)

  12. Everingham, M., Eslami, S.M.A., VanGool, L., Williams, C.K.I., Winn, J., Zisserman, A.: The pascal visual object classes challenge: a retrospective. Int. J. Comput. Vis. 111(1), 98 (2015)

    Article  Google Scholar 

  13. Shao, S., Zhao, Z., Li, B., Xiao, T., Yu, G., Zhang, X., Sun, J.: Crowdhuman: a benchmark for detecting human in a crowd. arXiv:1805.00123 (2018)

  14. Deguerre, B., Chatelain, C., Gasso, G.: Object detection in the DCT domain: is luminance the solution? arXiv:2006.05732 (2020)

  15. Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C.Y., Berg, A.C.: SSD: single shot multibox detector. In: European Conference on Computer Vision, pp. 21–37. Springer (2016)

  16. Simonyan, K., Zisserman, A.:Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556 (2014)

  17. Howard, A.G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., Andreetto, M., Adam, H.: Mobilenets: efficient convolutional neural networks for mobile vision applications. arXiv:1704.04861 (2017)

  18. Fukushima, K., Miyake, S.: Neocognitron: a self-organizing neural network model for a mechanism of visual pattern recognition. In: Competition and Cooperation in Neural Nets, pp. 267–285. Springer (1982)

  19. Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. In: International Conference on Machine Learning (PMLR, 2015), pp. 448–456

  20. Hendrycks, D., Gimpel, K.:Gaussian error linear units (gelus). arXiv:1606.08415 (2016)

  21. Elfwing, S., Uchibe, E., Doya, K.: Sigmoid-weighted linear units for neural network function approximation in reinforcement learning. Neural Netw 107, 3 (2018)

    Article  Google Scholar 

  22. Ramachandran, P., Zoph, B., Le, Q.V.: Swish: a self-gated activation function. Neural Evol. Comput. (2017)

  23. Wang, C.Y., Liao, H.Y.M., Wu, Y.H., Chen, P.Y., Hsieh, J.W., Yeh, I.H.: CSPNet: a new backbone that can enhance learning capability of CNN. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pp. 390–391 (2020)

  24. Huang, Z., Wang, J., Fu, X., Yu, T., Guo, Y., Wang, R.: DC-SPP-YOLO: dense connection and spatial pyramid pooling based YOLO for object detection. Inf. Sci. 522, 241 (2020)

  25. Nagi, J., Ducatelle, F., DiCaro, G.A., Cireşan, D., Meier, U., Giusti, A., Nagi, F., Schmidhuber, J., Gambardella, L.M.: Max-pooling convolutional neural networks for vision-based hand gesture recognition. In: 2011 IEEE International Conference on Signal and Image Processing Applications (ICSIPA), pp. 342–347. IEEE (2011)

  26. Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., et al.: Pytorch: an imperative style, high-performance deep learning library. arXiv:1912.01703 (2019)

  27. Marcel, S., Rodriguez, Y.: Torchvision the machine-vision package of torch. In: Proceedings of the 18th ACM International Conference on Multimedia, pp. 1485–1488 (2010)

  28. Buslaev, A., Iglovikov, V.I., Khvedchenya, E., Parinov, A., Druzhinin, M., Kalinin, A.A.: Albumentations: fast and flexible image augmentations. Information (2020). https://doi.org/10.3390/info11020125

    Article  Google Scholar 

  29. Everingham, M., VanGool, L., Williams, C.K.I., Winn, J., Zisserman, A.: The PASCAL visual object classes challenge 2012 (VOC2012) results. http://www.pascal-network.org/challenges/VOC/voc2012/workshop/index.html

  30. Everingham, M., VanGool, L., Williams, C.K.I., Winn, J., Zisserman, A.: The PASCAL visual object classes challenge 2007 (VOC2007) results. http://www.pascal-network.org/challenges/VOC/voc2007/workshop/index.html

Download references

Acknowledgements

This work has been funded by RAKUTEN INC. The authors would like to thank Rajasekhar Sanagavarapu for his support at notable stages of the work.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hasan Sait Arslan.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

The authors Hasan Sait Arslan and Denis Miller have completed the article work during their enrollment in Rakuten Inc.

Rights and permissions

Reprints and Permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Arslan, H.S., Archambault, S., Bhatt, P. et al. Usage of compressed domain in fast frameworks. SIViP 16, 1763–1771 (2022). https://doi.org/10.1007/s11760-022-02133-2

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11760-022-02133-2

Keywords