Abstract
There has been considerable progress in the applications of Convolutional Neural Networks (CNNs) to computer vision tasks with RGB images. A few studies investigated gaining more performance by replacing RGB representation with block-wise Discrete Cosine Transform (DCT) coefficients. DCT coefficients that are readily available during JPEG decoding might be competitive with the output of computationally costly initial CNN layers fed by RGB representation. Despite the attractiveness of the approach, up to our knowledge, there is only a single study targeting the use of DCT coefficients with the low-latency models. In this paper, we investigate the usage of DCT coefficients firstly with MnasNet, a mobile image classification model processing thousands of images per second on a single modern GPU, and secondly with Yolov5, which holds the benchmark performance on Average Precision (AP) and latency. After applying our methods to MnasNet (1.0) and evaluating performance on the ImageNet dataset, we observe competitive accuracy with RGB-based MnasNet (1.0) and significantly higher processing speed compared to RGB-based MnasNet (0.5). After applying our methods to Yolov5, we evaluate performance on three benchmark datasets. The resulting DCT-based object detection model processes up to 519 more images per second, while demonstrating up to 4.7% AP drop on MSCOCO test-dev set, up to 5.1% AP drop on Pascal VOC 2007 test set, and up to 3.8% AP drop on Crowd Human (Full-Body) validation set.
This is a preview of subscription content, access via your institution.





References
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Ren, S., He, K., Girshick, R., Sun, J.: Faster r-cnn: towards real-time object detection with region proposal networks. Adv. Neural Inf. Process. Syst. 28, 91–99 (2015)
He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask r-cnn. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2961–2969 (2017)
Tan, M., Le, Q.V.: Efficientnet: rethinking model scaling for convolutional neural networks. arXiv:1905.11946 (2019)
Bochkovskiy, A., Wang, C.Y., Liao, H.Y.M.: YOLOv4: optimal speed and accuracy of object detection. arXiv:2004.10934 (2020)
Bolya, D., Zhou, C., Xiao, F., Lee, Y.J.: Yolact: real-time instance segmentation. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 9157–9166 (2019)
Gueguen, L., Sergeev, A., Kadlec, B., Liu, R., Yosinski, J.: Faster neural networks straight from jpeg. Adv. Neural Inf. Process. Syst. 31, 3933 (2018)
Tan, M., Chen, B., Pang, R., Vasudevan, V., Sandler, M., Howard, A., Le, Q.V.: Mnasnet: platform-aware neural architecture search for mobile. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2820–2828 (2019)
Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: a large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255. IEEE (2009)
Jocher,G., Stoken,A., Borovec,J., NanoCode012, ChristopherSTAN, Changyu,L., Laughing, tkianai, yxNONG, Hogan,A., lorenzomammana, AlexWang1900, Chaurasia,A.,Diaconu, L., Marc, wanghaoyang0106, ml5ah, Doug, Durgesh, F.Ingham, Frederik, Guilhen, A.Colmagro, H.Ye, Jacobsolawetz, J.Poznanski, J.Fang, J.Kim, K.Doan, L.Yu. ultralytics/yolov5: v4.0 - nn.SiLU() activations, Weights & Biases logging, PyTorch Hub integration (2021). 10.5281/zenodo.4418161. https://doi.org/10.5281/zenodo.4418161
Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft coco: common objects in context. In: European Conference on Computer Vision, pp. 740–755. Springer (2014)
Everingham, M., Eslami, S.M.A., VanGool, L., Williams, C.K.I., Winn, J., Zisserman, A.: The pascal visual object classes challenge: a retrospective. Int. J. Comput. Vis. 111(1), 98 (2015)
Shao, S., Zhao, Z., Li, B., Xiao, T., Yu, G., Zhang, X., Sun, J.: Crowdhuman: a benchmark for detecting human in a crowd. arXiv:1805.00123 (2018)
Deguerre, B., Chatelain, C., Gasso, G.: Object detection in the DCT domain: is luminance the solution? arXiv:2006.05732 (2020)
Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C.Y., Berg, A.C.: SSD: single shot multibox detector. In: European Conference on Computer Vision, pp. 21–37. Springer (2016)
Simonyan, K., Zisserman, A.:Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556 (2014)
Howard, A.G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., Andreetto, M., Adam, H.: Mobilenets: efficient convolutional neural networks for mobile vision applications. arXiv:1704.04861 (2017)
Fukushima, K., Miyake, S.: Neocognitron: a self-organizing neural network model for a mechanism of visual pattern recognition. In: Competition and Cooperation in Neural Nets, pp. 267–285. Springer (1982)
Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. In: International Conference on Machine Learning (PMLR, 2015), pp. 448–456
Hendrycks, D., Gimpel, K.:Gaussian error linear units (gelus). arXiv:1606.08415 (2016)
Elfwing, S., Uchibe, E., Doya, K.: Sigmoid-weighted linear units for neural network function approximation in reinforcement learning. Neural Netw 107, 3 (2018)
Ramachandran, P., Zoph, B., Le, Q.V.: Swish: a self-gated activation function. Neural Evol. Comput. (2017)
Wang, C.Y., Liao, H.Y.M., Wu, Y.H., Chen, P.Y., Hsieh, J.W., Yeh, I.H.: CSPNet: a new backbone that can enhance learning capability of CNN. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pp. 390–391 (2020)
Huang, Z., Wang, J., Fu, X., Yu, T., Guo, Y., Wang, R.: DC-SPP-YOLO: dense connection and spatial pyramid pooling based YOLO for object detection. Inf. Sci. 522, 241 (2020)
Nagi, J., Ducatelle, F., DiCaro, G.A., Cireşan, D., Meier, U., Giusti, A., Nagi, F., Schmidhuber, J., Gambardella, L.M.: Max-pooling convolutional neural networks for vision-based hand gesture recognition. In: 2011 IEEE International Conference on Signal and Image Processing Applications (ICSIPA), pp. 342–347. IEEE (2011)
Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., et al.: Pytorch: an imperative style, high-performance deep learning library. arXiv:1912.01703 (2019)
Marcel, S., Rodriguez, Y.: Torchvision the machine-vision package of torch. In: Proceedings of the 18th ACM International Conference on Multimedia, pp. 1485–1488 (2010)
Buslaev, A., Iglovikov, V.I., Khvedchenya, E., Parinov, A., Druzhinin, M., Kalinin, A.A.: Albumentations: fast and flexible image augmentations. Information (2020). https://doi.org/10.3390/info11020125
Everingham, M., VanGool, L., Williams, C.K.I., Winn, J., Zisserman, A.: The PASCAL visual object classes challenge 2012 (VOC2012) results. http://www.pascal-network.org/challenges/VOC/voc2012/workshop/index.html
Everingham, M., VanGool, L., Williams, C.K.I., Winn, J., Zisserman, A.: The PASCAL visual object classes challenge 2007 (VOC2007) results. http://www.pascal-network.org/challenges/VOC/voc2007/workshop/index.html
Acknowledgements
This work has been funded by RAKUTEN INC. The authors would like to thank Rajasekhar Sanagavarapu for his support at notable stages of the work.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
The authors Hasan Sait Arslan and Denis Miller have completed the article work during their enrollment in Rakuten Inc.
Rights and permissions
About this article
Cite this article
Arslan, H.S., Archambault, S., Bhatt, P. et al. Usage of compressed domain in fast frameworks. SIViP 16, 1763–1771 (2022). https://doi.org/10.1007/s11760-022-02133-2
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11760-022-02133-2