Skip to main content

EdgeViT: Efficient Visual Modeling for Edge Computing

  • Conference paper
  • First Online:
Wireless Algorithms, Systems, and Applications (WASA 2022)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13473))

Abstract

With the rapid growth of edge intelligence, a higher level of deep neural network computing efficiency is required. Visual intelligence, as the core component of artificial intelligence, is particularly worth more exploration. As the cornerstone of modern visual modeling, convolutional neural networks (CNNs) have greatly developed in the past decades. Variants of light-weight CNNs have also been proposed to address the challenge of heavy computing in mobile settings. Though CNNs’ spatial inductive biases allow them to learn representations with fewer parameters across different vision tasks, these models are spatially local. To acquire a next-level model performance, vision transformer (ViT) is now a viable alternative due to the potential of multi-head attention mechanism. In this work, we introduce EdgeViT, an accelerated deep visual modeling method that incorporates the benefits of CNNs and ViTs in a light-weight and edge-friendly manner. Our proposed method can achieve top-1 accuracy of 77.8% using only 2.3 million parameters, 79.2% using 5.6 million parameters on ImageNet-1k dataset. It can achieve mIoU up to 78.3 on PASCAL VOC segmentation while only using 3.1 million parameters which is only half of MobileViT parameter budget.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Brown, T.B., et al.: Language models are few-shot learners. In: NeurIPS (2020)

    Google Scholar 

  2. Chen, C.F., Fan, Q., Panda, R.: CrossViT: cross-attention multi-scale vision transformer for image classification. In: ICCV, pp. 347–356 (2021)

    Google Scholar 

  3. Chen, L.C., Papandreou, G., Schroff, F., Adam, H.: Rethinking atrous convolution for semantic image segmentation. ArXiv abs/1706.05587 (2017)

    Google Scholar 

  4. Chen, Y., Dai, X., Chen, D., Liu, M., Dong, X., Yuan, L., Liu, Z.: Mobile-former: bridging mobilenet and transformer. arXiv abs/2108.05895 (2021)

    Google Scholar 

  5. Chen, Z., Chen, D., Yuan, Z., Cheng, X., Zhang, X.: Learning graph structures with transformer for multivariate time-series anomaly detection in IoT. IEEE Internet Things J. 9, 9179–9189 (2022)

    Article  Google Scholar 

  6. Chen, Z., Jiaze, E., Zhang, X., Sheng, H., Cheng, X.: Multi-task time series forecasting with shared attention. In: ICDMW, pp. 917–925 (2020)

    Google Scholar 

  7. Chen, Z., Shi, M., Zhang, X., Ying, H.: Asm2tv: an adaptive semi-supervised multi-task multi-view learning framework. In: AAAI (2022)

    Google Scholar 

  8. Chollet, F.: Xception: deep learning with depthwise separable convolutions. In: CVPR, pp. 1800–1807 (2017)

    Google Scholar 

  9. Courbariaux, M., Hubara, I., Soudry, D., El-Yaniv, R., Bengio, Y.: Binarized neural networks: Training deep neural networks with weights and activations constrained to +1 or -1. arxiv abs/1602.02830 (2016)

    Google Scholar 

  10. Cubuk, E.D., Zoph, B., Mané, D., Vasudevan, V., Le, Q.V.: Autoaugment: learning augmentation strategies from data. In: CVPR, pp. 113–123 (2019)

    Google Scholar 

  11. Dai, Z., Liu, H., Le, Q.V., Tan, M.: CoAtNet: marrying convolution and attention for all data sizes. In: NeurIPS (2021)

    Google Scholar 

  12. d’Ascoli, S., Touvron, H., Leavitt, M.L., Morcos, A.S., Biroli, G., Sagun, L.: Convit: improving vision transformers with soft convolutional inductive biases. In: ICML (2021)

    Google Scholar 

  13. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: NAACL (2019)

    Google Scholar 

  14. Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. In: ICLR (2020)

    Google Scholar 

  15. Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: a retrospective. IJCV 111, 98–136 (2014)

    Article  Google Scholar 

  16. Graham, B., et al.: LeViT: a vision transformer in convnet’s clothing for faster inference. In: ICCV, pp. 12239–12249 (2021)

    Google Scholar 

  17. Han, S., Mao, H., Dally, W.J.: Deep compression: compressing deep neural network with pruning, trained quantization and huffman coding. In: NeurIPS (2016)

    Google Scholar 

  18. Han, S., Pool, J., Tran, J., Dally, W.J.: Learning both weights and connections for efficient neural network. In: NeurIPS (2015)

    Google Scholar 

  19. Hariharan, B., Arbeláez, P., Bourdev, L.D., Maji, S., Malik, J.: Semantic contours from inverse detectors. In: ICCV, pp. 991–998 (2011)

    Google Scholar 

  20. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR, pp. 770–778 (2016)

    Google Scholar 

  21. He, Y., Lin, J., Liu, Z., Wang, H., Li, L.-J., Han, S.: AMC: AutoML for model compression and acceleration on mobile devices. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11211, pp. 815–832. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01234-2_48

    Chapter  Google Scholar 

  22. He, Y., Zhang, X., Sun, J.: Channel pruning for accelerating very deep neural networks. In: ICCV, pp. 1398–1406 (2017)

    Google Scholar 

  23. Heo, B., Yun, S., Han, D., Chun, S., Choe, J., Oh, S.J.: Rethinking spatial dimensions of vision transformers. In: ICCV, pp. 11916–11925 (2021)

    Google Scholar 

  24. Howard, A.G., et al.: Searching for mobilenetv3. In: ICCV, pp. 1314–1324 (2019)

    Google Scholar 

  25. Howard, A.G., et al.: Mobilenets: efficient convolutional neural networks for mobile vision applications. arXiv abs/1704.04861 (2017)

    Google Scholar 

  26. Huang, G., Liu, Z., Weinberger, K.Q.: Densely connected convolutional networks. In: CVPR, pp. 2261–2269 (2017)

    Google Scholar 

  27. Jacob, B., et al.: Quantization and training of neural networks for efficient integer-arithmetic-only inference. In: CVPR, pp. 2704–2713 (2018)

    Google Scholar 

  28. Jin, J., Dundar, A., Culurciello, E.: Flattened convolutional neural networks for feedforward acceleration. CoRR abs/1412.5474 (2015)

    Google Scholar 

  29. Li, Y., Zhang, K., Cao, J., Timofte, R., Gool, L.V.: LocalViT: bringing locality to vision transformers. arXiv abs/2104.05707 (2021)

    Google Scholar 

  30. Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48

    Chapter  Google Scholar 

  31. Liu, Z., et al.: Swin transformer: hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021)

    Google Scholar 

  32. Liu, Z., Mu, H., Zhang, X., Guo, Z., Yang, X., Cheng, K., Sun, J.: MetaPruning: meta learning for automatic neural network channel pruning. In: ICCV, pp. 3295–3304 (2019)

    Google Scholar 

  33. Liu, Z., Li, J., Shen, Z., Huang, G., Yan, S., Zhang, C.: Learning efficient convolutional networks through network slimming. In: ICCV, pp. 2755–2763 (2017)

    Google Scholar 

  34. Loshchilov, I., Hutter, F.: SGDR: stochastic gradient descent with warm restarts. In: ICLR (2016)

    Google Scholar 

  35. Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. In: ICLR (2019)

    Google Scholar 

  36. Ma, N., Zhang, X., Zheng, H.-T., Sun, J.: ShuffleNet V2: practical guidelines for efficient CNN architecture design. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) Computer Vision – ECCV 2018. LNCS, vol. 11218, pp. 122–138. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01264-9_8

    Chapter  Google Scholar 

  37. Mao, Y., You, C., Zhang, J., Huang, K., Letaief, K.B.: A survey on mobile edge computing: the communication perspective. IEEE CST 19, 2322–2358 (2017)

    Google Scholar 

  38. Mehta, S., Rastegari, M.: MobileViT: light-weight, general-purpose, and mobile-friendly vision transformer. arXiv abs/2110.02178 (2021)

    Google Scholar 

  39. Mehta, S., Rastegari, M., Caspi, A., Shapiro, L., Hajishirzi, H.: ESPNet: efficient spatial pyramid of dilated convolutions for semantic segmentation. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11214, pp. 561–580. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01249-6_34

    Chapter  Google Scholar 

  40. Ranftl, R., Bochkovskiy, A., Koltun, V.: Vision transformers for dense prediction. In: ICCV, pp. 12159–12168 (2021)

    Google Scholar 

  41. Russakovsky, O., et al.: ImageNet large scale visual recognition challenge. IJCV 115, 211–252 (2015)

    Article  MathSciNet  Google Scholar 

  42. Sandler, M., Howard, A.G., Zhu, M., Zhmoginov, A., Chen, L.C.: MobileNetv 2: inverted residuals and linear bottlenecks. In: CVPR, pp. 4510–4520 (2018)

    Google Scholar 

  43. Shi, W., Cao, J., Zhang, Q., Li, Y., Xu, L.: Edge computing: Vvsion and challenges. IEEE IoTJ 3, 637–646 (2016)

    Google Scholar 

  44. Srinivas, A., Lin, T.Y., Parmar, N., Shlens, J., Abbeel, P., Vaswani, A.: Bottleneck transformers for visual recognition. In: CVPR, pp. 16514–16524 (2021)

    Google Scholar 

  45. Tan, M., Chen, B., Pang, R., Vasudevan, V., Le, Q.V.: MnasNet: platform-aware neural architecture search for mobile. In: CVPR, pp. 2815–2823 (2019)

    Google Scholar 

  46. Tan, M., Le, Q.V.: EfficientNet: rethinking model scaling for convolutional neural networks. In: ICML (2019)

    Google Scholar 

  47. Tan, M., Le, Q.V.: MixConv: mixed depthwise convolutional kernels. arXiv abs/1907.09595 (2019)

    Google Scholar 

  48. Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., J’egou, H.: Training data-efficient image transformers & distillation through attention. In: ICML (2021)

    Google Scholar 

  49. Touvron, H., Cord, M., Sablayrolles, A., Synnaeve, G., J’egou, H.: Going deeper with image transformers. In: ICCV, pp. 32–42 (2021)

    Google Scholar 

  50. Vaswani, A., et al.: Attention is all you need. In: NeurIPS (2017)

    Google Scholar 

  51. Wang, K., Liu, Z., Lin, Y., Lin, J., Han, S.: HAQ: hardware-aware automated quantization with mixed precision. In: CVPR, pp. 8604–8612 (2019)

    Google Scholar 

  52. Wang, W., et al.: Pyramid vision transformer: a versatile backbone for dense prediction without convolutions. In: ICCV, pp. 548–558 (2021)

    Google Scholar 

  53. Wu, H., Xiao, B., Codella, N.C.F., Liu, M., Dai, X., Yuan, L., Zhang, L.: CvT: introducing convolutions to vision transformers. In: ICCV, pp. 22–31 (2021)

    Google Scholar 

  54. Wu, Z., Liu, Z., Lin, J., Lin, Y., Han, S.: Lite transformer with long-short range attention. In: ICLR (2020)

    Google Scholar 

  55. Xiao, T., Singh, M., Mintun, E., Darrell, T., Dollár, P., Girshick, R.B.: Early convolutions help transformers see better. In: NeurIPS (2021)

    Google Scholar 

  56. Yu, H., et al.: FedHAR: semi-supervised online learning for personalized federated human activity recognition. IEEE Transactions on Mobile Computing (2021)

    Google Scholar 

  57. Yuan, L., et al.: Tokens-to-token ViT: training vision transformers from scratch on imagenet. In: ICCV, pp. 538–547 (2021)

    Google Scholar 

  58. Zhang, H., Cissé, M., Dauphin, Y., Lopez-Paz, D.: mixup: beyond empirical risk minimization. arXiv abs/1710.09412 (2018)

    Google Scholar 

  59. Zhong, Z., Zheng, L., Kang, G., Li, S., Yang, Y.: Random erasing data augmentation. In: AAAI (2020)

    Google Scholar 

  60. Zhou, D., et al.: DeepViT: towards deeper vision transformer. arXiv abs/2103.11886 (2021)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yanwei Zheng .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Chen, Z., Zhong, F., Luo, Q., Zhang, X., Zheng, Y. (2022). EdgeViT: Efficient Visual Modeling for Edge Computing. In: Wang, L., Segal, M., Chen, J., Qiu, T. (eds) Wireless Algorithms, Systems, and Applications. WASA 2022. Lecture Notes in Computer Science, vol 13473. Springer, Cham. https://doi.org/10.1007/978-3-031-19211-1_33

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-19211-1_33

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-19210-4

  • Online ISBN: 978-3-031-19211-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics