EdgeViT: Efficient Visual Modeling for Edge Computing

Chen, Zekai; Zhong, Fangtian; Luo, Qi; Zhang, Xiao; Zheng, Yanwei

doi:10.1007/978-3-031-19211-1_33

Zekai Chen¹¹,
Fangtian Zhong¹²,
Qi Luo¹¹,
Xiao Zhang¹¹ &
…
Yanwei Zheng¹¹

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13473))

Included in the following conference series:

International Conference on Wireless Algorithms, Systems, and Applications

1409 Accesses
2 Citations

Abstract

With the rapid growth of edge intelligence, a higher level of deep neural network computing efficiency is required. Visual intelligence, as the core component of artificial intelligence, is particularly worth more exploration. As the cornerstone of modern visual modeling, convolutional neural networks (CNNs) have greatly developed in the past decades. Variants of light-weight CNNs have also been proposed to address the challenge of heavy computing in mobile settings. Though CNNs’ spatial inductive biases allow them to learn representations with fewer parameters across different vision tasks, these models are spatially local. To acquire a next-level model performance, vision transformer (ViT) is now a viable alternative due to the potential of multi-head attention mechanism. In this work, we introduce EdgeViT, an accelerated deep visual modeling method that incorporates the benefits of CNNs and ViTs in a light-weight and edge-friendly manner. Our proposed method can achieve top-1 accuracy of 77.8% using only 2.3 million parameters, 79.2% using 5.6 million parameters on ImageNet-1k dataset. It can achieve mIoU up to 78.3 on PASCAL VOC segmentation while only using 3.1 million parameters which is only half of MobileViT parameter budget.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Brown, T.B., et al.: Language models are few-shot learners. In: NeurIPS (2020)
Google Scholar
Chen, C.F., Fan, Q., Panda, R.: CrossViT: cross-attention multi-scale vision transformer for image classification. In: ICCV, pp. 347–356 (2021)
Google Scholar
Chen, L.C., Papandreou, G., Schroff, F., Adam, H.: Rethinking atrous convolution for semantic image segmentation. ArXiv abs/1706.05587 (2017)
Google Scholar
Chen, Y., Dai, X., Chen, D., Liu, M., Dong, X., Yuan, L., Liu, Z.: Mobile-former: bridging mobilenet and transformer. arXiv abs/2108.05895 (2021)
Google Scholar
Chen, Z., Chen, D., Yuan, Z., Cheng, X., Zhang, X.: Learning graph structures with transformer for multivariate time-series anomaly detection in IoT. IEEE Internet Things J. 9, 9179–9189 (2022)
Article Google Scholar
Chen, Z., Jiaze, E., Zhang, X., Sheng, H., Cheng, X.: Multi-task time series forecasting with shared attention. In: ICDMW, pp. 917–925 (2020)
Google Scholar
Chen, Z., Shi, M., Zhang, X., Ying, H.: Asm2tv: an adaptive semi-supervised multi-task multi-view learning framework. In: AAAI (2022)
Google Scholar
Chollet, F.: Xception: deep learning with depthwise separable convolutions. In: CVPR, pp. 1800–1807 (2017)
Google Scholar
Courbariaux, M., Hubara, I., Soudry, D., El-Yaniv, R., Bengio, Y.: Binarized neural networks: Training deep neural networks with weights and activations constrained to +1 or -1. arxiv abs/1602.02830 (2016)
Google Scholar
Cubuk, E.D., Zoph, B., Mané, D., Vasudevan, V., Le, Q.V.: Autoaugment: learning augmentation strategies from data. In: CVPR, pp. 113–123 (2019)
Google Scholar
Dai, Z., Liu, H., Le, Q.V., Tan, M.: CoAtNet: marrying convolution and attention for all data sizes. In: NeurIPS (2021)
Google Scholar
d’Ascoli, S., Touvron, H., Leavitt, M.L., Morcos, A.S., Biroli, G., Sagun, L.: Convit: improving vision transformers with soft convolutional inductive biases. In: ICML (2021)
Google Scholar
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: NAACL (2019)
Google Scholar
Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. In: ICLR (2020)
Google Scholar
Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: a retrospective. IJCV 111, 98–136 (2014)
Article Google Scholar
Graham, B., et al.: LeViT: a vision transformer in convnet’s clothing for faster inference. In: ICCV, pp. 12239–12249 (2021)
Google Scholar
Han, S., Mao, H., Dally, W.J.: Deep compression: compressing deep neural network with pruning, trained quantization and huffman coding. In: NeurIPS (2016)
Google Scholar
Han, S., Pool, J., Tran, J., Dally, W.J.: Learning both weights and connections for efficient neural network. In: NeurIPS (2015)
Google Scholar
Hariharan, B., Arbeláez, P., Bourdev, L.D., Maji, S., Malik, J.: Semantic contours from inverse detectors. In: ICCV, pp. 991–998 (2011)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR, pp. 770–778 (2016)
Google Scholar
He, Y., Lin, J., Liu, Z., Wang, H., Li, L.-J., Han, S.: AMC: AutoML for model compression and acceleration on mobile devices. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11211, pp. 815–832. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01234-2_48
Chapter Google Scholar
He, Y., Zhang, X., Sun, J.: Channel pruning for accelerating very deep neural networks. In: ICCV, pp. 1398–1406 (2017)
Google Scholar
Heo, B., Yun, S., Han, D., Chun, S., Choe, J., Oh, S.J.: Rethinking spatial dimensions of vision transformers. In: ICCV, pp. 11916–11925 (2021)
Google Scholar
Howard, A.G., et al.: Searching for mobilenetv3. In: ICCV, pp. 1314–1324 (2019)
Google Scholar
Howard, A.G., et al.: Mobilenets: efficient convolutional neural networks for mobile vision applications. arXiv abs/1704.04861 (2017)
Google Scholar
Huang, G., Liu, Z., Weinberger, K.Q.: Densely connected convolutional networks. In: CVPR, pp. 2261–2269 (2017)
Google Scholar
Jacob, B., et al.: Quantization and training of neural networks for efficient integer-arithmetic-only inference. In: CVPR, pp. 2704–2713 (2018)
Google Scholar
Jin, J., Dundar, A., Culurciello, E.: Flattened convolutional neural networks for feedforward acceleration. CoRR abs/1412.5474 (2015)
Google Scholar
Li, Y., Zhang, K., Cao, J., Timofte, R., Gool, L.V.: LocalViT: bringing locality to vision transformers. arXiv abs/2104.05707 (2021)
Google Scholar
Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Chapter Google Scholar
Liu, Z., et al.: Swin transformer: hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021)
Google Scholar
Liu, Z., Mu, H., Zhang, X., Guo, Z., Yang, X., Cheng, K., Sun, J.: MetaPruning: meta learning for automatic neural network channel pruning. In: ICCV, pp. 3295–3304 (2019)
Google Scholar
Liu, Z., Li, J., Shen, Z., Huang, G., Yan, S., Zhang, C.: Learning efficient convolutional networks through network slimming. In: ICCV, pp. 2755–2763 (2017)
Google Scholar
Loshchilov, I., Hutter, F.: SGDR: stochastic gradient descent with warm restarts. In: ICLR (2016)
Google Scholar
Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. In: ICLR (2019)
Google Scholar
Ma, N., Zhang, X., Zheng, H.-T., Sun, J.: ShuffleNet V2: practical guidelines for efficient CNN architecture design. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) Computer Vision – ECCV 2018. LNCS, vol. 11218, pp. 122–138. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01264-9_8
Chapter Google Scholar
Mao, Y., You, C., Zhang, J., Huang, K., Letaief, K.B.: A survey on mobile edge computing: the communication perspective. IEEE CST 19, 2322–2358 (2017)
Google Scholar
Mehta, S., Rastegari, M.: MobileViT: light-weight, general-purpose, and mobile-friendly vision transformer. arXiv abs/2110.02178 (2021)
Google Scholar
Mehta, S., Rastegari, M., Caspi, A., Shapiro, L., Hajishirzi, H.: ESPNet: efficient spatial pyramid of dilated convolutions for semantic segmentation. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11214, pp. 561–580. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01249-6_34
Chapter Google Scholar
Ranftl, R., Bochkovskiy, A., Koltun, V.: Vision transformers for dense prediction. In: ICCV, pp. 12159–12168 (2021)
Google Scholar
Russakovsky, O., et al.: ImageNet large scale visual recognition challenge. IJCV 115, 211–252 (2015)
Article MathSciNet Google Scholar
Sandler, M., Howard, A.G., Zhu, M., Zhmoginov, A., Chen, L.C.: MobileNetv 2: inverted residuals and linear bottlenecks. In: CVPR, pp. 4510–4520 (2018)
Google Scholar
Shi, W., Cao, J., Zhang, Q., Li, Y., Xu, L.: Edge computing: Vvsion and challenges. IEEE IoTJ 3, 637–646 (2016)
Google Scholar
Srinivas, A., Lin, T.Y., Parmar, N., Shlens, J., Abbeel, P., Vaswani, A.: Bottleneck transformers for visual recognition. In: CVPR, pp. 16514–16524 (2021)
Google Scholar
Tan, M., Chen, B., Pang, R., Vasudevan, V., Le, Q.V.: MnasNet: platform-aware neural architecture search for mobile. In: CVPR, pp. 2815–2823 (2019)
Google Scholar
Tan, M., Le, Q.V.: EfficientNet: rethinking model scaling for convolutional neural networks. In: ICML (2019)
Google Scholar
Tan, M., Le, Q.V.: MixConv: mixed depthwise convolutional kernels. arXiv abs/1907.09595 (2019)
Google Scholar
Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., J’egou, H.: Training data-efficient image transformers & distillation through attention. In: ICML (2021)
Google Scholar
Touvron, H., Cord, M., Sablayrolles, A., Synnaeve, G., J’egou, H.: Going deeper with image transformers. In: ICCV, pp. 32–42 (2021)
Google Scholar
Vaswani, A., et al.: Attention is all you need. In: NeurIPS (2017)
Google Scholar
Wang, K., Liu, Z., Lin, Y., Lin, J., Han, S.: HAQ: hardware-aware automated quantization with mixed precision. In: CVPR, pp. 8604–8612 (2019)
Google Scholar
Wang, W., et al.: Pyramid vision transformer: a versatile backbone for dense prediction without convolutions. In: ICCV, pp. 548–558 (2021)
Google Scholar
Wu, H., Xiao, B., Codella, N.C.F., Liu, M., Dai, X., Yuan, L., Zhang, L.: CvT: introducing convolutions to vision transformers. In: ICCV, pp. 22–31 (2021)
Google Scholar
Wu, Z., Liu, Z., Lin, J., Lin, Y., Han, S.: Lite transformer with long-short range attention. In: ICLR (2020)
Google Scholar
Xiao, T., Singh, M., Mintun, E., Darrell, T., Dollár, P., Girshick, R.B.: Early convolutions help transformers see better. In: NeurIPS (2021)
Google Scholar
Yu, H., et al.: FedHAR: semi-supervised online learning for personalized federated human activity recognition. IEEE Transactions on Mobile Computing (2021)
Google Scholar
Yuan, L., et al.: Tokens-to-token ViT: training vision transformers from scratch on imagenet. In: ICCV, pp. 538–547 (2021)
Google Scholar
Zhang, H., Cissé, M., Dauphin, Y., Lopez-Paz, D.: mixup: beyond empirical risk minimization. arXiv abs/1710.09412 (2018)
Google Scholar
Zhong, Z., Zheng, L., Kang, G., Li, S., Yang, Y.: Random erasing data augmentation. In: AAAI (2020)
Google Scholar
Zhou, D., et al.: DeepViT: towards deeper vision transformer. arXiv abs/2103.11886 (2021)
Google Scholar

Download references

Author information

Authors and Affiliations

School of Computer Science and Technology, Shandong University, Qingdao, China
Zekai Chen, Qi Luo, Xiao Zhang & Yanwei Zheng
Pennsylvania State University, State College, PA, USA
Fangtian Zhong

Authors

Zekai Chen
View author publications
You can also search for this author in PubMed Google Scholar
Fangtian Zhong
View author publications
You can also search for this author in PubMed Google Scholar
Qi Luo
View author publications
You can also search for this author in PubMed Google Scholar
Xiao Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Yanwei Zheng
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yanwei Zheng .

Editor information

Editors and Affiliations

Dalian University of Technology, Dalian, China
Lei Wang
Ben-Gurion University of the Negev, Beer-Sheva, Israel
Michael Segal
Chang Gung University, Taiwan, China
Jenhui Chen
Tianjin University, Tianjin, China
Tie Qiu

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Chen, Z., Zhong, F., Luo, Q., Zhang, X., Zheng, Y. (2022). EdgeViT: Efficient Visual Modeling for Edge Computing. In: Wang, L., Segal, M., Chen, J., Qiu, T. (eds) Wireless Algorithms, Systems, and Applications. WASA 2022. Lecture Notes in Computer Science, vol 13473. Springer, Cham. https://doi.org/10.1007/978-3-031-19211-1_33

Download citation

DOI: https://doi.org/10.1007/978-3-031-19211-1_33
Published: 17 November 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-19210-4
Online ISBN: 978-3-031-19211-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

EdgeViT: Efficient Visual Modeling for Edge Computing