Abstract
With the rapid growth of edge intelligence, a higher level of deep neural network computing efficiency is required. Visual intelligence, as the core component of artificial intelligence, is particularly worth more exploration. As the cornerstone of modern visual modeling, convolutional neural networks (CNNs) have greatly developed in the past decades. Variants of light-weight CNNs have also been proposed to address the challenge of heavy computing in mobile settings. Though CNNs’ spatial inductive biases allow them to learn representations with fewer parameters across different vision tasks, these models are spatially local. To acquire a next-level model performance, vision transformer (ViT) is now a viable alternative due to the potential of multi-head attention mechanism. In this work, we introduce EdgeViT, an accelerated deep visual modeling method that incorporates the benefits of CNNs and ViTs in a light-weight and edge-friendly manner. Our proposed method can achieve top-1 accuracy of 77.8% using only 2.3 million parameters, 79.2% using 5.6 million parameters on ImageNet-1k dataset. It can achieve mIoU up to 78.3 on PASCAL VOC segmentation while only using 3.1 million parameters which is only half of MobileViT parameter budget.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Brown, T.B., et al.: Language models are few-shot learners. In: NeurIPS (2020)
Chen, C.F., Fan, Q., Panda, R.: CrossViT: cross-attention multi-scale vision transformer for image classification. In: ICCV, pp. 347–356 (2021)
Chen, L.C., Papandreou, G., Schroff, F., Adam, H.: Rethinking atrous convolution for semantic image segmentation. ArXiv abs/1706.05587 (2017)
Chen, Y., Dai, X., Chen, D., Liu, M., Dong, X., Yuan, L., Liu, Z.: Mobile-former: bridging mobilenet and transformer. arXiv abs/2108.05895 (2021)
Chen, Z., Chen, D., Yuan, Z., Cheng, X., Zhang, X.: Learning graph structures with transformer for multivariate time-series anomaly detection in IoT. IEEE Internet Things J. 9, 9179–9189 (2022)
Chen, Z., Jiaze, E., Zhang, X., Sheng, H., Cheng, X.: Multi-task time series forecasting with shared attention. In: ICDMW, pp. 917–925 (2020)
Chen, Z., Shi, M., Zhang, X., Ying, H.: Asm2tv: an adaptive semi-supervised multi-task multi-view learning framework. In: AAAI (2022)
Chollet, F.: Xception: deep learning with depthwise separable convolutions. In: CVPR, pp. 1800–1807 (2017)
Courbariaux, M., Hubara, I., Soudry, D., El-Yaniv, R., Bengio, Y.: Binarized neural networks: Training deep neural networks with weights and activations constrained to +1 or -1. arxiv abs/1602.02830 (2016)
Cubuk, E.D., Zoph, B., Mané, D., Vasudevan, V., Le, Q.V.: Autoaugment: learning augmentation strategies from data. In: CVPR, pp. 113–123 (2019)
Dai, Z., Liu, H., Le, Q.V., Tan, M.: CoAtNet: marrying convolution and attention for all data sizes. In: NeurIPS (2021)
d’Ascoli, S., Touvron, H., Leavitt, M.L., Morcos, A.S., Biroli, G., Sagun, L.: Convit: improving vision transformers with soft convolutional inductive biases. In: ICML (2021)
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: NAACL (2019)
Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. In: ICLR (2020)
Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: a retrospective. IJCV 111, 98–136 (2014)
Graham, B., et al.: LeViT: a vision transformer in convnet’s clothing for faster inference. In: ICCV, pp. 12239–12249 (2021)
Han, S., Mao, H., Dally, W.J.: Deep compression: compressing deep neural network with pruning, trained quantization and huffman coding. In: NeurIPS (2016)
Han, S., Pool, J., Tran, J., Dally, W.J.: Learning both weights and connections for efficient neural network. In: NeurIPS (2015)
Hariharan, B., Arbeláez, P., Bourdev, L.D., Maji, S., Malik, J.: Semantic contours from inverse detectors. In: ICCV, pp. 991–998 (2011)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR, pp. 770–778 (2016)
He, Y., Lin, J., Liu, Z., Wang, H., Li, L.-J., Han, S.: AMC: AutoML for model compression and acceleration on mobile devices. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11211, pp. 815–832. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01234-2_48
He, Y., Zhang, X., Sun, J.: Channel pruning for accelerating very deep neural networks. In: ICCV, pp. 1398–1406 (2017)
Heo, B., Yun, S., Han, D., Chun, S., Choe, J., Oh, S.J.: Rethinking spatial dimensions of vision transformers. In: ICCV, pp. 11916–11925 (2021)
Howard, A.G., et al.: Searching for mobilenetv3. In: ICCV, pp. 1314–1324 (2019)
Howard, A.G., et al.: Mobilenets: efficient convolutional neural networks for mobile vision applications. arXiv abs/1704.04861 (2017)
Huang, G., Liu, Z., Weinberger, K.Q.: Densely connected convolutional networks. In: CVPR, pp. 2261–2269 (2017)
Jacob, B., et al.: Quantization and training of neural networks for efficient integer-arithmetic-only inference. In: CVPR, pp. 2704–2713 (2018)
Jin, J., Dundar, A., Culurciello, E.: Flattened convolutional neural networks for feedforward acceleration. CoRR abs/1412.5474 (2015)
Li, Y., Zhang, K., Cao, J., Timofte, R., Gool, L.V.: LocalViT: bringing locality to vision transformers. arXiv abs/2104.05707 (2021)
Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Liu, Z., et al.: Swin transformer: hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021)
Liu, Z., Mu, H., Zhang, X., Guo, Z., Yang, X., Cheng, K., Sun, J.: MetaPruning: meta learning for automatic neural network channel pruning. In: ICCV, pp. 3295–3304 (2019)
Liu, Z., Li, J., Shen, Z., Huang, G., Yan, S., Zhang, C.: Learning efficient convolutional networks through network slimming. In: ICCV, pp. 2755–2763 (2017)
Loshchilov, I., Hutter, F.: SGDR: stochastic gradient descent with warm restarts. In: ICLR (2016)
Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. In: ICLR (2019)
Ma, N., Zhang, X., Zheng, H.-T., Sun, J.: ShuffleNet V2: practical guidelines for efficient CNN architecture design. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) Computer Vision – ECCV 2018. LNCS, vol. 11218, pp. 122–138. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01264-9_8
Mao, Y., You, C., Zhang, J., Huang, K., Letaief, K.B.: A survey on mobile edge computing: the communication perspective. IEEE CST 19, 2322–2358 (2017)
Mehta, S., Rastegari, M.: MobileViT: light-weight, general-purpose, and mobile-friendly vision transformer. arXiv abs/2110.02178 (2021)
Mehta, S., Rastegari, M., Caspi, A., Shapiro, L., Hajishirzi, H.: ESPNet: efficient spatial pyramid of dilated convolutions for semantic segmentation. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11214, pp. 561–580. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01249-6_34
Ranftl, R., Bochkovskiy, A., Koltun, V.: Vision transformers for dense prediction. In: ICCV, pp. 12159–12168 (2021)
Russakovsky, O., et al.: ImageNet large scale visual recognition challenge. IJCV 115, 211–252 (2015)
Sandler, M., Howard, A.G., Zhu, M., Zhmoginov, A., Chen, L.C.: MobileNetv 2: inverted residuals and linear bottlenecks. In: CVPR, pp. 4510–4520 (2018)
Shi, W., Cao, J., Zhang, Q., Li, Y., Xu, L.: Edge computing: Vvsion and challenges. IEEE IoTJ 3, 637–646 (2016)
Srinivas, A., Lin, T.Y., Parmar, N., Shlens, J., Abbeel, P., Vaswani, A.: Bottleneck transformers for visual recognition. In: CVPR, pp. 16514–16524 (2021)
Tan, M., Chen, B., Pang, R., Vasudevan, V., Le, Q.V.: MnasNet: platform-aware neural architecture search for mobile. In: CVPR, pp. 2815–2823 (2019)
Tan, M., Le, Q.V.: EfficientNet: rethinking model scaling for convolutional neural networks. In: ICML (2019)
Tan, M., Le, Q.V.: MixConv: mixed depthwise convolutional kernels. arXiv abs/1907.09595 (2019)
Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., J’egou, H.: Training data-efficient image transformers & distillation through attention. In: ICML (2021)
Touvron, H., Cord, M., Sablayrolles, A., Synnaeve, G., J’egou, H.: Going deeper with image transformers. In: ICCV, pp. 32–42 (2021)
Vaswani, A., et al.: Attention is all you need. In: NeurIPS (2017)
Wang, K., Liu, Z., Lin, Y., Lin, J., Han, S.: HAQ: hardware-aware automated quantization with mixed precision. In: CVPR, pp. 8604–8612 (2019)
Wang, W., et al.: Pyramid vision transformer: a versatile backbone for dense prediction without convolutions. In: ICCV, pp. 548–558 (2021)
Wu, H., Xiao, B., Codella, N.C.F., Liu, M., Dai, X., Yuan, L., Zhang, L.: CvT: introducing convolutions to vision transformers. In: ICCV, pp. 22–31 (2021)
Wu, Z., Liu, Z., Lin, J., Lin, Y., Han, S.: Lite transformer with long-short range attention. In: ICLR (2020)
Xiao, T., Singh, M., Mintun, E., Darrell, T., Dollár, P., Girshick, R.B.: Early convolutions help transformers see better. In: NeurIPS (2021)
Yu, H., et al.: FedHAR: semi-supervised online learning for personalized federated human activity recognition. IEEE Transactions on Mobile Computing (2021)
Yuan, L., et al.: Tokens-to-token ViT: training vision transformers from scratch on imagenet. In: ICCV, pp. 538–547 (2021)
Zhang, H., Cissé, M., Dauphin, Y., Lopez-Paz, D.: mixup: beyond empirical risk minimization. arXiv abs/1710.09412 (2018)
Zhong, Z., Zheng, L., Kang, G., Li, S., Yang, Y.: Random erasing data augmentation. In: AAAI (2020)
Zhou, D., et al.: DeepViT: towards deeper vision transformer. arXiv abs/2103.11886 (2021)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Chen, Z., Zhong, F., Luo, Q., Zhang, X., Zheng, Y. (2022). EdgeViT: Efficient Visual Modeling for Edge Computing. In: Wang, L., Segal, M., Chen, J., Qiu, T. (eds) Wireless Algorithms, Systems, and Applications. WASA 2022. Lecture Notes in Computer Science, vol 13473. Springer, Cham. https://doi.org/10.1007/978-3-031-19211-1_33
Download citation
DOI: https://doi.org/10.1007/978-3-031-19211-1_33
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-19210-4
Online ISBN: 978-3-031-19211-1
eBook Packages: Computer ScienceComputer Science (R0)