Abstract
In recent years, the study of lightweight models has been one of the most significant application streams of face recognition tasks due to practical life demands. However, typical lightweight face recognition models become less effective when dealing with large face feature variations (e.g. age variation, pose variation). In this paper, we present a lightweight face recognition model, namely MobileFaceFormer. It takes advantage of both convolutional neural networks’ (CNNs) effectiveness in capturing local features and visual transformers’ effectiveness in computing global dependencies for more enriched and abundant interpretations of facial features. To achieve this, both CNN branches and visual transformer branches are parallelized, and a bi-directional feature fusion bridge connecting dual branches is designed to concurrently retain local facial features and global facial interpretations. To enhance feature interpretations on dual branch, a convolutional token initialization method is proposed at transformer branch to perceive long-range facial information, also depthwise separable convolution and attention mechanisms are adopted at CNN branch to enhance local facial feature extraction. Further, an attentive global depthwise convolution (AGDC) is proposed to encourage the concentration of key facial information. Experiments across state-of-the-art FR datasets show MobileFaceFormer achieves higher recognition performance, e.g. MobileFaceFormer achieves 99.60% at LFW dataset, compared to 99.28 % of MobileFaceNets; Meanwhile, MobileFaceFormer shows more lightweight model complexity, e.g. in terms of computation cost, MobileFaceFormer has 65M Multiply-Accumulate Operations (MAdds) than 221M of MobileFaceNets under similar parameter sizes.
Similar content being viewed by others
Data availability
The data are available from the corresponding author upon reasonable request.
References
Carion N, Massa F, Synnaeve G, Usunier N, Kirillov A, Zagoruyko S (2020) End-to-end object detection with transformers. In: European Conference on Computer Vision. Springer, pp 213–229
Chen B, Li P, Li B, Li C, Bai L, Lin C, Sun M, Yan J, Ouyang W (2021) PSVIT: better vision transformer via token pooling and attention sharing. Preprint at http://arxiv.org/abs/2108.03428
Chen S, Liu Y, Gao X, Han Z (2018) Mobilefacenets: efficient CNNs for accurate real-time face verification on mobile devices. In: Chinese Conference on Biometric Recognition. Springer, pp 428–438
Chen Y, Dai X, Chen D, Liu M, Dong X, Yuan L, Liu Z (2022) Mobile-former: bridging mobilenet and transformer. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp 5270–5279
d’Ascoli S, Touvron H, Leavitt ML, Morcos AS, Biroli G, Sagun L (2021) Convit: improving vision transformers with soft convolutional inductive biases. In: International Conference on Machine Learning, PMLR. pp. 2286–2296
Deng J, Guo J, Xue N, Zafeiriou S (2019) Arcface: additive angular margin loss for deep face recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp 4690–4699
Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S et al (2020) An image is worth 16x16 words: transformers for image recognition at scale. Preprint at http://arxiv.org/abs/2010.11929
Duong CN, Quach KG, Jalata I, Le N, Luu K (2019) Mobiface: a lightweight deep learning face recognition on mobile devices. In: 2019 IEEE 10th international conference on biometrics theory, applications and systems (BTAS). IEEE, pp 1–6
Graham B, El-Nouby A, Touvron H, Stock P, Joulin A, Jégou H, Douze M (2021) Levit: a vision transformer in convnet’s clothing for faster inference. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp 12259–12269
Guo J. Datasetzoo. https://github.com/deepinsight/insightface/wiki/Dataset-Zoo. Accessed 21 Nov 2021
Guo Y, Zhang L, Hu Y, He X, Gao J (2016) Ms-celeb-1m: a dataset and benchmark for large-scale face recognition. In: European Conference on Computer Vision. Springer, pp 87–102
Han K, Xiao A, Wu E, Guo J, Xu C, Wang Y (2021) Transformer in transformer. Adv Neural Inf Process Syst 34:15908–15919
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp 770–778
Hinton G, Vinyals O, Dean J et al (2015) Distilling the knowledge in a neural network. Preprint at http://arxiv.org/abs/1503.02531
Hu J, Shen L, Sun G (2018) Squeeze-and-excitation networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp 7132–7141
Huang GB, Mattar M, Berg T, Learned-Miller E (2008) Labeled faces in the wild: a database for studying face recognition in unconstrained environments. In: Workshop on Faces in’Real-Life’Images: Detection, Alignment, and Recognition
Li X, Wang F, Hu Q, Leng C (2019) Airface: lightweight and efficient model for face recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops. pp 0–0
Liu Z, Lin Y, Cao Y, Hu H, Wei Y, Zhang Z, Lin S, Guo B (2021) Swin transformer: hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp 10012–10022
Luo P, Zhu Z, Liu Z, Wang X, Tang X (2016) Face model compression by distilling knowledge from neurons. In: Thirtieth AAAI Conference on Artificial Intelligence
Moschoglou S, Papaioannou A, Sagonas C, Deng J, Kotsia I, Zafeiriou S (2017) AGEDB: the first manually collected, in-the-wild age database. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops. pp 51–59
Peng Z, Huang W, Gu S, Xie L, Wang Y, Jiao J, Ye Q (2021) Conformer: local features coupling global representations for visual recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp 367–376
Sandler M, Howard A, Zhu M, Zhmoginov A, Chen LC (2018) Mobilenetv2: inverted residuals and linear bottlenecks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp 4510–4520
Sengupta S, Chen JC, Castillo C, Patel VM, Chellappa R, Jacobs DW (2016) Frontal to profile face verification in the wild. In: 2016 Winter Conference on Applications of Computer Vision (WACV). IEEE, pp 1–9
Srinivas A, Lin TY, Parmar N, Shlens J, Abbeel P, Vaswani A (2021) Bottleneck transformers for visual recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp 16519–16529
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention is all you need. Adv Neural Inf Proces Syst 30
Wang W, Xie E, Li X, Fan DP, Song K, Liang D, Lu T, Luo P, Shao L (2021) Pyramid vision transformer: a versatile backbone for dense prediction without convolutions. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp 568–578
Woo S, Park J, Lee JY, Kweon IS (2018) CBAM: convolutional block attention module. In: Proceedings of the European Conference on Computer Vision (ECCV). pp 3–19
Wu H, Xiao B, Codella N, Liu M, Dai X, Yuan L, Zhang L (2021) CVT: introducing convolutions to vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp 22–31
Yi D, Lei Z, Liao S, Li SZ (2014) Learning face representation from scratch. Preprint at http://arxiv.org/abs/1411.7923
Yuan L, Chen Y, Wang T, Yu W, Shi Y, Jiang ZH, Tay FE, Feng J, Yan S (2021) Tokens-to-token VIT: training vision transformers from scratch on imagenet. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp 558–567
Zhang J (2019) SeesawFaceNets: sparse and robust face verification model for mobile platform. Preprint at http://arxiv.org/abs/1908.09124
Zhang K, Zhang Z, Li Z, Qiao Y (2016) Joint face detection and alignment using multitask cascaded convolutional networks. IEEE Signal Process Lett 23:1499–1503
Zhang Q, Yang YB (2021) Rest: an efficient transformer for visual recognition. Adv Neural Inf Process Syst 34:15475–15485
Zheng T, Deng W (2018) Cross-pose LFW: a database for studying cross-pose face recognition in unconstrained environments. Beijing University of Posts and Telecommunications, Tech. Rep 5, 7
Zheng T, Deng W, Hu J (2017) Cross-age LFW: a database for studying cross-age face recognition in unconstrained environments. Preprint at http://arxiv.org/abs/1708.08197
Zhong Y, Deng W (2021) Face transformer for recognition. Preprint at http://arxiv.org/abs/2103.14803
Zhou D, Kang B, Jin X, Yang L, Lian X, Jiang Z, Hou Q, Feng J (2021) Deepvit: towards deeper vision transformer. Preprint at http://arxiv.org/abs/2103.11886
Funding
This work was supported by the National Key Research and Development Program of China (2019YFB2204200) and National Natural Science Foundation of China(U1832217).
Author information
Authors and Affiliations
Corresponding authors
Ethics declarations
Conflict of interest
Authors declare that they have no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Li, J., Zhou, L. & Chen, J. MobileFaceFormer: a lightweight face recognition model against face variations. Multimed Tools Appl 83, 12669–12685 (2024). https://doi.org/10.1007/s11042-023-15954-1
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-023-15954-1