A Lightweight Verification Scheme Based on Dynamic Convolution

Tang, Lihe; Yang, Weidong; Gao, Qiang; Xu, Rui; Ye, Rongzhi

doi:10.1007/978-981-19-2456-9_78

Lihe Tang^40,41,
Weidong Yang^40,41,
Qiang Gao^40,41,
Rui Xu^40,41 &
…
Rongzhi Ye^40,41

Part of the book series: Lecture Notes in Electrical Engineering ((LNEE))

Included in the following conference series:

INTERNATIONAL CONFERENCE ON WIRELESS COMMUNICATIONS, NETWORKING AND APPLICATIONS

8314 Accesses

Abstract

Since Electricity Grid Engineering involves a large number of personnel in the construction process, face recognition algorithms can be used to solve the personnel management problem. The recognition devices used in Electricity Grid Engineering are often mobile, embedded, and other lightweight devices with limited hardware performance. Although a large number of existing face recognition algorithms based on deep convolutional neural networks have high recognition accuracy, they are difficult to run in mobile devices or offline environments due to high computational complexity. In order to maintain the accuracy of face recognition while reducing the complexity of face recognition networks, a lightweight face recognition network based on Dynamic Convolution is proposed. Based on MobileNetV2, this paper introduces the Dynamic Convolution operation. It proposes a Dynamic Inverted Residuals Block, which enables the lightweight neural network to combine the feature extraction and learning ability of large neural networks to improve the recognition accuracy of the model. The experiments prove that the proposed model maintains high recognition accuracy while ensuring lightweight.

You have full access to this open access chapter, Download conference paper PDF

Realtime Face Verification with Lightweight Convolutional Neural Networks

Face Feature Extraction and Recognition Based on Lighted Deep Convolutional Neural Network

Research on Lightweight Face Recognition Algorithm

Keywords

1 Introduction

The construction span of Electricity Grid Engineering is large, and the construction cycle is long. The handover and acceptance of engineering construction materials cover the whole construction cycle, and there are many handover points and many units involved in the handover of materials. These factors bring certain risks for material storage and confirmation of material handover personnel. There are phenomena that material handover responsibilities are difficult to clarify and non-handover personnel take over the handover.

With the continuous promotion of power grid information reform and the increasing information security requirements, it is necessary to informatize the engineering aspects of the power grid and improve the artificial intelligence management capability of Electricity Grid Engineering. Through the automatic authentication of engineering personnel’s identity, the material handover and responsibility implementation are transformed from a loose and sloppy management mode to a centralized and lean management mode, thus forming a sound and centralized, lean and efficient management system. The efficient and reliable face verification algorithm can not only improve Electricity Grid Engineering’s management services but also effectively improve the information protection and information security of Electricity Grid Engineering personnel.

Currently, high-precision face verification models are mostly built based on deep convolutional neural networks that require high computational resources. These models are trained using large amounts of data, and the models are complex and have a very large number of parameters that require a large amount of computational resources. Therefore, these models are difficult to run in mobile and embedded devices, which are mostly seen in Electricity Grid Engineering scenarios. Therefore, lightweight neural networks with low memory consumption and low computational resource consumption have become a trend in current research.

Non-lightweight face verification networks have higher verification accuracy but are more computationally intensive, such as DeepFace [1], FaceNet [2], etc. This paper proposes a lightweight face verification network based on Dynamic Convolution using the lightweight neural network MobileNetV2 [3] as the baseline network to address the above problems. By learning multiple sets of convolution kernels within a single convolution operation, the feature extraction capability of the lightweight network is improved, making the lightweight neural network also achieve good face verification accuracy. At the same time, the network only enhances the baseline network MobileNetV2 with a very limited amount of computing power and meets the demand for real-time verification recognition.

2 Dynamic Convolution-Based Face Verification Network

2.1 Dynamic Convolution

Dynamic Convolution is a network substructure [4], which can be very easily embedded into other existing network structures. The core idea is to give a layer of convolution the ability to learn multiple groups of convolution kernels so that a single convolution operation has a stronger feature extraction and representation capability. At the same time, an attention mechanism [5] is introduced to learn the weights of the parameters of each group of convolutional kernels through the network so that the effective convolutional kernel parameters have high weights. The remaining parameters have low weights, prompting the model to adaptively capture the high-weight convolutional kernel parameters according to the input, improving the performance of existing convolutional neural networks, especially lightweight neural networks. By introducing Dynamic Convolution operation into the operation of the lightweight neural network, the lightweight network can extract and learn face features more efficiently. The overall structure of Dynamic Convolution is shown in Fig. 1.

The Squeeze operation is performed on the input channels in the first step. That is, feature compression is performed on the input layer to turn each two-dimensional feature channel into a real number with a global perceptual field. The resulting output features are the same as the number of input feature channels. The Squeeze operation used is global average pooling:

$${F}_{s}\left({u}_{k}\right)=\frac{1}{W\times H}\sum\nolimits_{i=1}^{W}\sum\nolimits_{j=1}^{H}{u}_{k}(i,j)$$

(1)

where ${u}_{k}$ is the input feature, k is the number of channels, $W$ and $H$ are the width and height of an input channel feature, and ${F}_{s}$ is the result of the Squeeze operation, which is a vector of length equal to k.

In the second step, the Excitation operation is performed on the result of the Squeeze. This operation outputs the corresponding weights of each set of convolution kernel parameters, which enables the network to adaptively select the appropriate convolution kernel for convolution according to the input features:

$${F}_{e}\left({F}_{s},W\right)=\sigma ({W}_{2}\delta ({W}_{1}{F}_{s}))$$

(2)

where ${W}_{1}$ and ${W}_{2}$ are the parameters of the fully connected layer, the dimension of ${W}_{1}$ is k/r * k, r is the scaling factor in reducing the output dimension to reduce the operational complexity of the attention mechanism, r = 0.25 is used in this paper. The dimension of ${W}_{2}$ is T * k/r to obtain a vector of length T. T is the number of groups of convolution kernel parameters, δ is the nonlinear activation function ReLU [6], and $\sigma $ is the softmax function. The output weight vector ${F}_{e}$ is normalized to be in the interval [0, 1] and summed to 1 using the softmax function, and the length of ${F}_{e}$ is T.

In the actual training of the network, in order to ensure that all groups of convolutional kernel parameters can participate in the training at the beginning of the training and avoid falling into local optimal points at the beginning of the training, the softmax used is the temperature-controlled softmax:

$${F}_{e,t}= \frac{exp({F}_{e,t}/\tau )}{{\sum }_{j}exp({F}_{e,j}/\tau )}$$

(3)

where $\tau $ is the temperature parameter. It is set to a larger value at the beginning of the training and decreases until it becomes 1 as the training progresses.

In the third step, according to the weight ${F}_{e}$ of each group of convolution kernel parameters obtained from the Excitation operation, each group of convolution kernel parameters is weighted to obtain the real convolution kernel parameters for the convolution operation:

$$W=\sum\nolimits_{t=1}^{T}{F}_{e,t}{W}^{t}, b=\sum\nolimits_{t=1}^{T}{F}_{e,t}{b}^{t}, s.t.0\le {F}_{e,t}\le 1, \sum\nolimits_{t=1}^{T}{F}_{e,t}=1$$

(4)

where ${W}^{t}$ and ${b}^{t}$ are the t-th set of convolutional kernel parameters and ${F}_{e,t}$ is the tth value of the attention weight, which corresponds to the probability of using the t-th set of convolutional kernel parameters. The adaptive convolutional kernel parameters were obtained by weighting and summing each set of parameters by multiplication. The weights obtained using softmax contain a probabilistic sense, ensuring the scale stability of the obtained convolution kernel parameters. The application of the attention mechanism allows the network to automatically transform the parameters used for convolution in response to the input, greatly increasing the feature extraction and learning capability of the network.

The application of the attention mechanism allows the network to automatically transform the parameters used for convolution in response to the input, greatly increasing the feature extraction and learning capability of the network.

$${v}_{k}=W{u}_{k}+b$$

(5)

where ${u}_{k}$ is the convolutional input feature and ${v}_{k}$ is the output feature of Dynamic Convolution. After completing the Dynamic Convolution, the features can be normalized using the common Batch Normalization layer [7] and nonlinear activation operations can be performed using nonlinear activation functions such as ReLU, PReLU [8], etc.

2.2 Bottleneck Layer Structure Design

In order to solve the degradation problem of deep neural networks and accelerate the collection of the network, MobileNetV2 introduces the Inverted Residuals Block bottleneck layer structure [3], as shown in Fig. The traditional residual structure [9] is like an hourglass with narrow middle and fat ends. Using only a small number of convolutional kernels to extract features will lead to poor feature extraction. The number of convolutional kernels in each layer of the lightweight feature extraction network is limited. Using the traditional residual structure will lead to the network not extracting enough information, resulting in a poor network. Therefore, in this paper, we use an inverted residual structure, which is like a spindle with a large middle and small ends. The feature data are first up-dimensioned by 1 * 1 Conv. The convolution operation is performed to extract the feature data, and finally down-dimensioned again by 1 * 1 Conv, which ensures the feature extraction effect and controls the parameters and computation of the network to a certain extent.

It can be seen that the backbone network part of the Inverted Residuals Block is divided into three main blocks. The first block has a similar network structure to the third block, consisting of 1 × 1 Conv, BN, and ReLU6. Among them, 1 × 1 Conv is the convolutional layer with a convolutional kernel size of 1, which is mainly used to change the number of channels of the features. BN is the Batch Normalization layer, which normalizes the features after the convolutional layer computation. reLU6 is the activation function, which gives this neuron a layered nonlinear mapping learning capability. Note that the third block of the network structure does not contain an activation function. The second network structure consists of 3 × 3 DwiseConv, BN, and ReLU6 [10], where 3 × 3 DwiseConv refers to the Depthwise Convolution with a convolutional kernel size of 3 [11] (Fig. 2).

Inverted Residuals Block is an important component of MobileNetV2. Using a large number of Inverted Residuals Blocks, the input information can flow sufficiently within the network so that the network has enough parameters to understand the input information and record the information characteristics. For this structure, we empirically replace the 1 * 1 convolution in the third block of the network structure with the Dynamic Convolution layer. On the one hand, such a structural replacement can already be sufficient to improve the face verification performance of MobileNetV2. On the other hand, although the increase in the number of operations of Dynamic Convolution is very limited, the increase in the number of parameters is considerable. Replacing only the last 1 * 1 convolutional layer in the Inverted Residuals Block with a Dynamic Convolution layer can also effectively prevent the size of the network model from increasing so much that it can be used in grid-side devices. The modified Inverted Residuals Block will be called Dynamic Inverted Residuals Block.

2.3 Network Architecture Design

The size of the input image used in this paper is 112 * 112. Based on MobileNetV2, the Inverted Residuals Block used in this paper is replaced with the Dynamic Inverted Residuals Block with Dynamic Convolution as described above. As shown in the Table, the network structure mainly consists of four parts. The first part obtains a feature map of size 56 * 56 with rich face feature information by a normal convolution with a kernel size of 3, step size of 2, padding of 1, and output channel number of 64. The second part consists of six Dynamic Inverted Residuals blocks in different configurations. The third part contains 3 convolution operations. First, the number of feature channels is expanded by 1 × 1 convolution, and the 7 × 7 feature map with 512 channels is output. Then, a 7 × 7 convolution layer is used to obtain 512 1 × 1 features. Finally, the feature transform is performed by a 1 × 1 convolution, and after flattening, a 512-dimensional face feature vector is obtained. The fourth part, which is a fully connected layer, implements the face classification at training time.

Table 1. Network structure

Full size table

In Table 1, op indicates the operation, e is the channel expansion factor, c is the number of output channels (number of dimensions), d indicates whether dropout is used, r indicates the number of repetitions of the block, and s is the step size (only the first repetition module has a step size of s, the rest of the repetition modules have a step size of 1).

3 Analysis of Experimental Results

3.1 Data Set and Experimental Setup

The public dataset CASIA-WebFace [12] contains 494,414 images of 10,575 individuals. In this paper, we use CASIA-WebFace as a training dataset and use the face verification database LFW [13] to check the improvement of the algorithm under different conditions. The dataset has 13233 face images containing 5749 people, containing various types of conditions such as different poses, lighting changes, and background changes. There is no overlap between the training data and the test data.

The input face image size of the model is 112 * 112. For this reason, the data needs to be processed before the face recognition network is trained. The face detection algorithm is used to derive the coordinates of face regions and key points. Based on these coordinates, the face is aligned for correction, and finally, the aligned face image is scaled to 112 * 112. The data augmentation method used contains image mirroring, panning, brightness, color, contrast, sharpness adjustment, etc. The face image is normalized before training by subtracting 127.5 from the pixels and then dividing by 128 to obtain the normalized training data finally.

The experimental hardware platform is Ubuntu 18.04 operating system and Intel Corel NVIDIA Tesla V100 graphics card. The experiments in this paper are based on PyTorch deep learning framework [14] for algorithm model training.

In this paper, all experiments are trained using a stochastic gradient descent optimizer [15]. In order to speed up the convergence and reduce the oscillation in the process of model convergence, the Momentum factor is added to the experimental training process in this paper. Its value is set to 0.9, the weight decay is set to 5e−4, the initial learning rate is set to 0.01, and the learning rate is multiplied by 0.1 at epochs of 40, 50, and 60, and the model is trained for a total of 70 epochs.

In this paper, the loss function used in the training process is the Adacos [16] adaptive scale loss function. Compared with the loss functions used for face recognition, such as CosFace [17] and ArcFace [18], Adacos does not rely on manual adjustment of the hyperparameters of the loss function to achieve good optimization results.

3.2 Analysis of Experimental Results

The comparison between the lightweight face recognition algorithm model based on Dynamic Inverted Residuals Block and the baseline network MobileNetV2 on the LFW validation set is shown in Table 2.

Table 2. The comparison on the LFW validation set

Full size table

As can be seen from the Table, the model with the introduction of Dynamic Convolution increases from 292.6M to 305.3M in terms of computing volume, which is only a 4.34% improvement, while the accuracy of face recognition increases from 98.58% to 99.28%, with a significant 50.7% decrease in error rate. This result is not easy for such performance improvement in a long-tail task like face recognition. The number of model parameters and the forward transmission time are kept at the same order of magnitude as the baseline network, ensuring the possibility of applying the network model to all types of end devices on the grid.

In order to fully verify the performance of this algorithm model, an experimental comparison with the current mainstream algorithms in the field of face recognition was conducted, as shown in Table 3.

Table 3. The comparison with other algorithms

Full size table

LMobileNetE and Light CNN have higher recognition accuracy. Still, their training datasets are 4M and 3.8M. The number of model parameters are 12.8M and 26.7M (one order of magnitude higher than the model in this paper), which are significantly higher than the algorithms in this paper. It is significantly more difficult to migrate them to mobile platforms. Although the model size of MobileID and ShuffleNet is smaller, the performance is weak, failing to reach 99%, and the recognition accuracy is insufficient to meet the standard used by Electricity Grid Engineering. The algorithm model proposed in this paper achieves a good trade-off in recognition accuracy, operation volume, and model size by introducing Dynamic Convolution, which makes it meet both the accuracy requirements of recognition and can be efficiently applied on mobile devices.

4 Conclusion

In this paper, we propose a lightweight face recognition network based on Dynamic Convolution to address the common people management problem in Electricity Grid Engineering. The Dynamic Convolution operation not only gives richer feature extraction and learning capability to individual convolution, but also makes the convolution operation self-adaptive, so that it can automatically construct different convolution kernel parameters for different inputs for convolution. It has been proven that the lightweight face recognition network based on Dynamic Convolution proposed in this paper achieves a good balance of operational efficiency and recognition accuracy.

References

Taigman, Y., Yang, M., Ranzato, M.A., Wolf, L.: DeepFace: closing the gap to human-level performance in face verification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1701–1708 (2014)
Google Scholar
Schroff, F., Kalenichenko, D., Philbin, J.: FaceNet: a unified embedding for face recognition and clustering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 815–823 (2015)
Google Scholar
Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., Chen, L.C.: MobileNetV2: inverted residuals and linear bottlenecks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4510–4520 (2018)
Google Scholar
Chen, Y., Dai, X., Liu, M., Chen, D., Yuan, L., Liu, Z.: Dynamic convolution: attention over convolution kernels. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11030–11039 (2020)
Google Scholar
Hu, J., Shen, L., Sun, G.: Squeeze-and-excitation networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7132–7141 (2018)
Google Scholar
Agarap, A.F.: Deep learning using rectified linear units (relu). arXiv preprint arXiv:1803.08375 (2018)
Ioffe, S., Szegedy, C.: Batch normalization: Accelerating deep network training by reducing internal covariate shift. In: International Conference on Machine Learning, pp. 448–456. PMLR (June 2015)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Delving deep into rectifiers: surpassing human-level performance on ImageNet classification. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1026–1034 (2015)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Google Scholar
Sheng, T., Feng, C., Zhuo, S., Zhang, X., Shen, L., Aleksic, M.: A quantization-friendly separable convolution for MobileNets. In: 2018 1st Workshop on Energy Efficient Machine Learning and Cognitive Computing for Embedded Applications (EMC2), pp. 14–18. IEEE (March 2018)
Google Scholar
Howard, A.G., et al.: MobileNets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861 (2017)
Yi, D., Lei, Z., Liao, S., Li, S.Z.: Learning face representation from scratch. arXiv preprint arXiv:1411.7923 (2014)
Huang, G.B., Mattar, M., Berg, T., Learned-Miller, E.: Labeled faces in the wild: A database for studying face recognition in unconstrained environments. In: Workshop on Faces in ‘Real-Life’ Images: Detection, Alignment, and Recognition (October 2008)
Google Scholar
Paszke, A., et al.: PyTorch: An imperative style, high-performance deep learning library. arXiv preprint arXiv:1912.01703 (2019)
Bottou, L.: Large-scale machine learning with stochastic gradient descent. In: Lechevallier, Y., Saporta, G. (eds.) Proceedings of COMPSTAT 2010, pp. 177–186. Physica-Verlag HD (2010). https://doi.org/10.1007/978-3-7908-2604-3_16
Zhang, X., Zhao, R., Qiao, Y., Wang, X., Li, H.: AdaCos: adaptively scaling cosine logits for effectively learning deep face representations. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10823–10832 (2019)
Google Scholar
Wang, H., et al.: CosFace: large margin cosine loss for deep face recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5265–5274 (2018)
Google Scholar
Deng, J., Guo, J., Xue, N., Zafeiriou, S.: ArcFace: additive angular margin loss for deep face recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4690–4699 (2019)
Google Scholar
Wu, X., He, R., Sun, Z., Tan, T.: A light CNN for deep face representation with noisy labels. IEEE Trans. Inf. Forensics Secur. 13(11), 2884–2896 (2018)
Article Google Scholar
Luo, P., Zhu, Z., Liu, Z., Wang, X., Tang, X.: Face model compression by distilling knowledge from neurons. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 30, no. 1 (March 2016)
Google Scholar
Zhang, X., Zhou, X., Lin, M., Sun, J.: ShuffleNet: an extremely efficient convolutional neural network for mobile devices. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6848–6856 (2018)
Google Scholar

Download references

Acknowledgements

This work is supported by the State Grid Corporation Science and Technology Project Funded “Key technology and product design research and development of power grid data pocket book” (1400-202040410A-0-0-00).

Author information

Authors and Affiliations

NARI Group Corporation/State Grid Electric Power Research Institute, Nanjing, 211106, China
Lihe Tang, Weidong Yang, Qiang Gao, Rui Xu & Rongzhi Ye
NARI Information Communication Science and Technology Co. Ltd., Nanjing, 210003, China
Lihe Tang, Weidong Yang, Qiang Gao, Rui Xu & Rongzhi Ye

Authors

Lihe Tang
View author publications
You can also search for this author in PubMed Google Scholar
Weidong Yang
View author publications
You can also search for this author in PubMed Google Scholar
Qiang Gao
View author publications
You can also search for this author in PubMed Google Scholar
Rui Xu
View author publications
You can also search for this author in PubMed Google Scholar
Rongzhi Ye
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Lihe Tang .

Editor information

Editors and Affiliations

College of Communication Engineering, Jilin University, Jilin, Jilin, China
Zhihong Qian
Department of AI & ML, Vardhaman College of Engineering, Hyderabad, Telangana, India
M.A. Jabbar
College of Technology, Indiana State University, Terre Haute, IN, USA
Xiaolong Li

Rights and permissions

Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

Reprints and permissions

Copyright information

About this paper

Cite this paper

Tang, L., Yang, W., Gao, Q., Xu, R., Ye, R. (2022). A Lightweight Verification Scheme Based on Dynamic Convolution. In: Qian, Z., Jabbar, M., Li, X. (eds) Proceeding of 2021 International Conference on Wireless Communications, Networking and Applications. WCNA 2021. Lecture Notes in Electrical Engineering. Springer, Singapore. https://doi.org/10.1007/978-981-19-2456-9_78

Download citation

DOI: https://doi.org/10.1007/978-981-19-2456-9_78
Published: 13 July 2022
Publisher Name: Springer, Singapore
Print ISBN: 978-981-19-2455-2
Online ISBN: 978-981-19-2456-9
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics

A Lightweight Verification Scheme Based on Dynamic Convolution

Abstract

Similar content being viewed by others

Realtime Face Verification with Lightweight Convolutional Neural Networks

Face Feature Extraction and Recognition Based on Lighted Deep Convolutional Neural Network

Research on Lightweight Face Recognition Algorithm

Keywords

1 Introduction