Keywords

1 Introduction

The construction span of Electricity Grid Engineering is large, and the construction cycle is long. The handover and acceptance of engineering construction materials cover the whole construction cycle, and there are many handover points and many units involved in the handover of materials. These factors bring certain risks for material storage and confirmation of material handover personnel. There are phenomena that material handover responsibilities are difficult to clarify and non-handover personnel take over the handover.

With the continuous promotion of power grid information reform and the increasing information security requirements, it is necessary to informatize the engineering aspects of the power grid and improve the artificial intelligence management capability of Electricity Grid Engineering. Through the automatic authentication of engineering personnel’s identity, the material handover and responsibility implementation are transformed from a loose and sloppy management mode to a centralized and lean management mode, thus forming a sound and centralized, lean and efficient management system. The efficient and reliable face verification algorithm can not only improve Electricity Grid Engineering’s management services but also effectively improve the information protection and information security of Electricity Grid Engineering personnel.

Currently, high-precision face verification models are mostly built based on deep convolutional neural networks that require high computational resources. These models are trained using large amounts of data, and the models are complex and have a very large number of parameters that require a large amount of computational resources. Therefore, these models are difficult to run in mobile and embedded devices, which are mostly seen in Electricity Grid Engineering scenarios. Therefore, lightweight neural networks with low memory consumption and low computational resource consumption have become a trend in current research.

Non-lightweight face verification networks have higher verification accuracy but are more computationally intensive, such as DeepFace [1], FaceNet [2], etc. This paper proposes a lightweight face verification network based on Dynamic Convolution using the lightweight neural network MobileNetV2 [3] as the baseline network to address the above problems. By learning multiple sets of convolution kernels within a single convolution operation, the feature extraction capability of the lightweight network is improved, making the lightweight neural network also achieve good face verification accuracy. At the same time, the network only enhances the baseline network MobileNetV2 with a very limited amount of computing power and meets the demand for real-time verification recognition.

2 Dynamic Convolution-Based Face Verification Network

2.1 Dynamic Convolution

Dynamic Convolution is a network substructure [4], which can be very easily embedded into other existing network structures. The core idea is to give a layer of convolution the ability to learn multiple groups of convolution kernels so that a single convolution operation has a stronger feature extraction and representation capability. At the same time, an attention mechanism [5] is introduced to learn the weights of the parameters of each group of convolutional kernels through the network so that the effective convolutional kernel parameters have high weights. The remaining parameters have low weights, prompting the model to adaptively capture the high-weight convolutional kernel parameters according to the input, improving the performance of existing convolutional neural networks, especially lightweight neural networks. By introducing Dynamic Convolution operation into the operation of the lightweight neural network, the lightweight network can extract and learn face features more efficiently. The overall structure of Dynamic Convolution is shown in Fig. 1.

Fig. 1.
figure 1

The overall structure of Dynamic Convolution

The Squeeze operation is performed on the input channels in the first step. That is, feature compression is performed on the input layer to turn each two-dimensional feature channel into a real number with a global perceptual field. The resulting output features are the same as the number of input feature channels. The Squeeze operation used is global average pooling:

$${F}_{s}\left({u}_{k}\right)=\frac{1}{W\times H}\sum\nolimits_{i=1}^{W}\sum\nolimits_{j=1}^{H}{u}_{k}(i,j)$$
(1)

where \({u}_{k}\) is the input feature, k is the number of channels, \(W\) and \(H\) are the width and height of an input channel feature, and \({F}_{s}\) is the result of the Squeeze operation, which is a vector of length equal to k.

In the second step, the Excitation operation is performed on the result of the Squeeze. This operation outputs the corresponding weights of each set of convolution kernel parameters, which enables the network to adaptively select the appropriate convolution kernel for convolution according to the input features:

$${F}_{e}\left({F}_{s},W\right)=\sigma ({W}_{2}\delta ({W}_{1}{F}_{s}))$$
(2)

where \({W}_{1}\) and \({W}_{2}\) are the parameters of the fully connected layer, the dimension of \({W}_{1}\) is k/r * k, r is the scaling factor in reducing the output dimension to reduce the operational complexity of the attention mechanism, r = 0.25 is used in this paper. The dimension of \({W}_{2}\) is T * k/r to obtain a vector of length T. T is the number of groups of convolution kernel parameters, δ is the nonlinear activation function ReLU [6], and \(\sigma \) is the softmax function. The output weight vector \({F}_{e}\) is normalized to be in the interval [0, 1] and summed to 1 using the softmax function, and the length of \({F}_{e}\) is T.

In the actual training of the network, in order to ensure that all groups of convolutional kernel parameters can participate in the training at the beginning of the training and avoid falling into local optimal points at the beginning of the training, the softmax used is the temperature-controlled softmax:

$${F}_{e,t}= \frac{exp({F}_{e,t}/\tau )}{{\sum }_{j}exp({F}_{e,j}/\tau )}$$
(3)

where \(\tau \) is the temperature parameter. It is set to a larger value at the beginning of the training and decreases until it becomes 1 as the training progresses.

In the third step, according to the weight \({F}_{e}\) of each group of convolution kernel parameters obtained from the Excitation operation, each group of convolution kernel parameters is weighted to obtain the real convolution kernel parameters for the convolution operation:

$$W=\sum\nolimits_{t=1}^{T}{F}_{e,t}{W}^{t}, b=\sum\nolimits_{t=1}^{T}{F}_{e,t}{b}^{t}, s.t.0\le {F}_{e,t}\le 1, \sum\nolimits_{t=1}^{T}{F}_{e,t}=1$$
(4)

where \({W}^{t}\) and \({b}^{t}\) are the t-th set of convolutional kernel parameters and \({F}_{e,t}\) is the tth value of the attention weight, which corresponds to the probability of using the t-th set of convolutional kernel parameters. The adaptive convolutional kernel parameters were obtained by weighting and summing each set of parameters by multiplication. The weights obtained using softmax contain a probabilistic sense, ensuring the scale stability of the obtained convolution kernel parameters. The application of the attention mechanism allows the network to automatically transform the parameters used for convolution in response to the input, greatly increasing the feature extraction and learning capability of the network.

The application of the attention mechanism allows the network to automatically transform the parameters used for convolution in response to the input, greatly increasing the feature extraction and learning capability of the network.

$${v}_{k}=W{u}_{k}+b$$
(5)

where \({u}_{k}\) is the convolutional input feature and \({v}_{k}\) is the output feature of Dynamic Convolution. After completing the Dynamic Convolution, the features can be normalized using the common Batch Normalization layer [7] and nonlinear activation operations can be performed using nonlinear activation functions such as ReLU, PReLU [8], etc.

2.2 Bottleneck Layer Structure Design

In order to solve the degradation problem of deep neural networks and accelerate the collection of the network, MobileNetV2 introduces the Inverted Residuals Block bottleneck layer structure [3], as shown in Fig. The traditional residual structure [9] is like an hourglass with narrow middle and fat ends. Using only a small number of convolutional kernels to extract features will lead to poor feature extraction. The number of convolutional kernels in each layer of the lightweight feature extraction network is limited. Using the traditional residual structure will lead to the network not extracting enough information, resulting in a poor network. Therefore, in this paper, we use an inverted residual structure, which is like a spindle with a large middle and small ends. The feature data are first up-dimensioned by 1 * 1 Conv. The convolution operation is performed to extract the feature data, and finally down-dimensioned again by 1 * 1 Conv, which ensures the feature extraction effect and controls the parameters and computation of the network to a certain extent.

It can be seen that the backbone network part of the Inverted Residuals Block is divided into three main blocks. The first block has a similar network structure to the third block, consisting of 1 × 1 Conv, BN, and ReLU6. Among them, 1 × 1 Conv is the convolutional layer with a convolutional kernel size of 1, which is mainly used to change the number of channels of the features. BN is the Batch Normalization layer, which normalizes the features after the convolutional layer computation. reLU6 is the activation function, which gives this neuron a layered nonlinear mapping learning capability. Note that the third block of the network structure does not contain an activation function. The second network structure consists of 3 × 3 DwiseConv, BN, and ReLU6 [10], where 3 × 3 DwiseConv refers to the Depthwise Convolution with a convolutional kernel size of 3 [11] (Fig. 2).

Fig. 2.
figure 2

The Inverted Residuals Block

Inverted Residuals Block is an important component of MobileNetV2. Using a large number of Inverted Residuals Blocks, the input information can flow sufficiently within the network so that the network has enough parameters to understand the input information and record the information characteristics. For this structure, we empirically replace the 1 * 1 convolution in the third block of the network structure with the Dynamic Convolution layer. On the one hand, such a structural replacement can already be sufficient to improve the face verification performance of MobileNetV2. On the other hand, although the increase in the number of operations of Dynamic Convolution is very limited, the increase in the number of parameters is considerable. Replacing only the last 1 * 1 convolutional layer in the Inverted Residuals Block with a Dynamic Convolution layer can also effectively prevent the size of the network model from increasing so much that it can be used in grid-side devices. The modified Inverted Residuals Block will be called Dynamic Inverted Residuals Block.

2.3 Network Architecture Design

The size of the input image used in this paper is 112 * 112. Based on MobileNetV2, the Inverted Residuals Block used in this paper is replaced with the Dynamic Inverted Residuals Block with Dynamic Convolution as described above. As shown in the Table, the network structure mainly consists of four parts. The first part obtains a feature map of size 56 * 56 with rich face feature information by a normal convolution with a kernel size of 3, step size of 2, padding of 1, and output channel number of 64. The second part consists of six Dynamic Inverted Residuals blocks in different configurations. The third part contains 3 convolution operations. First, the number of feature channels is expanded by 1 × 1 convolution, and the 7 × 7 feature map with 512 channels is output. Then, a 7 × 7 convolution layer is used to obtain 512 1 × 1 features. Finally, the feature transform is performed by a 1 × 1 convolution, and after flattening, a 512-dimensional face feature vector is obtained. The fourth part, which is a fully connected layer, implements the face classification at training time.

Table 1. Network structure

In Table 1, op indicates the operation, e is the channel expansion factor, c is the number of output channels (number of dimensions), d indicates whether dropout is used, r indicates the number of repetitions of the block, and s is the step size (only the first repetition module has a step size of s, the rest of the repetition modules have a step size of 1).

3 Analysis of Experimental Results

3.1 Data Set and Experimental Setup

The public dataset CASIA-WebFace [12] contains 494,414 images of 10,575 individuals. In this paper, we use CASIA-WebFace as a training dataset and use the face verification database LFW [13] to check the improvement of the algorithm under different conditions. The dataset has 13233 face images containing 5749 people, containing various types of conditions such as different poses, lighting changes, and background changes. There is no overlap between the training data and the test data.

The input face image size of the model is 112 * 112. For this reason, the data needs to be processed before the face recognition network is trained. The face detection algorithm is used to derive the coordinates of face regions and key points. Based on these coordinates, the face is aligned for correction, and finally, the aligned face image is scaled to 112 * 112. The data augmentation method used contains image mirroring, panning, brightness, color, contrast, sharpness adjustment, etc. The face image is normalized before training by subtracting 127.5 from the pixels and then dividing by 128 to obtain the normalized training data finally.

The experimental hardware platform is Ubuntu 18.04 operating system and Intel Corel NVIDIA Tesla V100 graphics card. The experiments in this paper are based on PyTorch deep learning framework [14] for algorithm model training.

In this paper, all experiments are trained using a stochastic gradient descent optimizer [15]. In order to speed up the convergence and reduce the oscillation in the process of model convergence, the Momentum factor is added to the experimental training process in this paper. Its value is set to 0.9, the weight decay is set to 5e−4, the initial learning rate is set to 0.01, and the learning rate is multiplied by 0.1 at epochs of 40, 50, and 60, and the model is trained for a total of 70 epochs.

In this paper, the loss function used in the training process is the Adacos [16] adaptive scale loss function. Compared with the loss functions used for face recognition, such as CosFace [17] and ArcFace [18], Adacos does not rely on manual adjustment of the hyperparameters of the loss function to achieve good optimization results.

3.2 Analysis of Experimental Results

The comparison between the lightweight face recognition algorithm model based on Dynamic Inverted Residuals Block and the baseline network MobileNetV2 on the LFW validation set is shown in Table 2.

Table 2. The comparison on the LFW validation set

As can be seen from the Table, the model with the introduction of Dynamic Convolution increases from 292.6M to 305.3M in terms of computing volume, which is only a 4.34% improvement, while the accuracy of face recognition increases from 98.58% to 99.28%, with a significant 50.7% decrease in error rate. This result is not easy for such performance improvement in a long-tail task like face recognition. The number of model parameters and the forward transmission time are kept at the same order of magnitude as the baseline network, ensuring the possibility of applying the network model to all types of end devices on the grid.

In order to fully verify the performance of this algorithm model, an experimental comparison with the current mainstream algorithms in the field of face recognition was conducted, as shown in Table 3.

Table 3. The comparison with other algorithms

LMobileNetE and Light CNN have higher recognition accuracy. Still, their training datasets are 4M and 3.8M. The number of model parameters are 12.8M and 26.7M (one order of magnitude higher than the model in this paper), which are significantly higher than the algorithms in this paper. It is significantly more difficult to migrate them to mobile platforms. Although the model size of MobileID and ShuffleNet is smaller, the performance is weak, failing to reach 99%, and the recognition accuracy is insufficient to meet the standard used by Electricity Grid Engineering. The algorithm model proposed in this paper achieves a good trade-off in recognition accuracy, operation volume, and model size by introducing Dynamic Convolution, which makes it meet both the accuracy requirements of recognition and can be efficiently applied on mobile devices.

4 Conclusion

In this paper, we propose a lightweight face recognition network based on Dynamic Convolution to address the common people management problem in Electricity Grid Engineering. The Dynamic Convolution operation not only gives richer feature extraction and learning capability to individual convolution, but also makes the convolution operation self-adaptive, so that it can automatically construct different convolution kernel parameters for different inputs for convolution. It has been proven that the lightweight face recognition network based on Dynamic Convolution proposed in this paper achieves a good balance of operational efficiency and recognition accuracy.