Keywords

1 Introduction

With the rapid development and construction of the State Grid, all kinds of circuit equipment and power transmission equipment are constantly on the rise. As the power line equipment are in the outdoor, and by the natural environment and human factors, the pole tower will appear interface rust, collapse, wear and other phenomena. In order to ensure the proper transportation of electricity, frequent patrol inspections of outdoor power towers and other equipment are required. Determining whether there are any anomalies in power equipment by analyzing patrol photos is a very problematic issue.

Deep learning of images in performing analysis is currently a popular topic in the field of artificial intelligence. The method of machine learning not only can significantly improve the efficiency of detection also reduces the cost. Due to the specificity of patrol images, the vast majority of images captured are fault-free and only a few have anomalies. Most researchers base on improving the quality of raw data acquired by image acquisition terminals to obtain the transmission equipment patrol images needed for intelligent analysis. Thus, many framing correction techniques based on angle perception and research end devices have emerged. Researchers are devoted to realizing real-time detection of some abnormal feature quantities and fast filtering of low-quality repetitive images. However, the limited computational resources of the terminal equipment limit the research methods for analysis of transmission equipment inspection images. Thus, an effective, fast and low-power method for image detection is essential for circuit device inspection.

This paper focuses on feature learning analysis of power tower transmission equipment detection images, which is essentially the problem of detecting anomalies in the images on the dataset. The model proposed in this paper named CNN Transformer Detect Anomalies (CTran_DA) which combines the advantages of Convolution Neural Network (CNN) [1] and Transformer [2]. We use CNN to learn local features in the image, and Transformer to learn global features. According to data characteristics, we construct three datasets from the data set of total patrol photos samples. Compared with traditional computer vision classification methods, CTran-DA achieve the best performance in our dataset. CTran_DA is also much smaller than other algorithms or models in terms of the number of parameters. Finally, various experimental results prove that the model proposed in this paper is not only efficient in detecting anomalies in images but also lightweight.

2 Related Work

In recent years, Convolutional Neural Networks (CNNs) has achieved breakthrough results in various fields related to pattern recognition [3]. Especially in the field of image processing, CNNs can reduce the number of parameters of artificial neural networks, which motivates researchers to use large CNNs to solve complex tasks. One of the biggest points of CNNs is that they can learn local features in images very well and work very well with image details, and only a small number of samples are needed to learn a well-designed model.

The basic functionality of CNNs can be divided into four key sections: the input layer, the convolutional layer, the pooling layer and the fully connected layer. The convolutional layer, as the core layer in CNNs, can significantly reduce the complexity of the model by optimizing its output, which can be achieved by setting three hyperparameters: kernel size, stride and padding. Through the inspiration of CNNs, more and more effective models such as AlexNet [4], VGG [5], GoogleNet [6] and ResNet [7] have emerged accordingly. All these models have achieved excellent results in the field of computer vision and are constantly being improved.

Transformer was first applied in the field of natural language processing and was a deep neural network mainly based on a self-attentive mechanism [2]. Many recent NLP scenarios have applied the Transformer structure and have achieved excellent results in various NLP tasks [2, 8, 9]. Inspired by the significant success of the transformer architecture in the field of Natural Language Processing (NLP), researchers have recently applied transformer to computer vision (CV) task [10]. Alexey Dosovitskiy et al. [11] have proposed vision transformer (ViT) model, which applies a pure transformer directly to sequences of image patches [10]. Wenhai Wang et al. [12] proposed the Pyramid Vision Transformer (PVT) model based on the fact that ViT consumes a lot of computational resources and the computational parameters are too large. PVT not only can effectively filter some redundant information in ViT model to achieve the lightweight of the model, it also achieves better results in various tasks of CV. Microsoft Asia Research used the structural design concept of CNN to reconstruct a new transformer structure named Swin Transfomer [13]. The current borrowing of better models from various fields and then transferring learning [14] to other tasks all provide a new way of thinking for researchers in the current field [15].

3 Method

3.1 Overall Architecture

Fig. 1.
figure 1

Overall architecture of the proposed CTran_DA.

Our goal is to fully learn the features and the relationships between features in an image. An overview of CTran_DA is depicted in Fig. 1. Our model consists of three stages as CNN block, Transformer Encoder and Fully Connected Layers. The output of each stage is the input of the next stage, and the final result is obtained by the output of the fully connected layers.

3.2 CNN Block

In the first stage, given an input image with the size of \(H\times W\times 3\). Then, we use CNN to learn local features and details of the images. The CNN block contains convolution layers (Conv), batch normalization layers (BN), activation layers (LeakyReLu [16]) and max pooling layers. The process of CNN block is shown in Fig. 2.

Fig. 2.
figure 2

Flow char of CNN block.

Convolution Layer.

The convolutional layer is a feature extraction of the input data, and its process of processing images is just like the human brain recognizes images. It first perceives each feature in the image locally, and then performs a comprehensive operation to get the global information [3]. This convolution operation can be expressed as:

$${P}_{out}^{i}=f\left(P*W\right)+b.$$
(1)

where \(P\in {\mathbb{R}}^{h\times h}\) denotes the image input, W and b are the parameter matrix and bias of the convolution kernel respectively. \({P}_{out}^{i}\) denotes the convolution output of the ith layer.

Batch Normalization Layer.

The BN layer is to first find the mean and variance of each batch data, then subtract the mean and divide the variance by the data, and finally add two parameters [17]. BN layer has the following three roles: 1. speed up convergence. 2. prevent gradient exploding and gradient vanishing. 3. prevent overfitting. The result of the convolution, \({P}_{out}^{i}\), as the input to the BN layer can be expressed as:

$${B}_{out}^{i}=BN\left({P}_{out}^{i}\right).$$
(2)

Activate Layer.

One of the important roles of the activation function is to incorporate nonlinear factors, to map features to high-dimensional nonlinear intervals for interpretation, and to solve problems that cannot be solved by linear models. In nonlinear activation layer, we use LeakyReLu [16] as the activation function and the formula is as followed:

$$LeakyReLU\left(x\right)=\left\{\begin{array}{l} x , \quad if \,x\ge 0\\ \alpha x, \quad otherwise\end{array}\right. .$$
(3)

Max Pooling Layer.

The pooling layer, also known as the downsampling layer, reduces the resolution of the features to reduce the number of parameters in the model and the complexity of the computation, enhancing the robustness of the model.

After the CNN module, we get the feature map of the local features of the image. Each feature map reshaped to an m-dimensional vector, and then combine them into n*m-dimensional embeddings based on the number of channels n to be used as the input of Transformer encoder.

3.3 Transformer Encoder

Transformer was first used in the field of neural language processing on machine translation tasks [2]. Our encoder contains Layer Normalization (LN) Layer, multi-head attention layer, Dropout layer and MLP block.

Layer Normalization.

LN and BN work similarly. Since the length of each piece of data may be different when processing natural language, LN is used to process input embeddings.

Multi-Head Attention.

Multiheaded attention is a mechanism that can be used to improve the performance of the self-attention layer. In self-attention layer, the input vector is first transformed into three different vectors: the query vector q, the key vector k and the value vector v. These vectors are packed into different matrices Q, K and V. The attention function of the input vectors is the calculated as followed:

  • Step 1: Compute scores between query matrix Q and key matrix K with: \(S=Q\cdot {K}^{T}\)

  • Step 2: Normalize the fraction of gradient stability with: \({S}_{n}=S/\sqrt{{d}_{k}}\)

  • Step 3: Convert scores to probabilities using softmax function \(P=softmax\left({S}_{n}\right)\).

  • Step 4: Obtain the weighted value matrix with \(\mathrm{Attention}=\mathrm{V}\cdot \mathrm{P}\).

This whole process can be unified into a formula such as:

$$Attention\left(Q,K,V\right)=softmax\left(\frac{\left(Q\cdot {K}^{T}\right)}{\sqrt{{d}_{k}}}\right)\cdot V$$
(4)

However, self-attention is not sensitive to position information, and there is no position information in the calculation of the attention score. To solve this problem, the same dimensional position encoding is added to the original input embedding, and the position encoding is given by the following equation:

$$PE\left(pos,2i\right)=sin\left(\frac{pos}{{10000}^{\frac{2i}{{d}_{model}}}}\right)$$
(5)
$$PE\left(pos,2i+1\right)=cos\left(\frac{pos}{{10000}^{\frac{2i}{{d}_{model}}}}\right)$$
(6)

where \(pos\) represents the position of the word in the sentence and \(i\) denotes the current dimension of the positional encoding. \({d}_{model}\) is the dimension initially defined by our model.

On the multi-headed attention mechanism, we are given an input vector and the number of heads \(h\). The input vectors are then converted into three different groups of vectors: the query group, the key group and the value group. In each group, the dimensions for a group are equally divided according to \(h\) heads. So, the total attention then consists of the combination of the attention of multiple heads with the following equation:

$$MultiHead\left({Q}^{\mathrm{^{\prime}}},{K}^{\mathrm{^{\prime}}},{V}^{\mathrm{^{\prime}}}\right)=Concat\left(hea{d}_{1},\dots ,hea{d}_{h}\right){W}^{O}$$
(7)

where \(hea{d}_{i}=Attention\left({Q}_{i},{K}_{i},{V}_{i}\right)\) and \({W}^{O}\in {\mathbb{R}}^{{d}_{model}\times {d}_{model}}\) is a linear projection matrix.

DropPath.

DropPath is a regularization strategy that randomly deactivates the multi-branch structure in a deep learning model [18].

MLP.

MLP a traditional neural network that is designed to solve the nonlinear problem that cannot be solved by a single layer perceptron. In addition to the input and output layers, it can have multiple hidden layers in between.

3.4 Model Optimization

In this model, due to the specificity of patrol photos, cb_loss [19] is selected as the method to process the data set in this paper, and then Focal Loss [20] is selected as the loss function.

4 Experiment

In the experiments, the learning rate is 0.001 and batch size equals 64. Our experiments were done on Pytorch 1.6 and GeForce RTX 3080.

4.1 Datasets

We obtained a total sample of 1,886 by manually screening the patrol photos, of which 270 were positive samples. In order to solve the problem of imbalanced sample distribution, we used two different methods to construct two new datasets. Firstly, we filtered and then removed some images with less obvious features from the negative samples to get a small dataset which we named SMALL [21]. In this dataset, the negative sample was removed to only 282 images, and the positive sample was 270 images to reach a balanced sample. Secondly, we replicate the 270 negative samples of the original data 6 times to reach 1620. This results in a balanced set of 1616 positive samples, which is called LARGE [21]. The original dataset is named MIDDLE [21]. The specific data set is shown in Table 1. In our experiments we divide the datasets into training set, validation set and test set in the ratio of 8:1:1, respectively. We train the model on the training set, tune the parameters by the validation set, and finally test the model on the test set [21].

Table 1. Summary of the datasets.

4.2 Result

In the field of computer vision, many methods used for image classification have achieved excellent results. Therefore, we choose many of these models and modify the final output layer to serve as a reference comparison object for our experiments. Due to the specificity of the image and the specificity of the task, we are required to detect whether the positive sample from photos.

Table 2. Comparison results of proposed model and other methods on three different datasets.

The residual network solves the degradation problem of deep neural network well, and achieves great results on image tasks such as ImageNet and CIFAR-10. The residual network also converges faster with the same number of layers. [7] VGG [5] is a very classical network structure, which adjusts the model effect by constructing different layers of CNN. Therefore, VGG11 and VGG13 are selected as the reference objects for comparison. MLP-mixer [22] builds a pure MLP architecture and communicates in two different dimensions. ViT [11] is a network model that takes a pure Transformer, which applies a pure transformer directly to sequences of image patches. PVT [12] introduces the pyramid structure into Transformer on the basis of ViT, which not only achieves good results but also greatly reduces the number of model parameters. The Swin Transformer is a hierarchical Transformer structure built by learning the hierarchical structure of CNN. In ViT, PVT and Swin Transformer, we set the same parameters, the attention heads to 12 and the depth of transformer blocks to 6.

We build our model based on the number of layers of transformer blocks in our model. We set the number of layers of the Transformer Encoder to 1, 3 and 5, and name them CTran-1, Ctran-3 and CTran-5 respectively. We compare our model with above methods on three metrics: Recall scores, Area Under ROC Curve (AUC) and ACC scores. The results of compared with above methods are shown in Table 2.

Table 3. Font sizes of headings.

The experimental results on the three different data sets demonstrate that the method of obtaining the total number of balanced samples by replication achieves the best results. For SMALL dataset, a small sample balanced dataset, it is also slightly higher than the original dataset in all three metrics. After comparing with the traditional convolutional approach, our method achieves the best results on all three datasets. This shows that using only convolution for learning representation misses the global information of the image. After comparing with the latest Transformer-based model it was seen that both the pure Transformer model ViT and the simplified ViT did not achieve great results. When patching images, it is easy to lose details in complex images when using only the transformer to learn them. In particular, the task of processing for details is difficult to identify accurately. Table 3 shows the number of parameters and the amount of computation for each model. It can be seen that our model achieves better results on each dataset while using fewer parameters and consuming less FLOPs.

5 Conclusion

On the problem of abnormal detection for patrol photos, this paper proposes a novel scheme based on the features of pictures that are learned simultaneously by local and global features. In this paper, a new model CTran-DA is proposed which can effectively learn the feature details and global structure of the images. Secondly, it is a lightweight model with a lighter model structure than the current mainstream image classification models. The results from three different datasets show that our proposed model is also very effective and lightweight enough. This model can also provide a new idea for other researchers to follow and is very suitable for some restricted terminal devices. It provides a new solution for tasks that are highly complex and require light weight.