Abstract
With the development of the State Grid, the power lines, equipment and transmission scale are expanding. In order to ensure the stability and safety of electricity, it is necessary to patrol and inspect the power towers and other equipment. With the help of deep learning, neural networks can be used to learn the features in patrol image. In this paper, feature learning model named CNN Transformer Detect Anomalies (CTran_DA) is proposed to detect anomalies in patrol images. CTran_DA uses CNN to learn local features in the image, and uses Transformer to learn global features. This paper innovatively combines the advantages of CNN and Transformer to learn the local details as well as the global feature associations in images. By comparing experiments on out self-constructed dataset, the model outperforms state-of-the-art baselines. Moreover, the Floating Point Operations (FLOPs) and parameters of the model in this paper are smaller than other algorithms. In general, CTran_DA is an efficient and lightweight model to detect anomalies in images.
You have full access to this open access chapter, Download conference paper PDF
Similar content being viewed by others
Keywords
1 Introduction
With the rapid development and construction of the State Grid, all kinds of circuit equipment and power transmission equipment are constantly on the rise. As the power line equipment are in the outdoor, and by the natural environment and human factors, the pole tower will appear interface rust, collapse, wear and other phenomena. In order to ensure the proper transportation of electricity, frequent patrol inspections of outdoor power towers and other equipment are required. Determining whether there are any anomalies in power equipment by analyzing patrol photos is a very problematic issue.
Deep learning of images in performing analysis is currently a popular topic in the field of artificial intelligence. The method of machine learning not only can significantly improve the efficiency of detection also reduces the cost. Due to the specificity of patrol images, the vast majority of images captured are fault-free and only a few have anomalies. Most researchers base on improving the quality of raw data acquired by image acquisition terminals to obtain the transmission equipment patrol images needed for intelligent analysis. Thus, many framing correction techniques based on angle perception and research end devices have emerged. Researchers are devoted to realizing real-time detection of some abnormal feature quantities and fast filtering of low-quality repetitive images. However, the limited computational resources of the terminal equipment limit the research methods for analysis of transmission equipment inspection images. Thus, an effective, fast and low-power method for image detection is essential for circuit device inspection.
This paper focuses on feature learning analysis of power tower transmission equipment detection images, which is essentially the problem of detecting anomalies in the images on the dataset. The model proposed in this paper named CNN Transformer Detect Anomalies (CTran_DA) which combines the advantages of Convolution Neural Network (CNN) [1] and Transformer [2]. We use CNN to learn local features in the image, and Transformer to learn global features. According to data characteristics, we construct three datasets from the data set of total patrol photos samples. Compared with traditional computer vision classification methods, CTran-DA achieve the best performance in our dataset. CTran_DA is also much smaller than other algorithms or models in terms of the number of parameters. Finally, various experimental results prove that the model proposed in this paper is not only efficient in detecting anomalies in images but also lightweight.
2 Related Work
In recent years, Convolutional Neural Networks (CNNs) has achieved breakthrough results in various fields related to pattern recognition [3]. Especially in the field of image processing, CNNs can reduce the number of parameters of artificial neural networks, which motivates researchers to use large CNNs to solve complex tasks. One of the biggest points of CNNs is that they can learn local features in images very well and work very well with image details, and only a small number of samples are needed to learn a well-designed model.
The basic functionality of CNNs can be divided into four key sections: the input layer, the convolutional layer, the pooling layer and the fully connected layer. The convolutional layer, as the core layer in CNNs, can significantly reduce the complexity of the model by optimizing its output, which can be achieved by setting three hyperparameters: kernel size, stride and padding. Through the inspiration of CNNs, more and more effective models such as AlexNet [4], VGG [5], GoogleNet [6] and ResNet [7] have emerged accordingly. All these models have achieved excellent results in the field of computer vision and are constantly being improved.
Transformer was first applied in the field of natural language processing and was a deep neural network mainly based on a self-attentive mechanism [2]. Many recent NLP scenarios have applied the Transformer structure and have achieved excellent results in various NLP tasks [2, 8, 9]. Inspired by the significant success of the transformer architecture in the field of Natural Language Processing (NLP), researchers have recently applied transformer to computer vision (CV) task [10]. Alexey Dosovitskiy et al. [11] have proposed vision transformer (ViT) model, which applies a pure transformer directly to sequences of image patches [10]. Wenhai Wang et al. [12] proposed the Pyramid Vision Transformer (PVT) model based on the fact that ViT consumes a lot of computational resources and the computational parameters are too large. PVT not only can effectively filter some redundant information in ViT model to achieve the lightweight of the model, it also achieves better results in various tasks of CV. Microsoft Asia Research used the structural design concept of CNN to reconstruct a new transformer structure named Swin Transfomer [13]. The current borrowing of better models from various fields and then transferring learning [14] to other tasks all provide a new way of thinking for researchers in the current field [15].
3 Method
3.1 Overall Architecture
Our goal is to fully learn the features and the relationships between features in an image. An overview of CTran_DA is depicted in Fig. 1. Our model consists of three stages as CNN block, Transformer Encoder and Fully Connected Layers. The output of each stage is the input of the next stage, and the final result is obtained by the output of the fully connected layers.
3.2 CNN Block
In the first stage, given an input image with the size of \(H\times W\times 3\). Then, we use CNN to learn local features and details of the images. The CNN block contains convolution layers (Conv), batch normalization layers (BN), activation layers (LeakyReLu [16]) and max pooling layers. The process of CNN block is shown in Fig. 2.
Convolution Layer.
The convolutional layer is a feature extraction of the input data, and its process of processing images is just like the human brain recognizes images. It first perceives each feature in the image locally, and then performs a comprehensive operation to get the global information [3]. This convolution operation can be expressed as:
where \(P\in {\mathbb{R}}^{h\times h}\) denotes the image input, W and b are the parameter matrix and bias of the convolution kernel respectively. \({P}_{out}^{i}\) denotes the convolution output of the ith layer.
Batch Normalization Layer.
The BN layer is to first find the mean and variance of each batch data, then subtract the mean and divide the variance by the data, and finally add two parameters [17]. BN layer has the following three roles: 1. speed up convergence. 2. prevent gradient exploding and gradient vanishing. 3. prevent overfitting. The result of the convolution, \({P}_{out}^{i}\), as the input to the BN layer can be expressed as:
Activate Layer.
One of the important roles of the activation function is to incorporate nonlinear factors, to map features to high-dimensional nonlinear intervals for interpretation, and to solve problems that cannot be solved by linear models. In nonlinear activation layer, we use LeakyReLu [16] as the activation function and the formula is as followed:
Max Pooling Layer.
The pooling layer, also known as the downsampling layer, reduces the resolution of the features to reduce the number of parameters in the model and the complexity of the computation, enhancing the robustness of the model.
After the CNN module, we get the feature map of the local features of the image. Each feature map reshaped to an m-dimensional vector, and then combine them into n*m-dimensional embeddings based on the number of channels n to be used as the input of Transformer encoder.
3.3 Transformer Encoder
Transformer was first used in the field of neural language processing on machine translation tasks [2]. Our encoder contains Layer Normalization (LN) Layer, multi-head attention layer, Dropout layer and MLP block.
Layer Normalization.
LN and BN work similarly. Since the length of each piece of data may be different when processing natural language, LN is used to process input embeddings.
Multi-Head Attention.
Multiheaded attention is a mechanism that can be used to improve the performance of the self-attention layer. In self-attention layer, the input vector is first transformed into three different vectors: the query vector q, the key vector k and the value vector v. These vectors are packed into different matrices Q, K and V. The attention function of the input vectors is the calculated as followed:
-
Step 1: Compute scores between query matrix Q and key matrix K with: \(S=Q\cdot {K}^{T}\)
-
Step 2: Normalize the fraction of gradient stability with: \({S}_{n}=S/\sqrt{{d}_{k}}\)
-
Step 3: Convert scores to probabilities using softmax function \(P=softmax\left({S}_{n}\right)\).
-
Step 4: Obtain the weighted value matrix with \(\mathrm{Attention}=\mathrm{V}\cdot \mathrm{P}\).
This whole process can be unified into a formula such as:
However, self-attention is not sensitive to position information, and there is no position information in the calculation of the attention score. To solve this problem, the same dimensional position encoding is added to the original input embedding, and the position encoding is given by the following equation:
where \(pos\) represents the position of the word in the sentence and \(i\) denotes the current dimension of the positional encoding. \({d}_{model}\) is the dimension initially defined by our model.
On the multi-headed attention mechanism, we are given an input vector and the number of heads \(h\). The input vectors are then converted into three different groups of vectors: the query group, the key group and the value group. In each group, the dimensions for a group are equally divided according to \(h\) heads. So, the total attention then consists of the combination of the attention of multiple heads with the following equation:
where \(hea{d}_{i}=Attention\left({Q}_{i},{K}_{i},{V}_{i}\right)\) and \({W}^{O}\in {\mathbb{R}}^{{d}_{model}\times {d}_{model}}\) is a linear projection matrix.
DropPath.
DropPath is a regularization strategy that randomly deactivates the multi-branch structure in a deep learning model [18].
MLP.
MLP a traditional neural network that is designed to solve the nonlinear problem that cannot be solved by a single layer perceptron. In addition to the input and output layers, it can have multiple hidden layers in between.
3.4 Model Optimization
In this model, due to the specificity of patrol photos, cb_loss [19] is selected as the method to process the data set in this paper, and then Focal Loss [20] is selected as the loss function.
4 Experiment
In the experiments, the learning rate is 0.001 and batch size equals 64. Our experiments were done on Pytorch 1.6 and GeForce RTX 3080.
4.1 Datasets
We obtained a total sample of 1,886 by manually screening the patrol photos, of which 270 were positive samples. In order to solve the problem of imbalanced sample distribution, we used two different methods to construct two new datasets. Firstly, we filtered and then removed some images with less obvious features from the negative samples to get a small dataset which we named SMALL [21]. In this dataset, the negative sample was removed to only 282 images, and the positive sample was 270 images to reach a balanced sample. Secondly, we replicate the 270 negative samples of the original data 6 times to reach 1620. This results in a balanced set of 1616 positive samples, which is called LARGE [21]. The original dataset is named MIDDLE [21]. The specific data set is shown in Table 1. In our experiments we divide the datasets into training set, validation set and test set in the ratio of 8:1:1, respectively. We train the model on the training set, tune the parameters by the validation set, and finally test the model on the test set [21].
4.2 Result
In the field of computer vision, many methods used for image classification have achieved excellent results. Therefore, we choose many of these models and modify the final output layer to serve as a reference comparison object for our experiments. Due to the specificity of the image and the specificity of the task, we are required to detect whether the positive sample from photos.
The residual network solves the degradation problem of deep neural network well, and achieves great results on image tasks such as ImageNet and CIFAR-10. The residual network also converges faster with the same number of layers. [7] VGG [5] is a very classical network structure, which adjusts the model effect by constructing different layers of CNN. Therefore, VGG11 and VGG13 are selected as the reference objects for comparison. MLP-mixer [22] builds a pure MLP architecture and communicates in two different dimensions. ViT [11] is a network model that takes a pure Transformer, which applies a pure transformer directly to sequences of image patches. PVT [12] introduces the pyramid structure into Transformer on the basis of ViT, which not only achieves good results but also greatly reduces the number of model parameters. The Swin Transformer is a hierarchical Transformer structure built by learning the hierarchical structure of CNN. In ViT, PVT and Swin Transformer, we set the same parameters, the attention heads to 12 and the depth of transformer blocks to 6.
We build our model based on the number of layers of transformer blocks in our model. We set the number of layers of the Transformer Encoder to 1, 3 and 5, and name them CTran-1, Ctran-3 and CTran-5 respectively. We compare our model with above methods on three metrics: Recall scores, Area Under ROC Curve (AUC) and ACC scores. The results of compared with above methods are shown in Table 2.
The experimental results on the three different data sets demonstrate that the method of obtaining the total number of balanced samples by replication achieves the best results. For SMALL dataset, a small sample balanced dataset, it is also slightly higher than the original dataset in all three metrics. After comparing with the traditional convolutional approach, our method achieves the best results on all three datasets. This shows that using only convolution for learning representation misses the global information of the image. After comparing with the latest Transformer-based model it was seen that both the pure Transformer model ViT and the simplified ViT did not achieve great results. When patching images, it is easy to lose details in complex images when using only the transformer to learn them. In particular, the task of processing for details is difficult to identify accurately. Table 3 shows the number of parameters and the amount of computation for each model. It can be seen that our model achieves better results on each dataset while using fewer parameters and consuming less FLOPs.
5 Conclusion
On the problem of abnormal detection for patrol photos, this paper proposes a novel scheme based on the features of pictures that are learned simultaneously by local and global features. In this paper, a new model CTran-DA is proposed which can effectively learn the feature details and global structure of the images. Secondly, it is a lightweight model with a lighter model structure than the current mainstream image classification models. The results from three different datasets show that our proposed model is also very effective and lightweight enough. This model can also provide a new idea for other researchers to follow and is very suitable for some restricted terminal devices. It provides a new solution for tasks that are highly complex and require light weight.
References
Kalchbrenner, N., Grefenstette, E., Blunsom, P.: A convolutional neural network for modelling sentences. arXiv preprint arXiv:1404.2188 (2014)
Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems 2017, pp. 5998–6008 (2017)
Albawi, S., Mohammed, T.A., Al-Zawi, S.: Understanding of a convolutional neural network. In: 2017 International Conference on Engineering and Technology (ICET) 2017, pp. 1–6. IEEE (2017)
Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems 2012, vol. 25, pp. 1097–1105 (2012)
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)
Szegedy, C., et al.: Going deeper with convolutions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2015, pp. 1–9 (2015)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2016, pp. 770–778 (2016)
Devlin, J., Chang, M.-W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
Brown, T.B., et al.: Language models are few-shot learners. arXiv preprint arXiv:2005.14165 (2020)
Han, K., et al.: A survey on visual transformer. arXiv preprint arXiv:2012.12556 (2020)
Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)
Wang, W., et al.: Pyramid vision transformer: a versatile backbone for dense prediction without convolutions. arXiv preprint arXiv:2102.12122 (2021)
Liu, Z., et al.: Swin transformer: hierarchical vision transformer using shifted windows. arXiv preprint arXiv:2103.14030 (2021)
Torrey, L., Shavlik, J.: Transfer learning. In: Handbook of Research on Machine Learning Applications and Trends: Algorithms, Methods, and Techniques, pp. 242–264. IGI Global (2010)
Pan, S.J., Yang, Q.: A survey on transfer learning. IEEE Trans. Knowl. Data Eng. 22, 1345–1359 (2009)
Maas, A.L., Hannun, A.Y., Ng, A.Y.: Rectifier nonlinearities improve neural network acoustic models. In: Proceedings of the ICML 2013, vol. 30, p. 3. Citeseer (2013)
Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. In: International Conference on Machine Learning 2015, pp. 448–456. PMLR (2015)
Larsson, G., Maire, M., Shakhnarovich, G., FractalNet: ultra-deep neural networks without residuals. arXiv preprint arXiv:1605.07648 (2016)
Cui, Y., Jia, M., Lin, T.-Y., Song, Y., Belongie, S.: Class-balanced loss based on effective number of samples. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2019, pp. 9268–9277 (2019)
Lin, T.-Y., Goyal, P., Girshick, R., He, K., Dollár, P.: Focal loss for dense object detection. In: Proceedings of the IEEE International Conference on Computer Vision 2017, pp. 2980–2988 (2017)
Chen, J., Luo, W., Hao, Y., Xu, H., Wu, J., Ju, X.: Using convolution neural networks to build a LightWeight anomalies detection model. In: 2021 IEEE 4th International Conference on Automation, Electronics and Electrical Engineering (AUTEEE) 2021, pp. 157–160. IEEE (2021)
Tolstikhin, I., et al.: MLP-mixer: an all-MLP architecture for vision. arXiv preprint arXiv:2105.01601 (2021)
Acknowledgments
The work is supported by State Grid Zhejiang Electric Power Co., Ltd., science and technology project (5211nb200139), the key technology and terminal development of lightweight image elastic sensing and recognition based on AI chip.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.
The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.
Copyright information
© 2022 The Author(s)
About this paper
Cite this paper
Zhou, H., Qin, R., Wu, J., Qian, Y., Ju, X. (2022). CTran_DA: Combine CNN with Transformer to Detect Anomalies in Transmission Equipment Images. In: Qian, Z., Jabbar, M., Li, X. (eds) Proceeding of 2021 International Conference on Wireless Communications, Networking and Applications. WCNA 2021. Lecture Notes in Electrical Engineering. Springer, Singapore. https://doi.org/10.1007/978-981-19-2456-9_67
Download citation
DOI: https://doi.org/10.1007/978-981-19-2456-9_67
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-19-2455-2
Online ISBN: 978-981-19-2456-9
eBook Packages: EngineeringEngineering (R0)