LSTM power quality disturbance classification with wavelets and attention mechanism

Efficient detection and classification of power quality disturbances is required with the increasing penetration of multi-energy systems such as microgrids and features from renewable energy resources. Machine learning approach is popular to generate useful and optimal features from data learning to improve the classification performance. This paper aims to analyse the classification performance using the hybrid model of multi-resolution analysis and long short-term memory network. The proposed model uses four-level decomposition wavelet transform to increase the resolution of input signals into multi-bands signal representation. Spatial and temporal feature representation of the wavelet coefficients are highlighted using attention mechanism before feeding into long short-term memory network for sequence feature extraction. The sequence feature output is then passed into multiple dense layer for the classification process. Synthetic disturbance signals are used as training samples. The performance test carried out includes the condition of 20–50 dB signal-to-noise ratio signals, where additive white Gaussian noise are added into the test samples.


Introduction
The integration of multiple energy systems especially renewable energy microgrids has been proved efficient in reducing carbon footprint [1]. However, the blooming integration of multiple energy systems also challenged the utility in providing good quality of power supplies [2,3]. Power quality disturbances (PQD) are among the major concerns in improving functionality and efficiency of a microgrid [4]. PQD is defined as a series of disruption on the magnitude or frequency of the power supply sinusoidal waveform [5]. The occurrence of PQD causes problems such as reducing the lifespan of electrical devices, causing malfunctioning in sensitive electronics devices such as computers, causing unwanted power tripping, causes financial losses and decreases productivity. PQD detection and classification is thus an important tools for monitoring, preventive actions these statistical features never clearly justified [15]. Optimal feature selection methods were proposed to select the most important statistical features generated [18].
Most traditional three-stages PQD classification models uses independent feature extraction, where features extracted are not related to the classification performance. The classification performance of the network depends highly on the feature extraction stage. However, the handcrafted statistical features extracted may not be conducive and some features might cause adverse effect on the accuracy. This process often requires professional knowledge and usually changes over different scenarios. The lack of closed-loop feedback system between feature extraction and classification stages has been highlighted in [19] as an essential element to achieve automatic feature extraction and classification. Deep neural network (DNN) models [20][21][22][23] are proposed to attain automatic feature extraction and classification without human intervention.
Hybrid methods combining the advantages of WT with artificial neural networks (ANN) has first been presented by Santoso et al. [24,25]. The output of WT are squared to form squared wavelet transform coefficients. The extracted features are processed using ANN layers and a simple thresholded voting is used as decision making for the classification. In [26], statistical feature extracted from WT is used with radial basis function neural network for classification. In [27], WT is used for the signal processing, and the statistical features extracted are classified by using wavelet networks. On the other hand, Khokhar et al. uses WT as feature extractor, the statistical features extracted are passed into artificial bee colony optimiser for feature selection, and finally the classification is done using probabilistic neural networks [15]. It is noticed that most of the hybrid model uses statistical features for the classification process.
To solve above-mentioned problems, a novel hybrid approach of signal analysis with DNN method is proposed. Instead of extracting specific statistical features, the proposed multi-resolution attention LSTM model embed and align the wavelet transformation coefficients using a perceptron layer. Attention mechanism is applied on the embedded features to improve the generalisation capability of the network. The highlighted features are then being passed into LSTM layers for feature extraction. Finally, the sequence features extracted from the LSTM layer are then being passed into fully connected layer for classification process. An overview of the proposed method is depicted in Fig. 1.

Multi-resolution attention LSTM method
Multi-resolution signal decomposition (MSD) is a perfect tool to transform the time-series input signals into multiple frequency components signals or time-frequency domain signals [28]. Attention mechanism has been proposed along with MSD to achieve better classification performance especially under noisy environment. The proposed method as shown in Fig. 1 uses MSD with attention mechanism and LSTM layers to achieve the automatic feature extraction. Automatic PQD detection and classification can be achieved with automatic feature extractor. In order to achieve best feature extraction and classification performance, a series of feature manipulation has been tested. Two different attention mechanism has been tested, temporal feature attention (TFA) and spatial feature attention (SFA). The output coefficients of MSD is first passed through a feature align layer which embed different bands into similar dimensions. TFA is multiplied element-wise with the temporally aligned features from feature align layer outputs to get the temporal attention vector. While spatial attention vector is acquired by multiplying the spatially aligned features with the SFA attention weights. The spatial or temporal highlighted feature vector are then being arranged into temporal arrangement before passing into LSTM layers for feature extraction process. Two fully connected layers and softmax activation function are used to classify the features into respective disturbance classes. The details of the proposed mechanism are explained in the following sections.

Multi-resolution signal decomposition
Wavelet transform [29] is proved to be efficient in detecting discontinuity in signals. The characteristics of varying window sizes in WT allows it to achieve an optimal timefrequency resolution. The varying window sizes also allow wavelets to detect non-stationary signals, which are posed by most of the PQDs. Wavelet transform can be expressed as [29], where a i, j representing discrete wavelet transform (DWT) expansion coefficients of input f (x), with scaling and shifting parameters, i and j, respectively, while ψ i, j represents the wavelet expansion function. The DWT coefficients can be expressed as: Wavelet basis function can be generated from mother wavelet, ψ i, j by tuning the scaling and shifting parameters.
where i and j are scaling and shifting parameters. MSD can be achieved by providing two-scaled equation, ψ i, j and φ i, j where g( j) = (−1) j h(1 − j). h( j) and g( j) can be viewed as high pass and loss pass filter coefficients. MSD is achieved with different I th level decomposition as follows [29]: MSD allows the signal to be band-filtered into levels of frequencies. The original signal is filtered using a high-pass filter (HPF) for high-frequency components, while low-pass filter (LPF) is used to extract low-frequency components. The band low-pass filter from each levels will be used as input to next level decomposition until the desired decomposition level. Figure 2 shows MSD with four levels of decomposition. The detailed coefficients, D1 to D4, are outputs from highpass filter from respective decomposition levels, while A4 are low-pass filter output from last decomposition level.

Attention mechanism
Attention mechanism is used to highlight the important characteristics of the input signal. The input signal x t with T dimension is passed through a dense layer with similar where x t represents the input signal, w d,t is the trainable weights kernel vectors of the dense layer. A softmax layer is used to normalise the Attention Score calculated into range between (0, 1), forming Attention Weight, a d , The Attention Weight, a d , is then multiplied element-wise with the input signal, resulting in highlighted feature vector, a d x t which can be expressed as:

Long short-term memory
The main operating mechanism in LSTM is governed by three "gates" with the aids of activation functions [30]. LSTM achieves its "memory" by having cell state C t in its architecture. LSTM takes in sequential inputs and processed using the three gates: input gate, i t , forget gate, f t , and output gate, o t . Input gate updates LSTM cell state with new information, forget gate filters unwanted information present in the cell state, and output gate outputs the hidden state or temporal encoding of each timestep. The output of input gate will be filtered by the hyperbolic tangent or tanh activation function, producing a new candidate,c to update the cell state. Each unit of LSTM consists of one set of trainable weights, W f , A new cell state, c t is produced at every timestep. The new cell state, c t , is achieved by forgetting irrelevant information while learning new information. Equation below represents the cell state updating mechanism. Previous cell state, c t−1 , is multiplied element-wise with the forget gate to remove the unwanted information, while candidate cell state,c t is multiplied element-wise with input gate control for learning new information.
The third gate in an LSTM cell is the output gate, o t . Output gate controls the output information from an LSTM cell. The information in new cell state, c t , will be filtered by the output gate, o t to form a hidden state output, h t . The output of LSTM layer represents the encoded sequence feature based on the input pattern. The output gate and hidden state output can be calculated as follows:

Experiment setup
The classification and generalisation capability of the proposed method is tested with 16 synthetic models of PQDs [19,31] including normal signal waveform, single disturbance waveform, and multiple disturbance waveform as listed in Table 1. The entire experiments are carried out using AMD Ryzen 7 3800X 8-Core Processor with Nvidia P6000 graphic processing unit. Pytorch framework has been used for the experiments. The proposed method takes in 10-period waveform as input. A total of 76800 10-period PQD sam-ples are randomly generated, where each disturbance classes are having 4800 samples. The sampling frequency used is 3200 Hz. As noise is always present during the real-world data collection, 20-50 dB signal-to-noise (SNR) ratio level of additive white Gaussian noise (AWGN) are added randomly into the generated training samples. tenfold cross-validation has been carried out, which consists of 90% training samples and 10% of total samples are used as validation samples. A total of five sets of testing data sets are generated for models bench-marking purpose. The five sets of the testing data includes a set of noiseless samples, and 20 dB, 30 dB, 40 dB, 50 dB SNR AWGN added samples. Each set of the testing data consists of 1000 samples per PQD classes. The SNR can be depicted as SNR = 10 log 10 P signal − 10 log 10 P noise .
The main evaluation matrix used in these experiments is the classification accuracy. The classification accuracy of individual class Acc n is the true positive, T P n over the total test samples for m classes of PQD, S j as, In this paper, two types of input arrangement are used to evaluate the proposed multi-resolution attention LSTM model. The feature are arranged in either TFA or SFA as described in Sect. 2. As shown in Fig. 1, feature align layer consists of a perceptron layer which encodes the different dimension band outputs coefficients into same dimension, as well as reshaping the output to produce either spatial or temporal arrangement before passing into attention network. In the setup of temporal feature, the attention is applied over respective bands, whereas for spatial feature, the attention mechanism is applied across bands. Bench-marking of the proposed method has also be done with multi-resolution LSTM model without attention mechanism, deep LSTM model [31], and deep convolution neural network (DCNN) model [19,31]. The details of each of the models compared are given in Table 2.

Performance analysis of the proposed method
The classification performance of the proposed method and the bench-marking models are tabulated in Table 3. By comparing the classification performance of Deep LSTM with WT LSTM models, it can be noticed that the proposed MSD signal transformation increased the overall classification performance across different noise levels. This shows that hybrid model using MSD with DNN increases the classification per-   [19]. Results shows that the proposed WT-SFA LSTM model is having better performance at the highest noise 20 dB SNR condition.

Classification accuracy analysis
The individual class accuracy of the proposed WT-SFA LSTM and Deep CNN models are given in Tables 4 and 5,

Confusion matrix analysis
The in-depth classification performance of the model can be visualised using confusion matrix. From Fig. 3, it can be shown that the confusion occurs on class P0-Normal, P8-Notch and P13-Flicker with harmonics. Class P0-Normal is having confusion with Class P13-Sag with Harmonics. There is only slight difference in the boundary while defining normal class and sag or swell disturbance class. Higher level of noise can easily disrupt the average magnitude signal. The confusion of slow disturbance class can be explained with higher resemblance of high noise signals with the harmonics classes. However, it can be noticed that the fast transient  The confusion matrix of the WT-SFA LSTM model tested with 20 dB SNR AWGN is as shown in Fig. 4. From the confusion matrix, it can be noticed that class P0-Normal, P8-Notch, and P9-Flicker is having higher classification confusion with another class. It can be noticed that the classification of class P0 and P9 is having higher confusion with class P10-Sag with Harmonics, that is, 0.23 and 0.12, respectively. Both classes (P9 and P13) are categorised under slow transient disturbance. Class flicker can only be detected with multiple periods of waveforms. The confusion of class P0 and P9 is due to high noise condition resembles the condition of higher level of harmonics. The confusion on class P8-Notch with class P15-Flicker with Swell also occur in the proposed method, but with 38% improvement, that is, from confusion of 0.26 to only 0.16 in the WT-SFA LSTM model as compared to deep CNN model. This shows that the MSD splitting the signal into multiple frequency bands is contributing in the generalisation of classifying PQDs under different frequency bands.

Model complexity analysis
The model complexity comparison is shown in Table 6. From the comparison, it can be noticed that the proposed WT-SFA LSTM model is having the highest classification accuracy of 91.78% on high noise 20 dB SNR AWGN test. Although the current model has 272 thousand number of parameters and the model size of 1.069 MB is the highest among the models, the time required for each epoch of training is still kept at 34 s, which is comparatively low for an improved performance on high noise condition.

Conclusion
Automatic feature extraction is a vital process for accurate PQD detection and classification. In this paper, a novel model consisting wavelet transform, attention mechanism and LSTM is proposed. Multi-level signal decomposition using wavelet transform decompose signals into different frequency components. Results shows that wavelet transform helps in improving overall classification accuracy across different noise levels. The classification accuracy under the highest noise levels improved from 88.48 to 89.77%. Two attention mechanism has been examined, that is, temporal feature attention (TFA) and spatial feature attention (SFA). The classification performance of TFA and SFA under 20 dB SNR AWGN are 89.87% and 91.78%, respectively, which proves increased classification performance under high noise condition. The proposed model has also been bench-mark with state-of-the-art deep CNN model, which shows better performance under high noise condition. Model complexity of the model has also been compared in the experiment. For the future work, the model can be further simplify by simplifying the feature align layer, and LSTM layer can be replaced with transformer which comes with attention mechanism.
Funding Open Access funding enabled and organized by CAUL and its Member Institutions

Declarations
Conflict of interest The authors declare that they have no conflict of interest.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecomm ons.org/licenses/by/4.0/.