1 Introduction

The integration of multiple energy systems especially renewable energy microgrids has been proved efficient in reducing carbon footprint [1]. However, the blooming integration of multiple energy systems also challenged the utility in providing good quality of power supplies [2, 3]. Power quality disturbances (PQD) are among the major concerns in improving functionality and efficiency of a microgrid [4]. PQD is defined as a series of disruption on the magnitude or frequency of the power supply sinusoidal waveform [5]. The occurrence of PQD causes problems such as reducing the lifespan of electrical devices, causing malfunctioning in sensitive electronics devices such as computers, causing unwanted power tripping, causes financial losses and decreases productivity. PQD detection and classification is thus an important tools for monitoring, preventive actions [6] as well as finding the root cause [7] of the power quality disturbance within the power systems. The ability to identify the presence of PQDs in a system helps in guiding the energy management operation.

Fig. 1
figure 1

Multi-resolution Attention LSTM model with attention mechanism

PQD detection and classification process mainly involves three stages, signal analysis, feature extraction, and classification. Time-domain signal analysis techniques used includes fast Fourier transform [8], and discrete Fourier transform (DFT) [9]. Time-frequency domain analysis such as short time Fourier transform (STFT) [10], WT [11, 12], wavelet packet transform [13], S-Transform [14] have been widely applied as these methods solves the shortcoming of DFT and STFT. These signal analysis methods have been studied and proven to be effective in the field of detection and classification of PQDs.

Statistical feature extractions are usually being carried out after signal analysis stage [15, 16]. Statistical features such as mean, median, RMS, standard deviation, variance, and norm, are extracted as the output features before passing into next stage for higher-order feature extraction. A good feature extraction can improve both computation power and classification performance [17]. Traditional hand-engineered methods use as many statistical features that can be generated to ensure better classification accuracy, but the use of these statistical features never clearly justified [15]. Optimal feature selection methods were proposed to select the most important statistical features generated [18].

Most traditional three-stages PQD classification models uses independent feature extraction, where features extracted are not related to the classification performance. The classification performance of the network depends highly on the feature extraction stage. However, the handcrafted statistical features extracted may not be conducive and some features might cause adverse effect on the accuracy. This process often requires professional knowledge and usually changes over different scenarios. The lack of closed-loop feedback system between feature extraction and classification stages has been highlighted in [19] as an essential element to achieve automatic feature extraction and classification. Deep neural network (DNN) models [20,21,22,23] are proposed to attain automatic feature extraction and classification without human intervention.

Hybrid methods combining the advantages of WT with artificial neural networks (ANN) has first been presented by Santoso et al. [24, 25]. The output of WT are squared to form squared wavelet transform coefficients. The extracted features are processed using ANN layers and a simple thresholded voting is used as decision making for the classification. In [26], statistical feature extracted from WT is used with radial basis function neural network for classification. In [27], WT is used for the signal processing, and the statistical features extracted are classified by using wavelet networks. On the other hand, Khokhar et al. uses WT as feature extractor, the statistical features extracted are passed into artificial bee colony optimiser for feature selection, and finally the classification is done using probabilistic neural networks [15]. It is noticed that most of the hybrid model uses statistical features for the classification process.

To solve above-mentioned problems, a novel hybrid approach of signal analysis with DNN method is proposed. Instead of extracting specific statistical features, the proposed multi-resolution attention LSTM model embed and align the wavelet transformation coefficients using a perceptron layer. Attention mechanism is applied on the embedded features to improve the generalisation capability of the network. The highlighted features are then being passed into LSTM layers for feature extraction. Finally, the sequence features extracted from the LSTM layer are then being passed into fully connected layer for classification process. An overview of the proposed method is depicted in Fig. 1.

2 Multi-resolution attention LSTM method

Multi-resolution signal decomposition (MSD) is a perfect tool to transform the time-series input signals into multiple frequency components signals or time–frequency domain signals [28]. Attention mechanism has been proposed along with MSD to achieve better classification performance especially under noisy environment. The proposed method as shown in Fig. 1 uses MSD with attention mechanism and LSTM layers to achieve the automatic feature extraction. Automatic PQD detection and classification can be achieved with automatic feature extractor. In order to achieve best feature extraction and classification performance, a series of feature manipulation has been tested. Two different attention mechanism has been tested, temporal feature attention (TFA) and spatial feature attention (SFA). The output coefficients of MSD is first passed through a feature align layer which embed different bands into similar dimensions. TFA is multiplied element-wise with the temporally aligned features from feature align layer outputs to get the temporal attention vector. While spatial attention vector is acquired by multiplying the spatially aligned features with the SFA attention weights. The spatial or temporal highlighted feature vector are then being arranged into temporal arrangement before passing into LSTM layers for feature extraction process. Two fully connected layers and softmax activation function are used to classify the features into respective disturbance classes. The details of the proposed mechanism are explained in the following sections.

2.1 Multi-resolution signal decomposition

Wavelet transform [29] is proved to be efficient in detecting discontinuity in signals. The characteristics of varying window sizes in WT allows it to achieve an optimal time-frequency resolution. The varying window sizes also allow wavelets to detect non-stationary signals, which are posed by most of the PQDs. Wavelet transform can be expressed as [29],

$$\begin{aligned} f(x) = \sum _{i,j} a_{i,j}\psi _{i,j} (x), \end{aligned}$$
(1)

where \(a_{i,j}\) representing discrete wavelet transform (DWT) expansion coefficients of input f(x), with scaling and shifting parameters, i and j, respectively, while \(\psi _{i,j}\) represents the wavelet expansion function. The DWT coefficients can be expressed as:

$$\begin{aligned} a_{i,j} = \int _{-\infty }^{\infty } f(x)\psi _{(i,j)} (x). \end{aligned}$$
(2)

Wavelet basis function can be generated from mother wavelet, \(\psi _{i,j}\) by tuning the scaling and shifting parameters.

$$\begin{aligned} \psi _{(i,j)} (x) = 2^{-i/2}\psi (2^{-i}x-j), \end{aligned}$$
(3)

where i and j are scaling and shifting parameters. MSD can be achieved by providing two-scaled equation, \(\psi _{i,j}\) and \(\phi _{i,j}\)

$$\begin{aligned} \psi _{i,j}(x) = 2^{-i/2} h(j) \psi (2^{-i}x-j), \end{aligned}$$
(4)
$$\begin{aligned} \phi _{i,j} (x) = 2^{-i/2} g(j) \phi (2^{-i}x-j), \end{aligned}$$
(5)

where \(g(j) = (-1)^j h(1-j)\). h(j) and g(j) can be viewed as high pass and loss pass filter coefficients. MSD is achieved with different \(I^{th}\) level decomposition as follows [29]:

$$\begin{aligned} f_i (x) = \sum _{i=1}^{I} a_{i,j}\psi _{i,j} (x) + \sum _{i=1}^{I} a_{i,j}\phi _{i,j} (x), \end{aligned}$$
(6)

MSD allows the signal to be band-filtered into levels of frequencies. The original signal is filtered using a high-pass filter (HPF) for high-frequency components, while low-pass filter (LPF) is used to extract low-frequency components. The band low-pass filter from each levels will be used as input to next level decomposition until the desired decomposition level. Figure 2 shows MSD with four levels of decomposition. The detailed coefficients, D1 to D4, are outputs from high-pass filter from respective decomposition levels, while A4 are low-pass filter output from last decomposition level.

Fig. 2
figure 2

MSD with four levels of decomposition

2.2 Attention mechanism

Attention mechanism is used to highlight the important characteristics of the input signal. The input signal \(x_t\) with T dimension is passed through a dense layer with similar dimension to gain attention score, \(y_d\), as follows:

$$\begin{aligned} y_d=\sum _{t=0}^{T}{w_{d,t} \cdot x_t}, \end{aligned}$$
(7)

where \(x_t\) represents the input signal, \(w_{d,t}\) is the trainable weights kernel vectors of the dense layer. A softmax layer is used to normalise the Attention Score calculated into range between (0, 1), forming Attention Weight, \(a_{d}\),

$$\begin{aligned} a_d=\hbox {softmax}(y_{d}) = \frac{e{^{y_d}}}{\sum _{j=0}^{T}{e^{y_{d_j}}}}. \end{aligned}$$
(8)

The Attention Weight, \(a_{d}\), is then multiplied element-wise with the input signal, resulting in highlighted feature vector, \(a_d x_t\) which can be expressed as:

$$\begin{aligned} a_{d}x_t=a_d \odot x_t. \end{aligned}$$
(9)

2.3 Long short-term memory

The main operating mechanism in LSTM is governed by three “gates” with the aids of activation functions [30]. LSTM achieves its “memory” by having cell state \(C_t\) in its architecture. LSTM takes in sequential inputs and processed using the three gates: input gate, \(i_t\), forget gate, \(f_t\), and output gate, \(o_t\). Input gate updates LSTM cell state with new information, forget gate filters unwanted information present in the cell state, and output gate outputs the hidden state or temporal encoding of each timestep. The output of input gate will be filtered by the hyperbolic tangent or tanh activation function, producing a new candidate, \({\tilde{c}}\) to update the cell state. Each unit of LSTM consists of one set of trainable weights, \(W_f\), \(W_i\), \(W_o\), \(W_c\), \(U_f\), \(U_i\), \(U_o\), \(U_c\), and biases, \(b_f\), \(b_i\), \(b_o\), \(b_c\). The dimension of these weights can be varied by using different units of neurons for the LSTM cell. Increasing number of units of neurons increased the number of trainable weights or the learning capability of the LSTM layer.

$$\begin{aligned}{} & {} f_{t}=\sigma (W_f x_t + U_f h_{t-1} +b_f). \end{aligned}$$
(10)
$$\begin{aligned}{} & {} i_{t}=\sigma (W_i x_t + U_i h_{t-1} +b_i). \end{aligned}$$
(11)
$$\begin{aligned}{} & {} {\tilde{c}}_{t}=\hbox {tanh} (W_c x_t + U_c h_{t-1} +b_c). \end{aligned}$$
(12)

A new cell state, \(c_t\) is produced at every timestep. The new cell state, \(c_t\), is achieved by forgetting irrelevant information while learning new information. Equation below represents the cell state updating mechanism. Previous cell state, \(c_{t-1}\), is multiplied element-wise with the forget gate to remove the unwanted information, while candidate cell state, \(\tilde{c_{t}}\) is multiplied element-wise with input gate control for learning new information.

$$\begin{aligned} c_{t}=f_{t} \odot c_{t-1}+i_{t} \odot \tilde{c_{t}}. \end{aligned}$$
(13)

The third gate in an LSTM cell is the output gate, \(o_t\). Output gate controls the output information from an LSTM cell. The information in new cell state, \(c_t\), will be filtered by the output gate, \(o_t\) to form a hidden state output, \(h_t\). The output of LSTM layer represents the encoded sequence feature based on the input pattern. The output gate and hidden state output can be calculated as follows:

$$\begin{aligned}{} & {} o_t=\sigma (W_o x_t+U_o h_{t-1}+b_o), \end{aligned}$$
(14)
$$\begin{aligned}{} & {} h_{t}=o_{t} \odot \tanh (c_{t}). \end{aligned}$$
(15)

3 Experiment setup

Table 1 Class of power quality disturbances

The classification and generalisation capability of the proposed method is tested with 16 synthetic models of PQDs [19, 31] including normal signal waveform, single disturbance waveform, and multiple disturbance waveform as listed in Table 1. The entire experiments are carried out using AMD Ryzen 7 3800X 8-Core Processor with Nvidia P6000 graphic processing unit. Pytorch framework has been used for the experiments. The proposed method takes in 10-period waveform as input. A total of 76800 10-period PQD samples are randomly generated, where each disturbance classes are having 4800 samples. The sampling frequency used is 3200 Hz. As noise is always present during the real-world data collection, 20–50 dB signal-to-noise (SNR) ratio level of additive white Gaussian noise (AWGN) are added randomly into the generated training samples. tenfold cross-validation has been carried out, which consists of 90% training samples and 10% of total samples are used as validation samples. A total of five sets of testing data sets are generated for models bench-marking purpose. The five sets of the testing data includes a set of noiseless samples, and 20 dB, 30 dB, 40 dB, 50 dB SNR AWGN added samples. Each set of the testing data consists of 1000 samples per PQD classes. The SNR can be depicted as

$$\begin{aligned} \hbox {SNR} = 10\log _{10}{P_\textrm{signal}} - 10\log _{10}{P_\textrm{noise}}. \end{aligned}$$
(16)

The main evaluation matrix used in these experiments is the classification accuracy. The classification accuracy of individual class \(Acc_n\) is the true positive, \(TP_n\) over the total test samples for m classes of PQD, \(S_j\) as,

$$\begin{aligned} \hbox {Acc}_n = \frac{TP_n}{\sum _{j=0}^{m} S_j}. \end{aligned}$$
(17)

In this paper, two types of input arrangement are used to evaluate the proposed multi-resolution attention LSTM model. The feature are arranged in either TFA or SFA as described in Sect. 2. As shown in Fig. 1, feature align layer consists of a perceptron layer which encodes the different dimension band outputs coefficients into same dimension, as well as reshaping the output to produce either spatial or temporal arrangement before passing into attention network. In the setup of temporal feature, the attention is applied over respective bands, whereas for spatial feature, the attention mechanism is applied across bands. Bench-marking of the proposed method has also be done with multi-resolution LSTM model without attention mechanism, deep LSTM model [31], and deep convolution neural network (DCNN) model [19, 31]. The details of each of the models compared are given in Table 2.

4 Performance analysis of the proposed method

The classification performance of the proposed method and the bench-marking models are tabulated in Table 3. By comparing the classification performance of Deep LSTM with WT LSTM models, it can be noticed that the proposed MSD signal transformation increased the overall classification performance across different noise levels. This shows that hybrid model using MSD with DNN increases the classification performance of the DNN classifier. Comparing the performance of WT-TFA LSTM and WT-SFA LSTM against WT LSTM shows that attention mechanism helps in improving the noisy condition classification, especially on high noise 20 dB SNR tests. Spatial feature attention in WT-SFA LSTM model is however showing better improvement, that is, with the highest \(91.78\%\) classification accuracy on high noise 20 dB SNR AWGN test. Results shows that SFA mechanism can improve the performance of the classification under high noise condition. The classification performance of the proposed WT-SFA LSTM is also compared to the literature model, that is, deep CNN model [19]. Results shows that the proposed WT-SFA LSTM model is having better performance at the highest noise 20 dB SNR condition.

Table 2 Details of different bench-marking models
Table 3 Classification performance comparisons
Table 4 Performance of the proposed WT-SFA LSTM model tested with 20–50 dB AWGN and noiseless conditions
Table 5 Deep CNN performance of the proposed wavelet-based convolutional transformer model trained with random synthetic PQD data and tested with 20–50 dB AWGN and noiseless conditions

4.1 Classification accuracy analysis

The individual class accuracy of the proposed WT-SFA LSTM and Deep CNN models are given in Tables 4 and 5, respectively. From Table 4, it can be noticed that the model is having weaker classification on class P0-Normal and class P8-Notch under 20 dB SNR condition. This shows that the model is having difficulty in identifying normal class and notch class under high noise condition. As additive noise are introduced, notching effect which is having negative magnitude might be neutralised easily by the AWGN. High noise condition may also confuse the classifier with harmonic disturbance. On the other hand, deep CNN model in Table 5 is showing weaker classification performance on class P8-Notch and P13-Flicker with Harmonics under 20 dB SNR. Neutralised magnitude on the notching effect is showing more serious impact on the deep CNN model. The harmonics confusion however is having negative impact on class P13 in deep CNN model.

Table 6 Model complexity comparisons

4.2 Confusion matrix analysis

Fig. 3
figure 3

Confusion matrix of DCNN model at 20 dB SNR AWGN test

The in-depth classification performance of the model can be visualised using confusion matrix. From Fig. 3, it can be shown that the confusion occurs on class P0-Normal, P8-Notch and P13-Flicker with harmonics. Class P0-Normal is having confusion with Class P13-Sag with Harmonics. There is only slight difference in the boundary while defining normal class and sag or swell disturbance class. Higher level of noise can easily disrupt the average magnitude signal. The confusion of slow disturbance class can be explained with higher resemblance of high noise signals with the harmonics classes. However, it can be noticed that the fast transient class P8-Notch is having high confusion of 0.26 with slow transient class P15-Sag with harmonics. This shows that the performance of Deep CNN can be seriously impacted with disturbance across different frequencies. The confusion of class P13-Flicker with harmonics with class P5-Spike and P6-Harmonics again showing the confusion of this classifier across fast and slow transient disturbances.

The confusion matrix of the WT-SFA LSTM model tested with 20 dB SNR AWGN is as shown in Fig. 4. From the confusion matrix, it can be noticed that class P0-Normal, P8-Notch, and P9-Flicker is having higher classification confusion with another class. It can be noticed that the classification of class P0 and P9 is having higher confusion with class P10-Sag with Harmonics, that is, 0.23 and 0.12, respectively. Both classes (P9 and P13) are categorised under slow transient disturbance. Class flicker can only be detected with multiple periods of waveforms. The confusion of class P0 and P9 is due to high noise condition resembles the condition of higher level of harmonics. The confusion on class P8-Notch with class P15-Flicker with Swell also occur in the proposed method, but with 38% improvement, that is, from confusion of 0.26 to only 0.16 in the WT-SFA LSTM model as compared to deep CNN model. This shows that the MSD splitting the signal into multiple frequency bands is contributing in the generalisation of classifying PQDs under different frequency bands.

Fig. 4
figure 4

Confusion matrix for WT-SFA LSTM model at 20 dB SNR AWGN test

4.3 Model complexity analysis

The model complexity comparison is shown in Table 6. From the comparison, it can be noticed that the proposed WT-SFA LSTM model is having the highest classification accuracy of 91.78% on high noise 20 dB SNR AWGN test. Although the current model has 272 thousand number of parameters and the model size of 1.069 MB is the highest among the models, the time required for each epoch of training is still kept at 34 s, which is comparatively low for an improved performance on high noise condition.

5 Conclusion

Automatic feature extraction is a vital process for accurate PQD detection and classification. In this paper, a novel model consisting wavelet transform, attention mechanism and LSTM is proposed. Multi-level signal decomposition using wavelet transform decompose signals into different frequency components. Results shows that wavelet transform helps in improving overall classification accuracy across different noise levels. The classification accuracy under the highest noise levels improved from 88.48 to 89.77%. Two attention mechanism has been examined, that is, temporal feature attention (TFA) and spatial feature attention (SFA). The classification performance of TFA and SFA under 20 dB SNR AWGN are 89.87% and 91.78%, respectively, which proves increased classification performance under high noise condition. The proposed model has also been bench-mark with state-of-the-art deep CNN model, which shows better performance under high noise condition. Model complexity of the model has also been compared in the experiment. For the future work, the model can be further simplify by simplifying the feature align layer, and LSTM layer can be replaced with transformer which comes with attention mechanism.