Introduction

The widespread application of large-scale power systems has become an important foundation for the operation of society. The stability of power systems faces many challenges from both internal and external factors, such as equipment failures and extreme weather conditions [1]. Many examples have shown that instability in the power system will cause significant economic losses and many other complex problems for society. The importance of power system stability assessment is becoming increasingly prominent. As part of power system stability, short-term voltage stability (STVS) represents the ability of the power system to quickly restore the voltage of each bus to an acceptable level after a fault [2]. Predicting the voltage instability of the power system in real time can provide more time for protective measures to be taken, thereby reducing losses as much as possible.

As an effective means of realizing digital management of the power grid and enhancing system safety, phasor measurement units (PMUs) have been widely used in actual power grids in recent years [3]. Meanwhile, with the development of data science technologies such as machine learning and deep learning and the advancement of communication technology, data-driven STVS assessment methods have been rapidly developed and gradually applied in engineering practice.

Literature review

PMUs dynamically collect various data from the power system in real time, producing a high-dimensional time series. Based on the extensive grid running information contained in the PMU signal sequence, data-driven STVS assessment models have undergone a rich development. Time-domain simulation (TDS) is the traditional method for STVS assessment [4]. This method portrays the power system structure and the input information by constructing complex mathematical models. However, due to its high computational complexity, TDS cannot meet the high requirements of STVS assessment in terms of computational delay and has many disadvantages, such as low accuracy. It is gradually being replaced by new methods based on machine learning [5].

In the literature, many classical machine learning models have been introduced into the field of STVS assessment and have been reported with good performance. Support vector machine (SVM) [6] and decision tree (DT) [7] have been successively introduced into the field of STVS assessment, while based on these two types of classical model structures, Zhang et al. [8] used random forest for STVS assessment. The method proposed by Zhu et al. further optimizes the response speed of the model [9]. Classical machine learning models are highly dependent on additional expert knowledge and feature engineering, which limits the application of the model in complex real-world environments. The structure of recurrent neural networks (RNNs) can effectively capture the temporal information of sequence data. Models utilizing RNNs often exhibit concise and direct structures and demonstrate exceptional performance across various sequence tasks. James J Q et al. and Gupta A et al. constructed STVS assessment models based on long short-term memory (LSTM) [10] and gated recurrent units (GRUs) [11], respectively, and achieved good results.

The above studies assume that the input PMU data are complete. For instance, in RNN-based models, the computation of each cell's output only involves the previous output's hidden state and the current input, implying that RNN-based models require input time series data with identical sampling intervals [12]. However, in practical engineering environments, the PMU data received by the models often experience random losses due to network delays, equipment malfunctions, and other reasons, as shown in Fig. 1. This means that the PMU sequence data input to the model has randomly irregular intervals, and the interval between two sampling points may even be vastly different. In addition, since each PMU measures different parts of the power grid, the random missing PMUs cause the measured topology of the power grid in each frame of the input sequence to be inconsistent [13]. Weerakody et al. [14] have already noted that most existing time series models, especially RNN-based models, cannot directly handle this type of data. Some studies have proposed filling in incomplete PMU data using a certain method during data preprocessing, enabling the use of classification evaluation models that are applicable to complete data. For example, Gao et al. [15] used a matrix completion method, and Ren et al. [16] used deep residual learning.

Fig. 1
figure 1

Incomplete PMU signal sequence

Obviously, simple missing data imputation methods result in a decrease in model classification accuracy, while complex imputation methods significantly increase the computational complexity of model inference. In addition, since the PMU measurement sequence exhibits different characteristics during a stable power grid and during an unstable power grid, using the same method to impute missing data in both types of sequences will inevitably introduce bias. This results in a significant performance degradation of these models when faced with more severe missing data situations.

Some other models are capable of directly performing STVS assessment based on incomplete PMU data. Such methods are generally constructed based on classical machine learning algorithms, especially tree models [17]. Guo and Milanović [18] designed special splitting rules to enable the tree model to handle incomplete PMU data and Zhang et al. [13], recovered the system topology represented by the data as much as possible through a clustering rule. However, to minimize the impact of incomplete data, these models actually construct a large number of submodels, which also significantly increase the computational complexity. This reduces the practical value of the models.

A fully attention-based Transformer, proposed by Vaswani et al. [19], provides a new approach for handling incomplete PMU data. The Transformer is insensitive to the order of input sequences. Its flexible position encoding, which only serves as auxiliary information, makes it possible to describe the relative relationships between data points in a sequence. The multi-head attention mechanism of the Transformer has been shown to have good robustness on incomplete data [20]. It can automatically learn how to restructure the original topology of the data based on incomplete information. Based on this, the Transformer has shown great potential in directly extracting the information contained in incomplete PMU measurements.

Contributions

Both the complete PMU signal sequence and STVS assessment are highly applicable in the power grid. In the existing studies, to obtain STVS assessment results, filling must be performed first to obtain the complete PMU signal sequence, which causes a large computing overhead. In addition, there is no model that can simultaneously obtain both complete PMU signal sequence prediction and STVS assessment. This paper recognizes the relationship between the two tasks and the Transformer;s ability to adapt to incomplete PMU signal sequences based on its weak assumption about input sequences. We introduce Transformer into the field of STVS assessment, and based on this, we fuse two tasks of missing completion and stability classification, thus proposing a multi-task learning model that can handle both tasks simultaneously for the first time. In addition, the contributions of this paper include the following:

  1. 1.

    The proposed model in this paper shows a significant improvement in single-task performance compared to traditional single-task models, demonstrating stronger generalization ability and robustness. It performs well even in short observation windows and high data missing rate scenarios that are challenging for traditional single-task models.

  2. 2.

    Due to the hard sharing mechanism to share the underlying parameters, the number of parameters and the computational complexity of the proposed model are significantly reduced compared with the single-task model, while the use of submodels can be selected to further save inference time according to different practical needs. This makes the model of high practical value.

  3. 3.

    This paper shows the necessity and rationality of constructing and training multi-task learning models by combining two tasks through a hard sharing mechanism, attention mechanism, and dynamic weight average (DWA).

Methodology

The framework of multi-task learning

The comparative display of Fig. 2 illustrates the differences in the PMU signal sequences under conditions of system stability and instability. In most cases, when the system remains stable throughout, the voltage signal sequence remains stable as well. However, when the system is on the verge of instability, the voltage signal sequence exhibits complex fluctuations. This proves that there is an underlying correlation between the PMU missing data filling task and the STVS classification task, which also demonstrates the necessity of constructing a multi-task learning model to extract the implicit connection between the two tasks.

Fig. 2
figure 2

PMU signal sequence in two grid states

Multi-task learning can be considered a form of transfer learning, which is equivalent to adding additional constraints to each of these tasks [21]. With proper design, a multi-task learning model, by optimizing the objectives of two different tasks, can enable the model to better extract the information embedded in the data and avoid overfitting while reducing the computational complexity of the model as well as the training difficulty (e.g., avoiding falling into local optima)

The two main challenges in building a multi-task learning model are choosing the right way to share parameters and the loss function balancing mechanism. Traditionally, multi-task learning models can be divided into two categories, as shown in Fig. 3: (a) hard sharing and (b) soft sharing [22]. In the hard sharing mechanism, each task has the same underlying hidden layer, while a few specific task output layers are retained to output the results of different tasks simultaneously. The soft sharing mechanism is more flexible, also known as “constraint-based sharing”, and its method of associating subtasks is achieved by compressing the distance of parameters between models through regularization. The hard sharing mechanism is suitable when the relationship between tasks is strong, while the soft sharing mechanism is suitable when the relationship between tasks is weak. In addition, since the computational complexity of the soft sharing mechanism is too large, it is not suitable for the STVS assessment task that requires high timeliness. The model proposed in this paper adopts the hard sharing mechanism.

Fig. 3
figure 3

Two types of multi-task learning frameworks

The differences that exist in the nature of the two tasks of missing filling and stability classification need to be considered. The hard sharing mechanism often struggles to address problems where the differences between subtasks are too large. Liu et al. [23] used the soft-attention mechanism [24, 25] in their proposed multi-task attention network (MTAN) model to make task-specific selections on the features extracted from the shared layers, allowing shared underlying features to be input into specific task submodels with focused selection and combination. This relaxes the assumption of intertask variation for multi-task learning.

The two tasks of missing data filling and sequence classification are regression and classification tasks, respectively, and the loss functions used have different metric spaces. For this reason, the balance of the loss functions needs to be carefully designed to avoid difficulties in training multi-task learning models due to the different convergence rates of different tasks. The DWA method proposed in [23] is used to dynamically adjust the weights of each task during the training process. The weight \({\lambda }_{k}(t)\) of task \(k\) at epoch \(t\) can be calculated according to Eqs. (1) and (2):

$${\lambda }_{k}\left(t\right)=\frac{\mathrm{exp}\left(\frac{{\omega }_{k}\left(t-1\right)}{T}\right)}{\sum_{i}\mathrm{exp}\left(\frac{{\omega }_{i}\left(t-1\right)}{T}\right)} \quad \left(i=\mathrm{1,2},\dots ,k\right)$$
(1)
$${\omega }_{k}\left(t-1\right)=\sum_{j=1}^{5}\frac{{\mathcal{L}}_{k}\left(t-j\right)}{{\mathcal{L}}_{k}\left(t-j-1\right)}$$
(2)

where \({\mathcal{L}}_{k}\left(t\right)\) is the average loss of task \(k\) at epoch \(t\), \(T\) is a relaxation factor, and a larger \(T\) indicates a larger difference between tasks. Due to the existence of exponential operations in Eq. (2), the log-sum-exp stabilization trick shown in Eq. (3) is used in the actual calculation to avoid overflow by noting \( b = \max _i x_i \):

$$\frac{\mathrm{exp}\left({x}_{i}\right)}{\sum_{j}\mathrm{exp}\left({x}_{j}\right)}=\frac{\mathrm{exp}\left({x}_{i}-b\right)}{\sum_{j}\mathrm{exp}\left({x}_{j}-b\right)}.$$
(3)

Proposed method composition

Shared feature extraction submodel

For incomplete PMU data \({\varvec{x}}\in {\mathbb{R}}^{f}\) with input feature number \(f\), the Transformer first performs embedding. This operation projects \({\varvec{x}}\) to a \(d\)-dimensional space \({\varvec{Y}}\in {\mathbb{R}}^{f\times d}\). In the model implementation, position information is represented through an additional positional encoding \({\varvec{P}}\in {\mathbb{R}}^{f\times d}\). The positional encoding has the same dimension as \({\varvec{Y}}\), so it can be directly added. In this paper, the sine–cosine encoding shown in Eq. (4) is used to represent the relative position relationship:

$$ \begin{aligned} & {\varvec{P}}(i,2j) = \sin (w_{j} ) \hfill \\ & {\varvec{P}}(i,2j + 1) = \cos (w_{j} )\,i \in [1,f] \end{aligned}$$
(4)

where \({w}_{j}=\frac{i}{{{10{,}000}}^{\frac{2j}{d}}} , j\in \left[1,{\frac{d}{2}}-1\right] \).

Considering two positional encoding vectors \({\varvec{P}}\left(i\right)={\left({\varvec{P}}\left(i,2j\right),{\varvec{P}}\left(i,2j+1\right)\right)}^{T}\) and \({\varvec{P}}\left(i+k\right)={\left({\varvec{P}}\left(i,2j\right),{\varvec{P}}\left(i,2j+1\right)\right)}^{T}\) with a relative position of \(k\), \({\varvec{P}}(i+k)\) can be obtained from \({\varvec{P}}(i)\) through a simple linear transformation, as shown in Eq. (5):

$${\varvec{P}}\left(i+k\right)=\left[\begin{array}{cc}\mathrm{cos}({w}_{i}k)& \mathrm{sin}\left({w}_{i}k\right)\\ -\mathrm{sin}\left({w}_{i}k\right)& \mathrm{cos}\left({w}_{i}k\right)\end{array}\right]\times {\varvec{P}}\left(i\right).$$
(5)

This shows that the Transformer can capture the relative positional information of incomplete PMU voltage signal sequences. In addition, this positional encoding maintains consistency across time sequences of different lengths, which is one reason why the model proposed in this paper has good adaptability to data with different missing rates.

Multihead attention is the core of Transformer. Each self-attention head independently focuses on different portions in the abstract space \({\mathbb{R}}^{f\times d}\). The Transformer concatenates the information extracted by each self-attention head to capture long-range and short-range dependencies and effectively extract information at different levels and different position combinations in the sequence [26]. Obviously, the information extracted by the feature extraction submodel (FES) has the same dimension as the input, and in this paper, the extracted information is denoted as \({\varvec{E}}\in {\mathbb{R}}^{f\times d}\).

Missing completion submodel and stability classification submodel

The missing completion submodel (MCS) in this paper is in fact the decoder part of "sequence to sequence" (seq2seq) [27], as the Transformer Encoder is used as the FES. The MCS is required to directly output the complete PMU data \(\mathcal{Z}\in {\mathbb{R}}^{t\times b}\) with observation windows \(t\) and number of buses \(b\). Since the complete PMU signal sequence is no longer unequally spaced, while the grid topology information contained in the sequence remains unchanged, traditional RNN-like models have been shown to perform well for this task. In practice, the decoder structure can be either the decoder part of the Transformer or a model based on the traditional RNN structure. Compared to the more complex Transformer decoder structure, using GRU [28], which achieves “state of the art” (SOTA) in RNN-like models, as a decoder has shown excellent results in practice and is easier to train. Figure 4 is a schematic of the seq2seq structure, where \({\varvec{z}}\) is a row vector of \(\mathcal{Z}\).

Fig. 4
figure 4

Seq2seq structure

For the task of missing voltage signal sequence completion, an important objective is the shape similarity between the forecasted sequence and the ground truth. As described in “The framework of multi-task learning”, the shape of the voltage signal sequence contains important information about the state of the power grid, and if the shape of the filled voltage signal sequence differs significantly from the real one, it will invalidate other subsequent analyses using these data. Traditional MSE loss functions calculate the distance between two sequences point to point, while the dynamic time warping (DTW) index, as shown in Fig. 5, can better capture the shape similarity between two sequences [29]. A smaller DTW indicates that the two sequences have more similar shapes. The challenge of using DTW as a loss function in the past is that the classical DTW index is not differentiable and therefore cannot be directly used as the loss function for a neural network model. Cuturi et al. [30] proposed the differentiable Soft-DTW loss function, and Le Guen et al. [31] subsequently proposed the distortion loss, including the shape and time (DILATE) loss function, which further incorporates an evaluation of temporal accuracy based on Soft-DTW. Le Guen et al. [31] have shown that this helps the model to find the optimal DTW index without ignoring the optimization of “point-to-point” indexes such as MSE, which represent temporal differences, and thus better meet the objective of voltage signal sequence missing completion.

Fig. 5
figure 5

The difference between DTW and the traditional Euclidean distance index

Similar to the MCS, in the proposed method, the stability classification submodel (SCS) can use any suitable machine learning classification model. In the proposed method, a more straightforward approach is used to output the probabilities of belonging to two classes after the features extracted by the soft-attention mechanism selection are sequentially passed through the fully connected and dropout layers based on the consideration of reducing the number of model parameters. The loss function used in the stability classification task is the cross-entropy loss.

Proposed method summary

The method proposed in this paper, illustrated in Fig. 6, fully considers the computational complexity and actual online application. A multi-task learning model dealing with two subtasks of incomplete PMU data voltage stability assessment and missing completion is constructed on a hard sharing mechanism. We note the respective characteristics of Transformer and GRU. The Transformer Encoder is used to form a FES. Incomplete PMU sequence data are fed into the model and first passed through an embedding layer. Multihead attention automatically extracts different aspects of the information contained in the input data. Before the task-specific submodel is introduced, self-attention is used to select and combine the underlying information that the two tasks need separately. The SCS consists of a fully connected layer and associated ReLU activation function and dropout layers. In the MCS, the GRU sequentially gives the predicted values of the complete sequence. In addition, as an open framework, the above specific models can be replaced by other similar models. This paper uses the DILATE loss function, which considers both the similarity of the predicted sequence shape and the accuracy of the time domain. It is more suitable for the practical requirements of PMU voltage signal sequences. DWA is introduced to balance the weights of the two tasks in the model training process.

Fig. 6
figure 6

Proposed method framework

Experiment

Datasets and configurations

The IEEE New England 10-machine 39-bus test case power system is a benchmark for research on power system stability [32]. In this paper, a dataset of incomplete PMU signal data is constructed based on the open dataset proposed by Sarajcev et al. [33]. The dataset is obtained by simulation using the MATLAB®/Simulink electromechanical transient simulation toolbox and contains 9360 valid samples that systematically cover different load levels, different fault types, and different fault locations, with a balanced number of stable and unstable samples and good representativeness. The PMU data are recorded at a 1/60 s sampling rate and have a total record length of 3 s. Different levels of missing data are obtained by randomly dropping a specific amount of data points as shown in Fig. 1 (e.g., a missing data percentage of 20% means that 20% of the \(39\times \mathrm{sequence length}\) data points are randomly dropping). The missing data percentage can be expressed by Eq. (6):

$$ {\text{Missing}}\,{\text{data}}\,{\text{percentage}} = \frac{{{\text{Total}}\,{\text{missing}}\,{\text{number}}\,{\text{of}}\,{\text{data}}\,{\text{points}}\,{\text{in}}\,{\text{a}}\,{\text{sample}}}}{{{\text{Total}}\,{\text{number}}\,{\text{of}}\,{\text{data}}\,{\text{points}}\,{\text{contained}}\,{\text{in}}\,{\text{a}}\,{\text{sample}}}} \times 100\% $$
(6)

In addition, 80% of the valid samples are used as the training set, and 20% are used as the test set.

The observation windows of the input data determine the practical value of the model. Although a long observation window improves the accuracy of the model, it shortens the time for taking protective measures before power system failure, while a short observation window makes it more difficult to determine the true performance of the model under incomplete PMU measurement data [34]. Figure 7 shows the distribution of the unstable samples from the time of fault clearance to the time at which the power system loses stability in the dataset proposed by Sarajcev et al. In subsequent tests, this paper determines the observation windows to be 20 sample points (0.33 s) and 25 sample points (0.42 s) after the fault is cleared to complete the prediction before 99% and 95% of the power system loses stability, respectively.

Fig. 7
figure 7

Histogram and kernel density estimation of unstable samples from fault clearance to power system losing stability time

The main hyperparameters used in the follow-up tests are shown in Table 1.

Table 1 The main hyperparameters used in the experiments

In addition to the model hyperparameters mentioned above, following the research of Le Guen et al. [31], the compromise parameter \(\alpha \) in the DILATE loss function is set to 0.5, and the smoothing parameter \(\gamma \) is set to 0.01 to balance the model’s ability to fill in missing values in both temporal and shape domains. Consistent with the original setting of DWA proposed by Liu et al. [23], the relaxation factor \(T\) in DWA is set to 5. The Adam optimizer is used for training.

Experimental analysis

Multi-task learning training process

Figure 8 shows the variation in the loss function values during the training process when the observation window is set to 0.42 s and the missing data level is 50%, using fixed weights and DWA, respectively. Figure 8 shows that the two loss functions with DWA have a more coordinated rate of descent, which makes it easier to determine a common convergence point for the two tasks. This shows that the DWA works to balance the loss functions of the two tasks.

Fig. 8
figure 8

Changes in loss during training under fixed weights and DWA

The abrupt surge in Fig. 8 is due to the change in the loss function for the missing data completion task. This is because during the experiment, it was found that directly using the DILATE loss function has the disadvantages of slow convergence and computation in the incomplete PMU data missing completion task. Therefore, we first train the model for 20 epochs using the MSE loss function and then use the DILATE loss function for subsequent training. As a compensatory training trick for the drawbacks of the DILATE loss function, this only affects the number of model training iterations without changing the final performance of the model.

Performance of missing completion subtask

In this paper, the root mean square error (RMSE), mean average percentage error (MAPE), and DTW indexes are used to measure the performance of the missing completion task submodel. DTW have been discussed in “Missing completion submodel and stability classification submodel”, and Eqs. (7) and (8) show the definitions of RMSE and MAPE:

$$\mathrm{MAPE}=\frac{1}{n}\sum_{t=1}^{n}\left|\frac{{x}_{t}-\widehat{{x}_{t}}}{{x}_{t}}\right|$$
(7)
$$\mathrm{RMSE}=\sqrt{\frac{1}{n}\sum_{i=1}^{n}{\left({x}_{t}-\widehat{{x}_{t}}\right)}^{2}}$$
(8)

where \({x}_{t}\) is the true value, \(\widehat{{x}_{t}}\) is the predicted value and \(n\) is the length of the sequence. RMSE characterizes the absolute deviation of the predicted value from the true value, and MAPE characterizes the relative deviation of the predicted value from the true value. A smaller of the two metrics indicates better performance [35].

In this paper, two types of experiments are designed to verify the effectiveness of the proposed model on subtasks. Using the Transformer Encoder for constructing the FES and using soft-DTW as the loss function for the missing completion task is one of the core parts of the proposed method. The first type of experiment was used to check whether they were beneficial to the model over traditional components. The performance of the model obtained by changing the FES to BiGRU- and LSTM-based and the loss function of the missing completion task to MSE is shown in Table 2. The performances of the modified models are all weaker than that of the proposed model. The proposed method with a combination of Transformer Encoder and soft-DTW loss not only achieves high DTW, but also outperforms the comparison method in terms of the “point-to-point” metrics (RMSE and MAPE). This demonstrates the ability of the proposed method to focus on both shape and temporal accuracy of complementary sequence.

Table 2 Performance comparison of different structure models for missing completion tasks

The second part of the experiment is a comparison with established representative PMU signal sequence missing completion models or missing preprocessing methods in STVS assessment based on incomplete PMU data. Two commonly used methods are selected as benchmarks for comparison. (1) LSTM is the main method for current models to perform missing value filling before STVS assessment through incomplete PMU data [34, 36]. (2) The extreme gradient boosting tree (XGBoost) [37] has been proven to have good performance in many applications and is widely used in various engineering practices. Most of the existing models for incomplete PMU data processing are based on tree models. We use XGBoost as a representative tree model. Table 3 and Fig. 9 show the performance of MCS under different experiment settings and the comparison with other methods.

Table 3 Performance comparison of different methods for missing completion tasks
Fig. 9
figure 9

Performance comparison of different methods for missing completion tasks (the error bars are the variances of the performance at different observation windows)

Of the three models, XGBoost performed the worst, particularly in the two indexes of RMSE and DTW, where it had the highest errors. Although the LSTM is similar to the proposed method in the RMSE and DTW metrics, the variance of its performance on different observation windows is larger. The proposed method performs best in all three indexes. RMSE is less than 0.001, MAPE is less than 1%, and DTW is less than 0.01, with performance improvements of over 20–50% compared to the comparison models in each indicator and different scenarios. The performance of the proposed method is almost unaffected by the level of data missing and has the smallest variance under different observation windows, with the best performance.

Performance of stability classification subtask

Similar to “Performance of missing completion subtask”, the experiments in this section also include two parts, an validation of the model components and a comparison with established models in the domain. The results in Table 4 show that the models using the other structures or methods also perform weaker than the model proposed in this paper in the stability assessment task. Using the Transformer for FES also improved the performance of the proposed model on the stability classification subtask. It is noteworthy that the proposed modification for the missing completion task (using soft-DTW loss) also improves the accuracy of the stability classification task. This is one of the evidences that the two subtasks are related and suggests that optimisation on sequence morphology would be beneficial for STVS assessment. This, together with the results in Table 2, demonstrates the effectiveness of the proposed method structure in this paper.

Table 4 Accuracy comparison of different structure models for stability classification tasks

Five representative models that performed STVS assessment were used for comparison with the model presented in this paper.

  1. 1.

    Super-resolution perception-based incremental broad learning (SRP-BL) is a SOTA model that has performed STVS assessment in the past. This model is mainly based on a broad learning model for STVS assessment [16].

  2. 2.

    Robust feature ensemble learning (RFEL) includes a clustering method for incomplete data and then constructs many submodels to give STVS assessment results. RFEL is a popular machine learning STVS assessment model [13].

  3. 3.

    LSTM-based reinforcement learning (LSTM-RL) uses double deep Q-Learning in conjunction with LSTM to give STVS assessment results [34].

  4. 4.

    A recurrent convolutional neural network (RCNN) [38] automatically captures temporal and spatial features in PMU measurement sequences for STVS assessment.

  5. 5.

    Mean imputation (MI) [39] uses the mean value of each feature to fill in the missing values. This is the baseline for handling missing values, and here, it represents the model composed of MI as MCS.

Figure 10 illustrates the accuracy of the voltage stability assessment for different models with different observation windows and missing data levels. The proposed model in this paper achieves more than 99% assessment accuracy in all scenarios, which has the highest accuracy compared to the comparison models and is minimally affected by different observation windows and missing data rates.

Fig. 10
figure 10

Accuracy comparison of different methods for stability classification tasks

To quantify the magnitude of the decrease in classification accuracy as the model performance increases with the level of missing data, Zhang Y et al. [40] proposed using the average accuracy drop slope \(\overline{s }\), as shown in Eq. (9) for the metric:

$$ \begin{array}{*{20}c} {\overline{s} = \frac{1}{D}\mathop \sum \limits_{i = 1}^{D} \frac{{A_{0} - A_{i} }}{{d_{i} }}} \\ \end{array} $$
(9)

where \(D\) is the number of different missing data levels tested, \({d}_{i}\) is the \(i\)th missing data level, \({A}_{i}\) is the average classification accuracy of the model at the \(i\)th missing data level with different observation windows, and \({A}_{0}\) is the classification accuracy achieved by the model when the data are complete. The comparison in Table 5 and Fig. 11 shows that the proposed method has obvious advantages over the established models and is minimally affected by changes in the missing data levels. The filling method of MI is equivalent to adding noise to the STVS assessment task, which is similar to the additional constraints introduced by other tasks in a multi-task model if there is no implicit relationship between the tasks. The comparison of the proposed model to MI in this paper also provides further evidence that there is indeed an association between the two tasks.

Table 5 Average accuracy drop slope of different methods
Fig. 11
figure 11

Average accuracy drop slope of different methods

Conclusion

In practical applications, PMU data are often missing due to a variety of factors, which makes the traditional STVS assessment method ineffective. The existing STVS assessment methods for incomplete PMU data have the disadvantage of computational complexity and are highly influenced by the level of missing data. At the same time, complete PMU data can help the analysis of other power system information. In this paper, we propose a multi-task learning model that performs the two tasks of stability classification and missing data completion in parallel, which significantly reduces the number of model parameters and training difficulty.

In the proposed model, the self-attention mechanism is introduced to address the problem of large differences between tasks, and the DWA method is used to balance the weights of the two tasks during training. In this paper, we use the order-insensitive Transformer Encoder as the shared layer for feature extraction, and the SCS consists of a BP network. The MCS consists of a GRU and uses the DILATE loss function, which allows the model to focus on both shape and temporal differences between the forecasted sequence and the ground truth. As an open framework, other possible models with similar properties can replace the Transformer and GRU structures mentioned above.

The simulation experiments and comparisons with other related models show the following conclusions: (1) the proposed model fills the missing data with high accuracy and focuses on the differences with the real sequence shape and time domain at the same time, with MAPE less than 1% and DTW less than 0.01. The performance improvement is generally more than 20–50% compared with the benchmarks. (2) The proposed model achieves high stability classification accuracy, and the classification accuracy reaches more than 99% in different scenarios. (3) The performance of the proposed model is least affected by the observation windows and the missing data rate, especially the large missing data rate. The above conclusions indicate that the two tasks of missing completion and stability assessment can be effectively combined in parallel by a multi-task learning model based on the attention mechanism, and the proposed model has good robustness and high application value.