Enhanced damage classification accuracy on a transmission by extending existing datasets with generative adversarial networks

In many areas of drive technology, condition monitoring of transmissions and drive systems is becoming an increasingly important discipline. Condition monitoring systems are used in many cases in combination with machine learning algorithms. The generation of a sufficient amount of data per condition class is relevant to ensure training stability and accuracy of the applied algorithms. Especially in early development phases a sufficient data generation is not often given. In the scope of this paper, a Generative Adversarial Network is applied to generate synthetic data and therefore extend existing measurement data sets. Acceleration data in three different condition classes is used, that has been collected on a gearbox as part of the PHM Data Challenge 2009. In order to highlight relevant features and reduce the number of data points, data is pre-processed via appropriate signal analysis techniques, in this case with the spectral kurtosis. It is shown, that in this use case the synthetically generated data via a Generative Adversarial Network has the same feature characteristics as the real measured data sets. The augmentation of the existing data set also improves the detection accuracy with artificial neural networks for the classification of different system states.

The manuscript is the extended version of a paper presented at the International Conference on Gears 2022 (VDI). Timo König timo.koenig@hs-aalen.de 1

Introduction
In industrial practice, high reliability and efficient operating behavior are required for electromechanical drive systems. Installed transmissions in drive systems have an effect on the overall system in terms of reliability and efficiency. Condition monitoring of transmissions therefore plays an important role in ensuring an efficient and reliable operation. However, machines and drive systems are often not maintained in a condition-oriented or predictive manner. The drive systems are normally maintained as part of a preventive or reactive maintenance strategy. In this context, different condition variables, such as vibration signals, can be used to monitor transmissions and thus enable conclusions to be drawn about the condition of a drive system [1,2].
The acquisition of condition data is a relevant issue for the implementation of condition monitoring systems. In reality, the acquisition of a sufficient data base is the main challenge, as this is very cost-intensive and time-consuming in practice. However, for the implementation of machine learning algorithms to improve condition monitoring systems, a sufficient data base is necessary to ensure a valid classification as well as damage detection. Generative Adversarial Networks (GAN) represent a suitable approach to augment and extend existing real data sets. The features in the measured signals are learned by the algorithms, based on the real and existing data sets, to replicate the signal features. This creates new data with similar properties to the real measured data. Appropriate pre-processing methods are required to highlight relevant features, reduce the number of data points and replicate signal characteristics in the best possible and most realistic way.
In the following use case, a gearbox data set is used from a data challenge of the PHM Society. The PHM Data Challenge from 2009 focuses on, among other things, fault detection of a generic gearbox using accelerometer data [3]. Vibration data at identical operating conditions are extracted from the data set, so that data from different damage classes and system conditions is available [3]. However, only a very small data base is available, which is not adequate for machine learning algorithms to classify measurement data. The focus of the following work is to show an application for GAN-based signal generation to increase the classification accuracy of the used gearbox condition data by extending the existing data set. Furthermore, the performance of data generation shall be improved by the use of input features with a low number of data points and a shallow network architecture [4].

State of the art
Condition monitoring systems enable the monitoring of machines and drive systems as well as their components depending on various condition variables, such as vibration, acoustic and temperature [2]. In many applications, for example bearings and transmissions, vibration analysis is used because of its advance warning time until damage occurs [1]. During vibration analysis, it is important to ensure that the performance of the sensor system is matched to the particular application and that a suitable sensor position is selected, in this case on the transmission [5,6]. The noise caused by other parts and components must also be kept as low as possible in order to identify the signal components that are relevant to damage [7]. The implementation of condition monitoring and the classification of faulty systems in combination with machine learning algorithms often have the problem, especially at the beginning of development processes, that not enough data is available. This means that the algorithms cannot unambiguously identify and classify fault-free or faulty states. In general, there are different possibilities to generate data and extend existing data sets [8]. One possibility to solve the problem and extend existing data sets are GAN to generate synthetic signals similar to real signals [9].
A GAN consisting of two competing Artificial Neural Networks (ANN) enables a generation of synthetic data sets. One ANN is called generator and the other one discriminator. The generator tries to generate synthetic data, which the discriminator cross-checks based on real measured data until real and synthetic data become ideally inseparable. There is a mathematical relationship between the two different networks via the loss functions. The loss functions of both networks are optimized and approximated during the training process. Thus, the discriminator has the task to distinguish the real data from the synthetically generated data. After several training epochs, synthetic data becomes so similar to real data, so that the respective data sets are difficult to distinguish from each other. The probability distribution of the synthetic data should correspond as close as possible to the real data [10,11].
One approach to implement GAN is to augment existing real data sets with synthetic data to increase accuracy [12]. The measurement of a sufficient data set is often not given or difficult, especially in early development phases of products [12]. However, in the context of measurement data acquisition, GAN are mostly used to synchronize unbalanced data sets, because in a lot of use cases significantly less data is available from bad system condition in relation to the new system condition. Balanced data sets can significantly increase the performance and stability of machine learning algorithms [10,11,13].
A problem in fault diagnosis of rotating machines is that an unbalanced condition class distribution negatively influences the diagnosis with classification algorithms [14]. Liu et al. demonstrates a multi-class fault diagnosis method for unbalanced data sets on rotating machines [14]. The validation of the method is carried out, among others, with a data set collected on transmissions [14]. In another paper, a transfer learning approach based on GAN is used for an intelligent fault diagnosis to balance unbalanced data sets [15]. In certain applications, optimized algorithms, so-called Wasserstein Generative Adversarial Networks (WGAN), are used to improve the accuracy and the stability of algorithms [13]. WGAN allow a more stable training and reduce the risk of mode collapse or vanishing gradients. For WGAN the Wasserstein loss function is used. The function generates a continuous loss output instead of exponentially decreasing loss values [9]. The WGAN value function is displayed in Eq. 1 [16]: Here, Pr is the distribution of the real data x, while PG is the generator model distribution of the generated data e x based on the noise vector input z. D represents the sets of 1-Lipschitz functions. The generator G tries to minimize the difference between the discriminators output value for the real data D(x) and the discriminators output values for the generated data D(e x), while the discriminator tries to maximize it [16].
In a WGAN, the discriminator doesn't distinguish between real and synthetic data, but evaluates the difference of the input. Therefore, instead of discriminator, it is called critic in a WGAN [9].
For a stable training it is necessary, that the critic follows the Lipschitz constraint [16]. To enforce the constraint, a gradient penalty for the critic output is used. Equation 2 shows the corresponding function for the critic output with gradient penalty [16]: The distribution P b x for random samples b x is sampled uniformly between PG and Pr. The penalty coefficient λ is used to weight the gradient penalty [16].
Combining a GAN with an autoencoder can also improve data generation [17,18]. An application of GAN to synthesize and extend captured data sets is also applied to other machine elements, such as bearings [19][20][21].
In most cases, the recorded vibration data must be preprocessed to highlight relevant features and ensure conclu-sions about the system condition. A calculation of condition indicators, both in the time and frequency domain, already allows conclusions to be drawn about the condition of transmissions in many applications. However, a detailed damage analysis of the transmissions is not possible with a parameter analysis. A frequency analysis is often used to identify damage and defects. There are also different pre-processing methods in the frequency range, such as envelope demodulation, in order to highlight periodically repeating frequencies. In combination with machine learning algorithms for classifying different operating and damage states, different pre-processing methods can have a significant impact on the classification accuracy [8,[22][23][24][25].
Vibration data pre-processed by envelope demodulation is represented in a large data set with a large number of data points. Other pre-processing methods, such as the spectral kurtosis, are used to minimize the number of data points and reduce the noise. This allows a shorter training time and a more efficient data generation. Antoni et al. show that spectral kurtosis is well suited for the detection and characterization of non-stationary signals in vibration-based condition monitoring, even with high signal noise [26]. The suitability of spectral kurtosis in research for fault diagnosis on rolling bearings is shown, for example, by Vrabie et al. [27]. An exact fault diagnosis is also made possible with the spectral kurtosis on bearings installed in electric motors [28]. However, spectral kurtosis is not only used for fault detection on rolling bearings, it can also be used for fault detection, diagnosis and prognosis on rotating machinery [29,30].
Therefore, the measured data of the transmission is preprocessed using spectral kurtosis, which indicates a non-Gaussian behavior in the frequency domain, and synthesized using a WGAN.

Test setup and data acquisition
In the scope of this paper, acceleration data collected on a generic gearbox (see Fig. 1) of the PHM Data Challenge 2009 is used. The focus of the challenge was on fault detection in order to identify the type, location and extent of damage in the transmission system. Only selected data at identical and constant operating points are used for the paper. Data in new condition (Class 0), from a defect of the input shaft (Class 1) and from a defect of the inner bearing (Class 2) are used. Measurements in the given data set are not labelled, but Al-Atat et al. have categorized the data with a high probability [31]. Several signal processing and feature extraction methods are used here to extract the features in the signals. To identify the different condition classes, the raw time signal, the time synchronous signal and the envelope spectrum, among others, are ap- Fig. 1 Inside view of transmission used for data collection (a) and location of input shaft accelerometer (b) [3] plied. Furthermore, a segmentation approach is used to determine the speed and load of the transmission. With the described method, it is thus possible to split the data in system and operating states. The classes used are based on this categorization. Each class contains 4 separate measurements of approx. 1.5 s measurement duration. The used data was measured with an accelerometer placed on the housing close to the input shaft of the transmission. The damage cases at hand are to be identified and detected with a classification algorithm. Due to the little available data, a sufficient damage classification with machine learning algorithms is impractical [3,31].
An application of ANN compared to traditional analysis algorithms is preferred due to easier representation of the result probability. In addition, a machine learning approach can use more informative features such as spectral kurtosis for easier data classification than, for example, a simple parameter analysis.

Signal pre-processing and data generation using GAN
The measurement data of the three condition classes considered are limited to 1 s duration per measurement in order to obtain uniform signal lengths. Therefore, four signals for each of the three condition classes are used for the analysis. For further processing, the spectral kurtosis is calculated from the existing amplitude-time signals. The spectral kurtosis provides a measure of the impulsivity of a signal as a function of the frequency. The spectral kurtosis can be calculated based on the Short-Time Fourier Transform (STFT), as described in Eq. 3. A simple approach is to shift a short time window along the signal and obtain the spectrum as a function of the time shift. The window that is moved along the recording is described by w(t) and the recorded signal by x(t) [30,32].
Based on the STFT, the spectral kurtosis K(f) can be calculated so that frequency changes can be tracked with the time. The kurtosis for each frequency can be calculated with the fourth power of S(t, f) at each point and averaging its value. Normalization of the values is done by squaring the mean square value. If 2 is subtracted from the quotient, the result is zero for a Gaussian signal. The dependencies are visualized in Eq. 4. [30,32] Examples of the corresponding time and spectral kurtosis signals are given in Fig. 2. The spectral kurtosis of the signals is used as an input for the machine learning algorithms. The spectral kurtosis is calculated with MATLAB, whereby a size of 64 data points per window is selected for the underlying spectrogram calculation. Datapoints of the signals are reduced to 256, corresponding to 256 Hz of the spectral kurtosis with a resolution of 1 datapoint per Hz.
Due to the windowing in the calculation of spectral kurtosis, the number of data points per signal is reduced. This enables a more computationally efficient synthetization and classification of the data. Another advantage is that the noise component in the spectral kurtosis is relatively low. Noise in the signals affects the accuracy of synthetic data generation by GAN, which is presumably due to the fact that noise can only be learned and synthesized by a GAN to a limited extent because of random variation. A WGAN with gradient penalty is used to synthesize vibration data. The used network architecture is shown in Fig. 3.
The WGAN is trained over 10,000 epochs. The batch size is 2. A normally distributed randomized vector with a length of 64 data points is used as noise input. During an epoch the generator is trained 1 cycle, the critic 5 cycles. This, together with the limitation of the gradient changes per epoch via penalting serves to stabilize the training. The gradient penalty is weighted with a coefficient of 10. The used optimizer is Adam for the generator and RMSprop for the critic.
In order to allow an objective evaluation of the synthetic data, the GAN is not trained on all existing real data. One signal per class is retained for the test data set of the classification ANN. This leaves 3 signals per class for the data synthetization. The GAN is trained seperately for each class, resulting in 3 separate GAN network architectures. For each class, 100 synthetic data are generated to form the training data set for the subsequent classification.

Comparison of real and synthetic signals
In order to examine the suitability of the synthetically generated data for the extension of the existing data set, distribution functions of the averaged synthetic data per class are compared with the distribution function of the averaged real training data of the corresponding class. The distribution functions are shown in Fig. 4 as histograms with approximated distribution curves for a) Class 0, b) Class 1 and c) Class 2.
The distribution functions of the real training data and the synthetically generated data for all classes are sufficiently approximated. For further analyses, the real and synthetic signals are compared. Figure 5a shows the spectral kurtosis of a selected measurement for each class. Figure 5b shows a synthetically generated spectral kurtosis for each class.
The features of the real data are reproduced in the synthetic data. It is clear that the synthetic data is noisier than the real data, which is presumably due to the small amount of real training data. As data synthetization using GAN is not a duplication of existing data, but a new generation of data based on learned distribution functions, the feature expression and the amplitude levels of the synthetic data vary K Fig. 5 Samples of the spectral kurtosis of the real data (a) and synthetic data (b) by classes and deviate from those of the real data. This is particularly evident in Fig. 6, which shows a two-parameter analysis of the spectral kurtosis signals of the real training data (circles) in comparison to synthetic data (squares). The Root-Mean-Square and the kurtosis of the spectral kurtosis signals are considered.
It can be seen that the distribution of the parameters of the synthetic data approaches the ones of the real data. For Class 0 and Class 1 there is a scattering of the synthetic data and an overlap of the parameters, but the distributions of the synthetic data are concentrated in the areas of real data. It is evident that the synthetic data of Class 2 approximates the distribution function of the real data (see Fig. 4c), but does not reproduce the maxima and minima of the real data. This is presumably due to the high signal deviation of Class 2 with simultaneously little training data.
Finally, the suitability of the synthetic data for classification is checked. The real data set and a combined data set of real and synthetic data are classified and the results are compared. Figure 7 shows the inputs and the used network architecture.
The classification accuracy of the test data set is used as a measure for evaluating the results. To classify the real data set, it is divided into 3 signals per class for the training   Confusion matrixes with test results for real and combined training data set after 1000 trainings data set and 1 signal per class for the test data set (Fig. 7). Due to the small data base, a repeated cross validation is performed. The network is trained 1000 times, each time randomly splitting the data into training and test data. Randomization enables a more valid evaluation of results. Since there is no specific pattern underlying the partitioning, it is possible that certain training and test combinations occur more frequently. This should be compensated by the high number of repetitions. Subsequently, a classification of a combined data set of real and synthetic data is carried out. The procedure corresponds to the classification of the real data set. In addition, 27 synthetic data per class are incorporated into the training data set to extend it by a factor of 10. The synthetic data is randomly selected from the synthetic data sets of the classes to consider the potential scatter of the generated synthetic signals in the evaluation. Due to the randomized partitioning of the whole real data set, the real GAN training data underlying the synthetic data is also used as classification test data. However, since the synthetic data exhibits variations of the features and not duplications, this circumstance is tolerated for the evaluation. Due to the little available data, no validation data is used during the training of the real data set. For reasons of comparability, the same procedure is used for the combined data set. Training is carried out over 100 epochs; the batch size is 3. An Adam optimizer is used. Figure 8 shows an example of the model accuracies over the trained epochs of the classification ANN for a) the real data set and b) the combined data set.
It is obvious that training for the combined data set is more stable. A stable training accuracy is already achieved after approx. 30 epochs, while the accuracy varies continuously during the training of the real data set. After 1000 trainings, the mean test accuracy for the real data set is 89.6 %. In a comparison with the mean training accuracy of 96.9 %, an overfitting of the ANN can be suspected, presumably due to little available training data. The mean test accuracy of the combined data set of real and synthetic data is 99.9 %, the mean training accuracy is also 99.9 %.
In Fig. 9, the resulting confusion matrixes with the test results of the real and combined training data sets for 1000 network trainings are shown.
As it can be seen, the classification of Class 0 and Class 1 is not sufficient for the real training data set. The combined training data set results in just 2 misclassifications for Class 0 of the test data in the 1000 trainings of the network.

Conclusion
The results show that for the considered use case an extension of the existing data set leads to a significant improvement of the test classification accuracy of about 10.3%. This can be achieved with a normal ANN network architecture as well as input features with a relatively few data points, which improves the network performance and the user friendliness. However, the data base is not sufficient for an in-depth validation of the results, which is why no statement can be made about the transferability to other use cases. The limitation of the proposed approach is that only features in the trained data can be replicated. Therefore, it is necessary that the measured data contains all relevant features regarding the respective condition state. The described procedure can be optimized in further studies by hyperparameter optimization and optimization of the training duration of the networks. It has been shown that data augmentation using GAN is a suitable approach for optimizing and simplifying the condition monitoring of machine components.
Funding Open Access funding enabled and organized by Projekt DEAL.
Conflict of interest T. König, F. Wagner, M. Kley and M. Liebschner declare that they have no competing interests.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4. 0/.