1 Introduction

Measuring and analyzing gearbox vibration signals is a critical technical challenge in factories. Historically, mechanical systems required operational delays for inspection or maintenance. By analyzing measured vibration signals, experts can detect gear faults and develop appropriate maintenance plans to prevent sudden transmission system failures.

Standards for vibration evaluation, such as ISO 7919 and ISO 10816 for rotating machinery, have been established [1, 2]. Researchers are now focusing on developing machine learning algorithms to enhance the speed and accuracy of gear fault detection. In 2011, Heidari et al. [3] used a convolutional neural network (CNN) to classify four types of gear faults, achieving a training accuracy of 97.68% and a testing accuracy of 97.18%. In 2018, Liu Yang et al. [4] proposed a CNN with weight values, achieving high accuracy in gearbox fault diagnosis. However, their study used only simulated data with white noise and limited feature values, thus missing much signal information. A Principal Component Analysis (PCA) technique was used before training the CNN to obtain more input data, resulting in a test accuracy of 98.5% [5]. Alternatively, researchers such as Long Wen et al. [6] successfully applied a Transfer Learning (TL) model with time-domain signals as input, enabling automatic feature extraction and data classification without pre-processing.

Despite their efficiency, these methods rely on time-domain signals, which are sensitive to noise. This can obscure fault symptoms of the gear and hinder the model’s ability to classify new data accurately. Randall highlighted this issue in [7], particularly under varying load conditions affecting the vibration signals of an operating gearbox.

To comprehensively address this issue, the time–frequency representation (TFR) is suitable as input data for the DL network [8] because it effectively captures non-linear transient signals. The TFR accurately identifies the type of gear fault based on its mechanical vibration properties [9]. Several methods exist for converting a time-domain signal into its TFR, including the continuous wavelet transform (CWT) [10,11,12,13], or short-time Fourier transforms (STFT) [14,15,16,17], and its variants, such as Fourier synchrosqueezing transform (FSST) [18,19,20]. Unlike STFT, which uses fixed-duration windows to analyze local time intervals and extract their frequency components, CWT analyzes a signal at various frequencies with resolutions that adapt to those frequencies. This characteristic allows CWT to achieve good time resolution at high frequencies (capturing rapid changes), but with relatively poor frequency resolution. Conversely, it offers good frequency resolution at low frequencies (identifying subtle variations), but with less precise time resolution [20]. This makes CWT particularly well-suited for analyzing non-stationary signals with time-varying frequency components, as it excels at capturing transient or localized spectral information. Additionally, DL network reduces the size of images after several convolution blocks, which leads to information loss, particularly with FSST images. Thus, CWT is selected for pre-processing for DL networks. Most CNN architectures can be trained to automatically extract image features. However, after many training layers, the extracted information can deviate significantly from the original image. To address it, Resnet-50 [6, 21] is a specific CNN architecture that uses additional connections to retain information after DL layers, preventing signal characteristics from being lost. By enabling the direct flow of information, the network can learn more efficiently and reduce the time required to achieve desirable performance levels. This work proposes an optimized approach using Resnet-50 networks, employing the CWT technique as a pre-processing step, namely Fine-tuned Wavelet Resnet-50 (FWR50). Compared to published works, the main contributions of this work are as follows:

  • The input evaluation results show that the CWT has suitable properties for DL classification models.

  • The new processing scheme works accurately regardless of different working conditions.

  • Fine-tuning techniques based on TFR reduce the input data needed.

  • Trained models can be used in real-time automatic diagnostics with high reliability.

The remainder of the paper is structured as follows. Section 2 highlights the signal analysis based on the mechanical features of the gear, enabling the selection of appropriate pre-processing methods. Section 3 proposes a suitable DL network for the gear fault dataset. Numerical results in Sect. 4 demonstrate the effectiveness of the proposed method compared to the state-of-the-art techniques. Finally, Sect. 5 provides further discussions and potential developments in gear fault diagnosis.

2 Background theory

2.1 Continuous Wavelet Transform (CWT)

CWT aims to extract more information from the raw data by addressing the limitations of the time-domain signal representation, which can sometimes obscure important features. CWT provides a more an informative representation by using basic wavelet functions, known as mother wavelets, to represent any signal \(x\left(t\right)\). This is generally expressed by the formula [13]:

$$y_{s,\tau } \left( t \right) = \frac{1}{\sqrt s }y_{0} \left( {\frac{t - \tau }{s}} \right)$$
(1)

In this formula, \(\tau \in {\mathbb{R}}\) represents the translation coefficient, and \(s\in {\mathbb{R}}^{*}\) represents the scale coefficient \(s>0\), with \({y}_{0}\) being the base wavelet function. Changing the parameter \(s\) will alter the scale of the wavelet functions, leading to variations in time and frequency resolution in different regions. This property is unique to the wavelet transform and particularly useful for analyzing non-stationary signals and signals with rapidly changing frequencies over time.

The color intensity in the TFR of the signal indicates the wavelet coefficient’s magnitude. Figure 1a shows the TFR of an example gear vibration signal. The wavelet coefficients vary in both the time and frequency domain. If observed in the frequency domain, the TRF becomes a frequency spectrum, as shown in Fig. 1b. Thus, based on the frequency standard for troubleshooting gear faults [22], gear fault features can be extracted from the TFR image.

Fig. 1
figure 1

Example of TFR: a 3D view and b TFR-Frequency view

Figure 2 illustrates a typic frequency spectrum of a gear fault. Gear faults are generally characterized by the gear mesh frequency (GMF) and its high-order harmonics, often accompanied by surrounding sidebands. While most gear faults exhibit distinct spectra, some may share similar spectral features. Therefore, exploiting information in the time domain for each frequency range is crucial. Using TFR as an input, the DL network can simultaneously examine the relationships of high coefficient magnitude regions through a CNN. This allows the analysis of these features and the classification of gear conditions. These feature values extracted by DL serve a similar function to traditional feature values, such as mean coefficient, negative peak value, and positive peak value, etc. These feature values form the basis of gear fault classification in the next step.

Fig. 2
figure 2

Troubleshooting gear fault by frequency spectrum

2.2 Transfer Learning (TL)—Resnet 50

In the era of Industry 4.0, artificial intelligence (AI) resources have become increasingly abundant. Alongside this development, the availability of high-quality, accurate pre-trained models has grown significantly. Transfer Learning, initially explored experimentally by Lorien Pratt in 1993 [23] and later formalized mathematically in 1998, embodies the concept of transferring knowledge between models in a manner akin to human learning. This method is a highly efficient method and saves computational resources in machine learning (ML)[24]. It involves using a model developed for one dataset as the initial framework for another model with a different dataset. Consequently, the new model benefits from the optimized network structure derived from the prior training process.

Figure 3 illustrates the difference between traditional ML and TL models. Traditional ML involves training models on a specific task or dataset to make predictions. In contrast, TL leverages knowledge from previously trained models to improve performance on a different task. This process entails fine-tuning a pre-trained model on a source task or dataset to apply it effectively to a target task.

Fig. 3
figure 3

The difference between traditional ML and TL [25]

In this paper, we utilize the TL with the ResNet-50 neural network [21]. The ResNet-50 network is a deep CNN architecture widely used in computer vision tasks, renowned for its ability to overcome the vanishing gradient problem and achieve exceptional performance. The ResNet-50 network architecture consists of multiple residual blocks, as shown in Fig. 4. The fully connected layer at the end of the network plays a crucial role in mapping the extracted features to new output classes, where new data is trained.

Fig. 4
figure 4

The ResNet-50 architecture

3 Proposed method: Fine-tuned Wavelet Resnet-50 (FWR50)

3.1 Fine-tuning hyperparameters

One significant challenge in the automatic classification of vibration signal data for gearboxes using DL networks is the limited data availability. To effectively implement TL, the hyperparameters of the ResNet-50 network have to be fine-tuned to align with the specific dataset. Table 1 outlines hyperparameter fine-tuning techniques suitable for the characteristics of the investigated image data.

Table 1 TFR characteristics and corresponding DL network optimization techniques

3.2 FRW-50 flowchart

The data processing follows the flowchart depicted in Fig. 5. undergoes the following steps:

  1. 1.

    Measuring: vibration data from the experimental gearbox housing, which includes background noise.

  2. 2.

    Pre-processing: the measured signals are denoised using Tunable Q-factor Wavelet Transform (TQWT) [26], yielding pure signals. These pure signals are then split into training, validation, and testing sets.

  3. 3.

    Feature extraction: CWT is applied to obtain TFR images, which serve as the input for the DL network. The network model is trained using ResNet-50 to automatically extract feature vectors. The network model undergoes multiple trainings with optimal hyperparameters to achieve the possible highest training accuracy.

  4. 4.

    Classification: The classification model is validated using the testing set. Once verified, the model is employed to diagnose new data.

Fig. 5
figure 5

FWR-50 flowchart

The network model is trained several times with optimal hyperparameters until the possible highest training accuracy is attained. Subsequently, the model’s performance is evaluated using the testing dataset. Finally, the trained network is deployed to classify the gear faults in new datasets.

4 Experiment model and numerical results

The experiment was carried out at a gearbox test rig to investigate the influence of different gear faults on the measured acceleration from gearbox housing. The test involved four types of faults, and two different load conditions were constructed at Hanoi University of Science and Technology. The gearbox ratio is 10/20, with the pinion having four fault types: healthy, chipped, wear, and pitting, and two levels of load—1.5 Nm and 3 Nm, as shown in Fig. 6.

Fig. 6
figure 6

a Experiment model and four fault types: a Normal gear; b Chipped gear; c Pitting gear; d Wear gear

The PID controller was used to control torque and motor speed. A high-resolution encoder was employed to enable continuous measurement and interpolating shaft torque. The vibration signal is collected via the Endevco 2228C accelerometer, mounted on the gearbox housing with radial measurement direction having a sampling frequency of 20 kHz.

The data is divided into two groups: Group 1 consists of vibration signals measured with torque 1.5 Nm, and Group 2 is measured with torque 3 Nm. The authors train the network with Group 1 data and then test the fault classification ability with both groups to prove the effectiveness of the trained model.

According to the experiment setup, the rotation frequency \({f}_{r}\) and the mesh frequency \({f}_{z}\) of the second gear can be calculated as follows:

$$f_{r} = n = \frac{1390}{{60}} = 23.16 \left( {Hz} \right)$$
(2)
$$f_{z} = f_{r} z = 23.16 \times 28\, = \,648.67\,\left( {Hz} \right)$$
(3)

The vibration signal after CWT is compared with those in the time domain and after STFT to evaluate the effectiveness of different DL network input data. Figure 7 illustrates the signals after CWT pre-processing, where the color regions indicate the magnitude of the signal energy values.

Fig. 7
figure 7

Four types of signal samples after CWT: a Normal gear, b Wear gear, c Pitting gear, and d Chipped gear

Although the transformed images from these methods may differ in appearance, they all exhibit two prominent visual features. Firstly, the images of faulty gears show the presence of high-amplitude regions in the high-frequency range, which are gear mesh frequency (648.67 Hz), the natural frequency of the transmission (around 1950 Hz), and high-order harmonics of the mesh frequency (around 3250 Hz). Secondly, there is a correlation between nearby regions with large amplitude variations, such as an advantage of image analysis as it captures information in both the time and frequency domains.

Table 2 presents the training results of the dataset using three types of input data for the trained network.

Table 2 Classification results

According to Table 2, the training dataset is divided into parts to evaluate the effectiveness of the network model, and the validation accuracy is calculated by performing five iterations and taking the average result. The training data ratio is the ratio of correctly classified instances to the total testing data. The statistics indicate that the DL network performs very well when 80% of the data is used for training and the remaining 20% for testing. The transformed data using CWT and STFT achieved a 100% accurate validation in this case. The DL network achieved a lower accuracy rate for the time-domain data. Overall, for all three types of input data, the more data used for training, the more accurate the classification results. According to the statistical table, with the input of CWT images, achieving a high validation accuracy of 99.45% on average only requires 10% of the dataset. Moreover, the model with input data transformed using CWT more quickly achieves near 100% validation accuracy in the classification results. The result in Table 2 demonstrates that the TL network with CWT-preprocessed data can perform well with limited data. This result is beneficial in practical scenarios with limited standard training data.

After training with different ratios of training data, the model was trained quickly and achieved high accuracy with the best set of hyperparameters, including an image size of 224*224*3, 30 epochs, 5% neural dropout, validation every two epochs, mini-batch size of 32, the initial learning rate of 0.001. The average training time with 10% of the training data was 5.2 min.

Figure 8 represents the trend of increasing accuracy of the network model as the number of training datasets increases. The input data with Continuous Wavelet Transform (CWT) leads to classification results with an accuracy approaching 100%. On the other hand, the input data with STFT, despite enhancing the high-energy components and reducing the low-energy components, does not yield good training results with a small amount of data. However, when the number of training datasets reaches 10%, the accuracy of the STFT data surpasses the time-domain data.

Fig. 8
figure 8

Training results of the three methods when training data size varies

Figure 9 illustrates the stability of the training results after each training iteration of multiple experiments. As the classification results, the deviations for the input data based on CWT are lower than those of the time domain or STFT. Therefore, CWT provides the most reliable results regarding accuracy and stability. Using lots of data for training always makes the classification accuracy 100%, which can be overfitting, making the classification model less reliable. Thus, to evaluate the effectiveness of the trained model under different load conditions, the comparison below uses the classification model trained by 5% data from data Group 1 with CWT pre-processing.

Fig. 9
figure 9

Box charts of classification results of the 3 types of input data

Figure 10 depicted CWT images of the training data from Group 2 compared with Fig. 5, the color intensity of the image indicates that the high-load case displays more significant vibration amplitude and more occurrences of substantial amplitude variations. Nevertheless, similar to the low-load case, there are particular time points where the wavelet coefficient magnitude is most distinct.

Fig. 10
figure 10

Four types of signal samples with high loading after CWT: a Normal gear, b Wear gear, c Pitting gear, and d Chipped gear

Applying the model to the data from two groups, the classification results were represented in confusion matrices, as shown in Fig. 11. As shown in Fig. 11, the classification result is 97.8% for the low-load case and 96.8% for the high-load case. The data errors almost occur for certain classifications with similar fault types, such as between the wear and normal gear classifications. The similarity between the faults and the limited training data causes these errors. From this confusion matrix, the reliability of the classification model is highly regarded.

Fig. 11
figure 11

Confusion matrix 5% data: a Group 1; b Group 2

Applying the model to the data from two groups, the classification results were represented in confusion matrices, as shown in Fig. 11. As shown in Fig. 11, the classification result is 97.8% for the low-load case and 96.8% for the high-load case. The data errors almost occur for certain classifications with similar fault types, such as between the wear and normal gear classifications. The similarity between the faults and the limited training data causes these errors. From this confusion matrix, the reliability of the classification model is highly regarded.

Table 3 provides an overview of various fault diagnosis methods for different research objects, highlighting the complexity of vibration data, preprocessing algorithms, machine learning algorithms, percentage of data used for training, and maximum accuracy achieved. Notably, the proposed method outperforms others in several aspects. First, it applies two load conditions instead of one, enhancing accuracy and applicability in more realistic scenarios. Second, the proposed method uses Time–Frequency Representation and TQWT denoising, improving the quality of input data. Third, by employing TL-Based ResNet50, it leverages pre-trained models, reducing training time and requiring less data to achieve high accuracy. Remarkably, the proposed method achieves a maximum validation accuracy of 100% with only 10% of the training data, demonstrating its efficiency and data-saving advantage. Compared to other studies, this accuracy is either superior or on par with methods that use more data and only single load conditions. This indicates that the proposed method is not only effective but also versatile and suitable for various working conditions. These results affirm the superiority and wide application potential of the proposed method in diagnosing multiple faults with imbalanced datasets.

Table 3 Comparison of the proposed methodology with the published literature

5 Conclusions

This study opens up a research direction on the relationship between the amplitude regions of wavelet coefficient distributions in the time–frequency domain. It is possible to extract features related to the boundary frequency range of the signal, separating them from gear meshing frequency to obtain more explicit images.

An automatic classification method based on DL networks for gear faults, with input data pre-processed using CWT. The training model applies TL from the ResNet-50 model, optimized to suit the characteristics of the gearbox dataset. An experiment test with four gear faults evaluated this proposed technique’s effectiveness. The accuracy of classification based on time–frequency analysis by CWT is proven stable during training.

Moreover, the network model is also applied to different load conditions in the gearbox. Notable, the gear fault classification result is good even with a small amount of training data, with an average accuracy of 99.45% achieved with only 10% of the data participating in the training process. According to the proposed approach, the trained model can be integrated into a compact device to perform real-time diagnosis without interrupting rotating machine operations. This solution can help reduce dependence on experts and save time and costs in monitoring and maintaining rotating equipment. These research issues help facilitate the necessary training data for new research subjects, thereby saving costs in industrial fault classification.