A wavelet convolutional capsule network with modified super resolution generative adversarial network for fault diagnosis and classification

The study of fault diagnosis and classification has gained tremendous attention in various aspects of modern industry. However, the performance of traditional fault diagnosis technique solely depends on handcrafted features based on expert knowledge which is difficult to pre-design and has failed in several applications. Deep learning (DL) has achieved remarkable performance in hierarchical feature extraction and learning distinctive feature of dataset from related distribution. However, the challenge associated with DL models is that max-pooling operation usually leads to loss of spatial details during high-level feature extraction. Another concern is the low quality characteristics of 2D time-frequency image which is mostly caused by the presence of noise and poor resolution. This paper proposes a modified wavelet convolutional capsule network with modified enhanced super resolution generative adversarial network plus for fault diagnosis and classification. It uses continuous wavelet transform to convert raw data signals to 2D time-frequency images and applies super resolution generative adversarial technique to enhance the quality of the time-frequency images and finally, the convolutional capsule network learns the extracted high-level features without loss of spatial details for the diagnosis and classification of faults. We validated our proposed model on the famous motor bearing dataset from the Case Western Reserve University. The experimental results show that our proposed fault diagnostic model obtains higher diagnosis accuracy of 99.84% outweighing most traditional deep learning models including state-of-the-art methods.


Introduction
In modern industry, one of the things that play a crucial role is fault diagnosis [1]. Data-driven fault diagnosis being a typical type of fault diagnosis has attracted much attention in recent 1 School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu 611731, Sichuan, China years [2]. To carry out the extraction of the underlying knowledge concerning system variables, historical data of large volume is used especially for the complicated systems where it looks hard to establish unambiguous models or symptoms of signal [3]. The development of smart manufacturing has brought ease to the process of data collection [4] which does not only bring new perspective to the industry but has also brought some challenges [5]. Therefore, it becomes crucial to discover an effective data-driven method for fault diagnosis [6]. Traditional machine learning (ML) methods utilize predesigned handcrafted features and these features contribute to the best possible (upper bound) prediction accuracy [7]. In 2006, Deep Learning (DL) became the center of attraction for most researchers in the field of machine learning [8]. DL has the ability to automatically extract features of raw data in hierarchical representation [9,10]. This advantage enables the DL to avoid the errors encountered in the handcrafted features designed by domain experts and has consequently shown a remarkable prospect on the diagnosis of faults [11,12]. Although traditional machine learning methods perform well under the assumption that both the training and the testing data are expected to be drawn from exactly the same distribution. When drawn from differing distributions, then the performances would drop significantly. In addition to that, this assumption has never recorded success in many applications.
DL methods also encounter the same bottleneck mentioned above. To solve this challenge, a well-known approach called Transfer Learning (TL) method is utilized to perform learning task on both the training and testing dataset from a distribution that is related. TL approach offers better adaptability in extracting high-level features compared to shallow architectures and has recorded tremendous progress in many applications [13]. However, the challenges associated with TL models is that max-pooling operation usually leads to loss of spatial details during high-level feature extraction. According to Zellinger et al. [14] who suggested a unique strategy for unsupervised domain-adaptation for neural networks that depends on the regularization of the metricbased learning procedure. The authors further explained that by decreasing the suggested Central Moment Discrepancy (CMD) metric, the regularization tries to maximize the resemblance of domain-specific activation distributions. The authors also stated that the CMD addresses difficulties of instability that occur when using integral probability metrics based on polynomial function spaces. More so, the authors explained that in dual space, the metric can be interpreted as the sum of the differences between higher order central moments of the associated activation distributions [14]. Studies have shown that deep learning models combined with wavelet transform significantly increase the overall performance of the network in classification task compared to the traditional stand alone deep learning models [15]- [17]. This paper proposes a modified wavelet convolutional capsule network with modified enhanced super resolution generative adversarial network plus for fault diagnosis and classification. It uses continuous wavelet transform to convert raw data signals to 2D RGB images (scalograms) and applies super resolution generative adversarial technique to enhance the quality of the scalograms and finally, the convolutional capsule network learns the extracted high-level features without loss of spatial details for the diagnosis and classification of faults. We conducted several experiments to examine the diagnosis performance of our proposed MWCCN-MESRGAN+ model. From the results obtained, not only does MWCCN-MESRGAN+ has a promising potential in fault diagnosis, it also outweighs several TL models and fault diagnosis methods. The remaining part of this paper is structured as follows. The related works is presented in Sect. 2. Methodology and detailed discussion of the proposed framework are presented in Sect. 3. Section 4 presents the several experiments conducted. Section 5 presents the evaluation of the proposed framework and finally, the conclusion of this paper is given in Sect. 5.

Related works
Fault diagnosis abates the risk of unforeseen breakdown and ensures the safety as well as the reliability of industrial systems. Generally, the methods of fault diagnosis can be categorized into four domain subjects; signal domain, model domain, hybrid/active domain, and knowledge domain methods [18]. The knowledge domain method is data-driven technique which is better used for systems that are too complicated and difficult to obtain specific system framework or symptoms of signal [18]. Machine learning is a well-known analytical technique in data-driven fault diagnosis such as artificial neural network (ANN), expert system, Support Vector Machine (SVM) as well as fuzzy logic.
The first data-driven fault diagnosis that became famous was developed in the 1980s with the use of expert systems [19]. This method utilizes a technique that requires the expert to learn a set of rules from previous experiences. The authors in [20] suggested an extended version of neural network for fault diagnosis of internal combustion engines. A study was suggested in [21] to investigate the merits of SVM method for fault diagnosis and to monitor the recent progress. The authors in [22] proposed an intelligent framework which is based on a fuzzy genetic algorithm. This method was applied to automatically detect failures in aircraft.
The authors in [23] investigated the possible applications of DL to machine condition monitoring. Recurrent neural networks and dynamic Bayesian modeling approach was suggested by the authors in [24] to detect faults in induction motors. The investigation of an stacked Auto-Encoder (SAE) for the classification of induction motor fault was proposed by the authors in [25]. Two-layer unsupervised neural network with sparse filtering technique was proposed in [26] for fault diagnosis. The authors in [27] applied the stacked de-noising auto-encoder (AE) to fault diagnosis in rotary machinery. The authors in [28] suggested an improved deep belief network (DBN) for fault diagnosis in rolling bearing. A deep neural network with auto-encoder (AE) was investigated by the authors in [29] for smart fault diagnosis.
Despite the huge achievement of machine learning methods,it still fails in many applications due to its assumption that the data distributions of the train and the test are the same [30]. The authors in [31] presented a study to analyze different methods while the work of [32] presented some of the applications of TL. A method of transfer component analysis was proposed by the authors in [33] to achieve feature transformation with the aim of discovering common latent features with similar margin of distribution while keeping the intrinsic structure of the input data.
A method of handling the heterogeneous features of both videos and images by transferring the SVM was proposed in [34]. A method of unsupervised Stacked de-noising AE was proposed in [35] to transform the input space to find consistent latent feature space. A study on transfer learning by reusing stacked de-noising AE was investigated by the authors in [36].The authors in [37] utilized domain adaption with neural network for fault diagnosis. It is worth mentioning that rolling bearing fault data are raw signals (time-series) data that need to be converted to 2D timefrequency images by means of continuous wavelet transform (CWT). These time-frequency images are absolute value of CWT coefficient characterized with low quality. However, all the methods mentioned above contributed remarkably to the study of fault diagnosis but none of the methods addressed the effect of low quality 2D time-frequency scalogram on the performance of fault diagnosis. The method proposed in this research is designed to solve the problem of low quality 2D time-frequency scalograms with the aim of obtaining high performance on fault diagnosis.

Methodology
This section will introduce the dataset utilized in this paper, the pre-processing of the fault dataset, the modified enhanced super resolution generative adversarial network plus, followed by the wavelet convolution capsule network, and the time-frequency scalogram construction. Finally, the experimental setup and details conclude this section.

Dataset
In this study, we collected raw signals of fault dataset from Case Western Reserve University [38] which consists of 10 health conditions with 1024 sample points each. Out of the 10 conditions, only 1 condition belongs to the normal label while the other 9 conditions are classified as fault condition with different damage points, fault diameters and load conditions. The raw signals of the 10 health conditions are sampled at the frequency of 12k H z. For the purpose of our study, we split the dataset into train, validation and test.

Pre-processing of fault dataset
Dataset of raw fault signals are collected from a well-known bearing data repository of the Case Western Reserve University [38]. Since our proposed model requires the input data to be in image format, we converted the raw signals into time-frequency scalograms (images) using continues wavelet transform (CWT). However, the time-frequency scalograms are grayscale with 1 channel, therefore, it is important to convert them to 3 channel format of RGB. Finally, the input image is normalized and reshaped to a dimension of 224 × 224 × 3 to match the input size of our proposed model. The dataset is subdivided into training, validation and test set.

Modified enhanced super resolution GAN Plus (MESRGAN+)
In this study, our aim is to enhance the low quality of scalogram (2D images) into a super-resolution before passing them through the wavelet convolutional capsule network for bearing fault diagnosis and classification. We will present the proposed modified enhanced super resolution generative adversarial network plus (MESRGAN+) architecture and its structural improvement for achieving a balance in perceptual quality and PSNR in this section. Hence, we will briefly highlight the transition of SRGAN to MESRGAN+.

Transition of super resolution by GAN
SRGAN [39] utilizes basic blocks of deep residual network to recover image-realistic details in which batch normalization(BN) is followed after each convolutional layer as depicted in Fig. 1. The transition from SRGAN to ESRGAN [40] is based on two modifications; the first modification is the removal of all BN in the generator structure and the second modification involves the replacement of the original basic block with Residual-in -Residual Dense Block (RRDB) as shown in Fig. 1. Finally, the transition from ESRGAN [40]  to ESRGAN+ [41] is based on introducing additional level of residual learning at every two layers inside the dense block as illustrated in Fig. 1 without changing the convolutional structure.

Architecture of the proposed MESRGAN+
In our proposed super resolution architecture, the overall structural configuration of the Residual-in-Residual Dense Block (RRDB) in ESRGAN+ is kept the same as shown in Fig. 1. We made few modifications to the ESRGAN+ network in the generator structure by expanding the convolutional layers with additional two convolutional layers and two ReLU activation function. Normally, the direct mapping of the high-dimensional LR features to HR feature vectors ultimately results to high computational complexity and we know that the dimension of the LR feature is normally very huge. To address this bottleneck, we utilize a 1 × 1 convolutional layer as the second layer to reduce the computational cost by shrinking the LR dimensional features thereby maintaining the same kernel size of 64 after the first layer. To maintain consistency and the performance of ESRGAN+, we utilized 3 × 3 filter size and kernel size of 64 for the third and fourth convolutional layers.
To produce the high-resolution images from the scaleadaptive module, the scale factor is increased to 4. This image's network generator produces v k+1 = G k (v k ). Feature map is extracted to calculate the perceptual loss before being passed to the final activation function. Pixel-wise loss is measured, and the created image is forwarded to the discriminator network to differentiate between the created image v k+1 and the actual imagev k+1 . This actual imagev k+1 is fed to the discriminator network for training, which results in the same super-resolution image v k+1 . The generator network could generate new images that look like the real image. When training begins, the generator produces obviously fake data, and the discriminator quickly learns to tell that it's fake. As training progresses, the generator gets closer to producing output that can fool the discriminator. Finally, if the generator training goes well, the discriminator gets worse at telling the difference between real and fake. It starts to classify fake data as real, and its accuracy decreases. During the discriminator training, the discriminator classifies both real and fake image from the generator. Finally, if the generator training goes well, the discriminator gets worse to distinguish real and fake image which simply means that this entire process was only completed when the discriminator network could no longer tell the difference between real and fabricated images. At this point, the discriminator loss penalizes the discriminator for misclassifying a real instance as fake or a fake instance as real.Therefore, if the process is incomplete, the generator loss penalizes the generator for failing to fool the discriminator. At this point, the discriminator can still tell the difference between the fake and original image. We train the generator function G k to approximate the HR of the next LR imagev k+1 which LR input can represent.The total loss of the super-resolution network is given in (1) as;

As (1) is evaluated, Gen is the generator loss and
Perceptualloss is the perceptual loss. Ra G is called the adversarial loss which is the loss of a relativistic generator, L1 is the content loss and Ra Dis is the discriminator loss. μ and η represent the coefficients to offset the losses.

Perceptual loss
Perceptual loss works to improve the texture and picture accuracy of the generated images [42]. Euclidean distance is used to compare the feature maps of the original imagev k+1 and the generated image v k+1 . According to the definition of [42], the feature map was extracted before using the generator network's final activation function. The extraction of feature maps after activation function caused the model to be inconsistent, directly impacting the model output.
When recapturing HR from LR, it provides close supervision between feature maps. The fact that scalograms are not sufficiently HR is well understood, and this aspect boosts model re-generation dramatically. Mapping feature α i j is gotten after j th -convolution and before the max-pooling layer. The formality is measured as the distance between the function representations of the super-resolution image G k v k and the real imagev k+1 . Formal calculation between feature maps is given in (2).
Rather than encouraging the pixels of the output image v k+1 to exactly match the pixels of the target imagev k+1 , perceptual loss encourages them to have similar feature representations as computed by the loss network.

Content loss
By manipulating the HR image v k+1 to be close to the ground truthv k+1 , the network improves the accuracy of pixel-level by calculating the L1-norm distance between both the ground truth and the recovered image. (3) calculates the L1-norm distance between the SR image G k (v k ) xy and the ground truth (v k+1 ) xy are given in (3).

Relativistic loss
The majority of the preliminary research focused on standard GAN. Meanwhile, we employ a rational discriminative loss in our SR network, ensuring that HR photos are not stylized or unrealistic. In (4), the classification of the images uses the standard discriminator D is in GAN.
Equation (4) reflects the regular GAN's operation. D is is the discriminator's output to classify whether the images are real or artificial. The vector feature discriminator is represented as f d (.). Additionally, the word "σ " stands for the sigmoid function. Adversarial loss is a binary classifier that differentiates between real data and generated data predicted by the generative network. We use the relativistic GAN [36] to distinguish between the realv k+1 and created data G k (v k ) with the distance computed in (5).
RGAN produces images with sharp edges when used in a relativistic model and provides more graphic and detail information than a typical GAN as presented in (6).
how realistic an image is compared to a fake one.
how fake an image is compared to a real one.
Here, E(.) is the average of all real or fake data in the sample. This slight modification makes the model more efficient than the standard discriminator network. The discriminator network loss is given in (7).
Despite this, (8) illustrates the adversarial loss for the RGAN.
The network is concurrently trained for both actual imagê v k+1 and created image G k (v k ) to minimize the failure of the discriminator and generator networks. When the discriminator reaches the optimal value, the gradient gets close to zero which provides little feedback to the generator, thereby slowing or completely stopping the learning. At this level, custom GAN does not learn how to create more realistic images. In comparison, RGAN study both images and gradient are dependent on both terms, i.e.,v k+1 and G k (v k ).

Modified wavelet convolutional capsule network (MWCCN)
Presently, research literature have shown that CNN have generated excellent results in the extraction of features for classification problems. Conventional CNNs use scalar neurons to express the likelihood of distinguishing features being present, which severely restricts their performance. Figure 2 illustrates a fine-tuned VGG-19 model with discrete wavelet transform (DWT) pooling used for the feature extraction of scalograms. Traditional VGG-19 has 16 convolution layers and 3 fully connected layers. To this, we executed few modification using the pre-trained weight while keeping the first block and replaced subsequent blocks having max-pooling with the discrete wavelet pooling.
As low-level features such as curves, color, edges and texture are extracted from the first block, thus high-level properties are extracted as the network goes deeper. However, the fundamental goal is to replaced max-pooling with DWT pooling to reduce the loss of spatial details. In this study, after the feature extraction stage, we discarded the 3 fully connected layers in the pre-trained VGG 19, thus having the last block layer with 14 × 14 × 512 feature vector as output. For dimensionality match, this last block layer is connected to the primary capsule layer of the capsule network using a 14 × 14× convolutional layer with kernel size of 256 and stride of 3 represented as f 1 in Fig. 2.
Loss of spatial information is one of the causes for CNN's low classification efficiency. To properly identify and categorize bearing defects, we suggested a Modified Wavelet Convolutional Capsule Network (MWCCN). Capsule network for categorization problems was first proposed by the authors in [43]. Unlike standard CNNs, a capsule network consists of capsules of vectorial entities. A capsule is a group of neurons that are arranged in a vectorial pattern [44]. The starting parameters of a capsule represent a specific class of entity, and also the length of the capsule indicates the chances that the entity exists. Capsule networks outperform regular CNN in obtaining intrinsic and differentiating features of entities [43]- [45]. To attain excellent fault detection and classification efficiency, we adapt the original capsule model and fine-tuned the network by including a pre-trained VGG-19 framework to extract features in order to develop a deep convolutional capsule framework. Figure 2 shows a concise representation of the proposed wavelet convolutional capsule framework. Low-level attributes of the scalograms such as curves, color, edges, and texture are obtained from the few convolutional layers at the beginning stage of the network while the highlevel attributes are obtained as the convolutional layers grow deep. To improve the classification performance and maintain the integrity of high-level features, we implemented a pooling operation called the discrete wavelet transform (DWT) pooling to achieve down-sampling. This minimizes the loss of spatial information and allows for dimension reduction. After extracting the features from the CNN framework, the features are passed to another convolutional layer to achieve a match in dimension before reshaping the features into primary capsules. Routing by agreement is applied to map the features between the primary and digit capsules. As shown in equation (9), the total input in the capsule layers consists of sum total of the weights of all predictions obtained from the capsules within the capsule network.
Where C j depicts the entire input to capsule j. a i j is the coupling coefficient which indicates the level to which capsule i ignites capsule j. U j/i is the prediction of capsule j from capsule i as illustrated in (10).
W i j represents the network weight mapping capsule i to j whereas U i represents the output of capsule i. A routing by agreement algorithm decides the coefficient between the primary and digital capsules summing to 1 [43]. This routing approach takes into account both the length and representation parameters of the capsule and when igniting another capsule whereas in conventional CNN a framework depends on the evaluated probability. In a nutshell, capsule networks have a better dependency and capability of abstracting distinct inherent features. It's worth noting that the capsule's length is utilized to assess the possibility of the existence of an entity. For a perfect probability prediction, a non-linear activation function called squashing function is applied, where capsules with short vectors are marked as low probability and that of long vectors are marked as high probability, while retaining a fixed orientation. (11) gives the squashing function formula.
The high-level entity abstraction is further passed into the 2 fully connected layers, and then a Softmax classifier is used for the classification task. For a successfully training the capsule network for classification tasks, margin loss [43] is applied. Equation (12) defines the margin loss, L k for classk.
In the Softmax layer, U k is represented as the output of the capsule. If the training sample is an instance of class k, then T k is set to 1 , else, T k is set to 0. m+ is set to 0.9 and mis set to 0.1 which represent the lower and upper bounds for the probability of a training data becoming or not becoming an instance of class k respectively.n is the weight regularizer which is by default set to 0.5. The total loss of the capsule network is the summation of all digit capsule losses.

The Proposed Modified Wavelet Convolutional capsule Network with MESRGAN+ (MWCCN-MESRGAN+)
Our proposed MWCCN-MESRGAN+ is an integrated super resolution GAN and wavelet convolutional capsule network for diagnosing rolling bearing fault as presented in Fig. 3. The proposed architecture consists of the super resolution part which handles the image enhancement by reconstructing high-resolution images from low-resolution image counterpart as the first stage while the second part is the wavelet convolutional capsule network which extracts and learns high dimensional feature vectors from the super resolution imagery generated by the super resolution network for fault diagnosis and classification. We adopted some evaluation metrics such as receiver operating characteristic (ROC), accuracy (ACC), sensitivity, specificity (SPE), and precision (PRE). Details of the dataset utilized in the paper are described in Sect. Experiments .

Constructing the time-frequency scalograms
Convolutional neural network processes images in either 2D grayscale or RGB. To this end, the raw fault signals are converted to time-frequency scalograms of 2D images with abundant fault features using CWT. This paper adopts cmor3-3 wavelet for CWT due to its excellent ability to analyze time-frequency. The raw signals are collected at each sample point of 1024 in time series and CWT is also executed at the same time series.

Experimental Setup and Details
We built our proposed model using the famous rolling bearing dataset of CWRU [38]. The vibration signal of the bearing motor is collected using acceleration sensor. The bearing dataset collection is structured as normal, drive end fault with sampling frequency of 12 kHz and 48 kHz and fan end fault with sampling frequency of 12 kHz. The drive end defect with 12 kHz frequency is utilized in this paper. This data was measured under different load conditions (0, 1, 2, and 3 hp) as a single point fault due to electro-discharge machine. This paper adopts the damage point of 6 o'clock and the load condition of 0 hp to build the data split for our proposed model. Since there are 10 conditions with 1024 signal sampling points, we randomly select 400 samples to build the training set and 200 samples to build the testing set for each condition which will eventually amount to a total of 4000 training set and 2000 testing set. Additionally, a validation set is constructed from 25% of the training set which bring the training set to 3000 samples and 1000 samples for the validation set. Since the proposed model requires that the input data must be a 2D image with three channels (RGB), we adopted CWT to convert the raw signal into time-frequency scalograms (2D images) and reshaped the input dimension to 224 × 224 × 3. It is important to mention that Adam optimizer and the learning rate of 0.0002 are used with a batch size of 16 and 30 epochs during the training of our model and the number of class label in the dataset correspond to 10. We implement our proposed model on Keras framework with Tensorflow as backend using NVIDIA GTX 1080 GPU work station.

Evaluation
This section presents the results of our study in two parts; the first part presents the evaluation of our proposed modified super resolution GAN plus network in terms of peak signal to noise ratio(PSNR) and Perceptual Index (PI). The second part involves evaluating the fault diagnosis performance of our proposed wavelet convolutional capsule network (MWCCN). We made few comparison with state-ofthe-art models including some selected pre-trained models. The evaluation criterion adopted as the metric to evaluate the diagnosis performance of our proposed MWCCN is as follows: accuracy (ACC), precision (PRE), sensitivity (SEN), specificity (SPE), area under curve (AUC) and F1-Score.
Sensitivit y = T P T P + F N (15) Speci f icit y = T N T N + F P (16) where T P, F P, and F N indicates the outcomes of true positive, false positive, and false negative, respectively. The results of the CWT conversion of raw signals to scalograms are presented in Fig. 4 (load condition = 0 hp). According to Fig. 4, the fault types that correspond to the raw time domain signals is difficult to distinguish. However, CWT makes it easy to distinguish the differences between the time-frequency scalogram of individual fault category which makes it suitable for our proposed MESRGAN+ to extract abundant features for image regeneration.More so, Fig. 5 shows the performance of our proposed super-resolution, MESRGAN+ and other state-of-the-art models which are SRGAN, ESRGAN and ESRGAN+. For fair comparison, we employed their source codes available online with the same CWRU dataset. One of the aims of this research is to check the PSNR and perceptual index (PI) of the superresolution models in which our model gives the bet results in both cases. MESRGAN+ produces more appropriate images, removes artifacts, and improves extracting features clarity by extending the convolutional layer of the generative structure of the residual block and removing batch normalization.

Results
The accuracy and loss curves of our proposed MWCCN-MESRGAN+ are presented in Figs. 6 and 7. We observed that after 10 epochs, the training and validation curves started converging with smooth stability depicting efficacy of the model. More so, the diagnostic accuracy of the model is achieved using the trained model to classify the testing dataset. To ensure the stability of our proposed model, the work was repeated for 7 times under same condition and the  The PI score is shown on the left while the PSNR score is shown on the right accuracy, sensitivity, specificity, and precision presented in Table 2. As presented in

Model tweaking
For the purpose of understanding the influence of super resolution approach and discrete wavelet pooling on image  Table 3. The four different models are trained on the first category of sub-data class called data-class A and the training epochs is set to 30. All training parameters are kept the same and the training time for each models to complete one epoch is recorded and the results are given in Table 4. We observed that the time required for the proposed MWCCN-MESRGAN+ model to train for one epoch is 9s, 16s, and 20s longer than the other models as shown in Fig. 8. This is a clear indication that more parameters more training time. More to the point, MWCCN-  Most deep learning models utilize max-pooling layers to perform down-sampling operation to reduce the dimensionality of the feature vector but this process usually lead to loss in spatial features although the computational time is reduced. We know that capsule network has a longer training time due to the size of its feature dimension, however, it maintains the integrity of its high-level feature without loss of spatial feature which gives capsule-based network the competitive advantage over traditional convolutional neural networks to learn specific features from the dataset resulting to higher accuracy and fast convergence as a trade-off between computational time and accuracy. It is worth mentioning that our proposed MWCCN-MESRGAN+ model achieves nearly 100% diagnosis accuracy without overfitting on small dataset.

Investigation of dataset generalization
To further investigate the generalization of our proposed MWCCN-MESRGAN+ model as presented in Table 4, we constructed several categories of sub-data class under different load conditions as follows; Data-class B 400 samples are selected randomly under the load condition of 1hp for individual health condition as the training set, and 200 samples are selected randomly under the same load condition for the testing set. For the 10 conditions, Data-class F 150 samples are selected randomly under the load condition of (0, 1, and 2 hp) for individual health condition as the training set, and 80 samples are selected randomly under the same load condition for the testing set. For the 10 conditions, the total training samples and testing samples are 4500 and 2400 respectively. Data-class G 100 samples are selected randomly under the load condition of (0, 1, 2, and 3 hp) for individual health Table 4 Investigating dataset generalization. NC stands for normal condition. LC stands for load condition. HC stands for Health condition  Additionally, 25% split of the training set for each dataclass category above is utilized to build the validation dataset during training. the proposed model is used to train the dataclass B -H. The training parameters and conditions are kept the same as the previous data-class A. The proposed MWCCN-MESRGAN+ is trained for 7 times repeatedly and the experimental outcome is presented in Table 5. Damage point of 6 o'clock is utilized for all data-class categories. The results indicates that our proposed model relatively achieved nearly 100% across the different categories of data-class.
It is worth mentioning that The proposed MWCCN-MESRGAN+ achieves excellent diagnosis accuracy fault data under different load conditions. The data-class H is a special construction to test the validation of our proposed MWCCN-MESRGAN+ by introducing a new load condition from 2 hp as the model is trained on the load condition (0, 1 hp). However, the model still classified the bearing faults which indicates that our proposed MWCCN-MESRGAN+ model is robust and generalizes very well.

Comparison with other MWCCN (Inception V3, EfficientNet) Models
Our proposed MWCCN in this paper is constructed using VGG-19 as the base model. To examine the efficacy of MWCCN (VGG- 19), We compared the model with other MWCCN(Inception V3 and EfficientNet) models. We only fine-tuned the last layer of the pre-trained models by replac- Fig. 9 The average training time for the different MWCCN models. On one hand, MWCCN is modified using Inception V3 as the base model. on the other hand, EfficientNet is used as the base model. We compared these two modifications with our proposed model with VGG-19 as the base model Fig. 10 The accuracy curves of the various MWCCN models on dataclass A category ing the neurons in the Softmax layer to correspond to the number of class label in our dataset which in our case is 10 neurons. The different versions of the constructed MWCCN models including our proposed model are trained on the first sub-dataset category, data-class A. The computational time for training one epoch by the three models are presented in Fig. 9. Figure 10 shows the accuracy graph for both the training and validation dataset for the individual MWCCN model. We observed that MWCCN (Inception V3), MWCCN (Effi-cientNet) and MWCCN (VGG-19) requires approximately the same training time to complete one epoch but MWCCN (VGG-19) converges faster. In a bid to further ascertain the excellent performance of our proposed MWCCN (VGG-19), we conducted another experiment on all the categories of the sub-dataset (Data-class A-H).

Comparison with other fault diagnosis methods
In the course of our work, we reviewed several literature related to fault diagnosis based on artificial intelligence and presented some comparison. Some literature reported few performance indicators to support their claims as seen in Table 7. More to the point, our proposed model achieves better performance with more indictors reported compared to the other fault diagnosis methods cited from literature. the proposed MWCCN-MESRGAN+ model is compared with other fault diagnosis methods.
The authors in [44] proposed a wavelet based multi-fractal feature learning approach with SVM classifier. The authors in [45] adopted the method of ELM to diagnose bearing faults. An interesting work was proposed in [46] which is based on SVM and EEMD for fault diagnosis. The authors in [47] integrated wavelet into auto-encoder learning and combined the framework with ELM to diagnose faults. The authors in [48] suggested a solution to the problem of hierarchical recognition in machine by using DBN to diagnose fault. Quite  Table 7 Comparison results of our proposed model with other fault diagnosis methods  an interesting work of DNN with SSAE was suggested by the authors in [49] for the diagnosis of fault. The authors in [26] utilized 2D and CNN for the diagnosis of bearing faults. Random forest learning with CNN was suggested by the authors in [50] to diagnose faults. We observed that our proposed method is superior to conventional machine learning approach and other fault diagnosis model. Compared with the methods suggested by the authors in [44] and [45], our proposed MWCCN-MESRGAN+ model showed a significant improvement by a large margin. Compared with the other methods including the deep learning model, our proposed model significantly outweighs all of the models. Compared to the method in [50] under the data-class F, the accuracy of our method increased by 0.5% showing that our model generalizes very well.

Method
The experimental results show that our proposed architecture outweighs other fault diagnosis models and some selected deep learning models. For fairness, we selected some deep learning models and implemented them based on their source code using the same data-class A. From the experimental analysis of our comparative report as presented in Table 8. MobileNet V2 achieves the least sensitivity score of 93.6% whereas ResNet50 obtains the least specificity score of 92.5% as depicted in Table 8. From all indications, our proposed model outweighs all the pre-trained models with a high sensitivity score of 99.78% and 99.69% specificity score. Another important metric is the Receiver Operating Characteristics (ROC) curve. The ROC curve measures the overall accuracy in terms of AUC as shown in Fig. 11. Fig.  11 shows that our model demonstrates a satisfactory balance between sensitivity and specificity by minimizing the error rate of the false positive and maximizing the true positive rate.
More so, the accuracy performance of the pre-trained models is reported in comparison with our proposed model as presented in Table 8. Our model performs better than the pre-trained models, achieving a high accuracy of 99.92%. We also show that our proposed model convergences smoothly and steadily with a moderate reduction in loss.

Conclusion
In this work, we proposed a GAN-based super-resolution with wavelet convolutional capsule framework for fault diagnosis and classification with the aim of handling the challenge of low-quality characteristics of scalograms obtained from raw fault signals by using continuous wavelet transform (CWT). In summary of the contribution of this work, CWT is used to convert the raw fault signals into 2D time-frequency scalograms compatible for 2D CNN operation. The 2D timefrequency images (scalograms) are reshaped to 224×224×3 RGB format as input to the GAN-based super resolution network for quality enhancement. The reconstructed high-resolution images become the new input images to the wavelet convolutional capsule network for fault diagnosis and classification. More so, the proposed MWCCN-MESRGAN+ model is validated with the famous rolling bearing fault dataset from CWRU achieving 99% accuracy, 99% specificity, 99% sensitivity and 100% precision which outweighs the other fault diagnosis methods including some deep learning models. We carried out ablation study to evaluate the generalization performance of our proposed model and by a well-observed margin, the results demonstrate that our proposed MWCCN-MESRGAN+ model achieved excellent fault diagnosis performance.
Even though this study has a high level of accuracy in classifying fault, it does have certain drawbacks. This suggested strategy, which has high classification accuracy in CWRS dataset, might not obtain exactly the same classification accuracy in imbalanced fault dataset. The reason is because the class labels may consist of imbalanced data samples owing to differences in labeling. To solve this challenge, AI model should be trained utilizing imbalanced class label data acquired at various times and locations. Aside the diversity of data, the allocation of the data classes is also significant. The disparity in class sizes has a detrimental impact on training. The accuracy of classification is also affected by the different data augmentation strategies employed to correct the imbalance. In light of this constraint, study will be conducted in our future work employing a wider range of imbalance class data and possibly employing various optimization strategies that are more efficient in terms of computation time.