Deep convolutional tree-inspired network: a decision-tree-structured neural network for hierarchical fault diagnosis of bearings

The fault diagnosis of bearings is crucial in ensuring the reliability of rotating machinery. Deep neural networks have provided unprecedented opportunities to condition monitoring from a new perspective due to the powerful ability in learning fault-related knowledge. However, the inexplicability and low generalization ability of fault diagnosis models still bar them from the application. To address this issue, this paper explores a decision-tree-structured neural network, that is, the deep convolutional tree-inspired network (DCTN), for the hierarchical fault diagnosis of bearings. The proposed model effectively integrates the advantages of convolutional neural network (CNN) and decision tree methods by rebuilding the output decision layer of CNN according to the hierarchical structural characteristics of the decision tree, which is by no means a simple combination of the two models. The proposed DCTN model has unique advantages in 1) the hierarchical structure that can support more accuracy and comprehensive fault diagnosis, 2) the better interpretability of the model output with hierarchical decision making, and 3) more powerful generalization capabilities for the samples across fault severities. The multiclass fault diagnosis case and cross-severity fault diagnosis case are executed on a multicondition aeronautical bearing test rig. Experimental results can fully demonstrate the feasibility and superiority of the proposed method.


Introduction
Bearings are widely used in rotating machinery, and their condition monitoring is crucial to the precision and reliability of mechanical systems [1]. In recent years, with the development of sensor technology and information science, the research of data-driven mechanical fault diagnosis has developed rapidly. In particular, the emergence of deep learning (DL) technology makes fault diagnosis based on deep neural network (DNN) redefine the most advanced performance [2,3]. Different from the top-down physics-based fault diagnosis methods, datadriven methods can resist the effect of environmental noise and equipment complexity, and update the model in a timely manner as the monitoring data increase to obtain more accurate fault recognition performance. Compared with traditional machine learning methods, DNN has more powerful data feature extraction capabilities and less reliance on prior knowledge or hand-made features. As bottom-up condition monitoring approaches, the DLbased fault diagnosis methods enjoy an evident advantage in saving resources and have attracted extensive attention due to their better effectiveness and robustness.
Researchers tackle data-driven fault diagnosis mainly for fault type discrimination and fault severity identification, where the former is to know the fault location of the components, and the latter tries to analyze the degradation level related to the physical size of defects [4,5]. The DL-based fault diagnosis approaches tend to learn the signal patterns associated with a particular fault type or severity by DNN methods, e.g., autoencoder [6], generative adversarial net [7], recurrent neural network [8], deep belief network [9], and convolutional neural network (CNN) [10]. Generally, vibration signals, acoustic emission signals, electrical signals, temperatures, pressures, and sound signals can be used for condition monitoring and fault diagnosis of bearings. Among them, vibration signals are widely used in the fault diagnosis of bearings [11]. According to the different structures of the networks, the vibration data are usually transformed into different forms for analysis. For example, the time domain data or frequency domain data of the signal are generally used as the input of the recurrent neural network network, which is more suitable for the analysis of sequence data; the CNN model is suitable for processing high-dimensional data and performs well in analyzing the time-frequency distribution (TFD) of the signal, such as continuous wavelet transform (CWT) distribution [12], short-time Fourier transform distribution [13], and Chirplet transform [14,15]. More researchers regard the diagnosis problem as a single-level multiclassification problem and attempt to achieve higher classification accuracies by designing a more complex network. Most researchers ignore the logic of fault diagnosis and only focus on the judgment of fault type, but not the fault severity associated with the magnitude of the failure [16]. Few approaches consider fault types and severities together when transforming the diagnosis task into a common classification task for processing, where each fault mode and each fault severity are treated as a specific label. For example, Zhao et al. [17] converted the raw signals of bearings into grey images and directly adopted the CNN model for fault diagnosis. In the experiment, three fault types and three fault severities were considered at most, and the fault diagnosis task was transformed into a common 10category classification task for processing. Analogously, Minhas et al. [18] recognized the different fault types and severities of bearings as a single-level multiclassification problem by the complementary ensemble empirical mode decomposition and support vector machine (SVM) classifier. Pan et al. [19] proposed a novel symplectic geometry matrix machine method for the classification of bearings with different fault types and severities. Wen et al. [20] adopted a hierarchical CNN model for the classification of bearings with different fault types and severities. These explorations are beneficial to obtaining precise fault recognition no matter which data form or network structure is used. However, considering the actual application scenarios faced by fault diagnosis, the following issues are often overlooked: a) The logic of fault diagnosis is usually ignored. Although these models meet the demand of joint diagnosis of fault types and severities, they exponentially increase the complexity of the classification task and require more labeled data as well as complex models for the expected diagnosis performance. The substantial increase of classification complexity for these approaches brings greater challenges to the classification models.
Moreover, it is not in line with the logical cognition of experts to mix different fault attributes for identification. b) Most of the works only consider the input and output but not the justifiable prediction of the intermediate process. Although DNN-based fault diagnosis networks have strong knowledge learning capabilities, explaining the discriminative details of intermediate decisions is still difficult, making the diagnosis results provided by DNN models often difficult to be trusted. The interpretability of the model has always been recognized as a topic worth exploring, which is of great importance for fault diagnosis tasks [21,22]. c) Data-driven fault diagnosis methods often have weak generalization ability for new categories but a strong dependence on the label information of the samples [23,24]. However, the fault severity of collected samples will not be exhaustive in the real case; consequently, the diagnostic model often fails in dealing with test samples belonging to unseen fault severity classes. The limitation of the cross-severity generalization becomes a large obstacle for existing models to be popularized and applied in engineering.
The existing research has several useful explorations on these problems. To deal with the first problem, the hierarchical diagnosis strategy that identifies the type and severity has been adopted in several works. However, the existing methods still stay at acquiring a hierarchical output by using a hierarchical Softmax classifier [25,26] rather than the hierarchical decision in the intermediate stage of the diagnosis models. In our view, a more effective approach would be to apply hierarchical decision rules to deal with this problem. The fault type should be judged first, and then the fault severity judgment can be made based on the prior knowledge provided by the fault type judgment. Regarding the second problem, the interpretability of DL models has always been continuously explored in various fields and still a difficult challenge worth numerous studies. An accepted way to improve the interpretability of the model is the estimation of the decision uncertainty [27,28]. For the third problem, the cross-severity identification of bearing faults is a new subject to the best of our knowledge. The hierarchical diagnosis framework and hierarchical decision rules can help the cross-severity generation of fault type diagnosis for these samples with unseen fault severities. The effective usage of fault type knowledge in the training data will greatly support the decision making of test samples across severities but still difficult to achieve with existing methods.
To address the mentioned issues, a novel deep convolutional tree-inspired network (DCTN) is explored for the hierarchical fault diagnosis of bearings. The proposed model effectively integrates the advantages of CNN and decision tree methods by designing an output decision layer similar to the decision tree structure to fine-tune the weights of the convolutional layers reversely. The signals are converted into TFDs by the CWT method because the time-domain information is conducive to fault severity analysis, and the frequency domain information is more sensitive to different fault types [29,30]. The CNN-based architecture is used as a pre-training model. During pre-training, a Softmax classifier is connected to the backbone CNN. The powerful feature learning ability of CNN can ensure the effectiveness of fault-related feature extraction from the samples. After that, the Softmax classification layer and the fully-connected layer are replaced by the treestructured decision layer to execute the hierarchical fault diagnosis decision in sequence. The hierarchical diagnosis helps reduce the task complexity of diagnosis and improve the accuracy of fault diagnosis. More importantly, the tree-inspired network designed in this paper enables the model to diagnose across fault severities of bearings. The multiclass fault diagnosis case and cross-severity fault diagnosis case are executed on a multicondition aero-engine bearing test rig to verify the feasibility and superiority of the proposed method. Given the state-of-the-art works in fault diagnosis, the proposed DCTN-based fault diagnosis approach has unique advantages in the following aspects: a) The tree-structured hierarchy is helpful for a more accurate, comprehensive diagnosis decision-making. The judgment of the fault type provides a priori knowledge for fault severity diagnosis, which is beneficial to improving diagnosis accuracy. The multilevel decision information with the progressive determination of fault type and fault severity are more in line with maintenance cognition in engineering.
b) The interpretability of the model output is explored through the hierarchy structure. The decision tree model is one of the most interpretable machine learning methods, but its weak knowledge learning ability has always limited its application. The proposed DCTN model effectively integrates decision tree with the CNN model and can provide the hierarchy and uncertainties of decision-making to improve the interpretability of the model output.
c) The proposed model has more powerful generalization capabilities for samples with unseen fault severity categories. The final diagnosis decision of fault attributes is made from multiple views by the embedded hierarchical tree-structured decision layers. The trained model can be generalized well for fault type diagnosis even if the sample has an unseen fault severity category. To our best knowledge, this paper carries out the first exploration in cross-severity fault diagnosis of bearings.
The rest of this paper is organized as follows. Section 2 presents the methodologies of the proposed DCTN model. Section 3 presents the DCTN-based fault diagnosis approach of bearings. Section 4 shows the two case studies for the fault diagnosis of aeronautical bearings and discusses in detail the superiority of the DCTN method over other related works. Finally, Section 5 presents the conclusions and conceivable future works.
2 Proposed deep convolutional tree-inspired network Figure 1 shows the schematic view of the proposed DCTN model, which mainly consists of three convolutional layers, one pre-trained fully-connected layer, and one tree-structured decision layer. The proposed DCTN model takes CNN as the backbone network to learn the fault-related features in the TFDs of bearing signals. A tree-structured decision layer is then embedded into the pre-trained CNN to replace the fullyconnected layer for fine-tuning. The weights of the fullyconnected layer in the CNN are deduced to the node attribute representation of the tree-structured decision layer. Different types of nodes are given corresponding weights according to the logical relationship in the treestructured decision layer. By defining a new supervision loss function and then fine-tuning the model weights, the leaf nodes and seed nodes can support the effectiveness of hierarchical fault diagnosis. The leaf nodes can also acquire an effective fault classification ability in the proposed tree-structured decision layer to deal with crossseverity fault diagnosis tasks, which lays the foundation for the better generalization of the model.

Learning the neural backbone of the seed nodes
Fault type discrimination and fault severity identification have an inherent logical relationship, which the decision making of the model should also correspond to. The decision tree model has good interpretability because the decision of each node has clear physical meanings. However, weak knowledge learning ability has greatly limited its application for a long time [31][32][33]. Although decision trees are interpretable and simple to use, they are prone to overfitting, can be less robust to small changes in training data, and generally rely on heuristic algorithms. In recent years, many researchers have made several attempts to improve the performance of the decision tree model, such as the random forest model [34], the deep forest model [35], and several deep decision tree models [36,37]. These methods optimize the structure of the decision tree to improve the classification performance and retain interpretability. However, the performances of these methods are still not as good as the state-of-the-art DNN models even in small data sets.
As a feedforward neural network, CNN has shown strong feature extraction ability in the processing of sequence, image, and video data. Generally, the basic structure of a CNN includes two kinds of layers, one of which is the feature extraction layer. The input of each neuron is connected to the local receptive field of the previous layer, and the local features are extracted by the feature extraction layers. Once the local feature is extracted, the positional relationship between it and other features is also determined. The second is the feature mapping layer, in which each layer is connected with multiple feature maps, and each feature map corresponds to a classification plane. With the deepening of exploration, several studies have found that embedding the decision tree model into DNN models can better guarantee the recognition accuracy of the model [38][39][40]. Hence, the tree-structured decision layer is embedded into the CNN model for ensuring the performance of hierarchical fault diagnosis and reinforcing the interpretability of the model. The constructed DCTN model takes the convolutional layers as the backbone. Inspired by the decision tree, a tree-structured decision layer is embedded in the backbone model to provide hierarchical diagnosis logic and is endowed with the ability to diagnose across fault severities. To understand the distribution of embedded features extracted by the convolutional layers better, the weights of the final fully connected layer are used to induce the hierarchy and embedded decision rules.
The constructed CNN neural backbone owns three convolutional layers and one fully-connected layer. The convolutional layers can learn the fault-related features from the TFDs of the signals by model training. The fully-connected layer can reduce the dimension of the features learned from the convolutional layers to adapt to the dimension of the seed nodes in the tree-structured decision layer. The layer details of the backbone CNN are shown in Table 1, where N represents the number of samples, R×R represents the dimension of the TFD, and K is usually equal to the number of sample categories. The weight of the model is initialized by pre-training to guarantee the accuracy of the entire model. During pretraining, a Softmax classifier is connected to the backbone CNN. After that, the Softmax classification layer and the fully-connected layer are replaced by the tree-structured decision layer for fine-tuning. The Adam optimization algorithm is used to optimize the model and speed up the convergence. The cross-entropy function is used as the loss function in the model pre-training.
The cross-entropy of the prediction loss H(p, q) is where x is the input features of the Softmax classifier, p(·) is the probability distribution of the predicted output, and q(·) is the probability distribution of the actual output. Inspired by the decision tree model, the sequential decision of the hierarchical fault diagnosis is carried out by the tree-structured decision layer. The structure of the decision sequence is designed according to the underlying logic of the fault diagnosis task. Two decision hierarchies are generated, yielding the two main fault diagnosis levels, namely, fault type and fault severity. Figure 2 shows that the first hierarchy is to determine the fault type of the input sample to acquire the corresponding superclass attribute, whereas the second hierarchy is to determine the fault severity of the input sample to acquire the corresponding subclass attribute. The parentheses indicate the calculation method of prediction probability at this node. The input-output relationship between leaf nodes and root nodes is not as simple as that in the neural network.

W ∈ R K×L
According to the structure of the decision tree, the decision nodes in the first decision level are defined as the leaf nodes, and the decision nodes in the second decision level are defined as the seed nodes. The number of leaf nodes is consistent with the number of sample fault types, whereas the number of seed nodes is consistent with the number of sample categories K. The seed nodes correspond to the weights of the fullyconnected layer of the pre-trained backbone CNN. The weight matrix of the fully-connected layer is obtained by the back-propagation training with Softmax classifier. According to the network structure setting in Table 1, the dimension value L of the sample feature after convolutional layers should be 64.
In the pre-training stage, distance d j between the feature and each classification hyperplane is where is the weight vector of the jth vector in weight matrix W of the fully-connected layer, and refers to the input feature vector of the treestructured decision layer, which is also the output of the final convolutional layer. The multiclassification model sets a hyperplane for each category and divides the feature space through multiple hyperplanes. One region corresponds to one category. Distance d j refers to the similarity between the test sample and the labeled sample in the jth hyperplane. The prediction scope corresponding to K categories can be acquired by the fully-connected layer. Then, it is mapped to prediction probabilities by the Softmax classifier as y j where refers to the predicted probability for the jth category and satisfies The weight of seed nodes directly adopts from the pretrained fully-connected layer. In this way, the identification ability of the subclass is equivalent to that of the pre-trained CNN model, which guarantees the performance of the model. The corresponding weight sw i (i = 1, 2, or 3) of the ith leaf node is obtained by adding the weight of its seed nodes. Taking the structure shown in Fig. 2 as an example, the following relationship can be obtained: Fig. 2 Weight propagation of tree-structured decision layer.

Fine-tuning with decision loss
Fine-tuning is of great importance to improve the overall performance of the model. If only the weights obtained by the pre-training model are used and the weights are determined according to Eq. (5), the accuracy of the overall model will be the same as that of the pre-training, and the advantages of the hierarchical structure cannot be exerted. The Softmax function at each decision node is used to generate the corresponding decision probabilities because probabilities are naturally better interpretable. Taking the structure in Fig. 2 as an example, the following relation is met after the fine-tuning: where refers to the weight vector of the jth treestructured decision layer after fine-tuning. The classification of the superclass can provide prior knowledge for the identification of the subclass through the fine-tuning. The DCTN model designed in this paper fine-tunes the weights of the backbone model and the tree-structured decision layer, and performs Softmax classification on all nodes to make the final fault diagnosis decision according to the path probabilities. In detail, the probability of correct prediction for seed nodes is defined as P (subclass). The probability of correct prediction of leaf nodes is defined as P (superclass). Hence, the probability of overall correct prediction of the model is calculated as refers to the path probabilities of the treestructured decision layer, refers to the overall prediction. The final class prediction is defined as q where is the predicted probabilities of the treestructured decision layer. The loss function of the tree-structured decision layer is calculated based on the cross-entropy function as where refers to the true labels of the pre-trained network, refers to the predicted probabilities of the pre-trained network, and refers to the true labels of the tree-structured decision layer. The first term on the right side of the equation represents the same cross-ω entropy function as the pre-trained network to maintain the effectiveness of the original training. The second term is the newly added loss term, which corresponds to all predictions related to the tree decision path probabilities. Super parameter is the weight adjusting the pre-trained decision and tree-structured decision.

Proposed DCTN-based fault diagnosis approach of bearings
To analyze the ability of multi-fault identification and the capacity of generalization for the superclasses of the proposed DCTN-based fault diagnosis approach, two fault diagnosis tasks, namely, multiclass fault diagnosis and cross-severity fault diagnosis of aeronautical bearings, are carried out. Different fault diagnosis networks are designed corresponding to different tasks, which are described in detail in Subsections 3.2 and 3.3.

Aeronautical bearing test rig
The bearing dataset is collected by the Politecnico di Torino rolling bearing test rig [41], which is shown in Fig. 3. The aeronautical bearing at the B1 position can be easily removed from its support, allowing checking the response of the system when installing bearings with different fault types and severities. The bearings of the spindle are grease lubricated, whose temperature is limited by a liquid refrigeration circuit. Table 2 shows the serial number of the damaged bearing, the fault locations and severities, the subclass labels, and the superclass labels. Among them, the superclass is determined according to the location of the fault, marked as N in the table for no fault, I for the inner ring fault, and R for the outer ring fault. Rockwell tools are used to produce localized defects on the elements, resulting in conical indentations on the inner ring or individual rollers. The set fault size is shown in Table 2.
Given such a small fault severity, observing the specific fault size is difficult with the existing signal processing methods. The XYZ triaxial sensor of the B1 bearing is installed at the A1 position. According to experience, the signals collected in the Y direction can better reflect the health status of the bearing.
The operating condition details of the aeronautical bearing are shown in Table 3. Data are collected from aeronautical bearings operating under 17 loads and speeds with a sampling frequency of 51.2 kHz and a sampling time of 10 s. The raw signals of the seven bearings are shown in Fig. 4. The raw signal is unstable and contains some noise in the real case. In our view, this instability of the raw signals makes directly using the signals in the time domain for fault diagnosis difficult. To obtain a better expression of fault features, TFD is used as the basic data form for analysis.

Time-frequency analysis based on CWT
The CWT method can effectively represent the local ψ characteristics of signals in the time-frequency domain and has proven to be quite suitable for fault analysis of mechanical equipment [13,42]. For signal s(t) in time t and the specified mother wavelet , the CWT function is defined as follows: where a > 0 is the stretch factor, b ≥ 0 is the shift factor, and refers to the conjugate operation. The complex Morlet wavelet with bandwidth frequency and center frequency of 3 is selected as the mother wavelet, the scale sequence length is set as 256, and a and b are set as 2 and 5 empirically, respectively. The TFD of 0.1 s length signals collected in the Y direction from the bearing under condition C17 is shown in Fig. 5. The occurrence of faults is usually accompanied by the increase of the impact components in the timefrequency domain. Furthermore, the impact component distribution varies when the fault location of the bearing is different. The impacts of the I-2 and R-5 bearings locate in different frequency bands. For bearings with the same fault position, the vibration component in the signal increases gradually with the increase of the fault size, that is, from 150 to 450 µm. However, distinguishing bearings N-1, I-4, and R-7 only by observation from the TFDs is difficult. Therefore, more effective models are needed to distinguish different fault bearings. In general, CWT-TFD can effectively characterize the difference of bearing signals in different health states, which lays a good foundation for fault diagnosis.

DCTN-based hierarchical multiclass fault diagnosis network
The designed DCTN-based hierarchical multiclass fault diagnosis network is shown in Fig. 6. The DCTN-based hierarchical multiclass fault diagnosis approach differs from the existing methods in the strategy for hierarchical decision making. In decision making, the different fault    Table 2.

DCTN-based cross-severity fault diagnosis network
The cross-severity fault diagnosis approach is a new attempt for fault diagnosis models. The designed DCTNbased cross-severity fault diagnosis network is shown in Fig. 7. The cross-severity samples corresponding to the same superclass labels are identified for decision making. For example, bearing I-3 has the same super class as I-2 and I-4, which correspond to the fault located on the inner ring, and bearing R-6 has the same superclass as R-5 and R-7 corresponding to the fault located on the roller. The model shown in Fig. 6 is trained by the samples from bearings N-1, I-2, I-4, R-5, and R-7, and tested by the samples from bearings I-3 and R-6. The purpose of prediction is to identify successfully the superclass labels of the test bearings, whose node weights are initialized by seed nodes that do not contain the corresponding subclass labels of the test bearings.

Case studies
This section discusses two fault diagnosis cases, namely, multiclass bearing fault diagnosis task and cross-severity bearing fault diagnosis task. The first case is to verify the diagnosis performance of the proposed DCTN model. The adopted strategy of hierarchical decision-making is expected to be conducive to improve the precision of fault identification. The second case is to verify the generalization ability across different fault severities. The proposed model is built by the Pytorch framework and implemented on a computer with 64-bit Windows 10 system, RAM of 16 GB i5 CPU, and NVIDIA RTX 2080 GPU.

Case one: multiclass fault diagnosis of bearings
This subsection discusses the diagnosis task of seven fault categories that belong to different fault types and severities. The input data of the model is the TFD matrix generated by the CWT method. For convenience of processing, all the TFD matrices are normalized to the dimension of 100×100 as the standard input of the network model by bilinear interpolation. Each fault category corresponds to 100 signal samples, several of which are randomly selected as the training set and the rest as the test set. In the training, the batch size of the model is set as 16, and the learning rate is set as 0.01. Moreover, 10% of the training data are randomly extracted as the validation data set. All the training processes can achieve convergence within 200 epochs. To deal with measurement error, the results of fault identification accuracy given in the experiment are averaged after 10 measurements.
ω ω ω ω Figure 8 illustrates the fault diagnosis performance of bearings under operating condition C17. The ratio of the training data in the whole dataset ranges from 0.1 to 0.9. Theoretically, using more samples in model training is more conducive to the model achieving higher recognition accuracies. The analysis under different training sample sizes is beneficial to comparing the fault diagnosis performance of the proposed model more comprehensively. Parameter sensitivity analysis is also executed as the proportion weight in Eq. (9) and is set as 0.1, 0.5, 1, 5, 10, 50, 100, and 500. Theoretically, parameter cannot be very large or very small. A large will lead to the reduction or even loss of the feature learning ability achieved by pre-training in the finetuning of the model, whereas a small will prevent the advantages of the designed hierarchical decision from being reflected.  Figure 8 shows that the results are consistent with the above analysis. When the training data ratio is higher than 0.5, the model can achieve an accuracy of 100% under different settings. When the ratio is less than 0.5, the recognition accuracy shows a downward trend along with the decrease of the training data. In comparison, the fault diagnosis performance is unsatisfactory when is 0.1, 0.5, or 1. When is 100 or 500, the performance of the model is relatively better but not as good as the performance of the model when = 10. Therefore, we can conclude that the hierarchical diagnosis strategy of the model proposed in this paper is beneficial to improving the accuracy of fault diagnosis, but the knowledge learning ability of the CNN model should also be retained. A better fault diagnosis performance can be reached by the reasonable allocation of the two parts in the final decision. Table 4 shows the fault diagnosis performance of bearings with different training data ratios under 17 operating conditions. The mean accuracies on the right of the table correspond to the recognition accuracy rate under specific conditions. The mean accuracy corresponding to the data collected under condition C4 is the lowest, which is 97.08%. The mean accuracies of the data collected under conditions C5 and C13 are the highest, which are both 99.89%. The performance of fault diagnosis varies under different working conditions, but the overall identification can be considered very effective. The mean accuracies on the bottom of the table correspond to the recognition accuracy rate under a specific ratio. In a comprehensive consideration of all working condition data, the mean recognition accuracy also increases with the increase of the ratio of training samples, and it can be maintained at 100% when the ratio exceeds 0.7. When the ratio is 0.1, the mean recognition rate is the lowest, which is 96.04%. Overall, the proposed DCTN-based multiclass bearing fault diagnosis approach is effective under different operating conditions.
To analyze the performance of the proposed DCTNbased fault diagnosis approach more objectively, it is compared with seven other typical fault diagnosis   1) The TFD-CNN approach that has the same input and structure as the pre-trained backbone network.
2) The TFD-local binary convolutional neural network (LBCNN) approach that has the same input as the proposed approach and uses the LBCNN for fault identification. The used network structure is the same as the model in Ref. [12].
3) The TFD-PCA-SVM approach that uses the principal component analysis (PCA) method to acquire the sample features from TFDs and adopts the SVM method for fault identification. The penalty parameter and kernel parameters in SVM are selected automatically by grid searching.
4) The TFD-PCA-KNN approach that uses the PCA features and the k-nearest neighbor (KNN) method for fault identification.
5) The TFD-PCA-extreme learning machine (ELM) approach that uses the PCA features and the ELM method for fault identification. The weight matrix and bias of hidden layers in the ELM model can be adjusted automatically.
6) The time-features-SVM approach that uses 14 typical time-domain features of bearing signals, namely, maximum value, minimum value, mean value, peak-topeak value, rectified mean value, variance, standard deviation, kurtosis, skewness, root-mean-square, corrugation factor, crest factor, impulse factor, and margin factor, and the SVM method for fault identification.
7) The time-features-KNN approach that uses 14 typical time-domain features and the KNN method for fault identification.
8) The time-features-ELM approach that uses 14 typical time-domain features and the ELM method for fault identification.
9) The raw-data-wide deep convolution neural network (WDCNN) approach that uses the raw signal as input and WDCNN [43,44] for fault identification. The used network structure is the same as the model in Ref. [44]. The fault diagnosis performance of various approaches under different ratios of training data is shown in Fig. 9. The time-frequency analysis exhibits evident advantages over the typical time-domain analysis method with the same classifiers, showing that TFD is an effective data analysis form for the joint diagnosis of bearing fault type and bearing fault severity. The DL models, namely, CNN, WDCNN, LBCNN, and DCTN, perform better than other methods in accuracy. Although the SVM, KNN, and ELM models are all typical small-sample-analysis methods, the diagnosis performance is not satisfactory when the sample size is small. The proposed DCTN model shows overall higher fault diagnosis accuracies than the CNN model with the same convolutional layers, which fully illustrates that the proposed hierarchical decision strategy is beneficial to improving the decisionmaking ability of the model.

Case two: cross-severity fault diagnosis of bearings
The cross-severity fault diagnosis task attempts to identify the fault type of the test samples with fault categories that are unseen for the training samples. The set of the cross-severity fault diagnosis tasks is shown in Table 5. Specifically, for the aeronautical bearings with failures on the inner ring or a roller, the monitored signals corresponding to different fault severities are used for training and testing. The test data in each task correspond to two fault types, namely, defined data superclass I and R, and the same fault severity, namely, 150, 250, or 450 µm. The training data contain all the fault types but lack the fault severity of the test data. Figure 10 shows the predicted superclass labels of the test samples under condition C17 with six fault categories. The model parameters used in the experiment are consistent with those set in Section 4.1. Cross-severity fault diagnosis is effective because most of the predictions of the corresponding labels are correct, especially for the bearing samples corresponding to R-5, where the predictions of the superclass labels are all correct. In addition, the incorrect predictions of the samples corresponding to labels I-4 and R-7 are identified as label N, which is reasonable for these two sets of data corresponding to small faults of bearings. The predicted labels show the effectiveness of the DCTN model in superclass identification. Figure 11 shows the prediction probabilities of the test samples with six fault categories in three tasks. The bar chart shows the mean of the prediction probabilities of all the samples corresponding to each category. The error bars show the range of the prediction probabilities for each superclass label. The prediction probability of the superclass category corresponding to the test sample is the largest, which is the fundamental basis for the realization of cross-severity fault diagnosis because all decisions are inferred according to the probability value. The correct prediction probability of samples corresponding to R-5 category is close to 1, which is the best prediction performance among the six categories. The predicted labels and probabilities fully demonstrate the effectiveness of the proposed DCTN model for crossseverity fault diagnosis tasks, which can support better generalization of the model.
The proposed DCTN-based cross-severity fault diagnosis approach can reduce the requirements for labeled data in practical application and is more consistent with engineering needs. Moreover, the three cross-severity fault diagnosis tasks listed in Table 5 are performed using the comparison methods selected in Section 4.1. The fault diagnosis accuracies of all the methods are shown in Table 6. The fault diagnosis accuracies of each category, mean accuracies of each task, and mean accuracies of each approach are listed in Table 6. The highest mean prediction accuracies for the whole work and each task are shown in bold form. The following conclusions can be drawn from the results in Table 6: a) Most methods are completely ineffective in the    identification of most fault categories when the corresponding results of samples in each category are analyzed separately. The recognition rate of several methods reaching 100% with ineffective fault classification is realized. For example, the diagnosis accuracy of the TFD-PCA-SVM approach in the R-6 fault category is 100%, but the accuracy of I-3 is 0 in the same task. Analysis of the predicted labels reveals that the model identifies the labels of all the samples as R, that is, the classifier loses its discriminability. Hence, the mean recognition accuracy of this method in Task 2 is 50%, but this approach is not valid for this task.
b) The mean fault diagnosis accuracy of the proposed DCTN-based approach is remarkably higher than that of other methods. The mean fault diagnosis accuracy of the proposed DCTN-based approach is up to 93.83%. The results show that the proposed DCTN method is more suitable for cross-severity fault diagnosis tasks. In the three tasks, the proposed approach achieves the highest recognition accuracy of 99%. In comparison, most of the accuracies of the other methods are less than 50%, and several methods are completely ineffective with an accuracy of 0. Overall, the effectivity of the fault diagnosis approach proposed in this paper is verified in each task. Compared with the other approaches, it shows an evident advantage in the cross-severity fault diagnosis task.

Conclusions
Aiming at the problems of poor interpretability and weak generalization ability that commonly exist in the deeplearning-based fault diagnosis methods, this paper proposes a DCTN-based hierarchical fault diagnosis method that effectively merges the advantage of decision tree and the CNN model. The proposed DCTN model uses the convolutional layers in the CNN model for sample characterization and replaces the fully-connected layer in the CNN model with a novel tree-structured decision layer, in which the leaf nodes and seed nodes are set for fault type and fault severity identification, respectively. The ability of hierarchical decision making is given to these nodes in the model through pre-training and fine-tuning with exclusive loss functions. The final fault diagnosis decision is made according to the overall path probabilities in the tree structure.
Hierarchical multiclass fault diagnosis experiments and cross-severity fault diagnosis experiments are executed to analyze the generalization of the proposed model. The proposed DCTN-based fault diagnosis approach achieves a relatively higher multiclass recognition performance. In particular, the diagnosis accuracy of this model is even higher than that of the backbone CNN, indicating that the hierarchical decision-making strategy adopted in the model is beneficial to fault diagnosis. Moreover, the proposed method shows a more powerful generalization ability in the cross-severity fault diagnosis experiments, which is meaningful in practice because the collection of the training samples has difficulty covering all fault severities. The experiment highlights the effectiveness and superiority of the proposed method in fault diagnosis. This paper makes a useful exploration of the decision interpretability of the fault diagnosis model, and more importantly, provides a feasible way to realize crossseverity fault diagnosis of bearing. All these are beneficial to improving the confidence level of the fault diagnosis model and facilitating the solution of practical engineering problems. As a complete data-driven method, the proposed model has few limitations on the applied objects; thus, it also has better generalizability for other devices.
The purpose of this work is not to provide a complete solution but rather to suggest an alternative approach to deliver improved interpretability and generalization performance of bearing fault diagnosis. Several issues are still worthy of further exploration: 1) In terms of model interpretability, the proposed method still has difficulty explaining whether the convolutional layers have learned useful fault-related knowledge or in which way the model can effectively learn the knowledge. Therefore, the interpretability of the CNN model and other DNN models in fault diagnosis needs to be explored further. Certainly, this is a very challenging task that many researchers attempt to break through. 2) In terms of cross-severity fault diagnosis task, the diagnosis results of the proposed method remain in the accurate judgment of super class labels, that is, fault types. It would be more meaningful if the approximate range of the fault severity to which the test sample belongs can be accurately identified, which can be the direction of our next efforts.