1 Introduction

As a critical part of modern equipment, bearing always works under harsh conditions and suffers from time-varying load, which results in a significant risk of failure [1]. Bearing failure is the main cause of the machinery breakdowns, and sometimes can lead to huge economic loss and severe casualties [2, 3]. To ensure the operational reliability of equipment, researchers have conducted related studies for bearing fault diagnosis and proposed many effective methods [4,5,6,7,8].

Among these methods, deep learning based methods have shown its excellent performance in recent years, which can learn diagnosis knowledge from large amount of labeled data and reduce the dependence on expertise [9,10,11,12]. Although deep learning based methods are not expertise-dependent, they are heavily data-dependent. Unfortunately, collecting labeled machinery failure data are expensive or even impossible [13]. Under such circumstance, transfer learning [14] begins to attract researchers’ attention, which could transfer the related knowledge of fully labeled source domain to enhance the fault diagnosis performance in sparsely labeled or unlabeled target domain [15, 16].

Han et al. [17] introduced adversarial learning for feature distribution adaptation and transferred the source domain fault diagnosis model to the target domain. Guo et al. [18] utilized both Maximum mean discrepancy (MMD) and adversarial learning to adapt the feature distribution, which can transfer the fault diagnosis model to other machineries. Yang et al. [19] transferred fault diagnosis model from laboratory bearings to locomotive bearings with MMD-based multi-layer feature alignment and pseudo label learning. Wang et al. [20] utilized conditional maximum mean discrepancy (CMMD) to align feature distribution for cross-domain bearing fault diagnosis. Li et al. [21] proposed a novel cross domain fault diagnosis method, which take full advantage of the availability of target domain health labels.

These transfer learning methods have achieved great success in unsupervised domain adaptation, which can build fault diagnosis models for unlabeled target domain. However, these works only tend to adapt either marginal or conditional distributions (MDA or CDA) between source and target domain. In practice, both marginal and conditional distribution discrepancies have significant but different influences on domain divergence [22]. Recently, researchers have carried out some work in joint distribution adaptation (JDA) [23, 24], which simultaneously adapt marginal and conditional distribution. Although these works have achieved better performance, they allocate equal weights to marginal and conditional distributions discrepancies, which cannot quantify the different contributions of these distributions discrepancies.

In this paper, a dynamic distribution adaptation based transfer network (DDATN) is proposed for cross domain bearing fault diagnosis, which utilizes the proposed instance-weighted dynamic maximum mean discrepancy (IDMMD) for dynamic distribution adaptation (DDA). The main contributions of the paper are as follows.

  1. (1)

    Introduce DDA framework for cross domain bearing fault diagnosis, which can dynamically adjust the weights of marginal and conditional distributions discrepancies in domain adaptation.

  2. (2)

    Propose a novel dynamic distribution discrepancy metric (IDMMD) for unsupervised DDA. IDMMD uses a novel dynamic factor estimation method to dynamically estimate the contributions of MDA and CDA, which further considers the contribution of CDA of each class. In addition, it takes the confidence of target domain pseudo labels into account when calculates the conditional distribution discrepancy.

The remainder of the paper are organized as follows. The theoretical and technical bases are introduced in Section 2. Section 3 describes the detail of the proposed DDATN, which has been experimental evaluated in Section 4. Finally, the conclusion is drawn in Section 5.

2 Preliminaries

2.1 Dynamic Distribution Adaptation

Marginal and conditional distributions have different contributions on domain divergence and their contributions dynamically change during the transfer learning procedures. To improve transfer learning performance, DDA [22] is proposed as a general transfer learning framework, which considers the different and ever-changing contributions of marginal and conditional distributions on domain divergence. In DDA, the dynamic distribution discrepancy has the general form as

$$ \begin{aligned} D_{d} \left( {\Omega_{s} ,\Omega_{t} } \right) & = \left( {1 - \mu } \right)D\left( {P_{s} ,P_{t} } \right) + \\ & \quad \mu \sum\limits_{c = 1}^{C} {D^{\left( c \right)} \left( {Q_{s} ,Q_{t} } \right)} , \\ \end{aligned} $$
(1)

where Ps and Qs are marginal and conditional distributions of source domain Ωs, respectively; Pt and Qt are marginal and conditional distributions of source domain Ωt, respectively; D(Ps, Pt) is marginal distribution discrepancy, D(c)(Qs, Qt) is conditional distribution discrepancy for class c, C is the number of classes; μ is the dynamic weight which changes when the training goes on.

From Eq. (1), DDA degenerates to MDA and CDA when μ=0 and μ=1, respectively. Therefore, DDA can be regarded as a more general distribution adaptation framework.

2.2 Maximum Mean Discrepancy

Maximum mean discrepancy (MMD) [25] which is an effective distribution discrepancy metric widely used in transfer learning. Given datasets Xs and Xt sampled from distributions P(Xs) and P(Xt), the MMD between P(Xs) and P(Xt) is can be calculated as

$$ MMD_{\mathcal{H}} \left( {P\left( {{\varvec{X}}_{s} } \right),P\left( {{\varvec{X}}_{t} } \right)} \right) = \left\| {\frac{1}{{n_{s} }}\sum\limits_{i = 1}^{{n_{s} }} {\phi \left( {{\varvec{x}}_{i}^{s} } \right)} - \frac{1}{{n_{t} }}\sum\limits_{j = 1}^{{n_{t} }} {\phi \left( {{\varvec{x}}_{j}^{t} } \right)} } \right\|_{\mathcal{H}}^{2} , $$
(2)

where xisXs, xjtXt; ns and nt are the numbers of samples in Xs and Xt, respectively; \(\phi\) is a nonlinear mapping function in reproducing kernel Hilbert space (RKHS) \({\mathcal{H}}\).

From Eq. (2), the MMD is expressed as the distance in \({\mathcal{H}}\) between mean embeddings of Xs and Xt.

3 Dynamic Distribution Adaptation Based Transfer Network

In this paper, DDA is introduced for improving the cross domain bearing fault diagnosis performance. The framework of DDATN is shown in Figure 1. It consists of DDA and supervised learning. The DDA part aims to constrain the feature extractor Gf to extract domain-invariant features by minimizing the proposed dynamic distribution discrepancy IDMMD. The supervised learning part realized by minimizing the supervised loss LC will guide the Gf to extract features which are discriminative for bearing health conditions, and train the effective classifier Gy which can accurately diagnosis bearing fault with these features.

Figure 1
figure 1

Framework of DDATN

3.1 Supervised Learning

The DDATN is proposed for unsupervised domain adaptation, which target domain data are totally unlabeled. The supervised learning is realized using labeled source domain data, whose loss can be defined as

$$ L_{C} = \frac{1}{{n_{s} }}\sum\limits_{i = 1}^{{n_{s} }} {J\left( {G_{y} \left( {G_{f} \left( {{\varvec{x}}_{i}^{s} } \right)} \right),{\varvec{y}}_{i}^{s} } \right)} , $$
(3)

where \(J\left( { \cdot , \cdot } \right)\) is cross-entropy loss function, yis is the labeled of source domain sample xis.

3.2 Instances-weighted Dynamic Maximum Mean Discrepancy (IDMMD)

In unsupervised domain adaptation, target domain cannot provide label information. The final fault diagnosis process can just be conducted by the shared classifier Gy which trained by labeled source domain data. To prevent the interference of target domain specific features and domain divergence, it is important to extract the domain-invariant features.

In DDATN, a novel dynamic distribution discrepancy metric IDMMD is proposed for constraining Gf to extract domain-invariant features. IDMMD based on the DDA framework, which considers the ever-changing contributions of marginal and conditional distribution discrepancy on domain divergence. In addition, IDMMD further considers the different contributions of conditional distributions discrepancies of different class, and the confidence of target domain samples’ pseudo labels. The IDMMD between Ωs and Ωt is defined as:

$$ \begin{aligned} IDMMD\left( {\Omega_{s} ,\;\Omega_{t} } \right) & = \left( {1 - \sum\limits_{c = 1}^{C} {\mu^{\left( c \right)} } } \right)IDMMD_{M} \left( {\Omega_{s} ,\Omega_{t} } \right) \\ & \quad + \sum\limits_{c = 1}^{C} {\mu^{\left( c \right)} } DMMD_{C}^{\left( c \right)} \left( {\Omega_{s} ,\Omega_{t} } \right), \\ \end{aligned} $$
(4)

where IDMMDM is the marginal distributions discrepancy, IDMMDC(c) is the conditional distributions discrepancy for class c, μ(c) is the dynamic factor for IDMMDC(c). They are defined as

$$ IDMMD_{M} \left( {\Omega_{s} ,\Omega_{t} } \right) = \left\| {\frac{1}{{n_{s} }}\sum\limits_{i = 1}^{{n_{s} }} {\phi \left( {{\varvec{x}}_{i}^{s} } \right)} - \frac{1}{{n_{t} }}\sum\limits_{j = 1}^{{n_{t} }} {\phi \left( {{\varvec{x}}_{j}^{t} } \right)} } \right\|_{\mathcal{H}}^{2} , $$
(5)
$$ IDMMD_{C}^{\left( c \right)} \left( {\Omega_{s} ,\Omega_{t} } \right) = \left\| {\frac{{\sum\limits_{i = 1}^{{n_{s} }} {y_{i}^{\left( c \right)} \phi \left( {{\varvec{x}}_{i}^{s} } \right)} }}{{\sum\limits_{i = 1}^{{n_{s} }} {y_{i}^{\left( c \right)} } }} - \frac{{\sum\limits_{j = 1}^{{n_{t} }} {\hat{y}_{j}^{\left( c \right)} \phi \left( {{\varvec{x}}_{j}^{t} } \right)} }}{{\sum\limits_{j = 1}^{{n_{t} }} {\hat{y}_{j}^{\left( c \right)} } }}} \right\|_{\mathcal{H}}^{2} , $$
(6)
$$ \mu^{\left( c \right)} = \frac{{IDMMD_{C}^{\left( c \right)} \left( {\Omega_{s} ,\Omega_{t} } \right)}}{{IDMMD_{M} \left( {\Omega_{s} ,\Omega_{t} } \right) + \sum\limits_{c = 1}^{C} {IDMMD_{C}^{\left( c \right)} \left( {\Omega_{s} ,\Omega_{t} } \right)} }}, $$
(7)

where yi(c) is the real one-hot label of xis for class c, ŷj(c) is the prediction probabilities of xjt for class c.

The IDMMDM is the original form of MMD. For unsupervised domain adaptation, the labels of target domain samples which are necessary for calculating conditional distributions discrepancy is unavailable. Therefore, the predictions of Ωt are regarded as its soft labels. Considering the confidence of the soft labels, different weights are allocated to different target domain samples while calculating IDMMDC(c), which are their prediction probabilities for class c. Intuitively, MMD calculates the distance between the centers of two datasets in the embedded feature space. In target domain, the center of each class will tend to be closer to the samples which have higher prediction probabilities with the proposed weight allocation. Therefore, the negative effects of misclassification will be diminished.

To quantify the ever-changing contributions of marginal and conditional distributions, the dynamic factors for each class are calculated as Eq. (7). The class has larger IDMMDC value will be allocated larger dynamic factor, and the dynamic factor of marginal distribution will be calculated as Eq. (4). The proposed dynamic factor allocation method aims at guiding the DDA to focus on the main cause of domain shift.

3.3 General Procedure of DDATN

As mentioned above, DDATN contain two parts: supervised learning and DDA. Therefore, the total loss function of DDATN can be defined as

$$ L_{{{\text{total}}}} = L_{C} + \lambda \cdot IDMMD\left( {\Omega_{s} ,\Omega_{t} } \right), $$
(8)

where λ is the trade-off factor.

The procedures of DDATN are presented in Figure 2 and summarized as follows.

  1. (1)

    Datasets generation. The source and target domain signals are segmented and standardized to form the source and target domain datasets (Ωs and Ωt), respectively.

  2. (2)

    Batch optimization. Source and target domain samples batches are generated from Ωs and Ωt, respectively. These batches are forward propagated to calculate the loss Ltotal. The loss is back-propagated to update the whole network.

  3. (3)

    Traverse datasets. Repeat step 2 to traverse the Ωs and Ωt.

  4. (4)

    Iterative optimization. Repeat step 3 for nepochs times.

  5. (5)

    Model evaluation. Evaluate the final fault diagnosis model with testing dataset.

Figure 2
figure 2

Flowchart of DDATN

4 Experiment

4.1 Dataset Description

In this section, the CWRU [26] bearing dataset (CW) and the bearing dataset from our laboratory (OL) are utilized for verifying the effectiveness of the DDATN.

The test rig of CWRU bearing dataset is shown in Figure 3, which consists of a driven motor (left), a torque transducer and an encoder (middle), and a dynamometer (right). The test bearing is installed at the output side of the motor and support the motor shaft. This bearing test includes four health conditions with four fault diameters (0.18, 0.36, 0.53, 0.071 mm), and it is conducted on four different loads (0, 1, 2, 3 hp) with sampling rates of 12 kHz and 48 kHz. The details of the used part of data whose sampling rate is 12 kHz are listed in Table 1.

Figure 3
figure 3

CWRU bearing test rig

Table 1 Details of the CWRU bearing dataset

The bearing test rig of our laboratory is shown in Figure 4. The test rig is driven by the motor, and the power transfer to the shaft which is supported by the test bearing with belt drive. The loading device exert radial force on the shaft to simulate the load of bearing. In this test, inner and outer race faults with size 0.5 mm are introduced to the test bearing by wire-electrode cutting. The acceleration signals are collected with sampling rate of 12 kHz. The details of this dataset are listed in Table 2.

Figure 4
figure 4

Bearing test rig

Table 2 Details of the OL bearing dataset

For each health condition, 100 samples are segmented from the original vibration signals. Therefore, there are 300 and 1200 (300 for each speed) sampled from CW and OL bearing datasets, respectively. The length of the sample is set as 2048.

4.2 Comparison Setting

Thirty-six cross equipment tasks are conducted to verify the effectiveness of DDATN, which are listed in Table 3. For target domain dataset, half are used for training and the rest are served as testing dataset. In Table 3, S denotes source domain dataset, T denotes target domain training dataset, OL 500 (150) denotes the OL bearing data with speed of 500 r/min and the number of samples are 150 (50 samples for each health condition). CW0.18_1(300) denotes the CWRU data with 0.18 mm fault diameter and 1 hp load, and the number of samples are 300 (100 samples for each health condition).

Table 3 Cross equipment tasks

The structures of features extractor Gf and classifier Gy are presented in Table 4, where Conv1D denotes 1D convolutional layer, MP1D denotes 1D max pooling layer, FC denotes fully connected layer. The Gf and Gy are trained by Adam optimizer (learning rate = 0.001, β1 = 0.9, β2 = 0.999). The tradeoff factor λ is set as 1.

Table 4 Structures of Gf and Gy

The details of the comparison methods are as following. They use the same CNN structure as DDATN.

  • Method 1 (DDC): Deep domain confusion (DDC) [27] is a deep transfer learning method proposed by Tzeng et al., which utilizes MMD for single-layer feature alignment.

  • Method 2 (FTNN): Feature-based transfer neural network (FTNN) [19] is proposed by Yang et al, which applied MMD-based multi-layer feature alignment and pseudo label learning to transfer fault diagnosis knowledge from laboratory bearings to locomotive bearings.

  • Method 3 (DTN) [23]: Deep transfer network (DTN) is a cross-domain fault diagnosis method proposed by Han et al. It utilizes MMD and CMMD to evaluate the marginal and conditional distributions discrepancies, respectively. They are given equal weights for single-layer feature joint distribution adaptation.

  • Method 4 (IWC): IWC is derived from DDATN, which only takes the conditional part of IDMMD as the evaluation of the domain divergence.

  • Method 5 (IWCM): IWCM is also derived from DDATN, which allocates equal weights to the marginal and conditional parts of the IDMMD.

4.3 Result and Discussion

The experiment is conducted on a computer with two E5-2630 v3 CPUs, a Nvidia GeForce RTX 2080 Ti GPU (11 GB memory), and 64 GB memory. To avoid the influence of randomness, each task is repeated 10 times. The mean accuracies, standard deviations, training and testing time are listed in Table 5. The overall accuracy curves of these methods are presented in Figure 5. In addition, for each method, the average accuracy and standard deviation among all tasks are also presented in Avg.

Table 5 Comparison results
Figure 5
figure 5

Comparison result

The comparison shows that IWC, IWCM and DDATN have better performances than other methods. DDC has the worst accuracy in in almost all tasks except task 14. FTNN and DTN are both derived from DDC, whereas they have different improving directions. FTNN extends the single-layer adaptation to multi-layer adaptation and introduces pseudo label learning for further improvement, which has demonstrated to be effective in this case. DTN extends the MDA to JDA and achieves higher average accuracy than FTNN.

IWC achieves the average accuracy of 91.24% with standard deviation of 11.16%, which indicates that the proposed conditional distribution discrepancy metric is effective and robust. IWCM shows better performance than IWC in all tasks. The extension from CDA to JDA is proved valid while comparing IWCM with IWC. In addition, the IWCM can be regarded as the variation of DTN, which replaces the pseudo label strategy with instance-weighted strategy when calculating conditional distribution discrepancy. The comparison between IWCM and DTN demonstrates the effectiveness of the instance-weighted strategy.

Specifically, DDATN outperforms other methods in all tasks and achieves the highest average accuracy of 98.43% with lowest standard deviation, which indicates its superior effectiveness and robustness. In tasks 3, 7, 13, 21, 23, DDATN does not perform best, but it still gains a very close accuracy with the highest one. In tasks 4 (CW0.18_1 to OL1400) and 8 (CW0.18_2 to OL1400), the accuracies of DDATN are relatively low (67.07% for both tasks), which indicates that DDATN cannot gain very high accuracy in some transfer tasks. However, the accuracies of DDATN in these tasks still higher than other methods. In some difficult transfer tasks, DDATN may not be able to gain very high accuracy, but it can improve the performance to some degree.

In summary, the comparison indicates that the proposed conditional distribution discrepancy metric is effective and robust, whereas the extension from MDA, CDA and JDA to DDA can further improve the cross-domain fault diagnosis performance.

4.4 Feature Visualization

All the methods used in this experiment are feature-based transfer learning methods. To further evaluate the feature alignment performance of DDATN, t-distributed stochastic neighbor embedding (t-SNE) [28] is utilized for feature visualization. Tasks 33 and 36 are selected for visualization. For DDC, IWC, IWCM and DDATN, the feature visualizations are conducted on Flatten layer, whereas it is conducted on FC_2 layer for FTNN and DTN. In Figures 6 and 7, the legend consists of two parts: bearing health condition (outside the bracket) and the domain label (inside the bracket). For example, IR (T) denotes the inner race fault sample of the target domain. The marker of source and target domain samples are circle and triangle, respectively. The color represents the health condition, e.g., blue represent Normal (N), red represent Inner Race fault (IR), green represent Outer Race fault (OR).

Figure 6
figure 6

Feature visualization of task 33 (CW0.53_3 to OL500)

Figure 7
figure 7

Feature visualization of task 36 (CW0.53_3 to OL1400)

In task 33, the features of IWC, IWCM and DDATN show good fusion of source and target domains, whereas great discriminability with respect to bearing health conditions is also observed. For other methods, their features still have related good interclass discriminability, but the aggregation of source and target domains is poor. Especially, the source and target domain samples can be linearly separated with the feature of DDC.

In task 36, the features of IWC, IWCM and DDATN still show superior performance compared with the features of other methods, which still have good fusion of source and target domains. However, there has been a significant degeneration of the interclass separability of IWC and IWCM, whereas DDATN still hold excellent interclass separability. For DDC, FTNN and DTN, the interclass separability and intraclass aggregation are both poor, whereas the fusion of source and target domains are hard to be observed.

The feature visualization demonstrates that DDATN can effectively adapt the target domain features distributions to that of source domain. The target samples can be accurately aggregated to the corresponding source cluster, and the extracted features shows good fusion of domains and excellent discriminability of bearing health conditions.

5 Conclusions

This article proposes a novel unsupervised domain adaptation method termed DDATN. It introduces DDA for cross domain bearing intelligent fault diagnosis. The DDA is realized by the proposed IDMMD, which combines novel dynamic factor estimation method and instance-weighted conditional distribution discrepancy metric. The cross domain bearing fault diagnosis experiment is conducted to verify the effectiveness of DDATN. DDATN achieved better performance than other state-of-the-art cross domain fault diagnosis methods. The results demonstrate that the proposed conditional distribution discrepancy metric and the dynamic factor calculation method are effective and robust for DDA. Therefore, DDATN can effectively adapt the target domain features distributions to that of source domain for better cross domain bearing fault diagnosis.