1 Introduction

The condition monitoring of rotating components such as bearings requires a solid knowledge of correlation between diagnostic measures, i.e., high-dimensional data such as vibration signals and structural defects of the component [20]. Typical scope of condition monitoring involves prompt interrogation techniques using system-integrated sensors along with advanced signal-processing tools for extracting defect-sensitive features [21]. Statistical variations occurring to dynamic properties of rotating machinery have to be constantly monitored and evaluated for finding the critical point of trend change from a healthy state to a damaged state. Thus, the issue of segregating damage-sensitive features between healthy and damaged condition from high-dimensional data measurements becomes one of the active research topics. This includes data classification, dimensional reduction, pattern recognition, and trend analysis.

Vibration signal, which is a typical example of high-dimensional data, collected from a rotating machine exhibits a broad spectrum, representing dynamics of interaction with all rotating components, i.e., bearings, gears, background resonances, driving frequencies, and defects. Also, they are masked by measurement noises and environmental factors such as humidity, temperature, and lubrication conditions. Thus, vibration signals from a rotating system deeply incorporate highly mixed and multi-dimensional profile of data induced by unknown combination of various factors. It is a challenging task to extract evidence of defect from vast amount of vibration signals. Since high dimensionality of original vibration signal goes against knowledge discovery and visualization, the dimensionality is necessary to be reduced [3]. Feature extraction has been an active research area in pattern recognition, statistics, and data mining communities [10, 12]. The main idea of feature extraction is to choose a subset of the extracted input features by eliminating features with little or no predictive information. Feature extraction can significantly improve the comprehensibility of the resulting unsupervised or supervised learning models and often build a model that generalizes better to unseen signals. Relevant features aid in learning models, but on the other hand, irrelevant features can hurt performance of the models, and with more features, we might need more training samples to get good performance [6]. Therefore, choosing relevant features are essential in doing fault detection. This is true especially in the task of classifying vibration signals where the features outnumber the samples by several orders of magnitude [13]. In our paper, we analyze the simulated time-history responses with time points \(N=4{,}096\) (number of features), which have three different fault-types, assuming only 100 sample signals (number of samples) are available from each fault-type. Therefore, it is critical to extract a few damage-sensitive features from original time-history responses.

Principal component analysis (PCA) is one of the most widely used feature extraction techniques for dimensionality reduction studied in many literatures [16, 19, 24]. However, due to the variance-maximizing nature of PCA, physical interpretation of relationship between first a few principal components(PCs) and sources of distinction among faults is not possible. For example, the location of vibration signal clusters in low-dimensional feature space (i.e., 2D or 3D for visualization) do not explain what kind of statistical characteristics causes the distinction of fault-types. This deficiency restricts its interpretability to specific problems of process control and thus limits its broader usefulness.

In this context, we propose a version of feature extraction using frequency-based signal energy, so-called wavelet scalogram, for interpretable visualization and classification of fault-generated vibration signals. While the development of fault-sensitive features normally requires sufficient prior knowledge or fault condition information, which is hard to satisfy in many environments, our approach does not require such information, which is suitable for unsupervised learning scheme. Wavelet analysis is capable of revealing some hidden aspects of the vibration data that other signal analysis techniques fail to detect [15]. The main advantage of using wavelets is the capacity to perform local analysis of a signal, i.e., to zoom on any interval of time or space. Wavelet scalogram represents a vector of scalewise energies in a signal, which is wavelet analogue of the well-known periodogram from the spectral analysis of time series [1, 11]. It decomposes the energy of a signal into different frequency-level components of wavelet [22]. Then, we can interpret the distinction of clusters as the difference of frequency components. Our approach visualizes fault classes by separating frequency components in vibration signals; especially, in supervised learning setting, we use silhouette statistics for more effective visualization of several fault classes.

The contribution of this paper can be summarized as following: This paper introduces a methodology for interpretable visualizing and classifying vibration signals, which are very high dimensional from a bearing system. To overcome deficient interpretability of PCA , this study exploits the multiscale energy analysis of discrete wavelet transformation (DWT) in unsupervised manner. In supervised learning scheme, it is eventually combined with silhouette statistics for the purpose of more effective visualization of the main sources of different fault classes and classifying signals into fault classes. The simulation results of fault detection in a bearing system show that the approach proposed in this study can be successfully applied for classifying several different faults using nonparametric multi-class classifiers such as CART and kNN, even in the presence of a considerable amount of noise. The proposed feature extraction approach has some desirable attributes. First, the feature extraction is conducted in an unsupervised learning mode. That is, data from the fault structure are not needed. The ability to perform the feature extraction in an unsupervised learning mode is very important because data from faulty machineries are typically not available for most real-world engineering structures. Second, the approach presented herein is very attractive for the development of an automated continuous monitoring system because of its simplicity and because it requires minimal interaction with users.

The organization of this paper is as follows. In Sect. 2, PCA-based fault cluster visualization of vibration signals and its limitation are described. The proposed methodologies and their backgrounds are outlined in Sect. 3. In Sect. 4, an illustrative example from bearing system with high-dimensional vibration signal is explained. In Sect. 5, performance of the proposed method is evaluated via visualization and classification accuracy. Finally, Sect. 6 contains the conclusions and future research topics.

2 PCA-based fault cluster visualization and its limitation

In this section, we simply introduce PCA for fault cluster visualization and criticize the deficiency of using it.

PCA is one of the most widely used multivariate statistical techniques for dimensionality reduction. Given the random vector that has a sample of observations (\(n\)) for a set of \(p\) variables (i.e., \(\mathbf{X}^T=[X_1, X_2, \ldots , X_p]\)) and its covariance (or correlation) matrix \(\Sigma \), the first step of PCA is to transform the original variables \(X\) into linear combinations \(\mathbf{Z}=\alpha ^T \mathbf{X}\) that are uncorrelated, where \(\mathbf{Z}^T=[Z_1, Z_2, \ldots , Z_p]\). Using eigenvalue analysis of \(\Sigma \), these \(Z_i\) (\(i=1, 2, \ldots , p\)) are ordered such that the \(Z_i\) with the highest eigenvalue of \(\Sigma \) corresponds to the first PC and describes the largest amount of the variability in the original data; the second highest is the second PC, etc. Significant dimension reduction can be achieved if only the first few PCs are needed to represent most of the variability in the original high-dimensional space. This is common where there is high multicollinearity among the original variables. It should be noted that if the units of the original variables are different, the data should be normalized before performing PCA.

Although it is effective to use PCA for dimensionality reduction, we should admit that one cannot interpret the physical meaning of each PCs, which characterize the axial location of classes. This difficulty of physical interpretation of relationship between first a few PCs and sources of distinction among fault classes is due to the variance-maximizing nature of PCA. The variance-maximizing nature of PCA means that PCA just identifies a lower-dimensional variable set that can explain most of the variability of the original variable set. Thus, the location of vibration signal clusters in low-dimensional feature space do not explain what kind of statistical characteristics causes the distinction of fault types. This deficiency restricts its interpretability to specific problems of fault-monitoring system.

In this paper, to overcome this deficiency, we exploits the multiscale energy analysis of discrete wavelet transformation (DWT), wavelet scalogram, in unsupervised manner. Next section reviews related methodologies briefly.

3 The proposed methodologies and their backgrounds

In this section, we provide an overview of the methodologies used in this paper. First, we give a brief overview of DWT and wavelet scalograms. This step is for effective feature extraction to overcome the deficiency of PCA. Readers interested in mathematical aspects of wavelets are directed to [17, 25]. Then, we describe silhouette statistics to produce more effective visualization of fault classes in supervised learning scheme. Finally, this section ends up with the brief description of nonparametric classifiers to be used to evaluate the performance of classification.

3.1 Wavelet scalogram

A wavelet is a function \(\psi (t) \in L^2(R)\) with the following basic properties

$$\begin{aligned} \int \limits _{R} \psi (t) \, \hbox {d}t = 0 \text{ and } \int \limits _{R} \psi ^2(t) \, \hbox {d}t = 1, \end{aligned}$$
(1)

where \(L^2(R)\) is the space of square integrable real functions defined on the real line \(R\). Wavelets can be used to create a family of time-frequency atoms, \(\psi _{s, u} (t) = s^{1/2} \psi (s t - u)\), via the dilation factor \(s\) and the translation \(u\). We also require a scaling function \(\phi (t) \in L^2(R)\) that satisfies

$$\begin{aligned} \int \limits _{R} \phi (t) \, \hbox {d}t \not = 0 \text{ and } \int \limits _{R} \phi ^2(t) \, \hbox {d}t = 1 . \end{aligned}$$
(2)

Selecting the scaling and wavelet functions as \(\{ \phi _{L,k} (t) = 2^{L/2} \phi ( 2^{L} t - k); k \in Z \}\) and \(\{ \psi _{j,k} (t) = 2^{j/2} \psi ( 2^j t - k); j \ge L, k \in Z \}\), respectively, one can form an orthonormal basis to represent a signal function \(f(t) \in L^2(R)\) as follows.

$$\begin{aligned} f(t) = \sum _{k \in Z}c_{L,k}\phi _{L,k}(t) + \sum _{j \ge L}\sum _{k \in Z} d_{j,k}\psi _{j,k}(t) \end{aligned}$$
(3)

where \(Z\) denotes the set of all integers \(\{ 0, \pm 1, \pm 2, \ldots \}\), and the coefficients \(c_{L,k} = \int \nolimits _{R} f(t) \phi _{L,k}(t) \, \hbox {d}t\) are considered to be the coarser-level coefficients characterizing smoother data patterns, and \(d_{j,k} = \int \nolimits _{R} f(t) \psi _{j,k}(t) \, \hbox {d}t\) are viewed as the finer-level coefficients describing (local) details of data patterns. In practice, the following finite version of the wavelet series approximation is used:

$$\begin{aligned} \tilde{f}(t) = \sum _{k \in Z}c_{L,k}\phi _{L,k}(t) + \sum _{j=L}^{J} \sum _{k \in Z}d_{j,k}\psi _{j,k}(t)\,, \end{aligned}$$
(4)

where \(J > L\) and \(L\) correspond to the coarsest resolution level.

Let \(\mathbf{y}=( y_1, y_2, \ldots , y_{N}))^T\) be a data vector of length \(N=2^n\). The discrete wavelet transform (DWT) of \(\mathbf{y}\) is defined as \( \mathbf{d} = {\varvec{W}}\mathbf{y}\) where \({\varvec{W}}\) is the orthonormal \(N \times N\) DWT-matrix. We can write \(\mathbf{d}=(\mathbf{c}_L, \mathbf{d}_L, \mathbf{d}_{L+1}, \ldots , \mathbf{d}_J)^T\), where \(\mathbf{c}_L=(c_{L,0}, \ldots , c_{L,2^{L}-1})^T, \mathbf{d}_L=(d_{L,0}, \ldots , d_{L,2^{L}-1})^T, \ldots , \mathbf{d}_J=(d_{J,0}, \ldots , d_{J,2^{J}-1})^T\). Using the inverse DWT, the \(N\times 1\) vector \(\mathbf{y}\) of the original signal curve can be reconstructed as \(\mathbf{y} = {\varvec{W}}^{T} \mathbf{d}\). By applying the DWT to the data \(\mathbf{y},\, \mathbf{d} = {\varvec{W}}\mathbf{y}\), we obtain the following in the wavelet domain: \(d_{j,k}\) for \(j=L,\ldots , J,\, k=0,1, \ldots , 2^{j}-1\), and \(c_{L,k}\), for \(k=0,1, \ldots , 2^{L}-1\), where \(J=\log _2 N-1\). To simplify the notation, instead of using \(\mathbf{d}=(\mathbf{c}_L, \mathbf{d}_L, \mathbf{d}_{L+1}, \ldots , \mathbf{d}_J)^T\), we use \(\mathbf{d} = (\mathbf{d}_{(1)}, \mathbf{d}_{(2)}, \ldots , \mathbf{d}_{I})^T\) where \(I=J-L+2\) for the components of \(\mathbf{d}\) without any confusing.

If \(\mathbf{d}_{(i)}\) is defined as \(\mathbf{d}_{(i)}=(d_{(i),1}, d_{(i),2}, \ldots , d_{(i),m^i})^T\) where \(m^i\) is the number of wavelet coefficients at the \((i)\)th resolution level, the energy \(s(i)\) of \(\mathbf{d}\) at \((i)\)th resolution level is defined as [11, 25]

$$\begin{aligned} s(i)=\sum _{k=1}^{m^i}d_{(i),k}^2, \ \ \ i=1, 2, \ldots , I \end{aligned}$$
(5)

which is the sum of squares of all wavelet coefficients at the \((i)\)th resolution level.

The wavelet scalogram of \(\mathbf{d}\) is the vector of energies \(S=(s(1), \ldots , s(I))\). Wavelet scalogram provides measures of signal energy at various frequency bands [22]. It represent the scalewise distribution of energies in a signal. Thus, it is in very low dimension depending on the depth of decomposition predetermined. Therefore, each element of wavelet scalogram in Eq. (5) has an straightforward interpretation about the characteristics of the signal, which is an advantage against PCA-based dimension reduction approaches.

In order to use scalogram features, we need to first decide which wavelet basis function type should be used. For a detailed description on the mathematical properties of the wavelet basis function, refer to [9]. The most widely used wavelet is the Daubechies’ basis function. The Haar’s filter is best suited to represent step signals or piecewise constant signals. Adequate depth of decomposition for DWT is also required to avoid the risk of eliminating important features and over-smoothing of the signal. For a signal of length \(N\) in an off-line mode, or a window of length \(N\) in an on-line mode, empirical evidence suggests that the depth of decomposition should be half the maximum possible depth. According to [7], for a dyadic data length \(N=2^j\), the recommended depth of decomposition is \(j/2\) (i.e., depth of decomposition \(=\frac{\log _2 N}{2}\)).

In the next section, we introduce how to select the best features in wavelet scalogram in order to visualize the classes in low-dimensional space.

3.2 Silhouette statistics

The silhouette statistics originally have been used to assess the quality of clustering by measuring how well an object is assigned to its corresponding cluster [23]. In this section, we expand this concept to the discriminant power function used in this paper. For signal pattern classification, assume that we are given a dataset in which \(Z=(\mathbf{X}_j,G(j))\), for \(j=1,\ldots ,n\) is a set of \(n\) number of sample signals with well-defined class labels. Note that \(\mathbf{X}_j=(x_{1j},x_{2j},\ldots ,x_{Ij})\) is the vector of signal for \(j\)th sample describing \(I\) number of predictor variable and \(G(j)\in G=\{G_1,G_2,\ldots ,G_k\}\) is the class label associated with \(\mathbf{X}_j\). Note also that \(k\) is the number of classes and \(n_k\) is the number of \(\mathbf{X}_j\) in \(G_k\). The discriminant power function based on silhouette statistics at \(i\)th element of signal vector is then defined as

$$\begin{aligned} H_{i}=\frac{1}{n} \sum ^{n}_{i=1} \frac{b_i(\mathbf{X}_j)-w_i(\mathbf{X}_j)}{\max \{w_i(\mathbf{X}_j),b_i(\mathbf{X}_j)\}}, \ \ \ i=1,2,\cdots , I \end{aligned}$$
(6)

where, for \(\mathbf{X}_j \in G_k,\, w_i(\mathbf{X}_j)=\frac{1}{n_k -1} \sum _{\mathbf{X}_{j'} \in G_k} D_i(\mathbf{X}_j, \mathbf{X}_{j'}),\, b_i(\mathbf{X}_{j})=\min _{s\ne k}\frac{1}{n_k} \sum _{\mathbf{X}_{j'} \in G_s} D_i(\mathbf{X}_{j}, \mathbf{X}_{j'})\) and \(D_i(\mathbf{X}_{j},\mathbf{X}_{j'})=(x_{ij}-x_{ij'})^2\). In other words, \(w_i(\mathbf{X}_{j})\) is the average distance between \(\mathbf{X}_{j}\) and all other sample signals in the same class with respect to \(i\)th element of sample signals (within-class distance), and \(b_i(\mathbf{X}_{j})\) is the minimum average distance of \(\mathbf{X}_{j}\) to all sample signals in other classes with respect to \(i\)th element of sample signals (between-class distance).

The discriminant power function, \(H_{i}\), with respect to \(i\)th element of signal vector, returns the discriminant power score in the range from \(-\)1 to +1, and indicate how well all sample signals can be assigned to their own class in terms of \(i\)th element of signal vector. Intuitively, sample signals are well classified by each element of signal vector with a large silhouette statistic value, sample signals tend to lie between classes by those with small silhouette value, and samples are poorly classified by those with negative value. In conjunction with scalogram, the scalogram of a signal is regarded as \(\mathbf{X}_{j}\) in this study.

3.3 Nonparametric classifiers

A classification problem of \(K\) classes and \(n\) training observations consists of a set of instances whose class membership is known. Let \(Z=\{ (\mathbf{X}_1, G(1)), (\mathbf{X}_2, G(2)), \ldots , (\mathbf{X}_n, G(n))\}\) be a set of \(n\) training samples where each instance \(\mathbf{X}_i\) belongs to a domain \(X\). Each label is an integer from the set \(G=\{G_1, \ldots , G_k\}\). A multi-class classifier is a function \(f:X \rightarrow G\) that maps an instance \(\mathbf{X}\in X \subset \mathbb {R}^D\) into an element of \(G\). The task is to find a definition for the unknown function, \(f(\mathbf{X})\), given the set of training instances.

A primary method for the supervised classification of data utilizes the maximum likelihood decision rule, based on statistical theory. This type of classifier, called parametric, has been the most commonly applied classification technique because of its well-developed theoretical base and its successful application with different data types and classification schemes. With the parametric technique, the classifier must be trained with class signatures defined by a statistical summary (mean vector and covariance matrix). However, the maximum likelihood classifier has mathematical limitations and assumes certain properties of the sample data. The sample covariance matrix must be invertible, i.e., not singular. A singular covariance matrix may result if the sample is too homogenous and if the sample size is too small compared to the dimensionality. The classifier also assumes that the distribution of the sample data is normal, a condition which is sometimes violated for certain classes. Another type of classifier, nonparametric, is not statistically based and thus makes no assumptions about the properties of the data. This classifier assigns samples to classes based on the sample’s position in discretely partitioned feature space. Aside from the independence from the sample data properties, the nonparametric classifier also has a performance advantage over the parametric classifier. There are some widely used nonparametric classifiers, such as classification and regression tree (CART) and \(k\)-nearest neighbors (\(k\)-NN) classifier. In this study, due to the nature of nonparametric assumptions in the vibration signals, we use these nonparametric classifiers to evaluate the performance of classification in several different domains such as original time, wavelet, and scalogram. Brief introduction to these classifiers is given as below.

Classification and regression tree model is a nonparametric procedure for predicting dependent variable with categorical and/or continuous predictor variables where the data are partitioned into nodes on the basis of conditional binary responses to questions involving the predictor variable. CART models use a binary tree to recursively partition the predictor space into subsets in which the distribution of response variable \(y \in G\) is successively more homogenous [4]. For example, CART procedure derives conditional distribution of \(y\) given \(\mathbf{X}\). A decision tree with \(t\) terminal nodes is used for communicating the classification decision. For a thorough discussion of CART model, readers are referred to [2, 4].

\(k\)-Nearest neighbors rule is another a well-known and widely used nonparametric method for classification. The method consists of storing a set of prototypes that must represent the knowledge of the problem. To classify a new instance \(\mathbf{X}\), the \(k\) prototypes that are nearest to \(\mathbf{X}\) are obtained, the \(k\)-nearest neighbors, and \(\mathbf{X}\) is classified into the class most frequent in this set of \(k\) neighbors. \(k\)-NN method is used mainly because of its simplicity and its ability to achieve error results comparable with much more complex methods [8]. The best choice of \(k\) depends upon the data; generally, larger values of \(k\) reduce the effect of noise on the classification, but make boundaries between classes less distinct. A good \(k\) can be selected by various heuristic techniques. But this problem is out of main interest in this paper. The special case where the class is predicted to be the class of the closest training sample (i.e., when \(k = 1\)) is called the nearest neighbor algorithm.

4 Illustrative example: vibration signal visualization and classification

This section illustrates an example of a bearing system and a mathematical approach for simulating vibration signals from both healthy and damaged states. A bearing system is one of the most critical components of rotating machinery because it physically connects the rotating and nonrotating parts of the machine, while separates interactions between them simultaneously. It is well known that the presence of defect in bearings causes periodic spikes of vibration signal in time domain data. Here, a realistic simulation method is explained, which is to create datasets for validating the proposed condition monitoring algorithm.

4.1 Bearing system and vibration signal acquisition

The most widely used data for monitoring a bearing condition is vibration measurement through accelerometers mounted on the surface of the outer frame or casing. Having collected data, signal-processing techniques are applied for extracting damage-sensitive features. Figure 1 depicts a schematic drawing of bearing system with peripherals of data acquisition.

Fig. 1
figure 1

Typical experimental setup for monitoring bearing vibration signal through accelerometer and data acquisition system

Typically, a bearing system consists of inner-race, outer-race, and balls or rollers as shown in Fig. 2a. A bearing system experiences defects such as cracks, pits, and spalls between rolling components. As shown in the Fig. 2b, if a local crack occurs on the surface of inner or outer-race of a bearing, it becomes a physical singularity or discontinuity from the perspective of rolling balls. Obviously, rotating motion of bearing creates repeating impacts between the defect and its contact surface. The repeated impulse develops to stress waves penetrating and vibrating the whole system including the mounting frame. The frequency of this impact is proportional to the rotating speed of the bearing. This stress wave is inherently identical to the impulse response of a dynamic system and can be monitored through accelerometers. Here, the challenging part of condition monitoring is how to successfully separate impulses caused by bearing defect from background vibrations normally induced from rotating shaft and motor.

Fig. 2
figure 2

Rotating direction of rolling elements of bearing system with an inner-race crack (a) and impulsive stress wave generated by repeated collisions of crack (b)

4.2 Simulation of bearing defect and dynamics

Once crack occurs in a bearing system, its severity grows steadily and the level of vibration also increases due to repeated loadings. Thus, developing a reliable condition monitoring technique for predictive maintenance requires repeatable simulation schemes to faithfully represent bearing vibrations with and without defect. Instead of using experimental data, this study employed a state-space model for bearing defect simulation in which we can conveniently change the intensity and the frequency of defect-induced spikes to validate proposed algorithm. Previously, bearing simulation has been investigated by other researchers using sinusoidal terms with damping [5, 18]. Note that the frequency of peaks can be calculated by collision frequency of rotating elements. The state-space model of simulated bearing signal is expressed as sixth-order dynamical system:

$$\begin{aligned} x(t)&= Ax(t) + B u(t) + n z(t)\end{aligned}$$
(7)
$$\begin{aligned} A&= \begin{bmatrix} 0&\quad I \\ -K/M&\quad -\Theta /M \end{bmatrix}\end{aligned}$$
(8)
$$\begin{aligned} B&= \begin{bmatrix} 1\\ M^{-1} \end{bmatrix} \end{aligned}$$
(9)

A bearing system exhibits modes of broad spectrum. Thus, a state-space model representing the overall dynamics of the bearing system can be used for bearing simulation [14]. The repeating peaks caused by bearing defect are simulated through periodic impulses to the state-space system. Consider a three-degree-of-freedom, spring-mass-damper system:

$$\begin{aligned} M&= \begin{bmatrix} 10&\quad 0&\quad 0 \\ 0&\quad 10&\quad 0 \\ 0&\quad 0&\quad 10 \end{bmatrix}\end{aligned}$$
(10)
$$\begin{aligned} K&= \begin{bmatrix} 300&\quad -200&\quad 0 \\ -200&\quad 400&\quad -200 \\ 0&\quad -200&\quad 200 \end{bmatrix}\end{aligned}$$
(11)
$$\begin{aligned} \Theta&= \alpha M + \beta K \end{aligned}$$
(12)

Here, \(M,\, K\), and \(\Theta \) are mass, stiffness, and proportional damping matrices, respectively. In the equation, \(A\) represents the overall bearing system, applied impact due to bearing defect is denoted as \(u(t)\), and \(n z(t)\) is measurement noise. Considering structural dynamics of bearing system, we assumed the bearing system has proportional damping of \(\alpha =0.0001\) and \(\beta =0.0004\), which will show exponential decay in impulse response. In order to quantify the level of damage severity (\(SL\)) in bearing simulation, the amplitude of impact input is scaled from 0 to 1, in which \(SL=0\) means no defect while \(SL=1\) indicates the most severe state of damage. Figure 3 shows the time-history data generated from the bearing simulation for three different fault severities, i.e., \(SL=0\), 0.2, 0.4. Again, the relative level between background noise \(n z(t)\) and the amplitude of bearing defect is denoted to SNR.

Fig. 3
figure 3

Vibration signals from several different conditions

In Fig. 3, a sample set of vibration signals of length \(N=4{,}096\) from several different conditions (see Sect. 4 for more details of vibration signal generation) is shown. Each column of Fig. 3 comes from different crack severity levels (\(SL\)) of rolling-element bearings and each row from different signal-to-noise (SNR) ratios of measurement error (100 signals from each combination of crack severity level and SNR were generated, and a single sample signal is shown in each panel).

5 Visualization and classification results and discussions

In this section, we compare the visualization results of PCA-based and wavelet scalogram-based signal discriminations. Also, the performance of our approach with wavelet scalogram is demonstrated for machine fault classification problem in terms of the 10-fold misclassification error rate.

5.1 Visualization of PCA-based signal discrimination

PCA performs a linear mapping of the data to a lower-dimensional space in such a way that the variance of the data in the low-dimensional representation is maximized. In practice, the correlation matrix of the data is constructed and the eigenvectors on this matrix are computed. The eigenvectors that correspond to the largest eigenvalues (the PCs) can now be used to reconstruct a large fraction of the variance of the original data. Moreover, the first few eigenvectors can often be interpreted in terms of the large-scale physical behavior of the system. The original space has been reduced (with data loss, but hopefully retaining the most important variance) to the space spanned by a few eigenvectors. Figure 4 represents PCA-based cluster visualization of three fault-type signals in Fig. 3. The symbols in Fig. 4 represent different fault-types as follows: (\(\cdot \)) for fault-type 0, (i.e., healthy condition with \(SL=0\)), (\(+\)) for fault-type 1 (i.e., \(SL=0.2\)), and (\(\square \)) for fault-type 2 (i.e., \(SL=0.4\)). For more details about \(SL\), refer to Sect. 4.2. Each row of Fig. 4 represents the PCA results of different SNR cases and each column from different data domain (time in left and wavelet in right).

Fig. 4
figure 4

PCA-based cluster visualization of three fault-type signals in time (left) and wavelet (right) domains; the distinction among three fault-types (i.e., ’dot’ , ’square’ and ’plus sign’) is not clear in time domain (left panels). In the case of wavelet domain (right panels), three fault classes are a lot better separated, but the axial location of each class is not consistent in noise environments with different SNR

The distinction among three fault-types (i.e., \(SL=0, 0.2, 0.4\)) is not clear in time domain (all left panels) in Fig. 4. In the case of wavelet domain, three fault classes are a lot better separated, but the axial location of each class is not consistent in noise environments with different SNR. Thus, it is vague to interpret the physical meaning of first two PCs, which characterize the axial location of classes in the two-dimensional space.

5.2 Visualization of wavelet scalogram-based signal discrimination

First, in order to obtain a vector of wavelet scalogram, we need to first decide which wavelet basis function type should be used. Several wavelet basis function types are available in the literature. Some of these are the Haar’s, Daubechies’, coiflets, symlets, bi-orthogonal wavelets, etc. Some of the desirable properties of the basis functions are good time-frequency localizations, various degrees of smoothness, and large number of vanishing moments. For a detailed description on the mathematical properties of the wavelet basis function, refer to [9]. The most widely used wavelet is the Daubechies’ basis function [7]. Thus, in this study, we used Daubechies’ wavelet type (db5) with depth of decomposition 6 (i.e., \(N=4{,}096\), \(\frac{log_2 (N)}{2}=6\)).

Figure 5 shows the results of boxplot analysis based on wavelet scalogram at different scales for three different fault-types of bearing signals with \(N=4{,}096\) and decomposition level 6 (i.e., \(S=( s(1), s(2), \ldots , s(7))\)) for the vibration signals in Fig. 3 (a), (b), and (c) (i.e., 100 signals for each fault-type). It is shown that those comparatively high-frequency scales such as \(s(5),\, s(6),\, s(7)\) are less sensitive to the existence of faults in bearing in this case while the approximation level \(s(1)\) and those low scales such as \(s(2),\, s(3)\), and \(s(4)\) are quite sensitive.

Fig. 5
figure 5

Boxplot analysis based on wavelet scalogram at different scales; comparatively high-frequency scales such as \(s(5),\, s(6),\, s(7)\) are less sensitive to the existence of faults while the approximation level \(s(1)\) and those low scales such as \(s(2),\, s(3)\) and \(s(4)\) are quite sensitive

According to the perspective of silhouette statistics in Sect. 3.2, we will utilize \(H_i\) to choose a few important features in wavelet scalogram for further cluster visualization. In Fig. 6, we plotted mean silhouette statistics, \(H_i\), (\(i=1,\cdots ,7\)) for each elements of wavelet scalogram. The largest silhouette statistics \(H_i\) are from \(s(2)\), the second largest from \(s(3)\) and so on.

Fig. 6
figure 6

Mean silhouette statistics for elements of wavelet scalogram; the largest silhouette statistics indicating effective features in cluster visualization is from \(s(2)\), the second largest from \(s(3)\) and so on

Figure 7 shows the results of a cluster visualization based on wavelet scalogram for vibration signals with different SNR levels in Fig. 3. Two elements of wavelet scalogram with first and second largest silhouette statistics were used as features for cluster visualization. This figure represents that there is a certain linear pattern of the relationship between crack severity levels and the amount of frequency-wise energy, which is robust against noise levels. Although a pair of classes (’\(\cdot \)’ and ’\(+\)’) are a little more difficult to separate each other, we would be able to explain that the larger level of crack severity has larger mean and variance of those frequency energies such as \(s(2)\) and \(s(3)\). It is a critical interpretation of vibration signals in very low-dimensional space, which cannot be drawn from PCA-based clustering results. This examples illustrate the potential of the wavelet scalogram for signal classification.

Fig. 7
figure 7

Cluster visualization based on wavelet scalogram for vibration signals with different SNR levels; these figures show that there is a certain linear pattern of the relationship between crack severity levels and the amount of frequency-wise energy, which is robust against noise levels

5.3 Classification performance using scalogram

In this section, nonparametric multi-class classifiers such as classification and regression tree (CART) and k-nearest neighbors (kNN) quantitatively evaluate the performance of our approach for machine fault classification problem in terms of the 10-fold misclassification error rate. In order to obtain the misclassification error rate of classifiers in this study, we used 10-fold cross-validation method. In 10-fold cross-validation, the original simulation sample is partitioned into 10 subsamples. Of the 10 subsamples, a single subsample is retained as the validation data for testing the classification model, and the remaining 9 subsamples are used as training data. The cross-validation process is then repeated 10 times (the folds), with each of the 10 subsamples used exactly once as the validation data. The 10 results from the folds then were averaged to produce a single misclassification error rate estimation. The misclassification error rate estimations for CART and k-NN in different domains (time, wavelet, and scalogram) are shown in Table 1 and 2, respectively. All the results are obtained using an Intel(R) Core(TM) i5-3570K CPU running at 3.40 GHz with 8 GB RAM and Windows 7 Ultimate K 64-bit operating System.

Table 1 Misclassification rate and average computation time elapsed for CART
Table 2 Misclassification rate and average computation time elapsed for kNN(k \(=\) 7)

5.3.1 CART

From Table 1, it is found that CART from wavelet and scalogram domain produced more accurate classification performance compared to time domain. Moreover, Table 1 provides the comparison of the elapsed computation time of performing CART with data in several domains. The CART in scalogram domain is executed faster than other domains. For use of CART, the efficiency of the proposed scalogram domain is validated by comparing the computation time and misclassification error rate. Table 3 in Appendix shows more detailed results of confusion matrix.

5.3.2 k-NN

Table 2 shows the case of \(k=7\) with summary of classification results. It is found that \(k\)-NN classifier (\(k=7\)) from scalogram domain produced a lot more accurate classification performance compared to time and wavelet domains. Scalogram domain is also executed a lot faster than others. Scalogram domain outperforms time and wavelet domains in terms of computation time and misclassification error rate for use of \(k\)-NN classifier. More detailed results of confusion matrix with various choices of \(k\) (such as \(k=1, 3, 5, 7\)) are shown in Tables 456, and 7 in Appendix.

6 Conclusion and future research

This study investigates a methodology for visualizing and classifying high-dimensional data such as vibration signals in machine fault detection application. This study exploits the multiscale energy analysis of discrete wavelet transformation (DWT), so-called wavelet scalogram, in unsupervised manner. Wavelet scalogram allows us to first obtain a very low-dimensional feature subset of our data, which is strongly correlated with the characteristics of the data. The characteristic of the data in this paper means the scalewise energies in a signal. In supervised learning scheme, it can be eventually combined with silhouette statistics for the purpose of more effective visualization of the main sources of different classes and classifying signals into different classes. Finally, nonparametric multi-class classifiers such as classification and regression tree (CART) and k-nearest neighbors (kNN) quantitatively evaluate the performance of our approach for machine fault classification problem in terms of the 10-fold misclassification error rate.

It should be pointed out that the procedure developed has only been verified on relatively simple laboratory test specimens. To verify that the proposed method is truly robust, it will be necessary to examine many time records corresponding to a wide range of operational and environmental cases, a wide range of healthy and faulty structures, as well as different fault scenarios.