Introduction

Ball bearings are essential elements in a rotating machine. They enable rotational or linear motion while minimizing friction and supporting dynamic mechanical loads. Defects and faults inside these bearings may lead to disastrous failures. Some defects in bearing appear gradually over time, while others occur suddenly with little warning. Therefore, early identification and diagnosis of defects in ball bearings can contribute to increasing machinery uptime and form a part of a preventative maintenance strategy. For example, Frosini et al. mentioned that 40–50% of induction motor failures in the industry happened due to damages occurring in bearings [1]. Therefore, condition monitoring (CM) of ball bearings plays a significant role in early fault detection and is considered an integral part of any preventative maintenance plan.

The persistent rotational movement of bearings during extended periods of operation causes friction, elevated temperatures, and increased vibrations, leading to localized and distributed primary defect categories. Localized defects are defined as single-point faults, such as cracks, pits, spalls, and even small particles in the lubrication fluid [2]. These defects are usually identified according to their locations (on the outer race, inner race, or the rolling element itself) [3]. The second type of bearing faults are called extended or distributed defects [4,5,6]. They refer to imperfections that are spread out over the surface of any of the bearing components. These defects, usually related to manufacturing or installation mistakes, can include surface roughness and waviness [6]. They can also arise due to the progression of minor localized defects. Surface defects (localized or distributed) generate undesirable frequencies during the rotational motion of their supporting mechanisms and can also excite them at one of their resonances. However, they are usually hard to diagnose if represented only in the frequency domain. The presence of multiple simultaneous defects at various locations in the bearing may bring more complications when interrogating the spectrum. Therefore, investigating and studying these faults and how to detect them is especially important.

Different methods were developed to detect and diagnose defects in bearings. Among these techniques are vibration, acoustic emissions (AE), or motor's current variation [7, 8]. Bearings tend to generate vibration and noise due to the presence of defects, which works as an obstacle that impedes the smooth motion of the bearing balls. However, even if the geometry of the bearing is perfect, it generates vibration due to continuous changes in the total stiffness. This change in stiffness is due to the finite number of elements that carry the total load [9]. Singh et al. [10] stated that the generated vibration signals due to defects are caused by the restressing of the rolling elements. Consequently, when the ball hits the end of a defect, the impulse generates vibration signals and enlarges the defect's size. In contrast, other explanations define the short force impulse vibration as the ball's compression between the inner and outer races [11, 12]. In either case, the spectrum of a defect-free ball bearing differs from that of a ball bearing with (one or more) defect(s).

Different attempts were made to develop numerical or mathematical models that could predict the dynamic behavior of damaged bearings. However, these models always require validation compared to experimental results. Therefore, damaged bearings with specific types and sizes of damage are always needed. Using manufacturing techniques, such as Electric Discharge Machining (EDM) or punching, the damage could be intentionally seeded on the inner or outer raceways of bearings. The defective bearings would then be mounted on an experimental test rig to study their responses. One of the fault-testing approaches is run-to-failure, which consists of running the bearing under abnormal conditions, such as over-loading, over-speeding, or poor lubrication, until defects occur [9]. Vibration data is periodically or continuously recorded to study the time evolution of the vibration signal. Another approach is to seed several defects on different bearings [13] and test them separately under the same operating conditions to compare their readings with signals extracted from healthy bearings. The latter approach is adopted in this study. Subsequently, the vibration analysis could be done in the time domain, the frequency domain, or the time–frequency domain [9, 14, 15].

Recently, machine learning (ML) was also adopted in both the diagnosis and prognosis of bearings' faults. ML could be based on supervised or unsupervised learning algorithms. Supervised learning (SL) is the process in which the machine will learn with the help of given data containing features and labeled data. Features are the independent parameters, while dependent parameters are called labels. In contrast, unsupervised learning (UL) is about labeling unlabeled data using algorithms and then running processes to produce analyzed data. ML models usually use a considerable percentage of data to train the system and then use the rest to test the prediction accuracy before conducting the prediction on foreign data [16]. Multiple ML techniques have been employed in the last two decades to detect defects in rolling element bearings, using distinct test data parameters to train the algorithms.

Artificial neural networks (ANN) are usually adopted when dealing with complicated problems with many trainable parameters. Several types of ANN were adopted in bearings troubleshooting and condition-based maintenance (CBM), such as convolution neural networks (CNN) and recurrent neural networks (RNN). Eren, L. [17] presented a one-dimensional CNN model to monitor bearings health using a single-learning body model and achieved 97% fault detection accuracy. Hoang and Kang [18] used a novel CNN model that transforms 1D signals into 2D ones and approached 100% accuracy in defect detection using the Case Wester Reverse University (CWRU) public bearing data set. RNN is another branch of ANN that recurrently processes the data instead of the feed-forward behavior, allowing outputs to be processed as inputs while having unknown status in the hidden layers [19, 20]. However, one of the disadvantages of RNN is the possible gradient’s exploding and vanishing problem in time series during the backpropagation process [21]. Therefore, long-short-term memory (LSTM) could be used. LSTM is a time-recurrent neural network block used to solve the vanishing gradient problem [22] since it is suitable for processing and predicting data with gaps and delays in a time series based on complex historical fault data. Liu et al. [23] proposed their model of Gated Recurrent Unit-based denoising autoencoder, which outweighed other classifiers with more than 99.5% diagnosis accuracy. Several works have adopted the K-Nearest Neighbors' Classifier [24,25,26] and used it in the fault detection of bearings as a different type of ANN classifier. Moreover, Multi-Layer Perceptron (MLP), the classical supplement of feed-forward neural networks, is regularly used in CBM of bearing [27,28,29]. Many other ML approaches were implemented in the CBM of bearings, such as Neural Fuzzy Networks [30], Generative Adversarial Networks [31], and Naive Bayes Classifier algorithm [32]. Regression techniques were also adopted in different bearing CBM models [33,34,35].

On the other hand, Trees Ensembles (TE) methods such as Decision Trees (DT), Extra Tree Classifier (ETC), Random Forest (RF), and Gradient Boost (GBoost) are increasingly adopted by researchers in classification and regression problems of bearing fault-detection [36,37,38] due to their low computational demands. In addition, enhancing the TE in classifying the defect is conducted with the use of other classification algorithms such as Support Vector Machine (SVM) [39], fuzzy classifiers [40], envelope signal-based feature extraction [41], and 2D-discrete wavelet transform [42, 43]. For example, Patil and Phalle [44] have found that adopting ETC to classify bearing faults achieved an accuracy of 98.12%. Moreover, Nistane and Harsha [45] used ETC supported with a stationary wavelet transform algorithm to compare this technique's performance with RF and MLP regression. Furthermore, RF is another algorithm that combines multiple decision trees into one single decision tree, making it robust for regression and classification tasks; however, the training time increases in this case. The main difference between RF and ETC is that RF chooses the best node to split, while ETC randomly splits nodes to sample the training data, reducing bias, variance, and training time. Some researchers proposed ML models using RF [40, 46]. In contrast, others [47] employed a refined composite multi-scale reverse dispersion entropy technique with RFC, achieving 100% maximum prediction accuracy and a 97.3% average classification accuracy.

In addition to ETC and RF, Boosting is a general ensemble method that creates a robust classifier from several weak classifiers. One of the boosting techniques is GradientBoosting (GBoost), which minimizes the prediction error of the next model by choosing the best outcome of the last model based on the DT model. It gained much attention in ML due to its efficiency in making predictions and its ability to handle large and sparse data. However, Extreme GradientBoosting (XGBoost) has a computational advantage over GBoost, where training progresses slowly. XGBoost is a SL algorithm used to solve classification and regression problems [48] and is commonly used in bearing fault diagnosis [46, 49, 50]. Qi et al. [51] included different classifiers in their models and achieved 90.42%, 95.76%, and 97.21% prediction accuracies when using DT, XGBoost, and Weighted Extreme Gradient Boosting (WXGB), respectively. Xia et al. [52] later adopted XGB in their Federated Learning model using the Privacy‐Preserving technique. Some works have included combining Adaboost and EMD [53, 54], while in [55], researchers used the DT classifier followed by Adaboost to compare it with SVM to achieve 96% and 92% maximum testing accuracy, respectively. One more classifier is the Light Gradient-Boosting Machine (LightGBM), which was implemented in bearing fault detection in [56,57,58] and for the same contribution but supported with CNN in [59,60,61,62].

In predictive maintenance and condition monitoring, machine learning techniques have gained significant traction for their ability to extract valuable insights from complex industrial data. Recent advancements in tree-based ensemble methods, such as Decision Trees, Random Forests, and XGBoost, have demonstrated remarkable performance in diagnosing faults in rolling element bearings using vibration signals [63]. Complementing these approaches, deep learning architectures have emerged as powerful tools for time series forecasting, potentially enabling proactive maintenance strategies [64]. However, industrial data's growing complexity and scale pose challenges regarding computational efficiency and privacy concerns. In this regard, distributed frameworks for training XGBoost models have been proposed, leveraging parallel computing to handle large-scale datasets [65]. Additionally, federated learning techniques have been explored for collaborative bearing fault diagnosis, enabling privacy-preserving data sharing across multiple parties [66]. Recognizing the diversity of machine learning approaches, comprehensive reviews have been conducted to evaluate traditional and deep learning methods for fault diagnosis in rotating machinery [67]. These studies provide valuable insights into the strengths and limitations of various techniques, guiding practitioners in selecting appropriate methods for their specific application scenarios. Beyond the realm of rotating machinery, machine learning has also found applications in monitoring and diagnosing faults in actuators, which are critical components of many industrial systems [68]. Integrating intelligent algorithms into actuator systems can enhance their reliability and performance, further underscoring the pervasive impact of machine learning in industrial operations. While significant progress has been made, implementing machine learning algorithms in real-world industrial environments remains challenging [69]. Data quality, sensor reliability, and system complexity must be carefully considered to ensure accurate and robust condition monitoring solutions.

The models developed in the literature using the ensembles method and ANN algorithms were assessed based only on overall accuracy. However, those models usually make classification depending on the part of the independent feature data set while ignoring features with lower contribution scores to the classification model. However, a generalized multi-parameter bearing fault detection classification model requires mapping between every signal feature with the target defect. Therefore, this paper proposes and compares three tree ensemble machine learning models (Decision Tree, Random Forest, and XGBoost) for diagnosing and prognosing faults in roller element bearings using vibration data. It utilizes 17 time-domain statistical features extracted from the vibration signals to the machine learning models as input features. A thorough feature importance analysis is conducted to understand which vibration signal features contribute most significantly to the performance of each machine learning model in detecting bearing faults and their locations.

Experimental Testing, Results, and Data Setup

This study proposes and compares three tree ensembles of ML models that can diagnose and prognose faults of roller element bearings. Each of the three models can detect the defect's existence and location, whether it occurred on the outer or inner rings. Moreover, four more extra bearings data were experimentally collected on the same testing machine used in QU-DMBF [62], which were used to train the system and test it afterward. The introduced models enhance the CBM decisions of rotating equipment and assess the importance of time-domain signal parameters used to define the defective bearing. In this paper, we examine each model and the features it deems essential to make a decision, referred to as feature importance. This aspect is often overlooked in prior studies. Many researchers have gravitated toward benchmarking and frequently fixating on accuracy scores, which usually leads to ignoring the exploration of the importance of features. Regarding predictive modeling, prior papers may have focused primarily on accuracy scores and model performance rather than usually reporting on individual features and their contribution to the model performance cross-referenced with the problem domain. While it acknowledges the importance of accuracy as a vital component, this study explores the importance of features in model performance and domain: experimental Testing, Results, and Data Setup.

This paper utilizes benchmark experimental data to measure vibrations and validate defect diagnosis and prognosis functionalities using multiple time domain parameters. This section introduces the experimental test rigs and the set of bearings employed to gather vibrational data, along with incorporating these bearings into the existing QU-DMBF dataset. Subsequently, the time domain parameters essential for bearing fault analysis will be elaborated upon, detailing all the time domain parameters employed. The section concludes by discussing the data setup and model training criteria.

Qatar University Dual-Machine Bearing Fault Benchmark (QU-DMBF) Dataset

Figure 1 shows the test rig used in this research. This rig was originally a Machinery Fault Simulator from SpectraQuest Inc., USA. The rig includes a DC motor (0.5 HP, 90 VDC, 5 A) with a maximum rotational speed of 2500 RPM. The original shaft of the machine was removed and replaced by a new and bigger one to accommodate the tested bearings (type NSK-6208). For the sake of resistance and stability, other supporting mechanical components were also redesigned. The machine has an approximate weight of 5kg and overall dimensions of (100 \(\times\) 63 \(\times\) 53) cm.

Fig. 1
figure 1

Defect insertions: a on the outer ring; b on the inner ring

Initially, 19 different bearing configurations were considered in the investigation: one healthy, nine with a defect on the outer ring, and nine with a defect on the inner ring. The defect sizes vary between 0.35 mm and 2.35 mm. However, three more healthy bearings were added to the QU-DMBF published data set to balance healthy to faulty samples. One extra bearing with a 0.5 mm defect located on the outer race was exclusively used for testing. Hence, 23 bearings in total (same size and same brand) were employed in generating the dataset.

Six ICP® accelerometers (PCB Piezotronics, Model No. 352C33) were used to extract vibrational signals. These accelerometers were fixed on the motor and in different locations on the machine's mounting base, and they were also fixed in various orientations and distances from the bearing rotation point at the end of the shaft (Fig. 2).

Fig. 2
figure 2

Experimental testing machine where numbers 1–6 refer to accelerometers’ locations

Two four-channel NI-9234 sound and vibration input modules controlled the readings at a sampling frequency of 4.096 kHz. Signal recording for each bearing was taken for 30 s at five different speeds, which are 240, 360, 480, 720, and 1020 RPM. Hence, the total recording time of faulty bearings was 16,200 s (18 bearings \(\times\) 6 accelerometers \(\times\) 5 speeds \(\times\) 30 s). The recorded time for each healthy bearing was increased to \(270\) seconds to ensure proper balance in the training data. The procedure consisted of recording 32,400 s (4 bearings \(\times\) 6 accelerometers \(\times\) 5 speeds \(\times\) 270 s), similarly for all the healthy bearings.

Time Domain Analysis

Vibration responses due to defective bearings could be studied using different approaches, such as the time domain, frequency domain, or time–frequency domain [9, 15]. Several statistical parameters could be extracted from the time domain signals. These indicators could be training parameters in fault detection models for bearings CBM. Different mathematical operators can impact the statistical characteristics of the signals, and a change in the mathematical operator can either increase or decrease the informational value of the signals. For example, the Root Mean Square (RMS), also known as the quadratic mean or the square root of the arithmetic mean of the squares of the values, is related to the vibration energy of a signal in the time domain [71]. Furthermore, the Crest Factor (CRSF) is the dimensionless waveform metric that displays the ratio of the peak amplitude divided by the RMS value of the signal. It usually indicates how extreme the peaks are in a waveform. Signal spikiness affects how sensitive the crest factor is. Its sensitivity to signal spikiness can provide an early indication of significant changes in vibration readings. The CRSF increases when the signal has distinct spikes or peaks. A waveform containing random or periodic spikes scattered throughout the signal would have a greater CRSF than a pure sinusoidal wave. Researchers observed that the CRSF is more sensitive than skewness to the effects of radial load on bearing vibration.

One more parameter is the Kurtosis (KU), which represents a measure of the "tailedness" of the probability distribution of the random variable. In probability theory and statistics, KU is defined as the fourth-order normalized moment concerning the square of the variance of a time series signal. It is more sensitive to impacts and degradation than CRSF and is the most sensitive parameter that can be used to identify faults in rolling element bearings [72, 73]. A KU of 3 usually represents the normal behavior of a healthy machine. At the same time, a kurtosis higher than 3 indicates advanced states of fault progression (a distribution with heavier tails than a normal one) [74]. KU is a valuable instrument for keeping track of the health of rotating machinery, including gears, bearings, and other components.

Time domain indicators have proved their effectiveness in training ML models to detect bearing failures [75]. Different time domain indicators were adopted during the last two decades to train ML models, such as intrinsic mode functions (IMFs) [76] and Zero Crossing features (ZC) [77]. Many of these time domain parameters were obtained from the data and implemented in the proposed model. Of all time domain indicators in literature [78,79,80,81], seventeen (17) were used in training and testing the current work's model. Their formulas and abbreviations are displayed in Table 1.

Table 1 Time domain indicators used to train and test the proposed model

Experimental Data Setup

Bearing fault detection is considered a classification problem. The objective is to determine the bearing's health and the location of the defect if the bearing is defective. Hence, each class was given a label, as shown in Table 2. To develop the ML model, 80% of the dataset was used for training, while the remaining 20% was used for testing. As explained in “Qatar University Dual-Machine Bearing Fault Benchmark (QU-DMBF) Dataset” section, the dataset was stratified to ensure a balanced presence of examples for each class label.

Table 2 Defect classes and corresponding labels

Machine Learning Models

In this section, we benchmark the dataset obtained from the experiments using three ML models: Decision Tree (DT), Random Forest (RF), and extreme GradientBoosting (XGBoost). These three models use the fundamentals of the DT algorithm. Thus, a feature importance analysis will be conducted, which provides insights into how influential a particular input is in the predictions of each model.

Methodology

Before presenting the modeling, several assumptions need to be considered. Firstly, the data is assumed to follow a particular distribution, which can impact the models' performance. Additionally, the models assume that the features used for training are independent of each other, which might only sometimes hold in real-world scenarios. The assumption of balanced classes is also crucial for the models to learn effectively from the data. Furthermore, the models assume that the features selected are relevant and contribute significantly to the prediction task.

Moreover, it is worth remembering that the dataset consists of vibration signals obtained from accelerometer readings on bearings with different defect conditions (healthy, inner race defect, outer race defect) collected at Qatar University, with details on the experimental setup and data collection procedure provided. The study used time-domain statistical features extracted from the vibration signals as input features for the machine learning models, employing 17 different time-domain features. The models were treated as classification problems to predict the bearing condition (healthy, inner race defect, outer race defect) based on the input features. The dataset was split into training (80%) and testing (20%) sets in a stratified manner to ensure class balance.

The metrics used to validate and evaluate the models performance are precision, recall and F1-score. The precision score reflects how well the model has predicted the true positives in fraction to the overall predicted positves. On the other hand, the recall score shows how fit is the model to predict all true positives in fraction to the sum of true positives and false negatives. Finally, the F1-score gives an assessment on both, the precision and recall, producing a single measure while taking into account both the false positives and false negatives.

Figure 3 illustrates this paper's proposed approach for implementing machine learning models. It comprises three essential parts. Firstly, categorical variables were encoded into a numerical format during data preparation. Then, the Min–Max data normalization technique is introduced to the data to ensure all features have the same scale. Finally, a stratified split was implemented to provide a consistent target class distribution in the training and testing sets. This strategic division ensures against data imbalance and potential model bias stemming from the underrepresentation or the overrepresentation of specific target classes. The second part involves model implementation, where the grid search method is applied to fine-tune the model's hyperparameters through an iterative process of building and testing. The process starts by setting a grid and values range to the model's hyperparameters; each combination of the model's grid will be evaluated. Once the validation set has achieved optimal accuracy, the grid search reports the hyperparameter values. Lastly, the concluding phase involves implementing the model on the entire dataset and reporting the accuracy report.

Fig. 3
figure 3

Methodology flow chart

Ensemble Methods

Decision Tree (DT)

A decision tree algorithm is a versatile supervised machine-learning algorithm for classification and regression problems. It creates a flowchart-like tree structure (Fig. 4), where each internal node denotes a feature, branches denote rules, and leaf nodes denote the algorithm's result. The decision tree algorithm works by recursively making new decision trees to maximize the homogeneity of the target variable in each subset of the dataset until a stage is reached where further classification is not possible, and the final node is called a leaf node.

Fig. 4
figure 4

Decision tree algorithm

The primary challenge in decision tree implementation is identifying which attributes to consider as the root node at each level, known as attribute selection. The decision tree learning process employs a divide-and-conquer strategy by conducting a greedy search to identify the optimal split points within a tree. In the context of classification problems, to measure the degree of impurity in a subset of the tree, the Gini impurity Gini (D) is used, as shown in Eq. (18). For the information gain, IG(D), which measures the reduction in disorder [36] yield by splitting the data based on a feature (F), Eq. (19) is used, where entropy or disorder E(D) is shown in Eqs. (20) [36].

$$Gini(D)= 1-\sum_{i=1}^{N}{p}_{i}^{2}$$
(18)
$$IG (D) = E(D)- {\sum }_{v\in Values(F)}\frac{\left|{D}_{v}\right|}{\left|D\right|}E({D}_{v})$$
(19)
$$E(D) = -\sum_{i=1}^{N}{p}_{i}{log}_{2}({p}_{i})$$
(20)

where \({p}_{i}\) is the probability of class \(i\) occurring in the node, and N is the total number of classes, and \({D}_{v}\) is a subset of the collected samples set \(D\) [36].

However, DTs are prone to overfitting and underfitting problems. Hence, to obtain a model with high classification performance, the hyperparameter of the network must be optimized. Thus, the grid-search optimization technique was introduced to the model, resulting in the following values for the hyperparameters: entropy [70], to measure the quality of the split, the ideal depth of the tree is found to be 9, with a minimum three samples to create a leaf, and a minimum of 7 samples to make a split. The model achieved an accuracy of 82% on the testing dataset and a 94% accuracy on the training dataset. Table 3 shows the model's precision, recall, and F1 score, and Fig. 5 shows the confusion matrix.

Table 3 Performance metrics of the DT model on the testing data
Fig. 5
figure 5

Confusion matrix of the DT model on the testing data

Random Forest (RF)

Random forest (RF) is an ensemble supervised learning method for classification, regression, and other tasks that operates by constructing a forest of decision trees and combining their outputs (Fig. 6).

Fig. 6
figure 6

Random forest algorithm

Breiman and Cutler first introduced it [82] and trademarked it. The algorithm grows a forest of trees, where each tree is trained on a randomly chosen subspace of the training data and introduces variation among the trees. The decision at each node is selected by a randomized procedure rather than a deterministic optimization. The RF algorithm is an extension of the bagging method and utilizes both bagging and feature randomness to create an uncorrelated forest of decision trees. Feature randomness generates a random subset of features, which ensures low correlation among decision trees. The algorithm has three primary hyperparameters that must be set before training: node size, the number of trees, and the number of features sampled. In the case of Random Forest for classification, each decision tree in the ensemble is often built using Gini impurity Gini(D), like the earlier decision tree. In summary, RF is a popular machine learning algorithm for classification and regression problems in various societal and industrial sectors. It builds multiple decision trees via randomness injection and takes a vote among the trees to make robust and accurate predictions. Among the key advantages of this method are its ease of use, flexibility, and the ability to handle missing values. Additionally, random noise in the data doesn't affect the accuracy of RF models. While there isn't a single equation that encapsulates random forests, the algorithm involves creating multiple decision trees, each trained on a different bootstrap sample of the data, and making predictions by aggregating the results of these trees.

A grid search was applied to tune the RF model's hyperparameters and optimize the dataset's results. After several grid searches, it was found that the optimal number of trees in the forest is 60 trees, the max depth of the tree is 10, and the minimum number of samples mandated to split a node is 2. The classification report shows that the model achieved an accuracy of 91% on the testing data, as displayed in Table 4. Moreover, the confusion matrix, displayed in Fig. 7, shows a recall score of 100% for the No-Defect class.

Table 4 Performance metrics of the RF model on the testing data
Fig. 7
figure 7

Confusion matrix of the RF model on the testing

XGBoost

XGBoost, short-term for Extreme Gradient Boosting, is a software library that implements optimized distributed gradient boosting machine learning algorithms under the Gradient Boosting framework. Like other boosting methods, it builds models sequentially, where each model learns and improves upon the previous model. However, XGBoost utilizes more accurate approximations when computing splits. This prevents overfitting and enhances speed. XGBoost uses gradient boosting, which minimizes loss(L) when adding new models. This is done using gradients in the loss function \(L({y}_{i},{y{\prime}}_{i})\) to estimate the descent direction. New models predict the residuals or errors of prior models and gradually boost the predictions. For multiclass classification problems, the loss function is shown in Eq. (21) [50]. Where N is the number of samples and M is the number of classes, the \({y}_{i,j}\) represent the true label for the \(i\)th example and the \(j\)th class, and \({p}_{i.j}\) represent the predicted probability of the sample in question to the class. XGBoost models are made up of decision trees like random forests. However, while random forest trains trees in parallel, gradient boosting trains trees sequentially. XGBoost is particularly popular due to its ability to handle large datasets, achieve state-of-the-art performance in many machine learning tasks such as classification and regression, and efficiently handle missing values without requiring significant pre-processing. It also has built-in support for parallel processing, making it possible to train models on large datasets quickly. Since its introduction, XGBoost has become the machine learning algorithm of choice for data scientists and machine learning engineers. It is known for its speed, ease of use, and performance on large datasets. It does not require optimization of parameters or tuning, allowing it to be used immediately after installation without further configuration (Fig. 8). Additionally, XGBoost counters the overfitting and underfitting problems in RF and DT models by incorporating regulation techniques such as feature subsampling and learning rate. The regularization term Ω(f) is defined as shown in Eq. (22), where.

Fig. 8
figure 8

XGBoost Algorithm

$$L({y}_{i},{y{\prime}}_{i}) = -\frac{1}{N}\sum_{i=1}^{N}\sum_{j=1}^{M}{y}_{i,j}log({p}_{i.j)}$$
(22)
$$\Omega ({\text{f}}) =\mathrm{\gamma T}+\frac{1}{2}\uplambda \sum_{j=1}^{T}{\omega }_{j}^{2}$$
(23)

The regularization term to control the model complexity [49], shown in Eq. (22), consists of two parts: \(\mathrm{\gamma T}\), and \(\frac{1}{2}\uplambda \sum_{j=1}^{T}{\omega }_{j}^{2}\). In the first part, γ refers to the regularization parameter for tree complexity, while T is the number of terminal nodes (leaves) in the tree. Combined, they aim to penalize the complexity of the tree based on the number of terminal nodes. The second term represents the L2 regularization penalty applied to the weights of the terminal nodes. Where \(\uplambda\) is the L2 regularization parameter that controls the strength of the penalty.

However, to capture complex relationships and an appropriately generalized model, optimizing the XGBoost is fruitful as it utilizes the model to its best performance. Hence, using the grid-search method for hyperparameter tuning with a fivefold cross-validation, which is determined as the optimal choice, balancing computational cost and performance accuracy, helps mitigate overfitting. As a result, the model's optimal hyperparameters are 0.1 for eta, which is an alias for learning rate, 3 as the ideal depth of the tree, and 250 for the number of runs to learn. Figure 9 shows the log loss function with the number of epochs. As observed, the model's performance stops improving around the 180th iteration, corresponding to Fig. 10, the classification error for the test set plates around the 180th iteration. Hence, a learning rate of 0.1 is suitable for the dataset. It is worth noting that early stoppage has been set to 15 iterations. Consequently, if after 15 iterations, the accuracy does not improve, the training shall stop.

Fig. 9
figure 9

XGBoost log loss function versus the number of epochs

Fig. 10
figure 10

XGBoost classification error figure shows the classification error value on the y-axis and the number of epochs on the x-axis

As expected beforehand and illustrated in Table 5, the XGBoost outperformed all the other models with an accuracy score of 92% on the testing dataset.

Table 5 Performance metrics of the XGBoost model on the testing data

The confusion matrix, shown in Fig. 11, indicates the performance of the XGBoost; like the Random Forest model, it predicts the No Defect class at 100% accuracy.

Fig. 11
figure 11

Confusion matrix of the testing data on the XGBoost model

Feature Importance

The Ensemble trees models, Random Forest and XGBoost, build multiple decision trees to drive the output. The importance of the dataset's features in these models is derived from the collective behavior of these trees. In random forests, the Gini importance is typically used to calculate how often a feature is used to split the dataset across all the trees created by the model. Features that are used more often result in higher feature importance scores. However, XGBoost models utilize the Gain importance method, where features are ranked based on the resulting gain in accuracy introduced by the feature. The contribution of each feature is aggregated over all the trees created by the model; hence, the gain is calculated at each split. Finally, the Decision Tree model utilizes the Gini importance as well; however, since DT builds only a single tree, the importance of each feature comes from its contribution to the reduction in impurity or entropy over the single tree.

On the other hand, the precision, recall, and F1 score values only report the models’ performance without providing insights into feature selection or the driving factors contributing to that performance. In other words, for the data under study, each ensemble method selects its essential feature list in mapping between them and the targeted output. Therefore, a high-accuracy model is not the only target to achieve. In addition, the extent to which the ensemble algorithm considers the entire training feature in detecting the defect on the bearing is also a necessary fact to consider. After training the model, a feature-important analysis is conducted to illustrate each feature's contribution to the targeted prediction, as shown in Fig. 12.

Fig. 12
figure 12

Relative importance analysis results for the three models, a decision tree, B random forest, and C XGBoost

In Fig. 12a, the DT approach has identified the SF as the main feature that dominates the bearing defect behavior at around 35% of the mapping process. Other input parameters, such as CRSF, CLF, or RSSQ, are essential but have scores of less importance. Furthermore, the algorithm neglects the accelerometer location by allocating a zero-importance score. Accelerometer location is a crucial feature that contributes to the signal time-histogram intensity.

Therefore, the model might suffer from collinearity, where other features are highly correlated, and the model is utilizing them to make up for the accelerometer location feature. Meanwhile, RF allocates significant importance to each input feature without neglecting them. Nevertheless, two primary experimental conditions (accelerometer location and motor speed) were unimportant. Eventually, the XGboost boosting technique showed more consideration for the accelerometer location and the motor speed. Although four distinctive features were neglected, their values were severely dependent on other features, such as RMS, STD, etc., already included in the model's essential features.

Based on the previous discussion, one can state that the proposed method utilizes tree ensemble machine learning models such as Decision Trees, Random Forests, and XGBoost for condition monitoring and predictive maintenance of rotating machinery, precisely diagnosing faults in rolling element bearings. These models offer various applications, including early detection and diagnosis of bearing faults by identifying defects and pinpointing their location within the bearing using vibration signal data. Additionally, they support condition-based maintenance (CBM) of bearings by monitoring vibration features and enabling proactive maintenance scheduling to prevent catastrophic failures. The models also facilitate the fault prognosis of bearings, allowing for the tracking of defect progression over time. Furthermore, they can be integrated into automated condition monitoring systems to continuously monitor bearing health in industrial machinery. The study provides a comparative benchmark of different tree-based machine-learning techniques for bearing fault diagnosis, emphasizing the primary application of leveraging machine learning on vibration data for reliable and early fault detection and diagnosis in rolling element bearings of rotating equipment across various industrial sectors. This approach enables timely maintenance and helps prevent unexpected failures that could result in costly downtime.

Moreover, the research reveals that the Decision Tree method identifies specific features that predominantly influence defect behavior, potentially reflecting the localized impact of faults within the bearing structure. In contrast, Random Forest assigns considerable importance to all input features, suggesting a holistic approach to fault detection that considers the collective influence of various parameters. XGBoost, with its balanced consideration of essential features such as accelerometer location and motor speed, showcases a nuanced understanding of the interplay between different variables in determining bearing health and fault location.

The research delves into the importance of various time domain parameters like Root Mean Square (RMS), Crest Factor (CRSF), and Kurtosis (KU) in analyzing bearing faults. These parameters offer a tangible understanding of the vibration energy, signal spikiness, and distribution characteristics, which are crucial for fault detection in rotating machinery. The models, including Decision Tree, Random Forest, and XGBoost, showcase high accuracy rates and highlight the significance of specific features in predicting bearing defects. Furthermore, the feature importance analysis reveals how different models prioritize input features. For instance, Decision Trees emphasize features like SF, while Random Forests assign significant importance to all input features without neglecting any. On the other hand, XGBoost demonstrates a balanced consideration for features like accelerometer location and motor speed, showcasing its ability to handle complex relationships and optimize model performance effectively.

Conclusion

This research paper provides a detailed analysis of the experimental testing, results, and data setup related to bearing fault diagnosis and prognosis functionalities. The study utilizes the Qatar University Dual-Machine Bearing Fault Benchmark (QU-DMBF) Dataset, which includes various bearing configurations with defects on the outer and inner rings. Vibrational signals were extracted using accelerometers, and time domain analysis was conducted to study defective bearings. Three machine learning models were benchmarked using this dataset: Decision Tree (DT), Random Forest (RF), and Extreme Gradient Boosting (XGBoost). These models were evaluated based on their performance metrics, such as precision, recall, and F1-score, with XGBoost outperforming the other models with an accuracy score of 92% on the testing dataset. The Decision Tree model achieved an accuracy of 82% on the testing dataset, with precision, recall, and F1-score values reported for each defect class. On the other hand, Random Forest achieved an accuracy of 91% on the testing data, with a recall score of 100% for the No-Defect class. XGBoost, known for its optimized distributed gradient boosting algorithms, demonstrated superior performance with an accuracy score of 92% on the testing dataset. The dataset preparation in the study involved several vital steps. Firstly, categorical variables were encoded into a numerical format. This is a common practice in machine learning to ensure that the algorithms can work with the data effectively. Secondly, the Min–Max data normalization technique was applied to the data. This technique ensures that all features have the same scale, essential for many machine learning algorithms to perform optimally. Finally, a stratified split was implemented to ensure a consistent class distribution in the training and testing sets. This strategic division helps to prevent data imbalance and potential model bias that could arise from the underrepresentation or overrepresentation of specific target classes. The study highlights that while Decision Trees and Random Forests offer simplicity and interpretability, they often fall short in predictive accuracy compared to XGBoost, especially when dealing with noisy, high-dimensional, and multicollinear data. XGBoost's ability to balance bias and variance effectively, incorporate regularization techniques, and fine-tune hyperparameters is pivotal in mitigating overfitting and enhancing defect location predictions. Moreover, XGBoost outshines its counterparts by providing more accurate feature importance scores, enabling it to iteratively build robust ensemble models by focusing on previous mistakes and refining feature importance scores with each boosting round.