This section describes the different modules of the BSD predictive maintenance system which are depicted in Fig. 3. The first part is the raw data, generated in lifetime experiments using the camera system described in [30]. The generated dataset has the specific property that it shows a continuous progression of failures and hence depicts the whole wear history from the start of operation until the mechanical breakdown of the component. On this dataset a so called Pitting Detection model is build which is able to return a bounding box approximating the size of a defect. This result can be further refined by using a new approach for the calculation of the defect area based on a combination of a classical threshold-based method in combination with a convolutional neural network (CNN) predicting the threshold for the calculation of the area with the threshold model. The extracted failure area is then processed in a forecasting module which is trained on historical defect data. The model can finally be used to predict the future area of the size of new failures.
Data set
Using a BSD test bench to artificially wear ball screw drives, the authors generated images of pittings in a temporal relationship. Early images show no or only small pittings which then grow over time until the component fails. A progression of pittings is depicted in Fig. 14. As stopping criteria for the experiments the authors defined the mechanical breakdown of the system. During the experiments the authors mounted a camera system as described in detail in [30] close to the nut of the ball screw drive such that the system looks radially onto the spindle and returns images of the surface of the spindle. The experimental setup is depicted in Fig. 4.
Every four hours the whole spindle is scanned by the system. The experiments are undertaken with an axial load of ~ 14kN. The external conditions can be regarded as being similar to an industrial environment since no special protection measures were taken during the experiments and pollutions as well as lubricants are show on the spindle.The BSD-Nuts are prepared with standard wipers. Due to [31] changing the axial load does not influence the way a surface defect grows (the growth function is the same) but only influences the speed of growth.
The camera has a resolution of 2592 × 1944 Pixels and an LED lighting is used. Because of the kinematics of the BSD, the whole raceway passes the camera lens and the author’s crop images of the size of 190 × 190 pixels automatically from the larger images. This setup can be easily adjusted to specific needs. This process is depicted together with exemplary images used for training in Fig. 5. The authors extracted in total 230 images where 60 images are put apart for testing the model and a 70/30 split of the remaining images is used for training and validation of the defect detection model. For the training of the threshold prediction model, which is described later on in more detail, the authors used 600 images from the same data source but without drawing bounding boxes around the defects. There is an intersection between the sets, but this does not influence the results since the models are distinct. These images are divided into 6 lighting categories for the threshold prediction. A 70/30 train, validation split is used for model training.
Defect detection
To be able to predict the evolution of the size of defects, first the defects must be located in the image. The TensorFlow Object Detection API is used to set up the object detection model. As a pre-trained model, the EfficientDet D0Footnote 1 512 × 512, pre-trained on the COCO dataset, is used. EfficientDet employs EfficientNet as a backbone network which is pre-trained on the ImageNet data set. As a featured network serves the weighted bi-directional feature pyramid network (BiFPN) (Tan et al. 2020). For the object detection part, the last layer of the pre-trained model is fine-tuned on the 120 images of pittings. The convolutional base is not changed. The model is trained using a NVIDIA Tesla T4 hardware for 2000 epochs.
If an image with pitting is passed through the model, the model detects the pitting and outlines the pitting with a bounding box as depicted in Fig. 6. The system yields a validation accuracy of 92%. These bounding boxes serve as Region of Interest for the following threshold model. As a result, this model could be used as a standalone model for the forecasting step in case that the approximate size of a pitting is sufficient.
By passing on only the content of the bounding box, this area then contains fewer disturbing factors, and the contour of the pitting can be determined more reliably. During the detection step, it could happen that the model cannot find an object in a specific image. In that case, it has to be differentiated if the model has detected a pitting at an earlier point in time. If this is the case, by domain knowledge, it is known that there has to be a pitting why it helps use a slightly larger bounding box as used on the same position at t − 1. This is beneficial because it is not possible that the failure has shrunken or disappeared entirely. To summarize: If there is a bounding box in step t, there has to be one in step t + 1. This expert behaviour based on domain knowledge is implemented by the expert system described later on. The authors provided the model with 60 additional test images containing defects of different sizes to validate the detection model. Figure 7 shows the distribution of defects together with the information if the pitting was detected correctly (the whole pitting lies within the bounding box) or not. All pittings corrrectly detected by the object detection model are depicted in green. The defects marked in blue are the pittings for which no bounding box was found. The model accurately detects most defects and only misses some small pittings. Figure 8 shows examples for which the model failed to detect the failures. However, since the goal is to use the bounding box as a region of interest for the next step, it is possible, as already mentioned, to use the previously found bounding box of t − 1 on the same position and assume that there must be a pitting.
Threshold prediction
Processing the regions inside the found bounding boxes, the goal of the threshold prediction model is to find an appropriate threshold such that the failure can be extracted as precisely as possible. To extract the failure, the authors use a classical threshold-based (findcontours) approach provided by the OpenCV foundation [1].
The algorithm yields the best results in a binary image since there the regions are maximally distinguishable. Hence the goal is to process the image into a black and white image separating the failure area. Since of the diversity of the BSD images there is no unique appropriate global threshold value for all images. The threshold value is an integer in the range [0,…,255] marking the border below which all values are set to 0, and all values above are set to 255. Experiments show that also automatic global and local threshold finding algorithms such as the method of Otsu, which is a widely used algorithm to find the global threshold automatically [28] did not work properly as depicted in Fig. 9.
The method of Otsu works especially with bimodal images by choosing the threshold as a point between the modes. This is not applicable in the here presented case since the images are not bimodal and the modes are changing over images. In the here presented approach, the authors trained a CNN model to classify images into their appropriate threshold values. The basic idea of this approach is to determine the value of a parameter, which is otherwise calculated by statistical methods such as a mean value method, with the help of a CNN. This method could be used in other areas where model parameters are to be determined as well. An example could be simulations where models must be parametrized based on some raw input data. Especially when it is difficult to determine an appropriate set of values in advance and finding a value is done by trial and error, the here presented approach can aid to accelerate the process of finding appropriate parameter values. The authors used a total amount of 600 images in a 70/30 train, validation split. The images used to train the CNN model are divided into six classes for six different background conditions resulting in six threshold values. The six background conditions are labelled by the authors based on different lighting and pollution conditions which have in turn influence on the threshold. Different conditions are shown e.g. in Fig. 5 above. For example, images taken on a spindle that is already worn often have a much darker background because dirt particles, discolored lubricant and wear particles showing on the spindle. It turned out that six classes are sufficiently fine-grained, more classes does not aid the model. As threshold values, the authors defined the values 35, 40, 45, 52, 62, 72. With these thresholds, all contours could be satisfactorily represented. Figure 10 shows the labelling process where all threshold values are applied to all images and the most sufficient threshold value is chosen.
It can be clearly seen that for each image, a different threshold value is needed for optimal contour recognition. A box surrounds the best value in each case. This process can be understood as labelling images with useful thresholds.
In the next step, the authors trained a CNN model by using the images together with the threshold labels. The CNN model is a manually build model with 4 convolutional layers and 2 × 2 max pooling operation following each convolutional layer. The convolutional base is followed by two fully connected layers with 64 neurons each. Relu activation is used in all convolutional and dense layers. The dense layers are followed by 0.1 dropout layers. The final layer consists of six neurons applying softmax activation. The used optimizer is Adam, with a learning rate of 0.0001. The training was done using NVIDIA Tesla T4 hardware for 1000 epochs.
The authors achieved a validation accuracy of 92% which shows that the model can accurately predict the best-suited threshold value. In advance to applying the method, the authors implemented a four-step image pre-processing to aid the segmentation results. The four conducted steps are Thresholding, Invert Bitwise, Morphological Dilation and Morphological Erosion (Fig. 11).
Expert system
Once the area of a failure is calculated, it could be used in the forecasting model to predict the expected failure size. As mentioned above, there is usually some valuable domain knowledge about the failure’s visual characteristics as well as the wear mechanisms in technical domains, which could aid the machine learning approach. This is true for many areas where substantial domain knowledge is available. By intelligently combining the domain knowledge with Machine Learning systems, the intelligent systems could be improved by the human experts introducing information not easily learnable from the available data (expert system). One example is the fact that a defect on the spindle cannot shrink over time, but only grow larger. It is important to mention that the knowledge base for the expert system is implicitly available encoded in the experience of the expert and is available with zero additional data points. This means that because it is already existing, task is to properly formulate and implement the expert knowledge in an algorithm to support data driven approaches. In the here presented case, a strong characteristic of the pitting is e.g. that it has sharp corners and a somewhat darker colour than the surroundings. This knowledge was not explicitly formulated above but implicitly used by the contour finding algorithm to find the borders of the failures. With the above described steps, it is possible to measure the size of a pitting very accurately, but due to oil or pollution on the ball screw, the appearance of the pitting area can vary. In these cases, the calculated pitting area would deviate from reality. Figure 12 opposes the predicted progression of pitting (Quantification) to the results implementing the expert system and the ground truth data over 28 timesteps after the first pitting has been observed. The y-axis represents the pitting’s size in connection with the respective time steps (x-axis). The blue data series reflects the ground truth data whilst the grey data series represents the values measured by the model without the implementation of the expert system. The green line represents the results after implementing the expert system. To make the prediction process more clear, predictions are made in a successive manner like it would be applied in reality during different steps in time. For each step, a linear regression model (purple line) trained on the data already processed by the expert system is added.
It can be seen how the model without expert system (grey) matches the ground truth data (blue) at the beginning. However, there are some strong outliers in later time steps. Additionally, it can be seen that the model partly overestimates the size of pitting in later time steps and also predicts smaller sizes for later timesteps. Though by domain knowledge, it is known that this effect of decreasing pitting sizes is not possible. Hence, either the model has overestimated the size of a pitting in earlier timesteps or underestimates its size. To address this issue, domain knowledge is introduced by applying a two-step algorithm represented by the green line. As a baseline for the expert system, three possible cases are discussed.
First, the area in time step t + 1 is slightly larger than the area in time step t. In this case, the proposed value is valid. Second, the area in time step t + 1 is disproportionately large which may cause from severe pollutions on the surface. In this case, the average of the defect size in time step t + 1 and the defect size in time steps t and t − 1 is calculated and used as the predicted size. In this way, rough outliers are averaged which yields a smoother curve. Third, the area in t + 1 is smaller than in time step t, which is impossible. In this case, the pitting size at t + 1 is predicted as the size in t and hence remains the same.
Therefore, the expert system as the last step of the pipeline compensates for erroneous measurements in the previous steps and results in a more accurate prediction model which shows that implementing the expert system aids the prediction.
Forecasting function
The linear regression fitted to the data above seems to fit the data well. To double-check this assumption, the authors fitted a set of functions to the data to measure their root mean squared error (RMSE) between the predictions and the ground truth data. As functions the authors chose the linear regression, a second and a third order polynomial and an exponential function. As a base method for determining the forecasting quality, the classical RMSE is used with:
$$E=\frac{1}{J}\sum_{j=1}^{J}\sqrt{{({\widehat{a}}_{x}^{t+j}-{a}_{x}^{t+j})}^{2}}$$
where \({\widehat{a}}_{x}^{t+j}\) is the size predicted by the expert system \(j\) timesteps ahead of \(t\) and \({a}_{x}^{t+j}\) is the ground truth value at the same time step. Hence the sum of the distances between all predicted and ground truth points is measured. The closer the predicted values match the true values, the better the function fits the data.
The issue here is that this only gives a measure on already observed points which is not appropriate in practice since the higher the polynomial the better the function will fit the data. A polynomial with degree n can match n data points with E = 0 though this function will probably not be a good estimator for future points. The goal is not to have a function which is precise on already observed points but on future points. Hence the model is created on points up to timestep \(t\) and the prediction precision is measured on all points \(j\) in the future.
From a practical point of view, predictions that are correct in the very near future and predictions that are correct in the very far future are less important than middle-term predictions. This is the case because to maintain a component in time, it is necessary to look some time ahead, which covers the needed time for the preparations for maintenance. Hence, it is of little value to have a model which is very accurate for the very near future but fails in the middle and long term future since then the time to plan the maintenance is not sufficient. The same is true for predictions very far in the future because the preparations for maintenance take some time p and if the prediction horizon is much greater than p this is of little extra value since it will not change the planning behaviour. Hence, there is a middle-term “sweet spot” in which a model should be as accurate as possible. The selection of this “sweet spot” horizon changes for different companies and processes. To implement this behaviour the formulation of the RMSE is extended to incorporate a time component:
$${E}_{\alpha }=\frac{1}{J}\sum_{j=1}^{J}\sqrt{{f(j)({\widehat{a}}_{x}^{t+j}-{a}_{x}^{t+j})}^{2}}$$
where \(\alpha\) is a parameter for how far in the future the highest attention is laid. \(\alpha\) should be odd for mathematical convenience. The authors chose \(\alpha\) as 7. The function \(f(j)\) is chosen as a bell-shaped function with \(f\left(j\right)={e}^{-0.15*({\left(ceil\left(\frac{\alpha }{2}\right)-j\right)}^{2})}\) where \(ceil(.)\) means that the resulting float is rounded up to the next integer which is 4 in that case. The function \(f\left(j\right)\) is symmetric and takes its maximum value of 1 at \(j=4\) and smaller values for \(j>ceil\left(\frac{\alpha }{2}\right)\) and \(j<ceil\left(\frac{\alpha }{2}\right)\). Hence the resulting RMSE is weighted and the value which lays 4 time steps in the future receives the highest weight.
Additionally, from a practical point of view a function is wanted that is as data efficient as possible. The function should fulfil the above criteria with as little data as possible. To implement this behaviour, the Loss Function is further developed to incorporate the reciprocal of the amount of data points used for training. The model is trained with progressively more data points where the minimum number of available data points is set to four, until the calculation of the error is started. The RMSE is then summed over all \(RMS{E}_{\beta }\) where \(\beta\) are the number of data points available for training. The final loss term is:
$$E=\sum_{\beta }^{N}{\beta }^{-1}\left(\frac{1}{J}\sum_{j=1}^{J}\sqrt{{f\left(j\right)\left({\widehat{a}}_{x}^{t+j}-{a}_{x}^{t+j}\right)}^{2}}\right)=\sum_{\beta }^{N}{\beta }^{-1}{RMSE}_{\beta }^{\alpha }$$
This final error term is used to compare the functions. The result is shown in Fig. 13. The bar-plot shows that the assumption that the linear function yields the best results can be validated. The linear function closely matches the data which is plausible because by inspection the data follows a linear function of order one. The predicted curves indicate that all higher order polynomials fail to match the data because they are too oscillatory. The linear model is chosen as the appropriate model to fit the evolution in size of a defect.