Keywords

1 Introduction

Condition monitoring enables an efficient and failure-free operation of production plants, machines and manufacturing processes. One way to get information about the condition of a machine or a machine part is to monitor its vibration. For a successful monitoring, it is commonly advised to place a vibration sensor on each part that is to be monitored. By using sensors, which output preprocessed data, for example the root mean square value or the peak-to-peak value in a given time interval, simple processes can be monitored by comparing the sensor output to predefined thresholds [1, 2].

By placing the sensor directly on the part of interest, it is often assumed that the sensor values are mainly reflecting the vibrations emitted by this part and that vibrations emitted by other parts hardly influence the sensor values.

For complex processes and machines, using thresholds and assuming that the sensor values depend only on the machine part the sensor is attached to, may not be feasible. In these cases, using machine learning to extract the desired information from the sensor values or the values of several sensors, can yield very good results [3, 4].

Instead of using machine learning to improve the condition monitoring of a single machine part, we use it to simultaneously monitor several machine parts using only a single vibration sensor. As preprocessed sensor data may not contain enough information to extract the condition of each machine part, we use the raw vibration data. Nguyen et. al [5] proposed to treat monitoring several machine parts using only a single vibration sensor as a blind source separation problem. However, their approach detects the overall condition of a machine and is not able to classify the individual condition of each machine part. For the classification of the condition of individual parts, the use of a support vector machine has been proposed in [6] for agro-industrial machinery and in [7] for centrifugal pumps. In both cases, the authors calculated statistical properties of the time domain signal and the corresponding frequency domain signal and used them as features. These features included mean, standard deviation, skewness, kurtosis, crest factor and root mean square value amongst others.

We propose to use a convolutional neural network (CNN) to classify the condition of several machine parts using only a single vibration sensor. This approach eliminates the need for explicit feature extraction and selection. Furthermore, the CNN may find features more suitable for the task than statistical properties of the time or frequency domain signal and thus yield better results. To the best of our knowledge, simultaneous condition monitoring using CNNs has not been studied before.

2 Theoretical Background and Methods

2.1 Convolutional Neural Network

Deep learning is a part of machine learning and offers the possibility to learn complex relationships in data using artificial neural networks (ANNs) [8]. ANNs are inspired by the brain of humans and animals and are based on the mathematical concept of artificial neurons [9]. Artificial neurons are arranged in interconnected representation layers to filter information from the data layer by layer [10].

Convolutional neural networks (CNNs) were introduced in [11] as a form of ANNs which are in particular used for image or speech recognition tasks as well as time-series tasks. CNNs use convolution layers to extract local features. A convolutional layer can have multiple convolution matrices, also called kernels or filters, which each create a feature map. The feature maps get stacked up and passed on to the next layer. To reduce shifts and distortions, a convolution layer is usually followed by a pooling layer, which performs a local averaging or subsampling. This results in a resolution reduction. Because convolutional operations are linear, a non-linear activation function must be applied to the output [8].

2.2 Discrete Fourier Transform

For periodic time signals, it is usually beneficial to investigate the spectrum of the signal. The transformation of a discrete-time signal into the discrete frequency domain using the discrete Fourier transform (DFT) is therefore a commonly used preprocessing step [12]. A DFT maps N finite discrete samples of a signal \(x_n\) onto N complex finite spectral values \(X_k\) with the index \(0 \le k \le N-1\) and is defined as

$$\begin{aligned} X_{k}=\sum _{n=0}^{N-1} x_{n} {\text {e}}^{-{\text {i}} 2 \pi n k / N} \text { .} \end{aligned}$$
(1)

2.3 Methods

In all of our experiments, we applied automated hyperparameter optimization using a Bayesian optimization algorithm to find the optimal parameters for the models. Figure 1 displays the structure and the hyperparameter space for the hyperparameter optimization. Our architecture contains one or more convolutional (conv) blocks, which consist of a convolutional layer, an average pooling or a max pooling layer, followed by a batch normalization and a ReLU activation layer. The optimization algorithm can choose from one up to five conv blocks with a default number of three. It can also choose the number of filters and size of the kernels in the convolution layer. For the number of filters, the selectable options are from four filters up to 64 with a step size of four and a default of eight. The kernel size options are from three up to a maximum of 59 with a step size of eight and a default size of 27. Because we chose a maximum kernel size of 59 for the hyperparameter optimization, we had to limit the number of conv blocks to five. After the conv block, the algorithm can choose between a global average or a global max pooling layer. Next, it has the option to add a dropout layer with a minimal dropout rate of 0.2 up to a maximum of 0.5 with a step size of 0.05 and a default of 0.25. At the end, there is a flatten layer followed by a dense layer with a ReLU activation and finally a dense layer with a softmax activation function.

Fig. 1.
figure 1

The architecture we used for the hyperparameter search. The convolution block is highlighted using a dashed line.

It would have been possible to add more layers and increase the hyperparameter space, but to reduce the number of trials, we chose a few basic parameters based on the current state of research, preliminary assessment and experience. As optimizer we use Adam, which was proposed in [13]. Furthermore, we chose categorical cross-entropy because of the one-hot encoding. The optimization metric is the classification accuracy over the test data set. As batch size we chose 32 as recommended in [14]. The output layer is, as mentioned above, a fully connected layer with eight neurons and a softmax activation function.

We use raw and transformed data to train the CNN model and measure the performance of each by comparing the correct classification rates. The transformed data is generated by taking the absolute value of the discrete Fourier transform of the raw data. Every run consists of three sections. First, we use hyperparameter optimization to find an architecture, which fits the data best. Next, we create ten models based on this architecture and train them separately. In the last step, we calculate a classification average. For every run, we make sure, that the data gets imported anew and that it gets distributed randomly into the training, test and validation set to avoid a favorable distribution. Finally we calculate the mean test accuracy and the standard deviation of the test accuracy over the ten loops for each run.

3 Experimental Setup

To create training data, we used a hardware setup with motors and a vibration sensor, that are installed on a perforated plate. We used one DC motor and two servo motors. This setup emulates a machine that was retrofitted with a single vibration sensor. The positions of the motors were chosen randomly. For the vibration sensor, we chose a specific position. In our scenario, this is possible, because the sensor was installed after the initial setup of the machine. Every motor has its own characteristic curve and therefore a specific vibration pattern. In this case, the DC motor is running continuously with the same frequency. The servo motors switch between movement and standstill with a different moving speed and idle time. The vibration sensor measures acceleration on three axes with a sampling frequency of 6.4 kHz. It provides raw vibration data without preprocessing.

Table 1. Mapping of motor combinations to classes.

We used software developed in-house to assign characteristics to the motors, control the start and stop time, as well as the start of the recording of the vibration with a predefined duration. The usage of three motors results in eight possible combinations. These combinations, as well as the class labels we assigned to the combinations, are listed in Table 1. For each class, 20 recordings were made.

The data is normalized with the L2 norm. Preliminary investigations showed, that an independent distribution normalization achieved better results than for example a min-max scaler or standardization.

We used 70% of the recordings for the training. The remaining 30% were split equally between validation and test data. Ahead of the split, the data was shuffled randomly without a seed to be sure that the distribution of the data differed from the runs before. This ensured, that an initial advantageous distribution of the data is avoided.

From the recordings, we created windows with 1600 data points, which equals a quarter of a second. We chose this time interval to reduce the inference time to a minimum with still enough information to reach a satisfying classification rate. To consider the transition between the windows, we chose a shift of 200 data points. Overall, we had 160 recordings, which resulted in 16000 windows for the training, validation and test sets. The split into training, validation and test data is made before creating the windows to ensure that all windows of a recording belonged to the same data set.

4 Results

4.1 Data Transformation

Figure 2 shows an example of raw data of the classes 3 and 7. In Fig. 2a, only the DC motor is running. In Fig. 2b, the DC motor and both servo motors are in use. Both plots look very similar due to the strong vibrations emitted by the DC motor. In Fig. 2b, a small distortion is visible on the y axis between data point 100 and 800. This can be caused by the servo motors but could also have an external source of origin as well.

Fig. 2.
figure 2

Raw data of the classes 3 and 7, given as acceleration in g over 1600 data points

Fig. 3.
figure 3

Amplitude of the DFT with a length of 1600 for the classes 3 and 7.

The absolute value of the Fourier transform of the time signals in Fig. 2 is shown in Fig. 3. The transformation into the frequency domain was applied over the whole time axis (1600 samples). As it was the case with the raw data, the dominance of the DC motor is visible. The high amplitudes are located in the low frequency area of the spectrum. Additionally, Fig. 3b displays spectral components between the frequency bins 80 and 120 as well as between 200 and 300. The amplitudes of these frequencies are small compared to the ones caused by the DC motor.

Considering our goal to use this classification algorithm on a sensor with limited resources, we decided to reduce the window size and thus the number of input values to 1200, 800 and 400 samples, respectively, by cropping before applying the DFT.

In Fig. 4, four examples of the spectrum of class 2 are shown. This reveals significant differences in the magnitudes of the amplitudes. Although the plots have a similar pattern, the maximum magnitude of the plot in the top right is more than double compared to the bottom two plots. Also, there is a spike around frequency bin 5 visible in the plots on the left, which is missing in the plots on the right.

Fig. 4.
figure 4

Examples for the spectra occuring in class 2

4.2 Training and Evaluation

The hyperparameter search process indicates that a decrease of input values exacerbates the discovery of architectures. Especially for the number of convolutional layers, we observed that the smaller the input size, the more the results varied between runs. We decided to choose one architecture for all input values except the raw data set because preliminary investigations showed, that the architecture found for 1600 samples reached high accuracies for the other input sizes as well. The architecture can be seen in Fig. 5. With five conv blocks it uses the maximum amount. In conv block one and three, 64 filters, that is the maximum possible number, are used where conv block four and five use the minimal number of four filters. Conv block two runs with the default eight filters. With a kernel size of three, only conv block one uses a smaller kernel than the default size of 27. The rest uses a kernel size of 35, 59, 59 and 35, respectively, where the maximal capacity of 59 is reached in the third and fourth conv block. Additionally, except conv block two, which uses average pooling, all conv blocks use max pooling. After the conv blocks, a global average pooling layer is used.

Fig. 5.
figure 5

Architecture found by the optimization algorithm

The classification results achieved with this architecture are listed in Table 2. The experiment with raw data to classify the motor status in row one reached the lowest mean classification accuracy value with 86.20%. Additionally, it has the highest standard deviation of all trials with 6.11%. The data shows, that the highest mean classification accuracy of 94.07% is achieved, when a Fourier transform with 1600 samples is used before training the model. When reducing the number of samples to 1200, 800 and 400, it shows a steady decline in correct classification accuracy with 92.98%, 90.13% and 87.33%, respectively. Although the lowest standard deviation is reached with 1200 samples, it comes with a reduced correct classification rate, which is 1.09% points lower compared to the mean accuracy with 1600 samples. The results also show, that even with a DFT length of 400 samples, a 1.13% points higher correct classification accuracy and a 3.1% points lower standard deviation is reached than with no transformation beforehand.

Table 2. Mean accuracy and standard deviation of the test accuracy for ten runs each with raw data and Fourier transformed data with a DFT length of 1600, 1200, 800 und 400 samples.

4.3 Discussion

The results show that it is feasible to simultaneously monitor several vibration-emitting machine parts using only one single vibration sensor. However, the plots in Fig. 4 suggest, that thresholds cannot be used to classify the motors, which supports our decision to use neural networks instead.

Although the classification accuracy of the CNN model, when using raw data, reached a satisfying level with an average of 86.20%, it also had a standard deviation of 6.11%. As shown in Fig. 2, this is likely due to the servo motors being masked by the DC motor, which dominates the shape of the time signal. As motors exhibit different frequencies, we transformed the signal into the frequency domain to achieve better separability. By using a DFT length of 1600 we were able to achieve the highest classification accuracy with 94.07%. The steady decline of the accuracy when reducing the DFT length to 1200, 800 and 400 indicates that due to the shorter window sizes, the window may lie completely in the pause between two movements of a servo motor. The example thus contains no vibrations from the servo motor despite being labeled as such.

The results also show, that the model’s greatest issues lie with distinguishing, if only the servo motor from class 2 or both servos are running. This false classification error from class 2 and 4 as well as class 6 and 7 represents 58.9% of all classification errors of the trial runs with 1600 samples.

Furthermore, we determined a correlation tendency with the number of hyperparameters and conv blocks with the number of input data when using hyperparameter optimization. Whereas the decrease of the correct classification rate with less data seems plausible, the average number of hyperparameters rose. As aforementioned, the optimization algorithm was not able to find a clear tendency for the number of conv blocks with a decreasing number of input data, which resulted in a lower average number of conv blocks. With a greater number of input data like 1200 or 1600 samples, the number of conv blocks tended towards the maximum number of five.

Our use case was limited to monitoring three vibration emitting machine parts. As this results in eight possible combinations, we were able to use multiclass classification. However, this approach is not feasible when more machine parts are in use. Therefore, we plan to introduce multi-task classification in the next step to achieve a scalable classification approach.

5 Conclusion

Our study shows, that it is feasible to use a single sensor to monitor several vibration-emitting machine parts simultaneously. We were able to reach an average correct classification rate of 94.07% by using a CNN model and transforming the data into the frequency domain beforehand. We also showed, that using raw vibration data with a CNN model can produce uncertain results because of its data dependency. Although it reaches a high accuracy of 86.20%, it still can’t match the results reached by the CNN models in combination with a Fourier transform. The reduction of the input size proved not to be sensible due to considerably lower accuracy rates. However, with 1600 samples as input, the usage of the Fourier transform proved to be an effective preprocessing step to reach higher classification accuracies with CNN models. The results provide evidence, that a classification of superposed vibration-emitting machine parts with a CNN model in combination with a Fourier transformation is possible. Nonetheless, reducing the number of classification errors caused by similar vibration patterns, like servos, is an issue for future research to explore.