Introduction

The industrial Internet of things (IoT), as an industry 4.0 implementation technology [1], is used in manufacturing to control and monitor operations and processes by smart sensors that detect the anomalous behaviors of machines, remotely control the input and output of each step in the process, and integrate physical production into interconnected networks.

Anomalous sound detection (ASD) is a smart data-driven technology at the edge of the IoT. Scientific methodology is used to identify the anomalous sound emitted from operational machines, and the detected warnings are sent to operators to mitigate the risk of breakdowns. For example, the modern textile industry uses a wide range of machines, especially massive heavy-duty industrial machines, e.g., woolen mill machines, thread winding machines, bleaching/dyeing machines, and scutching machines. The costs of detecting and fixing defects in those running machines in time are high, not only due to the expensive repair charges but also downtime. ASD, as an option for predictive maintenance technology, can detect fault conditions and automatically report them to operators in real time.

In addition to reducing the maintenance cost of audio analysis, anomaly detection technology can also be applied to image, video and text analysis in traffic control, cybersecurity and forensics. For example, in the automotive industry, machine learning and artificial intelligence technology are adopted to recognize traffic lights using onboard sensors in vehicles to improve the safety of driving [2]. In utilities, AI-based smart sensors and anomaly detection methods are also widely used in traffic flow studies to improve the mobility of cities or crossing regions [3]. In cybersecurity, to detect anomalies, machine learning methods are applied to reduce the vulnerability of sensors, e.g., IoT-based smart grids (SGs) [4]. In forensics, anomaly detection with autonomous artificial intelligence is used to detect frauds or cybercrimes. AI-based anomaly detection technology is used to detect malicious and illicit events in the text analysis of posts in online social networks in dark web environments [5]. In addition, by analyzing the security logs of attacked servers, anomaly detection technology can help engineers trace threat-intention cyber behaviors and predict evidential locations [5]. Smart anomaly detection technology is a prominent approach in system automation and risk control in both industry and society.

However, ASD has become increasingly challenging in recent decades, despite the wide recognition of its importance in industry 4.0. The major challenges in practice include the following:

Imbalanced training dataset In practical applications, anomaly events are much rarer than long time series of normal data [6]. Such an imbalance between exhaustive continuous normal data and anomalous data in the training process significantly compromises the performance of popular ASD machine learning algorithms.

Stability of high performance Maintaining a stable and highly accurate detection and prediction performance is another issue in real practice. Most deep learning algorithms, e.g., convolutional generative adversarial networks (GANs), can achieve high accuracy after sufficient training. However, the stability of the overall predictive performance is still a concern [7].

Hardcoded architecture Differences in background environments when collecting sounds and types of sounds require different parameter settings in the algorithm. Manually selecting the parameters to reset the algorithm to adapt to the environment and specific types of sound impacts the efficiency and accuracy of ASD.

Noise On most occasions, the real environments in which sound data are collected are composed of multiple types of sound. Environmental noise is a traditional issue in audio studies [8].

High computation capability and computing cost requirements Because of the high volume of the training dataset and the imbalance between normal and anomalous data, the algorithms applied in ASD, e.g., deep convolutional neural networks and generative adversary networks, require one or more graphics processing units to process and generate good predictive results.

To resolve these issues in practice, the proposed algorithm integrates the dimension-reduction technology of incremental PCA with unsupervised DBSCAN. This algorithm is optimized with the automatic EPS calculation (AEC)-guided genetic algorithm to set the localized parameters for different test datasets [9]. The details of the algorithm are introduced in “Enhanced Incremental Principal Component Analysis-Based Density-Based Spatial Clustering of Applications with Noise”.

We extend our gratitude to Mr. Huang CS, who provided the audio files for the study, and the Department of Computer Science at the University of Hong Kong, who sponsored the study.

Enhanced Incremental Principal Component Analysis-Based Density-Based Spatial Clustering of Applications with Noise

Extraction of Acoustic Features

Acoustic features are used to represent and recognize a typical computationally sound event or scenario to differentiate it from others. The input, as of the discrete time-series audio data of machine sounds collected from the plant site, is analog–digital converted, framed and partly labeled in the preprocessing stage and then is calculated and output as acoustic features by the preset rules. These digitalized representations, or acoustic characteristics, are capable of identifying the physical properties of the input audio data, for example, the signal energy, the toneless, the temporal shape and the spectral shape.

In recent decades, many different types of audio signal features have been proposed for sound recognition or description. Generally, the audio features can be categorized as either time domain or frequency domain. In the time domain, based on the different computational scopes, we can distinguish between the time extension validity of the global descriptors that are computed for the whole signal and the instantaneous descriptors that are computed for each time frame. The time frame is a short segmentation of the signal with a regular duration. In this paper, the duration for the time frame is 20 ms. As the proposed study focuses on the signal analysis of time frames, we adopt the instantaneous features as the acoustic characteristics for machine learning [10]. In 2004, G Peeters summarized a set of audio descriptors [11], including the temporal shape, temporal feature, energy features, spectral shape features, and perceptual features. This paper adopts Peeters’ classification as the main method for extracting acoustic features to identify anomalous sounds. The descriptors for further machine learning processing include the following:

Temporal shape

Features (global or instantaneous) computed from the waveform or the signal energy (envelop). The attack time, temporal increase/decrease and effective duration are features of this category.

Frequency [12]

Frequency is one of the basic features when describing or recognizing audio signals. In contrast to the time domain, which calculates the distance between two domain samples, the frequency is used to calculate the period vibration of two frequency band index bins. In this paper, short-time Fourier transform (STFT)-based analysis is applied for linear frequency calculations of continuous audio signals.

Amplitude

The amplitude is a descriptor that represents the waveform shape with limited information. Similar to the processing steps of frequency, in this paper, the amplitude is calculated based on the continuous signals after the STFT and is converted to db-scaled from the logarithm scale.

Temporal features

Autocorrelation coefficients [11]

The cross-correlation of a signal, as the inverse Fourier transform of the spectrum energy distribution of the signal, represents the signal spectral distribution in the time domain. This descriptor was proven by Brown in 1998 to be a valid description for classification. The formula is:

$$x_{corr} \left( k \right) = \frac{1}{{x\left( 0 \right)^{2} }}\mathop \sum \limits_{n = 0}^{N - k - 1} x\left( n \right) \cdot x\left( {n + k} \right)$$
(1)

Each coefficient is in the range of [−1,1]. The faster the coefficients decrease with increasing lag, the whiter the signal can be.

Zero-crossing rate [12]

The zero-crossing rate is a low-level feature used to describe the number of changes in signal values when crossing the zero axis. The concept assumes that the arithmetic mean of the audio signals is zero. The higher the zero-crossing rate is, the more high-frequency content there is, and the less periodic the audio signals are assumed to be.

Spectral shape features

Onset envelope

Onset is the percept related to the time a sound takes to start. The onset envelope is computed as a spectral flux onset strength envelope. The spectral flux measures the amount of change in the spectral shape as the average difference between consecutive STFT frames. The onset strength at time t is determined by:

$$\sum_{m=1}^{m=M}H({X}_{log,filt}\left(n,m\right)-{X}_{log,filt}^{max}(n-\mu ,m))$$
(2)

where ref is the logarithmically scaled filtered spectrogram \({X}_{log,filt}(n,m)\) after local max filtering \({X}_{log,filt}^{max}\left(n-\mu ,m\right)\) along the frequency axis [13].

Onset is correlated with the logarithm of the attack time [14].

Spectral centroid [12]

The spectral centroid represents the center of gravity (COG) of spectral energy. It is calculated as the frequency-weighted sum of the spectrum normalized by its unweighted sum:

$$v_{SC} \left( n \right) = \frac{{\mathop \sum \nolimits_{k = 0}^{\frac{K}{2}} k \cdot \left| {X\left( {k,n} \right)} \right|^{2} }}{{\mathop \sum \nolimits_{k = 0}^{\frac{K}{2}} \left| {X\left( {k,n} \right)} \right|^{2} }}$$
(3)

Spectral roll-off [12]

The spectral roll-off measures the bandwidth of the analyzed block n of the audio samples. The spectral roll-off point is the frequency at which the accumulated magnitudes of the STFT X (k, n) reach K of the overall sum of magnitudes:

$$v_{SR} \left( n \right) = k_{r} \left| {\begin{array}{*{20}c} . \\ {\mathop \sum \nolimits_{k = 0}^{i} \left| {X\left( {k,n} \right)} \right| = K \cdot \mathop \sum \nolimits_{k = 0}^{\frac{K}{2}} \left| {X\left( {k,n} \right)} \right|} \\ \end{array} } \right.$$
(4)

The common value for K was 0.85 (85%). The spectral roll-off range is [0, K/2].

Mel-frequency cepstral coefficients (MFCC) [15]

The MFCC is defined as the compact description of the shape of the spectral envelope of an audio signal. It is calculated by the logarithm of the spectrum after the discrete cosine transform (DCT) or Fourier transform (e.g., FFT). Since MFCC was introduced in 1980, it has proven to be a valid measurement of audio signal classification to contain principal information. In our approach, the number of coefficients is 20.

Other features’ categories

Intensity

Intensity is a physical and measurable entity that is related to human perception of the magnitude of an audio signal. In this category, most features are instantaneous features, such as the root mean square and root mean square energy.

  • Root mean squared energy. The RMS energy is calculated from the audio samples or from a spectrogram without STFT processing. The advantage of the RMSE is the faster calculation speed because it does not require STFT processing. It outputs the RMS of each frame. In this paper, we only calculate the RMSE directly based on the audio signals.

Derived features
  • Tempogram [16]. As a descriptor of the speed or pace of a given piece, a tempogram is usually measured in beats per minute (bpm). It is derived from the local autocorrelation of the onset strength envelope. For time t ϵ Z and time lag \(l\) ϵ [0, N]. W denotes window function: Z -> \({\mathbb{R}}\) centered at t = 0 with support [-N: N], N ϵ \({\mathbb{N}}\).

Enhanced Incremental Principal Component Analysis

Principal component analysis (PCA) is a classical multivariate statistical method for linear dimension reduction. It was introduced by Pearson as early as 1901 and Hotelling in the 1930s. As an unsupervised algorithm, the principal of PCA is to seek the subspace of the largest variance in the dataset. In 1982, the neural network implementation of one-dimensional PCA implemented by Hebb learning was introduced by Oja, and in 1989, it was expanded to hierarchical, multidimensional PCA by Sanger [17].

The enhanced incremental algorithm is based on the sequential Karhunen–Loeve (SKL) algorithm of Levy and Lindenbaum (2000) [18]. The computational advantages of the SKL algorithm are that it updates the original eigenspace and mean continuously with the learning rate, and the space complexity and the computational requirements are reduced to \(O(d(k+m))\) and \(O\left({dm}^{2}\right)\), respectively, because it maintains constant space and time complexity in n. The disadvantage is that it does not calculate the varying sample mean of the training data with the new data. To resolve this issue, the enhanced incremental PCA is improved by adding an additional vector to the new training data to correct the time-varying mean [19].

In this paper, the input parameter of the enhanced IPCA, the number of components, is selected by a genetic algorithm based on the most optimized historical results of different machine types, which will be introduced in detail in “Automatic EPS Calculation (AEC)—Guided Genetic Algorithm”.

Automatic EPS Calculation (AEC)—Guided Genetic Algorithm

The genetic algorithm (GA), a type of global stochastic search algorithm that includes evolutionary algorithms, particle swarm optimization and other biobased search methods, is applied for the selection of wrapper features [20]. Despite the capability of global searching, the exponentially increased computational cost of each candidate parameter restricts the efficiency of the GA. Therefore, the constraint of local optimization is added to resolve this issue.

Automatic EPS calculations (AECs) of randomly selected training datasets are used to set up the baseline of the initial range of estimated values of the candidate parameters. The wrapper parameters to be calculated in the guided genetic algorithm include the number of components for IPCA, the optimal epsilon value and the MinPts for DBSCAN. The automatic EPS calculation (AEC) algorithm estimates the EPS and MinPts based on the density of the randomly selected training datasets and the distances between the points in the density region. In the proposed AEC algorithm, the densities are calculated by the Gaussian kernel after the training dataset is scaled by MinMax. Similarly, the distances are calculated by the KD-Tree query after the MinMax scaled training dataset. The set of the estimated EPS and the estimated MinPts are the minimum values in all clusters. The range of the estimated number of components is set between 2 and 10 [21].

The three locally optimized parameters are input as the baseline to set up the range of values of the candidate parameters. The predicted value, the actual value, the difference between the predicted and the actual values, the mean squared error (MSE), the candidate number of components, the candidate EPS and the candidate MinPts, which are seven genes, are used to construct the chromosome. The fitness process is to set the reward value to 1 if the MSE is less than the target value of 0.4. Only the rewarded chromosomes construct the population for crossover and mutation to generate a new generation of populations with the preset crossover probability and specific mutation power [20].

Density-Based Spatial Clustering of Applications with Noise

DBSCAN was proposed by Martin Ester, Hans-Peter Kriegel, Jorg Sander and Xiaowei Xu in 1996. As a density-based clustering algorithm, DBSCAN separates clusters into low-density regions [22]. DBSCAN can identify global anomalies by defining dense and arbitrary shapes globally and, therefore, fails to identify local anomalies. There are two main advantages of DBSCAN over other unsupervised ML algorithms. The first is that DBSCAN does not require defining how many clusters to be calculated as an input parameter. It can define clusters of arbitrary shape by itself. Second, DBSCAN can handle noise points. With these two advantages, DBSCAN performs well when training and predicting large-volume and unbalanced datasets.

In DBSCAN, for any arbitrary object p belonging to dataset D, as shown in Fig. 1, the algorithm retrieves all object densities reachable from p by the ε and MinPts values [22]. There are three scenarios for any object p: it is the core object of a cluster, if there are enough other objects q within the distance from p ≤ ε and with the count of q ≥ MinPts in dataset D; it is the border object if there is not enough q to be density-connected to p; it is the noise object if it does not belong to any cluster. The algorithm will continue processing to locate all the objects into clusters or noise groups.

Fig. 1
figure 1

Three scenarios of DBSCAN: core, border and noise points

In the hybrid algorithm proposed in this study, automatic EPS calculation (AEC) is adopted to estimate the EPS based on the average distance between the points of the training dataset, and MinPts is based on the kernel density of the training dataset, which includes the extracted acoustic features of the audio files. The assumption of the experiment is that the frames of the normal and anormal files have significantly different density characteristics so that they can be easily differentiated by the hybrid algorithm with reduced dimensions.

Experiments

Dataset and Preprocessing

The data were collected from machines in a plant in Suzhou City, China. The data consist of the normal/anomalous sounds of real machines. Each recording is a single-channel 2-s long audio of both a target machine's operating sound and environmental noise. The sample rate was 44,100. The audio files for the experiments can be downloaded via weblink (sharontan6217/asd (github.com)).

In the experiment, the training dataset includes unlabeled normal and anomalous datasets, in which 190 files are randomly selected from 228 normal audio files and 20 files from 120 abnormal data files, 50 consecutive times. The test dataset includes 20 unlabeled abnormal audio files. Ten acoustic features, e.g., frequency and amplitude, from the audio files are extracted as the components for clustering.

Benchmark System and Results

The benchmark performance of a deep convolutional neural network (DCGAN) is adopted for the experiment. The DCGAN is a deep convolutional neural network architecture composed of a pair of adversarial models called the generator and the discriminator [23, 24]. The generator creates a noise vector as the fake input of the discriminator. The discriminator segments the real and fake data distributions with certain policies. The details of the parameters are listed in Table 1.

Table 1 Parameters of the DCGAN

Table 2 Experimental Results of the DCGAN shows the results of the benchmark experiments. The benchmark algorithm of the DCGAN achieves an accuracy of 0.7. However, the average execution time of the DCGAN is 90 min with 2 GPU units. The computational cost of DCGANs is relatively high compared with that of machine learning algorithms.

Table 2 Experimental results of the DCGAN

Training Process

The architecture of the algorithm is to extract acoustic features from audio files collected in a real manufacturing environment. After the MinMax scaling, the normalized acoustic feature data are loaded into the layer of optimizations to calculate the parameters to construct the incremental principal analysis for dimension reduction and the DBSCAN clustering algorithm to detect the anomalous sound file (see Fig. 2).

Fig. 2
figure 2

Using self-adaptive IPCA-based DBSCAN to detect anomalous sound data

During training, when optimizing the parameters via the AEC-guided genetic algorithm, the ranges for defining each parameter are based on the number of generations and the EPS and MinPS calculated via the AEC algorithm. The genetic algorithm selects the optimized parameters for which the compiled loss measure is less than the preset target value. Only the parameters selected by the guided genetic algorithm in the training are loaded to construct the dimension-reduction layer and the clustering algorithm to predict anomalous sound using the test dataset.

Algorithm
figure a

AEC-guided genetic algorithm with IPCA-based DBSCAN

Results and Analysis

Table 3 shows a summary of the test results. According to Table 3, the performances of the AEC-guided GA and IPCA + DBSCAN are acceptable, with an average fitness of 0.843 and an average MSE of 0.16. The average execution time is less than 0.5 min for a total data size of 202,860,000 (training dataset: 185,220,000, test dataset: 17,640,000).

Table 3 Summary of performance evaluation

Figure 3 shows a sample of the prediction performance of the normal audio data changing to anomalous audio data. The normal class is set as “0”, and the anomalous class is set as “1”. The red line is the predicted clustering class, and the blue line is the actual class. The results show that the AEC-guided GA and IPCA-based DBSCAN models predict the turning point with high accuracy, and the AUC is 0.95.

Fig. 3
figure 3

ROC curves of IPCA-based DBSCAN algorithms

Figure 4 shows the IPCA-based DBSCAN clustering results for a sample. With the optimized parameters, the normal and anomalous sound data are clearly clustered into two groups.

Fig. 4
figure 4

Clustering of IPCA-based DBSCAN algorithms

Table 4 shows the AUC, NMI, and F1 measure comparisons among 6 unsupervised and semisupervised machine learning or deep learning algorithms: K-means++, one-class SVM, agglomerative clustering, DCGAN, DCNN-Autoencoder, and AEC-Guided GA and IPCA + DBSCAN.

Table 4 AUC Comparison between unsupervised and semisupervised Ml algorithms

The experimental results show that both the AEC-guided genetic algorithm and IPCA-based DBSCAN for the extracted acoustic features and the DCNN-autoencoder for the audio data show the highest accuracy, with average AUCs of 0.843 and 0.8188, respectively. However, for the stability measures, the AEC-guided GA and IPCA-based DBSCAN of extracted acoustic features show the highest stability among all six semisupervised or unsupervised algorithms [25], with the lowest Hamming loss of 0.16 and the highest Spearman rank correlation coefficient of 0.72.

Figure 5 shows the ROC curves of the six semisupervised and unsupervised machine learning algorithms. From the graph, it can be observed that the extracted acoustic features of the AEC-guided GA and IPCA-based DBSCAN algorithms reach the highest AUC value of 0.95, while the DCNN-AE and DCGAN algorithms achieve lower AUCs of 0.84 and 0.719864, respectively. The performances of agglomerative cluster and k-means++are the worst, at 0.65 and 0.60, respectively.

Fig. 5
figure 5

ROC curves of the unsupervised and semisupervised ML/DL algorithms

Noise Tolerance Test

Another series of experiments is conducted to test the maximum noise tolerance of the AEC-guided genetic algorithm and IPCA-based DBSCAN. Based on the experimental results, the performance of the algorithm is impacted when the SNR is 13.0103 (SNR = 10*log10(1/0.05)), in which 0.05 is the noise significant factor [26]. This is because DBSCAN is unable to detect and filter noise outliers instead of the continuous noise pattern added to the clean audio sample. This is the disadvantage observed when it is applied to lab experiments.

Comparison of the Hardcoded Architecture and Parameterized Architecture

In the experiments to compare the hardcoded architecture and the parameterized architecture, it is observed that the parameterized architecture requires less execution time and achieves high accuracy. In this experiment, the hardcoded architecture is set to an EPS of 0.07, and the MinPts is set to 2. The experimental results of 50 random test cases shown in Table 5 indicate that although the AUC of the hardcoded architecture is 0.82, the stability indicators, including the Jaccard score and Spearman rank correlation coefficient, are significantly lower than those of the parameterized architecture. Therefore, the performance of the hardcoded architecture is not so satisfactory as that of the parameterized architecture.

Table 5 Comparison between the hardcoded architecture and the parameterized architecture with 50 random test cases

Conclusions and Future Work

The hybrid algorithm to integrate the AEC-guided genetic algorithm and IPCA with DBSCAN for anomaly sound detection seems to be a promising direction for ASD when handling different environmental issues and different types of audio files. Notably, when detecting rare events in multiple scenes (including silence and background sounds), the proposed unsupervised algorithm did not perform as well as the machine sounds. This is possibly due to the quality of the collected sound because we used high-quality equipment to collect the machines’ sounds at the specific plant site. Future research will improve the noise tolerance of the algorithm for environments with mixed sounds.