Approach
In order to generate representative data of printing operations, we equipped a 3D printerFootnote 2 with two types of microphones. Four miniature condenser vibration pickupsFootnote 3 were mounted to the four stepper motors (responsible for the X-, Y-, and Z-axis movements) of the extruder as well as the conveyer of the printing material (filament). Another two pickups of the same type were fixed to the connecting part between the guidance rods of the X- and Y-axes and the filament spool holder. Additionally, two hyper cardioid instrument microphonesFootnote 4 with magnetic mounts were placed on the printer frame (Fig. 1).
For printing, we adjusted a given 3D model of a hollow cube model without the top and bottomFootnote 5 to a volume of 2 cm3. We then printed the cube three times under normal conditions. Thereafter, we induced three faults to the printer which are not unusual to occur during printing operations and may affect print quality or lead to complete failure. For the first of the three printing errors, the grease on the guidance rods of the x-axis was removed. To create the second error condition, the screws of the extruder fan were loosened. For the third error, we reduced the tension of the hobbed bolt of the extruder, which causes issues with the conveying of the filament.
All printing operations were recorded on eight audio channels. For reasons of reproducibility, a synchronized video of the operations was captured additionally. The recordings were manually edited, labeled, and prepared for data analysis.
An informal evaluation of the recorded material conducted among the authors revealed that the error states of the printer cannot be distinguished from the normal operation state aurally. The authors’ expertise and experience in the fields of music and sound engineering, and the subsequent inability to aurally distinguish the different printing states from each other, led to the relinquishment of further aural experiments testing uninvolved subjects.
Data analysis
The analysis aimed at finding an appropriate method to retrieve the recorded printing states (error condition vs. normal operation) from the audio data. This method had to fulfill three main requirements:
-
1.
Verify that the information (acoustic cues) is contained in the data.
-
2.
Provide insight into where or how the information is contained in the recorded data.
-
3.
Provide preferably low-dimensional data to keep the complexity of the data sonification as low as possible.
Using the raw spectral data of all microphones would have resulted in a high-dimensional (8 × frame size) input vector for the machine learning algorithm. In order to reduce data complexity, we therefore performed feature extraction and feature selection first. We built a suitable training set by framing the audio data of all recordings using a frame size of 65,536 (= 216) samples and a hop size of 4096 samples. This rather large frame size facilitates a high frequency resolution as a basis for further processing. From the obtained spectral data, the following features were chosen for their general acceptance in audio machine learning applications: five MFCCs, root mean square (cf. [12]), spectral bandwidth, spectral centroid, and spectral roll-off. Feature calculation was based on the libROSA python package [42]. Other than [13], we did not run machine learning algorithms on the complete data generated via spectral analysis, but rather performed feature extraction and selectionFootnote 6 to achieve a quicker convergence of the machine learning algorithm. That way, we also obtained means of getting insights into the data by automatic feature selection.
As result of the analysis, we obtained a table containing 18,760 labeled observations (audio frames) with 64 audio features (8 microphones × 8 audio features). The gathered dataset was subsampled to obtain a balanced distribution of 50% error states and 50% states of normal operation. For automatic feature selection, the chi-squared test (X2) [43] was chosen for its generality, simplicity, and effectiveness [44]. The 15 most relevant features of all recordings were selected as input to the network model (cf. Fig. 2).
The application of an SVM did not deliver satisfying results. Therefore, we utilized a neural network-based classifier (Fig. 3). The model was built using the python libraries Keras [45] and TensorFlow [46]. While [13] applied a recurrent neural network (RNN) to their audio data in a similar approach, for a start, we opted for a standard forward one. Comparing several configurations (concerning the number of layers, neurons, and layer types), this network model showed to be sufficiently accurate for our purposes. It requires a relatively low number of features as an input, which in turn reduces the requirements for a real-time classification or the development of sonification models. However, a model that also makes use of past states (such as RNNs) is very likely to further improve the obtained accuracy and we will consider this for future developments.
The obtained data was split into a training set, a validation set and an independent test set. Using a training/validation split of 0.3 during the training process, an accuracy of > 93% on the independent test set could be achieved. This indicates that the collected data was meaningful (cf. Table 1 and Fig. 4 for a receiver operating characteristic (ROC) plot using only independent test data).
Table 1 True/false positive/negative rates of the independent test set We therefore conclude that the chosen model fulfills the requirements in terms of prediction reliability. Through feature selection, we were able to identify information-rich features and the model allows the classification of system states and conditions. Thus, the hypothesis that information is contained in the data is confirmed. Furthermore, the network model generates low dimensional data streams which makes it particularly suitable for the subsequent sonification.
The results of our data analysis offered three starting points for sonification approaches:
-
1.
Data of the identified most relevant features are directly mapped to a sonification model.
-
2.
Data of the identified most relevant features are used as metadata to manipulate incoming audio signals of the monitored machines. Thus, relevant sonic information within these signals can be emphasized and conditioned.
-
3.
The information on the confidence (error probability) of the model is used directly instead of thresholding this value to retrieve a classification. This provides a continuous data stream which is one-dimensional, meaningful and already normalized (Fig. 5).
By reason of simplicity and efficiency, we chose the third of these starting points as data basis for our sonification model.
Design and application of auditory display
By the application of machine learning algorithms, a highly complex input situation (eight channels of audio data) could be simplified to a one-dimensional data stream giving evidence of the error probability of the monitored operations. Thus, the challenge to deliver easily accessible and distinct information that is frequently put on an auditory display could be enormously reduced. The error probability indices of the previous condition classification comprised a value range from 0.0 to 1.0 for each analysis frame (at about 12 frames per second). These incoming values were smoothed by a moving average window of 10 frames length. To distinguish between normal operation states and error conditions, we applied a threshold at 0.7 on the weighted data stream. Error probability values below that threshold were unambiguously considered as normal state operations, values above gradually indicated increased probabilities of errors.
As a proof-of-concept, we designed three sonification models utilizing rather diverse approaches based on the following metaphors: (i) heartbeat, (ii) soundscape, and (iii) music listening. Thereby, we considered five fundamental requirements:
-
1.
In terms of an auditory augmented reality approach, the classification processes and the sonification processes are altogether realized in quasi real time.
-
2.
Normal states are unobtrusively represented by continuous sonification [47] to affirm that everything is working alright.
-
3.
Error conditions are clearly distinguishable without being considered as alarms.
-
4.
Silence indicates a dropout of the complete system.
-
5.
None of the represented states must acoustically hinder verbal communication (e.g., via radio).
The “heartbeat model” was chosen for its simplicity and its inherence to human activity. The characteristic double beat was generated by envelope shaped sinusoids. By default, the basic meter was set to 60 bpm and represented normal operation states reassuring a well-functioning system. As soon as the value stream of error probability indices exceeded the threshold, the meter started to fluctuate and speed up. Also, the volume of the heartbeats increased.
For their dual task experiment, Hildebrandt et al. [8] designed a soundscape based on a “forest” metaphor that included sounds such as a woodpecker pecking a tree or breaking twigs. We picked up this concept of utilizing nature sounds for the development of the “soundscape model.” Based on procedural synthesis models provided by [48], we implemented a natural environment that included bird tweets and flaps, crickets, wind, thunder, and rain. All parameters, such as, e.g., wind speed, triggering of chirps and tweets, and positioning in stereo panorama, were driven by random values. Only the individual contribution of the elements (= mixing) to the scene was controlled by the error condition parameter values. Therefore, good weather conditions including sounds of birds and crickets represented normal operation states, while upcoming storm and rain sounds indicated an increase of error probability over the threshold at 0.7.
As mentioned in Sect. 2.1, Barra et al. [34] developed and evaluated a continuous sonification model that included background music which was enriched by additional musical information. Based on this rather complex concept, we designed a much simpler “music listening model” that respects the habit that many operators have, according to our observations, of listening to music (via headphones or loudspeakers) during work.Footnote 7 Using our model, operators continue to listen to the music of their preference. However, in case of an increased error probability, a gradually narrowing bandpass filter is applied to the music playback (patina effect). Accordingly, the speed of the music starts to fluctuate in order to make operators aware of increasing error probability. The implementation of the speed fluctuation is based on the “supervp~”-external of the MuBu Library provided by IRCAMFootnote 8 [49] which allows tempo manipulations independent of frequency shifts in decent quality.
Results of 1st pilot study on error estimation sonification
Our results in detail are as follows:
-
1.
Combining feature extraction and a custom artificial neural network (ANN), the applied model indicated a high accuracy (> 93%) concerning error probability identification distinguishing between operation states. None of these states could be identified by listening. An auditory augmentation, or rather the sonification of this classifier, provides a considerable benefit to process monitoring.
-
2.
The data stream of error probability values was mapped into a sonification model, providing evidence about momentary operation states. Three models relying on different acoustic metaphors (heartbeat, soundscape, music listening) were implemented as a proof-of-concept. These models were designed to be unobtrusively perceived during normal conditions, clearly indicating error states without shifting into warning sound characteristics.
-
3.
The system works in quasi real time; the application of the analysis buffer causes a delay of about 85 ms; the input and output buffers of audio interface add another 10 ms.
Due to the simplicity of the sonification models and the one-dimensional, almost Boolean information stream, error conditions are easily distinguishable from normal states in all three models. We therefore decided to forego a formal perceptional user study for now and restrict our approach to a proof-of-concept. Also, the general benefit of continuous sonifications for early identification of upcoming issues has already been evaluated by in vitro studies (see, e.g., [32, 34]). As the latter pointed out, long-term observations under real-world conditions are necessary in order to evaluate the impact, benefit, and, most importantly, willingness of operators to accept exposure to the provided acoustic information on a day-to-day 8-h basis. While we expect a good chance for an implementation of the music listening model in manufacturing environments, we doubt the potential of the two other models since they appear quite uniform and fatiguing overall. For our 2nd proof-of-concept study, we therefore focused on the musical aspect.
Design and application of 2nd proof-of-concept study: process classification
As a next step of our research, we designed a second proof-of-concept study in situ at the shop floor of a metal working company. The fluctuating acoustic environment of a real-world production scenario implicates additional challenges for airborne sound analysis and process categorization. In addition to noises caused by nearby machines, passing by forklift trucks or human activities, area-wide music playback all over the shop floor was also a source of acoustic emission that needed to be taken into consideration.
Similar to our proceeding in the first study, we equipped a semi-automatic CNC punching machineFootnote 9 with 10 small diaphragm condensers and contact microphonesFootnote 10 at strategic positions which are, for instance, situated near the punching head, the work plate, the valve, the clutch, the compressor, and the transformer box. The aims of the study were as follows:
-
1.
to test/adapt our previously established feature extraction and machine learning routines against/to environmental influences
-
2.
to classify different operation phases during processesFootnote 11 with an accuracy in similar height to the one achieved in the first proof-of-concept study
-
3.
to develop a sonification model that clearly displays and distinguishes operation phases and integrates them into the work environment
Process phases during operations
The processing of a single workpiece, i.e., a metal sheet, at the punching machine can be subdivided into five operation phases:
-
1.
operator inserting the workpiece into the machine (manual operation)
-
2.
punching processes (automatic operation)
-
3.
re-arranging the workpiece (automatic operation)
-
4.
punching processes (automatic operation)
-
5.
operator withdrawing workpiece (manual operation)
Our self-defined task for the process classification was to develop a method based on our first proof-of-concept study that automatically distinguishes between these phases with a comparable accuracy (i.e., > 93%).
We focused on the recording of the processing of one specific product type (“A”). The observed custom order comprised 500 workpieces, a sample size that we expected to deliver enough data for our analysis. The processes for this product type consisted of 10-mm-diameter stamps punching holes into a 0.55-mm electronically galvanized steel sheet. In order to be able to reproduce the operations recorded by the set of the 10 microphones described above, we filmed the scenario with a video camera that was time-synchronized to the audio recordings. Manually labeled operation phases show a maximum difference of 3 s within each of the 5 operation phases (cf. Table 2) indicating that even the processes involving manual activities ran on a stable basis.
Table 2 Representative timestamps of operation phases after manual labeling Combining the manual operations, i.e., the “inserting” and “withdrawing” of a workpiece to an overall “handling” phase and considering the two “punching” phases as a single category, we obtain a characteristic temporal pattern of operation phases as displayed in Fig. 6.
Equivalent to our previous proceeding, we performed feature extraction on all audio recordings which were framed to a buffer of 215 samplesFootnote 12 using 12 MFCCs (from a mel spectrum with 128 mel bins), spectral centroid, spectral roll-off, and spectral bandwidth. Feature selection was performed using X2 (Fig. 7).
The 30 most relevant features of a dataset of 1206 frames were selected and fed into seven network modelsFootnote 13 for training and testing using a train/test split of 0.5. For each input frame, the network estimates the probability P(n) for each of the three classifiers representing the “handling” [0], “punching” [1], and “re-arranging” [2] phases of the operations. The classifier with the highest probability ranking determines the allocation of the analyzed frame. While most of the tested networks exhibit rather high confusion rates between phases [0, 2]—the confusion matrix of the support vector machine (Table 3) with an overall accuracy of about 80% provides a representative example—the random forest network (Table 4) performed best with an accuracy of more than 96%.
Table 3 Confusion matrix for classifiers [0–2] obtained by the application of a support vector machine network Table 4 Confusion matrix for classifiers [0–2] obtained by the application of a random forest network In order to a obtain a more flexible solution for challenges of future scenarios, we continued our research by developing a custom artificial neural network (Fig. 8) based on the one we had used in our first pilot study (Fig. 3) with superior modularity, expandability, and scalability. With an accuracy rate of about 94%, this model performed slightly worse than the random forest network (about 96%). According to the confusion matrix (Table 5), however, the confusion between class 0 (“handling”) and 2 (“re-arranging”) is on a similar level than the one exhibited by the random forest model and also outperforms all the other tested networks models. Since also the ROC in Fig. 9 displays individual accuracies of 95% for class 2 and even better performances for classes 1 and 3, we conclude that we reached our stated target of achieving an accuracy comparable with the one we reached in our first proof-of-concept study.
Table 5 Confusion matrix for classifiers [0–2] obtained by the application of our custom artificial neural network (ANN) The time-agnostic characteristics of the network model become evident in the noisy output of the original signal (Fig. 10). We smoothened these fluctuations by applying an infinite impulse response (IIR) filter H(z) to the output of the network before allocating the analyzed frames to their most probable class via argmax (Fig. 8). The filter was constructed using the following difference equation, with s being a smoothing constant:
$$ y(n)=\left\{\begin{array}{c}y\left(n-1\right)+\frac{x(n)-y\left(n-1\right)}{s}, for\ x(n)<y(n)\\ {}x(n), for\ x(n)\ge y(n)\end{array}\right. $$
resulting in the transfer function
$$ H(z)=\frac{1}{s+{z}^{-1}-s{z}^{-1}} $$
for a falling signal, and
for a rising signal.
Figure 10 also shows that the accuracy of our model was essentially improved by this filtering of recent predictions. While recurrent neural networks would offer a logical next step to truly make the model aware of previous states, the presented model fulfills the given task in a satisfactory manner and can even be used to label more collected data in order to train a more general model.
Sonification model
The auditory display of error probability estimations as performed in our first proof-of-concept study suggests the implementation of sonification models that map an increasing probability of faulty operations to sonic parameters that indicate rather negative connotations. This can be realized by modeling bad weather conditions or by applying patina filters to high-end music recordings. Errors that, for instance, are caused by the deterioration of machines usually do not appear at once but develop gradually. The worsening of generated weather conditions by upcoming rain and thunderstorms or gradually applied filters according to the state of deterioration will provide useful information to experienced operators so that they are well informed about the state of machines and can decide at which point to take action.
The challenges for designing auditory displays that represent operation states are rather different, since these phases do not change gradually but immediately. The sonification should indicate the state clearly on a perceptually neutral basis without evaluating the quality of the processes. The provided information should assure operators that everything is working properly. Also, it must be kept in mind that the displayed sounds will be listened to over long periods of time. Therefore, a strategy is needed that respects the usual acoustic environment operators are accustomed to and does not essentially intrude into the auditory scene. The shop floor of the enterprise where we recorded the punching processes at was permanently flooded with music. Listening to music during work has been a common experience for all operators who work there. Therefore, the development of a sonification model that considers listening to music can be expected to fulfill the stated criteria.
All three operation phases (“handling,” “punching,” “re-arranging”) should be displayed on a non-judgmental basis. One way to comply with this condition is the instrumentation of a musical piece. However, other than the application of audio effects, such as patina or tempo fluctuations, instrumentation as a sonification parameter cannot be applied to produced music recordings. For our second proof-of-concept study, we therefore arranged the jazz standard Autumn Leaves by Joseph Kosma manually according to the time sequence of phases given by the applied machine learning algorithm. While the plucked double bass and the laid-back drums (including brushes) build a continuous stable basis over the complete scene, the handling phase is represented by a muted trumpet for the melody and a piano for the accompaniment. During the “automatic” operation phases (i.e., “punching” and “re-arranging”) of the punching machine, these two instruments were substituted by a lead and a rhythm guitar. In order to distinguish between “punching” and “re-arranging” phases, the latter were instrumented with an additional synthetic male choir (Table 6).
Table 6 Mapping of operation phases to musical instrumentation Results of the 2nd pilot study on operation phase sonification
Our results in detail are as follows:
-
1.
The adjusted model combining feature extraction and a custom artificial neural network appears to be robust against the environmental influences that occurred during the recording phases.
-
2.
The model applied to estimate the probability of three different operation phases indicates an accuracy even higher (> 94%) than the one achieved in the first proof-of-concept study (> 93%). The robustness of the model could be further improved by the implementation of an IIR filter.
-
3.
The three states of this classifier representing the three operation phases were acoustically displayed by characteristic and clearly distinguishable instrumentations of a musical piece. An intrusion into the auditory scene of operators is not expected as long as they are accustomed to listen to music during working hours—as many operators do according to our observations.