Keywords

1 Introduction

Quite a lot of equipment is available for use in the home to inform the owner of important events, ranging from visitors at the door to gas leaks, all of which require immediate attention, though with quite different responses and different urgency levels. Though the usual notification is a special sound signal, for hearing-impaired persons and those too distant to hear such signals, many products communicate by methods other than sound. However, remote warning devices are specialized in ways that make them much more expensive than ordinary devices [1, 2]. It would be useful to have a single system to recognize all these various alarm signals and communicate their messages through a single channel to their intended recipients, whether disabled, distant, or simply distracted.

Machine learning techniques have already been widely applied in fields such as image recognition, speech recognition [3] and automatic translation [4], so it seems reasonable to apply them to the classification of various alarm sounds, whether smoke alarms or kitchen trimers. Although there are studies aimed at detecting alarm sounds [5,6,7], it is two divisions of alarm sounds and non-alarm sounds, and a plurality of various alarm sounds are not classified. In addition, it is an examination of the element technology, and no investigation has been done on a system for communicating the occurrence of alarm sound to the user.

In this paper, we propose a system that recognizes various alarm sounds using machine learning, and transmits the notifications to the smartphone of an individual user. Here we describe the results of the primary prototyping of such a system.

2 Alarm Sound Classification

An image of the application of the proposed system is shown in Fig. 1. We propose a system which classifies various alarm sounds by machine learning and notifies the user by vibration of the user’s smartphone and displays the sound source on its screen. Although there are special devices that notify of an abnormality by other than sound, such as light, for users with hearing disabilities, they are all individual devices. As in the proposal shown in Fig. 1, it is thought that a useful system can be realized at low cost by detecting the sounds of all the alarms, identifying them and informing the user of the classification via smartphone, by means of vibration, etc.

Fig. 1.
figure 1

Service image of proposed system

2.1 Feature Data Creation of Each Sound

Eight kinds of equipment, including a door alarm, two smoke alarms, a gas alarm, entrance bell, kettle alarm, and two timer alarms were selected as indoor alarm-sound producers in this investigation. The appearance of each alarm equipment and their spectrograms are shown in Fig. 2. The x-axis of the spectrogram graph is time (0 to 60 s), the y-axis is frequency (0 to 8000 Hz), and the sound level is −40 to 40 dB.

Fig. 2.
figure 2

Appearance of alarm equipment and their spectrograms

Here, it is necessary to consider the environment of any classification. That is, the classifier output produced even in the absence of an alarm sound. We selected two indoor sound environments: one in which air conditioners etc. are operating, but without conversation, i.e. just environmental sounds, and the other with normal speech sounds. These two sound backgrounds will be included for classification. The spectrograms of these two sound environments are shown in Fig. 3. It was confirmed that the spectrograms of each sound source (10 kinds of sound) were different and classification could be conducted by appropriate methods. To classify these sound sources, it was then necessary to select distinctive feature elements for each. In this investigation, we decided to use power spectrum and the Mel-Frequency Cepstrum Coefficients (MFCC) used in speech recognition as the feature elements for classification.

Fig. 3.
figure 3

Spectrograms of environmental and speech sounds

Two data sets consisting of 60 s of a signal from each of the eight sound sources, plus environment and speech were acquired at a sampling rate of 16,000 Hz, and their power spectra up to 8 kHz was obtained for the extraction of feature elements. Each data set was divided into segments of 32 ms (512 samples), and the power spectrum was obtained by overlapping the bits every 10 ms (160 samples). Thus, 6000 pieces of power spectrum data were created for each data sample, for a total of 12,000 power spectrum data sets for use in the creation of the learned model described in Sect. 2.2. A hamming window was used to sample the power spectrum. Here, in order to eliminate the influence of the difference in sound volume due to differences in equipment and in distance, the result was normalized, that is divided by the maximum value.

The same data sets were used to extract MFCC as feature elements. Here, the number of Mel filter banks was 30, and 13-dimensional MFCC elements were extracted from 32 ms frame data samples using the Speech Signal Processing Toolkit (SPTK) [8]. The three components were then combined to generate 39 dimensional features, which were used to create a learned model.

2.2 Creation of Learned Model

The configuration of the neural network for learning is shown in Fig. 4. Each of the feature elements of the power spectrum and MFCC were used for training, both using the same neural network model, and a learned model was created. The intermediate layer is composed of two layers. A dropout configuration with 50% probability was introduced in the intermediate layer in order to avoid over-fitting.

Fig. 4.
figure 4

Neural network configuration

Learning processes for the power spectrum and MFCC feature elements are shown in Fig. 5. The cross entropy error is used for the loss function, and the size of the mini batch is set to 20. We confirmed convergence of accuracy for the mini batch and loss obtained from the error function. For creation of a valid learned model, the number of epochs was set to 1000 to reach a stable status via this process. From the viewpoint of learning performance using feature data of 60 s, slightly better results were obtained when MFCC was used rather than power spectrum.

Fig. 5.
figure 5

Learning process

3 Classification Experiment

Giving the importance of early detection of an alarm sound, the classification performance for the data of the first 5 s of the alarm sounds was evaluated. Five seconds of data from each of the eight different alarm sounds, environmental and speech sounds were used to evaluate the classification performance by the learned network model obtained in Sect. 2. We generated sounds from prerecorded sound data from a Wav file and a speaker. Here, the mean and the standard deviation of the probability that is the output of the softmax function were examined together with the classification result for quality of classification.

3.1 Results by Power Spectrum

The classification performance when the power spectrum was used is shown in Table 1 as a confusion matrix. The overall accuracy was 96.0% (4 misjudgments among 100). Tables 2 and 3 show the average and the standard deviation of the output values of each classification target of the softmax function. Here, those whose value exceeds 0.1 are bolded. It can be understood from this average and standard deviation that misclassification occurred between the door alarm and kettle sounds as well as environmental and speech sounds.

Table 1. Result using power spectrum
Table 2. Average values of softmax function for power spectrum
Table 3. Standard deviation of softmax function for power spectrum

3.2 Result by MFCC

Classification performance when MFCC is used is shown in Table 4. The overall recognition accuracy was 87.0% (13 misjudgments among 100). The erroneous classification of the environmental sound and the speech sound is similar to the result when using power spectrum, but erroneous classification also occurred for timer 1 and timer 2. The spectrogram in Fig. 2, shows that the very same sound is produced by each in the first time span. This is a very different result from the power spectrum case.

Table 4. Result using MFCC

Tables 5 and 6 show the average value and the standard deviation of the output values of each classification target of the softmax function output, which are the same as in the power spectrum. In addition to environmental sound and speech sound, of course erroneous classifications of timer 1 and 2 occur.

Table 5. Average values of softmax function for MFCC
Table 6. Standard deviation of softmax function for MFCC

4 Development of Alarm Sound Notification System

4.1 System Requirements and Configuration

Based on the results in Sect. 3, we judged that a learned model capable of classifying each alarm sound had been created, and designed and developed an alarm sound notification system using this learned model. We constructed a system to notify a target user terminal such as a smartphone of an alarm and its source, as shown in Fig. 6. Vibration of the user terminal notifies the user of the occurrence of an alarm, and its display shows the source.

Fig. 6.
figure 6

Alarm notification system configuration

Since the classifier must always be in the power ON state, it is important for the classifier to have a low power consumption, so a Raspberry Pi with Bluetooth was used in this first prototype system. The learned model created in Sect. 3 was set up in the Raspberry Pi, which connects the microphone, and a smartphone was selected as the user terminal. The microphone is omnidirectional, and can be connected to the Raspberry Pi via USB, and is the same one used for data acquisition for the learning process.

4.2 System Design and Implementation

The system flowchart for activating the alarm sound is shown in Fig. 7. When an abnormality is detected, the alarm from the device continues to sound. It is necessary for the system to detect it and to notify the user as soon as possible. Therefore, in this prototype system, data acquired every 5 s was used, and the classification data was overlapped into 10 ms (duration 32 ms), that is, 500 pieces. Thus, 500 classification results are obtained, and the final result is determined by their majority decision. For each piece of alarm equipment, 50 s of sound were provided for 10-fold classification. Here, the identification result is transmitted to the smartphone using Bluetooth each time.

Fig. 7.
figure 7

System flowchart for alarm notification

In the smartphone, when the classified alarm sound is received, i.e. the classification result of A, B, …, H, the body is vibrated and the source of the alarm sound is displayed on its screen. Environmental and speech sounds are reported as No Alarm.

The function of the smartphone that receives the classification result from Raspberry Pi by Bluetooth, makes its body vibrate, and displays the result, was implemented using MIT App Inventor 2 [8]. This development software provides functions to be implemented in a smartphone by dragging and dropping a visual representation of each instruction and function by using a graphical interface.

5 Initial System Evaluation

Initial evaluation of the alarm notification system composed in Sect. 4 was carried out. In this evaluation, the distance between the microphone sensor and each alarm sound source was 1 m, and the alarm sound to be classified was emitted continuously. Each 5 s of sound data from 8 types of alarm, and environmental and speech sounds were classified for evaluation.

An overview of the equipment used in the system is shown in Fig. 8. There are several of the sound-producing alarms, a microphone sensor, a Raspberry Pi, and a smartphone. The other device is a monitor for checking the output of the Raspberry Pi. As shown in this photo, very inexpensive equipment will suffice, so if a satisfactory level of classification performance can be secured there is a high possibility that a useful system can be realized for people with impaired hearing.

Fig. 8.
figure 8

Overview of components of alarm notification system

The classification experiments were carried out by implementing the learned model created by using power spectrum and MFCC in a Raspberry Pi. Results are shown in Tables 7 and 8 as a confusion matrix. There is little difference between the classification rates of 83.0% and 82.0%. In both cases, erroneous classification of timers 1 and 2 occurred. However, when MFCC was used, there was also misclassification of smoke 1 and the gas-alarm sound of smoke 2’s alarm sound, while in power spectrum there was no additional error. For the 8 alarm sounds, the classification rate was 87.5% by power spectrum and 77.5% by MFCC. Although there were misclassification in environmental and speech sounds when using power spectrum, this error is not a problem in terms of the required functions of the proposed system. From this result, it can be said that better performance can be obtained by using power spectrum when these particular alarm sounds are to be discriminated.

Table 7. Evaluation result by power spectrum
Table 8. Evaluation result by MFCC

An example of the alarm display of the smartphone is shown in Fig. 9. Although there was classification error, it was confirmed that there was no problem in the operation of the system.

Fig. 9.
figure 9

Example of alarm display of smartphone

6 Conclusion

In this paper, we described a method for classifying various alarm sound sources and evaluated their classification performance using eight kinds of alarm sounds as well as conversational voice and environmental sounds. Then we proposed an alarm sound notification system using the created learned model, actually constructed the system as a prototype, and confirmed its basic functionality. We will conduct detailed evaluations and examine methods for enhancing accuracy in noisier environments. This time, the distance between the sound source and the microphone sensor was 1 m, and only two kinds of sound were involved: normal room sounds and speech sounds as the peripheral sound environment at the time of classification. In an actual life space, there will also be noises due to music, home appliance operation, etc. Improvement of classification performance and examination of evaluation methods considering various kinds of noises are future tasks.