1 Introduction

Coughing is a common symptom of respiratory disease and the forceful expulsion of air to clear up the airway [1]. It is distinctive in nature and is an important indicator used by physicians for clinical diagnosis and health monitoring in more than 100 respiratory diseases [2], including tuberculosis (TB) [3], asthma [4], pertussis [5] and COVID-19 [6]. Machine learning algorithms can be applied to the acoustic features extracted from the cough audio for automatic cough detection and classification [7,8,9]. However, using an audio-based monitoring system raises privacy issues [10], especially when the audio is captured by a smartphone [11, 12] and complex filtering processes might be required to preserve the privacy for continuous monitoring [13].

Acceleration measurements can be an alternative to the audio. Due to accelerometer’s much lower sampling rates, less computing and processing power is required than the audio [14]. Automatic cough detection based on accelerometer measurements is also possible when the device is placed on the patient’s body and the acceleration signals are used for feature extraction [15]. Since an accelerometer is insensitive to environmental and background noise, it can be used in conjunction with other sensors such as microphones, ECG and thermistors [16]. Body-attached accelerometers have for example proved to be useful in detecting coughs when placed in contact with a patient’s throat [17, 18] or at the laryngeal prominence (Adam’s apple) [15]. A cough monitoring system using a contact microphone and an accelerometer attached to the participant’s suprasternal (jugular) notch was developed in [19]. The participants moved around their homes while the cough audio and vibration was recorded. A similar ambulatory cough monitoring system, using an accelerometer attached to the skin of the participant’s suprasternal notch using a bioclusive transparent dressing, was developed in [20]. Here, the recorded signal is transmitted to a receiver carried in a pocket or attached to a belt. Two accelerometers, one placed on the abdomen and the second on a belt wrapped at dorsal region, have been used to measure cough rate after cross-correlation of the two sensor signals [21]. Regression analysis, carried out on both audio and accelerometer signals gathered from 50 children, was able to achieve 97.8% specificity and 98.8% sensitivity when the accelerometer was placed in the centre of the abdomen between the navel and sternal notch [22]. Finally, multiple sensors, including ECG, thermistor, chest belt, accelerometer and audio microphones were used for cough detection in [23].

However, attaching an accelerometer to the patient’s body is inconvenient and intrusive. We propose the monitoring of coughing based on the signals captured by the on-board accelerometer of an inexpensive consumer smartphone firmly attached to the patient’s bed, as shown in Fig. 1. This eliminates the need to wear a measuring equipment and the system uses machine learning classifiers as promising results were reported in the studies mentioned previously, making it an automatic and also non-invasive cough detection system. The work presented here extends our previous study [24] by using three additional shallow classifiers along with the deep architectures in the cough detection process and by comparing the performance between the proposed accelerometer-based classifiers and the baseline systems that classify audio signals of the same cough events. Such audio-based cough detection systems have been reported to discriminate between coughing and other sounds with areas under the ROC curve (AUCs) as high as 0.96 [7] and specificities as high as 99% [25]. Although we have found that the audio-based cough detection still outperforms the accelerometer-based detection, we demonstrate that the difference in the performance is narrow, as our best 50-layer residual architecture (Resnet50) based cough detector achieves an AUC of 0.996 for audio-based and 0.989 for accelerometer-based detection respectively. Thus we are able to demonstrate that an automatic non-invasive accelerometer-based cough detection system is a viable option for long-term monitoring of patient’s recovery.

The structure of the reminder of this paper is as follows. Section 2 describes data collection while Sect. 3 details the features we extract from this data. The classifiers we use for experimentation are introduced in Sect. 4 and the classification process itself is elaborated in Sect. 5. The results are presented in Sect. 6 and discussed in Sect. 7. Finally, Sect. 8 concludes the paper.

2 Dataset Preparation

2.1 Data Collection

Data has been collected at a small 24h TB clinic near Cape Town, South Africa, which can accommodate approximately 10 staff and 30 patients. The clinic contains 8 wards and each ward has four beds, thus four patients at one time can be monitored inside a ward. The overall motivation of this study was to develop a practical method of automatic cough monitoring for the patients in this clinic, so that the recovery progress can be monitored.

Figure 1 shows the recording setup, where an enclosure housing an inexpensive consumer smartphone is firmly attached to the back of the headboard of each bed in a ward. An Android application, developed specifically for this study, monitors the accelerometer and the audio signals. The on-board smartphone accelerometer has a sampling frequency of 100Hz. Although this sensor provides tri-axial measurements, we record only the vector magnitude. A BOYA BY-MM1 external microphone was used to capture audio signals (visible in Fig. 1) at a sampling rate of 22.05 kHz. Using a simple energy detector, activity on either the acceleration or the audio channels triggers the simultaneous recording of both. This results in a dataset consisting of a sequence of non-overlapping time intervals during which both acceleration and audio have been recorded.

Figure 1
figure 1

Recording Equipment: A plastic enclosure housing an inexpensive smartphone (Samsung Galaxy J4) running data gathering software is attached behind the headboard of each bed. The acceleration signal from the on-board accelerometer as well as the audio signal from the external microphone (BOYA BY-MM1), connected via a 3.5 mm audio jack, are monitored. Recording is triggered if activity is detected in either of these two signals.

2.2 Data Annotation

A large volume of both the audio and accelerometer data has been captured by using this energy-threshold-based detection for both audio and acceleration signals. Ceiling-mounted cameras simultaneously recorded continuous video to assist with the data annotation process. The audio signals and the video recordings allowed the presence or absence of a cough in an event to be unambiguously confirmed during manual annotation. In the remainder of this paper, we will define an ‘event’ to be any interval of activity in either the accelerometer or the audio signals.

The non-cough events are generated mostly due to the patients getting in and out of the bed, moving while on the bed, sneezing or throat-clearing. Examples of the accelerometer magnitude signals for a cough event and a non-cough event (in this case due to the patient moving while on the bed) are shown in Fig. 2. The spectrogram representations of these two signals are shown in Fig. 3. Manual annotation was performed using the ELAN multimedia software tool, which allowed easy consolidation of the accelerometer, audio and video signals for accurate manual labelling [26].

Figure 2
figure 2

The accelerometer magnitudes for a cough event (red) and a non-cough event (blue). In this case, the non-cough event was the patient moving while on the bed.

Figure 3
figure 3

Spectrogram representation of the cough and non-cough events shown in Fig. 2: The cough event is shown in (a) and (c) and the non-cough event (patient moving on the bed) in (b) and (d). The accelerometer and audio signals are shown in (a) & (b) and (c) & (d) respectively. The audio signal has a higher sampling rate and thus contains more frequency and time-domain information than accelerometer measurements.

2.3 Final Dataset

The final dataset, summarised in Table 1, contains approximately 6000 cough and 68000 non-cough events from 14 adult male patients. Cough events are on average 1.90 sec long, with a standard deviation of 0.26 sec. Non-cough events are on average 1.70 sec long, with a standard deviation of 0.24 sec. The total lengths of all cough and non-cough events are 11397.60 sec (3.16 hours) and 115928.12 sec (32.20 hours) respectively. No other information regarding patients is recorded due to the ethical constraints of the study. This dataset was used to train and evaluate six classifiers, introduced in Sect. 4 within a leave-one-out cross-validation framework, described in Sect. 5.

Table 1 Ground Truth Dataset: ‘PATIENTS’: list of the patients; ‘COUGHS’: number of confirmed cough events; ‘NON COUGHS’: number of confirmed events that are not coughs; ‘COUGH TIME’: total amount of time (in sec) for cough events; ‘NON-COUGH TIME’: total amount of time (in sec) for non-cough events.

2.4 Dataset Balancing

According to Table 1, cough events are outnumbered by non-cough events in our dataset. This imbalance can affect the machine learning classifiers detrimentally [27, 28]. We have applied the synthetic minority oversampling technique (SMOTE) to create additional synthetic samples of the minor class instead of for example oversampling randomly while training the classifiers [29, 30]. This addresses the class imbalance for both the accelerometer and audio events. SMOTE has previously been successfully applied to cough detection and classification based on audio recordings [9, 24, 31].

3 Feature Extraction

The feature extraction process is illustrated for both accelerometer and audio signal in Fig. 4.

3.1 Accelerometer Features

Power spectrum, root mean square (RMS) value, kurtosis, moving averages and crest factor are extracted from the accelerometer magnitude samples. No de-noising has been applied prior to the feature extraction process. Power spectra [32] has been used to represent sensor data for input to classifiers, including neural networks, in several studies [33,34,35,36]. RMS [37] values from the sensor data have also been found to be useful features [38, 39]. The kurtosis has also been useful for machine learning applications as it indicates the prevalence of higher amplitudes [40]. Moving averages indicate the smoothed evolution of a signal over a time period and have been found to be useful features for sensor analysis [41]. Finally, the crest factor measures the ratio of the peak and the RMS signal amplitudes and have also been found to help machine learning prediction [42] including deep learning [43].

3.2 Audio Features

Features such as mel-frequency cepstral coefficients (MFCCs), zero crossing rate (ZCR) and kurtosis are extracted from the audio signal. MFCCs are successfully used as features in audio analysis and especially in automatic speech recognition [44, 45]. They have been successfully used to differentiate dry coughs from wet coughs [46] and also to identify coughs associated with tuberculosis [47] and COVID-19 [9, 48]. We have used the traditional MFCC extraction method considering higher resolution MFCCs along with the velocity (first-order difference, \(\Delta\)) and acceleration (second-order difference, \(\Delta \Delta\)) as adding these has shown classifier improvement in the past [49]. The ZCR [50] is the number of times a signal changes its sign within a frame, and indicates the variability present in the signal. Finally, the kurtosis [51] indicates the prevalence of higher amplitudes in the samples of an audio signal. These features have been extracted by using the hyperparameters described in Table 2 for all cough and non-cough audio events.

Figure 4
figure 4

Feature extraction for both the accelerometer (top) and the audio (bottom) signals: Both acceleration and audio signals of the events, shown in Figs. 2 and 3, are split into a fixed number of overlapping frames. The length and number of these frames are \(\Psi\) and C for accelerometer signal & \(\mathcal {F}\) and \(\mathcal {S}\) for audio signal. For accelerometer measurements, the power spectrum, RMS, kurtosis, moving average and crest factor of each frame are extracted. For audio signals, the MFCCs, MFCC velocity (\(\Delta\)), MFCC acceleration (\(\Delta \Delta\)), ZCR and kurtosis are extracted. For the acceleration signal, this results in a feature matrix with dimensions (C, \(\frac{\Psi }{2}+5\)) while for the audio signal it generates a feature matrix with dimensions (\(\mathcal {S}, 3\mathcal {M} + 2\)) where \(\mathcal {M}\) is the number of extracted MFCCs.

3.3 Extraction Process

The features are extracted in a way that preserves the information regarding the beginning and the end of an event to allow time-domain patterns in the recordings to be discovered while maintaining the fixed input dimensionality, which is expected by the deep neural architectures such as a convolutional neural network (CNN).

For accelerometer signal, the frame length (\(\Psi\)) and number of segments (C) have been used as the feature extraction hyperparameters, shown in Table 2. Power spectra have the dimension of (C, \(\frac{\Psi }{2}+1)\) and each of RMS, kurtosis, moving averages and crest factor has the dimension of (C, 1). Thus, the input feature matrix for the accelerometer signal, fed to the classifiers mentioned in Sect. 4, has the dimension of (C, \(\frac{\Psi }{2}+5\)), as illustrated in Fig. 4.

For audio signal, frame length (\(\mathcal {F}\)) and number of segments (\(\mathcal {S}\)) have been used as the feature extraction hyperparameters, shown in Table 2. Each of MFCCs, MFCC velocity (\(\Delta\)), MFCC acceleration (\(\Delta \Delta\)) has the dimension of (\(\mathcal {S}, \mathcal {M}\)), where \(\mathcal {M}\) is the number of MFCCs. Each of ZCR and kurtosis has the dimension of (\(\mathcal {S}, 1\)). Thus, the input feature matrix for audio signals, fed to the classifiers mentioned in Sect. 4, has the dimension of (\(\mathcal {S}, 3\mathcal {M} + 2\)), as illustrated in Fig. 4.

From every event, we extract a fixed number of features (C and \(\mathcal {S}\)) by distributing the fixed-length analysis frames (\(\Psi\) and \(\mathcal {F}\)) uniformly over the time-interval of the cough and by varying the length of frame skips, noted as \(\delta\) in Fig. 4. To calculate frame skips, we divide the number of samples in an event by the number of segments and take the next positive integer. For a 1.2 sec long audio event, the length of frame skip in samples is \(\left\lceil \frac{1.2 \times 22050}{100} \right\rceil = \left\lceil \frac{26460}{100} \right\rceil = 265\) samples, as the audio sampling rate is 22.05 kHz.

The frame length (\(\Psi\)) used to extract features from acceleration signal is shorter than the frame length (\(\mathcal {F}\)) used to extract features from audio in this study (Table 2) and also traditionally [52]. This is because, as already noted in Fig. 1, the accelerometer in the smartphone has a lower sampling rate of 100 Hz than the microphone 22.05 kHz and longer frames lead to deteriorated performance as the signal properties can no longer be assumed to be stationary [53]. The lower sampling rates also reduce the amount of computation needed to extract features.

In contrast with the more conventionally applied fixed, non-overlapping frame rates, this way of extracting features ensures that the entire event is captured within a fixed number of frames, allowing especially the CNN to discover more useful temporal patterns and provide better classification performance. This particular method of feature extraction has also shown promising result in classifying COVID-19 coughs, breath and speech [9, 48].

Table 2 Feature extraction hyperparameters for both accelerometer and audio signals. For accelerometer, 16, 32, 64 samples i.e. 160, 320 and 640 msec long frames overlap in such a way that the number of these frames i.e. segments (5 and 10) are the same for all events in our dataset. Similarly for audio signals, MFCCs are varied between 13 and 65 & frames are varied between 256 samples (11.61 msec) and 4096 samples (185.76 msec) in such a way that the number of these extracted frames are varied between 50 to 150, fixed for all events in out dataset.

4 Classifier Training

We have trained and evaluated six machine learning classifiers on both audio and accelerometer signals. Table 3 lists the classifier hyperparameters that were optimised during leave-one-out cross-validation.

First, we establish the baseline results by training and evaluating three shallow classifiers: logistic regression (LR), support vector machine (SVM) and multilayer perceptron (MLP). Then, we improve the cough detection performance by implementing three deep neural network (DNN) classifiers: CNN, long short-term memory (LSTM) and Resnet50.

LR models have outperformed other more complex classifiers such as classification trees, random forests, SVM in several other clinical prediction tasks [3, 54, 55]. The gradient descent weight regularisation as well as lasso (l1 penalty) and ridge (l2 penalty) estimators [56, 57] were the hyperparameters, listed in Table 3, optimised inside the nested cross-validation during training. SVM classifiers have also performed well in both detecting [58, 59] and classifying [60] cough events in the past. The independent term in kernel functions is the hyperparameter optimised for the SVM classifier. An MLP, consisting multiple layers of neurons [61], is capable of learning non-linear relationships. It has produced promising results in discriminating influenza coughs from other coughs [62] in the past. MLP has also been applied to classify TB coughs [47, 59] and detect coughs in general [25, 63]. The penalty ratios, along with the number of neurons are used as the hyperparameters, optimised during leave-one-out cross-validation (Fig. 7 and Sect. 5).

Figure 5
figure 5

CNN Classifier, trained and evaluated using leave-one-out cross-validation [64] on 14 patients. The results are shown in Tables 4, 5 and 6 for feature extraction hyperparameters mentioned in Table 2.

Figure 6
figure 6

LSTM classifier, trained and evaluated using leave-one-out cross-validation [64] on 14 patients. The results are shown in Tables 4, 5 and 6 for feature extraction hyperparameters mentioned in Table 2.

A CNN is a popular deep neural network architecture, primarily used in image classification [65], such as face recognition [66]. It has also performed well in classifying COVID-19 coughs, breath and speech [9, 48]. The CNN architecture [67, 68], shown in Fig. 5, contains \(\alpha _1\) 2D convolutional layers with kernel size \(\alpha _2\) and rectified linear units as activation functions. A dropout rate \(\alpha _3\) has been applied along with max-pooling, followed by two dense (flatten) layers containing \(\alpha _4\) and 8 units (dimensionality of the output space) respectively with rectified linear units as activation functions. An LSTM model is a type of recurrent neural network which remembers previously-seen inputs when making its classification decision [69]. It has been successfully used in automatic cough detection [7, 24], and also in other types of acoustic event detection [70, 71] including COVID-19 coughs etc. [9, 48]. The hyperparameters optimised for the LSTM classifier [72] are mentioned in Table 3 and visually explained in Fig. 6. The LSTM classifier, shown in Fig. 6, contains \(\beta _1\) LSTM units i.e. the dimensionality of the output space is \(\beta _1\) with rectified linear units as activation functions and a dropout rate \(\alpha _3\). Then two dense (flatten) layers containing \(\alpha _4\) and 8 units (dimensionality of the output space) respectively have been applied with rectified linear units as activation functions. For both CNN and LSTM classifiers, a final softmax function produces one output for a cough event (i.e. 1) and the other for a non-cough event (i.e. 0), shown in Figs. 5 and 6. Features are fed into these two classifiers using a batch size of \(\xi _1\) for \(\xi _2\) number of epochs. The 50-layer residual network (Resnet50) architecture (Table 1 of [73]) we trained and evaluated has a very deep architecture that contains skip layers and has performed even better than existing deep architectures such as VGGNet on image classification tasks on the dataset such as ILSVRC, the CIFAR10 dataset and the COCO object detection dataset [74]. This architecture has also performed the best in detecting COVID-19 signatures in coughs, breaths and speech [9, 48]. Due to extreme computation load, we have used the default Resnet50 structure mentioned in Table 1 of [73].

Table 3 Classifier hyperparameters, optimised using the leave-one-patient-out cross-validation.

5 Classification Process

5.1 Hyperparameter Optimisation

Hyperparameters for both the classifiers and feature extraction are optimised inside the leave-one-out cross-validation process and are listed in Tables 2 and 3. Different phases of an event might carry important information and our way of feature extraction preserves the time-domain information. By varying the frame lengths and number of frames to extract, this information was varied. The spectral resolution was also varied by varying the number of lower-order MFCCs to keep from the audio signal.

5.2 Cross-Validation

Figure 7
figure 7

Leave-one-out cross-validation used to train and evaluate all six classifiers. Here, \(N = 14\) (Table 1). The development set (DEV) consisting 1 patient has been used to optimise the hyperparameters while training on the TRAIN set, consisted of 12 patients. The final evaluation of the classifiers in terms of the AUC occurs on the TEST set, consisting 1 patient.

All six classifiers have been trained and evaluated by using a leave-one-patient-out cross-validation scheme [64], as explained in Fig. 7. Our dataset contains only 14 patients and by using this cross-validation scheme we make the best use of our dataset, as a patient’s weight, coughing intensity and distance from the microphone can affect the accelerometer and audio signals and we were not allowed to collect that vital information due to the ethical constraints.

The Fig. 7 shows that one patient is left out from 14 patients to be used for later independent testing. Then another patient is removed from the remaining 13 patients to be used as the development set where the hyperparameters, listed in Table 3, are to be optimised. AUC has always been the optimisation criterion in this cross-validation. This entire procedure is repeated until all patients are used as an independent test set in the outer loop. The final performance is evaluated by calculating and averaging AUC over these outer loops. The hyperparameters producing the highest AUC over these outer test sets are noted as the ‘best hyperparameters’ in Tables 4, 5 and 6. Performances produced by each classifier for each set of hyperparameters are noted by ‘ID’ in these tables.

6 Results

6.1 Accelerometer-Based Cough Detection

Table 4 lists the performance achieved by the shallow classifiers in systems C1 to C18 and Table 5 lists the performance achieved by the deep architectures in systems C19 to C36. These results are the averages over the 14 leave-one-patient-out testing partitions in the outer loop of the nested cross-validation.

The shallow classifiers have provided the baseline classification performance. Table 4 shows that an LR classifier has achieved the best performance of an AUC of 0.8135 along with \(\sigma _{AUC}\) of 0.003, specificity of 81.42%, sensitivity of 81.28% and an accuracy of 81.35% (system C4). The SVM has produced an AUC of 0.8252 with \(\sigma _{AUC}\) of 0.003 and 80.91% specificity, 84.11% sensitivity and 82.51% accuracy while using ten 32 samples long frames (system C10) as its best performance. However, the AUC of 0.8587, accuracy of 85.67%, specificity of 84.47% and sensitivity of 86.89% have been achieved from the MLP classifier with 40 neurons and l2 penalty ratio of 0.7 using five 64 sample long frames (system C17) and this is the highest AUC achieved by a shallow classifier.

For the DNN classifiers, the lowest AUC of 0.9243 has been achieved from a CNN classifier in system C19 in Table 5. Table 5 also shows that the best-performing CNN uses ten 64 samples (640 msec) long frames to achieve an AUC of 0.9499, accuracy of 85.82%, specificity of 80.91% and sensitivity of 90.73% (system C24). The optimal LSTM classifier achieves the slightly higher AUC of 0.9572 when features were extracted using ten 32 samples (320 msec) long frames (system C28). However, the best performance is achieved by the Resnet50 architecture, with an AUC of 0.9888 after 50 epochs from ten 32 samples (320 msec) long frames along with 96.71% accuracy, 94.09% specificity and 99.33% sensitivity (system C34).

Deep architectures have produced a higher AUCs and lower \(\sigma _{AUC}\) than the shallow classifiers on accelerometer-based classification task. Figure 8 shows the mean ROC curves for the optimal LR, SVM, MLP, CNN, LSTM and Resnet50, whose configurations are shown in Table 3 and the mean AUCs were calculated over the 14 cross-validation folds. The Resnet50 classifier is superior to all other classifiers over a wide range of operating points (Fig. 8).

Figure 8
figure 8

Mean ROC curves for accelerometer-based cough detection, for the best performing classifiers whose hyperparameters are mentioned in Table 3. A Resnet50 has performed the best outperforming all the other classifiers over a wide range of operating points by achieving the AUC of 0.9888 and the accuracy of 96.71%.

Table 4 Accelerometer-based cough detection results for the shallow classifiers. The values are averaged over 14 cross-validation folds. The highest AUC of 0.8587 has been achieved from an MLP classifier.
Table 5 Accelerometer-based cough detection results for the DNN classifiers. The values are averaged over 14 cross-validation folds. DNN classifiers have outperformed the shallow classifiers by a wide margin and a Resnet50 produces the highest AUC of 0.9888 in detecting cough events.

6.2 Audio-Based Cough Detection

Table 6 Audio-based cough detection results. The values are averaged over 14 cross-validation folds and the best-three performances of each classifier are shown. All classifiers have performed well in detecting coughs but DNN classifiers have performed particularly well and their performances are very close to each other.

To place the performance of the accelerometer-based cough detection presented in the previous section into perspective, we have performed a matching set of experiments, this time using the audio signals to perform audio-based cough detection. These experiments are based on precisely the same events as the acceleration experiments, since our corpus contains both audio and acceleration signals for each.

Table 6 shows the best-three configurations for each of the six classifier architectures in systems D1 to D18. Again, the results indicate that shallow classifiers (LR, SVM and MLP) achieve good classification scores.

LR achieved the highest AUC of 0.9129 with \(\sigma _{AUC}\) of 0.003 when using 26 MFCCs, 512 sample long frames and extracting 100 frames (system D1). The system has also generated the specificity of 87.52%, sensitivity of 87.71% and an accuracy of 87.61%. The SVM achieved an AUC of 0.9066 with \(\sigma _{AUC}\) of 0.003 for 26 MFCCs, 1024 sample long frames and extracting 120 frames (system D4). The system has also generated the specificity of 86.75%, sensitivity of 86.91% and an accuracy of 86.83%. An MLP has produced the highest AUC of 0.9254 with \(\sigma _{AUC}\) of 0.002 for 39 MFCCs, 2048 sample long frames and extracting 120 frames (system D7). The system has also generated the specificity of 89.47%, sensitivity of 90.10% and an accuracy of 89.78%. This is the best performance achieved by the shallow classifiers.

Again, the DNN classifiers have outperformed the shallow classifiers by a large margin. The best LSTM classifier has produced the highest AUC of 0.9932 with \(\sigma _{AUC}\) of 0.002 for 26 MFCCs, 1024 sample long frames and extracting 70 frames (system D10). The system has also generated the specificity of 94.57%, sensitivity of 96.59% and an accuracy of 95.58%. The best CNN classifier has produced the highest AUC of 0.9944 with \(\sigma _{AUC}\) of 0.002 while features were extracted for 26 MFCCs, 1024 sample long frames and extracting 100 frames (system D13). The system has also generated the specificity of 93.24%, sensitivity of 97.88% and an accuracy of 95.56%. However, the highest AUC of 0.9957 has been achieved again from a Resnet50 classifier with a \(\sigma _{AUC}\) of 0.001 for 26 MFCCs, 1024 sample (i.e. 46.44 msec) long frames and extracting 100 frames from the entire event (system D16). This system has also achieved a specificity of 96.74% and a sensitivity of 99.5% along with the accuracy of 98.13%.

Figure 9
figure 9

Mean ROC curves for audio-based cough detection, for the best performing classifiers whose hyperparameters are mentioned in Table 3. The best performance has been achieved from a Resnet50 and similar performances have been achieved from a CNN and LSTM. The best Resnet50 produces the AUC of 0.9957 and the accuracy of 98.13%.

Again, deep architectures have produced higher AUCs and lower \(\sigma _{AUC}\) than the shallow classifiers on audio-based classification task. These best results for audio-based classification are shown in Fig. 9. Table 6 also indicate that the number of MFCCs has been varied between 13 and 65, although the best performance was achieved by using 26 and 39 MFCCs. Using the frame length of 1024 and extracting 100 frames from the events has provided the best performance for most of the classifiers.

7 Discussion

The results shown in Tables 4, 5 and 6 indicate that audio-based cough detection is consistently more accurate than accelerometer-based classification. However, it is interesting to note that the performances offered by the two alternatives are fairly close. In fact, the deep architectures like Resnet50 offer almost equal performance for audio-based and accelerometer-based cough detection. It also seems that the CNN and LSTM find it easier to classify cough events based on audio rather than accelerometer signal. We postulate that this is due to the limited range of time and frequency information contained in accelerometer data, which is in turn due to the lower accelerometer sampling rate.

For acceleration signals, the extraction of 10 frames each with a length of 640 ms produced the best result. For the audio, the extraction of 100 frames each with a length of 46.44 ms provided the optimal performance. These audio frame lengths are close to those traditionally used for feature extraction in automatic speech recognition. We also note that the performances of the deep classifiers are consistently better than those offered by the baseline shallow classifiers for both types of signals. Although the datasets differ, our system also appears to improve on recent work using the accelerometer integrated into a smartwatch [75] and distinguishing cough from other audio events such as sneeze, speech and noise [7].

8 Conclusion and Future Work

We have demonstrated that an automatic non-invasive machine learning based cough detector is able to accurately discriminate between the accelerometer and audio signals due to coughing and due to other movements as captured by a consumer smartphone attached to a patient’s bed.

We have trained and evaluated six classifiers including three shallow classifiers: logistic regression (LR), support vector machine (SVM) and multilayer perceptron (MLP) and three deep neural network (DNN) classifiers: convolutional neural networks (CNN), long-short-term-memory (LSTM) networks, and a 50-layer residual-based neural network architecture (Resnet50). A specially-compiled corpus of manually-annotated acceleration and audio events, including approximately 6000 cough and 68000 non-cough events such as sneezing, throat-clearing and getting in and out of the bed, gathered from 14 adult male patients in a small TB clinic was used to train and evaluate these classifiers by using a leave-one-out cross-validation scheme. For accelerometer-based classification, the best system uses a Resnet50 architecture and produces an AUC of 0.9888 as well as a 96.71% accuracy, 94.09% specificity and 99.33% sensitivity while features were extracted from ten 32 sample (320 msec) long frames. This demonstrates that it is possible to discriminate between cough events and other non-cough events by using very deep architectures such as a Resnet50; based on signals gathered from an accelerometer that is not attached to the patient’s body, but rather to the headboard of the patient’s bed.

We have also compared this accelerometer-based cough detection with audio-based cough detection for the same cough and non-cough events. For audio-based classification, the best result has also been achieved from a Resnet50 with the highest AUC of 0.9957. This shows that the accelerometer-based cough detection is almost equally accurate as audio-based classification while using very deep architectures such as a Resnet50. Shallow classifiers and DNN such as CNN and LSTM however perform better in classifying cough events on audio signals rather than accelerometer signals, as audio signal carries more dynamic and diverse frequency content.

Accelerometer-based detection of cough events has successfully been considered before due to its lower sampling rates and lesser demand of high processing power, however only by using sensors worn by the subjects, which is intrusive and can be inconvenient in some respects. This study shows that excellent discrimination is also possible when the sensor is attached to the patient’s bed, thus providing a less intrusive and more convenient solution. Furthermore, since the use of acceleration signal avoids the need to gather audio, privacy is inherently protected. Therefore, the use of a bed-mounted accelerometer inside an inexpensive consumer smartphone may represent a more convenient, cost-effective and readily accepted method of long-term patient cough monitoring.

In the future, we will be attempting to optimise some of the Resnet50 metaparameters and fuse both audio and accelerometer signal to achieve higher specificity and accuracy in cough detection. We are also in the process of applying the proposed system in an automatic non-invasive cough monitoring system. We also note, the manually annotated cough events sometimes contains multiple bursts of cough onsets and we are currently investigating automatic methods that allow such bursts within a cough event to be identified.