1 Introduction

Respiratory events are episodes during sleep where respiratory activity is disturbed. The American Academy of Sleep Medicine (AASM) scoring manual describes two respiratory events, namely apnea and hypopnea [1]. According to the scoring manual, apnea is the disruption of breathing during sleep with a reduction in airflow of at least 90% for at least 10 s while hypopnea is a reduction of airflow of at least 30%, with an additional requirement of having an associated oxygen desaturation and/or arousal [1]. The apnea–hypopnea index (AHI) is the measure used to quantify sleep apnea and hypopnea and classify its severity. AHI is defined as the number of apnea and hypopnea events per hour during sleep. Polysomnography (PSG) is a multi-sensor overnight recording of sleep and is the gold standard for sleep diagnosis. It is used to obtain the AHI. The relevant respiratory sensors are oronasal thermistor, nasal pressure cannula, abdomen and thorax respiratory inductance plethysmography (RIP) belts, finger pulse oximeter, and electroencephalography [1].

Aside from the standard PSG, home sleep apnea testing devices, such as the level III device, are currently in use. According to the AASM guideline, a level III sleep study is a smaller PSG version using fewer signals in a portable device, where a minimum of four signals are required: heart rate, oxygen saturation and two channels of respiratory movement or respiratory movement and airflow [2]. Correspondingly, we will present a novel method for detecting respiratory events that is suitable for level III systems and yields comparable results to detection performed with full PSG equipment.

In a previous study, three home sleep apnea testing devices using a reduced number of sensors (Nox T-3 (Nox Medical), ARES (SleepMed Inc.), and WatchPAT (Itamar Medical)), were compared to PSG and showed excellent agreement with the intra-class correlation coefficient (ICC) > 0.93 [3]. SpO2 is a signal measured by pulse oximeter and is available in level III devices. It has been shown to be a reliable signal for detecting respiratory events: an automated detection based on only SpO2 reported an average accuracy of 91% and an average Cohen’s kappa of 0.71. However, because of the delay between SpO2 and apnea onset, the start and end times of respiratory events cannot be obtained, so a 25-secondcorrection was applied [4]. We wanted to investigate whether respiratory events using RIPsum and SpO2 can be detected, where the predicted similar AHI is comparable to PSG scoring, the signals that are readily available in level III sleep devices. The combination of the two sensors was to ensure that the precise time and duration of each respiratory event were also detected. This is also to provide a detection system that is not dependent on the nasal pressure sensor and to provide an alternative when nasal sensors fail in level III devices.

2 Materials and Methods

2.1 Subjects

98 patients, 77 with suspected sleep apnea and 21 without, were included in this study. Each patient underwent an overnight PSG recording at Advanced Sleep Research GmbH in Berlin, Germany or at Kepler University Hospital, Department of Neurology 2, in Linz, Austria. The Linz clinic used Somnoscreen Plus with Domino software (Somnomedics, Randersacker, Germany) while the Berlin clinic used the EMBLA N7000 system with RemLogic 3.4.1 software (Embla systems, Broomfield, CO, USA). 65 of the subjects were men and 33 were women. The mean age was 53 years old (± 15.2). The mean body mass index (BMI) was 28.4 kg/m2 (± 5.2). PSG data of the patients are shown in Appendix Table 1. This study protocol was approved by the ethics committee of the state of Upper Austria (B-130–17) and the Charité—Universitätsmedizin Berlin (EA1/127/16). Written and signed consents forms were obtained from the patients prior to inclusion in the study.

2.2 Respiratory Events and AHI Calculation

2.2.1 Procedure

Each PSG recording came with respiratory events annotations detected by the PSG software. Apneas and hypopneas were pooled together and referred to as “respiratory events.” The PSG AHI from the annotations was calculated by the number of respiratory events divided by the total sleep time (TST), where TST was derived from the PSG’s hypnogram. The PSG AHI served as a reference for our study.

The automatic system for detecting respiratory events proposed in this study used signals available in level III devices, the uncalibrated RIPsum and SpO2 signals. The algorithm outputs the time and duration of the detected respiratory events. The algorithm also calculates the predicted AHITRT per recording. Unlike in PSG, sleep staging is not readily available in level III devices. Therefore, the predicted AHI was calculated using the total recording time (TRT). This study compared the predicted AHITRT computed using the TRT to the PSG AHI that was computed using TST. The difference between AHIs computed using TRT and TST is affected by the distribution of respiratory events during sleep and wake periods. When predicted respiratory events occur mainly during sleep periods, the AHITRT would consequently be underestimated compared to the PSG AHI given that the TRT would be higher than the TST.

The severity level of each recording was determined and compared to the reference scoring. The levels of severity were as follows: mild if the AHI range was between 5 and 15, moderate when between 15 and 30 and severe when the AHI was above 30 [5]. The normal category was also considered in this study for those recordings with AHI < 5.

2.2.2 Automatic Detection of RIPsum Events

RIPsum reduced events were detected = by first deriving a smooth upper envelope signal of the uncalibrated RIPsum. The upper envelope signal was derived by identifying all positive peaks in the RIPsum and performing a spline interpolation. The location of major peaks of the smooth signal was identified and segments of the RIPsum were created between the peaks. Each segment was automatically processed for any sub-segment with an amplitude lower than a threshold. The threshold used was based on the beginning peak amplitude and was determined heuristically. The sub-segments identified as being lower than the threshold must be at least 10 s in duration in order to be classified as a RIPsum event.

2.2.3 Automatic Detection of SpO2 Desaturation Events

We defined SpO2 desaturation events as events in the SpO2 signal with a desaturation of at least 3% and a subsequent return to the pre-desaturation oxygen level. The detection was performed by locating major peaks of the SpO2 signal as shown in Fig. 1 and measuring any desaturation in the signal between the peaks. The start of such an event was set at the start of desaturation and terminated on the completion of SpO2 re-saturation before the next succeeding peak.

Fig. 1
figure 1

Illustration of respiratory detection using RIPsum and SpO2. As the pair of RIPsum and SpO2 occurred within 60 s from each other, a respiratory event was detected with the same time and duration as the RIPsum event

2.2.4 RIPsum Events and SpO2 Desaturation Events for Detection

The detection of respiratory events was performed by pairing RIPsum events with associated SpO2 events, as illustrated in Fig. 1. To do so, the algorithm identified pairs of events with a RIPsum event followed by an SpO2 event with a maximum delay of 60 s. When such a pair was found, the algorithm labeled the event as a respiratory event, with the starting time and duration identical to the RIPsum event. The predicted AHITRT was calculated according to the total number of respiratory events and the TRT.

2.3 Statistical Analysis

The algorithm was evaluated by comparing the predicted AHITRT to the PSG AHI. The predicted AHITRT was computed using the TRT because it was designed for level III application while the PSG AHI was computed using the TST. An evaluation was also carried out by computing a predicted AHITST using the TST. This was to evaluate the performance of the algorithm when sleep/wake information is available. While standard level III devices do not come readily equipped with sleep/wake information, some are compatible with portable electroencephalography (EEG) for sleep staging. The Spearman’s r was calculated to evaluate correlation between the predicted AHI and PSG AHI. ICC r was computed as the metric of reliability or the degree of correlation and agreement. An ICC \(r >0.90\) indicates excellent reliability [6]. Following the guideline presented in [6], we used the two-way mixed effects ICC model with absolute agreement and single rater type. The ICC r was computed with the upper and lower bounds at 95% confidence interval (CI). Additionally, Bland–Altman analysis was performed to evaluate agreement with the PSG AHI by the mean difference and limits of agreement, set at ± 1.96 standard deviation, i.e. 95% confidence interval [7].

Aside from the comparison of predicted AHI to the PSG AHI, the severity classifications were also compared. Severity categories were defined a follows [5]: mild (5 ≤ AHI 15), moderate (15 ≤ AHI < 30), severe (AHI ≥ 30) and the normal category (AHI < 5). The confusion matrix and accuracy were prepared to compare the predicted severity to the PSG classified severity.

3 Results

Figure 2 shows the predicted AHITRT compared to the PSG AHIs whereas the full results are tabulated in Appendix Table 3. The PSG AHIs were calculated using the TST while the predicted AHIs were computed using TRT. A comparison between the predicted AHITRT and PSG AHI resulted in low median absolute difference, \(\left| {\Delta {\text{AHI}}} \right|\)= 2.8, Spearman’s r = 0.96 (p < 0.001) and ICC \(r = 0.96{ }\left( {0.95 - 0.97,p < 0.001} \right)\). The ICC indicates excellent reliability, suggesting high correlation and agreement. Furthermore, 70 recordings out of 98 had an \(\left| {\Delta {\text{AHI}}} \right|\)≤ 5. The Bland–Altman plot in Fig. 3 shows a mean difference at 0.6, where all but two are within or at borderline of the limits of agreements. The only significant outlier in Fig. 2 is recording no. 19 (Appendix Table 3) with total sleep efficiency of 67%.

Fig. 2
figure 2

Predicted AHITRT vs. AHIPSG

Fig. 3
figure 3

Bland–Altman plot between predicted AHITRT vs. AHIPSG

The algorithm was also evaluated using TST, to test performance of the algorithm when sleep information is available, with the predicted AHITST shown in Fig. 4. The predicted AHITST compared to PSG AHI resulted in Spearman’s r = 0.97 (p < 0.001), ICC r = 0.97 (0.96 – 0.98, p < 0.0001), and median \(\left| {\Delta {\text{AHI}}} \right|\)= 2.6. The Bland–Altman plot in Fig. 5 shows a mean difference of − 1.1.

Fig. 4
figure 4

Predicted AHI vs. AHIPSG

Fig. 5
figure 5

Bland–Altman plot between predicted AHITST vs. AHIPSG

Table 1 shows the confusion matrix for severity classification. The classification was based on the predicted AHITRT. The algorithm classified the correct severity for 75.5% (n = 74) of the recordings. 16 of the recordings were classified as a higher severity by the algorithm and the remaining 8 were classified as a lower severity. The average \(\left| {\Delta {\text{AHI}}} \right|\) between the underestimated recordings was 4.9, while for the overestimation, it was 5.0. The average absolute difference of predicted AHI from the correct classification cutoff (e.g. point difference of predicted AHI from 5 or 15 for mild category) for the underestimated recordings was 3.0, and 3.2 for the overestimation. No misclassification by more than one severity level occurred in any of the recordings. The severity classification based on predicted AHITST shown in Table 2 performed better with 80.6% (n = 79) accuracy, with 11 overestimated recordings and 8 underestimated recordings.

Table 1 Severity classification confusion matrix. 75.5% (n = 74) of recordings correctly classified
Table 2 Severity classification confusion matrix. 80.6% (n = 79) of recordings correctly classified

4 Discussion

In this study, we developed an algorithm using only RIPsum and SpO2 to detect respiratory events, intended for use with level III home sleep apnea testing devices. We tested our algorithm on 98 patients, 77 with suspected sleep apnea and 21 without.

The predicted AHITRT, performed well with an ICC of r = 0.96 and Spearman’s r = 0.96 when compared to PSG AHI. The results suggest that our algorithm showed a high level of agreement and correlation with full PSG based AHIs. The median difference \(\left| {\Delta {\text{AHI}}} \right|\) was only 2.8 and 70 of the recordings have an \(\left| {\Delta {\text{AHI}}} \right|\)≤ 5. The outlier shown in Fig. 3 was underestimated by 26.6 AHI points. This was caused by the predicted events predominantly occurring during sleep time, in relation to the effect of sleep–wake distribution of events to the computed AHITRT.

Level III devices do not come readily equipped with sleep staging. Therefore, the predicted AHI was computed using the TRT. Nevertheless, sleep staging can be added when portable EEG devices are integrated into level III devices. To test whether our algorithm can perform reliably when sleep/wake information is added, we computed the predicted AHI using TST from the hypnogram. The predicted AHITST scored an ICC of r = 0.97. The algorithm improved when given the sleep and wake information. Nevertheless, even without the TST, the algorithm performed with excellent reliability.

The severity classification (based on AHI using TST) scored 75.5% accuracy, where 16 of the recordings were overestimated. When we considered overestimated severity as being acceptable, then 92% (n = 90) of the recordings were given a safe classification. This assumes that it is safer to overestimate the severity than to underestimate it, i.e. it is better to say a patient has moderate AHI when he has mild AHI than to say a patient has normal AHI when it is actually mild. For the eight underestimated recordings, the predicted AHITRT was on average only 3.0 points away from the cutoff of its correct severity level. And only two of the underestimated recordings were misclassified as normal instead of mild. For comparison, we also performed the severity classification using the predicted AHITST. Using the total sleep time, the predicted severity, increased to 80.6% accuracy, as expected.

The results of our algorithm are comparable to studies on detection algorithms with fewer or novel sensors. Three different home sleep apnea testing devices scored ICC r = 0.93 − 0.97 compared to PSG scoring showing high reliability [3]. The WatchPAT validation study for AHI estimation reported a Spearman’s r of 0.802 between the device’s rapid eye movement sleep (REM) and non-REM-based AHI versus PSG AHI scores [8]. Another study reported a predictive model for apneas and hypopneas using SleepView, a portable two-channel diagnostic device for sleep-related diseases: Using a nasal pressure cannula transducer and pulse oximetry sensor, a correlation of \(r^{2} = 0.84,p < 0.01\) was reported between AHI calculated using TRT using SleepView software versus PSG AHI based on TST using 93 subjects [9]. Our algorithm performed on par with other apnea prediction systems without airflow sensors. Using a microphone placed one meter above the bed to detect snoring and estimate AHI, a correlation coefficient of r2 = 0.81 was achieved compared to AHIs scored according to the AASM scoring criteria [10]. Using tracheal sound signal and pulse oximetry, a linear correlation score of 0.96 was reported between the estimated AHI and manually scored PSG AHI [11]. A recent study using tracheal sounds to identify apneas reported 92.8% sensitivity and 99.7% specificity [12]. Using only the thoracic respiratory effort, a comparison based on sleep and wake periods between estimated AHI and scored AHI resulted in a correlation coefficient of r2 = 0.73 for training and r2 = 0.55 for validation set [13]. A study estimating AHI using only SpO2 reported a Cohen’s kappa of 0.71 and an accuracy of 91% [4].

We calculated the Spearman’s r between the AHI values and the ICC, to evaluate not only correlation but also agreement. This is to show that the predicted AHIs not only have a positive linear correlation with the reference AHIs but also are not greatly misestimated, to avoid misclassification of severity. We take note of the usage of SpO2 alone to estimate the AHI. However, we make the case of using RIPsum because it provides the possibility of classifying events between obstructive, central, or mixed, which will be of interest for future work. Furthermore, with RIPsum, the precise location and duration of the respiratory events can be determined. One limitation of using respiratory effort instead of nasal sensors is that the distinction between apnea and hypopnea cannot be made. However, the aim of this algorithm is to provide an alternative system that is not dependent on the nasal pressure sensor, so that the patient’s comfort level can be increased and to provide an alternative in the event that nasal sensors in level III devices fail.

5 Conclusion

Our results showed that our method using RIPsum and SpO2 has excellent agreement and correlation with PSG scoring. However, it must be noted that the difference in total recording time and total sleep time can affect the estimated AHI. Nevertheless, the algorithm can detect respiratory events without using airflow sensors, ensuring more comfortable sleep for patients. Another advantage is that the sensors needed for our algorithm are available in and compliant with level III sleep studies.