Introduction

Apnea and hypopnea are common sleep-related breathing disorders that affect 6% to 17% of adults, reaching up to 49% in older populations1. They are usually marked by pauses in breathing or shallow breathing, respectively. This can happen, for example, when the soft tissue at the back of the throat collapses and closes at night2. Because blockage in the airways slows breathing, less oxygen is sent from the lungs to the heart and body. The CO2 level in the blood then increases due to impaired breathing, leading to episodes of sudden awakening or choking during sleep3. During these events, nasal airflow can cease or reduce for a short time, and the brain and body will fight to keep breathing. Sleep apnea and hypopnea have been associated with an increased risk of heart disease, diabetes, chronic kidney disease, stroke, depression, and cognitive impairment2. As age and the likelihood of having sleep apnea are positively correlated4, an aging population is expected to add more pressure on healthcare systems.

Typically, the diagnosis and detection of sleep apnea and hypopnea are based on polysomnography (PSG) tests conducted in a sleep facility or at home5. PSG requires the overnight recording and monitoring of several physiological signals, including electroencephalogram (EEG), electrocardiogram (ECG), electrooculogram, chin muscle activity, leg movements, respiratory effort, nasal airflow, and oxygen saturation (SpO2). This information is then used to compute the Apnea-Hypopnea Index (AHI), the number of times a person has apnea or certain types of hypopnea per hour of sleep, as defined by the American Academy of Sleep Medicine (AASM)6.

There are many limitations with PSG tests: they are time-consuming, expensive, and inconvenient7. As a result, it is expected that more than 85% of people with sleep apnea or hypopnea are not diagnosed4. Hence, the development of portable, easy-to-use, reliable, and affordable AHI estimation tools is crucial to improve patient care. Currently, there is only one method that estimates AHI8, using an oximetry signal. However, this method does not estimate when a patient is sleeping, which is essential to compute AHI. Instead, it uses the information of when patients are asleep from the labels provided in the dataset (ground truth). As we will show later, oximetry signals on their own struggle to identify when patients are asleep, even when coupled with heart rate signals. This limitation can lead to poor AHI estimation. Finally, the available method does not detect events or segment oximetry signals, i.e., it does not identify the exact periods of time when apneas and hypopneas occur.

When focusing only on short time-series segments, there are many classification tools to decide whether an apnea event is present or not (see, for example, the comprehensive reviews in refs. 9,10,11). Most results are limited to obstructive sleep apnea (OSA), the most severe and common type of sleep apnea12. Such detection tools for OSA classification have been based on manual feature extraction, from time and/or frequency domains of one or more physiological signals. These algorithms require expert knowledge for the design and selection of features to be extracted, which may still not include the most relevant features (information) for classification from data13. The procedure can be hard, time consuming, and often subject to bias or lack of generalizability14,15. Deep learning, on the other hand, has shifted data modeling away from “expert-driven” feature engineering toward “data-driven” feature extraction8,11 and is now widespread across many other biomedical applications16,17.

Based on deep learning, this paper proposes a data-driven method, named DRIVEN, to estimate AHI and segment physiological signals. We focus on devices that can be implemented in a home care environment and were available in the considered datasets, including abdominal and thoracic movement, oxygen saturation, and R-to-R interval data. Our study employs an extensive multicenter database containing 14,370 recordings of 10,752 patients belonging to multiple ethnicities and with an average age of 68.7 years. To estimate AHI, DRIVEN detects apnea and hypopnea events, and determines when patients are sleeping or awake. To our knowledge, it is the first standalone algorithm to estimate AHI. Using abdominal movement and oximetry sensors, DRIVEN correctly classifies 72.4% of all test patients into one of the four AHI classes, with 99.3% either correctly classified or placed one class away from the true one, a reasonable trade-off between the model’s performance and patient’s comfort. Cross-validation and out-of-the-domain testing show that DRIVEN is highly generalizable, achieving high performance on large external datasets not included during the training process. Given the high performance and small computational cost of DRIVEN, we expect that it could be implemented as part of a home AHI estimator system.

Results

Data description

Three different datasets were considered in this work: the Multi-Ethnic Study of Atherosclerosis (MESA)18, the Men Study of Osteoporotic Fracture (MrOS)19, and the Sleep Heart Health Research (SHHS)20. These datasets are publicly available from the National Sleep Research Resource for Sleep-Related Studies (NSRR)21. In total, 14,370 PSG recordings were considered in our study. Each PSG recording consists of data collected from several physiological sensors during a patient’s overnight sleep (lasting 8.78 h on average). The PSG recordings were collected during typical PSG evaluations in healthcare facilities.

This study focuses only on channels that can be easily implemented in a home environment and are accessible in all three datasets: pulse oximetry (SpO2), thoracic movement, and abdominal movement. Data collected from various healthcare facilities had different sampling frequencies. These were normalized by resampling all channels to the maximum sampling frequency (64 Hz). Hence, all signals per patient were standardized to the same length, which simplified the training of the neural networks. PSG recordings were excluded from the study if they met at least one of the following criteria: (1) poor-quality PSG recordings given by the presence of “unsure” or “noise” labels in more than a third of the total sleep time or the repetition of only one value through the whole signal; and (2) the absence of one or more channels from the five initial channels selected for the study. After applying the exclusion criteria, a total of 10,643 sleep recordings were obtained to develop and evaluate the proposed method. Table 1 summarizes the databases.

Table 1 Data description

The data was labeled following the definition of AHI (ahi_a0h3a) from the American Academy of Sleep Medicine6. An apnea is defined as a nasal flow reduction of more than 90%, while an hypopnea is defined as a nasal flow reduction of more than 30%, both for at least 10 s. There are several types of hypopneas, in which only the following two count towards AHI: hypopneas with a reduction of oxygen desaturation equal to 3% or more within 45 s (also labeled in the paper as hypopnea type 1) and hypopneas with an arousal within 5 s after the end of the event (hypopnea type 2)22. All other hypopneas (hypopnea type 3) are not considered in the calculation of AHI. In this paper, an AHI event refers to any of these events: apnea, hypopnea type 1, and hypopnea type 2. According to the AASM guidelines, patients are categorized according to their AHI: healthy (less than 5 events/h), mild (5–15 events/h), moderate (15–30 events/h), and severe (more than 30 events/h).

DRIVEN: deep-learning model for the detection of AHI events

We developed a hybrid deep-learning model named DRIVEN to (1) classify “normal” or AHI events, (apneas, hypopneas type 1, and hypopneas type 2) from data, (2) estimate the total time of sleep, and (3) estimate the AHI. In the classification step, the inputs of DRIVEN are short segments of time-series data recorded during sleep. We explored window lengths of 10, 30, and 80 s, and selected 30 s windows as they outperformed the others at estimating AHI (see Methods for details). We focused on the following physiological signals: SpO2, thoracic movement, and abdominal movement. DRIVEN then learns whether a given input corresponds to a normal event or an AHI event. Figure 1 illustrates DRIVEN’s pipeline. For each available physiological signal, a distinct deep convolutional neural network (CNN) is trained to extract global features from the raw 1-dimensional signals. The extracted features from all CNNs are then combined on a light gradient-boost machine (LightGBM)23 to perform the classification between normal and AHI events. This approach has proven to generalize better than just using one CNN to input all the channels (see subsection “End-to-end CNN” in Methods). Finally, for each patient, DRIVEN counts all detected AHI events to estimate the AHI. We evaluate the performance of different sensors, window sizes, and deep-learning models. See Methods for details on data pre-processing and model training, validation, and testing.

Fig. 1: DRIVEN’s pipeline.
figure 1

a Data are separated by channels and segmented into 30 s windows. b For each channel, a distinct trained deep CNN extracts features (outputs) from the raw signal (input). c The extracted features are concatenated and fed to a trained LightGBM that classifies the input data between normal and AHI events (apneas, hypopneas type 1, and hypopneas type 2).

DRIVEN was trained and assessed on raw physiological signals. This avoids creating human-selected features, which can sometimes lead to bias. The data were separated by center to avoid data leakage and guarantee good performance on external datasets. SHHS was used to train and validate the model (both CNNs and LightGBM) while MESA and MROS were used as test sets to independently evaluate DRIVEN on unseen out-of-domain data from other sources. For each patient, sleep intervals were sampled every 30 s with a 15 s overlap and labeled as normal or AHI event, yielding a total of more than 15 million samples. The training and validation data (SHHS database) were originally imbalanced with a ratio of 6.4 normal events per AHI event. To overcome the bias and inaccuracy associated with classification using imbalanced data between two classes, the number of normal events was under sampled. Balancing the number of positive and negative samples is a common strategy followed in the literature24.

Performance of DRIVEN on sleep/awake detection

The AHI score is defined as the quotient between the total number of AHI events recorded overnight and the total time that patient slept (see subsection “AHI and severity class estimation from positive events” in Methods). Hence, to get an accurate estimation of this number, DRIVEN also needs to estimate the time the patient slept through the night. We trained the same architecture proposed in Fig. 1, using window sizes of 30 s, with the annotated labels from the dataset for awake or sleep. The training, validation, and testing methodology were similar as before. For the combination of abdominal and SpO2 sensors, we obtained an area under the receiver operating characteristic curve (AUROC) of 0.87 and an area under the precision-recall curve (AUPRC) of 0.90 in the test dataset. We also observe that adding RRI to any combination of sensors has little effect on the performance of the algorithm and that pulse oximetry (SpO2) alone is a poor sleep state predictor. The complete results of the sleep/wake classification are reported in Supplementary Fig. 1 and Supplementary Table 1.

Performance of DRIVEN on classification of AHI events

We now evaluate the performance of DRIVEN in the classification task between normal and AHI events. Results for the complete test set (MESA and MROS databases) are shown in Fig. 2a,b using either a single channel or a combinations of channels (when the sleep/awake states are also predicted by DRIVEN). AUROC and AUPRC show that models using thoracic and abdominal movement sensors each feature very similar performance, even when these two sensors are used in combination. This indicates very high mutual information between these sensors. On the other hand, adding SpO2 significantly improves the performance, indicating additional information carried by this signal. Note that the model using the SpO2 channel alone struggles to classify AHI events due to its poor performance to estimate sleep/awake segments (Supplementary Fig. 1 shows a relatively low AUROC and AUPR of 0.72 and 0.76, respectively, on the sleep/awake detection when using only SpO2). Figure 2c shows performance metrics as a function of the threshold when using two sensors: abdominal movement and SpO2. Both accuracy and F1-Event-Classification curves are relatively flat for a wide range of thresholds showing that the model is robust to the choice of threshold (which was decided on the validation results). Note that estimating hypopneas type 2 requires detecting short-term arousals, which are hard to estimate from the three available sensors. In contrast, apneas and hypopneas type 1 are much easier to detect (Supplementary Figs. 2 and 11).

Fig. 2: Performance of DRIVEN on the classification of AHI events.
figure 2

a Receiver-operator characteristic and (b) precision-recall curves. Note the overlapping curves for thoracic and abdominal sensors. c Threshold-dependent performance metrics of DRIVEN when using two input channels (abdominal movement and SpO2). The performance results are shown for all patients in the test dataset. The accuracy, precision, recall, and F1-Event-Classification are metrics on the classification of individual events. The F1-AHI-Classification measures the F1-Score of predicting the AHI severity category (healthy, mild, moderate, severe) on whole sleep studies.

Performance of DRIVEN on AHI estimation

This section evaluates the performance of DRIVEN at estimating the AHI for each patient on the test set. It consists of first estimating the total sleep time and then counting all detected AHI events throughout the night for each patient. Using both abdominal movement and SpO2 sensors, Fig. 3a shows the real versus predicted AHI while Fig. 3b contains the confusion matrix of the four severity categories. In total, 72.4% of patients are correctly classified and 99.3% were either correctly classified or were placed one class away from the true one. No healthy patient was classified as severe. Among the 5355 test patients, only one severe individual was classified as healthy. This occurred because the patient had a short sleep duration of 3.5 hours with frequent apneic bursts. DRIVEN exhibits consistent performance across all AHI severity levels, effectively identifying sleep-disordered breathing events in all patient groups. The model provides a good trade-off between accuracy and patient’s sleep comfort, when compared to a full polysomnography. Results for other combination of sensors are reported in Supplementary Figs. 3–8.

Fig. 3: Performance of DRIVEN on the estimation of AHI.
figure 3

a Real versus predicted AHI divided by the four AHI severity groups. b Confusion matrix. The performance is evaluated on the test data considering a threshold of 0.79 and a combination of two signals (abdominal movement and SpO2).

To investigate whether some physiological signals may contribute more than others in estimating AHI, we calculated F1 scores for different combinations of sensors (Table 2; further details can be found in Supplementary Figs. 3–8). Similar to the classification results above, thoracic and abdominal movement sensors each have similar performance, while adding SpO2 significantly improves the performance. The combination of abdominal movement and SpO2 sensors results in one of the highest performance and using only two sensors. We also compared the AHI classification F1 scores when AHI is estimated (1) without sleep information, (2) with ground truth on the sleep/awake states, and (3) with the predicted sleep/awake states (Supplementary Table 2). Using abdominal and SpO2 sensors, the F1 performance drops around 4% from the ground-truth sleeping information to the predicted sleep. For SpO2 alone, however, the drop is considerably larger at 12%, as this sensor alone cannot estimate sleep states as well.

Table 2 F1 scores for different combinations of sensors

Automatic detection of AHI events: segmentation of whole sleep studies

Using DRIVEN’s classification of AHI events, we can analyze whole sleep studies to identify apnea and certain hypopneas (types 1 and 2). Additionally, with DRIVEN’s sleep state detection, we can also identify when the patient is asleep and when the patient is awake. For each patient, recordings were sampled every 15 s with 30 s windows. Each window was labeled as awake or sleep according to the sleep prediction classification. Then, those segments classified as sleep are further categorized into normal or AHI events. Figure 4 illustrates the method on a random patient, revealing the periods of sleep with the highest AHI event episodes, compared with the ground truth, as well as awake segments.

Fig. 4: Automatic labeling of AHI events for a random patient using two sensors (abdominal movement and SpO2).
figure 4

The blue areas represent true events (zero indicates no event and one indicates an AHI event). The output of DRIVEN is illustrated with symbols that represent, for each 30 s window, the probability of the window to be classified as an AHI event. The windows are colored according to their classification, depending on whether they are above or below the determined threshold of 0.79. The black crosses represent the segments that were classified as awake, the green triangles the ones classified as not AHI events, and the red stars are the windows classified as AHI events. The second plot zooms in a 1h segment. Supplementary Fig. 10 increases even further the resolution to a 15 min interval. Supplementary Fig. 11 includes the ground truth labels divided by apnea and different hypopnea types.

Discussion

This paper introduced a method, DRIVEN, to (1) classify AHI events and sleep states, (2) estimate AHI, and (3) segment whole sleep studies, using a single or a combination of easy to use at home sensors: abdominal movement, thoracic movement, and oxygen saturation. We showed that a deep neural network could be effectively trained using an extensive database to extract relevant features from raw physiological data. The detection was enhanced by combining it with a LightGBM classifier, which proved to generalize better than using a CNN on its own, as in ref. 25. Moreover, DRIVEN’s architecture allows for each channel to have its own sampling frequency, and it estimates both AHI events and sleep stages. The best trade-off between model performance and patient comfort was achieved when combining data from only two sensors (abdominal movement and pulse oximetry), yielding an F1-score of 72% at estimating AHI. Hence, DRIVEN can provide an estimation of AHI and detection and segmentation of whole sleep studies to aid physicians make a final diagnosis. For patients, it can guide them to seek medical attention.

Thoracic and abdominal movement sensors each featured very similar performance, since they both measure breathing. Adding SpO2 significantly improved the performance, which is expected given the role of SpO2 in the clinical definition of AHI. However, SpO2 alone struggled to estimate the total sleep time per patient and, for this reason, it should not be used alone. Hence, the combination of SpO2 and abdominal movement sensors provided an overall good performance both in detecting AHI events and estimating the total sleep time per patient. Supplementary Table 3 reports the performance of DRIVEN on AHI-event classification using all possible combinations of physiological signals available in the considered datasets. RRI data, extracted from ECGs, provided only a marginal improvement combined with other sensors. On its own, it resulted in a poor performance. Nasal airflow was also not included in the main analysis (Figs. 2 and 3) since it is an uncomfortable sensor to sleep with, and provided similar performance as either thoracic movement or abdominal movement sensors.

The reasons for choosing CNNs to automatically search for features over domain knowledge feature extraction methods are mainly due to the variety of channels to evaluate and the large available datasets. The flexibility of deep-learning feature extraction approach allowed us to explore several choices of physiological channels (including all possible combinations). DRIVEN’s architecture allows us to easily and efficiently evaluate new sensors, such as adding a movement sensor if it is available. Several studies have shown that, when the training datasets are large and there is enough computing power, deep learning performs as well or better than other machine-learning algorithms26, as demonstrated in different applications ranging from chess games and speech recognition to computer vision27. In the particular case of obstructive apnea detection,28 reported consistently better results using unbiased deep-learning approaches and CNNs for segment classification in PSG studies.

A general limitation, not just of DRIVEN but also any other method without access to EEG data, is detecting hypopneas type 2. This type of hypopnea requires estimating short-term arousal following a drop of at least 30% in nasal airflow, which is particularly challenging to detect from the three sensors selected in this study (Supplementary Fig. 2). Figure 4 and Supplementary Fig. 10 also show a few false positives and some 30 s segments with a high probability of being AHI events, which in fact are hypopneas type 3 (that are not considered in AHI, Supplementary Fig. 11). Indeed, detecting AHI events (apneas and hypopneas types 1 and 2) is considerably more challenging than simply detecting obstructive sleep apneas (Supplementary Fig. 2), which is the main focus of most available algorithms28. A more specific limitation of DRIVEN is its use of 30 s windows, which might treat bursts of multiple short-lived apnea or hypopnea as a single long event. While the smoothing window has merits by effectively filtering noise and capturing longer events, it could lead to underestimating the number of apnea or hypopnea, especially in cases of severe AHI. This effect can be seen in Supplementary Fig. 10, which is a zoomed-in segment of Fig. 4. The smoothing effect caused individual apnea events of bursts to be merged into one event, resulting in a lower AHI estimation. The complex influence of different bursting patterns of apnea events on AHI is also discussed in the literature. For example,29 demonstrated that very different bursting patterns could lead to the same AHI values. In future studies, we plan to explore new types of AI models, including selective state-space neural networks30, to potentially further improve performance and address these limitations. Finally, we investigated classifying sleep or awake from combinations of the three sensors considered in the paper (Supplementary Table 1 and Supplementary Fig. 1). There was a decrease in performance of DRIVEN when using predicted sleep (Fig. 2) instead of the annotated total sleep time per patient obtained from the dataset (Supplementary Fig. 12), as summarized in Supplementary Table 2. In practice, DRIVEN’s performance could be further improved by combining these sensors with other easy-to-wear sensors like smartwatches or sensors under the mattress.

This result demonstrates the potential of DRIVEN for an automatic labeling application. Compared to a Polysomnographic Technologist’s manual detection and diagnosis of OSA, computer-assisted signal analysis systems can reduce errors related to inter- and intra-operation variability and tiredness induced by the arduous annotating process31. Furthermore, most computer-based analyses can be conducted more cost-effectively and quickly13. Overall, DRIVEN could reduce the burden on Polysomnographic Technologists and costs to the healthcare system by making feasible the examination of patients through multiple nights from data that can be collected at home by wearable devices.

Methods

Data specification

The polysomnography data is available in European Data Format (EDF)32, while the profusion files, in XML format, serve as the annotated label files. These files have been annotated for each sleep stage using the Rechtschaffen & Kales (R&K) criteria33. Each profusion file includes annotations for every 30-second sample of sleep stages and instances of sleep problems, with information on initiation and duration time, as well as the sensor used to detect the anomaly. Databases are summarized as follows.

  1. 1.

    The MrOS Sleep Study: MrOS is a substudy of the Men Study of Osteoporotic Fractures. Between 2000 and 2002, a baseline evaluation was performed on 5994 elderly men 65 years of age or older taken from six clinical institutions. Between December 2003 and March 2005, a total of 3135 of these participants were subjected to complete unattended polysomnography and 3 to 5-day actigraphy tests as part of the sleep study. The purpose of the Sleep Study was to determine the extent to which sleep disorders are linked to adverse health outcomes, such as the increased risk of death, fractures, falls, and cardiovascular disease.

  2. 2.

    The MESA Sleep Study: The Multi-Ethnic Study of Atherosclerosis (MESA) is a collaborative 6-center longitudinal investigation of factors associated with the development and progression of subclinical to clinical cardiovascular disease in 6814 black, white, Hispanic, and Chinese American men and women aged 45 to 84 years in 2000–2002. The participants were all between the ages of 45 and 84 at the time of the study. Four follow-up examinations have been performed, one each in the years 2003–2004, 2004–2005, 2005–2007, and 2010–2011. Furthermore, 2237 people participated in a Sleep Exam conducted by MESA Sleep between 2010 and 2012. This exam included an unattended overnight polysomnogram, wrist-worn actigraphy for seven days, and a sleep questionnaire. The sleep study aims to determine whether there is a correlation between subclinical atherosclerosis and sex, ethnicity, or other demographic differences in sleep and sleep disorders.

  3. 3.

    The SHHS Sleep Study: The Sleep Heart Health Study is a multicenter cohort study conducted by the National Institute of Heart, Lung, and Blood. The purpose of the study was to investigate the effects of sleep-disordered breathing on the cardiovascular system and also on other aspects of a person’s health and to determine whether breathing problems that occur during sleep are related to an increased risk of coronary heart disease, stroke, death from all causes, and hypertension. Between November 1995, and January 1998, SHHS Visit 1 research focused on participants who were at least 40 years old and included 6441 men and women. A second polysomnogram, known as SHHS Visit 2, was performed on 3295 participants during the third exam cycle (January 2001 to June 2003).

For the labeling of events, a window is considered AHI event positive if it is satisfies one of the AHI conditions (apnea, hypopnea type 1, and hypopnea type 2). To obtain AHI, the total number of AHI events throughout the night is divided by the total number of hours of sleep.

Data pre-processing and model training

First, data is split by measurement channel (SpO2, abdominal movement, thoracic movement, RRI and airflow). For each patient, the initial and final 30 min were removed from the recording as they are considered set up times. Then, for each channel per patient, all signals are standardized and normalized (to speed up training) by calculating the z-scores of 95% of the data points and applying min-max normalization to eliminate the bias of mean and variance of the raw one-dimensional signal34. The RRI data was inferred from ECG data by measuring the time difference between heartbeats (from one R peak to the next R peak) using the Pan-Tompkins algorithm35,36. Subsequently, each channel’s frequency was set to 64Hz, which is the maximum frequency of the measurements, using the cubic spline as the interpolation method37; this method is simple to implement and, at the same time, it does not attenuate higher frequency components of the signal. Finally, the signals were segmented into windows of 30 s and labeled as “normal” and “AHI events”. Data windows correspond to the same time instance across all channels (Fig. 1a).

Normalized samples of 30 s from each channel are then fed in parallel into distinct deep CNNs. Each CNN is independently trained on a specific channel to classify the AHI events from the normal windows, and then used for automatic feature extraction (Fig. 1b). The training was performed balancing the number of normal and AHI events fed to the network. For this task, the number of normal samples was randomly downsampled to match the amount of positive samples. Once the networks have been trained, features are extracted and concatenated to integrate information from different physiological biomarkers. This is achieved by using the concatenated features to train a LightGBM classifier for binary classification between normal and positive AHI-event windows (Fig. 1c).

The CNNs were trained and cross-validated with 7,753,500 samples, 1,042,652 positives and 6,710,848 negatives, from 5288 PSG recordings in the SHHS database. We use the EfficientNetV2 architecture, a deep CNN with 479 layers developed by Google in 202138. It is a modified and optimized version of EfficientNet39, a popular image classification algorithm that won the ImageNet 2019 competition40. The architecture used in this paper has been modified to handle unidimensional data and perform binary classification. Each CNN was independently trained using raw unidimensional physiological data from pulse oximetry, RRI, thoracic movement, abdominal movement, or nasal airflow. Categorical cross-entropy was used as the loss function and ADAM as the optimizer41. If the validation loss did not decrease after eight consecutive epochs, the training was terminated. Once the networks were trained, the final two layers were removed and the last global average pooling layer is used to yield 1280 features from each data channel.

Feature classifier

After the neural networks are trained, and features are extracted, the next step is to concatenate the features extracted by all CNNs and use them for training a LightGBM classifier. In contrast to a large number of other well-known algorithms, such as XGBoost42 and GBDT43, LightGBM employs the classification algorithm for growing trees in a leaf-wise manner rather than in a depth-wise manner. The leaf-wise algorithm can converge significantly faster than the depth-wise growth method, although its growth can be subject to overfitting if the appropriate hyperparameters are not used23. We use a random search method within a specified set of parameters to optimize the training and performance of the LightGBM. This allows a fixed number of parameters from a particular distribution to be sampled instead of testing all the values of potential parameters44. The output of the feature classifier is the probability that the evaluated segment contains AHI events according to the AHI definition. The training was performed with a balanced dataset. The complete datasets, however, are imbalanced (more negative AHI events than positive ones). Hence, on the validation data, we determined the probability threshold that optimizes the performance metrics of the classifier. This probability threshold was chosen so that the precision and the recall metrics have the same value in the validation dataset (Supplementary Fig. 9a).

Models and channels comparison

We evaluated longer and shorter window sizes for event classification and AHI estimation on a subset of 1000 randomly selected patients in a 3-fold cross-validation experiment, and selected 30 s windows for their higher performance in the validation set. Additionally, we noticed a delay in the response of SpO2 signal when a respiratory event occurred; thus, we also evaluated the model performance using different delays in the SpO2 signal (10, 15, 20, and 25 s delays). For AHI-event classification, we evaluated the feature extraction at different layers of the neural network (model called LGBM-1 for the last layer and LGBM-3 for the third last layer), and the performance of neural networks trained on each channel separately. In summary, we evaluated 9 different channels (abdominal, thoracic, nasal airflow, RRI, SpO2, SpO2+10s delay, SpO2+15s delay, SpO2+20s delay, and SpO2+25s delay), two different models (LGBM applied to the combination of different sensors extracting the information at the third last layer or at the last layer of each EfficientNetV2), and three different window sizes (10s, 30s, and 80s). Note that every time SpO2 is used, all its delayed signals are also used as separate channels. The summary of these results are reported in Supplementary Table 3 and the complete set of results is in Supplementary Table 4. These tables show that the 10s windows yield consistently lower performance and RRI signals provide negligible performance improvements. Hence, we now focus on the window sizes of 30 s and 80 s, as well as the channels abdominal, thoracic, and SpO2, and use the training data as described below. The complete performance results are reported in Supplementary Table 5, which also includes the estimation of AHI. On the validation dataset, these tables show that a larger window size leads to better classification of AHI events, whereas a lower window size results in better AHI estimation. Larger window sizes are less suited for counting the AHI events needed for AHI estimation as multiple AHI events are counted as a single detection. Since most AHI events last 30 s or less, the 30 s window presents a good trade-off between event classification and AHI estimation. Hence, we finally selected the 30 s windows because the objective of this paper is on AHI estimation. The 30 s windows also allow us to do a better whole sleep study segmentation.

End-to-end CNN

DRIVEN’s architecture consists of a combination of CNNs and LightGBM. Here, we compare its performance with that of a single CNN that inputs all channels and then performs the final classification. For this model, we used the same NN, EfficientNetV2, and changed the input layer to accept a vector for each channel. As in the previous section, we compared the AHI-event classification as well as the AHI classification of patients with the same 1000 randomly selected patients. Supplementary Table 4 shows that the validation results are quite similar, with the end-to-end CNNs having a slightly lower performance compared with the combination of CNNs and LightGBM. However, the performance of the end-to-end CNNs drop significantly in the test datasets, which are also detailed in this spreadsheet. The end-to-end CNNs tend to overfit the training data and the combination of CNNs and LightGBM shows better generalization across datasets. This result has also been observed by others (see, for example,25).

Dataset partition into train, validation, and test

We performed three different trials with the complete dataset: (1) SHHS and MROS patients as training and validation sets in an 80–20% random split and MESA patients as test set, (2) SHHS patients as training and validation sets in an 80–20% random split and MESA and MROS as test set, and (3) SHHS, MROS and MESA patients as training, validation and test sets in a 60–20–20% random split. Because there are patients in SHHS and MROS databases that have two recordings, to avoid data leakage we were careful to always place them in the same partition. The performance metrics for validation and test sets of each trial can be found in Supplementary Table 5. All the results and figures presented in the paper are from trial 2, as we only trained and validated with patients from the SHHS database and tested the model’s performance to close the domain gap when applying our algorithm to the other two independent databases.

Sleep/awake classification

The calculation of AHI requires the total time the patient has slept, excluding awake times throughout the night. Hence, it is important to discriminate between sleep and awake intervals. The profusion files per patient were already labeled with this information, which we could use to compute the total sleeping time. This information can also be estimated from other sensors like photoplethysmograms (PPG)45 or accelerometers. Here, to provide a standalone solution, we investigated classifying sleep or awake from combinations of only the available three sensors considered in the paper. We used the same architecture (CNN+LGBM) and model comparison methodologies to estimate if a 30 s window is sleep or awake. Additionally, we investigated adding an extra sensor, RRI, already discarded for AHI-event prediction. The results are presented in Supplementary Table 1 and Supplementary Fig. 1. The performance is high for most combinations of sensors. The exception is SpO2 alone, which struggles to correctly differentiate between sleep and awake windows. Adding an RRI sensor does not improve the sleep prediction performance either with the best classifiers or alone. This sleep prediction, with a threshold of 0.5 to classify sleep or awake, is then used for the AHI estimation of each patient.

AHI and severity class estimation from positive events

As the data were imbalanced, we determined the final classification threshold as the point of intersection of the precision and the recall curves on the validation data (Supplementary Fig. 9a). Then, for each patient in the validation dataset, we went through the overnight study classifying all 30 s windows sampled every 15 s (hence, consecutive windows overlap by 15 s). Each 30s window was first classified as awake or sleep, and the sleep segments were further classified into positive AHI event or negative by using the previously defined threshold (Fig. 4). Next, we calculated the ratio between the number of positive AHI events over the total sleep segments. With these ratios, one per patient, we fitted a linear regression to the value of AHI of these patients. In this regression, we used weights to account for the imbalance of the number of patients in different AHI classes. For the test set, we used the threshold selected before (Supplementary Fig. 9b), and obtained the AHI event vs sleep windows ratio as before. Finally, we used the previously determined linear regression factors to obtain the AHI of the test patients. Recall that the patient severity class was determined following the rules provided by the American Academy of Sleep Medicine6.

Performance metrics

The performance of the models was evaluated based on the following metrics. Let true positive (TP) represent the number of AHI events that are accurately predicted. True negative (TN) represents the number of normal events that are accurately predicted. False negative (FN) represents the number of AHI events incorrectly predicted as normal events, and false positive (FP) represents the number of normal events incorrectly predicted as AHI events. The accuracy of the model is given by (TP + TN)/(TP + TN + FP + FN), indicating the probability of correctly identifying normal and AHI events; the recall is given by TP/(TP + FN), indicating the probability of identifying AHI events; the specificity is given by TN/(TN + FP), indicating the probability of detecting normal events; the precision is given by TP/(TP + FP), indicating the ratio between detected AHI events and all predicted AHI-event cases; and the F1-score 2(precision × recall)/(precision + recall), indicating the harmonic mean of precision and recall. For every method, the receiver operating characteristic curve, and the precision-recall curve were obtained for each possible classification threshold. The area under both of these curves allows us to compare the global performance of each architecture. Finally, the patient severity classification results were evaluated by assessing the F1 score of each class and then averaging them.