Introduction

Pediatric sleep apnea–hypopnea (SAH) is a significant sleep disorder observed in children, characterized by disruptions in breathing patterns during sleep. These disruptions can manifest as reduced airflow (hypopnea) or complete cessation of airflow (apnea), significantly impairing the quality of children's sleep. Such impairments can lead to daytime drowsiness, concentration difficulties, and overall well-being impact1. Additionally, SAH is linked to cognitive, behavioral, and cardiovascular issues, posing potential long-term health concerns and impeding a child's development1.

SAH is classified into three main types: obstructive sleep apnea (OSA), central sleep apnea (CSA), and mixed sleep apnea (MSA)2. OSA is typically caused by a constricted or blocked airway, making breathing efforts ineffective. In contrast, CSA results from the brain's failure to generate or transmit appropriate signals for breathing initiation, leading to brief pauses in breath. MSA is a condition where both central and obstructive factors contribute to the sleep apnea events2.

The conventional diagnostic standard for pediatric SAH is the overnight polysomnography (PSG) test. This test, conducted in specialized sleep laboratories, involves monitoring a range of physiological signals such as the electroencephalogram (EEG), electromyogram (EMG), electrocardiogram (ECG), airflow (AF), chest and abdominal movements, blood oxygen saturation (SpO2), and photoplethysmogram (PPG)1,3. The collected data aids in calculating the apnea–hypopnea index (AHI), a critical clinical metric representing the average number of apnea and hypopnea events per hour of sleep. The AHI is instrumental in evaluating the presence and severity of SAH, with severity categorized into four groups: normal (AHI < 1), mild (1 ≤ AHI < 5), moderate (5 ≤ AHI < 10), and severe (AHI ≥ 10)4,5,6. Despite its effectiveness, PSG is a complex, costly, and time-consuming process, often uncomfortable for children, underscoring the need for simpler, more accessible diagnostic methods7.

One such promising diagnostic tool is the measurement of SpO2 using pulse oximetry. Pulse oximeters record the PPG signal, which is used to derive SpO28. SpO2, reflecting the oxygen content in blood hemoglobin, is extensively explored for its convenience in acquisition and compatibility with portable monitoring9,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29. Oximetry recordings are critical in revealing how apnea and hypopnea events lead to recurrent oxygen desaturation due to compromised airflow, causing irregular fluctuations in SpO2 signals in individuals with SAH9. According to the American Academy of Sleep Medicine (AASM) guidelines, apneas are identified by a decrease of ≥ 90% in the AF signal for at least two respiratory cycles, while hypopneas are defined as a decrease of ≥ 30% in AF, accompanied by at least a 3% reduction in SpO2 or an electroencephalographic arousal4. Given that oxygen desaturation typically begins 20–40 s after the start of an apneic episode, precise correlation between apneic events and subsequent desaturation is crucial for accurate detection10. Thus, SpO2 monitoring is invaluable for real-time evaluation of oxygen levels and essential in identifying SAH-related desaturation events.

Numerous studies have focused on feature engineering techniques and Machine Learning (ML) methods to analyze AF and SpO2 signals for detecting pediatric OSA11. These studies have utilized classical ML models such as logistic regression, support vector machines (SVM), and ensemble-learning adaptive boosting (AdaBoost) for binary classification tasks, distinguishing between OSA-positive and non-OSA patients12,13,14,17. Additionally, multilayer perceptron (MLP) neural networks have been employed for AHI estimation, with Hornero et al. using an MLP for AHI estimation from SpO2 recordings, categorizing subjects into four severity classes of OSA15. Barroso-García et al. explored AF and SpO2 recordings for AHI estimation, using recurrence plots (RP) and the 3% oxygen desaturation index (ODI3) from SpO2 signals for their MLP model16. Jiménez-García et al. addressed a 4-class classification task assessing pediatric OSA severity using AdaBoost, utilizing features from both AF and SpO2 signals17. These studies underscore the potential of ML in OSA screening.

However, deep learning (DL) algorithms present an advantage over traditional ML methods due to their ability to automatically extract complex features from raw data, thereby enhancing diagnostic accuracy and robustness18. The surge of deep learning innovations has significantly advanced the biomedical field, particularly in processing physiological signals. This has led to notable achievements in disease detection through DL applications, including blood pressure estimation19,20, sleep stage classification21, and cardiovascular risk assessment22, etc. Unlike traditional methods, DL techniques can uncover deeper physiological information and and enabling automated integration of a variety of features. Consequently, there is a growing trend among researchers to explore the detection of SAH through DL techniques. Several studies have explored deep DL methods for detecting OSA in adults using PPG and ECG signals. These studies focus on the segment-level classification of signal segments as either apneic or non-apneic, and incorporate architectures such as Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs)23,24,25, including Long Short-Term Memory (LSTM) networks in their models. All these studies manually determine the AHI to assess OSA severity. These approaches have shown promising results in enhancing OSA detection efficiency, demonstrating that CNNs can effectively extract deep features and RNNs can capture temporal features and measure temporal dependencies from physiological signals like PPG and ECG, thereby improving classification accuracy. In the domain of pediatric SAH detection, Vaquerizo-Villar et al. have implemented a CNN-based model to classify segments of SpO2 signals for SAH detection26. To assess the severity of pediatric SAH, recent studies have adopted DL methods with CNN-based models for regression tasks to estimate the AHI from SpO2 signals27,28, occasionally in combination with AF signals29,31. Vaquerizo-Villar et al. have also utilized a one dimensional (1D) CNN model to enhance the diagnostic capabilities of oximetry for pediatric SAH27 and later for pediatric OSA28. Furthermore, another research initiative has introduced a two dimensional (2D) CNN framework for estimating pediatric OSA severity by analyzing both AF and SpO2 signals as 2D data29. In subsequent work, they combined a 2D-CNN with a RNN to assess pediatric OSA severity from AF and SpO2 signals31. García-Vicente et al. have employed a 1D-CNN model to process overnight electrocardiogram (ECG) signals for pediatric OSA severity estimation30. These studies underscore the effectiveness of CNN-based models in extracting features from SpO2, AF, and ECG signals for both regression approaches in assessing SAH severity and classification for SAH detection. The insights derived from these studies are crucial for the development of reliable computer-aided diagnostic systems for managing childhood SAH.

In our research, we have employed a regression method for assessing SAH severity solely from SpO2 signals by directly estimating the AHI. SpO2 signals, which can be easily recorded with pulse oximeter sensors, have proven effective for SAH severity assessment in most previous studies. Recognizing the gap in the literature regarding the enhancement of CNN-based model architectures for AHI estimation and SAH severity assessment with SpO2 signals, we explored the integration of residual block architecture and attention-based RNNs to enhance CNN models for AHI estimation and SAH severity assessment using SpO2 signals, our research introduces an innovative approach. In this study, we propose a novel method for pediatric SAH detection, employing SpO2 signals as 1D raw data. Our key contributions are as follows:

  • Development of Two Unique Models: We have pioneered the implementation of a 1D ResNet-based model with residual architecture and an attention-based hybrid CNN-RNN network. These architectures are novel in the context of pediatric SAH assessment, representing the first application of such advanced neural network structures for AHI estimation.

  • Comprehensive Apnea Detection: Moving beyond the predominant focus on obstructive events in the literature, our models are capable of detecting all oxygen desaturations related to various types of apneas, including OSA, CSA, and MSA. This broadened detection scope is particularly crucial in pediatric cases, where distinguishing between different apnea types based solely on desaturation patterns is challenging without additional chest and abdominal movement data48.

Through these methodologies, our study aims not only to enhance the accuracy of SAH detection but also to contribute to the development of more effective diagnostic tools for pediatric sleep disorders.

Data source and signal analysis

In this research, we utilized the Childhood Adenotonsillectomy Trial (CHAT) dataset, a comprehensive and publicly accessible database that includes 1638 sleep studies of 1232 pediatric subjects aged between 5 and 9.9 years, all diagnosed with mild to moderate obstructive sleep apnea. These studies, conducted between 2007 and 2012, are registered under the Clinical Trial Number NCT0056085932,33. The CHAT dataset, available through the National Sleep Research Resource (https://sleepdata.org/datasets/chat), categorizes these studies into three subgroups: Baseline (453 subjects), Follow-up (406 subjects), and Non-randomized (779 subjects). The participants in the CHAT dataset are divided into randomized and non-randomized groups. The Baseline group consists of subjects who were randomly selected for early Adenotonsillectomy (eAT), while the Follow-up group includes individuals from the Baseline group who were observed over a 7-month period post-intervention. The PSG data within the CHAT dataset provides detailed annotations on the onset and duration of apneic events, which are crucial for labeling SpO2 signal segments in our study. The accurate linking of oxygen desaturation events to apneic episodes is essential for determining the number of apneic events present in each segment, serving as the foundation for training our algorithms. In accordance with the American Academy of Sleep Medicine (AASM) 2012 guidelines, the AHI for this study was calculated considering all apneas and hypopneas that were accompanied by either an arousal or a minimum of 3% oxygen desaturation34. This computation was based on the original variables included in the dataset. Consequently, clinical variables that provide reference information were vital for validating the number of apnea–hypopnea events associated with ≥ 3% oxygen desaturations, as identified by our labeling algorithm. Due to the absence of these critical variables in the non-randomized group, our analysis exclusively utilized recordings from the Baseline and Follow-up groups of the CHAT dataset.

For the training and evaluation of our models, we divided the dataset into distinct sets: 60% for training, 30% for testing, and 10% for validation. To ensure the robustness and generalizability of our DL models, a threefold cross-validation method was implemented. Initially, 10% of the data was reserved as a fixed validation set. The remaining data were then randomly distributed into three groups, with careful consideration of the proportion of each SAH severity category. This strategic partitioning was designed to facilitate a comprehensive and balanced assessment of the models’ performance across various SAH severity levels.

Methodology

The methodology for AHI estimation in our study is outlined in Fig. 1, comprising an end-to-end pipeline with four key stages: (A) Signal Segmentation and Pre-processing; (B) Labeling; (C) Deep Learning Model; and (D) AHI Estimation.

Figure 1
figure 1

End-to-End Process Flow for AHI Estimation from SpO2 Signals.

Signal segmentation and pre-processing

The SpO2 signals, acquired from PSG using a pulse oximeter finger probe, varied in sampling rates, ranging from 1 to 512 Hz. Our initial step involved re-sampling the SpO2 recordings to a unified rate of 1 Hz, rounded to the nearest second decimal place. This standardization, inspired by prior research28,34, aimed to reduce computational demands and achieve consistency across signals. Following re-sampling, we divided the SpO2 signals from each subject into non-overlapping segments of 20 min (1200 samples). This segmentation strategy facilitated the detection of sustained desaturation events, consistent with criteria defining desaturation clusters of at least 10 min duration34. To prepare the signals for analysis, we initially addressed motion artifacts and zero-level artifacts, which commonly arise from sensor disconnections. We eliminated abrupt changes exceeding 4% per second between consecutive samples over a one-second interval and disregarded any instances where oxygen saturation fell below 50%, following the guidance of previous studies9,35. We designed an algorithm inspired by these methodologies to effectively locate zero-level artifacts by detecting signal values below 50%, which are not typical for a healthy individual, and identifying abrupt changes by checking for differences greater than 4% between consecutive signal values. These artifacts were then removed and substituted with values derived from linear interpolation between their preceding and following values. This step was crucial for ensuring data integrity, as significant SpO2 drops below 50% and rapid fluctuations are often indicative of measurement errors or sensor issues. Furthermore, to smooth the signal and reduce short-term variations, we applied a 3-s moving average (MA) filter. This effectively attenuated sharp spikes and ripples in the data35.

Labeling of SpO2 signal segments

The labeling process for each 20-min SpO2 signal segment was crucial in our study. Based on annotations provided by sleep technicians, as referenced in32, our labeling algorithm was meticulously designed to accurately identify all desaturation events associated with apneic episodes. This algorithm operates on the principle that desaturations linked to any respiratory event rely on the nadir desaturation (lowest oxygen level during desaturation) reached, typically within a 30-s span following the event's conclusion32. For each segment, the output label was determined based on the number of apnea and hypopnea events associated with a 3% oxygen desaturation occurring within the 20-min window. Figure 2 exemplifies this process, showcasing the correlation between apneic events and their subsequent oxygen desaturations in AF and SpO2 signals. This labeling was meticulously conducted in accordance with the annotation files provided in the CHAT dataset. To validate the effectiveness of our labeling algorithm, we conducted a comparative analysis. This involved matching the number of detected apneic events, linked with a 3% oxygen desaturation, against the sum of original PSG variables from the dataset. These variables describe the number of each type of apneic event associated with a 3% oxygen desaturation. In our rigorous validation process, only recordings with a labeling error margin below 10% were considered suitable for training and evaluating our models. This criterion led to the selection of 884 SpO2 recordings for our study. Table 1 in our paper presents the clinical and demographic data of the subjects from these selected recordings.

Figure 2
figure 2

Synchronous events of apnea and oxygen desaturation.

Table 1 The statistical characteristics of subjects included in this study.

Deep learning model

ResNet architecture for SpO2 signal analysis

CNN-based models have demonstrated significant effectiveness in diagnosing the severity of pediatric OSA, as shown in previous research27,28,29,30. However, these networks often encounter a major obstacle when increasing the number of convolutional layers: the training loss tends to plateau, a phenomenon largely attributed to the vanishing gradient problem. The Residual Network (ResNet) framework was developed to address this challenge, notably enhancing the accuracy of deep CNNs36. ResNet introduces the concept of residual learning, a paradigm shift from traditional deep network methodologies that attempt a direct mapping from input to output. Instead, ResNets focus on learning residual mappings. This is mathematically expressed as Y = F(X) + X, where Y denotes the desired output, F(X) represents the residual function, and X is the input. This approach allows the network to concentrate on learning the additional information (F(X)) needed to achieve the desired output Y, particularly beneficial when F(X) is close to zero. The benefits of ResNet include easing the training of deep networks by alleviating the gradient vanishing issue and enabling the construction of much deeper networks without sacrificing accuracy. Additionally, ResNet incorporates skip connections, which directly add the input X to the output of the residual function F(X), promoting smoother gradient flow during training. Figure 3 illustrates a typical residual block, showing how input X is transformed into its desired mapping Y.

Figure 3
figure 3

Schematic of a residual learning block.

In our study, we have adapted the ResNet-34 architecture, initially designed for image recognition36, into a 1D format suitable for analyzing SpO2 signals. This adaptation involved replacing 2D convolutional layers (Conv2D) with 1D convolutional layers (Conv1D), which are more apt for processing time-series data like SpO2 signals. This adaptation involved replacing 2D convolutional layers (Conv2D) with 1D convolutional layers (Conv1D), which are more apt for processing time-series data like SpO2 signals. Given the size of our dataset, we started with the relatively shallower ResNet-34 model, consisting of 34 layers. However, to mitigate overfitting, we optimized the model to 16 layers, which provided efficient performance with quicker convergence compared to deeper models.

Figure 4 presents a detailed depiction of our modified ResNet architecture, illustrating the sequence and function of each layer:

  • Input Layer: Receives the raw 20-min SpO2 signal segment with a size of 1200 × 1.

  • Conv1D Layer: Applies learnable filters to the input, extracting fundamental patterns and low-level features.

  • Batch Normalization: Normalizes the activations from the Conv1D layer, stabilizing and expediting the training by reducing internal covariate shift.

  • ReLU Activation: Introduces non-linearity through the rectified linear unit (ReLU) function.

  • MaxPooling1D: Reduces the dimensionality of the data, maintaining essential features while lessening computational load.

  • Residual Blocks: Each block, consisting of two convolutional layers, forms the core component of the ResNet. Our model includes seven such blocks.

  • Flatten Layer: Transforms the output of the last residual block into a one-dimensional vector, ensuring compatibility with the subsequent fully connected layer.

  • Fully Connected Layer: Processes the received output to perform the final mapping, producing a numerical prediction.

  • Output Layer: Generates the final prediction for the number of apneic events (ypred).

Figure 4
figure 4

Architecture of the modified ResNet model for SpO2 signal analysis.

Development of the CNN-BiGRU-attention architecture

In our study, we designed a sophisticated model that combines a CNN with a Bidirectional Gated Recurrent Unit (BiGRU) and an attention mechanism. This integrative approach, inspired by its successful application in blood pressure estimation37, is adapted here to process SpO2 signals for AHI estimation. The incorporation of the attention mechanism is driven by its proven effectiveness in focusing on pertinent segments within datasets, a principle extensively utilized in diverse fields such as image captioning, machine translation, and speech recognition38. Our model aims to investigate the synergy of these distinct architectures, assessing how the RNN structure processes features extracted by the CNN and the role of the attention mechanism in augmenting RNN performance for AHI estimation. Figure 5 in our paper illustrates the architecture of the proposed model. The process begins with the CNN layer, which is responsible for extracting pertinent features from the input SpO2 signals. Following feature extraction, the BiGRU layer, recognized for its capability to handle long-term dependencies in sequential data, processes these features. This is achieved by analyzing the data in both forward and backward temporal directions, thereby capturing complex temporal dynamics inherent in the SpO2 signals. Furthermore, we integrate an attention mechanism with the outputs of the BiGRU layer. This mechanism assigns weights to different temporal features, enabling the model to concentrate its predictive capacity on the most crucial segments of the signal. This combination—CNN for initial feature extraction, BiGRU for in-depth temporal processing, and attention for targeted focus—constitutes a potent framework designed to enhance the accuracy of AHI estimation from SpO2 signals.

Figure 5
figure 5

CNN-BiGRU Attention model architecture for SpO2 signal analysis.

It is pertinent to note that while RNNs are pivotal in handling sequential data tasks such as speech recognition, they often struggle with long sequences due to the vanishing gradient problem39. LSTM networks, introduced by Hochreiter and Schmidhuber40, have been developed to counter this issue, incorporating specialized gates to manage information flow. Bidirectional LSTMs (BiLSTMs) further refine this approach by processing data in both forward and backward directions, thus encompassing past and future contexts in the analysis. BiGRUs, a variant conceived by Cho et al.41, streamline the design of LSTMs by combining the input and forget gates, thereby reducing the model's complexity while retaining efficiency in processing bidirectional sequence data. The architecture of the GRU cell employed in this analysis is portrayed in Fig. 6. At each time step, the GRU cell encompasses two crucial input vectors: the preceding hidden output value \({h}_{t-1}\), which includes feature values from the previous time step across all feature maps extracted by the CNN layer filters, and the current input vector \({x}_{t}\), which contains the current feature values from all feature maps extracted by the CNN layer filters. The computation of the contemporary hidden output value of the cell \({h}_{t}\) unfolds through the following equations:

$$ \begin{array}{*{20}l} {z_{t} = \sigma \left( {W_{z} \cdot \left[ {h_{t - 1} ,x_{t} } \right]} \right)} \hfill \\ \begin{gathered} r_{t} = \sigma \left( {W_{r} \cdot \left[ {h_{t - 1} ,x_{t} } \right]} \right) \hfill \\ {\tilde{\text{h}}}_{t} { = }\tanh \left( {W_{h} \cdot \left[ {r_{t} \odot h_{t - 1} ,x_{t} } \right]} \right) \hfill \\ \, \hfill \\ \end{gathered} \hfill \\ \begin{gathered} h_{t} = \left( {1 - z_{t} } \right) \odot h_{t - 1} + z_{t} \odot {\tilde{\text{h}}}_{t} \hfill \\ \end{gathered} \hfill \\ {} \hfill \\ \end{array} $$
(1)
Figure 6
figure 6

Schematic of a Gated Recurrent Unit (GRU) cell.

Here, \({\text{z}}_{\text{t}}\) and \({r}_{t}\) denote the update and reset gate vectors, respectively. The weight parameters \({W}_{z}\), \({W}_{r}\) and \({W}_{h}\) are trainable and contribute to the gate operations. The term \({\widetilde{h} }_{t}\) signifies the candidate state, capturing the extent of assimilating present information post the reset gate. The activation functions \(\sigma (\bullet )\) and \(\text{tanh}(\bullet )\) encapsulate the sigmoid and hyperbolic tangent functions, respectively, while \(\odot \) signifies element-wise multiplication. Unlike its unidirectional counterpart, the conventional GRU, a bidirectional GRU (BiGRU) is adopted in this study. A BiGRU encompasses both forward and backward layer cells' hidden output values. Figure 7 shows the architecture of a BiGRU layer structure, with one pair of GRU cells at each time step. The final hidden output vector of the BiGRU layer at time step t, \({\overrightarrow{h}}_{{out}_{t}}\) is a concatenation of the forward hidden output vector \({\overrightarrow{h}}_{t}\) (including forward layer’s cells hidden output values) and backward hidden output vector \({\overleftarrow{h}}_{t}\) (including backward layer’s cells hidden output values):

$$ \begin{gathered} \vec{h}_{t} = [h_{{tc_{1} }} , \, h_{{tc_{2} }} ,..., \, h_{{tc_{n - 1} }} , \, h_{{tc_{n} }} ] \, \hfill \\ \mathop{h}\limits^{\leftarrow} _{t} = [h_{{tc_{n} }} , \, h_{{tc_{n - 1} }} ,..., \, h_{{tc_{2} }} {, }h_{{tc_{1} }} ] \hfill \\ \overrightarrow {h}_{{out_{t} }} = [\vec{h}_{t} , \, \mathop{h}\limits^{\leftarrow} _{t} ] \hfill \\ \end{gathered} $$
(2)
Figure 7
figure 7

Schematic of a Bidirectional GRU (BiGRU) layer structure with one pair GRU cells at each time step.

The attention mechanism, initially introduced by Bahdanau et al. in 201442 to address limitations in traditional sequence-to-sequence models, marked a significant breakthrough in natural language processing. This mechanism, commonly referred to as attention, has since evolved into self-attention or intra-attention, finding widespread application in diverse DL tasks. In our work, we harness the power of self-attention to enhance the model's ability to capture crucial temporal features within sequential SpO2 signals. By applying self-attention to the outputs of the BiGRU layer, our model dynamically assigns weights to different temporal features, prioritizing the most relevant signal segments for accurate AHI estimation. In developing the self-attention mechanism, we consider the final hidden state matrix \({H}_{s}\) from the BiGRU including hidden output vectors of the BiGRU at each time step \({h}_{t}\), where t \(\in \) [1, N]. The significance score vector \(\overrightarrow{s}\) is computed using a score function, \(score(\cdot )\), before multiplying the hidden state matrix by a randomly initialized weight and bias vectors (\(\overrightarrow{w}\) and \(\overrightarrow{b}\)) as outlined below:

$$ \, \overrightarrow {s} = score\left( {H_{s} \vec{w} + \vec{b}} \right), \, H_{s} = \left[ \begin{gathered} \vec{h}_{{out_{1} }} \hfill \\ \vec{h}_{{out_{2} }} \hfill \\ \vdots \hfill \\ \vec{h}_{{out_{N} }} \hfill \\ \end{gathered} \right] $$
(3)

For the score function, we initially experimented with the dot product, tanh, and ReLU functions. Ultimately, we selected ReLU as the score function because it provided better performance and faster convergence during the training process. After obtaining the importance score values for each BiGRU hidden output vectors at each time step (including the hidden output value of all cells) and formming the score vector of \(\overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\rightharpoonup}$}} {s} \), the attention weight \({\alpha }_{i}\) for the hidden output vector at time step \(t\) is determined by applying a softmax function to the score vector:

$$ \begin{gathered} \overrightarrow {s} \, = \, [s_{1} ,s_{2} ,...,s_{N} ] \hfill \\ a_{i} = \frac{{\exp \left( {s_{i} } \right)}}{{\sum\limits_{j = 1}^{N} {\exp } \left( {s_{j} } \right)}} \, \hfill \\ \end{gathered} $$
(4)

This softmax function guarantees that the attention weights collectively sum to 1, effectively normalizing the significance scores across all time step vectors within the hidden state matrix. The attention weights vector \(\overrightarrow{\alpha }\) includes all attention weights for each cell's hidden output vector. The final output vector \(\overrightarrow{v}\) is derived by multiplying the attention vector by the hidden state matrix of the BiGRU:

$$ \begin{gathered} \overrightarrow {a} = [a_{1} , \, a_{2} ,..., \, a_{N} ] \hfill \\ \overrightarrow {v} = \overrightarrow {a} H_{s} \hfill \\ \end{gathered} $$
(5)

This summation offers a comprehensive representation of the input sequence, highlighting the contributions of distinct time steps based on their computed attention weights. To futher clarify the effect of the attention layer on each part of the input signal segment, we plotted an attention map (heatmap) of the attention weights vector (attention scores) for a 20-min signal segment from the test set, as represented in Fig. 8. The heatmaps, obtained by graphing the alpha vector resulting from the softmax output in Eq. (4), were scaled using log10 to better represent the distribution of attention score values.

Figure 8
figure 8

Attention Map of the Attention Scores Derived from the Attention Layer for a 20-Minute Signal Segment Input.

Model training and optimization

For training the ResNet and CNN-BiGRU-Attention models, we employed distinct initialization methods. The ResNet model was initialized using the He-normal method43, while the CNN-BiGRU-Attention model started with random weights. Data were fed into both models in batches and shuffled at each training epoch to enhance convergence. Since the number of non-apneic SpO2 segments exceeded the apneic segments, we balanced the data by oversampling the apneic segments, repeating them before the start of each training process. To facilitate efficient weight updates, we employed the adaptive moment estimation (Adam) optimizer44 with an initial learning rate. For the loss function, we chose the Huber loss45 due to its robustness, as evidenced by its strong performance in previous AHI estimation studies28,29. The Huber loss strikes a balance between quadratic and linear loss behaviors, as expressed by its formula:

$$ L(\hat{y}_{n} ,y_{n} ,\delta ) = \left\{ {\begin{array}{*{20}c} {\frac{1}{2}\left( {\hat{y}_{n} - y_{n} } \right)^{2} ,} & {\left| {\hat{y}_{n} - y_{n} } \right| \le \delta } \\ {\hat{y}_{n} \left( {\left| {\hat{y}_{n} - y_{n} } \right| - \frac{\delta }{2}} \right),} & {otherwise} \\ \end{array} } \right.. $$
(6)

Here, \({\widehat{y}}_{n}\) and yn denote the label and model output for segment n, respectively. The parameter δ acts as a threshold and serves as a tunable hyperparameter crucial for effectively handling data with outliers and noise during the model optimization process. The Huber loss function is particularly effective for datasets with outliers, providing a quadratic loss for smaller errors (inliers) and a linear loss for larger errors (outliers). Two key techniques were implemented to optimize the training process. Firstly, a dynamic learning rate reduction strategy was employed, reducing it by 50% every 10 epochs within the loss function. This strategy promotes training stability, facilitating smoother convergence. Secondly, early stopping was introduced to halt the training process if the validation set loss did not improve for 30 consecutive epochs, ensuring the generalization capability of the models. We employed the Keras deep learning framework with a TensorFlow backend for model training in the Google Colab environment, leveraging the availability of NVIDIA Tesla T4 GPUs.

AHI estimation

After obtaining the estimated number of apneic events in each 20-min segment, the AHI is calculated as the sum of the detected apneic events divided by the total recording duration. Utilizing the DL model's output yn for each 20-min SpO2 segment (n = 1, 2, 3, …, N), the AHI for each patient is calculated using the formula:

$$ {\text{AHI - Model}} = \frac{{\sum\limits_{n = 1}^{N} {y_{n} } }}{{{\text{SpO}}_{{2}} {\text{ signal recording time (h)}}}} \, $$
(7)

where N is the total number of 20-min SpO2 segments in the SpO2 signal. To calculate the AHI, knowledge of the total sleep duration for each patient is essential, particularly for sleep staging analysis. In the absence of this information, we made the assumption that the total length of the SpO2 signal serves as an approximation of the sleep duration. To refine this estimation, a regression model was employed to map the calculated AHI, based on the total SpO2 recorded time, to the actual AHI, taking sleep duration into consideration. The regression model utilizes an optimization method to determine the optimal coefficients of the linear equation \(ax+b\) that minimizes an error function. In this study, we opted for the Huber loss function (Huber regressor) to train the model on the validation set. The Huber loss function adjusts the parameters of the linear equation during training to minimize the loss across all predictions. To strike a balance between sensitivity to outliers and overall model performance, the delta value is crucial. In our case, the delta value was determined by selecting the value that minimized the root mean square error on the validation set. The choice of a delta value significantly influences the behavior of the Huber loss function. A higher delta makes the loss more robust to outliers, as it allows for a larger linear region, while a lower delta increases sensitivity to outliers by making the loss more quadratic. In our study, after experimenting with different delta values, we found that a delta of 6 led to the minimum root mean squared error on the validation set. This particular value was chosen to strike an optimal balance, aiming to maximize the accuracy of our AHI estimates in the presence of data variability and potential outliers. This approach effectively mitigates the lack of explicit sleep duration information, enhancing the accuracy of AHI estimation. The final estimated AHI is obtained through the Huber regression model. Figure 9 displays a scatter plot of total record time-based AHI and actual AHI in the validation set, along with the fitted regression function line on data points.

Figure 9
figure 9

Scatter plot of total record time-based AHI and actual AHI in the validation set.

Performance evaluation

To assess the proficiency of the models in estimation, we employed fundamental regression metrics. These metrics include mean absolute error (MAE), root mean squared error (RMSE) and R-squared (R2) for per-recording AHI estimation. The formulas for each metric are specified accordingly:

(1) MAE.

$$ MAE = \frac{1}{n}\sum\limits_{i = 1}^{n} {\left| {AHI_{{{\text{actual}}, \, i}} - AHI_{{{\text{estimated}}, \, i}} } \right|} $$
(8)

(2) RMSE.

$$ RMSE = \sqrt {\frac{1}{n}\sum\limits_{i = 1}^{n} {\left( {AHI_{{{\text{actual}}, \, i}} - AHI_{{{\text{estimated}}, \, i}} } \right)^{2} } } $$
(9)

(3) R2.

$$ R^{2} = 1 - \frac{{\sum\limits_{i = 1}^{n} {\left( {AHI_{{{\text{actual}}, \, i}} - AHI_{{{\text{estimated}}, \, i}} } \right)^{2} } }}{{\sum\limits_{i = 1}^{n} {\left( {AHI_{{{\text{actual}}, \, i}} - A\overline{H}I_{{\text{actual }}} } \right)^{2} } }} $$
(10)

We utilized scatter and Bland–Altman46 plots to compare the predicted AHI by the models with the actual AHI from the PSG test, ensuring proper alignment for a comprehensive evaluation. For apnea severity classification, the overall agreement of predicted AHI in estimating the severity of SAH was evaluated using confusion matrices, four-class Cohen's kappa coefficient (kappa)47, and four-class accuracy. To ensure a thorough evaluation, we categorized patients based on common AHI thresholds of 1, 5, and 10 events per hour (e/h), facilitating binary classification into those below and above each specified threshold. For each threshold, we assessed the model's diagnostic performance, covering sensitivity (Se), specificity (Sp), positive predictive value (PPV), negative predictive value (NPV), positive likelihood ratio (LR +), and negative likelihood ratio (LR-):

  • Sensitivity (Se): Percentage of SAH positive patients correctly classified.

  • Specificity (Sp): Percentage of SAH negative patients correctly classified.

  • Positive Predictive Value (PPV): Proportion of actual positive cases among instances predicted as positive by the model.

  • Negative Predictive Value (NPV): Proportion of actual negative cases among instances predicted as negative by the model.

  • Positive Likelihood Ratio (LR +): Se/(1-Sp).

  • Negative Likelihood Ratio (LR-): (1-Se)/Sp.

This comprehensive approach provided a robust evaluation of our model’s accuracy in effectively categorizing patients according to the severity of SAH.

Results

The study employed a threefold cross-validation approach on the dataset, reserving a fixed 10% for the validation set and selecting a test set in each iteration, leaving the remaining data for model training. Before initiating the cross-validation, an extensive exploration of diverse hyperparameter combinations was conducted to optimize the model's performance. The ultimate hyperparameters were chosen based on achieving the highest four-class kappa value on the validation set. The selected configuration encompassed a learning rate of 0.001, 120 epochs, a Huber loss delta (δ) value of 1.5, and a batch size of 32. Following the threefold cross-validation, Figs. 10 and 11 showcase scatter plots for each ResNet and CNN-BiGRU-Attention model, respectively, on the test set of each fold. These scatter plots display the estimated AHI by each model against the actual AHI for each of the three test sets, along with R2 and RMSE. Notably, in both figures, the scatter plot points of the Fold 2 test set exhibit a higher concentration near the diagonal line, indicating superior agreement between the actual and estimated AHI compared to the other fold test sets. Figure 12 introduces a Bland–Altman plot for the estimated AHI by the ResNet model compared to the reference AHI. Across all folds, the majority of data points fall within the confidence interval, indicating an acceptable level of agreement (LoA). In Fold-1, the negative mean value suggests a slight underestimation by the proposed model in the test subset. The Fold-2 plot demonstrates a positive mean error close to zero (0.09), with narrower LoA (− 3.26 to 3.45), reflecting a more accurate estimation with reduced variability and superior agreement compared to Folds 1 and 3. In Fold-3, the mean bias is essentially negligible (0.03); however, the LoA is the broadest (− 5.89 to 5.83) compared to other folds, indicating increased variability in the estimates despite an unbiased mean. Figure 13 exhibits a Bland–Altman plot for the estimated AHI by the CNN-BiGRU-Attention model compared to the reference AHI. Across all folds, the negative mean value indicates a slight tendency of CNN-BiGRU-Attention to underestimate the AHI. The Fold-2 plot showcases narrower LoA (− 3.5 to 2.8), reflecting a more accurate estimation with reduced variability and superior agreement compared to other folds, while Fold-3 has the broadest LoA, suggesting increased variability in the estimates.

Figure 10
figure 10

Scatter plots between actual AHI and estimated AHI by ResNet model in fold-1 (a), fold-2 (b) and fold-3 test sets.

Figure 11
figure 11

Scatter plots between actual AHI and estimated AHI by CNN-BiGRU-Attention model in fold-1 (a), fold-2 (b) and fold-3 test. sets.

Figure 12
figure 12

Bland–Altman plots of actual AHI and estimated AHI by ResNet model in fold-1 (a), fold-2 (b) and fold-3 test sets.

Figure 13
figure 13

Bland–Altman plots of actual AHI and estimated AHI by CNN-BiGRU-Attention model in fold-1 (a), fold-2 (b) and fold-3 test sets.

The confusion matrix for each model in every fold test set is meticulously outlined in Fig. 14. Additionally, Table 2 provides a comprehensive overview of the regression and classification metrics for each model's performance in each fold test set. The ResNet model exhibited the highest four-class accuracy and four-class kappa value in the fold-3 test set, despite having the highest RMSE value. In contrast, the CNN-BiGRU-Attention model showcased the highest four-class accuracy and four-class kappa value on the fold-2 test set with the minimum RMSE across all folds. For additional insights into the diagnostic capabilities of each model at commonly used AHI thresholds of 1, 5, and 10 e/h, Table 3 presents detailed results for each fold test set.

Figure 14
figure 14

Confusion matrices of the predicted SAH severity group by ReaNet model (a) and CNN-BiGRU-Attention model (b) in each fold test set.

Table 2 Models performance on each fold test set.
Table 3 The diagnostic ability of each model for AHI thresholds of 1, 5 AND 10 E/H.

It is worth noting that, based on the table results, at the AHI = 1 e/h threshold, both models may tend to misclassify healthy subjects (normal group) as SAH groups. At higher AHI thresholds (AHI = 5e/h and AHI = 10e/h), the models tend to misclassify subjects with higher SAH severity groups (moderate and severe) as lower SAH (mild and normal). Therefore, it is important to ensure that at AHI = 1e/h, the specificity value is not low, along with accuracy, and at high AHI thresholds, sensitivity should not be low alongside accuracy. As seen in the table, despite a reduction in sensitivity with higher AHI thresholds, both models maintained remarkable sensitivity and accuracy, indicating their proficiency in accurately detecting patients with high severity SAH, which holds pivotal clinical implications. The ResNet model exhibited overall high accuracy, sensitivity, and specificity on fold-3 test set, particularly with high sensitivity and accuracy at AHI thresholds of 5 e/h and 10 e/h on fold-2. The CNN-BiGRU-Attention model demonstrated overall high accuracy, sensitivity, and specificity on fold-2 test set, particularly with high sensitivity and accuracy at AHI thresholds of 5 e/h on fold-1.

Model comparison

To assess the overall performance of the models, we analyzed the average metrics across all folds, including four-class accuracy, four-class kappa, and RMSE. Results in Table 4 demonstrate CNN-BiGRU-Attention outperforming the ResNet. CNN-BiGRU-Attention model exhibited higher average four-class accuracy and kappa values, as well as lower RMSE, across all fold test sets. Moreover, it featured significantly fewer parameters and a smaller size compared to the ResNet model, resulting in reduced training time. The findings in Table 2 consistently reinforce the superior performance of CNN-BiGRU-Attention on each fold test set. The ResNet model is equipped with residual connections that help mitigate the vanishing gradient problem, thus facilitating the training of deeper networks. However, its primary focus on spatial feature extraction may not adequately capture the temporal dependencies present in sequential SpO2 data. In contrast, the CNN-BiGRU-Attention model integrates CNN for spatial feature extraction with BiGRU for temporal feature learning. The inclusion of an attention layer further refines the model's focus on the most pertinent aspects of the signal, thereby enhancing overall performance. This hybrid approach enables the CNN-BiGRU-Attention model to more effectively capture both spatial and temporal features compared to ResNet, which is particularly beneficial for time-series data like SpO2 signals that contain critical temporal dependencies for accurate apnea–hypopnea event detection. Despite ResNet's advanced deep learning capabilities, our limited dataset posed overfitting challenges when training deeper layers. This experience highlighted that more streamlined architectures like CNN-BiGRU-Attention can offer better generalization across different folds in cross-validation, whereas ResNet may be more susceptible to overfitting, leading to less consistent performance on validation and test sets. To assess the importance of each component in the proposed CNN-BiGRU-Attention model, we conducted an ablation study examining four configurations: without the attention layer, without the CNN layer, without the BiGRU layer, and without both the BiGRU and attention layers. The average metrics across all threefold test sets for each architecture are presented in Table 5. The results indicate that the omission of any component leads to a significant decline in the model's performance. This emphasizes the importance of the CNN layers for extracting spatial features, the BiGRU layers for capturing temporal dependencies, and the attention mechanism for highlighting relevant segments of the signal. These findings validate the effectiveness of the chosen architecture in estimating the AHI from SpO2 signals.

Table 4 Comparative analysis of models' performance across all folds.
Table 5 Ablation study results for cnn-bigru-attention model layers.

Discussion

In this study, we delved into the efficacy of leveraging a residual architecture and a combination of CNN and RNN architectures, augmented by an attention mechanism, to process SpO2 signals as 1D raw data for the estimation of AHI and the assessment of pediatric SAH severity. To the best of our knowledge, the application of these architectures in evaluating the severity of pediatric SAH using SpO2 signals constitutes a novel contribution. We employed a threefold cross-validation method to showcase the models' generalizability on the dataset. Remarkably, both models demonstrated commendable performance in terms of four-class accuracy and kappa values on each fold test set. Furthermore, the CNN-BiGRU-Attention model showcased a significant average four-class accuracy of 75.95% and an average four-class kappa of 0.63 across all folds, indicating its ability to accurately classify patients across various levels of SAH severity. As evidenced by the results in Table 4, this model achieved high accuracies of 89.24%, 91.24%, and 96.41% for AHI thresholds of 1, 5, and 10 e/h on the fold-2 test set, underscoring its high diagnostic capability to detect SAH patients, especially those in need of urgent.

treatment or at high health risk. Despite the promising results, both models tended to underestimate the AHI for subjects with severe SAH, leading to low sensitivities for AHI thresholds of 5 and 10 e/h. This observation might be attributed to the imbalance of subjects from each SAH severity group in the dataset; however, we addressed this by oversampling apneic SpO2 signal segments during the training process. Additionally, it is crucial to acknowledge that the AHI estimated by both models is derived using the length of the SpO2 signal as the total recording time, while AHI calculated in PSG tests is based on the total sleep time. Although we employed an additional linear regression model to mitigate this error by mapping AHI calculated by record time to AHI calculated by sleep time via the validation set, limitations may still exist in accurately estimating the AHI.

Table 6 provides an overview of previous studies dedicated to the analysis of pediatric SAH and OSA severity assessment. Hornero et al. employed a Multi-Layer Perceptron (MLP) network to estimate AHI from 3602 SpO2 recordings, categorizing subjects into four OSA severity classes, achieving an overall accuracy of 54.7%15. Jiménez-García et al. (2020), in their work, utilized the AdaBoost algorithm for a 4-class classification of pediatric OSA, using features from both AF and SpO2 signals across a dataset of 974 pediatric subjects, attaining a 4-class accuracy of 57.95%17. These studies, while significant in using ML algorithms, are contrasted by more recent research showing enhanced performance with DL algorithms. Recent studies have used CNN-based models in combination with the CHAT dataset to estimate AHI from ECG signals30, SpO2 signals27,28, and a combination of SpO2 and AF signals29,31. In 2020, Vaquerizo-Villar et al. employed a 1D CNN model for assessing SAH severity from 746 SpO2 signals in the CHAT dataset's Baseline and Follow-up parts, reaching a four-class accuracy of 67.15% and a kappa value of 0.31 in a test set of 246 subjects27. Following this, in 2021, the same team applied a similar model to the CHAT dataset's baseline, follow-up, and nonrandomized parts, as well as the University of Chicago Medicine (UofC) and Burgos University Hospital (BUH) datasets, achieving a four-class accuracy of 72.8% and a kappa of 0.51 in a test set of 312 subjects from the CHAT dataset28. In a different approach, Jiménez-García et al., in 2022, utilized a 2D CNN architecture to estimate pediatric OSA severity from SpO2 and AF signals as raw 2D data, applying it to all parts of the CHAT dataset and the UofC dataset. They achieved a four-class accuracy of 72.55% and a kappa of 0.60 on the CHAT test set29. In their subsequent work, they improved their results to a four-class accuracy of 74.51% and a kappa of 0.62 on the CHAT test set by utilizing a 2D CNN layer followed by a BiGRU layer31. García-Vicente et al., in 2023 focused on a 1D CNN model using ECG signals from the CHAT dataset for a similar purpose, attaining a four-class accuracy of 57.86% and a kappa of 0.37 with 299 test subjects30. Our research, distinct from these studies, considered all types of SAH events, including CSA. As previously stated, the main reason for this consideration is that CSA is always associated with a lack of respiratory effort48, posing a challenge in classifying CSA and OSA events based solely on the oxygen desaturation of SpO2 signals without chest and abdominal movement signals. Furthermore, we opted not to use the nonrandomized part of the CHAT dataset due to its lack of crucial clinical information necessary for accurate labeling. Unfortunately, we were unable to obtain permission to access other private datasets. Despite these limitations, our research results can be more directly compared with Vaquerizo-Villar et al. (2020)27 as we used the same dataset and the same proportion of data split for training, validation, and testing. Moreover, we implemented threefold cross-validation to demonstrate the generability of our models, and it is evident that our models, especially the CNN-BiGRU model, exhibited higher four-class accuracy and kappa values across all folds. In contrast to Jiménez-García et al.29,31, our reasearch focuses solely on the SpO2 signal, a single-channel source, aligning with practical scenarios and emphasizing cost-effectiveness. Unlike García-Vicente et al.30, our model aimed to utilize only SpO2 signals specifically because they can be recorded by pulse oximeters, which are more comfortable for patients to use than electrocardiographs for assessing pediatric SAH severity assessment. Overall, although previous studies utilizing CNN-based models have demonstrated high performance in both SAH and OSA severity assessment, our approach highlights the potential of hybrid CNN-BiGRU models, incorporating residual and attention-based mechanisms, specifically for pediatric SAH severity assessment. While achieving high diagnostic accuracy across various AHI thresholds, our models underscore the difficulty in accurately estimating AHI for severe SAH cases, primarily due to imbalances in subjects from different severity groups. We also stress the significance of considering sleep duration in AHI estimation, suggesting avenues for further improvement, such as integrating contextual information, including sleep stage analysis.

Table 6 Performance comparison with previous studies.

Despite these challenges, our study offers valuable insights into the use of residual architecture and attention-based hybrid CNN-RNN architecture for pediatric SAH assessment, setting a precedent for future developments in this area. In terms of data augmentation, we replicated apneic signal segments. For future research, additional techniques like the overlapping segmentation of signals, as utilized by García-Vicente et al.30 and Jiménez-García et al.29,31, could be adopted for data augmentation and to balance the dataset. Such strategies may also help mitigate overfitting and enhance the generalization capabilities of sophisticated models like ResNet. For future works, we recommend a two-stage approach using two DL models in parallel: one for sleep staging classification and the other for apneic event number estimation. PPG signals, as used in previous studies for sleep staging and estimating total sleep time21,49, can be suitable for this purpose. Concurrently, SpO2 signals derived from PPG can be used to estimate the number of apneic events through a regression model. This dual-model strategy should also encompass the refinement of AHI estimation by incorporating the quantification of apnea events from SpO2 signals. Moreover, we suggest conducting experiments to ascertain whether demographic features such as age and gender correlate with signal characteristics, which could then inform feature embedding during model training. We utilized a moving average filter for noise reduction and signal smoothing, following common practices in prior studies. However, exploring alternative filtering methods could yield signals of higher quality. Additionally, our algorithm for zero-level artifact removal, inspired by previous researches9,35, could be enhanced through further investigation to more effectively address artifacts. Future research may explore diverse filtering techniques to further improve signal fidelity.