1 Introduction

A common sleep disorder known as sleep apnea and hypopnea syndrome (SAHS) is characterized by abnormal reductions or pauses in breathing during sleep [1]. It is estimated to affect 2% of middle-aged women and 4% of middle-aged men. People suffering from severe sleep apnea are prone to develop coronary artery disease, congestive heart failure, and stroke [2]. Obstructive sleep apnea (OSA) is a common problem that affects at least 5% of adults [3, 4] and is linked to a higher chance of hypertension [5], coronary heart disease [6], atrial and ventricular arrhythmias [7], and mortality [8]. The physiologic linkages between obstructive airway events and cardiac pathology are multi-factorial and carefully documented in a consensus document from the American College of Cardiology [9] and the American Heart Association [10]. The association between obstructive sleep apnea and heart disease is supported by facts that treating sleep apnea lowers systolic blood pressure, improves left ventricular systolic function, and lowers platelet activation [11].

Sleep apnea can be divided into two categories: Central sleep apnea is caused by a disruption of normal communication between the brain and the respiratory muscles, while obstructive sleep apnea is caused by upper airway obstruction. Patients with both OSA and heart failure have a high co-occurrence of different conditions. Studies show a range of prevalence from 11 to 38%. In patients diagnosed with CSA, [12]. The data gives rise to speculation about how best to identify and treat co-occurring disorders. Patients with sleep apnea are more likely to be hospitalized for an exacerbation of HF [13]. There are only positive results for obstructive apnea, while studies have shown no clear benefit for central apnea. Treatment options are being explored for both [14].

Obstructive sleep apnea (OSA) is a worldwide health crisis that accompanies the global obesity epidemic. In the US, OSA affects 17% of adult women and 34% of adult men. Recent trends show that the number of people with OSA is on the rise in the US and the rest of the world [15]. Sleep apnea has been associated with metabolic syndrome features such as insulin resistance, dyslipidemia, hypertension, and central obesity [16]. The significant association between OSA and cardiovascular disease may be attributed to the metabolic syndrome and its negative effects on inflammation, oxidative stress, and endothelial dysfunction [17].

People with heart diseases are at higher risk for sleep apnea. Large prospective patient registries have shown that sleep apnea, particularly OSA, is relatively common in outpatient and inpatient cardiology settings [18, 19]. OSA has been associated with several health conditions, including hypertension [20], coronary artery disease [21], congestive heart failure [22], stroke [23]. Cardiac arrhythmias, especially atrial fibrillation (AF), [24].

Neurobehavioral disorders are associated with obstructive sleep apnea syndrome (OSAS) [25, 26], heart disease [27, 28], poor quality of life [28, 29], and more physician visits [30], demonstrating the importance of detecting and treating this condition. Therefore, the American Academy of Pediatrics (AAP) recommends screening for OSAS during regular doctor visits. Children with typical symptoms (such as snoring, restless sleep, and daytime hyperactivity) or risk factors (such as craniofacial, neurological, or genetic disorders) should be considered for diagnosis. Overnight polysomnography may confirm the diagnosis.

Over 34 million people worldwide and at least 3 million Americans have AF [31]. Atrial fibrillation is thought to be caused by abnormal atrial tissue substrates and triggers of abnormal impulses, which often originate in the pulmonary venous ostia[32]. However, the mechanisms by which AF arises are not fully understood. Because of its immediate effects on intrathoracic pressure and autonomic tone and its ability to drive long-term changes in the underlying atrial tissue substrate, OSA may favor the development of AF [33]. Obstructive sleep apnea is more common than usual in people with AF (OSA) of moderate or higher severity [34, 35], and the prevalence of AF in patients with moderate or severe OSA is comparable to that in patients with AF [36]. Machine learning can predict the likelihood of sleep apnea based on several factors, including age, gender, BMI, and other medical conditions.

In addition, the paper [37] discusses using supervised machine learning methods to predict obstructive sleep apnea (OSA). The authors used a noninvasive feature dataset of 231 records and applied common machine-learning algorithms to develop the prediction models. After reviewing the dataset and searching for hidden data, these were replaced with the average and most frequent records. Standard machine learning algorithms were used for modeling, and the overall performance of the models was evaluated using 10-fold cross-validation. The results showed that the Naive Bayes classifiers and logistic regression achieved the best predictive models with an overall AUC of 0.768 and 0.761, respectively. The SVM with a sensitivity of 93.42% and Naive Bayes with a specificity of 59.49% may be suitable for screening high-risk individuals with OSA.

Finally, [38] discusses occlusive sleep apnea syndrome, an airway sleep disorder characterized by intermittent nocturnal episodes of partial or complete upper airway obstruction. The article highlights the high prevalence of this disorder in the elderly population, with an estimated incidence between 20% and 60% in those over 65 years of age. It emphasizes the importance of diagnosing and treating this disorder in older patients as the average age of the world’s population increases. The paper does not include specific results of studies or experiments.

This research work makes the following significant contributions and offers an updated overview of the topic:

  1. 1.

    To the best of our knowledge, this is the first study that illustrates the XGBoost_BiLSTM model successfully predicts sleep apnea using EHR.

  2. 2.

    To avoid the problem of model overfitting, XGBoost model is first time deployed for the selection of significant features from the dataset for the prediction of sleep apnea.

  3. 3.

    The XGBoost_BiLSTM approach has a lower time complexity due to the fewer features (six) it uses.

  4. 4.

    The newly developed XGBoost_BiLSTM model also identifies the sleep apnea risk factors, which ultimately assists us in lowering the likelihood of developing sleep apnea.

  5. 5.

    According to experimental findings, the proposed XGBoost_BiLSTM model outperformed other cutting-edge ML models and conventional LSTM in terms of accuracy.

This paper discusses apnea detection using machine learning with cross-domain features. Section 2 explains the concept of deep learning and the techniques that are used in this work. The previous work based on ML for sleep apnea is described in Section 3 (literature review). Section 4 provides the details of the material and methods of this study. Section 5 and Section 6 present the results and discussion of the proposed work. Section 8 concludes the study with an overview of future research.

2 Deep Learning

Deep Learning (DL) is a subfield of artificial intelligence (AI) that simulates how the human brain processes data and creates patterns to enable rational decision-making. It is a branch of machine learning (ML) that provides more sophisticated tools for building models and applies a layered ANN to run ML methods. Deep learning structures, known as deep neural learning (DNL), are built from multiple interconnected layers. DL can learn from incoming data and transform it into various degrees of data abstraction [39, 40]. DL Examples include recurrent neural networks (RNNs), deep neural networks (DNNs), convolutional neural networks (CNNs), and deep belief networks (DBNs). These methods provide admirable results comparable to or superior to human reasoning, making them useful for various problems in numerous fields of study, including intrusion detection.

2.1 Recurrent Neural Networks

Due to their recurrent (circular) connection mode, recurrent neural networks (RNN), a subset of Deep Learning (DL), are a better technique for processing sequential inputs. This class of neural networks maintains hidden layers while using past outputs as inputs [41, 42]. RNNs can process inputs of arbitrary length and keep the size of a model constant as inputs grow. Unlike traditional feed-forward networks, RNNs can remember what they have learned and base their judgments on that information [43]. In other words, RNNs can recall information in addition to what they learned during training while producing output. RNNs are able to handle a variety of research problems but also have problems such as vanishing gradients [43, 44]. Because of this flaw, they are unable to acquire long-term dependencies. Hochreiter and Schmidhuber [45] introduced long short-term memory (LSTM) to solve this problem.

2.2 Long-Short-Term-Memory (LSTM)

The recurrent neural network LSTM uses a gating mechanism to learn long-term dependencies. It fixes the vanishing gradient problem that occurs in traditional RNN training. To avoid units and remember larger time steps, LSTM models use numerous switching gates [46]. The LSTM design often has a memory called cells that accepts input from the current input and the previous state. These cells decide what to keep and delete from memory before merging the previous state with the current input to produce the subsequent input. In this way, they can record long-term [47]. Due to their advantages over traditional RNNs, LSTMs have attracted much interest recently. The most pressing security challenges, such as intrusion detection, are being addressed by network security researchers with LSTMs [48, 49].

3 Literature Review

This section reviews the previous studies that presented ML methods to classify and detect sleep apnea [50]. The sleep apnea study also investigated different ML strategies with diverse types of input data [51]. The aim of the [52] and meta-analysis was to investigate the association between obstructive sleep apnea (OSA) and erectile dysfunction (ED). Masa et.al., performed a prespecified secondary analysis of the largest multicenter randomized controlled trial of OHS (Pickwick Project, n = 221 patients with OHS and coexisting severe obstructive sleep apnea) to compare the efficacy of 3 years of NIV and CPAP on structural and functional echocardiographic changes [53]. More importantly, [54] proposes a classification combination to further improve classification performance by using the complementary information provided by each classifier. [55] Study on automatic screening for obstructive sleep apnea using a lead-in electrocardiogram. A proposed algorithm uses a lead-in electrocardiogram (ECG) to detect OSA events.

Furthermore, the goal of [56] is to determine sleep and wakefulness with a practical and applicable method. In [57], the authors investigated an expert system for automatically detecting obstructive sleep apnea from the lead-in ECG using random undersampling boosting. The problem of automatic sleep apnea detection from lead-in ECG signals was addressed.

Fig. 1
figure 1

Description of dataset samples

In addition, [58] that machine learning can automate obstructive sleep apnea (OSA) detection. With tenfold cross-validation, [59] detection achieves 88.3% accuracy for four-group classification and 92.5% in the binary classification. The goal of [60] is to analyze published research over the past decade to answer research questions, such as how to implement different deep networks. In [61], an effective, efficient, and sustainable system for automatic sleep apnea detection using pulse oximetry (SpO2) signals indicating the percentage of oxygen in the blood is presented.

In addition, [62] study identified obstructive sleep apnea based on sleep architecture. The patient’s sleep stages and transition relationships are used as features to propose a machine learning-based OSA detection method. The proposed method can be a low-cost and reliable wearable device for monitoring sleep apnea at home and in the community [63].

Furthermore, in [64], the authors propose an efficient method to discriminate between patients with obstructive sleep apnea (OSA) and normal control subjects using EEG signals and machine learning algorithms. The delta, theta, alpha, beta, and gamma subbands of the EEG signals were separated. Energy and variance were extracted as descriptive features from each frequency band. Four machine learning algorithms were used to detect OSA: Support Vector Machines (SVM), artificial neural networks (ANN), linear discriminant analysis (LDA), and Naive Bayes (NB). The results showed that SVM achieved the best classification accuracy of 97.14% compared to the other classifiers.

In addition, the paper [37] discusses using supervised machine learning methods to predict obstructive sleep apnea (OSA). The authors used a noninvasive feature dataset of 231 records and applied common machine-learning algorithms to develop the prediction models. After reviewing the dataset and searching for hidden data, these were replaced with the average and most frequent records. Standard machine learning algorithms were used for modeling, and the overall performance of the models was evaluated using 10-fold cross-validation. The results showed that the Naive Bayes classifiers and logistic regression achieved the best predictive models with an overall AUC of 0.768 and 0.761, respectively. The SVM with a sensitivity of 93.42% and Naive Bayes with a specificity of 59.49% may be suitable for screening high-risk individuals with OSA.

Finally, [38] discusses occlusive sleep apnea syndrome, an airway sleep disorder characterized by intermittent nocturnal episodes of partial or complete upper airway obstruction. The article highlights the high prevalence of this disorder in the elderly population, with an estimated incidence between 20% and 60% in those over 65 years of age. It emphasizes the importance of diagnosing and treating this disorder in older patients as the average age of the world’s population increases. The paper does not include specific results of studies or experiments.

4 Material and Methods

4.1 Dataset

The Swedish National Study on Aging and Care (SNAC) served as the data source for this study. SNAC is a long-term consortium that collects multimodal data from Sweden’s aging population to develop reliable, comparable, and durable data sets for aging research [65]. SNAC was established as a multipurpose program to study health care quality in the aging population. SNAC includes several databases with information on various topics, such as medical records, social variables, lifestyle factors, metacognitive data, and physical assessment. In Blekinge, Skne, Nordanstig, and Kungsholmen, SNAC collected data on Swedish seniors. Figure 1, provides an overview of positive and negative samples in the collected dataset. The collected dataset for this study consists of 75 features with a total sample size of 10765. Table 1 displays the nature of features as feature groups, feature names, and the total number of features in a particular feature group.

Table 1 Feature Description

Based on previously published research, variables were selected from SNAC database (Blekinge) and variables from eight categories were considered for this study such as demographics, social, lifestyle, medical history, physical examination, biochemical tests, psychological examination, and assessment of various health instruments [66, 67]. We obtained 10,765 data samples, of which 3461 were from SNAC-Kungsholmen and 7304 from SNAC-Blekinge. The dataset collected consists of 6816 females and 3949 males. Only 229 of the 3949 males and 287 of the 6816 females suffer from sleep apnea. Table 2 shows the statistical information for the sample population.

Table 2 The summary of samples in collected dataset

4.2 Proposed Model

In this study, we presented a ML model that can predict sleep apnea based on EHR. The proposed ML model is based on two components that are hybridized into a single system. The first component employed the XGBoost technique to select the most significant variables from the dataset. XGBoost is used to rank the variables from the dataset, and highly ranked variables are fed into the second component for the prediction of sleep apnea. In the second component, we employed conventional LSTM and BiLSTM models. The performance of conventional LSTM and BiLSTM based on highly ranked features from XGBoost was assessed. From the experimental results, the performance of XGBoost_BiLSTM is evident in comparison to XGBoost_LSTM. Hence, we named the newly designed model XGBoost_BiLSTM. Figure 2 presents an overview of the developed XGBoost_BiLSTM model for the prediction of sleep apnea. Data preprocessing is the first step of the proposed model because BiLSTM deals with only numeric values. Therefore, all non-numeric features in the dataset will be converted into numeric form. After successfully converting non-numeric features to numeric representations, feature scaling is the next step. The feature scaling guarantees that the dataset is normalized. Because the values of several features in the dataset have an uneven distribution, we use Min-Max scaling to scale each feature’s values between 0 and 1. This ensures that our classifier does not provide biased results.

The Min-Max feature scaling equation is as follows:

$$\begin{aligned} { S^{'}= \frac{O-O_{min}}{O_{max} - O_{min}} } \end{aligned}$$
(1)

where \(S^{'}\) denotes the new scaled values and O represents the origional value.

Fig. 2
figure 2

Overview of Proposed Model

The architecture of XGBoost for features ranking from the dataset, along with the intuition of a conventional LSTM architecture and details of a bidirectional LSTM (BiLSTM) architecture, is given in the below sections.

4.3 Extreme gradient boosting (Xgboost) for Feature Selection

One of the variants of the gradient boost engine is Xgboost, which is considered to be one of the best-supervised learning algorithms available on the market. The fast out-of-core execution speed of Xgboost makes it a favorite among data scientists. In addition to regression and classification problems, Xgboost can be used for feature ranking from the dataset. XGBoost is an ensemble learning algorithm that employs cacheable block structure tree learning and regularized learning. Z denotes the loss function, \(\tau \) represents the \(t^{th}\)tree, and \(\eta (\tau )\) is the regularized term. The second order Taylor series of Z at the \(t^{th}\) iteration is:

$$\begin{aligned} { Z^{t} \simeq \sum _{j = 1 }^{n} \left[ z \left( x_{j}, x_{j}^{(t-1)} \right) + \rho _{j} \tau (y_{j}) + \frac{1}{2} \beta _{j} \tau ^{2}(y) \right] + \eta (\tau ) } \end{aligned}$$
(2)

where \(g_{j}\), \(\beta _{j}\) stand for \(1^{st}\) and \(2^{nd}\) order gradients. Gain is utilized to select the ideal split node throughout XGBoost training.

$$\begin{aligned}&{ Gain} \nonumber \\ {}&\quad ={\frac{1}{2}\left[ \frac{(\sum _{j}\varepsilon z_{Z} g_{j})^{2}}{\sum _{j}\varepsilon z_{Z} \beta _{j} + \alpha } + \frac{(\sum _{j}\varepsilon z_{\mathbb {R}} g_{j})^{2}}{\sum _{j}\varepsilon z_{\mathbb {R}} \beta _{j} + \alpha } - \frac{(\sum _{j}\varepsilon z g_{j})^{2}}{\sum _{j}\varepsilon z \beta _{j} + \alpha } \right] - \psi } \end{aligned}$$
(3)

where \(Z_{z}\) and \(Z_{\mathbb {R}}\) represent the left and right nodes, respectively, after segmentation. \(z= z_{Z} \cup z_{\mathbb {R}}\). \(\alpha \), \(\psi \) are penalty parameters. The average gain is used to determine the final significance value of the feature reflecting the gain for each tree split. The average gain is determined by dividing the total cumulative gain by the total cumulative number of splits for each feature. The more significant and useful a feature is, its value on the XGBoost significance scale is higher. The top features are determined in descending order of importance to describe the PPIs. In bioinformatics, the XGBoost method was used for feature selection. [68]. The loss function is binary: logistic, there are 500 boosting trees, the maximum depth is 15, and all other parameters are set to default values.

4.4 Conventional LSTM

A typical LSTM has the same control flow as a typical RNN, analyzing data and recording information as it propagates. The variations are a consequence of the LSTM’s cellular activities. These characteristics allow the LSTM to ignore or store information. The different gates and cell states form the core of an LSTM. The cell state is a channel for transmitting relevant data throughout the data processing. It could be considered as the memory of the network. Different neural networks control the information that can be applied to a cell state by acting as gates. During training, the gates learn which information to store and which to forget. Three different gates control the flow of information within an LSTM cell: the input gate, the output gate, and the forget gate. The input gate determines what information is added from the current state. The output gate determines the type of hidden state that follows. The forget gate determines what must be stored from the previous state. Figure 3 represents a typical LSTM architecture. The following equations provide a mathematical description of the connection between the inputs and outputs at time \(\tau \) and at time \(\tau \)-1.

Fig. 3
figure 3

Memory cell of LSTM

Fig. 4
figure 4

Overview of BiLSTM architecture

$$\begin{aligned}{} & {} { \alpha _{\tau } = \varphi \left[ \left( \omega _{\rho \alpha }* \rho _{\tau } \right) + \left( \omega _{\upsilon \alpha }* \upsilon _{\tau -1 } \right) + \left( \omega _{\mu \alpha }* \mu _{\tau -1 } \right) +\kappa _{\alpha } \right] } \end{aligned}$$
(4)
$$\begin{aligned}{} & {} { \beta _{\tau } = \varphi \left[ \left( \omega _{\rho \beta }* \rho _{\tau } \right) + \left( \omega _{\upsilon \beta }* \upsilon _{\tau -1 } \right) + \left( \omega _{\mu \beta }* \mu _{\tau -1 } \right) +\kappa _{\beta } \right] } \end{aligned}$$
(5)
$$\begin{aligned}{} & {} { \gamma _{\tau } = (\beta _{\tau } * \gamma _{\tau - 1}) + \alpha _{\tau } tanh\left[ \left( \omega _{\upsilon \gamma }* \upsilon _{\tau -1 } \right) + \left( \omega _{\mu \gamma }* \mu _{\tau -1 } \right) +\kappa _{\gamma } \right] }\nonumber \\ \end{aligned}$$
(6)
$$\begin{aligned}{} & {} { \phi _{\tau } = \varphi \left[ \left( \omega _{\rho \phi }* \rho _{\tau } \right) + \left( \omega _{\upsilon \phi }* \upsilon _{\tau -1 } \right) + \left( \omega _{\mu \phi }* \mu _{\tau -1 } \right) +\kappa _{\phi } \right] } \end{aligned}$$
(7)
$$\begin{aligned}{} & {} { \lambda _{\tau } = \phi _{\tau } tanh(\lambda _{\tau }) } \end{aligned}$$
(8)

where \(\alpha _{\tau }\) represents the input gate, \(\rho \) denotes the input vector, \(\phi \) is the output gate, \(\mu _{\tau }\) denotes the output, and \(\beta _{\tau }\) represents the forgetting function. The cell state is given by \(\gamma _{\tau }\), where \(\gamma \) and \(\kappa \) are the weight and bias parameters, respectively.

4.5 Bidirectional LSTM (BiLSTM)

The bidirectional LSTM complements the standard LSTMs to improve the classification performance of a model. Two LSTMs are trained based on the input data. The first LSTM was applied to the original input data, while the second was to the reverse copy. This increases the expressiveness of the network and leads to faster results. The concept underlying the BiLSTM is quite simple. It consists of duplicating the first recurrent layer of the network, passing the input data to the first layer in its original form, and then passing the input data to the duplicated layer in reverse order. This concept solves the problem of vanishing gradients in conventional RNNs.

The BiLSTM is trained with all available past and current input data within a specified time period. The BiLSTM uses a forward and backward layer to process input data in two directions (i.e., left-to-right and right-to-left) [69]. By accepting the initial LSTM layer as input, the Keras library in Python implements BiLSTMs via a bidirectional layer wrapper. The user can specify the fusion mode, which determines how the forward and reverse outputs are combined before being sent to the subsequent layer (Figure). In the Fig. 4, the mathematical formula for the output based on the forward hidden layer \(\overrightarrow{\lambda _{\tau }}\), and the backward hidden layer \(\overleftarrow{\lambda _{\tau }}\) is given as [70]:

$$\begin{aligned}{} & {} { \overrightarrow{\lambda _{\tau }} = L(\omega _{\rho \overrightarrow{l}}\rho _{\tau } + \omega _{\overrightarrow{l}\overrightarrow{l}} \overrightarrow{l}_{\tau -1} + \kappa _{\overrightarrow{l}} ) } \end{aligned}$$
(9)
$$\begin{aligned}{} & {} { \overleftarrow{\lambda _{\tau }} = L(\omega _{\rho \overleftarrow{l}}\rho _{\tau } + \omega _{\overleftarrow{l}\overleftarrow{l}} \overleftarrow{l}_{\tau -1} + \kappa _{\overleftarrow{l}} ) } \end{aligned}$$
(10)
$$\begin{aligned}{} & {} { Z_{\tau } = \omega _{\overrightarrow{l}z} \overrightarrow{l_{\tau }} + \omega _{\overleftarrow{l}z}\overleftarrow{l}_{\tau } + \kappa _{z} } \end{aligned}$$
(11)

where l represents the hidden layer and \(\omega \) denotes the input weight matrices (forwards, backward) hidden weight, bias vectors for both directions are given by \(\kappa (\kappa _{\overrightarrow{l}}\) and \(\kappa _{\overleftarrow{l}}\)).

5 Validation and Evaluation

To determine the efficacy of ML-based diagnostic systems, the holdout validation approach has often been used as a standard in the literature [71, 72]. However, the holdout validation scheme is inappropriate when the dataset contains imbalanced classes. Since ML models favor the majority class, we used a stratified k-fold cross-validation scheme to avoid biases caused by unbalanced classes in the collected dataset [73]. The stratified k-fold validation scheme extends the cross-validation technique by maintaining the same class ratio across K folds as the original dataset ratio. To test the efficacy of the proposed model, we used the stratified k-fold validation with k = 5. Specificity, sensitivity, and accuracy are the evaluation measures used to assess the performance of the proposed model. Using a receiver operator characteristic (ROC) curve and, the Matthew correlation coefficient (MCC) and area under the curve (AUC) is calculated. These are the evaluation metrics that are mathematically specified:

$$\begin{aligned}{} & {} { Sensitivity = \frac{TP}{TP + FN} } \end{aligned}$$
(12)
$$\begin{aligned}{} & {} { Specificity = \frac{TN}{TN + FP} } \end{aligned}$$
(13)
$$\begin{aligned}{} & {} { Accuracy = \frac{TP + TN}{TP + TN + FP +FN} } \end{aligned}$$
(14)
Table 3 Performance of LSTM and BiLSTM using all features
Fig. 5
figure 5

Performance comparison of BiLSTM with conventional LSTM model using all 75 features

where TP stands for the number of true positives, FP for the number of false positives, TN for the number of true negatives, and FN for the number of false negatives.

$$\begin{aligned} { F\_score = \frac{2TP}{2TP + FP+ FN} } \end{aligned}$$
(15)
$$\begin{aligned}&{ MCC }\nonumber \\ {}&{= \frac{ TP \times TN - FP \times FN }{\sqrt{(TP + FP)(TP + FN)(TN + FP)(TN + FN).}} } \end{aligned}$$
(16)

Furthermore, a binary classification problem is subjected to statistical analysis, i.e., the F-measure is used to test the model. The F-measure ranges from 0 to 1, where 1 represents excellent predictions, and 0 represents the worst. The accuracy of a test is evaluated using MCC. MCC ranges from 1 to –1, where 1 represents a perfect prediction, and –1 represents the worst prediction.

6 Results

6.1 Comparison of LSTM and BiLSTM Using All Features

In the first phase of experiments, we examined the performance of conventional LSTM and BiLSTM models using all 75 features of the dataset. The performance of both conventional LSTM and BiLSTM models is evaluated using the stratified K-fold cross-validation method, with the value of k set to 5. The performance of the conventional LSTM and BiLSTM models is validated using various evaluation metrics, i.e., accuracy, sensitivity, specificity, F1 score, and MCC which are given in Table 3, with BiLSTM achieving the highest test accuracy of 95.12% compared with conventional LSTM models of 94.56%.

We also evaluate the performance of BiLSTM and conventional LSTM models while using all features from the dataset based on the ROC curve. The graph with a larger area under the curve (AUC) is considered more accurate. From Fig. 5, it can be seen that the BiLSTM has a larger AUC in comparison to the conventional LSTM. Hence, BiLSTM is more efficient than conventional LSTM.

6.2 Bidirectional LSTM Using Xgboost Feature Selection Module

In this experiment, we hybridized the XGBoost model with the BiLSTM model, using the XGBoost model to rank the features in the dataset. All 75 features in the dataset are ranked according to their importance, as shown in Fig. 6.

Fig. 6
figure 6

Features ranking based on their importance using XGBoost

Table 4 Performance of BiLSTM Model based on Xgboost feature ranking
Fig. 7
figure 7

Performance comparison of BiLSTM with conventional LSTM model using top 6 ranked features

After ranking the features, we set a threshold to select the best features. The selected features are then fed into the BiLSTM model for classification. The proposed model (XGBoost_BiLSTM) was evaluated using evaluation metrics such as training accuracy, test accuracy, sensitivity, specificity, and MCC. The results of the proposed model are given in Table 4 along with the number of features selected using the XGBoost module. Table 4 shows that the proposed model (XGBoost_BiLSTM) achieves the highest test accuracy of 97.00% while using the six best features of the dataset. In contrast, using the six best features of the dataset.

In addition, we used a ROC curve to validate the performance of the proposed model (XGBoost_BiLSTM). The model ML with a large area under the curve (AUC) is considered more efficient. Therefore, we tested the performance of the proposed model based on the ROC curve using the six best features from the dataset. Also, the conventional LSTM model was tested based on the ROC -curve using the six best features from the dataset. From Fig. 7, it can be seen that the ML (XGBoost_BiLSTM) model has a large area under the curve compared to the (XGBoost_Conventional LSTM) model.

6.3 Performance of ML_Models Using All Features

In this experiment, all 75 features of the dataset were used to evaluate the performance of several modern ML models. The performance of the ML models was evaluated using the following evaluation metrics: training accuracy (Acc.Train), test accuracy (Acc.Test), sensitivity (Sens.), specificity (Spec.), F1 score, and Matthew’s correlation coefficient (MCC) based on a holdout validation scheme with 70% data for the training of the model and 30% data for the testing of ML models. From Table 5, it can be seen that the model RF achieves the highest test accuracy of 83.40 compared to the other models from ML. We also used ROC curves to test the effectiveness of the ML models. The ML model is more accurate and reliable because it has a larger area under the curve (AUC). From Fig. 8, it can be seen that the performance of the RF model is much better than the other ML models, which have an AUC of 83.40%.

Table 5 ML models performance with all features used
Fig. 8
figure 8

ROC curve analysis of ML models

In this study, we developed a hybrid machine learning model combining XGBoost and BiLSTM to detect risk factors and diagnose sleep apnea. The proposed model consists of two modules: The first module classifies the most important features from the dataset, and the second module performs sleep apnea classification. To evaluate the performance of our model, we used a k-fold cross-validation scheme (k=5). We compared it with eight other state-of-the-art machine learning models, including a conventional LSTM model. Our proposed model achieved 97% accuracy using only the top six features in the dataset. These six features, including type 2 diabetes, external injuries, mental and behavioral disorders, psychological stress and emotions, a comprehensive psychopathology rating scale, and respiratory system diseases, are the main risk factors for sleep apnea in older adults. Overall, our proposed model outperformed the conventional LSTM and other modern machine learning models, demonstrating its potential for early detection and diagnosis of sleep apnea in older adults

7 Discussion

Using machine learning and deep learning techniques, we identified the six major factors contributing to sleep apnea in older adults. The proposed sleep apnea prediction model consists of two modules, the first of which ranks features by importance. We used a machine learning model (XGBoost) to rank the features in the dataset. After ranking the features, we used deep learning techniques such as LSTM and BiLSTM. The performance of the LSTM and BiLSTM models was tested using various evaluation metrics. According to the experimental results, BiLSTM outperformed the LSTM model and several other state-of-the-art machine learning models. BiLSTM achieved the highest accuracy of 97% compared to the accuracy of the LSTM model, while using the same six main features. These (38, 40, 44, 69, 72, 73) six features are critical for the development of sleep apnea in older adults. Table 6 provides information regarding the feature code (F_Code), feature label (F_Label) selected by the proposed model (XGBoost_BiLSTM) and describes the six main risk factors for sleep apnea in older adults. If we avoid these risk factors, we can improve the health of older adults and reduce the risk of sleep apnea.

Table 6 Description of top six risk factors of sleep apnea

It is also important to mention the limitations of this study so that future researchers can benefit from it. The proposed study used EHR for experimental purposes, and the sample size of the dataset was modest (10,765). But deep learning methods work well for larger datasets. Therefore, in further research, researchers should collect datasets with a larger sample size. Although sleep apnea is a rare disease, the number of positive cases of sleep apnea is far less in comparison to healthy subjects. The machine learning models tend to bias toward the majority class in the dataset; therefore, a balanced dataset should be collected in the future. To avoid the problem of bias in the proposed model, we used a cross-validation scheme with several evaluation metrics to truly validate the performance of the developed algorithm. Moreover, instead of using a single modality, a multimodal dataset should be used for the prediction of sleep apnea. For clinical setup, a huge amount of medical data is generated these days. The generated data can be utilized for the improvement of the health conditions of adults. Based on the given scenario, the proposed model used the EHR of the patients to predict sleep apnea. The proposed model also identified the risk factors for sleep apnea in the patient. Through identification of the risk factors, medical practitioners can advise the patient to alter their lifestyle so that the development of sleep apnea can be avoided.

8 Conclusion

In this study, we developed a hybrid ML model (XGBoost_BiLSTM) to diagnosis of sleep apnea. The proposed model consists of two modules: The first module ranks the most important features from the feature space, and the second module used for the classification of sleep apnea. To evaluate the performance of (XGBoost_BiLSTM) model, we used a k-fold cross-validation scheme (k = 5). The perfomance of (XGBoost_BiLSTM) model is compared with other state-of-the-art ML models, including a conventional LSTM model.The newly constructed (XGBoost_BiLSTM) model achieved the highest accuracy of 97% while using only the top six features from feature space. The selected top six features consist of; type 2 diabetes, external injuries, mental and behavioral disorders, psychological stress and emotions, a comprehensive psychopathology rating scale, and respiratory system diseases, are the main risk factors for sleep apnea in older adults. Overall, the proposed (XGBoost_BiLSTM) model outperformed the conventional LSTM and other ML models by demonstrating its potential for early detection and diagnosis of sleep apnea. In this study, the dataset used for the experimental purpose was based on the EHR, but in further research, researchers can employ a multimodality-based dataset. Furthermore, deep learning algorithms perform well when the number of samples in the dataset is large; therefore, we need to collect a dataset with a larger sample size.

Supplementary information Detail of list of variables used in this study is provided along with the upload matrial.