Introduction

Obstructive sleep apnea (OSA) is an significant risk factor in several major health conditions, such as cardiovascular1,2,3,4, neurovascular5,6, and metabolic diseases7,8. OSA is diagnosed on the basis of certain crucial parameters, including the apnea–hypopnea index (AHI) of polysomnography (PSG). PSG is a standard diagnostic sleep test for OSA, and its results hold significant clinical implications for various major diseases. For example, severe OSA with AHI > 30 is known to be correlated with the development of strokes and incident hypertension5,9. Although large-scale prospective cohort studies can be used to empirically prove such important clinical observations, they suffer from the limitations of long follow-up periods and high costs. On the other hand, retrospective studies can only establish statistical associations between the risks of major conditions and PSG results rather than a definitive causal relationship. Furthermore, PSG is a whole-night test, and the capacity for PSG tests per sleep center is thus limited. However, multi-center collaborative studies can be used to conduct more PSGs, and well-defined large-volume PSG databases have the potential to corroborate the validity of conjectured correlations. The analysis of a wide range of electronic health record (EHR) data, including medical conditions, drug exposures, procedures, and measurements, in conjunction with PSG data, and their rapid verification across multiple institutions may enable the procurement of crucial pieces of robust scientific evidence through enhancements in analytic power.

However, as the primary goal of EHR is medical application, rather than research, the reuse of EHR data for academic purposes necessitates the mapping of clinical observations to standard vocabularies10. To this end, Observational Health Data Sciences and Informatics (OHDSI), an international collaborative initiative, has created and applied an open-source standard data format and analytic solutions to diverse health and medical databases across the world11. The Observational Medical Outcome Partnership's (OMOP) common data model (CDM), which is utilized by OHDSI as a standard data format, serves as a guide for the standardization of heterogeneous representations of healthcare data obtained from disparate sources. Conversion of health and medical databases into the CDM format is expected to enable interdisciplinary collaborative large-scale analyses. Such large-scale analyses using open-source analytic tools based on standardized datasets are, in turn, expected to improve the speed and efficiency of population-level estimation and patient-level prediction, thereby enhancing the reliability of clinical decision-making11,12.

To the best of our knowledge, measurements obtained via PSG are yet to be transformed into the CDM format. Linking the diverse data obtained from PSG with the extensive EHR database in a structured CDM format is expected to facilitate multi-center studies and strengthen general analytic power. In this study, we aimed to convert EHR PSG data into the standardized OMOP CDM data format and conduct a pilot feasibility test. Through a pilot feasibility study, we attempted to confirm the possibility of developing a predictive model using existing CDM data and additional PSG data, and to verify the usefulness of the integrated data.

Methods

Study population for CDM conversion

This study included patients who visited the Sleep Center at Seoul National University Bundang Hospital (SNUBH), located in the metropolitan area of Seoul in South Korea, and had undergone PSG between February 2004 and June 2019.

Data source

OMOP CDM data obtained from SNUBH were used in this study. In particular, the data comprised de-identified EHR data based on OMOP CDM version 5.3.1 and accumulated over a period of 16 years—from the opening of SNUBH with the full EHR system in May 2003, till June 2019. The EHR data of more than 2 million patients, including patient demographics, diagnosis, chief complaints, drug exposures, test orders/results, vital signs, surgeries, family histories, and past medical histories, were converted to CDM.

This study was performed in accordance with the relevant guidelines and regulations of the SNUBH Institutional Review Board (IRB) and was approved by the SNUBH IRB. As it is an observational study and the data source was de-identified, this study was approved based on waivers of informed consent or exemptions by the SNUBH IRB (IRB No: X-2002–592-904).

Polysomnographic parameters

We considered all PSGs performed at the Sleep Center of SNUBH as target data to be converted into OMOP CDM, including full-night PSGs, split-night PSGs, PSGs for continuous positive airway pressure (CPAP) titration, and multiple sleep latency tests (MSLTs). In the case of split-night PSGs, the values of the parameters represented only the diagnostic portions in this study. No home sleep apnea tests were included because they are not popular in South Korea. The PSG parameters to be transformed into OMOP CDM included information related to sleep architecture, respiratory activity, positions during sleep, blood oxygen saturation, and limb movement.

We conducted PSGs using an Embla N 7000 (Embla, Reykjavik, Iceland) recording system equipped with standard electrodes and sensors, in the presence of a sleep technician. The entire PSG retinue consisted of electroencephalography, electrooculography, echocardiography, submental and limb electromyography, chest and abdominal plethysmography, nasal pressure manometry, oronasal thermistor, pulse oximetry, and a snoring sensor. Apnea was defined as a pause in the respiratory airflow lasting at least 10 s, and hypopnea was defined as a reduction in the airflow by 50% or more lasting at least 10 s, or the accompaniment of airflow reduction by arousal or an oxygen desaturation by 4% or more13. The PSG data were reviewed and scored by sleep experts using the Embla RemLogic PSG Software (Embla, ON, Canada). The study report from the Embla RemLogic PSG Software has the following parameter (variable) categories: patient information; sleep summary; summary graph; sleep information; arousal statistics; autonomic arousal (plethysmogram) statistics; apnea/hypopnea statistics; apnea-desaturation relation; Cheyne Stokes breathing statistics; breath statistics; snoring statistics; flattening statistics; respiratory mechanic instability statistics; SpO2 statistics; desaturation statistics; heart rate statistics; cardiac events; bruxism; rapid eye movement sleep behavior disorder information; rhythmic movement disorder information; periodic limb movement statistics; and position statistics. Among them, the sleep experts at our sleep center selected the PSG parameters that are commonly employed in the literature to make available in the PSG summary report of our EHR. The selected parameters were automatically exported and imported into our EHR in a structured format.

Strategy to convert PSG data into OMOP CDM

We designed and implemented the following extract–transform–load (ETL) process to transform the PSG data into the OMOP CDM format.

Despite being reported in a structured form, the EHR PSG results considered in this study had been revised approximately 11 times. Hence, we extracted the data corresponding to each revised form and integrated them within the CDM format via standardization. The procedural information for PSG order itself had already been converted into the CDM format. Thus, in this study, we linked the extracted PSG results and the corresponding existing orders in the CDM to connect the PSG procedures with their corresponding results.

The PSG parameters were manually mapped by sleep domain experts (J.-W. Kim and S.-W- Cho) to standard concepts within the Logical Observation Identifiers Names and Codes (LOINC) or Systematized Nomenclature of Medicine–Clinical Terms (SNOMED CT) vocabularies corresponding to the measurement and observation domains. Non-mapped parameters were added to the concept, concept_ancestor, and concept_relationship tables to be used as new custom standard concepts (please see Supplementary Table S1 for the concept mapping information in the case of PSG and Supplementary Table S2 for the concept definitions). More than 2 billion digits were assigned to the concept_id of the new custom concepts. In the concept_ancestor table, the newly added concepts served as their own ancestors and descendants. In the concept_relationship table, the mapping information between source and standard concepts was added. Additionally, we described the bidirectional relationship between PSG and its parameters in the table using the concepts of ‘Panel contains’ and ‘Contained in panel (LOINC)’ relationships.

The extracted PSG data were transformed and loaded into measurement and observation tables with standard concepts. Observation data were linked to the corresponding PSG procedures via the observation_event_id and obs_event_field_concept_id fields. In order to link measurements with corresponding procedures, we used the new modifier_of_event_id and modifier_of_field_concept_id fields that have been proposed by the OHDSI Oncology Working Group14. The procedure_occurrence, measurement, and observation tables were linked to the person and visit_occurrence tables based on their foreign keys. The CDM tables associated with the PSG data are depicted in Fig. 1.

Figure 1
figure 1

Conversion of polysomnography into the Observational Medical Outcomes Partnership (OMOP) Common Data Model (CDM) tables.

After completing the ETL, we assessed the PSG data quality via exploratory data analysis and developed data quality check rules for data cleaning (please see Supplementary Table S3 for the detailed cleaning rules and the number of records filtered by the rules). Finally, the cleaned PSG data integrated into the existing CDM were utilized for a feasibility test.

Pilot feasibility test using open-source OHDSI analytic tools

We conducted a pilot feasibility test using only full-night PSG tests of patients 18 years or older. The feasibility test was designed to develop and validate a model to predict cardio-neuro-metabolic disease within a target population between a period of 1 day and 1095 days from the target cohort start date of the PSG test. A cardio-neuro-metabolic disease was defined as any condition involving International Classification of Disease, Tenth Revision (ICD-10) codes corresponding to the comorbidities listed in Supplementary Table S4. We included any occurrence of the defied ICD-10 codes without constraints on the frequency.

In the population setting for the patient-level prediction, varying minimum lookback periods of 30 days, 90 days, and 180 days were utilized for the prior observation periods of patients from the target population. Subjects without time-at-risk of 1094 days were also removed. Patients who had experienced prior outcomes were also not considered in this study.

Among the preexisting CDM data, we utilized multiple covariates, such as gender, 5-year age group, Anatomical Therapeutic Chemical (ATC) drug group, SNOMED CT condition group, procedure, measurement value, observation, visit concept count, the CHA2DS2-VASc (congestive heart failure, arterial hypertension, age > 75 years, diabetes mellitus, stroke/transient ischemic attack, vascular disease, age 65–74 years, sex category) score, diabetes complications severity index (DCSI), and the Charlson comorbidity score. Two different covariate settings were tested to determine which PSG parameters could be selected during the cardio-neuro-metabolic disease prediction. One setting (PSG-only covariates) used only gender, age group, and PSG parameters, and the other (all covariates) used all CDM covariates, including the PSG parameters described above as covariates. The observation time windows of the covariates for short, medium, and long terms were set as prior 7 days, 30 days, and 180 days before the cohort start date, respectively.

Three different machine learning models—Lasso Logistic Regression (Lasso), Gradient Boosting Machine (GBM), and Random Forest (RF)—were developed using 25% of the total data for training and 75% for testing. Hyper-parameter training was performed using five-fold cross-validation on the training set. PatientLevelPrediction R package15 version 4.0.5 was used for this purpose.

To evaluate the models, model discrimination was assessed using the area under the receiver operating characteristic curve (AUC).

Results

Conversion results of PSG parameters into OMOP CDM concepts

We converted data from a total of 11,392 tests corresponding to 11,797 sleep studies into the OMOP CDM format. These included 7,191 full-night PSGs, 2,725 split-night PSGs, 1,474 CPAP titration PSGs, and 407 MSLTs. Among the PSG test results stored in EHR, the conversion target parameters converted into CDM are presented in Table 1. These included 7 pertaining to body measurements, 7 to sleep summaries, 6 to sleep stages, 16 to respiratory events, 4 to apnea or hypopnea duration, 8 to sleep position, 5 to arousals, 2 to limb movement, 5 to snoring, 8 to oxygen statistics, 1 to continuous positive airway pressure, 2 to questionnaires, 11 to MSLT, 1 to apnea level manometry test, and 3 to Friedman staging. A total of 85 PSG parameter concepts were converted to the measurement domain and one to the observation domain (Waist/hip ratio). Moreover, 20 (23.3%) PSG codes were mapped to the standard OHDSI vocabulary including LOINC and SNOMED CT, but the remaining 66 (76.7%) could not be mapped and were added as new custom standard concepts.

Table 1 Polysomnographic parameters included in the Observational Medical Outcomes Partnership (OMOP) Common Data Model (CDM) transformation.

Characteristics of PSG data

The overall characteristics of the total sleep studies that were converted into OMOP CDM are presented in Table 2. Out of an aggregate of 11,392 sleep tests, 8363 (73.4%) tests were conducted on male patients and 3029 (26.6%) on female patients. There was an average of 1.2 tests per person. Tests of patients aged 40–49 years, 50–59 years, and 60–69 years accounted for approximately 65% of the total number of tests. The number of sleep studies conducted each year exhibited a progressive increment. The prevalence of AHI < 5, mild OSA (5 ≤ AHI < 15), moderate OSA (15 ≤ AHI < 30) and severe OSA (30 ≤ AHI) was 28.5%, 23.8%, 19.3% and 28.4%, respectively. The basic statistics of the associated PSG parameters are provided in Supplementary Table S5.

Table 2 Demographic characteristics of total sleep tests that were converted into OMOP CDM. The sleep tests from February 2004 to June 2019 were extracted, transformed, and loaded into the OMOP CDM.

Performance of the prediction models

Corresponding to the best performance setting of each prediction models, the number of people eligible for inclusion into the target population, the outcome count, and the number of people lost due to each inclusion step are illustrated in Fig. 2. The target population comprising 5581 full-night PSG tests of patients 18 years or older was reduced to a population comprising 2555 tests of 2542 patients. The outcome rate of cardio-neuro-metabolic disease was observed to be 11.1%.

Figure 2
figure 2

The attrition for the model development at the best performance setting of prediction.

The AUC performance of the prediction feasibility test based on CDM data achieved an 0.751(0.693–0.810) for the RF model with all covariates. The performance results corresponding to each set of configurations are listed in Table 3. All three models—RF, GBM, and Lasso—performed better when all parameters, such as condition, drug, measurement, and comorbidity score, were utilized as CDM data along with PSG, rather than only the PSG parameters.

Table 3 Prediction model performance for test data set. All covariates setting used all OMOP CDM variables including polysomnography parameter concepts, and PSG only covariates used only gender, age group, and polysomnography parameter concepts for developing and training the prediction model.

The top 20 covariates selected from the RF are presented in Table 4. Among them, 11 were PSG parameters, for example, AHI during right lateral (/h), central apnea index (/h), waking oxygen saturation (%), and snoring time (min). The top 20 covariates selected from the other models are included in Supplementary Table S6.

Table 4 Top 20 predictors selected from random forest model. The polysomnography parameters are indicated in bold.

Discussion

To the best of our knowledge, this study represents the first attempt to convert EHR PSG data into ODHSI OMOP CDM, a standard format for health and medical data. Through this study, we successfully converted more than 11,000 PSGs stored in a tertiary hospital EHR into the OMOP CDM version 5.3.1 format. However, we were able to map only approximately 23% of the 86 parameters present within the PSG data to the existing OMOP CDM standard vocabulary, and new custom standard concept names had to be created for the remaining 77% of the parameters. The method used to create the new custom standard concept can be employed when other sites add non-mapping PSG parameters that are not reported in this study.

The most significant advantage of the standardization of EHR data into the CDM format is the speed and efficiency of large-scale analysis afforded to researchers and clinicians using the open-source analysis tools provided by ODHSI10,12. Furthermore, due to the inapplicability of OMOP CDM to PSG parameters till date, CDM studies using PSG and MSLT test results, which are the most important tests in sleep medicine, are yet to be conducted. In this context, conversion of PSG results into the CDM format also enables utilization of OHDSI's open-source analytical solutions in clinical studies involving PSG results. In addition, the OMOP CDM format has already been used to standardize a comprehensive collection of EHR data, including diagnostic information, specimen test results, imaging test information, procedure and intervention information, drug exposures, past medical histories, and family histories. Therefore, the standardization procedure attempted in this study enables researchers to conduct robust and scalable analyses involving PSG results in conjunction with pre-CDM-converted large-scale EHR data. Collaborative research across a growing number of sites participating in the standardized CDM network is expected to lead to higher performance in population-level estimation and patient-level prediction models that leverage sleep study parameters.

In this study, the performance of the pilot feasibility test in terms of patient-level prediction for cardio-neuro-metabolic disease exhibited a significant improvement when the entire EHR data along with PSG was used, rather than solely the PSG data. This suggests the feasibility of utilizing all EHR data in the OMOP CDM format via CDM conversion of PSG data.

OSA is a broad-spectrum disease with several different subgroups or phenotypes, and each OSA phenotype is likely to be manifested with different levels of severity, both clinically and objectively16. Previous one-size-fits-all approaches based on apnea–hypopnea index suffered from insufficient consideration of these diverse phenotypic subtypes of OSA due to the imperfection of the apnea–hypopnea index as a diagnostic metric with respect to OSA-related symptoms and outcomes17. Several studies have demonstrated that each OSA phenotype exhibits different characteristics and varying risks of disease outcomes16,18. The most important data included in these studies were various metrics of PSG, including all the PSG results, which enabled the classification of OSA into various phenotypes via the phenotyping technique. One study that attempted a structured, data-driven approach based on multiple PSG features of approximately 2,000 OSA patients was able to identify seven subgroups (phenotypes). The aforementioned study also revealed that a unique phenotype that may have been missed during conventional OSA severity classification based on a single metric—apnea hypopnea index—could account for the risk of cardiovascular outcome more effectively19. In our previous study, we also identified four clusters based on various PSG features and there was a significant difference in disease outcome among the clusters, and such a difference could not be found in the standard classification of OSA based only on AHI severity20. Moreover, these characteristic phenotypes may exhibit different patterns depending on race, country, or individual. Therefore, to improve the ability to predict adverse OSA outcomes for a population or an individual, simply having a large number of PSGs is not sufficient—it is necessary to acquire PSGs across various data sources. Therefore, it is advantageous to use standardized data such as OMOP CDM to increase the reproducibility and statistical significance of the analyses. The conversion of data into the OMOP CDM format enables ATLAS, OHDSI's open-source analytic solution, to generate queries that can set the aforementioned OSA phenotypes as target cohorts and queries that can set OSA complications to be predicted as the outcome cohort. This enables verification of the reproducibility of outcome predictions of OSA phenotyping through analysis of the dataset including PSG with the same queries in multiple sleep centers where PSG-CDM standardization has been completed. In addition to the analysis of large-scale PSG data, the clinical relevance of the OSA phenotypes across various populations by region and race will be able to be also verified.

With the increase in CDM conversion of EHR data across medical institutions, research based on CDM-format datasets is expected to be pursued in various fields. However, unlike the CDM conversion of data such as clinical diagnosis results, laboratory sample test results, and drug exposure data, the CDM conversion of medical data based on patient-generated signals, including PSG, is still insufficient. Therefore, till date, CDM-based research has been actively conducted in fields where conversion to the pre-existing standard vocabulary is feasible. Domains where CDM research is most active include pharmacovigilance21,22,23 and pharmacoepidemiology24. For example, a study assessing anti-seizure drug-related adverse reactions in 1344 target epilepsy cohorts determined that the detection rate of the adverse drug reaction based on CDM-format data was comparable to previously published results obtained using traditional data analysis techniques21. In addition, it is possible to implement various designs of research by constructing a target cohort corresponding to a study entry population and an outcome cohort corresponding to a disease outcome population25,26. Examples include a prognostic model validation study predicting hemorrhagic transformation of acute ischemic stroke within a CDM dataset of more than 600,000 patients via the OHDSI international network25, and a survival analysis study using 115 variables in 346 patients diagnosed with intrahepatic cholangiocarcinoma26.

Despite the significant implications, the present study has certain limitations. First, the rate of correspondence between ODHSI's standard OMOP CDM concepts and PSG parameters was as low as approximately 20%. This can be attributed to the fact that the pre-existing OMOP CDM standard vocabulary does not reflect all of the approximately 80 PSG variables considered in this study. The custom standard vocabulary developed to address this limitation in this study is expected to contribute to future studies that utilize PSG parameters in CDM-based EHR studies. When creating the custom concepts, we made it easy to find all PSG parameters by defining the relationship to the PSG order. For concepts that may have varying definitions, the definition of the concept is provided as metadata. For concepts (e.g., %Time of saturation < 60%, %Time of saturation < 70%) in which multiple criteria can exist, a concept was created in a way that has individual concept_ids. Since the MEASUREMENT table does not have a modifier attribute, it would be the best practice to create individual concepts for them. By doing this, the meaning of new concepts can be clarified. As the basic PSG parameters of the PSG recording systems of the various vendors are similar, we think other institutions will also be able to apply the new concept proposed in this study. In addition, we look forward to adding the new concepts to OHDSI's standard vocabulary. Second, in South Korea, insurance for CPAP began in July 2018; before then, it had been recorded in a different form of EHR rather than an order. Thus, in this study, only CPAP orders after July 2018 were converted to CDM and can be used as predictors for the pilot prediction models. There could be an issue where information on orders for CPAP, which may be an important variable in predicting cardio-neuro-metabolic disease, is not complete. However, as the purpose of this study was only to demonstrate the pilot feasibility of the prediction model using CDM including PSG data, predictors should be considered more elaborately when developing a prediction model in the future. Third, different sleep centers represent PSG databases in EHRs in different ways. Many centers store PSG results in EHR as an image file, or simply record OSA severity in a report format. Therefore, significant implementation effort and time is required to extract, transform, and load the PSG results into the CDM format. Furthermore, different levels of digitization of PSG data in different hospitals may cause concerns regarding the different levels of CDM conversion from PSG parameters. However, with the increase in CDM studies including PSG parameters, the electronic representation of PSG data in the EHR system is expected to be facilitated across hospitals. Finally, conversion of data into the CDM format is time-consuming, requiring a substantial amount of resources, in addition to the fundamental requirement of collecting native source data. The need to code subsets of data manually may limit conversion efforts. However, once the native data are converted to the CDM format, EHR systems in the network will be able to use the same queries to identify cohorts. Thus, conversion to CDM is expected to minimize the effort required to develop cohorts and analyze results across multiple sites.

The harmonization across different sites requires collaborative efforts from multidisciplinary experts, including clinical domain experts, terminology experts, and engineers from various sites. When other sites try to map their own PSG data, efforts should be made to use and propose the same vocabulary and the same concept as much as possible by using the mapping result proposed in this study or by participating in the OHDSI community. As the standard terminology for PSG data has not yet been established internationally, if a specific ontology for sleep study can be proposed as OHDSI vocabulary by reviewing previous efforts, such as the Sleep Domain Ontology and the National Sleep Research Resource, it is expected to be helpful in the conversion and expansion of CDM by other sites.

Conclusions

The Observational Medical Outcomes Partnership (OMOP) Common Data Model (CDM) is a standard data format and has been applied to various EHR databases. However, its application to PSG data has not been attempted till date. To the best of our knowledge, this study represents the first attempt to transform PSG data into the OMOP CDM format. Well-defined large-volume OMOP CDM databases of PSG data can potentially enable the identification of clinically relevant OSA phenotypes, estimation of disease outcomes at the population level and prediction of outcomes at the patient-level. We expect the CDM mapping and CDM custom vocabulary of the PSG proposed in this study to contribute to the CDM conversion of PSG databases and future studies leveraging such databases.