Introduction

Chronic obstructive pulmonary disease (COPD) is one of the leading causes of death worldwide (7.5–10%)1,2. In South Korea, COPD affects 13.6% and 30.5% of adults aged > 40 and > 65 years, respectively3; these values are higher than in other regions. COPD is a chronic inflammatory airway disease characterized by fixed airflow limitation and chronic respiratory symptoms (e.g., cough, sputum, and progressive dyspnea). Acute exacerbation can occur during the natural course of disease, and increasing disease progression. Acute exacerbation is an acute deterioration of health status, which leads patients to visit clinics earlier than scheduled and may be hospitalized. Exacerbation degrades the quality of life, decreases lung function, increases the risks of future exacerbation and mortality, and imposes socioeconomic burdens in terms of medical expenses and resources4,5,6,7,8. Severe exacerbations requiring hospitalization consume 70% of COPD-related healthcare expenditures. A claims-based analysis of 37,089 COPD patients in the U.S.A. found that the healthcare costs ranged from $2,003 to $43,461 per patient; they were substantially higher in patients who visited emergency departments and were hospitalized, particularly in intensive care units (ICUs)9,10. Although pharmacotherapies and interventional approaches improve disease control and outcomes, many patients continue to experience exacerbations.

If an acute exacerbation could be predicted and preparations could be made in advance, the individual and social burdens, as well as adverse outcomes, would be reduced. Several prediction models have been proposed for COPD exacerbation. In many cases, there models have not reached a sufficient level of high-quality statistical approach11. Additionally, there has been interest in single biomarkers for exacerbation prediction, but the significance of excluding clinical parameters has not been established12. Exacerbations are known to be triggered by a variety of factors including clinical and environmental variables such as respiratory viruses, pollution, and comorbidities13,14,15,16. In our previous study, we integrated individual medical information, weather and air pollution data, and the detection rates of respiratory viruses that affect acute COPD exacerbations17. Recently, a model predicting exacerbation on an annual basis has been developed18, but since exacerbation events do not occur at regular intervals and are influenced by changing environmental factors, it is essential to consider these factors.

Here, we developed prediction model for subsequent exacerbation event by integrating several information regarding the risk of exacerbation, then sequentially performed externally validation to identify its predictive power and adequacy.

Methods

Data source

We used six sources of data: the Korean COPD subgroup study (KOCOSS) cohort, the Korean Health Insurance Review and Assessment Service (HIRA), weather data, air pollution data collected by Air Korea19, trends in virus detection nationwide from the Korean Centers for Disease Control, and the Korean National Health and Nutrition Examination Survey (KNHANES) database.

KOCOSS cohort data

We used data from the KOCOSS cohort (registered on ClinicalTrials.gov with identifier CT02800499), an ongoing, prospective observational cohort of COPD, which has been recruiting patients from 48 referral hospitals in Korea since 2012.

Patients with COPD in the KOCOSS cohort were eligible if they were aged ≥ 40 years and exhibited a fixed airflow obstruction, as defined by a post-bronchodilator forced expiratory volume in 1 s (FEV1)/forced vital capacity (FVC) ratio < 0.7; detailed information concerning the KOCOSS cohort has been published20. The KOCOSS database contains information regarding lung function, smoking history, and disease-specific quality of life measured using the COPD assessment test (CAT).

HIRA data

HIRA receives medical claims data from all hospitals in South Korea, evaluates the appropriateness of the claims, and approves reimbursements from the National Health Insurance. All healthcare services rendered under national health insurance are claimed for reimbursement on a weekly or monthly basis, and are reported to HIRA and stored in HIRA’s in-house data warehouse. We collected information regarding age, sex, medications, comorbid conditions [classified using the International Classification of Disease, Tenth Revision (ICD-10) codes], and exacerbations.

Definition of exacerbation in HIRA database

Moderate exacerbation was defined by an outpatient visit with an COPD ICD-10 code (J43.x-44.x, except J430) with prescription of systemic steroids with or without antibiotics. Severe exacerbation was defined by a visit to an emergency room and/or hospital admission with a COPD (J43.x–J44.x, except for J43.0) or COPD-related diseases such as pneumonia (J12.x–J17.x), pulmonary thromboembolism (I26, I26.0 and I26.9), dyspnea (R06.0), or acute respiratory distress syndrome (J80) with prescriptions of steroid and/or antibiotics. This method has been utilized in many previous our studies21,22,23,24,25. COPD ICD-10 code, combined with the prescription of systemic steroids with or without antibiotics. All exacerbation was defined as an event for either moderate or severe exacerbation.

Meteorological and air pollution data

The Korean Ministry of the Environment provides daily local meteorological weather data including the levels of particulate matter (PM) of aerodynamic diameter < 10 μg/m3, minimum ambient temperatures (°C), and precipitation at local weather stations. These data are publicly available on the Ministry website. Due to the single, government-established health insurance system and the assignment of national identification (ID) number at birth, assessment results can be produced linking the ID number with healthcare data under the national insurance system. It enables us to match all weather information to patient addresses at the provincial level. Both meteorological and air pollution data were expressed as the number of days prior to the date of acute exacerbation.

Data on epidemic viruses

The Korean Centers for Disease Control and Prevention Agency (KDCA) has been conducting national surveillance for respiratory virus through the Korea Influenza and Respiratory Viruses Surveillance System (KINRRESS) since 2000. This system comprises two surveillance components: specimen-based and clinical surveillance systems. The former collects data from patients who visited outpatient clinics with acute respiratory illness at 52 designated sentinel sites. The latter gathers data from patients with severe disease who are admitted to hospitals. This nationwide surveillance system reports the weekly number of detected cases and positive rates of eight respiratory viruses including influenza virus, adenovirus, parainfluenza virus, respiratory syncytial virus, human coronavirus, human rhinovirus, human bocavirus, and human enterovirus using the multiplex reverse transcription-polymerase chain reaction (RT-PCR).These data are accessible to everyone on the KDCA website (http://www.kdca.go.kr/npt/).

Model development and validation

All moderate to severe exacerbation were used as outcome variable, and the number of claims were used as individual event to analyze exacerbations. Based on each number of claims, participants of KOCOSS cohort were divided into training and internal validation dataset. We chose demographic data including age and sex, spirometry data, COPD related medications including inhaler treatment and systemic steroids and hospital visit for AE, air pollution data and meteorological data, and influenza virus data as relevant risk factors for the final model for prediction of moderate and/or severe exacerbation. Then, six machine learning and logistic regression tools were used to evaluate the performance of the model. Formula of prediction equation generated by GEE analysis was detailed in the supplementary methods.

The model will be reliable only if validated using data from other dataset. We validated our model using the KNHANES cohort; KNHANES conducts an annual, nationally representative study regarding the health and nutritional status of the South Korean population. We used data concerning 2151 individuals with FEV1/FVC ratios < 0.7; we employed only data from 2008 to 2012 (the years for which virus data were available). The KNHANES database does not include CAT scores; we calculated the scores using the EQ-5D values and patient ages26.

Statistical analysis

The parameters used to develop the final predictive model are listed in Table 1. Seven statistical methods were used: random forest, extreme gradient boosting (XGBoost), light gradient boosted machine (LGBM), mixed-effect random forest (MERF), support vector machine (SVM), logistic regression, and generalized estimating equation (GEE). All models were taught using data resampled via adaptive synthetic sampling; this solved the binary class imbalance problem associated with acute exacerbation. We evaluated model performance in terms of accuracy and the area under the curve (AUC) of the receiving operating characteristic curve (see the statistical analysis of supplementary methods). Moreover, F1 score, sensitivity, positive predictive value, specificity and negative predictive value were also used as indicators for the possibility of low predictive power of model with very few events of interest and unbalanced dataset. Additional details are available in the Additional file: Supplementary Methods and Materials.

Table 1 Variables used in the prediction model.

We divided the 590 KOCOSS patients into training and internal validation test sets by splitting 18,945 individual claims, rather than splitting them by each patient's identification number. The final model selection was based on an internal validation set that closely resembled the modeling training dataset. Additionally, the performance evaluation of the final model was conducted on an external validation dataset, the KNHANES. The KNHANES database was then used to explore predictive performance. We conducted an analysis of the performance of each prediction model through 50 replicates of bootstrapping for each, presenting results for each metric in terms of mean value and the lower and upper value of the 95% confidence interval (CI). All models were implemented in Python 3.7 and run in the Windows 10 environment.

Ethics approval and consent to participate

All hospitals involved in the KOCOSS cohort obtained approval from their respective institutional review board committees and informed consent from their patients. The study protocol was approved by the Institutional Review Board of the KONKUK University Medical Center (IRB No. KHH1010338). This study used data from KNHANES as validation cohort and KNHANES was approved by the Institutional Review Board of the Korea Centers for Disease Control (IRB No. 1401–047-547). The patients/participants provided their written informed consent to participate in this study.

Results

Characteristics of participants

The 590 COPD patients in the KOCOSS cohort were randomly divided into a training subset (n = 581) and an internal validation subset (n = 569) based on their number of claims (Fig. 1). Both sets were similar in terms of sociodemographic features, severity of airflow limitation, and disease-related quality of life (Table 2). Over 90% of all patients were men; the mean age was approximately 65 years, and over 80% of all patients were current or ex-smokers. The mean FEV1 was approximately 52% of the predicted value in both groups. The mean numbers of all exacerbations were 1.05 ± 2.07/year in the training subset and 2.53 ± 2.45/year in the validation subset. The mean numbers of severe exacerbations were 0.08 ± 0.24/year and 0.17 ± 0.43/year, respectively.

Figure 1
figure 1

Scheme of study design. COPD chronic obstructive pulmonary disease, FEV1 forced expiratory volume in 1 s, FVC forced vital capacity, HIRA the Korean Health Insurance Review and Assessment Service, KNHANES Korean National Health and Nutrition Examination Survey, KOCOSS Korean COPD Subgroup Study.

Table 2 Comparison of baseline characteristics between training, internal validation and external validation dataset.

The external validation cohort included 2151 individuals in the KNHANES dataset with FEV1/FVC ratios < 0.7, 70.7% of whom were men; the mean age was 64 years. Compared with the KOCOSS cohort, there were more non-smokers (30.9% vs. 1%), the FEV1 was higher (78.5% vs. 52.1% of the predicted value), and the CAT score was lower (10.3 vs. 16.1) in the KNHANES cohort (Table 2).

Development of an exacerbation prediction model

A scheme of the study design is shown in Fig. 1. Environmental data including information of weather, air pollution and epidemic respiratory virus, individual’s personal data from KOCOSS cohort and data of HIRA dataset of each participant. Each item was assumed affecting AE and represented by odds ratio (OR). The ORs for the personal factors including demographic data, comorbid conditions and hospital visits for COPD and/or AE were found Table S1 and ORs for the respiratory medications were presented in Table S2 in the additional file. Effects of air pollution, weather and virus trends on AE was presented in Table S3 in the additional file. Only significant variables with a p value of 0.05 or less in these univariable results were selected and the multivariable model was applied to confirm the ORs of each variable, and through this approach, final prediction model was generated (Table 3). We chose age, sex, smoking status, comorbidities, FEV1 value, disease specific medication usage and hospital visit, frequency of AE, air pollution data (e.g., NO2 and PM2.5), humidity and diurnal temperature data, and epidemic influenza virus data in final model.

Table 3 Factors associated with COPD AE estimated by a generalized estimation equation model through multivariate logistic regression analysis.

In Fig. 2, we showed the feature importance for the (a) random forest, (b) XGBoost, (c) LGBM, and (d) MERF model, which was made by impurity-based method for feature selection that best splits the dataset. Features that tend to split dataset more precisely will result in a large importance value in that order. Past exacerbation events were most important feature in all machine learning model.

Figure 2
figure 2

Impurity based feature importance of tree based machine learning methods. AE acute exacerbation, CAT chronic obstructive pulmonary disease (COPD) assessment test, DLCO diffusing capacity for carbon monoxide, FEV1 forced expiratory volume in 1 s, FVC forced vital capacity, ICS inhaled corticosteroid, IFV influenza virus, LABA long-acting beta2 receptor agonist, LAMA long-acting muscarinic receptor antagonist, NO2 nitrogen dioxide, OCS oral corticosteroids, PM particulate matter, SABA short-acting beta2 receptor agonist.

Internal validation of the predictive model

We fitted seven models and prediction performance in internal validation cohort is presented in Table S4 in the additional file. The cutoffs of the perceived exacerbation risks were values that maximized model performance. An SVM classifies data by finding the optimal hyperplane that separates all data in a binary manner (i.e., in one group or the other); no cutoff is given. The performance metrics were the AUC, accuracy, F1 score, recall, precision, specificity and negative predictive value. All bootstrapped data of each metric were presented in terms of mean value and upper and lower value of the 95% CI. The LGBM model was optimal in terms of AUC, accuracy, F1 score and sensitivity, specificity and negative predictive value. The random Forest, XGBoost, and MERF models also excellently predicted the risk of future exacerbation.

External validation of the predictive model

The results of external validation are shown in Table 4. The random forest, XGBoost, LGBM, and MERF models exhibited similarly high predictive capacities, according to all parameters. All four models had AUCs and F1 scores > 0.7; furthermore, the random forest and MERF models yielded F1 scores > 0.8.

Table 4 Prediction performance in validation cohort (KNHANES).

Examples of application of the prediction model

Various personal and environmental factors can be fed into machine learning models to predict daily exacerbation risk. To demonstrate the performance of our predictive model, we tested two hypothetical scenarios. We presented two representative cases in which the risk of exacerbation was assessed differently at each independent date (Fig. S1). Patient in case A has low personal risk factors and the probability of exacerbation in a favorable environment is 11%. While, B has high risk factors and the probability in unfavorable environment is 97%.

Discussion

We developed a model for predicting COPD exacerbation on a daily basis by merging clinical cohort data, 5 years of follow-up claims data, and information regarding air pollution and respiratory virus infection rates. Those with spirometry confirmed airflow limitation in the nationally representative database that contains 5 years of data regarding claims and external factors were used as validation cohort. We found similar overall favorable predictive performance especially the random forest, XGBoost, LGBM, and MERF models exhibited good predictive power. Our personalized model to predict exacerbation was created using data from patients with symptomatic COPD who were treated and followed-up in referral hospitals, but also has high predictive ability in patients with less symptomatic, mild, and undiagnosed COPD. This suggestive the possibility of universal application of our tool.

In our previous study of 594 COPD patients in the KOCOSS cohort, the HIRA data from 2007 to 2012 were merged and the relationships of future exacerbation with various clinical and environmental factors were analyzed17. We showed female sex (adjusted odds ratio [aOR], 1.5596; 95% confidence interval [CI], 1.1742–2.0715), number of exacerbations during the previous year (aOR, 1.2745; 95% CI 1.2095–1.3430), COPD grade (aOR, 1.5693; 95% CI 1.1727–2.1001), influenza virus detection rate 2 weeks before exacerbation (aOR, 1.0075; 95% CI 1.0017–1.0133), the lowest temperature in the 5 days before exacerbation (aOR, 0.9572; 95% CI 0.9219–0.9938), and the lowest temperature 1 day before exacerbation (aOR, 0.9587; 95% CI 0.9196–0.9994) were significantly associated with exacerbation. Based on the previous study, this study represented an effort to assess the risk of exacerbation prediction by incorporating not only demographic and clinical variables commonly used in previous researches but also environmental factors known to influence exacerbations. This aspect constitutes a significant strength of this study and we developed a predictive model of exacerbation and verified predictive performance through machine learning tools. Moreover, among various clinical and environmental factors, we identified the relative importance of each factor on exacerbation on each machine learning method. This is the first attempt and one of novel and interesting finding in this study.

To our knowledge, this is the first daily prediction model for exacerbation of COPD patients worldwide; our enhancement of predictive power via machine learning is novel. We used daily exacerbation data collected over several years; we also used daily air pollution, climate, and epidemic respiratory information. Korea’s unique health insurance system and readily accessible big data enable data collection. Few studies have sought to predict the risk of acute exacerbation, particularly on a daily basis, for several reasons. First, sufficient data have been lacking. We used information from the KOCOSS multicenter cohort; we validated the model using information from a nationwide health survey (KNHANES) and claims data. Therefore, our model is applicable to mild, asymptomatic, underdiagnosed, and symptomatic COPD patients. Importantly, it was possible to identify hospital visits caused by exacerbation. In 1998, South Korea implemented a National Health Insurance system that covers almost all of the population27. All healthcare institutions in South Korea claim medical expenses through HIRA, which evaluates the appropriateness of claims and approves National Health Insurance reimbursements. HIRA collects patient information provided by physicians; data concerning almost all COPD patients and their healthcare requirements can be found in the HIRA database. Because of this accurate database, it was possible to develop a predictive formula. The model includes severity of COPD itself, which is represented by lung function. The claims data lack pulmonary function test results. However, because we merged claims, cohort, and KNHANES data, we integrated lung function into our model. Second, because COPD is chronic in nature, most cohort data were derived during long-term follow-up, rendering it difficult to engage in statistical modeling because data independence may have been violated by the collection of repeated measurements. To overcome this correlation problem with respect to longitudinal data, the GEE and two machine learning methods (random forest and deep neural network) based on decision trees (in which independence is not assumed) were used to develop the predictive model. Third, event imbalance before and after resampling must be considered (see Table S5). The incidence of acute COPD exacerbation was low; only 3% of hospital visits by COPD patients were triggered by such exacerbation. Most COPD patients do not experience acute exacerbations on most days. The adaptive synthetic sampling method was applied to all machine learning models; this reduced the problems caused by event imbalances.

The airflow limitation group (FEV1/FVC ratios < 0.7) from the KNHANES dataset with an average FEV1 of 79% of predicted value was used as an external validation cohort. In general, approximately 75% of COPD patients remain undiagnosed because most of them are less symptomatic28. Likewise, only less than 5% of individuals are reported to visit and treated for COPD in South Korea29. Most of these less symptomatic patients that hardly voluntarily visit hospital, thus it is difficult to find these group of individuals. However, a general population-based studies showed that although these undiagnosed COPD patients are a- or less symptomatic, they are at higher risk of exacerbation and even mortality30,31. Our predictive model was derived from overt COPD cohort and validated in less symptomatic individuals from general population cohort. This suggests the generalizability of the model to patients with mild to severe COPD.

In this study, we created a predictive model for exacerbation that considered both external and clinical factors. The use of machine learning to improve the predictive power is both novel and encouraging. An exacerbation episode is not occurred with clear time interval with sharp start and end. But rather than it is a gradual process with crescendo and decrescendo process. Our prediction model could be applied regardless of any distinct time interval in predicting the risk of near future. If patients can easily assess their own daily risk of exacerbation, it may enable preventive interventions such as educating patients to wear masks or hand hygiene during respiratory virus outbreaks. Additionally, when there is an overall increase in exacerbation risk, patients could reschedule their outpatient visits, allowing for timely preventive interventions. However, our work had some limitations. First, because exacerbations were identified using claims data, over- or underestimation was possible. However, we combined the ICD-10 code with the prescribed drug; this may have rendered the data robust. Second, the PM level was based on the patient address in the HIRA database; this address was presumably not always the location where the exacerbation occurred. An association between PM level and COPD exacerbation has been found in many epidemiological studies. However, South Korea is a small country (total area 100,210 km2). Thus, the extent of air pollution may not greatly differ among regions. Third, we have taken into account weather variables prior to exacerbation, however, we were unable to precisely apply for the clear time lag. Fourth, high-quality data are available regarding weather conditions, PM levels, and epidemic respiratory virus status in Korea; it was thus possible to predict exacerbation considering all of these factors. However, it may be difficult to generalize the application of our predictive tool to other countries where such data are not readily available to the public. Thus, it seems necessary to develop an exacerbation model using universally available and applicable information. Fourth, we analyzed the influence of each drug type but did not incorporate levels of potency of each drug.

In conclusion, because COPD exacerbations negatively impact health and disease progression including further exacerbation, early detection of at-risk patients is helpful to response timely and appropriately. Exacerbation is triggered by combinations of various clinical and environmental factors. To our knowledge, we are the first group worldwide to develop a personalized, daily, predictive, COPD exacerbation model; we integrated various relevant internal and external factors including multicenter cohort and claims data, as well as big data concerning climate, air pollution, and epidemic respiratory virus status. We improved the model’s predictive power by using several machine learning methods. Our model will encourage patients to visit the hospital early when they aware an increased risk of exacerbation; timely and appropriate intervention will follow.