Background

Health spending as a share of gross domestic product (GDP) has gradually increased during the last 15 years, from 7.8% in 2005 to 8.8% in 2020 among the Organisation for Economic Cooperation and Development (OECD) countries [1]. The estimates reported 10.2% of GDP in 2030, a far higher value compared with the current proportion [2]. The rise in healthcare expenditures impacts the affordability of individual patients and payers. The shares of GDP spent on health positively correlate with catastrophic payments connected to affordability [3]. In addition, a continuous increase in health spending can inhibit the achievement of universal health coverage, which is a target under the United Nations Sustainable Development Goal 3 (Ensure healthy lives and promote well-being for all at all ages) [4]. In South Korea, the annual health expenditure covered by the National Health Insurance Service has been on the rise in the last two decades ($80 hundred million in 2000 to $65 billion in 2021) [5]. Worldwide health expenditure is expected to accelerate due to aging societies and technological advancement. In particular, due to an oversupply of services, sustainable health financing can deteriorate more in countries adopting a fee-for-service payment system, such as South Korea [6].

The rising health spending led to increased interest in efficiency in the quality of care. The efficiency measurement has progressed from measuring the amount of service provided (e.g., length of stay or physician visits) to calculating the ratio between observed and predicted costs [7]. The use of predicted costs in the efficiency measurement only allows comparability by adjusting for risk factors contributing to differences in the outcome of interest, such as sociodemographic factors or comorbidities. A comorbidity risk adjustment method for mortality, such as Charlson Comorbidity Index (CCI), has been widely used in clinical but also health expenditure research [8,9,10]. However, the choice of risk adjustment method should be based on the outcome of interest, which is closely related to the selection of the model’s construction and statistical techniques [11]. The United States Center for Medicare and Medicaid Services (CMS) introduced Hierarchical Condition Categories (HCC, CMS-HCC) for cost estimation. The CMS-HCC has been recently utilized in value-based payments such as the Merit-based Incentive Payment Systems or Hospital Value-Based Purchasing [12, 13]. In addition, the US health insurance system has started using another version of HCC, the Department of Health and Human Service-HCC (HHS-HCC), which is related to risk selection on the premiums under the Affordable Care Act [14].

In South Korea, there have been efforts to utilize a risk adjustment method specific to costs by adopting the National Health Insurance Service-HCC (NHIS-HCC), which is a modified version of CMS-HCC based on the annual cost estimation [15, 16]. In a recent study, the NHIS-HCC was utilized to estimate episode-based costs in the process of efficiency measurement [17]. However, studies have yet to evaluate the feasibility of the NHIS-HCC based on episode-based costs by comparing it with currently available risk adjustment methods. In addition, the disease groups in the NHIS-HCC are limited to the elderly because the CMS-HCC was developed for use in Medicare that targets people 65 or older [16, 18]. On the other hand, the HHS-HCC includes more various disease groups, including pregnancy, delivery, and neonate-related diseases [19].

Therefore, this study aimed to compare the diagnosis-based risk adjustment methods, including the mortality adjustment tool (i.e., CCI), risk-adjusted Diagnosis-Related Group (DRG), and HCCs, based on episode-based costs in the context of efficiency measurement.

Methods

Data sources

We used the Health Insurance Review and Assessment Service-National Patient Sample (HIRA-NPS), which is the representative claims database that randomly samples 3% of the annual beneficiaries in South Korea [20]. We used the 2018 HIRA-NPS for model evaluation, which was the latest available dataset at the design of the study. For external validity, we used the 2017 dataset considering the cross-sectional feature of HIRA-NPS and the sample size for regression [21].

Episode construction specifications

We adopted the episode definition used in the National Health Insurance Service Spending Per Episode (NSPE) index, an episode-based efficiency measure for hospitals (Fig. 1) [17]. An NSPE episode includes actual hospitalization (i.e., index admission) and the related outpatient services during the episode window (before and after the admission), reflecting the shifting services from inpatient to outpatient settings [22]. First, we create index admission datasets using annual claims data (i.e., 2017 and 2018 HIRA-NPS) from April to November, considering the definition of the NSPE episode and the lookback period to obtain comorbidity information. Exclusion criteria for index admission were as follows: (1) length of stay ≤ 1 day, (2) cost for index admission ≤ $0, and (3) error DRGs.

Fig. 1
figure 1

NSPE episode framework. (A) Index admission, (B) Identical primary diagnostic code (3 digits) and institution compared to the index admission, (C) Non-identical primary diagnostic code (3 digits) but the same institution compared to the index admission, (D) Identical primary diagnostic code (3 digits) but non-identical institution compared to the index admission, (E) Non-identical primary diagnostic code (3 digits) and institution compared to the index admission. NSPE, National Health Insurance Service Spending Per Episode

The NSPE window starts 30 days before the admission date and ends after 30 days following the discharge date. We only assigned related outpatient services to the NSPE episodes during the episode window. Related outpatient services are defined as the same primary diagnostic code (3 digits) and the same institution as the index admission. Considering the overlap between episode windows, we adjusted overlapped episodes depending on the types of overlapping: (1) a single episode (no adjustment), (2) multiple episodes, no overlap (no adjustment), (3) multiple episodes, overlapping but distinct periods (no adjustment), (4) multiple episodes, overlapping and non-distinct periods (adjusted by assigning half of the overlapped periods to pre- and post-episodes, respectively) (Additional file 1) [17]. A lookback period for comorbidities included episode windows and the previous two months from the episode window.

Model estimation and performance evaluation

We estimated the current episode costs (i.e., concurrent model) using a linear regression by the Major Diagnostic Categories (MDCs) [23, 24]. We used the ordinary least squares (OLS) regression, practically used to estimate episode-based costs [25,26,27]. Considering the requirement of 10 observations for each additional explanatory variable for the regression, as a rule of thumb, we screened the number of episodes according to the MDC groups [28]. As for MDCs not satisfying the minimum number of observations for the regression, several MDC groups were merged based on similarities; otherwise, we inevitably excluded those MDCs from the analysis due to a lack of observation for the estimation. We merged MDCs as follows: MDC ST (Infectious and Parasitic Diseases), MDC S (Infectious and Parasitic Diseases: HIV) and MDC T (Infectious and Parasitic Diseases); MDC UV (Mental Diseases and Disorders), MDC U (Mental Diseases and Disorders) and MDC V (Alcohol/Drug Use and Alcohol/Drug Induced Organic Mental Disorders); MDC WXY (Trauma, Injuries, Poisoning and Burns), MDC W (Multiple Trauma), MDC X (Injuries, Poisoning and Toxic Effects of Drugs), MDC Y (Burns) (Additional file 2). We excluded MDC A (PreMDC, transplants and tracheostomy DRGs), MDC Q (Disease and Disorders of the Blood-Forming Organs and Immunological Disorders), and MDC Z (Factors Influencing Health Status and Other Contacts with Health Services) from the analysis due to the insufficient number of observations within the MDC.

The dependent variable in the regression analysis was the total expenditure for inpatient and outpatient services during the individual NSPE episode window, obtained by the National Health Insurance Service (NHIS). Considering skewed distribution, we used winsorized NSPE episode costs as the dependent variable for the regression analysis. We obtained NSPE episode costs from the claims by the NHIS, a single insurer providing health insurance in South Korea. Therefore, the NSPE episode costs included the amount paid by the NHIS and a portion of the out-of-pocket costs (only statutory payment but not non-payment items).

Winsorizing was adopted to treat outliers at the 0.5 percentile (upper and lower bounds), considering the average cost per day ($180) in 2018 from claims statistics and the average NSPE episode cost by MDCs ($30–$202) [29] (Additional file 3). We used costs in South Korean Won (KRW) in the model estimation, then converted and presented them to United States Dollars (USD) using annual average exchange rates at the time of the datasets (2017, 1 USD = 1,130,48 KRW; 2018, 1 USD = 1,100.58 KRW) [30].

The explanatory variables included age groups (age 0–2, age 3–19, age 20–39, age 40–59, age 60 and over), sex, insurance type (National Health Insurance, Medical Aid), type of institution (tertiary hospital, general hospital, and hospital), Adjacent Diagnosis Related Group (ADRG), and diagnosis-based risk adjustment. Due to the limitation of the categorical age variable in the HIRA-NPS and the observation for explanatory variables, we collapsed age groups as follows: (1) age 0–2, infants and toddlers, (2) age 3–19, child and teenage, (3) age 20–39, young adults, (4) 40–59, middle-aged adults, (5) age 60 and over, older adults. Depending on the risk adjustment for comorbidities, we constructed five separate models: (1) No risk adjustment (Model 0), (2) Refined Diagnosis Related Group (RDRG, Model 1) [23], (3) CCI (Model 2) [31, 32], (4) NHIS-HCC (Model 3) [15,16,17], (5) HHS-HCC (Model 4) [14, 33].

The model performance at the episode level was evaluated using R-squared (R2) and adjusted R2 (adj. R2) statistics according to the MDC groups [34]. We also measured the Mean Absolute Errors (MAEs) to compare the average magnitude of the errors between observed and predicted values [24]. The predictive ratio (PR) was used to compare the accuracy within subgroups (age group, sex, types of institutions, insurance types, and the highest and lowest decile of the observed costs) [14, 24]. We verified our performance comparison using HIRA-NPS 2017, the dataset separately sampled compared to the dataset used for estimation (HIRA-NPS 2018). The HIRA-NPS are cross-section data selecting different patients every year in the pursuit of privacy protection [21]. Considering the insufficient sample size to split for external validation from the annual dataset, we used another year's dataset differently selected representatively from the whole claims data.

Additionally, we conducted several sensitivity analyses to explore models dealing with the right-skewed distribution of residuals and the potential clustering effect of medical institutions. First, we used log-transformed costs in the model using HHS-HCC for comorbidity (Model 5) [11, 35]. Second, we trimmed individual datasets by MDCs using the interquartile range (IQR) to deal with outliers (Model 6) [36]. Then, we compared these two additional models with Model 4 using winsorized NSPE episode costs. Third, we examined the clustering effect using the Intracluster Correlation Coefficient (ICC) based on the Model 4 [37, 38]. Then, we conducted a multilevel analysis considering nested within institutional types (Model 7) and presented model fits (Akaike Information Criterion, AIC; Schwarz’s Bayesian Information Criterion, BIC; Pseudo-R2) [39].

Efficiency measurement

Considering the purpose of cost estimation for efficiency measurement in this study, we compared the descriptive statistics and the distribution of the NSPE indexes, a modified version of the Medicare Spending Per Beneficiary measure [13], using estimates from individual models. The steps to calculate the NSPE indexes were as follows: (1) calculating observed and predicted costs of individual NSPE episodes, (2) treatment of outliers, (3) calculating average observed and predicted NSPE costs of the individual institution, (4) calculating the NSPE ratio as observed mean to predicted mean of costs, (5) calculating NSPE amount by multiplying the average observed costs and NSPE ratios, (6) deriving NSPE indexes of individual institutions as a ratio with weighted median NSPE amounts [17].

This research using administrative data was deemed exempt from review by the Asan Medical Center Institutional Review Board (#2021–0093). All analyses were conducted using SAS 9.4 (SAS Institute, Cary, NC, USA).

Results

Episode description

The original dataset consisted of 147,493 episodes for the estimation (HIRA-NPS 2018) and 144,877 for the external validation (HIRA-NPS 2017) (Table 1). After excluding the MDCs not satisfying an appropriate number of observations for regression analysis, episode counts were 145,792 and 143,158 in 2018 and 2017, respectively. The 2018 dataset included 106,876 beneficiaries and 1,772 institutions. The mean (standard deviation, SD) inpatient days was 8.2 (10.0). In the 2017 dataset, the number of beneficiaries and institutions was 104,736 and 1,763, respectively; the mean (SD) of inpatient days was 8.3 (10.2).

Table 1 Episode distribution according to MDC

NSPE episodes' characteristics in each MDC are presented in Table 2. MDC UV had the longest mean length of stay (20.6 days), whereas MDC C had the shortest mean length (3.9 days). Overall, Emergency Room (ER) episodes consisted of 19.7%: the proportion of ER episodes was the highest in MDC WXY (42.6%) and the lowest in MDC P (6.2%). The total numbers of ADRG and RDRG types were 1,164 and 2,933, respectively. While MDC I had the most types of ADRGs (n = 145) and RDRGs (n = 387), MDC UV and MDC P had the fewest types of ADRGs (n = 15) and RDRGs (n = 26), respectively. In particular, the number of ADRGs and RDRGs was the same in MDC P, implying no risk adjustment of comorbidities. The average cost of the NSPE episode was $2,422, with an average of $2,308 for inpatient care and $115 for outpatient care. While MDC F showed the highest mean costs in inpatient ($4,807) and NSPE episodes ($4,857), outpatient costs were the highest in MDC J ($374). On the other hand, MDC D had the lowest mean costs in inpatient ($1,019) and NSPE episodes ($1,104); outpatient costs were the lowest in MDC P ($9). The average number of diagnostic codes for comorbidities per episode was 16.9. The mean number of codes for comorbidities was the largest in MDC P (48.4) and the smallest in MDC O (8.2).

Table 2 General characteristics of NSPE episodes

Model fit

The overall mean of R2 (41.6%) and adjusted R2 (adj. R2 40.8%) from MDC groups were the lowest in Model 0, which was non-risk-adjusted for comorbidities (Table 3). While using risk adjustment methods for comorbidities improved the performance compared to Model 0 in all models, the amount of improvement differed depending on the risk adjustment methods used. Model 2 using CCI (adj. R2 42.7%) showed a minor improvement over Model 0 (△1.9%), but it was inferior to other risk-adjusted models (Model 1, Model 3, Model 4). Although Model 1, including RDRG (adj. R2 45.8%), was superior to both Model 0 and Model 2, models using HCCs showed better performance than Model 1 (Model 3 adj. R2 46.3%, Model 4 adj. R2 45.9%). Model 3, risk-adjusted with NHIS-HCC, had the highest explanatory power among the five models. The trends mentioned above of model performance did not significantly change in the weighted means considering episode counts, as Model 3 and Model 4 (using HCCs) showed superiority in the explanatory power (Model 3 weighted adj. R2 51.0%, Model 4 weighted adj. R2 50.7%).

Table 3 R2 (%) and adjusted R2 (%) of models

In general, model performance according to MDC groups showed similar trends among the models (Fig. 2). First, the model without risk adjustment for comorbidities had the lowest explanatory power in all MDC groups. Second, Model 2 mostly had the second lowest adj. R2. Third, MDC P, MDC F, and MDC I showed relatively higher performance. The explanatory powers of MDC P ranged from 77.1% to 80.8%, which are the highest among the MDC groups. MDC F (adj. R2 60.2%–63.3%) and MDC I (adj. R2 54.1%–61.1%) ranked second and third adj. R2. Lastly, the figures of explanatory power in MDC P were comparable between Model 0 (adj. R2 77.1%) and Model 1 (adj. R2 77.1%), implying that RDRG does not adjust for comorbidities.

Fig. 2
figure 2

Adjusted R2 (%) of models according to the MDC. ADRG, Adjacent Diagnosis Related Group; CCI, Charlson Comorbidity Index; HHS-HCC, Department of Health and Human Service Hierarchical Condition Category; MDC, Major Diagnostic Category; NHIS-HCC, National Health Insurance Service Hierarchical Condition Category; RDRG, Refined Diagnosis Related Group; R2, R-squared

Overall, MAE was superior in Model 1 using RDRG ($1,099) and inferior in Model 0 ($1,168), which was not risk-adjusted for comorbidities (Fig. 3). MAEs in individual MDC groups were also similar to the overall observation; while the values of MAE of Model 0 were the largest, they were the smallest in Model 1 in most MDCs except for MDC P, MDC ST, MDC UV, and MDC WXY. In MDC P, Model 4 using HHS-HCC ($1,238) was superior to other models; Model 0 and Model 1 had equal MAEs ($1,300), suggesting that there is no difference between the use of ADRG and RDRG. In MDC ST, Model 3 using NHIS-HCC ($1,170) had a smaller MAE than Model 1 using RDRG. In MDC UV, the MAE was the largest in Model 0 ($2,008) and the lowest in Model 4 ($1,928). While Model 4 ($1,363) presented the smallest MAE between models in MDC WXY, Model 2 ($1,433) showed the largest value. Model performance according to subgroups (sex, age group, type of medical institution, insurance type, and extreme actual costs) is shown in Table 4. In the subgroups of sex, medical institution, and insurance type, all PRs were 1.000, implying that the mean predicted costs were equal to the observed costs. In the subgroup analyses depending on the age group, the PRs were also 1.000 except for Model 1; the difference may suggest that the RDRG code embedded its unique age classification. Model 1 underestimated the group aged 60 years or older (PR 0.976) but overestimated other age groups (PR 1.011–1.105). In the actual cost groups, including both extreme values, the lower 10th percentile was overestimated (PR 3.341–3.601), and the upper 10th percentile was underestimated (PR 0.620–0.656). Additionally, estimates and values to test collinearity (Variance Inflation Factor, VIF, and Tolerance) were presented in Additional file 4.

Fig. 3
figure 3

MAE of models according to the MDC. Unit: United States Dollar (USD), converted from South Korean Won (KRW) (1 USD = 1,100.58 KRW, 2018). ADRG, Adjacent Diagnosis Related Group; CCI, Charlson Comorbidity Index; HHS-HCC, Department of Health and Human Service Hierarchical Condition Category; MAE, Mean Absolute Error; MDC, Major Diagnostic Category; NHIS-HCC, National Health Insurance Service Hierarchical Condition Category; RDRG, Refined Diagnosis Related Group

Table 4 Predictive ratios of the models

In the sensitivity analyses to improve the residual distribution, the distributions were close to normal after log transformation or trimming outliers (Additional file 5). The models' explanatory power (adj. R2) using log-transformed cost (Model 5) or trimming costs (Model 6) improved in most MDC groups, except MDC P, MDC R, MDC ST, MDC UV, and MDC WXY (Fig. 4). In MDC P, treatment for skewed distribution dropped adj. R2 8.9% (log-transformed) and 48.8% (trimmed), respectively. While log transformation improved performance (△0.8%–△7.2%), trimming decreased explanatory power (△3.2%–△11.0%) in MDC R, MDC ST, MDC UV, and MDC WXY. The results of mixed-effect models are presented in Additional file 6. The ICCs ranged between 0.018 and 0.500 in individual MDC groups. In the multilevel analysis, MDC I showed the largest AIC and BIC, whereas the lowest values were observed in MDC M.

Fig. 4
figure 4

Adjusted R2 (%) difference depending on outlier treatment compared to winsorized costs. IQR, Interquartile Range; MDC, Major Diagnostic Category; R2, R-squared

External validity

The overall mean value of adj. R2 was the lowest in Model 0 in the 2017 dataset, as in the dataset of 2018 (Model 0 adj. R2 42.3%, Model 1 adj. R2 47.5%, Model 3 adj. R2 47.6%, Model 4 adj. R2 47.7%). Model 3 using NHIS-HCC showed the highest R2 in the 2018 dataset, whereas the explanatory power was superior in Model 4 using HHS-HCC in the 2017 dataset. The weighted mean of adj. R2 also had the similar tendency (Model 0 adj. R2 47.5%, Model 1 adj. R2 53.1%, Model 3 adj. R2 52.5%, Model 4 adj. R2 52.5%). In each MDC group, the adj. R2 of Model 0 was inferior to those of other models (Fig. 5). The explanatory powers of MDC P (adj. R2 81.0%–82.5%), MDC I (adj. R2 56.6%–63.3%), and MDC F (adj. R2 56.5%–60.0%) ranked the highest among the MDCs. The explanatory powers in MDC P also had the same tendency as observed in the 2018 dataset, as there was no difference in the value of explanatory power between Model 0 (adj. R2 81.0%) and Model 1 (adj. R2 81.0%). MDC UV had the lowest explanatory power, as seen in the 2018 dataset (adj. R2 7.6%–8.9%).

Fig. 5
figure 5

External validity, adjusted R2 (%) of models according to the MDC. ADRG, Adjacent Diagnosis Related Group; HHS-HCC, Department of Health and Human Service Hierarchical Condition Category; MDC, Major Diagnostic Category; NHIS-HCC, National Health Insurance Service Hierarchical Condition Category; RDRG, Refined Diagnosis Related Group; R2, R-squared

In the validity results, overall MAEs ($954–$1,017) slightly decreased compared with the 2018 dataset ($1,099–$1,168) (Fig. 6). Model 1 showed superiority to other models in overall MAEs ($954) and MDC-specific MAEs ($271–$2,232). In MDC M, Model 4 using HHS-HCC had the smallest amount of MAE ($847) compared with other models ($872–$916). In MDC P, although Model 0 using ADRG and Model 1 using RDRG showed the lowest MAEs, RDRG did not seem to have been adjusted for comorbidities, considering the same values of adj. R2 between the two models. In MDC UV, Model 0 had the highest MAE ($1,704), whereas the values were lowest in Model 3 ($1,668) and Model 4 ($1,673).

Fig. 6
figure 6

External validity, MAE of models according to the MDC. Unit: United States Dollar (USD), converted from South Korean Won (KRW) (1 USD = 1130.48 KRW, 2017). ADRG, Adjacent Diagnosis Related Group; HHS-HCC, Department of Health and Human Service Hierarchical Condition Category; MAE, Mean Absolute Error; MDC, Major Diagnostic Category; NHIS-HCC, National Health Insurance Service Hierarchical Condition Category; RDRG, Refined Diagnosis Related Group

Simulation of efficiency measures

Utilizing predicted values from individual models, we calculated the NSPE indexes and presented according to the institution type (Table 5, Fig. 7). The average NSPE indexes were above 1 in all models, suggesting that the average efficiency is worse than the benchmark institution representing the median value. Among the three types of institution, the efficiency values were superior in general hospitals and inferior in hospitals in all models. The average NSPE index was the highest (1.024) in Model 1 using RDRG and the lowest (1.007) in Model 2 using CCI (Table 5). Regarding the distribution of NSPE indexes, Model 2 showed the most narrow distribution (SD, 0.350), whereas Model 0 had the widest distribution (SD, 0.370). The range of NSPE indexes was higher in Model 3 (5.177) than in other models.

Table 5 Comparison of NSPE index between models
Fig. 7
figure 7

NSPE index according to institution type. ADRG, Adjacent Diagnosis Related Group; CCI, Charlson Comorbidity Index; HHS-HCC, Department of Health and Human Service Hierarchical Condition Category; NHIS-HCC, National Health Insurance Service Hierarchical Condition Category; NSPE, National Health Insurance Service Spending Per Episode; RDRG, Refined Diagnosis Related Group

Discussion

Our study provided meaningful evidence on the risk adjustment of episode-based costs reflecting recent interest in cost containment and efficiency measurement. First, our results support a fundamental principle in risk adjustment: the choice of risk adjustment methods should be made based on the outcome of interest [11]. The model using CCI (developed for mortality adjustment) did not show any superiority to risk adjustment methods specific to cost estimation, though it showed subtle improvement compared to the model not adjusted for comorbidities (Not adjusted adj. R2 40.8%, CCI adj. R2 42.7%, methods specific to cost estimation adj. R2 45.8%–46.3%; Table 3). Second, HCCs were preferable methods in efficiency measurement to RDRG. Overall explanatory powers were higher in the HCC models (CCI adj. R2 42.7%, RDRG adj. R2 45.8%, NHIS-HCC adj. R2 46.3%, HHS-HCC adj. R2 45.9%; Table 3). Although the value of MAE was the smallest in the RDRG model (CCI MAE $1,158, RDRG MAE $1,099, NHIS-HCC MAE $1,126, HHS-HCC MAE $1,129; Fig. 3), RDRG does not differentiate complications and comorbidities for risk adjustment in the current KDRG system [23]. In addition, good model fits of RDRG are more likely due to the application of RDRG in seven diseases to determine payment within the KDRG-based payment system [40]. Third, we introduced HHS-HCC in the context of South Korea due to the limitation of NHIS-HCC targeting the older population [18, 33]. Adjustment methods should be comprehensive, given the purpose of risk adjustment for hospital efficiency measurement. Although NHIS-HCC showed its validity in several studies in South Korea [15,16,17], it does not precisely fit into the quality evaluation of hospitals due to the limited coverage of diseases. Hospitals providing a large volume of obstetric or pediatric services can have disadvantages in the evaluation. Fourth, our research design focuses on a pragmatic approach. Although various studies showed the superiority of HCCs, they evaluate the model performance based on annual costs. Depending on the reimbursement system, cost estimation can be annual, episode unit, etc. The factors contributing to cost rise can differ depending on the cost unit. Therefore, our strength is that our models are based on episode unit costs considering their actual utilization.

According to MDC groups, we observed similar performance patterns in each model to previous research using DRGs (Centers for Medicare and Medicaid Services Diagnosis Related Groups, CMS-DRG; Consolidated Severity-Adjusted DRGs, Con-APR DRG; Medicare Severity Diagnosis Related Groups, MS-DRG; RDRG). As in prior studies [41, 42], all models showed higher explanatory powers in MDC F (Diseases and Disorders of the Circulatory System, adj. R2 60.2%–63.3%) and MDC I (Diseases and Disorders of the Musculoskeletal System and Connective Tissue, adj. R2 54.1%–61.1%) than in the other MDC groups (Fig. 2). MDC UV (Mental Diseases and Disorders, adj. R2 7.7%–12.1%) also followed previous research outcomes with the lowest explanatory power. In terms of MDC P, even the unadjusted model (adj. R2 77.1%), including only ADRGs, described a relatively better performance of over 70%. However, the RDRG model (adj. R2 77.1%) did not show improvement in model fits compared to the unadjusted model. The same number of code types between ADRG (n = 26) and RDRG (n = 26) implies that the KDRG system does not risk adjusting in MDC P.

There are several limitations in our study. First, we could not obtain enough time period to define the index admission and the lookback period to identify comorbidities due to the cross-sectional dataset of the HIRA-NPS [21]. Due to the confined index admission (between April and November), seasonal variation in the epidemiological data cannot be considered [43]. The longitudinal dataset might be a fundamental solution to issues defining the time period. Additionally, Present on admission (POA) indicators can be a strategy for using claims data efficiently. Although the current Korean health insurance system does not provide POA indicators for research, they differentiate comorbidities and complications in the claims data [44]. Therefore, the use of POA indicators can reduce the lookback period. Second, we used HCCs based on the Korean modification 7th of the ICD-10 (KCD-7), which were transformed from the versions developed in the International Classification of Diseases, Ninth Revision, Clinical Modification (ICD-9-CM) and the International Classification of Diseases, Tenth Revision, Clinical Modification (ICD-10-CM). Therefore, information loss is inevitable during the transformation process due to the limited transferability of ICD codes between countries. In particular, the ICD-9-CM or ICD-10-CM coding systems are more fragmented due to the inclusion of procedure codes [45]. Third, the Korean claims system only collects the payer’s amount and a portion of the out-of-pocket cost (i.e., statutory payment by the patient) but does not include non-payment items by the payer. According to the benefit coverage rate survey, non-payment items comprised 15.6% of the total annual expenditure 2018 [46]. In addition, the proportion of non-payment items varied depending on institutional types and disease groups. For example, while non-payment items of hospitals accounted for 33.0%, tertiary and general hospitals accounted for 11.4% and 11.6%, respectively [46]. Furthermore, depending on disease groups, non-payment items ranged from 0.4% in human immunodeficiency virus disease to 22.9% in malignant neoplasms of female genital organs [46]. These differences suggest that total cost might differ after including non-payment items between MDC groups.

There are still opportunities to improve models by introducing sophisticated statistical methods in further studies. Our study tried to tackle the skewed distribution in the sensitivity analyses. After observing improved distribution by winsorizing the cost at 0.5 percentile (Additional file 5), the winsorized costs were used in our basic models. We also explored the log-transformation and trimming techniques. Regarding log transformation, performance improvement was observed in all MDC groups except MDC P (Fig. 4). On the other hand, a reduction in explanatory power in several MDCs (MDC P, MDC R, MDC ST, MDC UV, and MDC WXY) might have implied significant information loss in trimming at IQR (Fig. 4). We confirmed tentative conclusions, such as the benefits of using winsorized cost and the inappropriateness of trimming. Nevertheless, more rigorous statistical techniques should be covered to deal with skewed cost data in further studies, such as weighted least squares, the Generalized Linear Model (GLM) with gamma distribution, and constrained regression [14, 47, 48]. Additionally, we explored the clustering effects regarding types of medical institutions. The ICCs (0.018–0.500) suggest that costs from different institutional types were more discrepant from one another than the costs within the types of hospitals (Additional file 6). Our multilevel analysis results suggest further investigation into clustering effects. Inferior model performance in MDC I (the largest AIC and BIC) differs from our basic model using linear regression and the previous research comparing performance between MDC groups. The basic OLS regression models included institution types as independent variables considering the Korean Reource-Based Relative Value Scale (RBRVS) weighting scheme. Within the Korean RBRVS scheme, services in upper-level hospitals are reimbursed higher than in lower-level institutions [49]. There might be little difference between types of hospitals in a single insurer system like South Korea, except for service types and comorbidities. More studies need to investigate clustering effects on cost estimation within the context of the insurance system.

Conclusions

Our results suggest using risk adjustment methods specific to costs, such as HCCs, rather than CCI or risk-adjusted DRG in episode-based efficiency measurements. However, the subtle difference between the two HCCs suggests that more studies are needed to evaluate and further tailor them. Nevertheless, with recent increasing attention to efficiency, our methods and results can contribute to adopting and scaling up efficiency measures in the value-based payment system.