Introduction

Total knee arthroplasty (TKA) is a safe and effective treatment for end-stage osteoarthritis and is among the most common surgical procedures performed in the USA. National projections anticipate a substantial increase in TKA utilization and its associated economic burden well into the foreseeable future [1]. As healthcare systems shift toward an increasing focus on value and patient satisfaction, there has been increased emphasis placed on risk stratification, perioperative optimization, and improving value in TKA care delivery [2,3,4]. As such, considerable effort has been put forth to develop models to predict clinical outcomes after TKA [5]. More recently, artificial intelligence and machine learning have been heavily explored as potential tools to improve the predictive capacity of these models [6,7,8].

Machine learning (ML) is a subset of artificial intelligence (AI) that automates analytical model building by employing algorithms to progressively “learn” and improve from data [9, 10]. As technological shifts in the healthcare system have allowed for the accumulation and organization of large amounts of data, ML has shown immense promise for numerous applications within healthcare system. Recent studies within the orthopedic literature have applied ML to develop models to predict mortality, readmission rates, complication rates, length of stay, and patient-reported postoperative outcomes [8, 11,12,13,14,15]. Such predictive models have numerous potential benefits, including identifying patients at risk for worse outcomes, which allows for improved patient selection, targeted perioperative optimization, and stratification for risk-based alternative payment models. While these promising studies demonstrate the potential of ML to predict outcomes and improve value within orthopedics, they typically are limited in their choice of training variables and often employ a single elementary algorithm, without justification for the selection of either algorithm or variables. As a whole, there remains a critical need to develop and comparatively analyze the predictive capacity of various ML algorithms and to identify and select the relevant input variables used to train these models.

In that context, the purpose of this study was to comparatively evaluate the performance of ten different machine learning models in predicting LOS, mortality, and discharge disposition following TKA and to compare the performance of the best performing models developed with patient-specific vs. situational variables.

Methods

Data source and study sample

The National Inpatient Sample, a public and expansive database containing data of more than 7 million hospital stays in the US for the years 2016 and 2017, was utilized for this retrospective analysis and ML model development. Given the use of the International Classification of Disease, Tenth Revision (ICD-10) coding system in the database during the study period, the ICD-10-Procedure Coding System (ICD-10-PCS) for TKA was utilized to identify the study population (Additional file 1). Patients undergoing a conversion or revision TKA, younger than 18 years of age, or missing age information were excluded from the study population. This strategy resulted in a total of 305,577 discharges that were included in the current study.

Predictive and outcome variables selection

All available variables in the NIS database were considered and assessed for inclusion in this study. For the initial step of the study, 15 predictive variables were included in building and assessing ten different ML models, and subsequently divided, in the second step of the study, into 8 patient-specific (including Age, Sex, Race, Total number of diagnoses, All Patient Refined Diagnosis Related Groups (APRDRG) Severity of illness, APRDRG Mortality risk, Income zip quartile, Primary payer) and 7 situational variables (including Patient Location, Month of the procedure, Hospital Division, Hospital Region, Hospital Teaching status, Hospital Bed size, and Hospital Control). These features were manually selected by the authors by screening from all available variables in the NIS database. The analysis outcome variables were in-hospital mortality (binary yes/no outcome), discharge disposition (home vs. facility), and length of stay (≤ 2 vs. > 2) among primary TKA recipients. The determination of the LOS cutoff level was guided by analysis of the average LOS for the entire cohort, and subsequently utilizing the closest lower integral number to create the binary outcomes. Patient discharge destination was coded as either home (discharge to home or home health care) or facility (all other dispositions to a facility, such as skilled nursing facilities or inpatient rehabilitation centers). Patient datasets missing information on these variables were removed from the study sample.

Data handling and machine learning models development

SPSS Modeler (IBM, Armonk, NY, USA), a data mining and predictive analytics software, was utilized to develop the models based on commonly used ML techniques. The algorithmic methods implemented included Random Forest (RF), Neural Network (NN), Extreme Gradient Boost Tree (XGBoost Tree), Extreme Gradient Boost Linear (XGBoost Linear), Linear Support Vector Machine (LSVM), Chi square Automatic Interaction Detector (CHAID), Decision lists, Linear Discriminant Analysis (Discriminant), Logistic Regression, and Bayesian Networks. These methods were selected as they are well-studied, commonly used ML methods in medical literature and are distinct in their pattern recognition methods (Table 1) [8,9,10, 12].

Table 1 Description of machine learning models

For each technique and for each outcome-variable, a new algorithm was developed. The overall data set was split using random sampling into three separate groups: a training, testing, and validation cohort. A total of 80% of the data were used to train-test the models, while the remaining 20% was employed to validate the model parameters. The training–testing subset was subsequently divided into 80% training and 20% testing, yielding a final distribution of 64% for training, 16% for testing, and 20% for model validation. In-between those phases, there were no leaks between the data sets, as mutually exclusive sets were used to train, test, and then validate each predictive algorithm.

When predicting outcomes with a low incidence rate, there exists a bias within the model, leading to an inaccurate imbalance in predictive capacity biased against the minority outcome [16]. As such, and to avoid such implications, when imbalanced outcome frequencies were encountered, the Synthetic Minority Oversampling Technique (SMOTE) was deployed to resample the training set to avoid any implications on the training of the ML classification [17, 18]. Despite the validation of SMOTE, as a measure to successfully minimize the impact of the bias, the classifier’s predictive ability in minority outcomes is improved, however, it remains imperfect.

Statistical analysis

The comparative analysis of the different ML models consisted of assessment of responsiveness and reliability of the predictions for all models. Responsiveness is a measure of successful prediction of variable outcomes and was quantified with area under the curve (AUC) for the receiver operating characteristic (ROC) curve. AUCROC measurements were generated by assessing true positive rates vs. false positive rates under the training, testing, and validation phases of each model. For this study, responsiveness was considered as excellent for AUCROC was 0.90–1.00, good for 0.80–0.90, fair for 0.70–0.80, poor for 0.60–0.70, and fail for 0.50–0.60. Reliability of the ML models was measured by the overall performance accuracy quantified by the percentage of correct predictions achieved by the model.

All ten ML models were trained, tested, and validated to assess responsiveness and reliability. The first step of the study aimed at analyzing and comparing the predictive performance of these ML models in identifying the outcome variables after primary TKA: in-hospital mortality, discharge disposition, and LOS. The validation phase utilizing 20% of the sample was considered as the main assessment metric and quantified with responsiveness and reliability. Once the development and comparative assessment of the different ML models were completed, the three algorithmic methodologies with the highest accuracy for each outcome variable were identified. The second step of the study consisted of developing and comparing the predictive performance of the top three ML methodologies for the same set of outcome measures while using patient-specific and situational predictive variables. All statistical analyses were performed with SPSS Modeler version 18.2.2 (IBM, Armonk, NY, USA).

Results

This study included a total of 305,577 discharges that underwent primary TKA with an average age of 66.51 years. Descriptive statistics for the distributions of the aforementioned predictive variables are included in Table 2. The study population had an average of 0.1% mortality during hospitalization, a home discharge rate of 79.6%, and an LOS of 2.41 days.

Table 2 Demographic variables of the study population

For models developed using all 15 variables, the three most responsive models for LOS were LSVM, Neural Network, and Bayesian Network, with poor results measuring 0.684, 0.668 and 0.664, respectively (Table 3). The three most reliable models for LOS were Decision List, LSVM, and CHAID. Decision List had a good reliability of 85.44%, while LSVM and CHAID had a poor reliability of 66.55% and 65.63%, respectively. Figure 1 provides the ROC curves for the training, testing, and validation phases for the LSVM model predicting LOS. The three most responsive models for discharge disposition were LSVM, XGT Boost Tree, and XGT Boost Linear had fair performance with respective values of 0.747, 0.747, and 0.722 (Table 4). The two most reliable models for discharge yielding good reliability were Decision List and LSVM measuring 89.81% and 80.26% respectively, and the third most reliable one for discharge with fair results was CHAID at 79.80%. Figure 2 provides the ROC curves for the training, testing, and validation phases for the LSVM model predicting discharge disposition. The top 4 models that yielded excellent responsiveness for in-hospital mortality were LSVM, XGT Boost Linear, Neural Network, and Logistic Regression. with their values being 0.997, 0.997, and 0.996, respectively (Table 5). The most reliable models, all with excellent reliability, for in-hospital mortality were XGT Boost Tree, Decision List, LSVM, and CHAID, with values of 99.98%, 99.91%, 99.89%, and 99.89%, respectively. Figure 3 provides the ROC curves for the training, testing, and validation phases for the LSVM model predicting in-hospital mortality.

Table 3 Responsiveness and reliability in predicting length of stay for the 10 models developed using all 15 variables
Fig. 1
figure 1

ROC curves for the training, testing, and validation phases for the LSVM model predicting LOS

Table 4 Responsiveness and reliability in predicting discharge disposition for the 10 models developed using all 15 variables
Fig. 2
figure 2

ROC curves for the training, testing, and validation phases for the LSVM model predicting discharge disposition

Table 5 Responsiveness and reliability in predicting mortality for the 10 models developed using all 15 variables
Fig. 3
figure 3

ROC curves for the training, testing, and validation phases for the LSVM model predicting in-hospital mortality

Separate models were then developed using the three most reliable algorithms for each outcome and their predictive performance was compared using either patient-specific or situational variables. Tables 6 and 7 describe the performance of models developed with patient-specific variables and situational variables, respectively. For nearly all outcomes, responsiveness was higher for each algorithm when trained with patient-specific variables vs. situational variables, the only exception being CHAID having marginally better performance for predicting LOS when developed with situational variables. Similarly, reliability was higher for most algorithms when models were developed using patient-specific as opposed to situational variables, with the exception of higher reliability for CHAID for predicting LOS and Decision List for predicting discharge disposition when developed using situational variables, and equivalent reliability of XGT Boost Tree and LSVM for predicting mortality when developed using either patient-specific or situational variables.

Table 6 Responsiveness and reliability in predicting length of stay, discharge disposition, and mortality for the best performing three models when trained with patient-specific variables only
Table 7 Responsiveness and reliability in predicting length of stay, discharge disposition, and mortality for the best performing three models when trained with situational variables only

Discussion

TKA is one of the most common procedures performed in the United States, with a considerable associated economic burden. As healthcare systems continue to aim to optimize value of care delivery, there has been a growing focus on standardizing outcomes and establishing accurate risk assessment prior to TKA [5, 19]. More recently, ML has been applied to develop models to predict outcomes after TKA [8, 13, 14, 20]. As such, the aim of this study was to develop and compare the performance of multiple ML models to predict in-hospital mortality, LOS, and discharge disposition after TKA and to compare the performance of models trained using patient-specific and situational variables.

Selecting an appropriate algorithm for training is critical in developing a predictive ML model. As the number of ML algorithms abounds, there has been a concerted effort within the medical literature to compare ML algorithms to identify which are optimal for a given set of data and diseases [21, 22]. However, within the nascent orthopedic ML literature, different ML algorithms have been seldom compared when developing predictive models. Therefore, this study aimed to assess the performance of ten different ML models for prediction of LOS, mortality, and discharge disposition after TKA. When comparing the different ML models using fifteen independent variables available in the NIS database, the LSVM methodology was consistently the most responsive and reliable one, being within the top three best-performing ML models in predicting all tested outcomes. This result is not surprising, as support vector machine algorithms have consistently been one of the most widely used ML predictive algorithms [22]. Still, other studies in the general medical literature have shown superior performance of other algorithms for the prediction of other outcomes [21,22,23]. As such, it should be noted that if different variables or outcomes are to be tested in a different study, it is possible that a different ML algorithm would be more effective and accurate within its predictive capacity. As clinical application of ML continues to evolve, it should be stressed that various ML methodologies should be tested prior to developing and deploying a model for clinical use.

The selection of the optimal independent variables or features to train models is a cornerstone of supervised ML. Redundant variables can complicate models without increasing the predictive accuracy, while a deficiency of variables can oversimplify models without capturing the true complexity of a given use case. In the nascent TKA-related ML literature, there has typically been little justification for the variables selected to train models. Therefore, the predictive capacity of various models trained with either patient-specific or situational variables were compared. As both patient-specific factors, such as age, and situational variables, such as hospital volume, have been shown to correlate with outcomes after TKA, this distinction would be useful for the development of future models for use in clinical practice. Our analysis demonstrated consistently better performance of models developed with the 8 patient-specific variables when compared to models developed using 7 situational variables. These results, while stressing the importance of patient-specific variables, also highlight the potential of a smaller number of variables to develop equivalent predictive models. A similar concept was demonstrated recently in a study on heart failure patients that reported equivalent performance of an ML model using only 8 variables compared to one using a full set of 47 variables [24]. Continued research within the orthopedic literature on variable engineering and selection is critical, and identifying the most predictive variables will prove useful for the development of models that will be deployed to clinical practice.

There were several limitations to this study. The strength of ML models is dependent on the quality of the data used to train, test, and validate the algorithms, and administrative databases may be prone to incompleteness and errors [25]. However, the NIS has been demonstrated as an appropriate database to utilize for predictive large population-based studies and administratively-coded comorbidity data has been previously validated as accurate [26]. Another limitation is that the LOS outcome was adjusted to be binary to simplify outcomes and provide more accurate analysis. These adjusted outcomes are useful in the setting of predictive ML at the expense of precise predictions. However, despite the continuous nature of LOS as a variable, when quality-improvement efforts are implemented in the clinical setting, the target for improvement in LOS is generally a binary cutoff, and so a binary predictive model has practical use. Another limitation is that the findings of this study were not externally validated. Although external validation was not within the scope of the study, efforts were made to internally validate the results, as the dataset was split into 64% training, 16% testing, and 20% validating groups. The analysis of each phase was concurrent with all models with similar results, indicating the internal validity of the findings. Still, comparison with another data source would be useful to assess the generalizability of each ML model and the replicability of the findings in this study.

There were several strengths to this study. This study represents a novel attempt in the orthopedic literature to analyze a large variety of ML algorithms to develop the best-performing model. Our analysis of multiple ML algorithms generates insights into the performance of these various algorithms for multiple outcomes, which has seldom been encountered in the orthopedic literature. Additionally, by demonstrating the generally superior performance of models trained on patient-specific variables over situational variables, this study highlights the role that patient-specific factors play in determining critical quality outcome metrics within the available dataset. These insights should empower efforts aimed to influence both clinical practice and reimbursement models, which typically do not consider patient factors despite their demonstrably substantial impact on various quality metrics.

Conclusion

In summary, this study compared ten ML models developed using different algorithms to predict three important quality metrics: mortality, LOS, and discharge disposition. Models developed using patient-specific variables performed better than models developed using situational variables. As the effort to develop ML models and identify which ML algorithms are optimal for a given set of conditions and outcomes, these results prove useful in the development of predictive ML models for accurate risk assessment and stratification for TKA.