Introduction

Artificial intelligence (AI) and Machine learning (ML) tools in knee arthroplasty (KA) have the potential to improve patient-centered decision-making and outcome prediction in orthopedics. The application of ML in KA has been useful for predicting implant size, reconstructing data, and assisting with component positioning and alignment. ML implementation enhances surgical precision and can help predict parameters such as length of hospitalization, healthcare costs, and discharge disposition [1,2,3]. 

Additionally, ML algorithms have been proven, in more recent studies, to be useful when selecting the right drugs to treat prosthetic joint infection (PJI) to have a more patient specific approach to medicine; this was possible due to the development of a Random Forest (RF) model able to take notice of several risk variables, such as patients’ characteristics and comorbidities and using the, for the selection [4]. In data science theory, the quantity and quality of input parameters are crucial; therefore, the previously mentioned variables, if not selected by relevance to the topic of each study, although beneficial in theory, may hinder the full potential of ML algorithms for KA. This is because, analyzing all underlying relations between variables, with a large number of inputs the models may highlight irrelevant patterns, leading to a greater risk of overfitting: the algorithms perform significantly better with the training data in respect to the newly presented one [4, 5].

Moreover, patient satisfaction following primary KA is one of many outcome measures currently used to assess the efficacy of this procedure. Patients’ satisfaction is dependent on many factors such as age, gender, and the presence of comorbidities. Therefore, it is essential to understand the relationship between the variables underlying satisfaction to provide the best care and optimized postoperative care for KA patients. ML algorithms, capable of generating patient-specific risk models, appear to be very effective means to achieve this goal [6].

Overall, the application and use of ML and AI in orthopaedics are beneficial not only for the previously mentioned situations, but also for the identification of possible patients that are at high risk for severe walking limitations post-total knee arthroplasty [7], and the selection of high-risk patients who will require a blood transfusion after KA [8].

This review will focus on investigating which predictions are achievable by using AI and ML models in knee arthroplasty, identifying prerequisites for the effective use of this new approach. Moreover, the second aim is to highlight the latest findings of these technologies in predicting outcomes after KA.

Materials and methods

Study selection

The research question was defined by using a PIO approach: Population (P); Intervention (I); Comparison (C); Outcome (O). The objective of this systematic review was to investigate which outcomes can be assessed by using AI or ML models (I) in patients with knee osteoarthritis who underwent total (TKA) or unicompartmental (UKA) knee replacement (P). The following outcomes were considered: complications, costs, functional outcomes, revision rate, and postoperative satisfaction (O).

Inclusion criteria

Only articles that evaluated AI/ML-based applications in clinical decision-making in knee arthroplasty were considered. Only original clinical studies written in English, Spanish, or Italian were screened.

Exclusion criteria

Studies that did not evaluate AI/ML applications in KA. Studies with nonhuman subjects. Medical imaging analysis studies without explicit reference or application to KA. Inaccessible articles, conference abstracts, reviews, and editorials. No limits were placed on the level of evidence or publication date of the study.

Search

Following the Preferred Reporting Items for Systematic Reviews and Meta-analysis (PRISMA) guidelines, a thorough literature search was conducted using the following string: ((((total) OR (unicompartmental or unicondylar)) AND (knee replacement)) AND (((artificial intelligence) OR (machine learning)) OR (algorithm))) AND ((((((((complications) OR ((blood) AND ((transfusion) OR (loss)))) OR (functional outcomes)) OR (revision)) OR (satisfaction)) OR (surgical technique)) OR ((length of stay) OR (hospitalization))) OR ((costs) OR (economic analysis))). The use of keywords was both combined and isolated. The following databases were used: MEDLINE (Medical Literature Analysis and Retrieval System Online), Scopus, Cinahl, Google Scholar, PUBMED, and EMBASE (Excerpta Medica Database). The reference lists of selected systematic reviews [2, 5] were searched for the selection of further studies. The authors (F.V. and M.V.C.) searched from June of 2022 to January 2024. The databases were screened from the inception to January 2024.

Data collection process

Two independent reviewers (F.V. and M.V.C.) collected the data, and mutual approval resolved differences. A third reviewer (S.D.S) was consulted in case of any disagreement. Title and abstract screening were the first steps, followed by the full-text evaluation of the selected articles. The inclusion and exclusion of the reviewed studies were displayed in the PRISMA flowchart, seen in Fig. 1.

Fig. 1
figure 1

Prisma flowchart

Data items

A database was developed by collecting and categorizing the general study characteristics from the selected articles, which comprised: primary author, year of publication, study design, level of evidence, study duration, AI/ML methods, data source, input variables, output variables, sample size, average patient age, percentage of female patients, Area Under the Receiving Operating Characteristic Curve (AUC-ROC), accuracy, sensitivity, specificity.

Risk of bias assessment

For the quality assessment, a modified eight-item Methodological Index for Non-Randomized Studies (MINORS) checklist was employed to evaluate the selected articles. The eight-item checklist included: disclosure, study aim, input feature, output feature, validation method, dataset distribution, performance metric, and AI model. Each item was scored using the following binary scale: 0 (not reported or unclear) and 1 (reported and adequate). The following criteria were used as a guide when assessing the quality of each publication: 

Disclosure: Scored 1 if clearly reported possible conflicts of interest, funding, or ethical considerations, scored 0 if not reported or unclear. Study aim: scored 1 if the research question and/or objective were clearly reported, scored 0 if unclear or not reported. Input feature: scored 1 if variables were clearly reported, scored 0 if unclear or not reported. Output feature: scored 1 if clearly reported, scored 0 if unclear or not reported. The validation method involves the evaluation of the AI/ML model’s performance by specific methods: scored 1 if the tools external validation, cross-validation, and/or bootstrapping were used and clearly reported, scored 0 if not reported nor used. Dataset distribution: scored 1 if the phases of training, testing, and validation for the AI/ML methods were clearly reported, scored 0 if unclear or not reported. Performance metric: scored 1 if the study clearly reported the metrics accuracy, sensitivity, specificity, and/or AUC-ROC for assessing the AI/ML model performance, scored 0 if unclear or not reported. AI model: scored 1 if clearly stated the specific AI/ML algorithm used by the study, scored 0 if not clearly stated.

Compared to the original MINORS checklist, this modified version, proposed by [9], provides a more accurate grading tool for studies focused on applying AI/ML methods in medical research and diagnostic studies within the medical field. Two independent reviewers (F.V. and M.V.C.) evaluated each publication individually.

Results

Study selection

The initial search identified 654 studies. After the duplicate removal, 479 studies were screened from which 402 articles were excluded after the title/abstract examination, resulting in 77 records for the full-text evaluation. After the full-text assessment, 49 studies were included in the data analysis (Fig. 1). Of these excluded articles, 9 studies did not evaluate AI/ML application in knee arthroplasty, 8 were medical imaging analysis studies without explicit reference or applications to knee arthroplasty, 2 used non-human objects, and 9 were inaccessible articles or systematic reviews.

Study characteristics

A total of 2,595,780 patients were identified from 48 of the 49 studies included, with one study [10] not providing the sample size. Thirty-seven of the 49 studies stated the percentage of female patients, adding up to 1,435,218 female patients, which account for 55.29% of the total patients. The overall average age of the patients was 70.2 years ± 7.9 years old, with 33 out of 49 articles providing an average age of the study population. The study which had the highest number of patients was Hyer et al., 2020 [11] with 1,049,160 patients (40.41% of all the patients included in the studies). All the study characteristics are reported in Table 1.

Table 1 Study characteristics

The five most common AI/ML models used were: RF, used in 19 articles; Gradient Boosting Machine (GBM), used in 18 articles (including less generalized versions such as Extreme Gradient Boosting (XGBoost) and Stochastic Gradient Boosting (SGB)); Artificial Neural Network (ANN) used in 17 articles; Logistic regression (LR), used in 16 articles (together with less generalized versions such as Elastic-net penalized logistic regression (EPLR)); and Support Vector Machine (SVM) used in 13 articles.

Regarding the variables reported, the most common input variables were: Age [38, 41, 45, 47, 49, 50, 52] (44 articles), Sex (33 articles), Comorbidities (29 articles), BMI (27 articles), Race/ ethnicity (26 articles), ASA classification (10 articles). The most common output variables provided by the studies were: post-surgical complications (11 articles), Probability of TKA (7 articles), and length of stay (4 articles).

This review included studies with level of evidence II-IV. Level of evidence II studies consist of Randomized controlled trials (RCTs) and are considered one of the strongest study designs, second only to reviews and meta-analysis which are considered as level of evidence I; Level of evidence III studies are composed of non-randomized controlled trials; the last category of evidence included in the review is Level IV: Case–control studies assessing associations between exposure and outcome.

The following level of evidence was included in the selected articles: 37 level III retrospective cohort studies [6, 8, 10,11,12,13,14,15,16,17,18,19, 23,24,25,26,27,28,29,30,31,32,33,34,35,36,37, 40, 48, 51]; three level III diagnostic studies [20,21,22, 54]; three level II prospective cohort studies [4, 39, 53]; one level II comparative studies [46]; three level IV cohort pilot studies [42, 44, 55], one level III multi-center retrospective study [47]. One study [43] did not present the level of evidence. All the characteristics are reported in Tables 1, 2 and 3.

Table 2 AI/ML methods
Table 3 Input and output variables

AI and ML methods

The following section reports the AI and ML methods identified in the reviewed articles. Each section includes the number of articles that used each AI or ML method, its corresponding AUC value, and the evaluated output variable. Table 4 classifies each article regarding the output variable studies and presents the highest AUC score for the respective article.

Table 4 Output variables

Random forest

RF is a decision trees-based algorithm introduced in the 2000s and capable of handling a variety of data types; its implementation in many medical fields is sustained by its high performance with large datasets and its ability to integrate both clinical and imaging data to achieve more accurate predictions compared to older models such as LR. This ML method operates by constructing and averaging a multitude of decision tress, a simpler ML method, with each of the tress randomly analyzing selected subset variations of the original data, the model is capable to analyze large and complex subset of data, resulting in a more resistant model to overfitting, while also adding diversity in the analysis. It was the most common AI method, applied in 38.77% of the reviewed articles. Mainly it was used to evaluate outcomes, one of them being a technical outcome: TKA component size prediction (femoral and tibial) [35]. Eight publications implemented RF for the evaluation of clinical outcomes, some of them being: achievement of Minimal Clinically Important Differences (MCIDs), prediction of Patient Reported Outcomes (PROs), prolonged postoperative opioid prescription, improvement of Knee injury and Osteoarthritis Outcome Score (KOOS) to one-year, dissatisfaction, assessment of sensitization in patients with chronic pain after TKA, etc. [4, 6, 20, 26, 32, 33, 45, 53]. Only one article evaluated the post-walking limitation with RF, under the functional outcome category [7, 56].

RF was also utilized to analyze the surgical technique by two articles [15, 49], which considered the following outputs respectively: characterization of anatomical tissues and surgical corrections, the latter presenting the highest AUC (0.89) for this ML method. Postoperative length of stay (LOS) was predicted using RF only by one article [57], which presented an AUC of 0.71.

Another application of RF was regarding possible complications such as major complications after primary TKA, blood transfusion, surgical site infection, and disposition of patients at discharge [15, 25, 38]. Lastly, two reviewed articles implemented RF for predicting TKA risk depending on knee OA, evaluating both risk and time [16, 27].

Gradient boosting machine

The ML model GBM gained popularity in the 2000s due to the model’s high predictive accuracy even in settings with mixed data types and missing values. GBM works by building decision tress sequentially, rather than in parallel like RF, with each of the tress correcting the predicting errors made by the previous ones. This results in the model being able to analyse complex relationships in data and producing an accurate prediction, even if lacking the randomized selection or diversity of the RF model. It can be used for both classification and regression due to its ability to produce new decision trees by correcting the errors of the previous predictions, gaining more accuracy than popularly used models such as SVM.

It was used by 18 studies, one employing it to predict TKA component size [35]. The highest AUC value was applied by an article that evaluated the development of acute kidney infection (AKI) after TKA, AUC: 0.89 [34]. Other studies that evaluated complications with GBM comprised the following outputs: major complications after primary TKA, blood transfusion after TKA, surgical site infection, and disposition of patients at discharge [8, 17, 38]. One study used GBM for the prediction of LOS after TKA [37], a different study employed this method to evaluate functional outcome: post-TKA walking limitations [7].

In addition, GB was used by 7 articles to evaluate clinical outcomes: prediction of patient satisfaction, achievement of MCIDs in KOOS 1 year after TKA, prediction of PROs, extended prescription of postoperative opioids, MCIDs attainment 2 years after TKA [6, 18, 19, 22, 26, 32, 33, 53]. Only one study evaluated the use of SGB to predict the risk of TKA in comparison to other ML models, resulting in the highest performance together with RF among the algorithms observed, with an AUC: 0.83 [16].

Artificial Neural Network (ANN) /Multilayer perceptron

Although it originated in the 1940s, the ANN model gained prominence in the 2010s due to the application of deep learning in modeling complex relationships, making it suitable for a wide range of applications. ANN is a computational algorithm consisting of interconnected nodes organized in sequential layers, each analyzing the data to pass the prediction to the following one, mimicking the functioning of human neural network. This model was applied by 17 studies, one of them being for the prediction of LOS, inpatient charges, and discharge disposition before primary TKA [43]. Five articles analyzed clinical outcomes, the one having the highest AUC for this method (0.86) was regarding the prediction of PROs [26]; other outputs under this category were: prolonged postoperative opioid prescription, dissatisfaction after TKA, prediction of same-day discharge in patients undergoing TKA [6, 32, 33, 50]. One article applied ANN for TKA component size prediction (femoral and tibial) [44], and another study applied it for procedural cost prediction for TKA [31, 58].

Regarding complications, ANN was applied to evaluate the disposition of patients at discharge, post-surgical complications such as surgical site infection, and blood transfusion [38]. Additionally, two articles used this ML method to characterize tissues and surgical corrections based on patient-specific intra-operative assessment [15, 49]. Another application of ANN, by four other articles, was related to future clinical intervention outputs: effect of opioid use in risk of knee revision and manipulation in the first year after primary TKA [59]; identification of influential factors before surgery, and prediction of the risk of TKA surgery [23, 60].

Logistic regression

LR is a simply interpretable model for binary classification developed in the early twentieth century; being one of the oldest predictive models, its role is well established in the medical setting to estimate the probability of occurrence of different events. Although, it is to be considered that the advent of newer algorithms able to form wider and more complex associations between inputs and outputs causes this model to be more frequently relegated to a comparator role. The algorithm was used by 16 out of 49 articles. Four articles evaluated complications, which comprised the following outputs: disposition of patient at discharge, predictors of Allogenic Blood Transfusion (ALBT), and post-surgical complications [17, 25, 38]. The future clinical intervention was studied by three articles, specifically regarding the risk and time for a TKA in a patient presenting knee OA [27]. One article used this machine learning method for TKA component size prediction [35], and a different publication used it to evaluate post-TKA walking limitations, a type of functional outcome [7].

Regarding clinical outcome, LR was applied by 7 articles to study: achievement of MCIDs in KOOS 1 year after TKA, extended opioid prescription post-surgery, dissatisfaction after TKA, assessment of sensitization in patients with chronic pain after TKA, prediction of same-day discharge in patients undergoing TKA, and prediction of PROs [6, 22, 26, 32, 33, 45, 50]. The article that presented the highest AUC (0.88) evaluated the probability of TKA within 5 years [47].

Support vector machine

SVM is an effective model which can be used for both classification and regression; developed in the 1960s it still is one of the most popular algorithms used to classify disease progression based on imaging data. However, due to its low accuracy in performances with noisy datasets, newly developed algorithms such as K-Nearest Neighbors (kNN) are gaining prominence in this role. SVM is particularly effective when the number of features exceeds the number of samples in the data, being able to handle both linear and non-linear relationships in data. It was used by 13 articles, one of them evaluating the prediction of LOS and complications after TKA [29, 51]. Mainly to assess clinical outcomes such as: prolonged postoperative opioid prescription [32]; improvement of KOOS one year after TKA [33]; dissatisfaction after TKA [6]; attainment of MCIDs 2 years after TKA [20, 53]. SVM was also employed to analyze subtask segmentation of the TUG test for perioperative TKA [24]; Risk and Time of TKA in patients with knee OA [16, 27]; surgical corrections based on patient-specific intra-operative evaluation [49]. Additionally, one article used the algorithm to evaluate the characterization of tissues [15, 60] while another applied SVM in component sizing for TKA [35].

Other AI models

Two AI models were employed to evaluate major complications after primary TKA [17]: AutoPrognosis (AP) and AdaBoost. The ML method Decision tree was utilized in two studies for the analysis of the following outputs: gait comparison between UKA and TKA patient [30], and subtask segmentation of TUG test for perioperative TKA, the latter also being assessed by the methods: AdaBoost, kNN, Naïve Bayes Classifier (NB) [24].

Regarding the analysis of post-TKA walking limitation, the model SuperLearner was used [7]. Both the Cox-PH model and DeepSurv model were used to predict the risk and time of TKA in patients with knee osteoarthritis [27]; an Ensemble Deep Learning (DL) model based on the use of MRI and radiograph was also compared with traditional ML algorithms to predict the risk of TKA, obtaining promising results [40]. The prediction of PROs was assessed by the models: NB, kNN, and Multi-Step Adaptive Elastic-Net (MSAENET) [26].

The models Quadratic Discriminant Analysis (QDA) and LASSO regression were employed to evaluate MCIDs attainment after TKA in different periods. One of the studies made the assessment 1 year after TKA [22], other two articles made the evaluation 2 years after TKA [20, 53]. LASSO regression was also used to analyze mortality and complication after TKA, such as respiratory, cardiovascular, and nervous system and renal complications [21]. Regarding the prediction of clinical outcomes, the new Skeletal Oncology Research Group Machine Learning Algorithm (SORG-MLA) was validated for the identification of patients at risk of prolonged postoperative opioid use after TKA, obtaining an AUC: 0.75 [48].

Moreover, the models Linear Discriminant Analysis (LDA), Recursive Partitioning (RP), and NB were employed for the assessment of sensitization in patients with chronic pain after TKA [1]. The prediction of procedural cost after TKA, the DenseNet was used, presenting an AUC score of 0.813 [31].

Natural Language Processing Method (NLPM) was utilized to assess surgical technique, using the following outputs: category of surgery, implant model, presence of patellar resurfacing, constraint type, and laterality of surgery [46]. NLPM was also used to estimate ITS data [4] and analyze the alteration that opioid use can have in risk of knee revision and manipulation in the first year after primary TKA [12].

Lastly, the Stochastic Hill Climbing Complexity score was for the prediction of surgical 90-day morbidity, mortality, and complications [11]. NB was employed to analyze inpatient cost and LOS after TKA [32, 45].

Quality assessment by modified MINORS

All 49 of the reviewed articles were evaluated following the modified MINORS checklist to assess quality and risk of bias. All 49 articles clearly reported the study aim, however, 11 studies failed to report the performance metric. Two publications did not report the output feature, while 46 of the studies clearly stated the input feature, and 45 of the articles indicated disclosure. Regarding the item AI model, 45 of the reviewed articles fulfilled this criterion. These findings showed a relatively high degree of quality and low likelihood of bias, only two of the reviewed articles received a score of 5/8, five articles with 6/8 as a score, and the majority, 42 out of 49 publications, scored 7/8 and higher (Table 5).

Table 5 Quality assessment by modified MINORS

Discussion

This systematic review evaluated the possible uses of AI/ML models in TKA, highlighting their potential in improving decision-making, component sizing, inpatient costs, perioperative planning, and streamlining the surgical workflow. Implementing these prediction models in TKA can ultimately lead to more accurate predictions, less time-consuming data processing, and higher precision in identifying patterns, all while minimizing user input bias to provide risk-based patient-specific care.

A key finding was the benefits of RF in aiding surgical decision-making when applied in intraoperatively collected surface models and patient-specific intraoperative assessments. RF outperformed both ANN and SVM not only when categorizing various types of anatomical tissue [15], but also when identifying populations at risk for TKA [16], and assessing balance and alignment during TKA surgery, aiding the surgeon regarding the optimal choice for the suitable bone recut or soft tissue adjustment [49, 61]. This review highlights how the application of RF in all the steps leading to TKA, perioperative and postoperative care can lead to optimal clinical and surgical outcomes, while reducing complications thanks to patient-specific planning. Moreover, by streamlining the surgical workflow and helping to select surgical corrections, this AI model can overcome the risk of data overload and the challenge of data interpretation, while being fast, cost-efficient, and accurate.

The SGB model presented promising results in the Kunze et al. (2021) study, by outranking RF, SVM, and EPLR for the prediction of the component sizing of the implant used in TKA. This model demonstrated the best overall performance regarding minimizing prediction error and maximizing accuracy for both femoral and tibial implant component size prediction. A potential benefit is an ability to predict final component sizes of the prosthetic without reliance on digital or manual templating, therefore being faster than traditional methods. Also, showing good performance across different TKA component manufacturers, streamlining component selection processes, improving inventory control, and reducing shipping costs [35, 62].

Regarding prediction models for allogenic blood transfusion, the highest AUC score was reported by the RF and SVM-based models [25]. With a slightly lower difference of 0.038 in the AUC score, the ANN-based model was still significantly higher than the classic prediction models [38]. Overall, these results show how the implementation of various ML-based models can result in an improvement of peri-operative complications predictions, ensuring that the identified population at risk, for blood transfusion, receives proper care while also optimizing the operative process and reducing the risk of prolonged LOS, caused by complications, such as blood transfusion, during TKA.

A further finding is the already established importance of LR models when used in healthcare settings, which can lead to the development of patient-specific care and peri-operative planning. The most successful result of LR (AUC 0.88) was achieved by its implementation, together with DenseNet, in identifying a population at higher risk of TKA within 5 years, particularly at less advanced stages of OA [47]; although, in the more recent study published by Crawford et al. in 2023, compared to other models such as SGB and RF (AUC: 0.83), EPLR scored a lower performance in identification of population at risk of TKA [16]. Additionally, implementing LR with other models, like the ML-based remote patient monitoring system, can reduce the need for TKA revision, while acquiring continuous data for patients undergoing TKA, in terms of mobility and rehabilitation compliance. This patient monitoring system proved to be reliable, low-maintenance, and a well-received platform for the patient recovering from TKA [42]. Implementing LR models would result in higher objectivity, cost-effectiveness, and ability to acquire continuous data, together with higher accuracy in identifying at-risk population, overall increasing the success rate for TKA.

Financial aspects are to be considered when proposing a treatment plan to patients, as complications can arise during the surgery and recovery, drastically changing the cost expected beforehand. Although it was shown to be an important element to consider when planning peri-operative care during TKA, the cost-prediction outcome was only analysed in one article. Demonstrating high accuracy when used in clinical medicine, the DenseNet model [31, 63] can optimize and provide a cost-efficient organization of resources that can benefit the medical staff by reducing their workload and improving the quality of the arrangement of resources. Simultaneously, this method can identify populations at risk for complications, a benefit that would help reduce the higher cost of the procedure after TKA, making it possible to implement patient-specific payment plans benefitting both patients and healthcare providers.

Going over the performances of the GBM model analysed in different articles, we can observe how this algortihm is simple and efficient, it has been validated to improve both short- and long-term prognoses of TKA patients. Ko et al. successfully used this AI model for the prediction of the development of postoperative AKI after TKA, which can not only increase LOS but also be life-threatening [34]; while TreeNet GBM proved to be the most successful method when applied for predictors regarding patient satisfaction [18]. Additionally, GBM showed great results when predicting the disposition of patients at discharge [38], therefore the model’s implementation could improve the overall patient satisfaction and recovery rate post-TKA, while also assuring patient-specific peri-operative care is applied to prevent and manage possible complications.

Looking at more novel models less implemented up until recently in the healthcare settings, the following AI/ML models: DL-TL-MT, SVM, Deep Surv, and Cox-PH, proved to be of great use to individuate the population of patients at risk and develop patient-specific care. The DL-TL-MT model successfully predicted the risk of OA progression based on knee radiographs in patients that previously underwent TKA [36]. Presenting the same AUC level (of 0.87), the methods SVM, Deep Surv, and Cox-PH were successfully employed to predict the risk and time of TKA of an OA knee [27]. The implementation methods prove to be indispensable in predicting the progression of OA, even at an early stage. This ML-based model has great potential as a diagnostic tool for physicians when determining the prognosis for patient at all stages of OA, allowing for early intervention through TKA where needed, therefore reducing the risk of complications and of TKA revision.

The SVM predictor model showed also a very promising results when applied in the different settings, and especially for the segmentation of the TUG test and extraction of information from each subtask perioperative to TKA, solving the problems regarding subjectiveness and other biases [24, 64]. The benefits that come with the usage of this AI model would be a more precise segmentation and therefore data extraction, which results in further understanding and classification of improvements in patients, leading to the employment of patient-specific treatments and rehabilitation models.

Looking at the results of the different articles involved in the review, the emergence of ML models in the medical setting becomes an evident matter: most data corroborates the idea that novel AI models present better results and predictive powers, compared to traditional models, when identifying predictors of TKA and analyzing multiple outcomes simultaneously. In the prediction of complications after primary TKA, Devana et al. prove the superiority of AP, compared to traditional models, regarding the discriminative ability and the capability to suggest nonlinear relationships between variables in the outcomes of TKA. Consequently, AP can be a versatile tool that may be utilized for the identification of crucial patient characteristics when predicting outcomes across a variety of datasets, thereby improving the patient outcomes [17]. Additionally, Harris et al. demonstrated how AI can produce preoperative prediction models for one-year improvement in pain and functioning after TKA; and how the GBM model, which performs well in important interactions, and the QDA model, which performs better in nonlinear association, can be applied to produce an easy-to-use predictive model able to achieve similar or better accuracy with far fewer inputs in respect to traditional predictive models [22].

Lastly, the NLPM model presents great potential as a newly emerging algorithm, in particular when applied in clinical settings for the interpretation of a text, which has been applied in different studies for the classification of patient satisfaction [14], knee revisions after TKA due to preoperative opioid use [12], and for the processing of clinical free text from electronic health records [46]. The strength of this ML-based model relies on its ability to automate the extraction of embedded information in perioperative notes and patient-centered surveys, decreasing the need for costly manual chart reviewing and improving data quality while being less time-consuming. The use of this model would improve patient feedback and perioperative notes to better patient-specific risk-based care resulting in higher patient satisfaction and a reduction in costs for the healthcare system due to possible lawsuits [65], together with the reduction of the cost due to manual chart reviews [46].

Like both the Hinterwimmer et al. 2021, and the Lee et al. 2022 review, this systematic review confirms the great potential of AI/ML methods and their application in orthopedics for cost predictions, diagnostic applications, and identification of risk factors, while also clearing the doubts regarding the inaccuracy and lack of sufficient evaluation of these models. In comparison, this review analyzed 49 articles, including the publications already examined in previous reviews. This more extensive research concluded that not only is it possible to implement these models in the prediction of TKA perioperative care, disease progression of OA, and distinct outcomes applying specific data, but also the prediction of more complex outcomes is now feasible through the application of more novel AI/ML algorithms [13, 17, 21, 22, 27, 30]. Although, as mentioned in several studies, further research may enhance the reliability of AI/ML models and allow for their use in patient preoperative and perioperative care [8, 11, 19, 21, 43, 50].

Limitations

The main limitation of this review derives from the possible bias of information regarding the performance of the different AI models, which, as highlighted by the MINORS table, results as the most at-risk parameter due to the omission by several articles of either AUC score or Accuracy score for the different predictive models examined. Moreover, many of the studies included in this review are retrospective studies obtaining the data, regarding the patients for the testing of the AI/ML prediction models, from national databases and electronic health recordings; limitations by the lack of detailed clinical information, potential misclassification of data, and in many cases a small cohort of patients presenting limited characteristics from which to derive input and compare outputs, which may lead to the results not being generalizable to all patient populations [11, 19, 21]. Validation of analyzed predictive models on larger populations of patients is needed. Lastly, due to the heterogeneity between data, it was not possible to perform a meta-analysis.

Conclusion

Regarding the implementation of AI/ML models in TKA, the articles in this review mostly consider these predictive models to be helpful and suggest that their application in medical settings for perioperative TKA clinical decision-making and prediction of the progression of OA into TKA may result in an improvement of patient satisfaction, risk managing, and cost efficiency. Among the best qualities, for which the AI/ML models outperform the traditional prediction models, frequently reported higher accuracy, cost efficiency, simple application, lack of subjectiveness, and overall reduction of time consumption thanks to the automation of tasks. Therefore, it is possible to conclude that, although the results of the reviewed articles should be further validated by their testing on larger cohorts of patients, the findings of these articles highlight the great potentials that derive from the inclusion of AI/ML predictive models in a further branch of medicine.