Background

Total hip arthroplasty (THA) and total knee arthroplasty (TKA) are increasingly high-volume orthopaedic procedures expected to grow by 71% and 85% by 2030, respectively [1,2,3,4]. This growing population of arthroplasty patients is paired with an increasing volume of total joint arthroplasty (TJA) reoperations [5,6,7,8]. Radiographic assessment is the most prevalent method to identify the correct positioning of the implant, monitor implant wear, exclude complications, and identify implant design before revision surgery [9, 10]. However, these analyses place a significant burden upon arthroplasty surgeons. For instance, inconsistent implant records can complicate implant identification before revision surgery, increasing perioperative morbidity and cost of care [11]. In a 2012 survey of arthroplasty surgeons, 88% of respondents claimed that identifying components of a failed implant takes a significant amount of time [12].

Artificial intelligence (AI) presents an alternative to this time-consuming process, and the reduction of human error could further optimize preoperative planning. AI algorithms can extract rules and patterns from large amounts of data to predict outcomes with sets of similar data [13]. Machine learning (ML) and deep learning models, known as convolutional neural networks (CNNs), are subsets of AI modeled after the human brain to identify rules and patterns in images [14,15,16,17]. AI algorithms have been utilized to detect mammographic lesions [18], skin cancer [19], and have a growing presence in orthopaedic surgery [14, 15, 17]. AI has been promising in preoperative planning for revision TJA where multiple aspects of the implant need to be analyzed [20, 21].

As the rate of revision TJAs is rising for a multitude of reasons, AI implant recognition may reduce surgeon workload, save resources, and reduce inaccuracies necessitating another revision. Because of the plethora of different AI algorithms, a systematic review of current studies exploring the nature of these algorithms is critical to understanding the efficacy and potential use cases. Therefore, we asked: (1) What are the currently established use cases for AI in TJA? (2) What is the performance of these algorithms? (3) What are the current limitations of these AI algorithms?

Methods

This review was conducted according to the Preferred Reporting Items for Systematic Reviews (PROSPERO registration of the study protocol: CRD42023403497, 27 February 2023).

Search strategy

The PubMed, EBSCOhost, Medline, and Google Scholar electronic databases were searched on 27 February 2023, to identify all studies published between 1 January 2000, and 27 February 2023 evaluating AI-mediated implant analysis in hip and knee arthroplasty. The following keywords and Medical Subject Headings were used in combination with the “AND” or “OR” Boolean operators: (“Total Joint Arthroplasty [Mesh]” OR “Total Knee Arthroplasty [Mesh]” OR “Total Hip Arthroplasty [Mesh]” OR “THA” OR “TKA” OR “TJA”) AND (“Artificial Intelligence” OR “AI” OR “Machine Learning” OR “ML”) AND (“Implant”).

Eligibility criteria

Articles were included if (1) full-text manuscripts in English were available and (2) the study investigated the use of artificial intelligence algorithms in TJA implant analysis. Additionally, the following studies were excluded from our analysis: (1) case reports, (2) systematic reviews, (3) duplicate studies among databases, (4) gray literature such as abstracts and articles on pre-print servers, and (5) publications in languages other than English.

Study selection

Two independent reviewers assessed the eligibility of each included article. Disagreements were discussed with a third independent reviewer to achieve consensus. Upon removing duplicates, the initial query yielded 257 articles, which were then screened for appropriate studies aligning with the purpose of this review. 36 studies were selected for further consideration after the title and abstract screening. The full text of each article was reviewed, 20 of which fulfilled our inclusion criteria. Reasons for full-text exclusion included the study not directly addressing implant analysis in TJA (n = 13), and the study not assessing the efficacy of an AI model (n = 3). A review of each study’s reference list yielded no additional articles (Fig. 1).

Fig. 1
figure 1

PRISMA diagram depicting the study selection process

Study characteristics

A total of 20 studies evaluating 66,190 radiographs were included in the final analysis (Table 1). The efficacy of AI-mediated implant recognition was reported for TKA in 10 studies, for THA in 8 studies, and for both in 2. The included studies were conducted between 2020 and 2023, with all 20 reviewing radiographs retrospectively. While 13 studies were conducted with data from single.

Table 1 Characteristics of studies included in the final analysis

Institutions, 7 studies utilized data from multiple institutions. All studies were diagnostic trials exploring the efficacy of AI algorithms regarding TJA.

Risk of bias in individual studies

Two independent reviewers assessed the risk of bias by using the Methodological Index for Nonrandomized Studies (MINORS) tool. This is a validated assessment tool that grades comparative studies from 0 to 24 based on 12 criteria related to study design, outcomes assessed, and follow-up, with higher scores reflecting better study quality. Per domain, each item was scored 0 if low, 1 if moderate, and 2 if high (Supplemental Fig. 1). Discrepancies in grading were resolved by achieving consensus through consulting a third reviewer. The mean MINORS score was 20.4 ± 0.6.

Primary and secondary outcomes

Firstly, we identified the currently established use cases for AI in TJA implant analysis. These were found to be implanted style identification, implant failure identification, and implant measurement. The primary goal of this study was to present the efficacy of current AI algorithms in implant recognition following total joint arthroplasty. To achieve this, we performed an analysis of the accuracy, the area under the curve (AUC) for the receiver operating characteristic (ROC) curve, sensitivity, specificity, and positive predictive value (PPV) for each use case. The median and interquartile ranges (IQR) were calculated using Excel (Microsoft Corporation, Redmond, Washington, USA) for the highest-scoring AI algorithm in each study. As a secondary goal, we synthesized key limitations that the authors of each study had noted.

Results

Implant identification

Most studies (n = 16) included in this review explored the efficacy of AI algorithms in identifying implant shape, model, and manufacturer. Seven of these studies were TKA implant-specific, seven were THA implant-specific, and two included implants for both surgeries. For TKA algorithms, the AUC ranged from 0.9857 to 1, accuracy ranged from 22.2% to 100%, sensitivity ranged from 22.2% to 100%, PPV ranged from 22.2% to 100%, and specificity ranged from 97.8% to 100% (Table 2). The median (IQR) for each of these domains was AUC: 0.996 (0.990 to 1), accuracy: 98.9% (96.9% to 99.8%), sensitivity: 98.1% (94.8% to 99.7%), PPV: 99.6% (99.0% to 100%), and specificity: 99.4% (98.1% to 100%). Of note, one study was able to develop an algorithm with perfect scores across all reported domains [35]. For THA algorithms, the AUC ranged from 0.99 to 0.999, accuracy ranged from 83.7% to 100%, sensitivity ranged from 75.4% to 98.90%, PPV ranged from 83.7% to 99.0%, and specificity ranged from 98.0% to 99.80% (Table 3). The median (IQR) for each of these domains was AUC: 0.999 (0.995 to 0.999), accuracy: 98.2% (91.7% to 99.6%), sensitivity: 94.6% (94.3% to 95.7%), PPV: 96.3% (93.1% to 99.0%), and specificity: 99.2% (98.5% to 99.8%).

Table 2 Performance of artificial intelligence algorithms in identifying implants for total knee arthroplasty
Table 3 Performance of artificial intelligence algorithms in identifying implants for total hip arthroplasty

Additionally, three studies were able to compare the identification capabilities of AI relative to that of a human expert [20, 25, 31]. Of these three studies, two showed improved performance from a certain AI architecture when compared to arthroplasty clinicians [20, 31]. However, one study showed poorer performance from their AI architecture when compared to experts [25]. Three studies also reported the average time spent per radiograph by their algorithm, which was less than one second [21, 25, 36]. In comparison, one study reported the time required for a surgeon to analyze a radiograph with which they had no experience to be greater than eight minutes [36].

The most common limitation noted by authors was the limited dataset upon which the algorithms were trained [7, 8, 22, 24,25,26,27, 29, 35]. In addition to a limited number of radiographs, authors also faced challenges with developing an algorithm with generalizability due to a limited library of implants [7, 8, 20,21,22, 26, 27, 30, 31, 35, 36]. authors noted a lack of high-quality radiographs of implants from various imaging positions and modalities [7, 29, 30, 35], which further hampers their generalizability. Lastly, authors advocated for a need to validate these algorithms through comparison with the judgment of both surgeons of varying experience [23, 27, 30].

Implant failure detection

Two studies aimed to detect implant failure through the utilization of AI algorithms [32, 37] (Table 4). One study sought to assess implant loosening in TKA [37]. When compared to the baselines set by two orthopaedic specialists, the image-based algorithm attained an accuracy of 96.3% with no improvement upon adding clinical information. Additionally, class activation maps (CAMs) showed signals over the loosened bone-implant interface, the parameters for detecting implant loosening. The other study developed a deep learning tool to quantify femoral component subsidence between serial AP radiographs of the hip [32]. Parameters included distance from the tip of the stem to the most superior point on the greater trochanter, angle of the femoral axis, and distance between magnification markers. The model was able to achieve an accuracy of 97% for detecting the femur, 98% for detecting the implant, and 94% for detecting the magnification markers. When compared to the manual measurements of two orthopaedic surgeon reviewers, the automatic measurements had an absolute mean error of 0.6 (21%) ± 0.7 mm. The measurements bore a strong correlation of 0.96 (P < 0.001). The median (IQR, if applicable) for implant failure detection algorithms was AUC: 0.935, accuracy: 97.2% (96.7%–97.6%), sensitivity: 96.1%, PPV: 92.4%, and specificity: 90.9%.

Table 4 Performance of artificial intelligence algorithms detecting implant failure in total joint arthroplasty

Both studies acknowledged similar limitations: small datasets, the use of cemented implants limiting external validity as the use of cementless implants is rising, and alterations in the radiographic appearance of bones due to heterotopic ossification, bisphosphonate administration, and magnesium coatings over implants [32, 37].

Implant measurement

Two studies assessed the measurement capabilities of AI in total joint arthroplasty [28, 33] (Table 5). In one study, the authors attempted to build an algorithm to delineate the epiphyseal, metaphyseal, and diaphyseal fixation zones and cone placements following revision TKA [28]. To accomplish this, the widest condylar width, most inferior points of the femoral implant, widest tibial width, and most proximal points of the tibial implant were used as parameters to construct squares on the femur and tibia. 98% of zones were able to be delineated, and when compared to a fellowship-trained orthopaedic surgeon, the algorithm achieved a 90% zonal mapping accuracy, with 97.8% tibial and 100% femoral cone identification. Runtime for the algorithm was 8 ± 0.3 s per radiograph [28]. In another study, an algorithm was trained on long leg radiographs (LLR) following TKA to assess the alignment of knee systems with reads of the hip-knee-ankle (HKA), femur component (FCA), and tibial component (TCA) angles [33]. This study was conducted using the commercially available AI software IB Lab LAMA (Leg Angle Measurement Assistant, version 1.03, IB Lab GmbH, Vienna, Austria), which localizes anatomical features of the femur, tibia, and calibration ball to measure leg angles. When compared to two orthopaedic surgeons who regularly perform LLR measurements, the algorithm achieved an accuracy of 99% for HKA, 99% for FCA, and 97% for TCA. For these measurement studies, the median (standard deviation) of the highest accuracy achieved was 97.3% (94.5% to 99.3%). Noted limitations included limited knee systems for algorithm training and limited cohorts for external validation, especially those with varying degrees of image quality [28, 33].

Table 5 Performance of artificial intelligence algorithms measuring implants in total joint arthroplasty

Discussion

AI algorithms for TJA implant analysis have shown promising preliminary results regarding identification, failure detection, and measurement. For all these use cases, algorithms have been able to demonstrate high accuracy, PPV, sensitivity, and specificity. Some studies were also able to demonstrate that these algorithms could outperform human experts. Yet still, a major limitation noted by almost all studies was a limited radiographic dataset size which limits their extrapolation, as the AI needs to be trained on all types of inputs it is expected to perform upon. Overall, AI algorithms show promise in implant identification, failure detection, and measurement with the ability to improve orthopaedic workflow similar to prior integrations of AI into workflows [38,39,40,41]. For wider implementation and validation of AI, future algorithms need to be trained on a robust set of high-quality datasets, externally validated, and publish explainability methods.

The lack of robust and high-quality datasets has been identified as a significant limitation in multiple studies, adversely affecting the performance of AI algorithms. Consequently, some of these studies failed to meet the desired thresholds for excellent algorithm performance, namely an AUC of 0.90 and an accuracy of 90%. The performance of algorithms that did not have access to a large dataset of high-quality images will most likely worsen when externally validated [42,43,44]. Nonetheless, the approximate volume of imaging samples needed for high sensitivity and specificity can be relatively low (< 500). All but one of the studies reporting these metrics [27] were able to achieve high sensitivity and specificity for implant identification even though the image sample sizes ranged from 274 to 11,204 images in total. Even when considering implant design, a very low quantity of images per design is required. Many studies used augmentation techniques to increase the number of images for training through contrast editing, flipping, and rotating of raw image data. Through this technique, Kang et al. were able to create 3606 augmented images from 179 images of 29 hip implant designs with some having less than 5 radiographs and still achieve an AUC of 0.99 [29]. However, even algorithms that demonstrated excellent performance are limited by the catalog of implants and radiographs presented to them. To improve the AUC and accuracy of future studies, high-quantity and high-quality datasets need to be publicly available [45]. Datasets including all training images from DICOM to standard JPG formats would be beneficial to allow for AI training on multiple image mediums. Few well-curated imaging datasets are currently available due to a lack of image organization, anonymization, annotation, and linkage to a ground-truth diagnosis [45].

Institutional-level datasets limit the ability for external validation. As a result, fourteen out of the twenty included studies did not test their algorithms against an external dataset, making it difficult to understand if these results are reproducible in a different environment [46]. Park and Han stress the importance of testing all algorithms against a well-defined clinical cohort to eliminate potential overestimation of the algorithms' performance due to overfitting or overparameterization [47]. Failing to test against external datasets is not uncommon, with only 6% of prior radiological AI papers using an external test set [48]. To improve the reproducibility of the AI algorithms, future studies ought to conduct tests against external datasets [38, 42, 43, 47, 49, 50]. Developers ought to consider the external validity of the algorithm and minimize the risk of overestimation by testing with an external dataset and utilizing a strongly defined clinical cohort, respectively. Developers will have greater success at the institutional level compared to the global scale due to the vast library of implants that joint reconstructive surgeons use. A long-term solution for these concerns would be to create an implant library that any reconstructive surgeon at any institution could utilize to create new algorithms. While external validity is a concern for these algorithms, the internal validity is still very high so developers can create institution-specific algorithms based on the catalog of implants that their reconstructive surgeon routinely uses. With algorithmic training on high-quality publicly available datasets and external testing, the clinical feasibility of these algorithms may be better assessed.

Lastly, AI models have a “black box” phenomenon as most users are unable to understand how the algorithm reaches its decision. This phenomenon has been faced with criticism on whether or not to trust AI as one cannot trace the logic [51]. Saliency mapping and CAMs are methods to explain the region of the image that was relevant in the algorithm’s decision [52]. For example, a saliency map for identifying THA implants disclosed that the region around the tip of the femoral component was of utmost importance, something which has not been commonly used as a distinguishing factor between models [24]. However, these maps may not be enough as a few studies included in this review [22, 24, 26, 31, 34] demonstrated that AI-based implant measurement and failure detection require various other parameters. Therefore, all future studies should report the parameters as well as the saliency maps associated with decision-making to improve the transparency of the AI algorithms for potential clinician adopters.

Limitations

This study has its limitations. Firstly, not all values for AUC, accuracy, sensitivity, PPV, and specificity were included. The variation in performance reporting limits the accuracy of generalizations regarding the performance of these algorithms. Along these lines, the algorithms each have their own library of implants upon which they were trained. Due to this, overarching comparisons between studies are difficult to make as the algorithms were tested upon different images and implants. Additionally, very few studies reported demographic information corresponding to radiographic datasets. This will be crucial in the future as biased clinical data will negatively affect model performance [53]. Nonetheless, the results reported in included studies show promising results for AI-based implant analysis.

Conclusion

AI models hold great potential as a disruptive tool in the field of adult reconstructive surgery, specifically in the analysis of implants. This is particularly important considering the rising demand for revision TJA. AI-based implant analysis can reduce the workload of surgeons, save resources, and minimize inaccuracies that might necessitate further revisions. These findings highlight the promising role of AI in recognizing implants in TJA. Initial studies have demonstrated impressive performance in implant classification, analysis of implant failures, and measurements derived from radiographs. However, to develop more robust models, it is essential to have access to larger datasets of radiographs. Future research should adhere to standardized guidelines for model development and training while emphasizing the importance of transparency in presenting the results.