Characterized by functional disability and chronic pain, knee osteoarthritis (OA) accounts for approximately one-fifth of the OA of all joints [1]. OA may be diagnosed clinically or radiologically, though symptoms may be present years prior to the first appearance of X-ray signs indicative of OA [2, 3]. The most frequently used classification for knee OA is the Kellgren Lawrence (KL) scale, which differentiates five stages (0–4) of OA severity [4]. However, the KL scale is criticized for its assumption of linear OA progression [5] as well as its differing interpretations leading to aberrant classification of especially low-grade knee OA [6]. Therefore, Osteoarthritis Research Society International (OARSI) has developed an OA classification system based on an atlas with exemplary X-rays of distinct features [7].

While magnetic resonance imaging (MRI) has gained importance in the diagnosis of musculoskeletal pathologies, the advantages of plain X-ray over MRI include their prevalent availability and cost efficiency. However, early-stage OA signs are invisible on plain X-rays, as cartilage degeneration cannot be directly assessed, and OA constitutes a three-dimensional problem [8]. This is reflected by fair to moderate interobserver reliability for knee OA assessment using X-rays alone, with measured quadratic kappa values between 0.56 and 0.67 [9,10,11]. To overcome these issues, different solutions, including novel quantitative grading methods and automatic knee X-ray assessment tools, have been proposed [6, 12,13,14]. Currently, artificial intelligence (AI) and deep learning are used in medical image classification related to the musculoskeletal system [15,16,17].

In this study, the authors aimed to characterize two aspects of the impact of a novel, AI-based image annotation tool with regard to changes in the radiological judgement of knee OA [18]. First, we analyzed the intra- and interobserver reliability of board-certified orthopaedic surgeons (herein termed senior readers) regarding knee OA grade assessment using either AI-annotated or plain X-rays. Second, we compared the outcome of senior readers to that of senior residents (termed junior readers) with aided analysis in terms of agreement rate and overall performance.


Three board-certified orthopaedic surgeons (= senior readers) from a single hospital rated X-ray images, and readings with and without AI aid were compared to a gold standard (OAI consensus). The findings of a similar previous study [19] involving senior residents (= junior readers) were used as a comparator for senior reader performance.


In the present study, plain knee X-rays were acquired from a publicly available dataset by Osteoarthritis Initiative (OAI) [20]. From this dataset, 124 knee X-rays (size comparable to previous study in this field [21]) were semirandomly selected using a selection probability proportional to the frequency of KL grades across the visits baseline, as previously described [19]. Thereby, a uniform distribution of KL grades was ensured in the sample set. The images used in the present study and training data of the AI were drawn from OAI but segregated by the patient level to avoid biasing AI performance due to overfitting. Overfitting implies that an AI model has been trained in a way that the learned methodology is only applicable to the training set but not to another independent dataset [22]. A few additional images from the OAI dataset, outside of the study set, were randomly chosen for training of the readers on the user interface of the study’s annotation tool (see below).

Table 1 depicts the distribution of KL and OARSI grades of the final cohort (n = 115; 9 images with incomplete annotation by readers excluded), as reported by consensus readings of the OAI study (i.e. ground truth). Table 2 contains the patient demographics of the final cohort stratified by sex. All knee X-rays from the OAI study followed a “fixed flexion” protocol, with standing X-rays in posterior-anterior (PA) projection with feet externally rotated by 10° and knees flexed to 20–30° (until the knees and thighs touch the vertical X-ray table anteriorly) [23,24,25].

Table 1 Distribution of KL and OARSI grades in the study population, as reported by the gold standard consensus readings of Osteoarthritis Initiative [20] (OAI; n = 115)
Table 2 Population demographics of the individuals analyzed in this study (n = 115)

Knee osteoarthritis labeling assistant

KOALA (Knee OsteoArthritis Labeling Assistant) is a software providing both metric assessments of anterior–posterior (AP) or PA knee X-rays and proposals for clinical OA grade. These proposals should aid clinicians in assessing the degree of knee OA in adult patients or their risk of developing it. This is enabled by standardized quantitative measurements of morphological features (i.e. joint space width [JSW] and joint space area [JSA]) on AP or PA knee X-rays. The AI software subsequently provides numerical results together with graphical overlays on X-rays showing measurement points. Furthermore, OA severity is assessed with the AI software by proposing the following grades (the higher, the more severe) to the clinician/rater: maximum OARSI grade for sclerosis, joint space narrowing (JSN), osteophytes (each between 0 and 3), and KL grade (between 0 and 4). Grading proposals as well as metric assessments are summarized in a report that can be viewed on any DICOM viewer workstation approved by the FDA.

The AI software applied in this study is based on several CNNs trained on large datasets of over 20,000 individual knee X-rays. It combines several low- and high-level modules, with the low-level modules being responsible for knee joint detection and landmarking. Subsequently, information is transferred to the high-level modules responsible for joint segmentation and measurement of KL, JSW, and OARSI grades, as previously described [19].

Labelling process

The labelling process was divided into three steps (Fig. 1). First, three senior readers were trained on the structure of the AI software report, OARSI grading system [7], labelling process and platform used (i.e. Labelbox, a data labelling tool designed for machine learning procedures [26]. Second, readers assessed—unaided (i.e. without AI annotations)—124 plain knee X-rays and defined KL grade (0–4), osteophytes (0–3), sclerosis (0–3), and JSN (0–3) by completing a list. The readers were able to work remotely at their preferred time and allowed to interrupt and resume labelling at any time, without time restrictions for labelling individual images or the entire dataset. However, the time it took readers to label each image, as well as the time of the entire labelling process, was ascertained. Third, after a minimum of 2 weeks after the second step had been completed, the same 124 knee X-rays were relabelled by the readers, with images provided at random order (to avoid creating observer bias; Fig. 1). At this point, however, each image was supplemented with the AI software’s report together with a binary score of whether OA was present on the X-ray.

Fig. 1
figure 1

Flow chart of the study’s labelling process

As mentioned above, complete data by annotators for all modalities were not available for nine images; therefore, these were excluded from further analysis, resulting in a total subject count of 115 and a dropout rate of 7.3%.

Statistical analysis

Agreement rates

A two-way random, single score model intraclass correlation coefficient (ICC) was used to assess agreement rates between readers for items evaluated (i.e. presence of OA, osteophytes, sclerosis, KL grade, JSN) [27] when compared to the OAI consensus. As proposed by Shrout and Fleiss, 95% confidence intervals (CIs) were calculated [27]. Standard errors of the mean were estimated for ICCs by resampling observations with 1000 bootstraps. Via the z score method, the statistical significance of the difference between the unaided and aided labelling was determined. A p value of < 0.01 was considered statistically significant.

Accuracy measures

The performance of the readers was assessed by accuracy, sensitivity and specificity. True positives (TP), true negatives (TN), false-positives (FP) and false negatives (FN) were calculated for each measure of the readers against the readings from the OAI study. In particular, the ability to detect any abnormality (KL grade > 0) or OA (KL grade > 1), JSN (> 0), sclerosis (> 0) or severe sclerosis (> 1) and the presence of osteophytes (> 0) was assessed. Normal approximation to binomial proportional intervals was used to estimate standard errors as well as CIs for sensitivity, specificity, and accuracy.

Receiver operating characteristic curves

As the AI software used not only provides recommendations regarding the presence and severity of OA but also a confidence score on the recommendation made, a receiver operating characteristic (ROC) curve was plotted to visualize the effect of the AI software’s application on reader performance with regard to changes in TP rates (TPR) and FP rates (FPR).


Agreement between readers in unaided and aided labelling

Agreement rates for different measures (i.e. KL, presence of OA, JSN, sclerosis, osteophytes) between readers were calculated separately for unaided and aided labelling. Agreement rates for senior readers improved with aided labelling for all scores assessed (Table 3, Fig. 2). In detail, agreement rates increased twofold, 1.37-fold, 1.42-fold, 1.59-fold and 3.33-fold for KL grade, JSN, sclerosis, osteophytes and OA diagnosis, respectively (Table 3). When using the agreement classification proposed by Cicchetti [28], improvements from unaided to aided labelling were observed for all measures, except for sclerosis (Table 3).

Table 3 Agreement rates (ICCs) for junior and senior readers, by modality, together with the Cicchetti agreement classification [28]
Fig. 2
figure 2

Agreement rates between junior readers (blue) and senior readers (red) for the unaided (lighter) and aided (darker) modalities. Error bars denote standard errors of the ICC. Stars indicate statistically significant differences between the unaided and aided modalities, with a p value of < 0.01 considered statistically significant. JSN junior was the only nonsignificant result, with a p value of 0.125. Horizontal lines denote the thresholds separating poor, fair, good and excellent agreement, as defined by Cicchetti et al. [28]. The values for junior readers already published by Nehrer et al. [19]. OA was defined as KL > 1. KL Kellgren–Lawrence, JSN joint space narrowing, SC sclerosis, OS osteophyte, OA osteoarthritis)

In general, agreement rates for KL grade, JSN, and OA diagnosis with unaided labelling were higher between junior readers than between senior readers (Table 3, Fig. 2). Notably, when using aided labelling, ICCs for the junior and senior readers were comparable. Consequently, less pronounced improvements in ICC from unaided to aided labelling were found with junior readers (Table 3).

Senior reader performance in unaided and aided labelling

The accuracy of senior readers significantly improved with aided labelling for all measures (Fig. 3). Sensitivity only increased for OA diagnosis (i.e. KL grade > 1) while decreasing—without statistical significance—for all other scores. On the other hand, all measures showed a significant increase in specificity, indicating a decrease in overdiagnosis upon AI-aided labelling compared to the OAI ground truth (Fig. 3, Table 4).

Fig. 3
figure 3

Mean differences between senior reader AI software-aided and unaided labelling in sensitivity, specificity and accuracy for KL > 0, KL > 1, JSN > 0, sclerosis OARSI grade > 0, sclerosis OARSI grade > 1 and osteophyte OARSI grade > 0. Values to the right of the vertical line at 0 are improvements by the use of AI software. Error bars signify 95% confidence intervals. KL Kellgren–Lawrence, JSN joint space narrowing, SC sclerosis, OS osteophyte, OA osteoarthritis

Table 4 Accuracy, sensitivity and specificity of senior readers with unaided and aided modalities. The increase from unaided to aided labelling is highlighted in green, and the decrease is highlighted in red

Individual reader performance

The effect of the AI software on senior reader performance was comparable to that of our previous findings for junior readers only [19]. A reduction in FPR was observed, with no or little effect on TPR. Notably, two readers showed simultaneously increased TPR and reduced FPR regarding the feature “presence of OA” (i.e. KL > 1). For the “presence of osteophytes”, increased TPR and decreased FPR was found for another reader (Fig. 4).

Fig. 4
figure 4

Changes to the true positive rate (y-axis, TPR) and false-positive rate (x-axis, FPR) for each individual senior reader for KL > 0, KL > 1, JSN > 0, sclerosis OARSI grade > 0, sclerosis OARSI grade > 1 and osteophyte OARSI grade > 0. The black line denotes the ROC curve for the AI software within the dataset. Arrows point from the unaided to aided modalities. Arrows pointing upward and left indicate absolute improvements in detection ability. Note that even though some arrows point downward and left, the improvement in FPR was greater than the loss in TPR, representing a net increase in accuracy. KL Kellgren–Lawrence, JSN joint space narrowing, SC sclerosis, OS osteophyte


The main finding of the study was improvement in the senior reader agreement rate with aided analysis for KL grade, diagnosis of OA, JSN, osteophytes and sclerosis assessed on knee X-rays. Furthermore, the specificity and accuracy for all features mentioned increased with the AI-aided modality. Notably, the agreement and accuracy rates achieved when using aided analysis were comparable between senior readers and junior readers (from a different study).

Although over 100 commercially available AI applications similar to the tools investigated herein using CNNs are currently on the market, peer-reviewed literature on the impact of these systems on clinicians is scarce [18].

Nevertheless, reliable and homogenous evaluation of radiological images is necessary to improve the care of knee OA patients by means of timely treatment planning [29]. This applies particularly to the early stages of knee OA, in which well-established radiographic assessment scores such as the KL scale are prone to imprecision, as the incipient tissue damage characteristic of early OA is barely visible on X-rays [30, 31]. However, in settings with both junior and senior readers [32], as well as senior readers alone [29], assessment differences regarding the severity of radiographic knee OA features are found. Therefore, an AI-based tool supporting readers in decision-making may aid in the standardization of image evaluation.

In the current study, agreement rates for senior readers increased between 1.37-fold (for JSN) and 3.33-fold (for OA diagnosis) when utilizing AI software-aided labelling compared with unaided labelling. These observations are in line with a previous study on junior readers alone [19]. Comparable to previous results from our group on junior readers [19], use of AI-enhanced X-rays appears to standardize knee OA assessment among senior readers. This is of particular importance considering that initial agreement rates between senior readers and the OAI ground truth were evidently lower than those found between junior readers and the OAI ground truth, especially regarding OA diagnosis and KL grade. This discrepancy may be explained by the fact that orthopaedic specialists rely on their long-term experience when evaluating images rather than adhering to given scoring system definitions, as enforced during the truthing of the OAI. In the field of musculoskeletal radiology, comparable observations have been made by Peterlein et al. regarding developmental dysplasia of the hip assessment by ultrasonography [33], with similar performance found between medical students and paediatric orthopaedic surgeons [33]. Notably, in the present study, the junior and senior readers achieved similar agreement rates with AI software-aided labelling. This implies that orthopaedic specialists may benefit to a greater extent from AI software than senior residents. One may argue that any improvement in agreement with aided labelling may be related to some kind of cognitive bias—or “anchoring effect” [34], a phenomenon first observed in psychophysics [35]. It describes the situation in which a person’s decision is influenced to a considerable degree by a single, potentially irrelevant piece of information, i.e. the “anchor” [36]. As already outlined in our previous study [19], some facts eventually contradict this assumption. On the one hand, the readers had been specifically trained in assessment of X-ray features, and objective decision-making can be expected. On the other hand, improvements with aided labelling against the OAI consensus as the ground truth were mainly caused by an increase in specificity, reducing FPR (and thus overdiagnosis) with respect to the OAI consensus. Furthermore, the senior readers achieved better overall results regarding TPR and FPR with aided labelling compared to AI software alone. This implies a rather subordinate role of the “anchoring effect”, as the senior readers should have achieved results similar to the AI software unless they eminently relied on provided annotations.

As the pool of readers in the current study consisted of three board-certified orthopaedic surgeons, one may argue that generalizability was impaired. To overcome this issue, a specific ICC was calculated including the readers as random effects. Consequently, the pool of readers was treated as a sample of a larger pool, enabling better generalization of the results obtained.

Limitations of the study include the drop-out rate of 7.3% and the limited sample size of the 115 images ultimately analyzed. Furthermore, the number of readers involved might have biased the results obtained. Another potential source of bias was the 2-week interval chosen between the two assessments, which might have led to “memory bias” [37, 38]. Nevertheless, the effect of this bias is controversial in the literature, with studies on the presence of both strong [38] and weak [37] “memory bias” in imaging studies at two time points.

Moreover, AI might have been overfitted to the OAI dataset, eventually leading to a less pronounced difference between the senior and junior readers for varying datasets due to lower performance of the AI. Additionally, it cannot be ruled out that AI software itself sometimes classifies X-rays incorrectly, and aided image analysis would present readers with inaccurate information. Furthermore, the discrepant finding of better performance for senior residents compared to board-certified orthopaedic surgeons can only be explained by the hypothesis that experienced readers tend to analyze images in a less-structured and faster manner, relying on their long-term experience, but that less experienced readers are more likely to adhere to presented image classification systems.

In conclusion, use of AI-based KOALA software leads to improvement in the radiological judgement of senior orthopaedic surgeons with regard to X-ray features indicative of knee OA and KL grade, as measured by the agreement rate and overall accuracy in comparison to the ground truth. Moreover, the agreement and accuracy rates of senior readers were comparable to those of junior readers with aided analysis. Consequently, standard of care may be improved by the additional application of AI-based software in the radiological evaluation of knee OA.