History

Osteoarthritis (OA) ranks globally among the 50 most common sequelae of diseases and injuries, affecting over 250 million people or 4% of the world’s population [39]. Of the global disease burden for OA, knee OA constitutes 83% [39]. A detailed analysis of Medicare beneficiaries reported the TKA annual utilization rate ranging from 287,006 in 2006 to 301,956 in 2010 [26]. The demand for TKA is expected to grow exponentially over the coming decades with epidemiological data suggesting a 673% increase in the United States by 2030, representing 3.48 million procedures annually [22].

As a polymorphic disease with a variety of clinical presentations, OA is challenging to rigorously define. A commonly encountered definition of OA describes “…a heterogeneous group of conditions that leads to joint symptoms and signs which are associated with defective integrity of articular cartilage, in addition to related changes in the underlying bone and at the joint margins” [1]. The pathogenesis of OA is poorly understood but is thought to include a complex interplay among mechanical, biochemical, cellular, genetic, and immunologic phenomena [7]. Several attempts to develop diagnostic criteria for OA were previously undertaken and incorporate patient-reported joint pain in conjunction with consistent radiographic findings [1, 6, 24]. OA generally can be subcategorized into primary (idiopathic) and secondary OA [1, 25]. Common causes of secondary OA include posttraumatic, dysplastic, infectious, inflammatory, or biochemical etiologies that are relatively well understood. Although the etiology of primary OA remains largely undefined, genetic factors, age-related physiological changes, ethnicity, and biomechanical factors likely play an important role [16].

Plain radiography remains a mainstay in the diagnosis of OA. The first formalized attempts at establishing a radiographic classification scheme for OA were described by Kellgren and Lawrence (KL) in 1957 [19]. After studying rheumatism in coal miners at the Bedford Colliery in North West England [18], Kellgren investigated the inter- and intraobserver reliability of radiographic changes of rheumatism observed in the hand [17]. After concluding that there was wide disagreement among different observers, KL endeavored to establish a classification scheme with an associated set of standardized radiographs for OA of diarthrodial joints. They proposed a five-grade classification scheme and examined plain radiographs of eight joints including the distal interphalangeal joint (DIP), metacarpophalangeal joint (MCP), first carpometacarpal joint (CMC), wrist, cervical spine, lumbar spine, hips, and knees to calculate the inter- and intraobserver reliability of each [19]. They found that the tibiofemoral joint of the knee had the highest interobserver correlation coefficient of r = 0.83 (range of all joints studied, 0.10–0.83) as well as the second highest intraobserver correlation coefficient of r = 0.83 (range of all joints studied, 0.42–0.88) among the diarthrodial joints they examined [19]. These early results would predict the future application of their classification scheme to the knee specifically. Currently, the KL classification is the most widely used clinical tool for the radiographic diagnosis of OA [5].

Purpose

The KL classification has been commonly used as a research tool in epidemiological studies of OA, including landmark articles by Felson et al. [10] in the Framingham Osteoarthritis Study, and Bagge et al. [3] assessing osteoarthritis in European populations. The KL classification was also used in the development of atlases of radiographic features of OA, including the work done by Scott et al. [33].

The Kellgren and Lawrence classification may also assist healthcare providers with a treatment algorithm to guide clinical decision-making, specifically defining which patients may benefit most from surgical management. Furthermore, some insurers currently require providers to include documentation of the KL classification to receive approval for a TKA [37, 38].

Despite its common use, all research and clinical efforts using the KL classification depend critically on rigorous validation and continuous reevaluation of the schema’s relevance to patient-centered outcomes.

Description of the Kellgren-Lawrence Classification System

Based on the data presented in their original work, the KL classification is typically applied specifically within the context of knee OA. The KL classification was originally described using AP knee radiographs. Each radiograph was assigned a grade from 0 to 4, which they correlated to increasing severity of OA, with Grade 0 signifying no presence of OA and Grade 4 signifying severe OA [19] (Fig. 1). Additionally, KL provided detailed radiographic descriptions of OA (Table 1). Although it is unclear from the original paper whether the radiographic descriptions were presented with the intent of demonstrating a linear disease progression of OA that begins with the formation of osteophytes and culminates in the altered shape of bone ends, other authors have criticized the KL system on the basis of this assumption [5, 27].

Fig. 1A–D
figure 1

AP radiographs of the knee presented in the original Kellgren-Lawrence article [19]. (A) Representative knee radiograph of KL classification Grade 1, which demonstrates doubtful narrowing of the joint space with possible osteophyte formation. (B) Representative knee radiograph of KL classification Grade 2, which demonstrates possible narrowing of the joint space with definite osteophyte formation. (C) Representative knee radiograph of KL classification Grade 3, which demonstrates definite narrowing of joint space, moderate osteophyte formation, some sclerosis, and possible deformity of bony ends. (D) Representative knee radiograph of KL classification Grade 4, which demonstrates large osteophyte formation, severe narrowing of the joint space with marked sclerosis, and definite deformity of bone ends. Reprinted with permission from Kellgren JH, Lawrence JS. Radiological assessment of osteo-arthrosis. Ann Rheum Dis. 1957;16:494–502.

Table 1 Radiologic features of osteoarthritis as described by Kellgren and Lawrence [19]

Validation

In their original paper, KL acknowledged variability in the radiographic evaluation of OA and described the inter- and intraobserver reliability of their system [19] (Table 2). In particular, the authors comment that estimates of OA prevalence in all joints examined vary considerably between observers (± 31% deviation from the mean number of diagnoses of OA) and less so within the same observer (± 5% deviation from the mean number of diagnoses of OA). The authors report the inter- and intraobserver reliability correlation coefficient of the knee to both be 0.83 and comment that, “A significant correlation between the two observers was obtained for all joints except the wrist” and “Two readings by the same observer gave only a slightly better correlation on the reading of individual x rays…” [19]. It is unlikely, however, that the authors used the term “significant” within the same statistical context that the word carries in the literature today because no p values or confidence intervals were reported for their data. To determine the inter- and intraobserver correlation coefficients, KL used two observers to evaluate a series of 510 radiographs of eight joints including the DIP, MCP, first CMC, wrist, cervical spine, lumbar spine, hips, and knees from 85 patients aged 55 to 64 years selected randomly from an urban population. The authors did not explicitly report how many radiographs from each joint were used in their calculations or mention the qualifications or training of the two observers who graded the radiographs.

Table 2 Inter- and intraobserver correlation coefficients of the Kellgren and Lawrence classification in various joints [19]

More recently, Wright [40] reevaluated the interobserver reliability of the KL system in addition to five other radiographic classification schemes used to grade knee OA (Table 3). The group used radiographs from 632 patients enrolled in the Multicenter ACL Revision Study (MARS) consortium, which consisted of cohorts of patients from 83 surgeons from 52 sites. The investigators used three independent and blinded observers (specific qualifications and training not explicitly stated by authors) to grade weightbearing AP and/or Rosenberg radiographs (depending on availability) with six radiographic classification schemes. In addition, their radiographic findings were compared with arthroscopic evidence of tibiofemoral chondral disease. They reported that the KL system was the most studied among the different classification systems and had an interobserver reliability intraclass correlation coefficient of 0.51 to 0.89 (considered “moderate” to “very good” based on the author’s provided definitions) from studies since the original KL article [40]. The investigators provided their own interobserver reliability intraclass correlation coefficients using the Rosenberg radiograph 0.54 (95% confidence interval [CI], 0.48–0.59) and AP radiographs 0.38 (95% CI, 0.33–0.43), which they characterized as moderate and poor, respectively [40]. The authors provide an explanation for this wide range in interobserver reliability by suggesting that differences in technique, population age group, and degree of OA likely contributed to the differing results from the various studies. It is also likely that variable interpretations of the KL classification by observers in different studies had a considerable effect on the reliability data they reported.

Table 3 Grading scales for the radiographic osteoarthritis classification systems [40]

In the 1987 study of the Framingham population by Felson et al. [10], the investigators had two academically based bone and joint radiologists examine standing AP radiographs of the knee of 1424 elderly patients (ages 63–94 years old, mean = 73 years) and reported an interrater intraclass correlation coefficient of 0.85 (“very good”) using the KL classification. Although the study provided unique population-based data using highly trained observers, the authors admit that their population lacked ethnic diversity by having no black, Hispanic, or non-European ethnic groups. Given the growing ethnic diversity of the US population, it may be difficult to extrapolate the results found in this study to the broader population. In later works by Scott et al. [33], the authors used two skeletal radiologists and two rheumatologists to examine 30 standing AP knee radiographs randomly selected by an investigator not involved in the readings from the Baltimore Longitudinal Study of Aging and reported an interreader intraclass correlation coefficient of 0.68 (“good”) and intrareader intraclass correlation coefficient of 0.87 (“very good”). The authors do not provide detailed ethnic demographic information about their study population but report that their subjects consisted of 25 men and five women (ages 42–84 years old, mean age of men = 67 years, mean age of women = 71 years). Both studies admit to selecting a similar number of radiographs from each KL classification grade for calculating their intraclass correlation coefficients, which may provide more stable predictions of reliability scores but may not be as readily applicable to larger populations. Another study by Gossec et al. [12] used 50 radiographs selected from 1759 radiographs from five databases of trials or cohort studies to evaluate the interreader and intrareader intraclass correlation coefficient of standing, extended knee radiographs using two trained rheumatologists and reported that the KL classification had an interreader intraclass correlation coefficient of 0.72 (95% CI, 0.38–0.86) (“good”) and an intrareader intraclass correlation coefficient of 0.72 (95% CI, 0.55–0.83) (“good”). No criteria for the selection of the 50 radiographs from the collection of 1759 were explicitly stated by the authors and it is not clear if the selection included relatively equal numbers of radiographs from each KL classification grade. Given the differing demographic makeup of each population studied, including differences in ethnic group, percentage of male versus female, and age as well as the differing selection criteria for the radiographs examined by the observers, which themselves varied in number, qualifications, and training, it is understandable that the range of interobserver intraclass correlation coefficients for the KL classification cited by Wright [40] (0.51–0.89) is as wide as it is.

Although independent validation of the KL classification has been examined by multiple authors, less is known about the KL classification’s diagnostic accuracy, that is the degree to which the radiographic findings actually reflect the physical state of the joint; the best work on this suggests that using the Rosenberg view (the 45° posteroanterior flexion weightbearing radiograph; Fig. 2 [28]) results in higher interrater reliability (0.54; 95% CI, 0.48–0.59) and better correlation to arthroscopic evidence of OA (Spearman rho 0.42; 95% CI, 0.33–0.49) than does using the AP radiograph (interrater reliability 0.38; 95% CI, 0.33–0.43; Spearman rho 0.30; 95% CI, 0.23–0.38) [40]. Those authors argue that this may be because the Rosenberg radiograph provides better visualization of the femoral condyles in midflexion, a common site of articular surface degeneration.

Fig. 2
figure 2

Diagram of how the Rosenberg radiograph would be set up and performed [28]. Reprinted with permission from Rosenberg TD, Paulos LE, Parker RD, Coward DB, Scott SM. The forty-five-degree posteroanterior flexion weight-bearing radiograph of the knee. J Bone Joint Surg Am. 1988;70:1479–1483.

Limitations

Despite the wide application of the KL classification, the system has several noted limitations. Perhaps the most widely argued criticism of the KL system is its application to disease progression and insensitivity to change. The KL system has been criticized by Spector and Cooper [35] for assuming a linear radiographic progression of OA, starting with osteophyte formation, proceeding to joint space narrowing (JSN), and terminating in deformation of articular surfaces. However, as the authors point out, JSN in the absence of osteophyte formation cannot be measured in the KL system, which becomes problematic in patients with knee pain and radiographically evident loss of cartilage but a lack of osteophytes on their knee radiographs. Although this occurrence is thought to be less common, the presence of marginal osteophytes likely represents a hypertrophic response to mechanical stress in conjunction with enchondral ossification, which may occur along a spectrum depending on individual physiology [21]. Other classification systems have attempted to segregate the radiological findings by performing individual evaluations of the joint space for the medial and lateral compartments separately and doing likewise again for osteophyte formation in each location [2, 5, 27, 34]. However, these classification systems have not been adopted widely, likely as a result of challenges in standardizing definitions and an inherent difficulty in discriminating JSN and osteophyte formation. In addition, Günther and Sun [13] described inferior interobserver intraclass correlation coefficients for medial compartment (0.62) and lateral compartment (0.47) JSN as well as medial femorotibial, lateral femorotibial, and tibial spine osteophyte formation (0.75, 0.74, and 0.63, respectively) as compared with the overall KL classification score (0.81) when using three observers, one of whom was an experienced orthopaedic surgeon and two of whom were orthopaedic residents. As discussed by Emrani et al. [8], classifications based on JSN may be preferable to the KL system for monitoring progression of OA, whereas the KL system may be better for assessing severity of osteoarthritic disease. A criticism of multiple classification systems (including the KL) is the lack of recognition of patellofemoral arthritis as a distinct or contributory radiographic factor.

Another criticism of the KL system is the inconsistency in its original description by the authors and variable applications of the classification in subsequent studies. In their original paper, KL provided simple descriptions of each grade, “none, doubtful, minimal, moderate, severe,” along with radiographic features considered evidence of OA [19] (Table 1); however, they never explicitly specified which radiographic features correspond to which grade. Years after their original article, Grade 2 was changed to “the presence of definite osteophytes with minimal joint space narrowing” [20]. After that, Grade 2 was changed again to “definite osteophytes but the joint space is unimpaired” [23]. These conflicting descriptions, although they were not present in the initial paper or made by its original authors, have nonetheless led to substantial confusion among investigators and inconsistent application.

Schiphof et al. [30] examined epidemiological cohort studies between 1966 and 2006 that incorporated the original KL criteria and found five different descriptions of the KL grading system. Interestingly, some cohort studies contained inconsistencies within the same article. They later examined the impact of these differences in descriptions of the KL criteria [31] by having two trained readers examine the weightbearing, extended, AP knee radiographs of 3071 people using the five unique descriptions. The authors calculated reproducibility, agreement using the κ statistic, and association with patient-reported knee complaints. They determined that the reproducibility of three of the descriptions was “good” (weighted κ = 0.66, 0.69, 0.63), whereas the reproducibility of the original description and a fifth description was “moderate” and “poor” (weighted κ = 0.41 and 0.35, respectively). The authors determined the agreement between the original KL criteria and the four alternatives to be “moderate” (weighted κ approximately 0.50). The correlation to patient-reported knee complaints was the strongest with the original description (odds ratio [OR], 2.2 [95% CI, 1.9–2.7], 4.3 [95% CI, 3.2–5.7], 18.3 [95% CI, 7.1–47.2] at Grade > 1, Grade > 2, Grade > 3, respectively); however, the differences in OR between the original and alternative descriptions were small and sometimes not significant [31]. Additionally, the authors admit that patient-reported knee complaints could be influenced by the presence of patellofemoral OA, which is not taken into account with the KL system and must be contextualized when interpreting this data. The authors conclude their article with recommendations for using the original description to differentiate patients categorized as Grade > 2 (definite/mild OA) from patients categorized as Grade < 2 (none/possible OA) and to use several of the alternative descriptions to differentiate patients categorize as Grade 0 (no OA) from patients categorized as Grade 1 (possible OA). Given these results, it is clear that discrepancies in the description and application of the KL system are present in the literature; however, the exact impact of these differences on patient care is unclear and requires further investigation.

Conclusions and Uses

Although the KL system has limitations, it remains widely used in clinical practice and in research. Like any radiographic classification tool, the KL system is used ideally in conjunction with a thorough clinical assessment. Altman et al. [1] proposed criteria combining the medical history, physical examination, laboratory as well as radiographic tests to diagnose knee OA, an approach that provides a more comprehensive assessment of a patient’s disease state in comparison to the KL system alone.

Radiographic classification systems like KL seek to standardize the interpretation of studies that many clinicians order during an initial assessment of a patient presenting with clinical findings suggestive of knee OA. In their original paper, Kellgren and Lawrence intended to create a standard reference for the radiographic evaluation of OA for the purposes of field surveys and clinical trials. Although the KL system has been validated with respect to inter- and intraobserver reliability and a recent article suggests high diagnostic accuracy [40], further research applications should likely focus on the development of treatment algorithms based on classification grade. Such algorithms may better guide clinical decision-making through an evidence-based approach.