Background

For hip joint pathologies, two major operative treatments exist: hip preservation and hip replacement. The presence of osteoarthritis is a critical factor in a surgeon’s decision between the two options [1]. Efforts to preserve the hip joint are hindered by the presence of osteoarthritis [2]. Therefore, a reliable evaluation of the degree of osteoarthritis is necessary for optimizing patient outcomes. Radiographic assessment provides essential information concerning the diagnosis and treatment of osteoarthritis [3]. The traditional Tönnis Classification System is commonly used to classify the severity of osteoarthritis. The literature generally supports hip preservation for hips graded as Tönnis 0 and 1, and replacement for hips graded Tönnis 2 and 3 [2, 4]. However, despite its extensive use in clinical practice and medical literature, the traditional Tönnis Classification System has some drawbacks [5]. First, several studies have reported questionable inter-observer and intra-observer reliability [3, 6, 7]. A cardinal drawback of the traditional Tönnis Classification System is it’s subjectivity. It has been criticized for being unclear and having overlapping parameters. Yet, another difficulty may rise when parameters from different grades are found in a single radiograph e.g. moderate loss of head sphericity and slight narrowing of the joint space, which pretrain to grade 2 and 1, respectively [5]. The pitfalls associated with the traditional Tönnis Classification System reach beyond the boundaries of orthopedics and may have multidisciplinary manifestations that impair the cross talk between radiologists, general practitioners, and rheumatologists. Similar to the traditional Tönnis Classification System, the Garden Classification for femoral neck fractures also demonstrated poor reliability derived from the challenging radiographic distinctions between the grades. Based on the clinical relevancy of the Garden Classification, a simplified binary classification was developed that demonstrated higher reliability compared to the original classification [8,9,10,11]. Given the binary nature of available surgical interventions (i.e. hip preservation versus reconstruction) derived from the traditional Tönnis Classification System, a two-level classification could be more reliable and reproducible without compromising the clinical relevance. Taking into consideration Occam’s Razor [12], which states that the simplest answer is typically the correct answer, a two-level classification for surgical treatment options seems most appropriate. The goal of this study is to validate a simplified Binary Tönnis Classification System to reduce excessive complexity and better capture the diagnostic essence of having a certain classification. Specifically, this study (1) compares the inter-observer and intra-observer reliability of the traditional Tönnis Classification System and a new simplified Binary Tönnis Classification System for hip osteoarthritis and (2) evaluates the clinical applicability of both systems, notably its agreement with the clinician’s decision for either preservation or replacement. Our hypothesis is that a binary system will have better reliability and agreement for surgical decision-making.

Methods

Patient selection and data acquisition

Forty consecutive patients who presented to the clinic for hip pain between February 2018 to March 2018 were selected to participate in this study. Patients were included in the study if they were between the ages of 35 and 60 years old. Patients were excluded if they had prior ipsilateral or contralateral surgeries or had prior hip conditions such as Legg-Calve-Perthes disease, slipped capital femoral epiphysis, pigmented villonodular synovitis, or ankylosing spondylitis. All patients underwent operative management due to radiographic FAI, osteoarthritis, and/or symptoms of hip pain that were unresponsive to conservative treatment and significantly limited activities. Demographic data, such as sex of patients, laterality, and age at surgery, was collected for all patients.

All patients underwent routine radiographic imaging at their preoperative clinic visit. A standard anteroposterior supine radiograph was used for this study to grade the severity of osteoarthritis, the protocol for which is detailed by Clohisy et al. [13]

This study was approved by the Institutional Review Board and did not receive any funding. All patients participated in the American Hip Institute Hip Preservation Registry through written consent. While the present study represents a unique analysis, data on some patients in this study may have been reported in other studies.

Classification systems

The traditional Tönnis Classification (Table 1) and the simplified Binary Tönnis Classification systems (Table 2) were used in this study. The simplified Binary Tönnis Classification System was fashioned to reflect the primary indications that our institution uses with the traditional Tönnis Classification System: hip preservation or reconstruction.

Table 1 Traditional Tönnis Classification
Table 2 Simplified Binary Tönnis Classification

Inter-rater reliability and agreement with surgical treatment

Five fellowship-trained hip surgeons from a single center were the observers for this study. Three observers were hip preservation and reconstruction fellows and two observers were attendings who had trained in both hip preservation and reconstruction. Radiographic grading of hip OA is part of the observers’ daily practice. However, to minimize inter-observer discrepancies, both the traditional Tönnis Classification System and the Binary Tönnis Classification System were provided on each individual excel sheet that was utilized to grade the radiographs. This study was a full-crossed study in which all observers read the same set of radiographs. All images were uploaded to the digital imaging system and retrieved by a non-observer who randomized and blinded the films, Fig. 1.

Fig. 1
figure 1

Image of blinded hip radiograph. The medical reference number and the side under consideration is in red

The five observers independently assessed the series of radiographs. Observers classified the radiographs utilizing the traditional Tönnis Classification System and rated another set of randomized radiographs with the Binary Tönnis Classification System after at least a week had transpired. Images were randomized again, and observers repeated their respective assessment at least 3 weeks later.

Statistical analysis

Statistical analysis was conducted in R (R software foundation, version 3.6.0) and Microsoft Excel (Redmond, WA). Demographic data was separated and analyzed for patients who underwent arthroscopy or THA. To analyze demographic data, the Chi-squared and Fisher’s Exact tests were utilized to evaluate differences in the proportions of categorical data between the arthroscopy and THA groups. For continuous variables, the F-test was performed to evaluate variance, and the Shapiro-Wilk test was utilized to evaluate distribution. A p > 0.05 indicated equal variance and normal distribution, respectively. The independent-samples t-test was performed for unpaired data comparisons between both groups. Significance was set to 0.05.

Intra-observer and inter-observer reliability were calculated using the Cohen’s κ coefficient for the traditional Tönnis Classification System and the simplified Binary Tönnis Classification System. Further, the traditional Tönnis Classification System was dichotomized (0 and 1 vs. 2 and 3) and the Cohen’s κ coefficient was calculated. The multi-rater κ was calculated using the weighted Fleiss method. The degree of agreement based on the κ coefficient were interpreted by the ranges recommended by Landis and Koch: a κ value of 0–0.2 indicated slight agreement, 0.2–0.4 to be fair, 0.4–0.6 to be moderate, 0.6–0.8 to be substantial, and greater than 0.8 to be near perfect [14].

The traditional Tönnis Classification System and the simplified Binary Tönnis Classification System were assessed for agreement with the surgical treatment received by the patient (either hip preservation or hip replacement).

Results

The study sample contained 40 anterosuperior hip radiographs, 19 of which received hip preservation and 21 of which received hip replacement. There were 15 males and 25 females (age 35.05–59.25 years). The demographics of the overall group and subgroups are presented in Table 3.

Table 3 Demographics

The traditional Tönnis Classification System showed fair reliability for the inter-observer reliability, (κ = 0.474) and excellent reliability for the intra-observer reliability (κmean = 0.866, range = 0.780–0.907), as calculated by the weighted κ agreement.

The inter-observer and intra-observer reliability showed improvement with the simplified Binary Tönnis Classification System. The inter-observer reliability was (κ = 0.858) and intra-observer reliability was (κ mean = 0.928, range = 0.892–0.948). Both inter-observer and intra-observer reliability were deemed excellent (Table 4).

Table 4 Intra-observer reliability of the classification systems

The Tönnis grading based on both systems and their agreement with the ultimate surgical management were calculated and are represented in Table 5. On average, the simplified Binary Tönnis Classification System correctly captured 87% of cases. When the traditional Tönnis Classification System was dichotomized (0 and 1 as hip preservation and 2 and 3 as hip replacement), the capture rate was 84%. The confusion matrices for the capture rates are depicted in Tables 6 and 7.

Table 5 Agreement between assessed parameters on plain x-ray
Table 6 Confusion Matrix for Dichotomized traditional Tönnis Classification System
Table 7 Confusion Matrix for Binary Tönnis Classification System

Discussion

The aim of this study was to validate a simplified Binary Tönnis Classification System. In this study, 40 radiographs of consecutive patients were analyzed by five fellowship-trained orthopedic surgeons. Overall, the Binary Tönnis Classification System reported better inter-observer and intra-observer reliability and demonstrated higher agreement rate with the ultimate surgical treatment, as recommended by the treating surgeon, compared to the traditional Tönnis Classification System.

In their study, Clohisy et al. [3] (Table 8) evaluated the ability of hip specialists to reliably indicate the correct diagnosis based on plain radiographs alone. Five hip specialists and one fellow performed a blinded radiographic review of 25 hips with developmental dysplasia, 27 hips with femoroacetabular impingement, and 25 control hips. The readers assessed a variety of radiographic parameters including osteoarthritis using the traditional Tönnis Classification System. The combined κ for intra- and inter-observer reliability for the traditional Tönnis Classification System were 0.60 (95% CI: 0.54–0.66) and 0.59, respectively. Furthermore, Steppacher et al. [6] had two readers assess the Tönnis grade for a set of 50 radiographs illustrating dysplastic hips. The range of reported κ for intra-observer were 0.73 to 0.74. The Fleiss κ for interobserver reliability was 0.74. Clohisy et al. attributed the difference between their results and Steppacher’s to their inclusion of a non-dysplastic cohort, in contrast to a dysplastic only cohort in Steppacher’s study. The higher radiographic variability may have contributed to a decrease in reliability, especially in cases with none or mild arthritis. Troelsen et al. [7] aimed to investigate the variability of diagnostic assessment of the hip joint. In their study, four observers independently assessed the level of osteoarthritis in 25 radiographs. Treolsen dichotomized Tönnis grades. They assessed the dichotomized inter-observer reliability, of a quaternary classification, as well as its agreement with CT scan. The κ for inter-observer agreement was 0.54 for the Tönnis classification and 0.66 for the dichotomized version. Furthermore, the observed agreement with the CT scan was 70% for the traditional Tönnis Classification System and 88% for the dichotomized alternative. In this present study, κ for intra- and inter-observer reliability for the traditional Tönnis Classification System were 0.86 and 0.47, respectively. In contrast to the evidence reported for the traditional Tönnis Classification System, the simplified Binary Tönnis Classification System demonstrated excellent inter- and intra-observer κ (i.e. 0.86 and 0.85, respectively). Additionally, this study supports Troelsen’s findings from dichotomizing the traditional Tönnis Classification System. Adopting a true binary classification will better serve the clinician as it would eliminate the need for a preliminary low-reliable four level classification which requires further dichotomization for determining treatment.

Table 8 Previously published X-ray Reliability Studies for the Traditional Tönnis Classification System

The second step in validating the simplified Binary Tönnis Classification System was to assess its reliability in indicating the surgeon-recommended treatment. Valera et al. [1] evaluated the reliability of the traditional Tönnis Classification System as a reference for hip preservation. Three orthopedic surgeons examined 117 hip x-rays for hip joint osteoarthritis according to the traditional Tönnis Classification System. The κ value for interobserver reliability were slight or fair (0.173–0.379) and the κ value for intra-observer reliability were fair (0.364–0.379). Variance in classifying low grade osteoarthritis was the major cause for disagreement between observers. In contrast, experience did not play a significant role in grading reproducibility. The authors concluded that the traditional Tönnis Classification System is a poor method of assessing early hip osteoarthritis and that routine use in clinical decision-making for preservative surgery should be reconsidered. In this study, the traditional Tönnis Classification System correctly overlapped with actual surgical treatment in 85.2% of cases. The simplified Binary Tönnis Classification System had a higher overlap, correctly capturing 86.5% of the cases. While the binary classification did show a slightly better correlation with the indicated treatment, it should be emphasized that radiographic evaluation is only part of the overall patient assessment and thus a discrepancy between both classification systems and the actual performed treatment should be expected.

In summary, the simplified Binary Tönnis Classification System addresses the drawbacks of the traditional Tönnis Classification System without compromising clinical relevance. Adoption of a binary system would allow for more consistent data collection, thus improving the quality of studies. Practically for the clinician, a two-grade classification is more appropriate for a two-way treatment.

Limitations

The major limitation of this study stems from the retrospective nature. We minimized this effect by blinding the investigators to any identifier including name and treatment. In addition, we excluded patients who were treated contralaterally, which could bias grading. Also, the readers in this study were all surgeons. A better generalization may have been generated with the inclusion of multidisciplinary readers (e.g. radiologists). Furthermore, whereas the actual procedure is performed by senior surgeon, either arthroscopy or arthroplasty, was indicated based on the overall patient’s assessment, the assigned procedures in this study were exclusively based on the radiographic classifications. This single blinding design may have introduced a bias to the study. Despite the effort to minimize the selection bias in the study by choosing consecutive series of patients, the resulted cohort was fairly homogenic in terms of demographic characteristics, which by itself may limit the generalization of the results. Last, patients without osteoarthritis, who traditionally were classified as Tönnis 0, no longer have a distinct grade according to the binary classification. This may potentially lead to lower threshold for indicating surgery. However, since hip arthroscopy is generally indicated based on the intra-articular mechanical impairments such as FAI and labral tears, osteoarthritis is normally considered a contraindication for such preservative measures.

Conclusion

A simplified Binary Tönnis Classification System demonstrates better reliability and clinical implementation than the Traditional Tönnis Classification System.