Inter- and intraobserver reliabilities and critical analysis of the osteoporotic fracture classification of osteoporotic vertebral body fractures

The Osteoporotic Fracture Working Group (Spine Division of the German Orthopaedic and Trauma Society) has developed a classification system for osteoporotic thoracolumbar fractures, namely the osteoporotic fracture (OF) classification system. The purpose of this study was to determine the inter- and intraobserver reliabilities of the OF classification system for osteoporotic vertebral body fractures (VFs) at a level-one trauma centre. Conventional radiography, magnetic resonance imaging (MRI), and computed tomography (CT) scans of 54 consecutive women who sustained an osteoporotic VF were analysed by six orthopaedic traumatologists with varying levels of experience. The inter- and intraobserver reliabilities of the OF classification system were determined using intraclass correlation coefficients (ICCs) and Cohen’s kappa. The overall interobserver reliability of the OF classification system was good (ICC, 0.62 [0.51, 0.72]). The intraobserver reliability was found to be substantial (overall weighted Cohen’s kappa estimate [95% confidence interval {CI}] = 0.74 [0.67, 0.80]) and better when the radiography, MRI, and CT scans were assessed together than when only the radiography and MRI scans were evaluated, although the difference was not significant. The OF classification system is easy to use. It shows good interobserver reliability and substantial intraobserver reliability if diagnostic prerequisites (conventional radiography, MRI, and CT scans) are met.


Introduction
Osteoporotic vertebral body fractures (VFs) are common and are an essential part of everyday trauma surgery. Compared to a clearly traumatic VF, an osteoporotic VF is a separate entity. For traumatic fractures, there are valid classifications with reference to the surgical findings [1][2][3]; for osteoporotic VFs, this has not been the case for a long time. Osteoporotic VFs are inadequately reflected in the AO classification of traumatic fractures. With regard to risk assessment and medicinal osteological therapy recommendation, the German Osteological Association follows the morphological classification of osteoporotic VFs according to the Genant classification [4,5]. However, from a spine surgery perspective, this classification is not suitable for deriving a conservative versus surgical therapy recommendation. This was the main reason for the development of a new classification system for osteoporotic fractures (OF) of the spine. This was created by the Osteoporotic Fracture Working Group (Spine Division of the German Orthopaedic and Trauma Society [DGOU]) in 2013. For this purpose, data from 707 patients from 16 hospitals were collected and evaluated [6].
The established OF classification consists of five groups and is based on all available radiological examinations (radiography, computed tomography [CT], and magnetic resonance imaging [MRI]) ( Fig. 1) [7]. According to the authors, standard diagnostic requirements should be consistently followed in all cases to assess the type of to obtain reproducible treatment recommendations: radiography (whenever possible in the standing position), MRI (including a short-tau inversion recovery [STIR] sequence), and CT [8,9]. The additional OF score allows a link between classification and treatment recommendations. In addition to the fracture morphology (OF types 1-5 [OF1-5]), the OF score considers parameters of bone density, dynamics of sintering, pain (under analgesia), neurology (fracture-related), mobilisation (under analgesia), and health status [8].
According to the analysis of the Osteoporotic Fracture Working Group of the Spine Section of the DGOU, OF2, OF3, and OF4 accounted for 95% of all fractures [10]. The interobserver reliability of the OF classification has previously been evaluated by six raters from the above-mentioned working group in 146 fractures [10]. In this study, there was substantial agreement (Fleiss' kappa = 0.63; confidence interval [CI] not stated) according to the criteria by Landis and Koch [11]. Published data on intraobserver reliability are not yet available. Currently, the OF classification and OF score have yet to be proven feasible for use in clinical practice; however, some existing factors hinder their establishment.
In this study, the inter-and intraobserver reliabilities of the OF classification were verified at a level-one trauma centre. Further, the value of MRI alone compared to the combination of CT and MRI was evaluated with regard to cross-sectional imaging after conventional radiography.

Materials and methods
Data from a consecutive series of 54 fractures in 54 female patients (mean age = 80.9 ± 8.6 years) with osteoporotic VF were evaluated. The inclusion criteria were female sex, age of ≥ 50 years, complete image diagnosis (radiography, CT, and MRI including a STIR sequence), no or low-energy trauma (fall from standing height), and osteoporosis according to Hounsfield units (Hus) on CT. The exclusion criteria were male sex, age of < 50 years, incomplete diagnosis according to the criteria listed above, high-energy trauma, and pathological fractures associated with tumours. Osteoporosis was confirmed using CT HU of the trabecular bone in L1. When a fracture or sclerosis of L1 (n = 30) prevented reliable measurement, a suitable neighbouring vertebra (T11, n = 1; T12, n = 8; L2, n = 19; L4, n = 1) was used to determine attenuation values. The scan parameters included a tube voltage of 120 kV with modulation of tube current (Aquillion, Canon Medical Systems, Neuss, Germany), and the HU was assessed in axial 0.5-or 1-mm slices at three different vertebra levels (Infinitt, Infinitt Healthcare Europe, Frankfurt am Main, Germany). Although no widely accepted cut-off value exists for the HU compared to dual-energy x-ray absorptiometry, high sensitivity and specificity were reported for a value of 110 HU, matching a T-score of − 2.5 [12,13]. Digital radiographs and cross-sectional images were analysed retrospectively as part of the study, with initial diagnostic radiography, CT, and MRI scans submitted. The patients' personal data were not identifiable on the imaging material. For patients with multiple new VFs, the study director determined in advance which vertebral body should be evaluated by all raters. The corresponding image material . This type is rare. The stable injury is clearly visible on MRI-STIR sequence only. The radiography and CT scans do not show vertebral deformation. OF2: Deformation with no or only minor involvement of the posterior wall (< 1/5). This type of fracture affects one endplate only (impression fracture). The posterior wall can be involved, but only minor. OF2 is a stable injury. OF3: Deformation with distinct involvement of the posterior wall (> 1/5). This type of fracture affects one endplate only but shows distinct involvement of the anterior and posterior walls (incomplete burst fracture). The fracture can be unstable and may collapse further over time. OF4: Loss of integrity of the vertebral frame structure, vertebral body collapse, or pincer-type fracture. This subgroup consists of three fracture types. In case of a loss of integrity of the vertebral frame structure, both endplates and the posterior wall are involved (complete burst fracture). Vertebral body collapse is typically seen as a final consequence of a failed conservative treatment and can impose as a plain vertebral body. Pincer-type fractures involve both endplates and may lead to severe deformity of the vertebral body. OF4 is an unstable fracture, and intravertebral vacuum clefts are often visible. OF5: Injuries with distraction or rotation. This group is rare but shows substantial instability. The injury involves not only the anterior column but also the posterior bony and ligamentous complexes. OF5 injuries can be caused either by a trauma directly or by ongoing sintering and collapsing of an OF4. CT, computed tomography; MRI, magnetic resonance imaging; OF, osteoporotic fracture; STIR, short-tau inversion recovery was presented in random order. To determine interobserver reliability, the six raters rated the images separately. The raters had different levels of training (assistant physician [AA], n = 3; specialists in orthopaedics and trauma surgery, n = 3). All raters were from the same university trauma surgery hospital. The residents were in advanced training (from the 3rd year of training) to become specialists in orthopaedics and trauma surgery. The specialists were three senior attending physicians (OA) at the hospital. The OF classification has been used in the clinic for approximately 1.5 years before the start of this study. As an aid, the OF classification system was available to the raters as a schematic illustration with explanatory text. At 2 months later, the same image material was assessed again in a different order by the same rater to determine the intraobserver reliability. All six raters processed all 54 fractures at both time points.
Subsequently, we investigated whether the combination of conventional radiography and MRI provides sufficient information to assign the fracture to one of the five types of classification and whether CT is, therefore, dispensable in clinical practice. For this purpose, the raters were presented with the images in the following sequence: conventional radiography-MRI-CT, and were asked to make a diagnostic statement after viewing the radiography and MRI scans, as well as after viewing the radiography, MRI, and subsequent additional CT scans. Subsequently, the diagnostic statements were evaluated according to the following comparisons: radiography/MRI versus radiography/MRI/CT using estimators, including CIs as well as the descriptive categorisation according to the criteria by Landis and Koch [11].
Intraclass correlation coefficients (ICCs) were used to determine the interobserver reliability among the six raters [14]. The null hypothesis herein assumes that there is no difference between the raters' assessments. In our study, an ICC of 0.75 should be achieved to demonstrate good interobserver reliability according to the criteria by Cicchetti [15] and Koo and Li [14]. With a significance level of 5% and a power of 80%, this study then required a case number of 50 patients with six raters (SAS 9.4; SAS Institute Inc., Cary, NC, USA). The intraobserver reliability for the two measurement points was determined using Cohen's kappa [11,16] A kappa value of 0.8 was initially assumed to demonstrate substantial intraobserver reliability. With a significance level of 5%, power of 80%, and proportion of positive ratings of 70%, 48 patients were required according to the criteria by Sim and Wright [16]. Since an ordinal fivelevel rating scale is available, Cohen's weighted kappa using Fleiss/Cohen quadratic weights was used [17]. Cohen's kappa was descriptively assessed according to the criteria by Landis and Koch [11]: poor: < 0.00, slight: 0.00-0.20, fair: 0.21-0.40, moderate: 0.41-0.60, substantial: 0.61-0.80, and almost perfect: 0.81-1.00.
The study design was approved by our Ethics Committee. The study was registered in the German Clinical Trials Register.

Results
The interobserver reliability was average when the radiography and MRI scans were analysed together without the CT scans according to the criteria by Cicchetti (ICC = 0.52 [0.41, 0.64], Table 1). The interobserver reliability was found to be better when the radiography, MRI, and CT scans were analysed together than when only the radiography and MRI scans were evaluated (ICC = 0.62 [0.51, 0.72], Table 1) and could be defined as good according to the criteria by Cicchetti. The overlapping CIs indicate that the difference was not significant.
The intraobserver reliability was substantial when only the radiography and MRI scans were evaluated (overall estimate of weighted Cohen's kappa = 0.64 [0.57, 0.71], Fig. 2). It was also substantial (overall estimate of weighted Cohen's kappa = 0.74 [0.67, 0.80], Fig. 3) and slightly better when the radiography, MRI, and CT scans were assessed together than when only the radiography and MRI scans were evaluated, although the difference was not significant. When the raters had complete image material available, they rated it with a high agreement after 2 months. When they referred only to the conventional radiography and MRI scans, the assessment after 2 months was significantly less congruent in four of the six raters. Only OA3 (weighted Cohen's kappa = 0.60 and 0.59, respectively) and AA3 (weighted Cohen's kappa = 0.65 and 0.63, respectively) showed almost identical kappa values (Figs. 2 and 3). The values from OA1 (weighted Cohen's kappa = 0.65 and 0.76, respectively), OA2 (weighted Cohen's kappa = 0.70 and 0.80, respectively), AA1 (weighted Cohen's kappa = 0.62 and 0.76, respectively), and AA2 (weighted Cohen's kappa = 0.54 and 0.71, respectively) significantly improved, although they remained substantial. In the clinical routine before and during evaluation, the raters noticed a differentiation problem between OF2 and OF3 and partly between OF3 and OF4. We used a heatmap (Fig. 4) to display and evaluate the deviations between the different time points. In the best case, all values should be on the diagonal major axis, and deviations from this demonstrate the differentiation problem well. Five of the six raters (AA1, AA2, AA3, OA2, and OA3) showed a clear deviation with respect to OF2 and OF3. OA2 classified eight fractures as OF2 at the first evaluation and OF3 after 2 months. AA1 classified 16 fractures as OF2 at the first evaluation and OF3 at the second evaluation. Otherwise, both raters showed very small deviations from the main axis. There was also a differentiation problem between OF3 and OF4 in four of the six raters (AA2, AA3, OA1, and OA3). For OA1, there were overall minor deviations from the main axis; however, three cases were classified as OF3 at the first evaluation and OF4 at the second evaluation and four cases as OF4 at the first evaluation and OF3 at the second evaluation. Generally, a somewhat larger spread was noticeable among the AAs than among the OAs.

Discussion
Osteoporotic VFs represent a distinct entity compared to clearly traumatic fractures and require independent treatment modalities. A classification-based approach is a general trauma surgery standard. However, the AO classification according to the study by Magerl et al. [18] and the new AOSpine classification of spinal injuries presented by Vaccaro et al. [1] in 2013 are at best of limited use for both the classification and derivation of therapeutic measures for OFs. Various working groups have developed their own classifications for osteoporotic VFs to be better able to derive necessary therapeutic measures based on the classification [4,[19][20][21][22][23].
The OF classification of the Spine Section of the DGOU, which is available herein, considers the specific Fig. 2 Intraobserver reliability: radiography + MRI. Quantification of the intraobserver reliability: radiography + MRI. The intraobserver reliability was found to be substantial when the radiography and MRI scans were evaluated (overall weighted Cohen's kappa = 0.64 [0.57, 0.71]). Six raters (OA = attending, AA = resident) evaluated osteoporotic VFs (n = 54). Forest plot using weighted Cohen's kappa as the estimator with 95% confidence intervals. MRI, magnetic resonance imaging; VF, vertebral body fracture morphological and radiological features of trauma in the ageing spine [24]. The additionally developed OF score links the classification to a treatment recommendation and supports basic decision-making between nonoperative and operative therapies [6]. In the score, two points (OF1) to 10 points (OF5) are assigned in the ranking depending on the OF type. In this respect, the valid classification of morphology is relevant in any case with regard to a therapy decision according to a scoring scheme. If the OF score is used to determine the indication for surgery, the procedure will differ depending on the morphology [9]. If there are several fractures to be treated, the score of the OF classification should be applied to each fracture to obtain an initial basis for further therapeutic approaches [9].
The Osteoporotic Fracture Working Group of the Spine Section of the DGOU determined a substantial agreement for the interrater reliability of the OF classification (Fleiss' kappa = 0.63, no CI given; 146 fractures, 6 raters) [10]. In our study conducted among residents and specialists at a level-one trauma centre, the interobserver reliability according to the criteria by Cicchetti was also good (ICC = 0.62 [0.51, 0.72]; 54 fractures; 6 raters). Information on the intraobserver reliability of the OF classification has not been found in the relevant literature to date. Herein, we found a substantial agreement (weighted Cohen's kappa = 0.74 [0.67, 0.80]). As women are mainly affected by VF, we conducted this study in women. This inclusion criteria represents a limitation of the study since the OF classification is also applicable to men.
Thereafter, we also evaluated whether the combination of conventional radiography and MRI provided sufficient information to assign the fracture to one of the five types in the OF classification. This was performed purely descriptively based on the ICC and Cohen's kappa owing to the lack of assumptions on the non-inferiority boundary. The OF classification showed good and considerable inter-and intraobserver reliabilities, respectively, when projection radiography, CT, and MRI were used. The reliability was better when these three modalities were used than when radiography and MRI were used alone, although the difference Fig. 3 Intraobserver reliability: radiography + MRI + CT. Quantification of the intraobserver reliability: radiography + MRI + CT. The intraobserver reliability was found to be substantial when the radiography, MRI, and CT scans were evaluated (overall weighted Cohen's kappa = 0.74 [0.67, 0.80]). Six raters (OA = attending, AA = resident) evaluated osteoporotic VFs (n = 54). Forest plot using weighted Cohen's kappa as the estimator with 95% confidence intervals. CT, computed tomography; MRI, magnetic resonance imaging; VF, vertebral body fracture did not appear relevant, since both estimates for weighted kappa fall into the same category, that is, substantial reliability. A disadvantage of CT is that painful microfractures and bone marrow oedema cannot be detected. Furthermore, additional examinations always involve higher costs in the daily routine in clinics. This is in contrast to MRI, which also helps in the basic differentiation between acute and chronic changes. MRI can provide additional information regarding the integrity of disco-ligamentous structures. Furthermore, it is necessary for the classification of OF1. Thus, conventional radiography and CT alone are not required for the assessment. Although no relevant differences in reliability were seen in our study after radiography and MRI alone compared to radiography, MRI, and CT, we still recommend consistent adherence to the standard diagnostic requirements recommended by the Spine Section of the DGOU (conventional radiography, MRI, and CT) for the overall assessment [8,9]. CT ultimately allows a more accurate assessment of morphology than does MRI and is also described by raters as subjectively helpful.
During evaluation and in clinical practice, a differentiation problem between the individual OF types became apparent. In particular, the raters found it difficult to differentiate between OF2 and OF3. However, a certain problem between OF3 and OF4 could be observed. If the classification is used more widely, it may be necessary for it to be readjusted by the Osteoporotic Fracture Working Group of the Spine Section of the DGOU. The OF classification is currently being applied in orthopaedic trauma surgery centres but has yet to be proven feasible for use in daily practice. With appropriate consistency in its application, including the OF score, reproducible treatment recommendations will be derived.
Authors' contributions MS and SP conceptualised the study, collected and interpreted the clinical data, and wrote the manuscript. AS-Z designed this work. SP revised the manuscript critically for important intellectual content. MS and VL collected the data and revised the manuscript. RD confirmed the osteoporosis diagnoses. RO and DA Fig. 4 Representation of the differentiation of the OF types. Differentiation abnormalities between OF2 and OF3 and between OF3 and OF4. Six raters (OA = senior physician, AA = resident) evaluated osteoporotic VFs (n = 54) at two time points. X-axis time point 1, y-axis time point 2, each depicting OF1-OF5. Heatmap using intraobserver data. Higher colour intensity corresponds to higher counts. OF, osteoporotic fracture; VF, vertebral body fracture conducted the statistical analyses. All authors have read and approved the final version of the manuscript.
Funding Open Access funding enabled and organized by Projekt DEAL. No funding was received in support of this study. The authors have no relevant financial activities outside the submitted work.

Availability of data and materials
The data that support the findings of this study are available from the corresponding author upon reasonable request.
Code availability Not applicable.

Consent for publication: Not applicable.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http:// creat iveco mmons. org/ licen ses/ by/4. 0/.