FormalPara PACS picture, archiving and communication system (i.e. Agfa Impax)

REDCap Research Electronic Data Capture program

Background

Imaging has been used in the diagnostic workup of people seeking care for low back pain for more than a hundred years. Consequently, conventional radiographs, computerized tomography (CT) or magnetic resonance imaging (MRI) is accepted, if not expected, by many patients and doctors [1,2,3,4]. For any diagnostic procedure including imaging to be useful, it must first demonstrate adequate reliability [5, 6].

Most people would claim that they understand and appreciate the meaning of agreement as well as the meaning of disagreement, and we all deal with the consequences of both every single day of our lives. In healthcare this is certainly the case and can have grave consequences for doctors and patients, since the quality of care and procedures delivered in health care systems around the world, depends on this seemingly simple notion. Patients are increasingly aware of different doctors having different opinions regarding their health issues. This variability may be due to variations in nomenclature [7, 8], and it is critical for radiologists and other interpreters of diagnostic imaging, such as chiropractors, to reduce interpreter variability [6]. Inter-rater reliability is an important parameter to measure and is of concern as data-sets should reflect variation in the study participants and not variation in the raters involved in the study. For valid results it is important that raters are updated and trained in the use of standardized protocols prior to the evaluation of MRI findings. Diagnostic classification systems for MRI have been proposed to address interpreter variability and in a systematic review the reliability of different nomenclature and grading systems for lumbar disc herniation and nerve root compression were compared, ranging from κ = 0.39–0.81 [9], and representing quite a range in reliability despite limited to two degenerative conditions: Disc herniation and nerve root compromise. It is uncertain what kind of variability and reliability raters would produce if evaluating multiple degenerative pathologies simultaneously. Previous reliability studies of MRI findings of the lumbar spine have mainly reported on single findings or a specific grading scale of lumbar spine degenerative pathology, such as disc herniation [10, 11], spinal stenosis [7, 12] and end-plate changes [13]. However, a few studies have examined a handful of spinal degenerative pathologies simultaneously using various classification systems/scales [5, 8, 14, 15]. To our knowledge, there are few reliability studies on lumbar spine MRI findings that have both included raters of different professions and included multiple degenerative findings: One study compared medical radiologists, chiropractic radiologists and chiropractors [16]. Another study investigated reliability between a radiologist, a chiropractor and a second year resident of rheumatology in classifying degenerative MRI-findings of the cervical spine [17]. Thus, there is a need for studies investigating reliability for a wider range of spinal degenerative pathologies based on standardized formats and involving more than one profession participating in MRI readings in radiology departments.

Objectives

The objective of this study was to determine the inter-rater reliability of the evaluation of degenerative findings in lumbar spine MRI.

Methods

Design

Fully crossed inter-rater reliability study.

Guidelines for reporting reliability and agreement studies (GRRAS-guidelines) have been followed in this paper [18].

Sample size calculation

In a test for agreement between two raters using the Kappa statistic, a sample size of 51 subjects achieves 80% power to detect a true Kappa value of 0.70 in a test of H0: Kappa = κ0 vs. H1: Kappa ≠ κ0, when there are 6 categories with frequencies equal to 0.48, 0.28, 0.20, 0.03, 0.01, and 0.00. This power calculation is based on a significance level of 0.05000. Furthermore, we assumed the three disc-levels per participant to be independent, leaving us with 177 observations [19].

Study population and reliability sample

Fifty-nine MRIs of the lumbar spine from people with or without LBP, who were enrolled in a cross-sectional study, were included in this study from February 26th, 2018 to April 26th, 2018.

People with LBP or LBP and leg pain were invited to participate, when scheduled to the hospital for an MRI procedure, ordered by their primary care physician. The inclusion/exclusion criteria for people with LBP were: 1) Referred to Department of Radiology, Silkeborg Regional Hospital, Denmark for MRI from primary care (general practitioners or chiropractors) with LBP or LBP and leg pain; 2) Having symptoms for > 4 weeks; 3) 18–60 years of age; 4) Not part of a referral pathway to spinal surgery or another secondary care sector activity 5) No suspicion of serious pathology, i.e. cancer, infection or inflammatory arthritis; 6) Able to stand up for at least 20 min; 7) Able to read and write Danish.

The same criteria were used for people without LBP (no-LBP) with the exclusion of criteria 1) and 2). People in the no-LBP group were mainly recruited from a local school of nursing near the hospital, employees at the hospital and through announcements in workplace environments in Silkeborg Municipality/City, Denmark. Recruitment was carried out by posters and by personal communication.

Informed consent was provided by all participants via REDCap (Research Electronic Data Capture) installed on i-Pads. Recruitment for all participants was consecutive on a first come, first serve basis. A total of 242 people were initially included in the study population. Participants were assigned to subgroups based on self-reported symptoms/no-symptoms in their baseline questionnaires. Six people were excluded because of age over 60 years and 6 were excluded because of technical problems with their baseline questionnaires or failing to complete the MRI procedures. The remaining 230 individuals defined our main study population (see Fig. 1), and the first 59 individuals giving a fair representation of participants with LBP, leg pain and no-LBP in each group defined the reliability study sample.

Fig. 1
figure 1

Flowchart of selection of main study population and the reliability study sample

MRI-procedures

All patients were scanned in either a Siemens Avanto 1.5 T or a Siemens Skyra 3 T MRI unit and all no-LBP individuals were scanned in an open MRI unit (Paramed OpenMR 0.5 T). MRI sequences for all three scanners were: Sagittal T1 and T2 as well as T2 axial at the three lowest lumbar levels. For the 1.5 T and 3 T systems, the sagittal T2 weighted sequence also included T2 fatsat (DIXON).

Raters, training and consensus

Rater 1, a medical radiologist consultant with 30 years of experience in musculoskeletal MRI; Rater 2, a chiropractor and PhD student with 28 years of clinical and radiography experience and 4 years of MRI experience including over 1000 supervised reports of lumbar MRI in the same radiology department; Rater 3, a chiropractor and senior researcher with 12 years of clinical research and MRI experience from radiology departments. All raters had, in various degrees, experience with reliability of diagnostic classification model as well as clinical experience with spinal diagnostic imaging [13, 17, 20, 21].

To ensure consensus regarding understanding of the diagnostic classification, an evaluation manual was prepared based on existing literature [5, 22,23,24,25,26,27,28,29] (see Additional files 1 and 2). For the purpose of training and to identify practical issues in the evaluation process, all three raters independently analyzed and classified 10 MRIs based on the manual. The raters then met to compare ratings and discuss adjustments to the assessment and coding process. The evaluation manual was then adjusted, and a second set of 5 MRIs was rated, compared and discussed before the final version of the manual was prepared.

MRI evaluation and classification of findings

The three raters evaluated 177 disc levels (the three lowest lumbar levels: L3/L4 – L5/S1) for the presence of the following eight degenerative pathologies, independently: Spondylolisthesis; Scoliosis; Annular fissure; Disc degeneration; Disc contour; Nerve root compromise; Spinal stenosis; and Facet joint degeneration. The image findings were classified based on methods described in the literature (see Table 1).

Table 1 Classification of MRI findings

The raters retrieved images in PACS (Picture, Archiving and Communication System: Agfa Impax, version 5.2) and filled in the standardized research evaluation form in REDCap. All images were assessed and analyzed on diagnostic Agfa Impax workstations with high resolution diagnostic monitors (Totoku Monochrome MS33I2_Pair, 3 mpx. Barco MDNC-2121 color pair, 2 mpx and Barco MDNC-2121 monochrome pair, 2 mpx). The raters were blinded with respect to clinical information and previous report of findings.

Data management and statistical analysis

Data analysis was carried out in Stata, ver.15.1 (StataCorp LLC, 4905 Lakeway Drive, College Station, Texas 77,845, USA) and AgreeStat 2015.1 for Excel Windows/Mac (Advanced Analytics, LLC. PO Box 2696, Gaithersburg, MD 20886–2696, USA.).

Inter-rater reliability, based on 3 lower disc level of 59 persons = 177 levels, was determined for binominal, nominal and ordinal data (see Table 1) by calculating percent agreement and chance-corrected agreement coefficients (Cohen’s/Conger’s κ and Gwet’s AC1 (unweighted) and AC2 (weighted)) for pair-wise raters and for the three raters overall [33], and were reported with 95% confidence intervals. Gwet’s agreement coefficients, AC1 and AC2, were used to address the κ paradox [34] and has been shown to be more stable and paradox-resistant than Cohen’s κ and other coefficients [33, 35]. The κ paradoxes with very low or very high prevalence’s are well described in the literature [36, 37]. The first paradox occurs when percent chance agreement (pe), is large, the correction process can convert a relatively high value of observed agreement (pO) into a relatively low value of κ. The second paradox occurs when unbalanced marginal totals produce higher values of κ than more balanced totals. In order to deal with the paradoxes (very low or very high prevalence’s), we used Gwet’s Agreement Coefficients (AC1 and AC2). In order to compare our results with previous literature, we chose to also present both Cohen/Conger’s κ as well as the benchmarking procedure to the Landis and Koch scale [38]. The probabilistic method for benchmarking, as suggested by Gwet, is the absolute agreement and chance corrected agreement coefficients benchmarked as the cumulative probability (in our case exceeding 95%) for the any coefficient to fall into one of the following intervals: < 0.00 “Poor”; 0.01 to 0.20 “Slight”; 0.21 to 0.40 “Fair”; 0.41 to 0.60 “Moderate”; 0.61 to 0.80 “Substantial” and 0.81 to 1.00 “Almost Perfect” [39]. This method allows for a direct and more precise comparison of different agreement coefficients and their representation on the Landis and Koch scale (or any other scale used). Substantial reliability (0.61–0.80) was defined as acceptable for the purpose of this study and confidence intervals were presented with 95% certainty to include the true estimate. We recommend interested readers to follow the link in the reference list for more insight and comprehension of Gwet’s Agreement Coefficents compared to other coefficients and the probabilistic benchmarking [40]. Key characteristics for the study target population and the study sample are presented for age, gender, symptoms, duration (see Table 2).

Table 2 Characteristics of the study target population and study sample for reliability

Results

The mean age is 42 years for the target population and 38 years for the reliability sample. Women represents 51 and 46% of the study target population and reliability sample respectably. LBP and leg pain patients account for 73 and 59% respectably, all with symptoms over 4 weeks. No-LBP persons accounted for 27 and 41% respectively.

The prevalence of positive findings for all raters can be found in contingency tables in additional files. Generally, rater 1 had lower prevalence of diagnostic findings (average = 0.26), compared to rater 2 (average = 0.40) and rater 3 (average = 0.44), with a total prevalence ranging from: 0.05–0.80 for the individual MR-findings.

In Table 3, results for reliability at disc levels are presented. The overall reliability (raters 1,2 and 3) for the eight spinal degenerative pathologies ranged from moderate to almost perfect. Overall reliability level for individual pathologies was almost perfect for spondylolisthesis, spinal stenosis, scoliosis and annular fissure; substantial for nerve root compromise and disc degeneration; and moderate for facet joint degeneration and disc contour (prevalence of MRI-findings at disc levels can be found in Additional files 1 and 2 for ordinal and nominal scales respectively).

Table 3 Inter-rater reliability coefficients and percent agreement with probabilistic benchmarking to the Landis and Koch scale in classification of MRI-findings at disc level

For the 3 rater pairs individually, the reliability ranged from moderate to almost perfect. For disc contour, there was a difference of one benchmark level between rater pairs, from moderate to substantial. There was no difference in benchmark levels between rater pairs for the remaining 7 pathologies.

The reliability among all 3 raters was moderate for facet joint degeneration and disc contour. Reliability for rater pairs 1–3 and 2–3 was also moderate for disc contour.

There was almost perfect reliability and very little variability between rater pairs for spondylolisthesis, spinal stenosis, scoliosis and annular fissure. There was substantial reliability for disc degeneration and nerve root compromise. Variability was highest between rater-pairs for disc contour, but low for the other MRI findings.

Discussion

Inter-rater reliability for three rater-pairs was found overall acceptable for 6 of 8 degenerative MRI-findings of the lumbar spine. In addition, our results indicate that experienced chiropractors can achieve the same level of reliability as medical radiologists for MRI interpretations of spinal degenerative pathologies. Thus these classifications of findings are sufficiently comprehensible to be applied by experienced health care professionals and can be used for both quality assurance and research purposes.

Prior studies have investigated the reliability of identifying degenerative MRI-findings of the lumbar spine [38,39,40,41], but few are directly comparable to our study, because they investigated only one spinal degenerative pathology. Zoete et al. compared experienced medical and chiropractic radiologists reviewing MRI for lumbar spinal degenerative pathology [16]. The findings were dichotomized into a classification between “Specific findings” or “No specific findings”, and higher reliability was found with more experienced raters. Specialists regardless of professional background obtained the best results (moderate reliability). Moll et al. investigated the reliability between a radiologist, a chiropractor and a second year resident of rheumatology, in classifying degenerative MRI-findings of the cervical spine and found overall substantial interrater reliability (κ ≥ 0.61) [17]. In our study, we achieved very low variability between the 3 raters and only one of eight pathologies had a difference in reliability among raters of one benchmark level indicating that experienced chiropractors and medical radiologists can achieve acceptable reliability in MRI interpretations, even when evaluating for a range of spinal degenerative pathologies in the lumbar spine.

Carrino et al. is one of few studies that has examined inter-rater agreement across several different spinal pathologies (spondylolisthesis, disc degeneration, end-plate changes, annular fissure and facet degeneration) and including 111 cases [5]. Kappa values were generally modest and ranged for all raters overall from, κ: 0.43–0.66 (CI 0.27–0.70), with only disc degeneration being of acceptable reliability, κ > 0.60. Average overall kappa among raters was, κ = 0.53.

Another similar interrater agreement study involving 75 cases by Fu et al., included 10 degenerative spinal pathologies of the lumbar spine and reported absolute agreement and Fleiss κ, with κ-values being modest and with significant variability across degenerative conditions, ranging from 0.28–0.62 (CI 0.27–0.64) [8]. Excluding transitional vertebrae (κ = 0.62), all remaining 9 degenerative conditions in this study achieved unacceptable reliability, κ < 0.60. The overall average kappa coefficient among all 4 raters was κ = 0.43.

Absolute agreement and agreement coefficients in our study were similar or higher compared with Carrino et al., and generally higher compared with Fu et al. We achieved average overall κ = 0.59 among all 3 raters. Both studies had interdisciplinary representation of raters and provided training and evaluation manuals, but also based reliability measures on less robust kappa-statistics, so their result might have proved better than ours, if the method proposed by Gwet had been used.

Methodical considerations

When a reliability study is based on samples smaller than the study population, there may be loss of information. All persons in this study were consecutively selected and had the same chance of inclusion in the interrater-analysis. There was fair resemblance between the study population and the study sample, helping to reduce sampling error.

Only few studies have examined agreement across several different spinal pathologies in the lumbar spine and with modest levels of reliability and high variability across pathologies (Fleiss κ and others). We achieved acceptable reliability for most degenerative pathologies ranging from: Gwet’s AC = 0.31–0.99.

In our study the raters were not selected randomly and only three specific raters were part of the analysis. In this study standard errors and confidence intervals of the study sample were based on fixed raters and consequently the inference generalizes and measures precision with respect to the universe of study persons only (with our inclusion criteria) and not the universe of raters, meaning its validity is limited to this specific group of raters that participated in the reliability experiment. This study indicated that even when raters have training materials or evaluation manuals available and use robust statistics, it is challenging to reach acceptable reliability for all degenerative pathologies.

We included no-LBP persons in our study sample for reliability, to resample the study population characteristics. This exposed us to the kappa paradoxes and made it difficult for us to compare our results directly to studies, where no-LBP persons were not included. But more importantly it challenged us to test the performance of several agreement coefficients, looking for more robust alternatives to Cohen’s and Conger’s extension of Cohen’s κ. Gwet’s AC1 and AC2 proved to be a good alternative. All no-LBP persons received supine MRI procedures in a new 0.5 T open MRI unit. This made it possible for raters to identify no-LBP persons, since the image quality was lower and certain sequences were used specifically for the 0.5 T unit. For the 1.5 T and 3 T systems, the sagittal T2 weighted sequence also included T2 fatsat (DIXON). It is possibly a source for bias of the raters to rate fewer findings in this group.

In this study the pair-wise comparison revealed that experienced chiropractors generally achieved the same levels of reliability as the medical radiologist. It is uncertain whether the lower average prevalence of positive findings, as reported by the radiologist, is due to underestimation or overestimation of the MRI-findings on part of the radiologist/chiropractors respectively. The chiropractors had similar reports of prevalence of positive finding, maybe due to the fact that they had collected the evidence and authored most of the evaluation guide. All raters had previous experience with reliability studies, but a high level of agreement among raters on negative findings has helped maintain acceptable reliability for most diagnostic findings in this study.

Clinical and research implications

Reliability is an issue of concern, since it is of fundamental importance for the quality of health care, that raters or doctors can replicate and agree on their findings and conclusions [41]. In all health care disciplines doctors, researchers and others are working, at some level, with the concept of agreement and striving systematically to investigate healthcare procedures for their reliability and validity. Inter-rater reliability is an important parameter to measure and a concern as data-sets should reflect the study participants and not the raters involved in the study. For valid results it is important that raters are updated and trained in the use of standardized protocols prior to the evaluation of MRI findings. In this study two raters reported similar prevalence of positive findings, most likely due to more knowledge of the evaluation manual. More interdisciplinary reviews are needed to establish internationally recognized standards for degenerative spinal pathologies.

Conclusion and recommendations

Inter-rater reliability for three rater-pairs was found overall acceptable for 6 of 8 degenerative MRI-findings of the lumbar spine. The two chiropractors in the study achieved similar levels of reliability as the medical radiologist for MRI interpretations of spinal degenerative pathologies. The classifications of findings are for most degenerative pathologies sufficiently comprehensible to be applied by health care professionals and can be used for both quality assurance and further research purposes. A few adjustments to the rating protocol will be required to bring all pathologies to an acceptable level of reliability.