The Composite Severity Score for Lumbar Spine MRI: a Metric of Cumulative Degenerative Disease Predicts Time Spent on Interpretation and Reporting

Conventional measures of radiologist efficiency, such as the relative value unit, fail to account for variations in the complexity and difficulty of a given study. For lumbar spine MRI (LMRI), an ideal performance metric should account for the global severity of lumbar degenerative disease (LSDD) which may influence reporting time (RT), thereby affecting clinical productivity. This study aims to derive a global LSDD metric and estimate its effect on RT. A 10-year archive of LMRI reports comprising 13,388 exams was reviewed. Objective reporting timestamps were used to calculate RT. A natural language processing (NLP) tool was used to extract radiologist-assigned stenosis severity using a 6-point scale (0 = “normal” to 5 = “severe”) at each lumbar level. The composite severity score (CSS) was calculated as the sum of each of 18 stenosis grades. The predictive values of CSS, sex, age, radiologist identity, and referring service on RT were examined with multiple regression models. The NLP tool accurately classified LSDD in 94.8% of cases in a validation set. The CSS increased with patient age and differed between men and women. In a univariable model, CSS was a significant predictor of mean RT (R2 = 0.38, p < 0.001) and independent predictor of mean RT (p < 0.001) controlling for patient sex, patient age, service location, and interpreting radiologist. The predictive strength of CSS was stronger for the low CSS range (CSS = 0–25, R2 = 0.83, p < 0.001) compared to higher CSS values (CSS > 25, R2 = 0.15, p = 0.05). Individual radiologist study volume was negatively correlated with mean RT (Pearson’s R =  − 0.35, p < 0.001). The composite severity score predicts radiologist reporting efficiency in LMRI, providing a quantitative measure of case complexity which may be useful for workflow planning and performance evaluation. Supplementary Information The online version contains supplementary material available at 10.1007/s10278-021-00462-1.


Introduction
Efficient interpretation and reporting are central principles of delivering value in diagnostic radiology [1]. Common performance metrics such as the relative value unit (RVU) are used to quantify the value of physician work as a function of time [2]. However, the RVU does not account for variability in procedural complexity which may significantly alter the duration of a medical procedure [3]. This discrepancy can generate systemic bias and misaligned incentives for physicians and healthcare organizations [4]. In diagnostic radiology, we hypothesize that studies with more complex pathology would require more cognitive energy, resulting in longer reporting time (RT) that is not captured by the RVU [5][6][7] .
To address this question, we studied degenerative disease of the lumbar spine (LSDD) on MRI (LMRI). Compared with other imaging reports, LMRI reporting is typically structured by anatomic level and each level is assigned a grade of relative severity of stenosis [8]. Prior work has shown that inter-radiologist agreement for spinal canal stenosis (SCS) and neural foraminal stenosis (NFS) is moderate to strong [9][10][11][12][13]. Moreover, the descriptive terminology for LSDD is standardized by an interdisciplinary consortium of national societies allowing comparison between radiologists with different training and levels of experience [14]. These features enable analysis of reporting text by natural language processing (NLP) algorithms, which allow a large corpus of text to be analyzed rapidly.
We hypothesized that a measurable relationship exists between disease severity (complexity) and RT owing to increased cognitive energy required to render a diagnosis and assign a severity value [15,16]. The primary aim of this study was to quantify the relationship of radiologist-assigned LSDD severity and reporting time, leveraging NLP to analyze a large volume of reports from many individual radiologists. To this end, we present and validate an NLP tool and derive a composite "severity score" reflecting the cumulative severity of LSDD at 18 discrete sites in the lumbar spine. We then evaluate the predictive value of this composite severity score on objective reporting times.

Data Selection
We performed an institutional review board-approved review of LMRI reporting data in compliance with the Health Insurance Portability and Accountability Act for which informed consent was waived. The 10-year archive of report text and reporting metadata comprised 43,255 LMRI studies from a single, multi-site institution (BLINDED FOR REVIEW), representing 29 different MR scanners. Studies which were co-interpreted with trainee assistance (n = 23,607) or reported on the weekend (n = 3202) were excluded because the workflow for these studies was presumably different. We also excluded studies in which the calculated RT was over 60 min (n = 3058) because this was felt to be beyond the conventional RT for an attending radiologist without significant interruption (exclusion criteria summarized in supplemental Fig. 1) [17]. Reports were signed by 152 unique attending radiologists including generalists, fellowshiptrained musculoskeletal radiologists, and neuroradiologists. Patient age, sex, and the requesting service (documented as "inpatient", "outpatient", "emergency", or "unknown") were recorded. Reporting documentation timestamps were extracted from the institution's reporting software API (Powerscribe, Nuance Communications, Burlington VT, USA). The timestamps marked the initialization and finalization of the radiology report to the 1/100 of a second; the difference between these time points was recorded as the RT (min).

Natural Language Processing
We applied a customized NLP algorithm to raw radiology report text. Using a rule-based approach employing a dictionary of customized regular expressions (RegEx), the algorithm was designed to extract a severity rating using a 6 point scale (0 = "normal", 1 = "mild", 2 = "mild to moderate", 3 = "moderate", 4 = "moderate to severe", and 5 = "severe"). A severity score was extracted for spinal canal stenosis (SCS) and left and right neural foraminal stenosis (LNFS, RNFS) for each of six spinal levels: T12-L1 through L5-S1. The NLP was designed using empirically, iteratively developed syntactic and semantic rules including common radiology terminology and phraseology to elicit a severity score (0-5) for the 18 locations (e.g., T12-L1 SCS, T12-L1 LNFS), resulting in a 6 × 3 matrix for each study (Fig. 1). When the model failed to assign a score, a default value of 0 ("normal") was applied. To test the accuracy of the model, we randomly selected 100 studies out of the full dataset (n = 43,255) and manually reviewed the radiology reporting text to assess for discrepancy or error. For each case, the reporting text was manually reviewed and assigned a 0-5 value by a radiologist. Scores were considered concordant if NLP and manual review matched exactly and any degree of discordance was considered unsuccessful.

Calculation of Severity Scores
The composite severity score (CSS) was calculated as the sum of the constituent severity scores (n = 18) resulting in scale ranging from 0 ("normal" at all locations) to a theoretical maximum of 90 ("severe" at all 18 locations).
The distribution of CSS was assessed for the study population (n = 13,388) and subdivided by age and sex. Age groups were defined as < 40, 40-49, 50-59, 60-69, 70-79, and ≥ 80. The sex-based distributions and age group distributions were compared to exponential theoretical distributions using quantile-quantile plots. The Kolmogorov-Smirnov (K-S) test was used to test difference in distribution between subgroups.

Quantifying Relationships Among CSS, Age, Sex, and RT
CSS and RT were grouped by sex and age and inter-group differences were calculated using ANOVA and pairwise t tests with the Benjamini-Hochberg method correcting for multiple comparisons. CSS were clustered into deciles and compared to RT using pairwise t tests. The results are visualized in box-and-whisker diagrams which show the median (central bar), interquartile range (box margins).
Simple, univariable linear least squares regression was used to assess the predictive relationship of CSS, age, sex, service location, and interpreting radiologist on RT. The distribution of residuals indicated heteroscedasticity (Supplemental Fig. 2); therefore, the regression was performed using log-log transformation. Significant predictors were then grouped, and a multiple least squares regression was performed to assess for independent effects of each variable on RT. For each test, p value < 0.05 was considered significant. A curvilinear regression was then fitted to model predictive value of CSS on mean RT. This was repeated for subgroups of low-CSS (0-25) and high-CSS (> 25) groups. Unless otherwise stated, values are reported with ± standard error of the mean (SEM). All statistical analysis and data visualizations were performed using R statistical computing software (v 3.6.2, The R Corporation).

Study Characteristics
From the initial dataset of 43,255 studies, we included 13,388 after applying exclusion criteria (Supplemental Fig. 1) representing 12,326 unique patients. The mean patient age was 54.8 ± 0.1 years and the sex distribution was 54.5% women. The mean RT for the study population was 14.41 ± 0.1 min (median = 10.7), with no significant difference in mean RT by patient sex (14.56 ± 0.15 min for men vs. 14.29 ± 0.14 min for women, p = 0.17). These differences are summarized in Table 1.

Natural Language Processing
The NLP accuracy at the level of CSS was 94.8% (93 misclassifications out of 1800 test values in random sample of 100 cases). The NLP was 100% accurate in 5/18 level instances (27.8%) and was least accurate at right L5-S1, correctly classifying the severity in 86% of cases. Two examples of NLP analysis of archived radiology report text (A). Using regular expression methods and the structured or semi-structured nature of LMRI spine reporting, the NLP algorithm extracted severity scores for the lumbar spine, resulting in a 6 × 3 matrix of values ranging 0-5 (normal-severe). The composite severity score is the sum of these values

Relationship of CSS, Age, and Sex
Mean CSS for the full study population was 6.04 ± 0.07. The relationship of CSS and RT to patient age is shown in Fig. 2A and showed a moderate positive correlation with Pearson correlation coefficient R = 0.44 (p < 0.001) and a weak positive correlation between age and RT (Pearson coefficient R = 0.16, p < 0.001). The distribution of CSS was exponential and differed by different age group with increasing flattening of the distribution at higher age groups, indicating a shift toward higher CSS (Fig. 2B). The distribution of CSS differed between men and women (K-S test, p < 0.001, Fig. 3A). Mean CSS was higher for men than women overall (7.0 ± 0.11 vs. 5.23 ± 0.09, p < 0.001) and the relative difference in proportion by sex is shown for each CSS value in Fig. 3B. The relationships of age and sex to CSS are summarized in Table 1.
We next examined the CSS in 10 groups, based on relative sample size distribution deciles. The relationship of CSS groups and RT is shown in Fig. 5. There were significant stepwise increases between groups (ANOVA, p < 0.001) and pairwise analysis showed a significant, stepwise for each sequential group except the transition from CSS 21-30 to CSS 31-40 (p = 0.74) and CSS 31-40 to CSS 41-50 (p = 0.13) ( Table 2).
Because the effect size of individual radiologist identity was large, the mean RT as a function of mean CSS was plotted for individual radiologists, excluding the 5% upper and lower tail outliers (Fig. 6). There was a significant positive relationship between mean CSS and RT by radiologist (Pearson's R = 0.26, p = 0.002). There was also a significant negative correlation between the radiologist's study volume (number of studies interpreted by each radiologist over the 10-year period) and mean RT (Fig. 7A, Pearson's R = − 0.35, p < 0.001). Pairwise t tests between radiologist volume quintiles showed significant decrease in mean RT between the 3rd and 4th quintiles (Fig. 7B) (25.02 ± 5.13 min vs. 20.76 ± 5.30 min, p = 0.02) and the 4th and 5th quintiles (20.76 ± 5.30 min vs. 15.00 ± 3.67 min, p = 0.004). There were no significant pairwise differences in the 1st-3rd quintiles (p > 0.05). The correlation between

Discussion
The CSS is a tool for quantifying radiologic complexity in lumbar spine MRI and is a significant, independent predictor of interpretation and reporting time (RT). The relationship of CSS and RT supports the anecdotal observation that severe LSDD is more challenging to interpret than normal anatomy, measured in terms of time spent generating a final radiology report. The CSS may be useful as a 'correction factor', allowing studies to be weighted based on overall complexity. The CSS could then be applied to quality improvement by defining efficiency benchmarks, ensuring equitable distribution of work, and improving existing measures of radiologist productivity (i.e., RVU).
The principle finding of this study is that the CSS is a reasonable predictor of mean RT (R 2 = 38%). Notably, the CSS performed better at low-range CSS than high-range CSS (R 2 = 83% vs. 15%), suggesting a de-coupling of this Fig. 5 The relationship of CSS group to RT shown as box-andwhisker plots. The box delineates the median and interquartile range. Individual CSS-RT relationship is superimposed as a color-matched dot within each CSS group. Significant pairwise increase in RT is indicated by *p < 0.05, **p < 0.01, ***p < 0.001 relationship for LMRI studies with large burden of LSDD. In multiple regression, CSS was relatively less important than the identity of the interpreting radiologist. Taken together, these findings indicate that LSDD severity is only one of many factors which influence real-time radiologist efficiency. Nonetheless, we can extract meaningful benchmarks from this large dataset which estimates an increase in median RT of 50% for LMRI studies with a CSS of ~ 12 relative to a normal study (CSS = 0). The Current Procedural Terminology (CPT) Code for non-contrast LRMI (72,148) in the Washington, DC area, is assigned a global RVU of 6.32, of which 1.48 RVU is allocated for physician work (estimated payment ranging from $220.87 to $273.47) [4,18,19]. LMRI can therefore account for a considerable portion of revenue for both academic and private groups. Within a radiology practice, a CSS-weighted RVU could result in more equitable recognition of work, accounting for procedural complexity, a process which has been used to improve performance and quality in pediatrics and surgery [20,21]. Moreover, the CSS was a stronger relative predictor of RT than conventional demographic information (sex, age) and referring service (inpatient, outpatient, emergency). The CSS also remained a robust predictor despite significant variation between radiologists with different training backgrounds and experience level. Differences observed in CSS by sex and age group are largely consistent with prior cohort studies of LSDD in cohort populations, supporting the feasibility of the model [22][23][24][25]. The distribution of CSS by sex indicates a higher prevalence of normal-low CSS for women and conversely, higher prevalence of high CSS for men, consistent with prior work showing sex differences in lumbar spine degeneration patterns, attributed to occupational and hormonal differences, gestation, among others [26]. In our population, the difference in CSS was significant (7.00 vs. 5.23 for men and women, respectively, p < 0.001), and the distribution of CSS also differed by sex. However, despite these anatomic differences, there was no sex difference in RT. It is unclear if this asymmetry is due more to sex-specific differences in natural history and disease prevalence, other unaccounted Vertical and horizontal lines indicate median LMRI volume (by radiologist) and median RT, respectively. Between-group differences for quintiles are shown as box and whisker plots B; between-group differences are indicated by *p < 0.05, ***p < 0.001 sex differences in anatomy, or potentially due to implicit bias. While the CSS is not sufficient to directly inform surgical decision-making, the utility of CSS in predicting surgical candidacy as well as its correlation with functional status could be evaluated in future work.
Lastly, while CSS was an important predictor of RT, the effect size attributable to different radiologists was tenfold higher in the multiple regression model (43% vs. 4.3%). In a population of 152 radiologists, significant variation in training, experience, and clinical role could all contribute to this effect. Still, the relationship of CSS and RT showed a significant correlation across radiologists despite substantial differences in individual practice patterns (R = 0.26, p = 0.002). There was a highly significant negative relationship between cumulative study volume and RT indicating that radiologists who read more LMRI over the 10-year period tended to report studies more efficiently. This effect was driven largely by the top quintile of radiologists, who read a disproportionately high number of studies. This may reflect experience; radiologists with higher cumulative volume have likely been in practice longer and may need less time to interpret LMRI; alternatively, radiologists who interpret studies quickly (regardless of experience) are likely to have higher volume overall. The impact of CSS on individual reader practice could be studied in future, prospective work.
The major strength of this study is the large and heterogenous dataset from which the input and outcome variables are derived. By modeling the LSDD pattern of over 12,000 unique patients, the CSS reflects large sample distribution which is likely representative of the general population. The CSS also incorporates the interpretation styles and practice behaviors of over 150 radiologists. The use of timestampderived performance metrics is also a strength because these values represent "real-world" practice and more likely to capture realistic practice patterns than an artificial experimental setting.
The principal limitations of this study lie in the assumptions used to build our measurement variable (CSS) and outcome variable (RT). Our CSS model does not account for NLP model error, which, although uncommon, could bias the CSS from ground truth report text. Moreover, the NLP tool was validated on the complete dataset rather than the study population (n = 13,388). It is plausible that the exclusion criteria for this study may introduce additional unmeasured bias in NLP accuracy. Further, we opted to use reporting text rather than the underlying radiographic images that were not referenced as a standard. Instead, our CSS presumes high fidelity translation of disease severity in the report and does not account for inter-reader variability. Nonetheless, the simplicity of the CSS provides an intuitive understanding of the interpretation process. Lastly, the generalizability of our results outside of this single academic institution is uncertain. Significant variations in reporting style, case mix, and patient demographics all likely contribute to variance in RT in other organizations and may not translate to private practice workflows. Extrinsic factors such as time of day, day of week, and more granular clinical factors such as chief complaint are all likely important predictors of RT not included in this model. Future work should strive to integrate intrinsic (i.e., disease severity, sex, age) and extrinsic (i.e., time, date, scanner) factors to predict RT toward a more comprehensive and equitable model of radiologist efficiency.

Conclusion
The CSS is an NLP-based method for analyzing lumbar MRI reports which allows quantitative characterization of degenerative disease severity in a large population of imaged patients. The CSS summates the cumulative severity of 18 lumbar spine components providing a global marker of LMRI study complexity which is a significant and independent predictor of radiologist reporting time. The CSS may improve existing quality metrics by allowing one to weight metrics such as the RVU by study complexity.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http:// creat iveco mmons. org/ licen ses/ by/4. 0/.