Reporting frequency of radiology findings increases after introducing visual rating scales in the primary care diagnostic work up of subjective and mild cognitive impairment

Objectives Study the effect of introducing a template for radiological reporting of non-enhanced computed tomography (NECT) in the primary care diagnostic work up of cognitive impairment using visual rating scales (VRS). Methods Radiology reports were assessed regarding compliance with a contextual report template and the reporting of the parameters medial temporal lobe atrophy (MTA), white matter changes (WMC), global cortical atrophy (GCA), and width of lateral ventricles (WLV) using established VRS in two age-matched groups examined with NECT before (n = 111) and after (n = 125) the introduction of contextual reporting at our department. True positive rate (TPR) and true negative rate (TNR) before and after were compared. Results We observed a significant increase in the percentage of radiology reports with mentioning of MTA from 29 to 76% (p < 0.001), WMC from 69 to 86% (p < 0.01), and GCA from 54 to 82% (p < 0.001). We observed a significant increase in the percentages of reports where all of the parameters were mentioned, from 6 to 29% (p < 0.001). There was a significant increase in TPR from 10 to 55% for MTA. Conclusion This study suggests that contextual radiological assessment using VRS could increase the reporting frequency of radiology findings in the diagnostic work up of cognitive impairment but compliance with templates may be difficult to endorse. Key Points • Introducing visual rating scales in clinical practice increases the reporting frequency of MTA, WMC, and GCA in the diagnostic work up of subjective and mild cognitive impairment. • Introducing visual rating scales has an effect on the true positive rate of reported MTA. • Compliance with contextual radiology templates remains low when use of the template is not enforced by the department leadership.


Introduction
The work of the radiologist includes interpretation of images and communicating relevant findings through the radiology report. Traditionally, radiology reports have been free narratives with variations in structure and quality [1,2]. The form of the radiology report has been debated and structured reporting has been suggested as a method to improve quality; however, consensus regarding form and style has not been reached [3][4][5][6][7].
Recently, contextual reporting has been suggested as an intermediate between structural reporting and free narrative reporting [8]. Contextual reports are structured in a diseasespecific way with findings reported in a checklist manner but they are less strict compared with structural reporting. Another method to potentially improve quality is the use of established visual rating scales (VRS). An example from the field of neuroradiology is VRS developed for the investigation of cognitive impairment, which are endorsed in clinical practice [9][10][11].
Previous studies have shown that structural findings are underreported in the diagnostic work up of cognitive impairment [12][13][14]. A recent European survey showed that VRS were used in 75% of responding centers but structural reporting was used in only 28% [15]. In order to improve accuracy and clarity of radiology reports, our department introduced contextual reports as an endorsed routine. The purpose of this study is to investigate the effect on reporting of structural radiological findings after introducing a template with VRS in the primary care diagnostic work up of cognitive impairment.

Materials
This is a retrospective, observational, single-center study. Eligible subjects, aged 60 to 80 years, were retrospectively recruited for two age-matched groups with exams performed before and after the introduction of contextual reporting; nonenhanced computed tomography (NECT) is the preferred modality in our country due to greater availability [16]. Only referrals issued by general physicians as part of a primary care diagnostic work up of cognitive impairment (the routine in our country) were eligible for inclusion.
We searched our picture archiving and communication system (PACS) for referrals containing the words "dementia" and/or "memory" together with the word "investigation" to identify eligible subjects under primary care investigation for subjective or mild cognitive impairment. Since our purpose was to study the effect of introducing VRS in the primary diagnostic work up, subjects where referrals mentioned known dementia or psychiatric disorder were not eligible for inclusion.
The group "before" was retrospectively recruited from the Swedish BioFINDER study mild cognitive impairment (MCI) cohort (see https://biofinder.se/ for more detail) where eligible subjects had performed a routine NECT available in our PACS from January 2010 to December 2014. Some subjects (n = 68) in this group have been included in a previous study but all results are new for this study [14]. The group "after" was retrospectively recruited from our PACS from January 2016 to December 2017. To mimic the clinical situation, we had only access to clinical information given in the referrals.
Our endorsed template (see Fig. 1) states that medial temporal lobe atrophy (MTA), white matter changes (WMC), global cortical atrophy (GCA), and width of lateral ventricles (WLV) must be reported. The use of established VRS is endorsed [11]. All reports must end with an "Impression" where findings should be interpreted and probable diagnosis listed. For this study, any additional findings mentioned in the reports were not included in the evaluation. Full compliance was Fig. 1 Contextual reporting template for the investigation of cognitive impairment. Free text narrative style can be used but evaluated parameters should always be mentioned regardless if they are normal or not. For full compliance, every report must include assessment of medial temporal lobe atrophy (MTA), global cortical atrophy (GCA), white matter changes (WMC), and with of lateral ventricles (WLV) and end with a separate "Impression" defined as reports including all evaluated parameters and an "Impression."
All NECT exams were performed according to our clinical routine with helical scan mode using Z-axis dose modulation on scanners from three different vendors with 120 kV voltage, exposure from 150 to 320 mAs, collimation from 0.5 to 0.75, and pitch factor from 0.36 to 0.65. Image quality was considered equal between scanners. All readings were done in our PACS IDS7® (Sectra AB) with a center width of 40 HU and window width of 80 HU.

Assessment of clinical reports
All reports were reassessed and graded according to a scale by Torisson et al [12] and applied in accordance with a previous study [14] with respect to quantitative (e.g., "mild," "severe") and qualitative (e.g., "widened," "enlarged") descriptions of the evaluated parameters. Examples of how reports were graded are given in Table 1. Reports were graded as NA = not mentioned, 0 = normal (corresponds to MTA 0, GCA 0, and WMC 0), 1 = mild or reported but not quantified (corresponds to MTA 1-2, GCA 1, and WMC 1), 2 = moderate (corresponds to MTA 3, GCA 2, and WMC 2), and 3 = severe (corresponds to MTA 4, GCA 3, and WMC 3) [12]. For MTA, GCA, and WMC, a grade of ≥ 2 was considered abnormal. For WLV, a grade of ≥ 1 was considered abnormal. The gradings were compared with the second reading for estimation of true positive rate (TPR) and true negative rate (TNR). For this, we assumed that all reports where evaluated parameters had not been mentioned were assessed as normal. Since visual ratings are subjective, an additional rating was performed after 4 weeks for estimation of intra-rater agreement. Additionally, a second rater (D.v.W.) performed an assessment of GCA, MTA, and WMC on 100 randomly selected subjects (50 from each group) for estimation of inter-rater agreement and the rating with the highest inter-rater agreement was chosen as standard for the second reading.

Statistics
Descriptive statistics (percentages) were estimated to summarize results of the gradings and the second reading. True positive rate and TNR were calculated using MEDCALC® (MedCalc Software Ltd.) online statistics calculator (available at https://www.medcalc.org/calc/diagnostic_test.php) where 95% confidence intervals (95% CI) were visually compared for statistical significance. Difference between groups was compared using Pearson chi-square analysis and Mann-Whitney U test where applicable. For estimation of intraand inter-rater agreement, Cohen's κ was estimated for dichotomized data. The level of agreement was defined according to Landis and Koch [23]. Calculations were done using SPSS® version 26 (IBM Corporation). A p < 0.05 was considered statistically significant.

Results
We identified 251 eligible subjects, ten subjects had cancelled examinations, four subjects had failed to perform the exam, and the referral was unclear for one subject. These were excluded and 236 subjects were included with 111 examined "before" and 125 "after" the introduction of contextual reporting. There were no significant differences between the groups with respect to prevalence of abnormal findings and gender. Subjective cognitive impairment was the reported symptom in 97% of all subjects (see Table 2 for demographic data). Evans' index was only reported for one subject and was included in WLV.
Data on grading of clinical reports and concordance with our second reading are summarized in Table 3. In total, MTA was reported in 54% of the reports. The corresponding number for WMC, GCA, and WLV was 78%, 69%, and 59% respectively. Where MTA was reported as moderate to severe (i.e., Torisson's scale grades 2-3), 88% was correctly reported as abnormal compared with the second reading. Medial temporal lobe atrophy was reported as mild (i.e., grade 1), in 18% of the reports. The corresponding number in the second reading was 45%; when age correction was applied, 36% was normal and 9% abnormal of which 0% (n = 0) was correctly reported as abnormal. Where WMC was reported as moderate to severe, 83% was correctly reported as abnormal compared with the second reading. White matter changes were reported as mild in 31% of the reports; the corresponding number in the second reading was 18%; when age correction was applied, 17% was normal and 1% abnormal of which 0% (n = 0) was correctly reported as abnormal. Where GCA was reported as moderate to severe, 47% was correctly reported as abnormal, and for WLV, the figure was 38% compared with the second reading.
Data on frequencies and compliance, including differences between the groups, are summarized in Table 4. Reporting of MTA, WMC, and GCA increased significantly. There was no significant change in the reporting of WLV. Altogether, the percentage of reports with all parameters mentioned increased from 6% (n = 7) to 29% (n = 36). Full compliance remained low; the percentage of reports in strict full compliance with the template increased from 2% (n = 2) to 8% (n = 10).
Results regarding TPR and TNR are summarized in Table 5. A significant increase in TPR was observed for MTA with an increase from 10 to 55%. There were no significant changes in TPR for the other parameters but an increase from 0 to 33% was observed for GCA and an increase from 37 to 58% was observed for WLV. There was high to almost perfect TNR with no significant changes for any parameter in the two groups.

Discussion
In this retrospective, observational study, we evaluated compliance and compared TPR of radiology reports before and after the introduction of contextual reporting in the diagnostic work up of cognitive impairment. We found an increase in the reporting of MTA, GCA, and WMC and an increased TPR for MTA. Although an increase in the reporting of evaluated parameters was observed, full compliance with the template remained low (8%) and the percentage of reports where all parameters were mentioned only reached 29%. Due to small numbers, it is difficult to draw any definitive conclusions from the reports with full compliance or mentioning of all parameters why we chose to evaluate each parameter separately.
We had anticipated that full compliance would reach at least 50% why our results would seem disappointing. In a study by Powell et al, 9% compliance was observed when a structured template for assessing maxillofacial trauma was evaluated [24]. This figure is close to our result but differences in methodology make further comparisons difficult. Another study by Larson et al showed that structured reporting could successfully be implemented if enforced by the department leadership [4]. Since visual ratings are subjective, it has also been suggested that differences in structure of radiology reports may be explained by local traditions and, in the case of cognitive impairment, imaging has traditionally been used to exclude secondary causes to cognitive impairment [25,26]. Taking all of this into consideration, we believe our observed low compliance is similar to what have been previously reported and could be explained by adherence to local traditions. Also, the use of our template was not enforced by the department leadership.
Previous studies have shown that abnormal findings, in particular MTA, are underreported in radiology reports even when assessment is warranted [12,14]. Medial temporal lobe atrophy is an important structural finding in Alzheimer's disease (AD) but it can also be found in other dementias [27,28]. Our results showed an increase in reporting and TPR for MTA. Moderate to severe MTA was correctly reported as abnormal in 88% but mild atrophy was underreported; also, when age correction was applied, there remained an underreporting of abnormal mild atrophy (i.e., MTA 2 in subjects < 75 years). The underreporting of mild MTA probably explains the observed low TPR (10% "before" and 55% Table 2 Data on subjects and prevalence of evaluated parameters for the two groups "before" and "after" "before" "after" p value All values are rounded to the nearest integer and represents percentages of total (N = 236) study population. *Sum of MTA 1 (27%) and MTA 2 (18%) in the second reading. 1 Corrected for age where MTA 2 is abnormal if age < 75 years. 2 Corrected for age where WMC 1 is abnormal if age < 65 years. 5 Sum of "1" + "2" + "3," enlarged WLV dichotomized according to age-corrected cutoffs suggested by Brix et al [22]. The scale by Torisson et al [12]: NA = not mentioned in report, 0 = reported as normal, 1 = reported as mild or reported but not quantified, 2 = reported as moderate, 3 = reported as severe.
MTA medial temporal lobe atrophy, GCA global cortical atrophy, WMC white matter changes, WLV width of lateral ventricles "after") for MTA and it cannot be excluded that a study population with a higher prevalence of moderate to severe MTA would have resulted in better TPR and compliance. In line with previous studies, excellent intra-rater and substantial inter-rater agreement was observed for MTA [14,29]. With respect to the clinical importance of MTA and previously observed underreporting, we believe our results regarding MTA have an important clinical impact [12,14]. The reporting of GCA increased significantly, although TPR remained low (33%). Where abnormal GCA was reported, it was erroneous in 53% which probably explains our observed low TPR. The GCA scale covers a larger brain region compared with the MTA scale. Although ratings are based on the highest grade of atrophy, the risk of potential underreporting cannot be eliminated since moderate parietal cortical atrophy and mild frontal cortical atrophy still could be interpreted as overall mild GCA (normal) by one rater and moderate GCA (abnormal) by another. This would probably also explain our observed levels of agreement.
There was an increase in the reporting of WMC and WLV, where the increase for WMC was significant, but changes in TPR were not significant. In many reports, the phrase "normal appearing cerebrospinal fluid spaces" (CSF spaces) was used. This resulted in difficulties in our grading of the reports since this could mean normal width of sulci (i.e., normal GCA) and normal WLV combined. We chose to interpret this as normal WLV. White matter changes are preferably assessed on magnetic resonance imaging (MRI), but for abnormal findings, NECT is considered sufficient [9,11]. In other words, when WMC is reported on NECT, it is most likely to be Fazekas grade 2 or 3. Our results showed that moderate to severe WMC was reported in 20% of the reports compared with 22% in the second reading while mild changes were overreported. In most reports, the phrases "white matter changes" or "focal parenchymal changes" were used to describe WMC which posed no difficulties in our grading why we do not believe this explains our results. Radiologists have been shown to be keener to report WMC which we believe is a more probable explanation to our results [14]. Our results would suggest that contextual reporting only had a significant effect on the reporting of MTA but the increased reporting of the other parameters would suggest that discrepancies in reporting styles were reduced to some extent. However, the low compliance with our contextual template and our assumption that reports with no mentioning of the evaluated parameters were normal may hamper such conclusions.
There are limitations to this study: (i) The use of two different cohorts could result in a potential bias from cohort effects. Differences in prevalence of abnormal findings were not significant between the groups but the observed prevalence was probably lower than would be expected in a memory clinic population. Our study population was derived from a population with cognitive impairment where none was diagnosed with dementia since we believe the clinical benefit Table 5 True positive rate and true negative rate for the evaluated parameters, expressed as percentages, of original reports in the two groups "before" and "after" "before" (n = 111) "after" (n = 125)   [30,31]. It cannot be excluded that an older study population or a memory clinic population would have yielded a higher prevalence of abnormal findings where a different compliance with our template cannot be excluded. (ii) Visual ratings are subjective and quantitative data such as volume segmentation would be preferable but are difficult to perform with NECT. We chose a gold standard based on high inter-rater agreement but this approach does not exclude a potential rater bias. (iii) The retrospective design hampers the possibility to obtain reliable data on potential effects of training, education, or individual experiences of using VRS among the neuroradiologists at our department. (iv) We have not compared our template with other structural reporting templates and we have not followed up on how the use of VRS affects the final diagnosis. (v) The use of the scale suggested by Torisson et al can be questioned. This is an attempt to grade the qualitative data in the radiology reports to make comparisons possible but the scale has not been tested for rater reliability, although it has been used in previous studies [12,14]. This study adds knowledge to how reporting frequency of radiology findings can be improved in the diagnostic work up of cognitive impairment. Our results suggest that there is a possibility to increase the overall reporting of structural findings but only the results for MTA were significant. In conclusion, this study suggests that contextual radiological assessment using VRS could increase the reporting frequency of radiology findings in the diagnostic work up of cognitive impairment, but compliance with templates may be difficult to endorse.

Compliance with ethical standards
Guarantor The scientific guarantor of this publication is Danielle van Westen.

Conflict of interest
The authors of this manuscript declare relationships with the following companies: O.H. has acquired research support (for the institution) from Roche, GE Healthcare, Biogen, AVID Radiopharmaceuticals, Fujirebio, and Euroimmun. In the past 2 years, he has received consultancy/speaker fees (paid to the institution) from Lilly, Roche, and Fujirebio. The other authors declare that they have no conflicts of interest.
Statistics and biometry One of the authors has significant statistical expertise. No complex statistical methods were necessary for this paper.
Informed consent Written informed consent was waived by the Institutional Review Board.
Ethical approval Institutional Review Board approval was obtained.
Study subjects or cohorts overlap Some study subjects or cohorts have been previously included in Håkansson et al (2019) Neuroradiology 61:397-404.

Methodology
• retrospective • observational • performed at one institution Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.