Overview
A matched-pairs study design was used with no patient data included. A power analysis (detailed below) was conducted to plan the sample size, and a corresponding number of model radiology reports (n = 40) were created. Six participating radiologists used speech-recognition software to create dictations based on these model reports. Dictations (n = 480) were compared to the model reports and errors were manually tallied and classified according to type and severity. A statistical model was used to compare error rates for masked vs unmasked dictations. Before beginning the study, the Institutional Review Board (IRB) for our hospital system was consulted and determined that this project was exempt from a full review as no patient data was included.
Power Analysis
To determine the total number of dictations that would be required of our study participants, we conducted a power analysis using G*Power v3.1.9.2 (Heinrich Heine Universität, Düsseldorf, Germany) [14]. Existing literature was used as an estimate of the mean number of errors per report expected to occur when dictating without a mask (1.6 ± 1.1) [10]. The mean number of dictation errors with a mask was hypothesized to be 20% greater (1.9 ± 1.1). We found that with a matched-pairs study design, the upper bound would be 211 dictations in each group for 80% power at alpha = 0.05.
Generation of Model Reports and Dictations
Six radiologists agreed to participate as readers in the project: five attending diagnostic radiologists (four female and one male) each with at least eight years of experience dictating, and one female diagnostic radiology resident in her fourth year of postgraduate training (PGY-4) with more than 2 years of experience dictating.
To meet the target number of dictations, a total of 40 model radiology reports were fabricated by the radiology resident with oversight from one of the faculty radiologists, then validated by a second faculty radiologist to ensure that the reports approximated the structure and complexity commonly generated during a workday at our tertiary care center. No patient data was used. As the five participating attending radiologists were within the division of abdominal imaging and could be expected to have dictation voice models highly tuned to terms and conditions found in abdominal imaging reports, model reports were limited to varieties that would be reported by an abdominal imaging division. Reports were evenly balanced including ten each of computed radiography/radiofluoroscopy (CR/RF) reports, ultrasound (US) reports, computed tomography (CT) reports, and magnetic resonance imaging (MR) reports. Departmental structured templates served as a foundation for these reports, to which features were added including dates and times; indications; factitious comparisons; common, uncommon, and incidental imaging findings; biplanar/multiplanar measurements; and image/series numbers as commonly dictated at our institution. A variety of benign and malignant conditions were included. A summary of study indications and an example report are included in Fig. 1. The total number of words in each model report was counted to evaluate error rates per 1000 words.
Each radiologist was instructed to dictate word-for-word the contents of the 40 model reports twice: once while wearing a mask, and once without a mask, for a total of 80 dictations per reader and 480 dictations total. To control for bias from dictating the same reports twice, readers were randomized into two equal groups with one group dictating first masked then unmasked, and the other dictating first unmasked and then masked. Masks were provided to each radiologist by the radiology department as personal protective equipment and consisted of a standard disposable surgical mask attached to the face via elastic ear-loops. No N-95 masks were used. When dictating with a mask, participants were instructed to wear the mask tight to the face and fully cover both the nose and mouth.
To standardize the process of dictation, requirements for reading radiologists included: de-novo dictation of all section headers, words, numbers, dates, and punctuation exactly as written in the model report; no proofreading of reports during or after dictation (excepting an obvious manual error or accidental garbling of words due to something other than the mask itself); dictation at a natural pace, tone, and volume; dictation of all reports in the same physical location to minimize variation due to microphone, room noise, or other environmental factors. All reports were created using PowerScribe 360 v4.0-SP2 reporting software and a PowerMic III (Nuance Communications, Burlington, Massachusetts) and then copied directly from the reporting software into a separate text document. Radiologists used their own user account and associated personalized voice model, which had been attuned to their pattern of speech through daily use for more than 2 years in each case. To simulate a real-world setting more closely, the dictation wizard was not run at the beginning of each dictation session, as this is not commonly done on a day-to-day basis.
Dictation Coding
Using a comparison feature in Microsoft Word (Microsoft Corporation, Redmond, WA) to highlight differences, model reports were compared side-by-side with dictations from the radiologists, and dictation errors were manually tallied and categorized by one attending radiologist and the participating PGY4 resident. Categories of errors (outcome variables) included: incorrect words, missing words, additional words, missing or incorrect phrases (defined as 3 + sequential words), incorrect terms of negation (e.g., errors in “no,” “not,” or “without”), sidedness errors, incorrect image numbers, incorrect measurements, incorrect dates/times, and punctuation errors.
Every error was counted, with no limit as to the maximum number of errors codified per report. Incorrect, missing, and additional-word errors were subclassified as minor, moderate, or major errors based on a subjective assessment of the potential to result in a clinically significant misunderstanding for the ordering provider or a future radiologist. Missing/incorrect phrases of 3 + words, errors in words of negation, sidedness errors, and incorrect measurements were all subclassified as being major errors that could result in a clinically significant misunderstanding. Incorrect image numbers and incorrect dates/times were all subclassified as being moderate errors which might result in misunderstanding.
All 480 dictations were codified, including 240 in the masked group and 240 in the unmasked group. To validate data coding and address inherent subjectivity, a selection of these dictations (20%; [96/480]) were separately coded by a second attending radiologist and were compared to the initial coding. The discrepancy rate was 6.3% (6/96).
Data Analysis
Graphical evaluation showed no evidence of overdispersion. Error rates were modeled for each outcome as a function of the presence/absence of a mask assuming a Poisson distribution with a log link. The number of words in each dictation report was included as an offset and the model controlled for the nuisance parameter of randomization order. The model included mixed effects to control for radiologist-level correlation and correlation within a study document. Predicted error rates per 1000 words were computed for the mask vs no mask group and compared using a t-test. P-values for these comparisons were adjusted using the false discovery rate method to control the Type 1 error rate [15].
Sensitivity and Subgroup Analyses
Following an initial data review, an approximately fourfold difference was seen in the total number of errors generated by one participant (1346 total errors for one reader vs. mean of 308 for other participants). This participant was notable for being the only trainee as well as the only participant having accented speech. Using the model above, predicted error rates per 1000 words were computed and compared for this individual vs. the other 5 readers for the “all errors,” “major errors,” “moderate errors,” and “minor errors” outcomes variables. To reduce the potential for significant bias of study outcomes toward error patterns present for this individual, a sensitivity analysis using the same model described above was performed excluding this trainee.
A separate subgroup analysis was conducted to evaluate whether modality was associated with the “all errors” outcome variable. We implemented the same model described above with the addition of a modality indicator variable. Predicted error counts were computed for each modality and compared using a t-test. P-values for these comparisons were adjusted using the false discovery rate method to control the Type 1 error rate.
Results are described using model-based error rates per 1000 words with standard errors and associated adjusted p-values [15]. Adjusted p-values < 0.05 are considered statistically significant.