Introduction

A radiology report is the primary communication method from radiologists to referring physicians [1, 2]. Radiology reports are valuable for individual patient care [3, 4] as well as the quality improvement of healthcare systems [5, 6]. At an aggregated level, anonymized radiology reports can be used to assess diagnostic yield, evaluate guideline adherence, perform epidemiological research, and be used for peer feedback and referral clinician feedback [7,8,9,10,11]. These applications are not widely implemented, however, because the manual classification of free-text radiology reports is cumbersome [12].

Automated processing of text is the domain of natural language processing (NLP) and has an increasing role in healthcare [13]. NLP has been applied in various applications in radiology to annotate texts or extract information [14,15,16]. Natural language processing has evolved from handcrafted rule-based algorithms to machine learning-based approaches and deep learning-based methods [17,18,19,20,21,22,23,24]. Deep learning is a subset of machine learning where features of the data are learned from the data by the application of multilayer neural networks [25, 26].

In machine learning, variation in size of the different classes in a dataset is called class imbalance. Together with dataset size, class imbalance potentially impacts results [27]. The impact of sample size and class imbalance is a recognized problem in machine learning in radiology but has not been fully explored [28, 29]. For NLP in medical texts, the impact of prevalence on model performance is also recognized [30]. In healthcare, the equivalent for class imbalance is called prevalence and is defined as the total number of cases of a disease at a specific point in time or during a period of time. Because the prevalence varies among different diseases and populations, class imbalance is inherent to radiological datasets. The prevalence also determines how many cases of a particular type of pathology are available for analysis in a particular population or a specific timeframe. Therefore, a prerequisite of the application of deep learning-based NLP into medical systems in clinical practice is the knowledge of the impact of prevalence and other particular characteristics of the radiology report dataset on algorithm performance. Different types of radiological examinations, different types of pathology, and different reporting styles among radiologists lead to variation in report length and complexity from a linguistic perspective. Questions that arise in the context of radiology and NLP include the following: Does variation in prevalence or variation in report complexity limit the application of NLP in radiology? What is the recommended dataset size before applying NLP in radiology? Which is the recommended algorithm to use? This study will elucidate these questions.

Objectives

  1. 1.

    Build a pipeline with four different algorithm types of deep learning NLP to assess the impact of dataset size and prevalence on model performance.

  2. 2.

    Test this pipeline on two datasets of radiology reports with low and high complexity.

  3. 3.

    Formulate a best practice for deep learning NLP in radiology concerning the optimal dataset size, prevalence, and model type.

Methods

Study design

In this retrospective study, we developed a pipeline using Python to perform experiments to investigate the impact of data characteristics and NLP model type on binary text classification performance. The pipeline created subsets of data with variable size and prevalence and subsequently used this data to train and test four different model types. The code for the pipeline is available in the supplementary material, including a list of all used packages and their version numbers. To ensure the reproducibility of this research, we organized this paper according to the Checklist for Artificial Intelligence in Medicine (CLAIM) [31].

Data

Two anonymized datasets of radiology reports were retrieved from the PACS of a general hospital (Treant Healthcare group, the Netherlands). The first dataset (Fracture-data) consisted of the reports (n = 2469) of all radiographs between January 2018 and September 2019 requested by general practitioners during the evening, night, and weekend shifts for patients with minor injuries to their extremities. The second dataset (Chest-data) consisted of the reports (n = 2255) of all chest radiographs (CR) and chest computed tomography (CT) studies from the first two weeks of March 2020 and the first two weeks of April 2020.

The datasets contained only the report text and annotations. No personal information about patients was included. The institutional review board confirmed that informed consent was not needed.

Ground truth

The annotation was performed in Excel. The Fracture-data was annotated by one radiologist for the presence or absence of a fracture or other type of pathology needing referral to the emergency department. The annotations were checked for consistency by one of two other radiologists. Discrepancies (3%) were solved in consensus. The Chest-data was annotated by a single radiologist for the presence or absence of pulmonary infiltrates. For both datasets, different Dutch words (or word combinations) were attributed as positive cases. The rationale for choosing radiologists for annotation was their experience in creating radiology reports and extensive knowledge of the nuances used in radiology reporting.

Post-hoc intra-rater agreement was assessed on random sample of 15% of both datasets over one year after the initial annotation. This resulted in a Cohen’s kappa value of 0.98 for the Fracture-data and of 0.92 for the Chest-data. Appendix A1 provides examples of the annotation.

Data partitions

Both datasets were split into separate sets of positive and negative cases. All four sets were randomized and split into training (80%) and testing (20%). The positive and negative cases of the training sets were kept separate. For both the Fracture-data and Chest-data testing sets, the positive and negative testing cases were combined.

For the Fracture-data, the training set had 1976 cases (720 positives, 1256 negatives) and the testing set had 494 cases. For the Chest-data, the training set contained 1803 cases (283 positives, 1520 negatives) and the testing set included 452 cases.

The positive and negative cases were kept separate for use as sources for artificially constructed training sets with variable sizes and variable numbers of positive and negative cases. Based on the size of the datasets for both the Fracture-data and Chest-data, a list was created with all combinations of positive and negative cases using increments of 100, starting with 100 positive and 100 negative cases. For the positive cases of the Chest-data, besides 100 and 200, the largest positive number, 283, was used. These lists were used during training to create temporary training sets of a specific size. Figure 1 demonstrates the data and processing workflow.

Fig. 1
figure 1

Flowchart of data processing, training, and testing. + and – refer to cases of the positive and negative classes. The input for the variable training sets are all combinations from positive and negative cases with a step size of 100. For the Fracture-data, the positive cases ranged from 100–700 and the negative cases from 100–1200. For the Chest-data, the positive cases ranged from 100–283 and the negative cases from 100–1500

Models

The four models used in this study were a fully connected neural network (Dense), a bidirectional long short-term memory recurrent neural network (LSTM), a convolutional neural network (CNN), and a Bidirectional Encoder Representations from Transformers network (BERT) (Table 1).

Table 1 Model characteristics

The Dense, LSTM, and CNN models were created using the Keras framework on top of TensorFlow 2.1.

The first layer for these three models was an embedding layer. For the Dense network, this was followed by a layer that flattened the input matrix and four fully connected layers. The LSTM network consisted of two bidirectional LSTM layers followed by two fully connected layers. The CNN network consisted of a convolutional layer, an average pooling layer, a convolutional layer, a global average pooling layer, and two fully connected layers. An overview of the three networks is provided in Appendix A2.

The number of layers and the number of epochs was empirically determined on a single training set with the original distribution of positive and negative cases for both the Fracture-data and Chest-data.

The BERT network was built using the simple transformers library in Python. BERT makes use of transfer learning, where models are pre-trained on a large text corpus in a particular language that can be fine-tuned for specific tasks [32]. In this project, the pre-trained Dutch language model 'wietsedv/bert-base-dutch-cased' was used from the Huggingface repository [33, 34]. Table 2 presents the model hyperparameters and the hardware used.

Table 2 Model hyperparameters for the ANN/Dense, CNN and LSTM models implemented with sequential layers in Keras (a) and for the BERT model implemented with simple transformers (b), and the hardware used for training (c)

Training

The training was performed in four steps using the following combinations of data and models:

  1. 1.

    Fracture-data, Dense/LSTM/CNN

  2. 2.

    Fracture-data, BERT

  3. 3.

    Chest-data, Dense/LSTM/CNN

  4. 4.

    Chest-data, BERT

For each step, the models were trained multiple times using the above-mentioned temporary training sets with different sizes and prevalence. The number of epochs was empirically determined by a test run, resulting in 12 epochs for the Dense/LSTM/CNN models and four epochs for the BERT model. For the Fracture-data, 84 experiments for each model were performed; for the Chest-data, 45 experiments were performed for each model.

Evaluation

The class imbalance for the training sets was indicated by the imbalance ratio, defined as the size ratio of the majority and minority classes.

Model performance was evaluated by assessing sensitivity, specificity, negative predictive value (npv), positive predictive value (ppv), area under the curve (auc), and F score on the fixed holdout test set from the Fracture-data (prevalence 0.36) and Chest-data (prevalence 0.16). No testing on an external dataset was performed.

The performance metrics were compared for each model using the t-test. The value p < 0.05 was considered to be statistically significant. Pearson correlation coefficients were calculated for training size, training prevalence, and all performance metrics for the Fracture-data and Chest-data sets.

Results

Data

Figure 2 demonstrates the distribution of report word count for both Fracture-data and Chest-data. The Chest-data is more complex because of the larger variation in report size and lower prevalence of positive cases. The prevalence varies for the training sets of the Fracture-data from 0.08–0.88; for the Chest-data, from 0.06–0.74. The imbalance ratios of the Fracture-data and the Chest-data training sets range from 7.3–11.5 and from 2.9–15.7, respectively.

Fig. 2
figure 2

Stacked histogram demonstrating report size and binary distribution of (a) Fracture-data (1 = fracture present, 0 = fracture absent) and (b) Chest-data (1 = infiltrate present, 0 = infiltrate absent)

Model performance

Figures 3 and 4 demonstrate scatterplots for model performance metrics and training dataset size and prevalence, respectively. Model performance metrics on the test set ranged from 0.56–1.00 for the Fracture-data and from 0.04–1.00 for the Chest-data. Table 3 demonstrates the Pearson correlation coefficients between the performance metrics and training set size and prevalence, respectively. For both datasets, there is a strong negative correlation between prevalence and specificity and positive predictive value. The positive correlation between prevalence and sensitivity and the negative predictive value was strong in the Fracture-data set and moderate to strong in the Chest-data set. Size had only a strong positive correlation with specificity and PPV in the Chest-data set.

Fig. 3
figure 3

Scatterplot of model performance metrics (vertical axis) and training dataset size (horizontal axis) for (a) Fracture-data and (b) Chest-data. The size of the dots corresponds to the training dataset prevalence

Fig. 4
figure 4

Scatterplot of model performance metrics (vertical axis) and prevalence (horizontal axis) for (a) Fracture-data and (b) Chest-data. The size of the dots corresponds to the training dataset size

Table 3 Pearson correlation coefficients for training set size; prevalence and model performance metrics

In Fig. 5, performance metrics are summarized in boxplots.

Fig. 5
figure 5

Boxplot of performance metrics per model for (a) Fracture-data and (b) Chest-data

In Table 4 (Fracture-data) and Table 5 (Chest-data), all pairs of models are compared for all performance metrics. For the Fracture-data, the BERT model outperforms the other models on most metrics, except the sensitivity and negative predicted value compared with LSTM and CNN. For the Chest-data, BERT outperforms the other models for sensitivity, npv, AUC, and F score. Specificity and ppv demonstrated no significant differences among the models. Table 6 highlights the most important findings.

Table 4 Comparison and t-test statistics of all performance metrics for all combinations of models trained on Fracture-data. The bold and underlined models have significantly better performance in the particular comparisons
Table 5 Comparison and t-test statistics of all performance metrics for all combinations of models trained on Chest-data. The bold and underlined models have significantly better performance in the particular comparisons
Table 6 Results summary of model performance

Discussion

In this study, we systematically evaluated the impact of training dataset size and prevalence, model type, and data complexity on the performance of four deep learning NLP models applied to radiology reports. The semi-automated pipeline allowed us to construct training sets of different sizes and different levels of class imbalance. This setup was chosen to discover the lower limit of usable dataset size and prevalence, as well as the limit above which adding more data had no added value. The results demonstrated that report complexity has a major impact on performance, illustrated by the substantially lower performance using the more complex dataset. For both datasets, the impact of training size and training prevalence demonstrated an identical pattern. Specificity and positive predictive value increased until there were about 800–1000 training samples; the values plateaued after that. Sensitivity and negative predictive value did not benefit substantially from an increase in the amount of training data. Prevalence correlated with sensitivity and positive predictive value and negatively correlated with specificity and negative predictive value. This aligns with the theoretically expected direction of effect, as explained in Appendix A3: In the case of class imbalance, the model tends to predict the majority class, and the positive and negative predicted values decrease by a reduction in false positive and false negative predictions, respectively.

The BERT model was the most stable algorithm in this study and demonstrated a limited impact of variation in data complexity, prevalence, and dataset size. BERT outperformed the other models on most metrics. The drawback of using BERT was the substantially longer training time of 30–60 min compared with less than 1 min for the other three algorithm types.

The conventional fully connected neural network (Dense) demonstrated the worst performance. The major difference between it and BERT, CNN, and LSTM was that the relationship between individual words was not taken into account. It is therefore not surprising that the nuances that radiologists embed in their reports are better extracted by the more advanced algorithm types.

To our knowledge, this is the first systematic, multifactorial comparative analysis of deep learning NLP in the field of radiology reporting. While no other study includes all the factors we investigated in ours, several authors describe one or more factors in their studies on natural language processing.

Comparison of CNN and traditional NLP

Weikert et al. [35] compared two conventional NLP methods and a deep learning NLP (CNN) model and analyzed the impact of the amount of training data. The CNN outperformed the other models. Even though the authors also investigated chest radiology reports, the main difference was that they used only the impression section of the CT pulmonary angiography, which resulted in a more focused classification task. This could also explain the difference in the number of training reports needed to reach a plateau in performance (500 in their study compared to 800–1000 in our study).

Krsnik et al. [36] compared several traditional NLP methods with CNN when classifying knee MRI reports. The CNN outperformed the other techniques and had high performance (F1 score was 0.89–0.96) for the most represented conditions with a prevalence of 0.57–0.63. Conditions with lower prevalence were better detected with the conventional NLP methods. This illustrates the relationship between NLP model performance and prevalence and model type. Contrary to our study, no variation in prevalence was used.

Barash et al. [24] compared five different NLP algorithms, including four LSTM deep learning-based methods, and applied them to classify Hebrew language radiology reports in a general task (normal vs. abnormal, prevalence 46%) and a specific task (hemorrhage present or absent, prevalence 7%). The results were in the same range as our study, including lower sensitivity (66%–79%) and ppv (70%) and higher specificity and npv in the low-prevalence task, compared with the equal sensitivity/specificity (88%) and ppv/npv (90%) in the high-prevalence task.

Comparison of LSTM and BERT

Datta et al. [37] applied several deep learning NLP methods to chest radiology reports where BERT outperformed LSTM. Instead of annotations at the report level (as in our study), they used annotations at the sentence level with words or a combination of words, not only to indicate diagnoses but also the spatial relation between any finding and its associated location. Because of this difference, the results are not directly comparable. The authors used under-sampling to deal with the substantial class imbalance at the sentence level in their data. In under-sampling, cases from the over-represented class are ignored in the training dataset. In fact, this is a variation of our approach with variation in the fraction of positive and negative cases to optimize model performance. At the level within the sentences, the authors described a higher performance for the words (5–6 times more frequent) with spatial information (F score 91.9–96.4) compared with less frequent words describing diagnoses (F score 75.2–82.8). Our study supports these results with a greater imbalance ratio and a greater performance difference.

Comparison of different BERT models

Bressem et al. [38] compared four different BERT models, including RAD-BERT, that were specifically pre-trained on a large corpus of radiology texts. One of their other models was a pre-trained native language (German) model, just as we used a pre-trained Dutch model. Their analysis on the impact of variation in training set size for fine-tuning on model performance demonstrate a curve with a steep increase between 200–1000 cases, a gradual increase between 1000–2000, and a plateau in the 3000–4000 range. This is confirmed in our study. The different investigated items in the radiology reports had differences in prevalence and also different model performance metrics. This suggests a relation between performance and prevalence, but the authors did not vary the prevalence within the dataset, as we did in our study. The best-performing model demonstrated a best pooled auc of 0.98, compared with a best auc of 0.94 (Chest-data) and 0.96 (Fracture-data) in our study for the BERT model.

Our study and the referenced literature demonstrate the surprisingly high performance of deep learning NLP in radiology reporting. Information from both simple and more complex unstructured radiology reports can be extracted and used for downstream tasks such as epidemiological research, identification of incidental findings, assessment of diagnostic yield and imaging appropriateness, and labeling of images for training of computer vision algorithms [39,40,41,42].

Limitations

The absence of inter-rater agreement assessment of the ground-truth annotations is a limitation. However, an unblinded assessment of the consistency of the annotations of the the Fracture-dataset by two radiologists and a blinded intra-rater agreement assessment of both datasets demonstrated excellent results.

Even though we constructed training sets with considerable variation in size and prevalence, the possible combinations were dependent on the original datasets' characteristics. The impact of variation in size and prevalence beyond these limits should be explored in further research.

Another limitation is that we investigated two report complexity levels but did not consider variation in report size within the datasets. Further research should elucidate to what extend NLP model performance depends on the size of radiology reports both in the training sets and the test sets. This is relevant because for clinical texts considerable larger than our current dataset research demonstrated a reduced performance of BERT compared with simpler architectures [43].

The results of our study are not directly generalizable to radiology reports from other institutions or other languages. External validation of the models should be performed to assess whether the results are generalizable to radiology reports from other institutions. Because BERT models are pre-trained on large datasets, and because our BERT model proved to deliver more stable results than the other models in our study, we expect a superior performance of BERT in the case of external validation.

Conclusion

For NLP of radiology reports, all four model-architectures demonstrated high performance.

CNN, LSTM, and Dense were outperformed by the BERT algorithm because of its stable results, despite variation in training size and prevalence.

Awareness for variation in prevalence is warranted because this impacts sensitivity and specificity in opposite directions.