Using the Textual Content of Radiological Reports to Detect Emerging Diseases: A Proof-of-Concept Study of COVID-19

Changes in the content of radiological reports at population level could detect emerging diseases. Herein, we developed a method to quantify similarities in consecutive temporal groupings of radiological reports using natural language processing, and we investigated whether appearance of dissimilarities between consecutive periods correlated with the beginning of the COVID-19 pandemic in France. CT reports from 67,368 consecutive adults across 62 emergency departments throughout France between October 2019 and March 2020 were collected. Reports were vectorized using time frequency–inverse document frequency (TF-IDF) analysis on one-grams. For each successive 2-week period, we performed unsupervised clustering of the reports based on TF-IDF values and partition-around-medoids. Next, we assessed the similarities between this clustering and a clustering from two weeks before according to the average adjusted Rand index (AARI). Statistical analyses included (1) cross-correlation functions (CCFs) with the number of positive SARS-CoV-2 tests and advanced sanitary index for flu syndromes (ASI-flu, from open-source dataset), and (2) linear regressions of time series at different lags to understand the variations of AARI over time. Overall, 13,235 chest CT reports were analyzed. AARI was correlated with ASI-flu at lag = + 1, + 5, and + 6 weeks (P = 0.0454, 0.0121, and 0.0042, respectively) and with SARS-CoV-2 positive tests at lag = − 1 and 0 week (P = 0.0057 and 0.0001, respectively). In the best fit, AARI correlated with the ASI-flu with a lag of 2 weeks (P = 0.0026), SARS-CoV-2-positive tests in the same week (P < 0.0001) and their interaction (P < 0.0001) (adjusted R2 = 0.921). Thus, our method enables the automatic monitoring of changes in radiological reports and could help capturing disease emergence. Supplementary Information The online version contains supplementary material available at 10.1007/s10278-023-00949-z.


Introduction
Radiological reports represent a colossal amount of information with several applications.Using natural language processing (NLP) to label reports could help generate large cohorts, plan human and technical resources, assess compliance with guidelines, and detect discrepancies between results and conclusions [1][2][3].It has been recently shown that the structure and content of reports developed by emergency radiologists depend on their personal background, examination characteristics, or workload [4].On a clinical side, one could hypothesize that an emerging new disease with significant impact on health would lead to new patterns of radiological depictions that could be captured with NLP before the semiology of the disease has been deciphered, which is inherently shifted by several weeks due to the time needed to understand patterns, collect databases, and statistically verify associations between features and diseases.Thus, such NLP-based detection methods on radiological reports could complement other efforts to detect emerging new disease notably wastewater-based surveillance in addition to clinical surveillance [5,6].
Regarding the coronavirus disease 2019 (COVID-19) outbreak due to severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), the first patients were clinically reported in China in December 2019 [7].The first radiological series involving the initial strain was published online in February 2020 and highlighted peculiar semiology on chest CT with bilateral peripheral ground glass opacities (GGOs), consolidations, and interstitial thickening [8,9].In France, the first three patients were identified on January 24, 2020, followed by progressive spread in the French territory until the first French lockdown on March, 17, 2020 (with n = 1097 patients newly diagnosed with positive SARS-CoV-2 by reverse transcriptase polymerase chain reaction (RT-PCR)) [10].The French Society of Radiology and the French Society of Thoracic Imaging (SFR-SIT) actively provided templates for standardized chest CT reports in the setting of suspected SARS-CoV-2 infection across the radiologist community on April 1, 2020 [11].Between the first COVID-19 diagnosis in France and the availability of these templates, French radiologists wrote their reports according to their own experience in thoracic imaging and the objective abnormalities on chest CT.So far, most studies using artificial intelligence have applied a supervised methodology on medical images in order to perform patients' triage, distinguishing common pneumonitis from COVID-19 lung disease, assessing the severity of the COVID-19 lung disease, or anticipating oxygen requirement thanks to classical machine-learning or deep-learning algorithms [12][13][14][15][16]. Regarding NLP application, Li et al. trained supervised machine-learning models to automatically identify CT reports with the diagnosis of acute appendicitis, diverticulitis, and bowel obstruction and secondarily applied those models on a large population to investigate the impact of the COVID-19 pandemic on their detection in emergency departments [17].
Consequently, our aims were (i) to develop an original unsupervised NLP method to detect variations in the content of chest CT reports at a population (or macroscopic) scale, without a priori knowledge of the possible occurrence of a new disease and its typical radiological presentation and before the availability of biological diagnostic tests for the whole population, and (ii) to test the ability of this method to detect the start of the COVID-19 pandemic in France.

Study Design and Population
This observational retrospective multicenter study was approved by the French national radiological review board (CRM-2303-337).The need for written informed consent was waived due to its retrospective nature and to the fact that data were anonymized.
Three cohorts from IMADIS Teleradiology were investigated: Cohort-1 (covering the 4 months before the first official COVID-19 case in France to 2 weeks after the 1st French lockdown) and two reference cohorts named Cohort-R1 (covering the first 2 weeks of September 2019, i.e., distant from any potential event) and Cohort-R2 (covering the first 4 weeks of November 2020, i.e., during the peak of the 2nd COVID-19 wave in France).IMADIS Teleradiology is a medical company dedicated to the remote interpretation of imaging from emergency departments in French public and private hospitals.
In Cohort-1, we included all consecutive patients between October 6, 2019, and March 28, 2020, who fulfilled the following criteria: (i) had a request for a CT of at least the chest by an emergency physician from one of the 62 partner centers of IMADIS Teleradiology at that time and (ii) had an available radiological report made in real time by one of the 171 emergency radiologists working at IMADIS Teleradiology during this study period.
In Cohort-R1, the same inclusion criteria were applied to patients between September 1, 2019, and September 14, 2019.
In Cohort-R2, we included all consecutive patients between November 1, 2020, and November 28, 2020, who fulfilled the following criteria: (i) had a request for a CT of only the chest by an emergency physician from one of the 76 partner centers of IMADIS Teleradiology at the time, and (ii) had an available radiological report made in real-time by one of the 173 emergency radiologists working at IMADIS Teleradiology at that time.The rationale for excluding examinations not specifically covering the chest in Cohort-R2 was to obtain a representative cohort of examinations that were more likely to be specifically requested for COVID-19 during a period of high prevalence of positive SARS-CoV-2 tests.
For all cohorts, we excluded patients with denied requests, MRIs, secondary opinions from outside center examinations, radiological reports not containing a clearly defined "Results" section, CTs involving body areas other than the chest, no clearly defined paragraph for the chest analysis within the "Results" section (for instance, starting with a heading such as "Thorax," "Chest," or "Thoracic analysis," and finishing with a line break).
Figure 1 shows the flowchart.

Text Preprocessing
The radiological reports were written in French.Radiologists completed free-text areas by typing or using speech recognition software (Dragon Medical Direct, Nuance Healthcare, Burlington, MA, USA).Spelling mistakes were highlighted in real time to reduce manual corrections.Templates for normal examinations were available and editable.Regarding the Cohort-R2, structured reports for the analysis of chest CT for suspected COVID-19 were also available based on the template provided by the SFR-SIT on April 2020.
Text preprocessing was performed with R (v.4.1.0,The R foundation for Statistical Computing, Vienna, Austria) using the "tidytext" and "stringr" packages [18] and focused on the paragraph related to chest analysis in the "Results" section, as these results were the most meaningful and likely to be modified based on new radiological findings.Supplemental Data S1 details the preprocessing.

Iterative Unsupervised Clustering
Our aim was to automatically perform unsupervised clustering of the preprocessed reports over consecutive biweekly periods (T) and to compare the similarity of the resulting clusters from the clustering of a reference period two weeks before (T-2).It must be emphasized that the accuracy of the depictions in the chest CT reports was not specifically verified in this pipeline and that there was no supervised analysis with an outcome to predict.In other words, herein, our goal was to classify the texts without any a priori depending on the words they contain.
The principle of the analysis was as follows (Fig. 2): -For each time period T of 2 weeks (with an increment of one week), we filtered the N T observations from T and the N T-2 observations in reference period T-2.-We performed a term frequency-inverse document frequency (TF-IDF) analysis on all stemmed nonstop words identified during T and T-2 (n = n words(T + T-2) ), which enabled the conversion of text to n words(T + T-2) numeric variables (methodology in Supplemental Data S2) [19].We repeated this process for each pair of consecutive time periods (T-2, T) from Cohort-1, with 1-week increments.
As a confirmatory analysis, we repeated the same analysis using Cohort-R1 and the last 2 weeks of Cohort-R2 (Cohort-R2') as references.

Additional Data Collection
Clinical and Radiological Annotations For all cohorts, we extracted the following information: patient age and sex and CT protocol (i.e., contrast medium injection, body areas covered by CT scans, CT pulmonary angiogram (CTPA)).The nature of the conclusion of the CT reports was prospectively encoded by the emergency radiologists when validating the CT report (categorized as "nonpathological," "pathological, related to symptoms," and "pathological, unrelated to symptoms" (i.e., fortuitous)).Of note, "pathological, related to symptoms" did not mandatorily imply COVID-19 lung disease and did not reflect the severity of the pathological findings.
Epidemiological Datasets Epidemiological datasets were retrieved from data.gouv.fr,an open-source platform storing public datasets [10].We used the daily time series of the Advanced Sanitary Index of flu syndromes (ASI-flu, highly correlated with the incidence of flu syndromes) and the number of positive tests for SARS-CoV-2 across the French territory.We then filtered the observations over the same time periods as Cohort-1, Cohort-R1, and Cohort-R2.It must be noted that the epidemiological datasets and the radiological datasets were not directly matched by patient.

Converting to Time Series
For all time periods, we counted the number of stemmed nonstop words related to the main pathological radiological features, namely: (1) consolidation, (2) fibrosis, (3) effusion, (4) nodule, (5) ground glass opacities, (6) lymphadenopathies, (7) crazy paving, and (8) reticulation) and divided it by the number of observations from the time period of interest to obtain their frequency and to understand the iterative unsupervised clusters obtained over time.The raw images corresponding to the CT reports were not reviewed to verify the actual presence of the features.We also counted the percentage of CTPAs, the percentage of pathological examinations, the number of newly confirmed SARS-CoV-2 infections, and the average ASI-flu value.

Statistical Analyses
Statistical analyses were also performed with R (v4.1.0).All tests were two-tailed.A P value < 0.05 was deemed significant.Associations between categorical variables were tested with chi-square tests.

Comparing the Similarities of Clusters
For each pair of time periods (T-2, T), the similarity between the K T and K' T-2' clusters (in T) and between the K T-2 and K' T (in T-2) clusters were calculated using the adjusted Rand index (ARI) (methodology in Supplemental Data S4) [21], and confidence intervals (CIs) were evaluated using bootstrapping on 1000 replicates using the "pdfCluster" and "boot" packages.

Explaining Clustering Dissimilarity
Correlations between time series were investigated with the cross correlation function (CCF) (methodology in Supplemental Data S5).Moreover, time series linear regressions between the number of SARS-CoV-2-positive tests, ASI-flu syndromes, and AARI values were performed for different lags.In this comprehensive analysis of AARI values, the explanatory variables were the number of SARS-CoV-2-positive tests and ASI (both provided in epidemiological datasets).The goodness-of-fits were evaluated with the adjusted R-squared values (adj-R 2 , or coefficient of determination-methodology in Supplemental Data S6) [22].
The list of CT devices used across all the partner centers is given in Supplementary Data S7.

Analyzing Words from Dissimilar Periods
We investigated which words were increasingly mentioned by analyzing the strongest variations (top 10) in the quantile of the number of quotations during the most dissimilar periods, i.e., from 2020/03/08 to 2020/03/21 and from 2020/03/15 to 2020/03/28 (see Table in Supplemental Data S7).

Against Epidemiological Data
The biweekly time series related to the two SPF datasets, the rates of CTPAs, and the rates of pathological examinations are shown in Fig. 4, with their CCFs against AARI values based on the iterative approach.Table 2 shows the time lag with significant cross-correlations.The highest significant CCFs were found at lag = 0 for the rates of CTPAs (CCF = + 0.805, P = 0.0003), the rates of pathological examinations (CCF = − 0.493, P = 0.0211), and the number of positive SARS-CoV-2 tests (CCF = -0.854,P = 0.0001) and at lag = + 6 for the ASI-flu value (CCF = − 0.648, P = 0.0042, i.e., significant correlations with the AARI values six weeks later).

Discussion
Herein, we proposed an innovative method based on text cleaning, TF-IDF vectorization, unsupervised clustering, and time series analysis to investigate whether the content of radiological reports changed in the beginning of an outbreak of a new emerging disease before the availability of standardized reports specific to this disease and the spread of medical knowledge across the radiological community.
Based on the example of the beginning of the COVID-19 pandemic, our results showed that this method was feasible and provided a similarity measure, which was negatively correlated with the incidence of new cases of SARS-CoV-2.
Our method takes advantage of common information and technology tools in teleradiology.As the examinations were performed in several centers scattered across France, these data sampled emergency activity and provided an overview of what was occurring in emergency departments.A prior study highlighted that teleradiological monitoring of the SFR-SIT diagnostic score could approximate the course of the COVID-19 pandemic in France [23].However, developing such a workflow relying on the SFR-SIT score implies that we already know that a new disease has emerged and its semiology.Herein, our goal was to identify breaks in the content of reports automatically and in an unsupervised manner without a priori information.
The similarity between consecutive clusters shrank in early March 2020 (from an AARI value of 1 to 0.15), which corresponds to the inflexion of positive SARS-CoV-2 tests (i.e., 49 patients across France for the 2020/02/16 to 2020/02/29 period, 4376 for the 2020/02/23 to 2020/03/07 period, 13,510 for the 2020/03/01 to 2020/03/14, and 33,075 for the 2020/03/15 to 2020/03/28 period) [10].To confirm these results, we replicated the same unsupervised method but with different reference periods.Using Cohort-R1 as a reference, we observed similar variations in AARI values (T, R1), that is, a strong decrease in March 2020.Using Cohort-R2' as a reference, we observed an increase in the AARI value (T, R2) in March 2020, which means that reports in March 2020 were increasingly similar to reports from the 2nd wave peak, when SFR-SIT-based standardized reports were widely used.
To understand these temporal variations, we investigated associations with simpler textual data (i.e., the frequency of words related to chest CT semiology), the number of CTPAs and pathological examinations, and   because the relationships among COVID-19 infection, the prothrombotic state and pulmonary embolism were not already known but were described in late April 2020 [24].
To date, only non-contrast-enhanced chest CT scans have been performed for acute respiratory symptoms.Regarding cross-correlations with the main radiological features, the AARI value (T, T-2) was positively correlated with the words "nodules" (CCF coefficient = 0.851) at lag = 0 and "consolidation" (CCF coefficient = 0.462) at lag = 3.Actually, these features were rarely encountered during COVID-19 lung infection and generally due to superinfection [25], whereas nodules and consolidation were routinely found in common bacterial pneumonitis and bronchiolitis a Lag: correspond to the decay between the two time series.If significance is found at Lag = k between AARI and one of the time series, then, the results should be interpreted as the correlation between the AARI at t-k and the time series at t seen before the COVID-19 outbreak.Conversely, "GGOs," "crazy paving," "reticulations," and "fibrosis" showed very low CCF coefficients (< − 0.800 for all) at lag = 0, which makes sense considering that these features are typical of COVID-19 infection.However, effusion (either pleural or pericardial) and lymphadenopathies were also negatively cross-correlated at lag = 0, although they are not specific to COVID-19 infection (found in 3 to 17.8% of patients with proven SARS-CoV-2 infection) [25][26][27][28].We explained this by the fact that radiologists could have mentioned these features in their reports but in a negative formula.Finally, linear regression analyses emphasized the strong relationships between AARI and ASI-flu values (taking the value 2 weeks before) and the number of positive SARS-CoV-2 tests (at lag = 0).Indeed, an Adj-R 2 value of 0.921 corresponds to an excellent fit.Regarding ASI-flu values, the best fit obtained with this lag can be explained by the entanglement with the end of the flu epidemic in France and the confusion with flu-like symptoms due to COVID-19 exposure occurring 1-2 weeks before clinical worsening requiring a visit to the emergency department.Future researches could investigate whether this method could prospectively detect the appearance of new SARS-CoV-2 variants or new infectious diseases that would be responsible for pathological radiological features (for instance: infectious colitis or meningitis).In case of breaks in the content of radiological reports (as measured with AARI) at a given time period, the reports and their corresponding images from this time period could be reviewed in details to explain the dissimilarity, and secondarily correlated to geographical, clinical, and biological data of those patients with the help of public health agencies.Furthermore, we believe that correlating radiological time series (such as the raw numbers of normal and pathological imaging per imaging modality per time unit) with economical data could provide relevant information to better anticipate the economical impact of emerging or resurging diseases on hospitals and to better anticipate human and technical resources [29].
Our study has limitations.First, other NLP methods could have been used.The bag-of-words approach and TF-IDF vectorization are classically used in NLP but do not allow us to account for positive or negative formulas.We used PAM and the Pearson distance, as they are robust and usually effective, but other clustering algorithms (such as k-means and HDBSCAN) and distance metrics are available.It is also possible to perform unsupervised clustering on latent layers of autoencoder neural networks or to use latent Dirichlet allocation, which may be more sensitive to detect new trends earlier and in smaller groups of patients [30,31].Second, we performed our proof-of-concept demonstration at the beginning of the COVID-19 pandemic, but this method should be confirmed prospectively.Third, it must be noted that the CT reports were not retrospectively reviewed as we used the chest CT reports consecutively performed by radiologists during their on-call duty, in the real-life setting, and provided to emergency physicians.Consequently, it is possible that radiologists missed some pathological findings on chest CT (such as small area of subpleural GGO), especially at the beginning of the COVID-19 outbreak.Actually, it would be hardly feasible to retrospectively review and annotate thousands of CT images and CT reports and we believe that this is an inherent limitation of macroscopic studies performed at the population level.Fourth, various CT devices were used for the CT acquisitions over the partner centers and the study periods, which could have influenced the image quality and the reports.

Conclusion
In conclusion, we proposed a method to operate large databases of radiological reports routinely collected in practice.Iteratively and automatically assessing the dissimilarities between radiological reports from consecutive periods could help detect variations in the observations made by radiologists, which could have several applications, such as monitoring emerging diseases or any public health issue.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made.The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material.If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Fig. 2
Fig. 2 Principle of the text clustering.The full study period from Cohort-1 was split into several time periods of 2 weeks long with an increment of 1 week.Abbreviations: T, a given time period; T-2, a

Fig. 3
Fig.3Average adjusted Rand index (AARI) as a function of time in A the main, iterative (T, T-2) approach (i.e., evaluating similarities between reports from a given time period T with the reports from 2 weeks before; B the (T, R1) approach (i.e., evaluating similarities between reports from a given time period T with the reports from a reference period R1 far before any wave of infection); and C the (T,

Fig. 4
Fig. 4 Cross-correlation between AARI(T-2,T) and clinical and epidemiological biweekly time series for A the number of C pulmonary angiograms (CTPAs), B the number of pathologic examinations related to symptoms, C the number of SARS-CoV-2 positive RT-PCRs, and D the advanced sanitary index (ASI) for flu syndromes.

Table 1
Characteristics of the main cohort (Cohort-1) and the two reference cohorts (Cohort-R1 and Cohort-R2) Data are number of patients with percentage in parentheses except for age, the number (no.) of radiologists and the no. of examinations per radiologist CTPA CT pulmonary angiogram, IQR interquartile range, SD standard deviation

Table 2
Significant lags and correlations obtained between the average adjusted Rand index (AARI) and clinical and epidemiological biweekly time series P value are adjusted using the Benjamini-Hochberg procedure.Values in bold correspond to the highest absolute value of the CCF coefficient and its corresponding lag, for each clinical and epidemiological feature of interest ASI advanced sanitary index, CCF cross-correlation function, CTPA CT pulmonary angiogram, GGO ground glass opacities, no.number * P < 0.05; **P < 0.005; ***P < 0.001 a Lag: correspond to the decay between the two time series.If significance is found at Lag = k between AARI and one of the time series, then, the results should be interpreted as the correlation between the AARI at t-k and the time series at t

Table 3
Significant lags and correlations obtained between the average adjusted Rand index (AARI) and textual biweekly time series related to the main radiological features described on chest CT P values are adjusted using the Benjamini-Hochberg procedure.Values in bold correspond to the highest absolute value of the CCF coefficient and its corresponding lag, for each clinical and epidemiological feature of interest ASI advanced sanitary index, CTPA CT pulmonary angiogram, GGO ground glass opacities * P < 0.05; **P < 0.005; ***P < 0.001