Key messages

  • Brain imaging is expensive to perform at scale for research purposes, and automated reading of base images is yet to be developed for most important disease phenotypes. Therefore reading of brain imaging text reports at scale would be useful for research and clinical purposes.

  • We developed a natural language processing (NLP) algorithm to identify 24 brain imaging phenotypes in two areas of NHS Scotland which had excellent positive predictive value for cerebrovascular and neurodegenerative phenotypes.

  • Use of radiologists’ reports of brain imaging in clinical practice can be useful for cohort development and outcome ascertainment of neurological phenotypes.


Brain imaging with computerized tomography (CT) and magnetic resonance imaging (MRI) can identify biomarkers of brain pathology that are important for the accurate diagnosis and phenotyping of many neurological diseases. However, brain imaging is expensive, and there are practical constraints to its use for research purposes, particularly in elderly and frail populations. Brain imaging reported by expert radiologists is nevertheless performed very frequently in clinical practice: in NHS England, for example, ~ 700,000 brain MRIs were performed between 2016 and 2017 [1]. Therefore, the reports of brain imaging could aid phenotype definition at lower costs, and be used for large-scale epidemiological studies of volunteers (e.g. UK Biobank) [2], cohorts of patients with disease, cataloguing clinical images, and for system-wide health care quality improvement.

In clinical practice, a radiologist reads a brain image and produces a text report of the findings. However, it is difficult to use these text reports in large scale research studies, because manually coding many thousands of reports is time consuming, and subject to inter- and intra-annotator variation [3]. Many radiology reports are unstructured, despite initiatives to improve this by the Radiological Society of America and other organizations, and therefore are difficult to use with the simplest automated methods of text searching.

One solution is to use natural language processing (NLP) methods to extract information from unstructured text in a radiology report. NLP algorithms can be constructed to identify clinically relevant phenotypes within text and to determine the grammatical relationship between different phrases. Rule-based NLP algorithms can have a high sensitivity (i.e. identify a high proportion of true cases) and high positive predictive value (i.e. a high proportion of those identified are true cases) in clinical records, for example identifying appendicitis, acute lung injury and cancer for use in cohort building, query based case retrieval and quality assessment of neurological practice [4]. For example a US-based study from Partner’s Healthcare identified ~ 6000 cases of cerebral aneurysms and ~ 6000 matched controls with a penalized logistic regression NLP model using radiology reports and other text and routine coding, giving a positive predictive value of 91% for the presence of aneurysms [5]. For the identification of stroke phenotypes, a study of 400 reports from the Mayo Clinic and Tufts Medical Center demonstrated that a rule-based NLP system had an excellent positive predictive value (1.0) for the identification of ‘silent brain infarcts’ and a convolutional neural network had an excellent positive predictive value (0.99) for the identification of white matter disease [6].

We aimed to develop and test an NLP algorithm to extract brain phenotypes from CT and MR brain radiology reports in NHS Scotland. We developed a list of brain phenotypes, primarily related to cerebrovascular disease that could be extracted from reports; determined ground truth in each report by expert review of the text; and developed and validated an NLP algorithm in two different datasets from different regions of NHS Scotland.


The datasets used to create the algorithm are available, subject to potential users obtaining the necessary ethical, research and data governance approvals, from Edinburgh Stroke Study ( and Health Informatics Centre Services, Dundee (


We used two sources of radiology reports to develop and test our automated reading and labelling algorithm: (i) all the brain imaging reports between 2002 and 2014 of participants in the Edinburgh Stroke Study (ESS), a hospital based register of 2160 stroke and transient ischaemic attack (TIA) patients [7] (of whom 1168 could be linked to local radiology reports) and (ii) MR and CT brain reports from NHS Tayside (a different NHS health board within Scotland) performed in unselected in- and out-patients between December 1994 and January 2015 (n = 156,619). We received reports stripped of identifiers. We excluded reports of non-brain imaging that were of mixed brain and other body areas, or did not contain a complete radiologist’s report.

We divided each set of reports into datasets for algorithm development (dev) and algorithm validation (test). The ESS data as we received it appeared to be randomized. We reserved the first 500 reports as development data, of which 364 were manually annotated. The remaining 668 reports were further randomized and, of these, 266 were manually annotated to make a test dataset. The Tayside data contained 156,619 reports and we first split this into four equal parts. We used the first part to create manually annotated development data (362 reports) and a randomised version of the fourth part to create manually annotated test data (700 reports). The Tayside data contained a high proportion of ‘normal’ reports, so to enrich it for pathological findings, we used a regular expression search (“blood|bleed|haemor|hemor”) to select reports for the development set. We did not do this for the Tayside test data in order to ensure that it was truly random (of the 700 test reports, only 295 matched the above regular expression).

Phenotypes of interest, ground truth and agreement between expert readers

Two clinically trained readers (a neuroradiologist and a neurologist, both with specialist expertise in stroke) read 1692 reports and marked up and coded each report with open access annotation software ( [8] using the following simple clinically meaningful disease entity and modifier entity annotations in the reports: stroke (haemorrhagic vs ischaemic vs underspecified, deep vs cortical, recent vs old); atrophy (present vs absent); changes of small vessel disease (present vs absent); brain tumours (meningioma, glioma, metastasis, other); subdural hematoma (present vs absent); subarachnoid haemorrhage (aneurysmal or other); microbleeds (deep vs lobar vs unspecified); haemorrhagic transformation of infarct (present vs absent. We combined these annotations into 24 clinically meaningful phenotypes (e.g. old deep ischaemic stroke, recent cortical stroke etc., see Table 1), which the expert readers also identified for each report as annotations on the document level. Stroke types were defined as ‘underspecified’ when it was not possible to assign a location, age or stroke type. Table 1 lists the number of reports, sentences, tokens and annotations for each of the datasets and partitions and also provides detailed counts for each phenotype per partition. An example of an annotated and synthetic but realistic brain imaging report with entity and label (phenotype) annotation displayed via the Brat annotation tool is shown in Fig. 1. A report can be labelled with zero or more phenotypes (min = 0, max = 7, average = 1.4). In the chosen example the report is labelled with two phenotypes (Ischaemic stroke, cortical, old and small vessel disease).

Table 1 Dataset statistics: number of reports, sentences, entity, modifier and phenotype (label) annotation per data set (ESS dev/test vs Tayside dev/test) for annotator 1
Fig. 1
figure 1

Annotated example report displayed in the Brat annotation tool

We calculated inter-annotator agreement (IAA) between the two clinician annotators for each of the phenotypes in the ESS test dataset (n = 266) and a subset of the NHS Tayside test dataset (n = 100). We selected a subset of Tayside data for double annotation because we did not have the resources to annotate all test reports twice. The double annotated sample was the final 100 of the 700 randomly selected test reports. We use precision, recall and F1-score for entity annotation agreement because Cohen’s Kappa has been found to be an inappropriate metric for measuring IAA for named entities [9].. We use Cohen’s Kappa for the label (phenotype) annotations [9].

Natural language processing

We iteratively developed an NLP system to identify the 24 phenotypes in radiology reports. The NLP system, Edinburgh Information Extraction for Radiology reports (EdIE-R), is a staged pipeline process (see Fig. 2), with XML rule-based text mining software at its core [10]. Scan reports are first converted from text format into XML. Each report is then zoned into relevant sections (request, body of report, conclusions) using regular expressions. The text of the body of each report is then split into paragraphs, sentences and word tokens by a tokenization component. This is followed by part-of-speech (POS) tagging where words are labeled with their syntactic categories using the C&C POS tagger [11] in combination with two models, one trained on newspaper text and one on the Genia biomedical corpus [12]. This is followed by a lemmatization step using morpha [13] to analyze inflected nouns and verbs and determine their canonical form (e.g. bleed for bleed, bled, bleeding and bleeds). All information computed up to this point is the basis for named entity recognition (NER), negation detection and relation extraction. These processes are rule-based and also involve look-up from two manually created domain lexicons (i.e. dictionaries), developed by expert readers’ mark-up of text. These lexicons total around 400 entries though many of these are near duplicates arising from hyphenation and spelling variants (e.g. ‘intracranial’, ‘intra-cranial’, ‘intra cranial’; ‘haemorrhage’, ‘haemorrhage’). The negation detection and relation extraction also rely on an additional chunking step which determines noun and verb phrases in the text. Finally, document-level labels on the patient’s type of stroke or other diseases discussed in the report (phenotypes) are assigned based on the entities and relations present in the text. The rules for this step are a simple mapping from the previous layers of mark-up to the labels. For example, to choose a ‘small vessel disease’ label, the rules need only to check that there is a non-negative small vessel disease entity in either the report or conclusions part of the report. To choose the label ‘Ischaemic stroke, cortical, recent’ there needs to be a non-negative ischaemic stroke entity which is in a location relation with a loc:cortical entity as well as in a time relation with a time:recent entity.

Fig. 2
figure 2

EdIE-R system architecture

Assessment of performance

We developed EdIE-R first on the ESS development data set and validated the system on separate, novel reports from the ESS validation data set. We then validated the EdIE-R system in the NHS Tayside dataset, and further developed and improved its performance, before validating it again on unselected unseen NHS Tayside and ESS data. The main further developments were additions to the domain lexicons for named entities that had not previously been encountered, extensions of the rules to recognize negative contexts (e.g. ‘Exclude subdural bleed.’) and fine-tuning of the relation extraction rules.

We report the performance of EdIE-R phenotyping of reports against the reference phenotyping standard of clinical expert reading of reports. For each phenotype, we report sensitivity (proportion of true positive reports identified by EdIE-R), specificity (proportion of true negative reports identified by EdIE-R) and positive predictive value (proportion of EdIE-R positive reports that are true positive). For each measure and phenotype, we calculated 95% confidence intervals using the Wilson method, which generates asymmetrical confidence intervals suitable for values very close to either 100% or 0% [14]. We also report F1-scores which is the harmonic mean of precision (positive predicted value) and recall (sensitivity) and a standard metric used in NLP research.

Sample size

We based the sample size, n = 700, of the validation of the final version of EdIE-R in the Tayside dataset on a sensitivity of 95% for a phenotype of particular interest, old deep ischaemic stroke, with an estimated prevalence of 12% with a 95% Wilson interval width of 10%. The 700 reports were selected at random from the final quarter, n = 39,154, of the original Tayside dataset. The development data was selected from the first quarter.


We first developed the EdIE-R algorithm in 364 reports from the ESS dataset, and internally validated it on 266 further reports from the ESS dataset. We externally validated the algorithm on 362 reports from the NHS Tayside dataset, then further developed the algorithm with different data before a final external validation in 700 reports from NHS Tayside (Fig. 3).

Fig. 3
figure 3

Data flow of patient scan reports from the Edinburgh Stroke Study and NHS Tayside

IAA for the entity annotations was high for both subsets. In the ESS subset, we found a precision of 0.96, a recall of 0.98 and an F1-score of 0.97. For the Tayside subset, precision was 0.95, recall was 0.96 and F1 was 0.96 [15]. The agreement between the two expert annotators for all phenotypes was generally excellent in ESS (all Cohen’s κ > 0.95), less so in NHS Tayside (Cohen’s κ 0.39–1.00, see Table 2).

Table 2 Inter-annotator agreement measured in Cohen’s Κ between two human annotators in ESS (266 doubly annotated reports) & NHS Tayside (100 doubly annotated reports) for different phenotypes (document-level annotations)

We developed the NLP algorithm using the ESS dataset, which is enriched for cerebrovascular phenotypes. In unseen ESS validation data, the algorithm had an excellent specificity (≥99%) for all phenotypes and excellent sensitivity for stroke phenotypes, atrophy and small vessel disease (≥95%). However, we identified few cases of haemorrhagic stroke, subdural or subarachnoid haemorrhage, or brain tumours.

We further developed our model in 362 expert-annotated reports in NHS Tayside, and then tested the final EdIE-R model in 700 unselected expert-annotated NHS Tayside reports. The final EdIE-R model had excellent sensitivity, specificity and positive predictive value (all ≥95%) for the following phenotypes: cerebral atrophy, cerebral small vessel disease, and old deep infarcts. The algorithm identified any ischaemic stroke (n = 88) with a sensitivity of 89% (95% CI:81–94), positive predictive value of 85% (76–90) and specificity of 100% (0.99–1.00); haemorrhagic stroke (n = 23) with a sensitivity of 96% (95% CI: 80–99), positive predictive value of 72% (55–84) and specificity of 100% (0.99–1.00); and any brain tumour with a sensitivity of 96% (95% CI: 87–99), positive predictive value of 84% (73–91) and specificity of 100% (0.99–1.00). For individual stroke and tumour types, the number of patients with any one type was small, and therefore point estimates had wide confidence intervals (see Table 3).

Table 3 EdIE-R performance on the NHS Tayside test set (n = 700 reports). Small numbers suppressed due to data governance requirements

We tested the potential of the final EdIE-R algorithm to identify patients with particular brain phenotypes in the routinely acquired brain imaging reports from the Tayside region of Scotland. Of 98,036 patients, there was a preponderance of women (54.7%), particularly in the youngest (0–50, 54.9%) and oldest (> 75 yrs. 61.7%) age groups. A minority of patients had been admitted or died with stroke within 30 days of the date of scan (overall 6, 1.5% < 50 yrs., 9.5% > 75 years) (see Table 4). In the 110,695 scan reports of these patients, the most frequent phenotypes were cerebral atrophy (26%), cerebral small vessel disease (13.6%), and deep old cerebral infarcts (9.6%) (see Table 5).

Table 4 Demographics of NHS Tayside patients providing reports
Table 5 Proportion of reports with a brain imaging phenotype in NHS Tayside (110,695 reports). Small numbers suppressed due to data governance requirements


We have developed an NLP algorithm for brain imaging reports in a stroke cohort study in one NHS hospital and validated it with reports from general clinical practice in a second NHS hospital. We have demonstrated excellent diagnostic performance for more common cerebrovascular phenotypes. Although the identification of phenotypes was not perfect, it would have been practically impossible to manually code > 100,000 radiology reports. The ability to code these reports using an NLP algorithm opens the door to using radiology reports to better identify stroke subtype when combined with ICD-10 coded information for outcome ascertainment in large studies such as UK Biobank, for the creation of new in silico cohort studies, or for health care quality improvement [2].

In most research using electronic health records, phenotypes are identified from administrative coding with multiple or single codes (e.g. ICD-10). These codes and combinations have modest to good positive predictive value (> 80%) for all stroke [16]. The addition of NLP summaries of brain imaging report data to administratively coded information, or to NLP processing of medical text, could improve the positive or negative predictive value of stroke identified in EHR. It would also increase the number of stroke that are unspecified (up to 40%) or where stroke type is specified, to allow subtyping of ischaemic stroke types where this is not available (for example in our center, codes for lacunar stroke are rarely used). In addition, some asymptomatic findings that are not routinely coded consistently (e.g. changes of cerebral small vessel disease) could be identified. This could be particularly useful for deriving neurological phenotype at scale from health records in large scale cohorts such as UK Biobank (N > 500,000), [2] and the NIH-funded All of Us study (planned N = 1,000,000

The performance of our algorithm differed to a modest degree in the ESS dataset which was enriched for cerebrovascular phenotypes and NHS Tayside from general radiology practice. This is probably accounted for by the differences in language used across different radiology departments; and the different prevalence of findings in different datasets with higher prevalence leading to greater positive predictive value.

The high IAA scores indicate that the annotation tasks were well-defined. In previous work we have demonstrated the impact annotation has on NLP performance [17]. The same is true for this task, the better and more defined the annotation the easier it is to extract the same information automatically. Before doing the annotation, the experts carried out some pilot annotation on paper and we decided on a set of rules for what to annotate. In some cases, the high IAA can also be attributed to the consistent use of medical terms in this domain. While there is some variation, certain diseases and symptoms are described with widely used and well-known expressions and it is straightforward for experts to identify them in text.

The high accuracy of the rules in our system results largely from the topic-focused nature of the radiology reports and the fact that the language is restricted and conventionalized, with only a limited number of ways in which a phenomenon tends to be described. Several of the system errors arise from unexpected ways of phrasing concepts, for example, the entity subarachnoid_haemorrhage is most frequently expressed as ‘subarachnoid haemorrhage’, ‘subarachnoid blood’ or ‘SAH’, and the system failed to recognize it in a report where it was described as ‘blood in the subarachnoid spaces’. This is the kind of problem that unavoidably occurs when test or run-time data contains unseen examples, i.e. ways of expressing concepts that have not been seen in the training/development data. This is true of both supervised machine-learning and rule-based systems. At the final stage of assigning labels to documents, labels will be missed if the relevant named-entities or relations have been missed (false negatives). Errors at this stage also arise from false positives from NER, relation detection and negation detection. Clear cases of time and location relations (e.g. ‘Right frontal chronic haemorrhage’) are straightforward to detect but the system can make errors in linking a time or location to an entity because it does not take sentence-structure into account. For example, in the sentence ‘I suspect this reflects redistribution of the original haematoma rather than new blood.’, the time entity ‘new’ was wrongly linked to the haemorrhagic stroke entity ‘haematoma’. With regard to negation, clearly stated absence of a phenomenon (e.g. ‘No metastases’) are reliably detected but in cases where the annotator has marked an entity as negative in an unclear context (e.g. ‘diffusely sclerotic metastases are much less likely’), negation detection can fail and this can lead to a false positive in the labelling.

The phenotypes we have chosen are those that are relevant for epidemiological and clinical researchers. There are limits to the detail in reports in clinical practice, hence we chose to identify phenotypes that we thought would be possible to code. In different settings (for example hyperacute stroke services), there may be more detail in individual reports to identify other phenotypes such as vessel occlusion. Further work is needed to enrich our samples for less common phenotypes (for example individual types of haemorrhagic stroke or tumours); to determine the diagnostic contribution of NLP alongside structured, coded data (e.g., ICD-coded hospital admissions or Read-coded primary care consultations); to compare the performance of NLP coding of reports against research-grade reads of images; and to implement these algorithms within NHS systems.

In terms of portability and generalisability of our NLP system, we have shown that EdIE-R is robust in phenotype labelling performance when porting it from one dataset to another (ESS to Tayside). The same holds true for named entity recognition (NER) on the same data. While we spent some effort on fine-tuning the system on the new development data, this did not take a substantial amount of time [15]. We would expect a high level of performance when running the EdIE-R system over new brain imaging reports. To port the system to a new type of medical text, e.g. radiology reports for a different disease or body part, or to pathology reports, we would require a new lexicon and would need to adapt some of the rules. This is not an insignificant amount of effort and requires input from domain experts. Instead we could use machine learning methods but would then require more training data (annotated by domain experts), as well as time to fine-tune parameters or features, in order to reach or exceed the same level of performance as the rule-based method.


In summary, we have demonstrated that an NLP algorithm can be developed with neuroradiology reports from the UK NHS radiology records, allowing identification of cohorts of patients with important cerebrovascular phenotypes at a scale that would otherwise not be possible.