1 Introduction

Medical Imaging is concerned with forming visual representations of the anatomy or a function of a human body using a variety of imaging modalities (e.g., X-rays, CT, MRI) [1, 89]. It is estimated that approximately one billion medical imaging examinations are performed worldwide annually [56], and the number of medical imaging examinations per year continues to rise in developed countries [9]. More sophisticated medical imaging systems lead to more images per examination and radiologist [18]. The total workload has increased by 26% from 1998 to 2010 [18], and radiologists must now interpret more images during work time compared to similar examinations performed 10–20 years ago. Radiologists need to consider the examination’s medical images, patient history, previous examinations, consult recent bibliography, and prepare a medical report. An increased workload in this demanding task increases the likelihood of medical errors (e.g., when radiologists are tired or pressured). These errors are not rare [6] and they will also be present in the medical report, which is what referring physicians (who ordered the examination) mostly consider. Consequently, tools to help radiologists produce higher quality reports (e.g., without missing important findings or reporting wrong findings when they are inexperienced) in less time (e.g., by providing them with a draft report) could have a significant impact.

Fig. 1
figure 1

a Caption produced by the generic captioning system of [100]. b Two images from an X-ray examination along with: the corresponding human-authored ‘FINDINGS’ section, from the IUXRay dataset (Sect. 2.1); silver labels (tags) automatically extracted from the human-authored findings (‘system tags’); and gold labels provided by physicians (‘human tags’). The diagnostic caption refers to ‘no change’ with respect to a previous examination, but there is no link to information and images of the previous examination in this particular dataset. The ‘XXXX’ is due to a (presumably automatic) de-identification process

Diagnostic Captioning (DC) systems encode the medical images of a patient’s examination (or ‘study’) and generate a full or partial draft of the report. Figure 1b, for example, shows the ‘FINDINGS’ section of the report. Although we are not aware of any clearly articulated description of the exact goals of research on DC systems, the main goals seem to be to (a) increase the throughput of medical imaging departments [38], (b) reduce medical errors [26], and (c) reduce the cost of medical imaging examinations [63]. Advances in image-to-text generation [7], especially deep learning methods [14, 33] for generic image captioning (Fig. 1a) [23, 39, 68, 69, 81, 84, 101], have recently led to increased research interest in DC [45, 55, 61, 67]. However, despite its importance and recent popularity, DC still suffers from shortcomings in methods, datasets, and evaluation measures. This article attempts to review all the published work on DC, outlining the major problems, and proposing future research directions.

DC methods usually employ the encoder–decoder architecture [17], heavily ignoring retrieval-based approaches (Figs. 2 and 3). In a similar manner, [72] recently surveyed deep learning based DC methods, but only considered systems where the diagnostic text is generated from scratch, using a recurrent neural network (RNN) decoder. However, recent studies show that retrieval-based approaches, where text from similar previous exams is reused, are very competitive despite their simplicity, often surpassing much more complex methods [55]. Even the simple nearest neighbor approach, where the diagnosis of the visually most similar study is retrieved and reused, has been reported to outperform all other approaches in clinical recall [67]. This could be due to the fact that retrieval can assist in capturing factual knowledge [49] and negation; the latter is particularly important in clinical text [48]. Or it could be due to the fact that medical reports seem to use a template-based language, where reports of the same findings are almost identical across many different patients. We provide a clustering of current DC systems by the kinds of methods they use, also reporting evaluation scores for each system with all available measures.

Fig. 2
figure 2

Steps performed in a basic encoder–decoder architecture for report generation

Datasets that have been used in previous DC work [21, 29, 45, 46, 60, 102] are not all publicly available. Out of the four publicly available ones, PEIR GrossFootnote 1 and ImageCLEF [20] suffer from severe shortcomings [55], which are also discussed briefly below. Hence, in this work we focus mostly on studying and discussing the characteristics of the remaining two datasets, namely IU X-Ray [21] and MIMIC-CXR [46]. Interestingly, previous research does not always use the same parts of the medical reports of these datasets (some use the ‘FINDINGS’ section only, others include the ‘IMPRESSION’), and most previous articles do not use a common training development test split of the data. For IU X-Ray, which is the most commonly used dataset, we use the split of [60], who recently used it to evaluate multiple DC systems. We also release (in supplementary material) instructions on how to obtain and use this split.

Evaluation measures employed by previous DC research mainly assess lexical overlap between machine-generated and human-authored captions [55], without directly assessing clinical correctness. This can lead to cases where a clinically wrong generated report can be scored higher than a clinically correct one [113]. Current methods for automatically measuring clinical correctness produce results of poor quality, because: (a) they only consider the presence (or absence) of particular medical terms in the reports [61], for example ‘pneumothorax’ (collapsed lung) would be considered a positive find in ‘no pneumothorax is observed’ [113]; or (b) they rely on the responses of rule-based automatic annotators [67], for example to obtain the ‘system tags’ in Fig. 1b, whose accuracy cannot be guaranteed; or (c) they use crowd workers [60], who are not necessarily medical experts or trained in medical informatics. We discuss these issues further below.

Excluding an earlier version of this survey [55], the only other DC survey we are aware of is that of [72].Footnote 2 As already pointed out, the latter considers only DC methods that generate diagnostic text from scratch using an RNN decoder, whereas we also consider retrieval-based methods, which are often very competitive in DC. We also explore and investigate more thoroughly the datasets we consider (Sect. 2). For instance, we study how often the diagnostic reports are very similar across patients, the class imbalance between reports with no findings versus reports that report abnormalities, or to what extent relevant information is missing (e.g., reports referring to images that have been removed during anonymization, or sections that require access to unavailable previous examinations of the same patient). Furthermore, we provide a more extensive discussion of evaluation measures; for example, we also cover clinical correctness measures, apart from word overlap measures that [72] mostly focus on; and we demonstrate the shortcomings of current evaluation measures using concrete DC examples. A final difference from the survey of [72] is that we assume the reader is familiar with commonly used machine learning algorithms, including currently widely used deep learning (DL) models like Convolutional and Recurrent neural networks (CNNs, RNNs). This allows us to present and compare DC methods more succinctly. Readers who lack this background and wish to comprehend the full details of this article should consult machine learning and DL textbooks first [14, 32, 33, 74]. However, the gist of the discussion throughout this article should also be accessible to readers with only an introductory knowledge of machine learning.

Fig. 3
figure 3

A simple pipeline of a retrieval method for DC based on image similarity [55]

In the rest of this article, Sects. 2 and 3 discuss DC datasets and evaluation measures, respectively. DC methods are discussed in Sect. 4, followed by our literature search methodology and a higher-level discussion of the state of the art in Sect. 5. Conclusions of this work and directions for future research are presented in Sect. 6. We hope that this article will be useful to (a) computer scientists, especially those with expertise in machine learning, computer vision, and/or natural language processing, who may wish to identify research and development opportunities and start working on DC, (b) medical researchers who already apply machine learning methods to other types of clinical data and may wish to expand their research to DC and possibly contribute to the development and assessment of real-life DC systems, (c) readers already active in DC, who may be interested in a consolidated critical view of DC work so far and recommendations for future directions, (d) physicians (e.g., radiologists, nuclear medicine physicians) as well as managers of diagnostic units and health care systems, who may wish to understand the state of the art in DC, the potential benefits and risks that forthcoming DC systems may bring.

2 Diagnostic captioning datasets

Datasets for DC comprise medical images and associated diagnostic reports. In previous work [55], we reported that three publicly available datasets can be used for DC research, namely PEIR Gross, ICLEFcaption [20], and IU X-Ray [21]. We concluded, however, that the first two datasets suffer from severe shortcomings. Most importantly, they contain photographs and captions from the figures of scientific articles, instead of real diagnostic medical images and reports; hence, they are inappropriate for realistic DC research. The third dataset, IU X-Ray, which contains X-ray images and medical reports, is appropriate if we ignore its small size. In our previous work [55], we did not consider MIMIC-CXR [46], a fourth and the largest to date publicly available DC dataset, which was released later, and we only partially explored IU X-Ray. In this work, we focus on the latter two quality datasets, MIMIC-CXR and IU X-Ray, referring readers interested in ImageCLEF and PEIR Gross to our previous work [55]. Datasets that do not comprise both medical images and diagnostic reports, or that are not publicly available, are not considered further in this study. Such datasets are BCIDR [116], consisting of 1000 pathological bladder cancer images, each with five reports, which is not publicly available; frontal pelvic X-rays [29], which comprises 50,363 images, each accompanied by a radiology report simplified to follow a standard template, but the dataset is not publicly available; and chest X-ray 14 [102], which is publicly available, but does not include any medical reports in its public version. Results on these datasets are included, however, in Table 4 for completeness.

Radiologists usually document their findings in titled sections, following standardized document structure templates. However, the sections found in the reports of IU X-Ray are not always the same as the sections found in the reports of MIMIC-CXR. ‘FINDINGS,’ ‘COMPARISON,’ ‘INDICATION,’ ‘IMPRESSION’ are sections found in both datasets, among which ‘FINDINGS’ and ‘IMPRESSION’ are of primary interest [46]. The ‘FINDINGS’ section, which is usually the lengthier one (Fig. 4), describes the imaging characteristics of a body structure or function that can have a clinical impact. ‘COMPARISON’ contains previous information about the patient, often from preceding medical exams, but never the whole report or the medical images of the previous exams. This means that it is almost impossible even for a radiologist to generate this section without the previous referred exams. The same applies to the ‘INDICATION’ section, which conveys the medical reason for the patient to be subjected to the examination (e.g., symptoms). Hence, the ‘COMPARISON’ and ‘INDICATION’ sections cannot be generated by DC methods. Instead, they could be treated as given and they could, at least in principle, assist the process of generating the ‘FINDINGS,’ although current DC methods attempt to generate the ‘FINDINGS’ directly from the images, without consulting the ‘COMPARISON’ and ‘INDICATION’ sections. All the aforementioned sections could in turn assist the process of generating the final section ‘IMPRESSION,’ although again current DC methods try to generate the ‘IMPRESSION’ directly from the images. The ‘IMPRESSION’ usually summarizes the most important findings and interprets their clinical value, giving the referring physician a direction for the management of the disease or a final diagnosis. However, sometimes the ‘IMPRESSION’ (or ‘FINDINGS’) includes a conclusion that does not follow from the previous sections and the images of the current exam. For example, a conclusion may be the result of comparing the current exam with a previous one. Unfortunately, the dataset may omit the previous exam(s), as in IU X-Ray; or it may hide the dates and times of the exams of each patient, as in MIMIC-CXR, making it impossible to identify the previous exams.

Table 1 Frequent ‘FINDINGS’ sections with no abnormality reported in IU X-Ray
Table 2 Number of reports per dataset (1st row) and number of reports whose diagnostic text includes a section with findings, impression, indication or comparison (2nd to 5th row)

The majority of the reports in both IU X-Ray and MIMIC-CXR concern cases where there is no disease or abnormality. In these cases, the diagnostic text is often very similar or identical across different exams (Table 1). The section that exists most often in both datasets is ‘IMPRESSION’ (Table 2), but that section is shorter than ‘FINDINGS’ on average (Fig. 4) and often includes conclusions drawn from information not included in the datasets, as already noted. Some previous work used only the ‘IMPRESSION’ section as the target text to be generated [85], but most previous work either uses the ‘FINDINGS’ as the target [60, 67] or aims to generate the concatenation of the two sections [45, 85].

Fig. 4
figure 4

Boxplots of section lengths in characters, in a IU X-Ray and b MIMIC-CXR

Another publicly available dataset is PadChest [11], which comprises 160,868 chest X-rays from 69,882 patients. However, its diagnostic texts (in Spanish) are not complete reports or complete sections of reports, but text snippets (not necessarily well-formed sentences or paragraphs) that were extracted from the reports with regular expressions. Therefore, we exclude this dataset from our study, since it does not contain texts of the kinds the DC methods we consider aim to generate (entire reports or particular sections of reports).

2.1 IU X-Ray

Demner-Fushman et al. [21] created a dataset of radiology examinations, comprising X-ray images and reports authored by radiologists. They publicly released an anonymized version of the dataset through the Open Access Biomedical Image Search Engine (OpenI).Footnote 3 The dataset consists of 3955 reports, one per patient, all in English, and 7470 images in the standard medical format DICOM, which includes metadata.Footnote 4 We found that 3851 reports (97.4%) are linked to at least one image and are thus valid for our study. Among the 3851 reports, 599 (15.6%) do not include a ‘COMPARISON’ section, 86 (2.2%) do not include the ‘INDICATION’ section, 31 (0.1%) do not include ‘IMPRESSION,’ while 514 (13.3%) do not include ‘FINDINGS.’ Among the 3337 reports that comprise both a ‘FINDINGS’ section and at least one image, only 2553 (76.5%) have a unique ‘FINDINGS’ section, i.e., the text of their ‘FINDINGS’ is not identical to that of any other report. The ‘FINDINGS’ of the remaining 23.5% reports is exactly the same in two or more other reports, and in these cases the reports describe mainly normal findings. The 10 most frequent ‘FINDINGS’ sections, all describing normal findings, occur in 344 reports in total, which is 10.3% of all the reports and 43.9% of the non-unique ‘FINDINGS.’ This gives an advantage to retrieval-based approaches, which have been reported to achieve surprisingly high performance [8, 55, 67].

Demner-Fushman et al. [21] initially collected and de-identified 4k examinations from two hospitals, 2k from each one. They used software to de-identify the texts [28], and the Clinical Trial Processor to de-identify the images.Footnote 5 They reported excellent de-identification results (no sensitive information found by human annotators), but we found some cases where the software had damaged the diagnostic text. For example, “Cardiomediastinal silhouette is XXXX” is not informative and we cannot accurately infer the finding. Demner-Fushman et al. then discarded four exams that either did not comprise both a lateral and a posterioranterior chest image, or had diagnostic text that did not include clearly separated sections for the findings and the impression (or the diagnosis). They also discarded 41 exams containing information that could reveal the patient identities. Two of them revealed information such as address and dates, which the Health Insurance Portability and Accountability Act of 1996 (HIPAA) considers identifiers that may reveal personal information. The remaining 39 were exams with images that comprised teeth, partial jaw, jewelry, or partial skull. Although not explicitly flagged as sensitive by HIPAA, these images might identify the patients. These 39 exams had all of their images in this category and were completely removed from the dataset. However, 432 (10.9%) other exams had at least one image removed for the same reasons, but were apparently retained in the dataset (without the removed images). These 432 cases are problematic for training and evaluation purposes, because the gold (target) diagnostic text that the systems are required to generate may refer to an absent image.

The ‘IMPRESSION’ and ‘FINDINGS’ sections were used by [21] to manually associate each report with a number of tags, shown as ‘human tags’ in Fig. 1b. Two human annotators were used, both trained in medical informatics. The tags were MeSH terms, supplemented with Radiology Lexicon terms (RadLex).Footnote 6 Each annotation label (term) referred to a pathology, a foreign body or transplant, anatomy (human body parts), signs (imaging observations), or attributes (object or disease characteristics). The annotators were instructed not to assign labels for negated terms (e.g., ‘no signs of tuberculosis’) or inconclusive findings (e.g., introduced by ‘possibly’ or ‘maybe,’ but not ‘probably’ or ‘likely’) of the ‘IMPRESSION’ and ‘FINDINGS.’Footnote 7 In addition to the human tags, each report was associated with tags automatically extracted from the ‘IMPRESSION’ and ‘FINDINGS’ by Medical Text Indexer (MTI) [73]; the resulting tags are called ‘MTI encoding’ and they are shown as ‘system tags’ in Fig. 1b. As shown in the example of Fig. 1b, the system tags are single words or terms (e.g., ‘Hiatus’), while human tags follow a different pattern, which may combine anatomical site and type (e.g., ‘Hiatal/large’). Surprisingly, although the dataset comprises both human and system tags, only the latter have been used to evaluate (or train) DC systems so far; we discuss evaluation measures that use tags in Sect. 3.2.

2.2 MIMIC-CXR

MIMIC-CXR comprises 377,110 chest X-rays associated with 227,835 medical reports, from 64,588 patients of the Beth Israel Deaconess Medical Center examined between 2011 and 2016.Footnote 8 The reports were de-identified to satisfy HIPAA requirements (Sect. 2.1). The images are chest radiographs, obtained from the hospital’s Picture Archiving and Communication System (PACS) in DICOM format. To remove all Protected Health Information (PHI), the images were processed to remove annotations imprinted in them (e.g., image orientation, anatomical position of the subject, timestamp of image capture). A custom algorithm was used for this purpose, based on image preprocessing and optical character recognition to detect text; all pixels within a bounding box containing this information were set to black. Two independent reviewers were employed after de-identification to examine 6900 radiographs for PHI and found no PHI.

The reports of MIMIC-CXR are written in English and their text is separated in sections, following document structure templates. Unlike IU X-Ray, where the boundaries of sections are made explicit by the XML markup, the section boundaries of MIMIC-CXR reports are not explicitly marked up. However, the section headings of MIMIC-CXR are written in upper case, followed by a column (e.g., “FINDINGS:”). Apart from the sections described in the discussion of IU X-Ray (Sect. 2.1), some reports of MIMIC-CXR include other sections, such as ‘HISTORY,’ ‘EXAMINATION,’ or ‘TECHNIQUE,’ but not in a consistent manner, because the structure of the reports and section names were not enforced by the hospital’s user interface [46].

3 Evaluation measures for diagnostic captioning

Text generated by DC systems has so far been assessed mostly via automatic evaluation measures originating from machine translation and text summarization [3, 64, 76, 99], which, roughly speaking, count how many words or n-grams (phrases of n consecutive words) are shared between the generated text and reference gold texts (typically human-authored). Such measures have been reported to correlate well with human judgments of information content (e.g., the degree to which the most important information is preserved in summaries) when the goal is to rank systems and when there are multiple gold references per generated text. However, as can be seen in Table 3, measures of this kind do not necessarily capture clinical correctness. \(H^{a}_{ci}\) for example, which incorrectly reports that there is no indication of a collapsed lung ‘No pneumothorax,’ receives higher scores than \(H^{a}_{cc}\), which correctly reports pneumothorax. When the gold and the system-generated captions are automatically labeled for mentions of diseases or abnormalities [41, 67, 108], standard classification evaluation measures, such as Accuracy, Precision and Recall, can be applied. Automated labeling, however, can be inaccurate leading to mistaken results. In the second (b) and third (c) cases of Table 3, for example, the automatically assigned tags are the same for the correct (\(H_{cc}\)) and incorrect (\(H_{ci}\)) diagnoses.

Table 3 Using the BLEU (B1, B2, B3, B4), METEOR (Met), and ROUGE (Rou) automatic evaluation measures to score clinically correct (\(H_{cc}\)) and clinically incorrect (\(H_{ci}\)) hypothetical diagnoses that paraphrase three reference (gold, human) diagnoses (\(R^a, R^b, R^c\))

In an interesting evaluation approach, [61] employed crowd-sourcing for 100 randomly selected studies (examinations). For each study, the annotators were shown three reports, produced by a physician, a baseline DC system, and a more elaborate DC system, respectively. Each annotator had to consult the report of the physician and choose the best system-generated report, based on the criteria of clinical correctness of the reported abnormalities, fluency, and content coverage compared to the ground-truth report. Although this approach is interesting in general, because it employs manual evaluation, the particular experiment of [61] raises doubts for three reasons. First, the medical background of the annotators was not reported and it may have been inadequate. Second, cases were excluded from the evaluation when no system-generated report was better than the other one, but these cases could be very frequent (e.g., if no system was good enough). Third, only 100 test studies were used and no statistical significance was reported. Consequently, we do not discuss the details and scores of this evaluation further.

3.1 Word overlap measures

The most common word overlap measures are BLEU [76], ROUGE [64], and METEOR [5], which originate from machine translation and summarization, as already noted. The more recent CIDEr measure [99], which was designed for generic (not medical) image captioning [50], has been used in only two DC studies so far [45, 116]. SPICE [3], also designed for generic captioning [50], has not been used at all in DC so far. We note again that all these are word overlap measures, which do not always capture clinical correctness [60], as already discussed. This was also discussed by [113], who used ROUGE to compare two medical statements, a clinically correct and a clinically incorrect one. Since the latter had more common words with the gold statement, it obtained a higher score. We included the example of Zhang et al. in Table 3, where we also used BLEU, METEOR, and ROUGE, along with Precision and Recall computed on CheXpert labels.

BLEU [76] is the most common and the oldest among the word overlap measures that have been used in DC. It measures the word n-gram overlap between a generated and a ground-truth caption. As this is a precision based measure, a brevity penalty is added to penalize short generated captions. BLEU-1 considers unigrams (i.e., single words), while BLEU-2, -3, -4 consider bigrams, trigrams and 4-grams respectively. The average of the four variants was used as the official measure in ICLEFcaption [20, 24]. METEOR [5] extended BLEU-1 by employing the harmonic mean between Precision and Recall i.e., the \(F_{\beta }\) score, but biased toward Recall (\(\beta >1\)). METEOR also employs Porter’s stemmer and WordNet [27], the latter to take synonyms into account.Footnote 9 The \(F_{\beta }\) score is then penalized up to 50% when no common n-grams exist between the machine-generated description and the reference human description. ROUGE-L [65] is the ratio of the length of the longest common n-gram shared by the machine-generated description and the reference human description, to the size of the reference description (ROUGE-L Recall); or to the generated description (ROUGE-L Precision); or it is a combination of the two (ROUGE-L F-measure). We note that many ROUGE variants exist [34], based on different n-gram lengths, stemming, stopword removal, etc., but ROUGE-L is the most commonly used version in DC so far. CIDER [99] measures the cosine similarity between n-gram TF-IDF [70] representations of the two captions; words are also stemmed. Cosine similarities are calculated for unigrams to 4-grams and their average is returned as the final evaluation score. The intuition behind using TF-IDF is to reward terms that are frequent in a particular caption being evaluated, while penalizing terms that are common across captions (e.g., stopwords). However, DC datasets have constrained vocabularies, and a common disease name may thus be mistakenly penalized. More importantly, CIDEr scores exceeding 100% have been reported [67], which contradicts the measure’s theoretic design. By using the official evaluation server implementation (CIDEr-D) [15], we found cases where scores exceeding 100% were indeed produced, which means that further investigation is required to check the correctness of the particular implementation of the measure and allow a fair comparison among systems. SPICE [3] extracts sets of tuples from the two captions (human and machine-generated), containing objects, attributes, and/or relations; e.g., {(patient), (has, pain), (male, patient)}. Precision and recall are computed between the two sets of tuples, also taking WordNet synonyms into account, and the \(F_1\) score is returned. The authors of SPICE report improved results over both METEOR and CIDEr, but it has been noted that results depend on the quality of syntactic parsing [50]. When experimenting with an implementation of this measure,Footnote 10 we noticed that long texts were not parsed at times and thus were not evaluated properly.

3.2 Clinical correctness measures

The word overlap measures discussed above do not always capture clinical correctness, as already demonstrated. To overcome this problem, recent work has proposed new evaluation approaches based on classification evaluation measures, an approach we have already mentioned and we now discuss further. The clinical correctness of a generated caption is measured through a set of medical terms extracted from that caption (see Table 3). These terms are then compared to the ones from the gold caption, which may have been generated by humans (gold labels, as in IU X-Ray) or a system (silver labels), like the Medical Text Indexer [73] or the CheXpert labeler [43]. In Table 3 for example, CheXpert was used to annotate three reference diagnoses (\(R_a, R_b, R_c\)) along with their alternative correct and incorrect hypothetical diagnoses (\(H_{cc}\), \(H_{ci}\)). In the topmost example, Pleural Effusion (excess fluid around the lung) and Pneumonothorax (collapsed lung) have been correctly generated by CheXpert,Footnote 11 for the reference diagnosis and the correct hypothetical diagnosis \(H^{a}_{cc}\). For the incorrect \(H^{a}_{ci}\), only Pleural Eeffusion was generated, leading to a perfect Precision (number of correctly assigned tags to the total number of assigned tags) and a reasonable 50% Recall (number of correctly assigned tags to the number of gold tags). In the next example, CheXpert does not detect 2 out of 3 tags for \(H^{b}_{cc}\), leading to a low 33.3% Recall. In the lowermost example, where the reference was labeled with lung opacity (pulmonary area with pathologically increased density), no tags were detected by CheXpert for the correct hypothetical diagnosis \(H^{c}_{cc}\). This leads to zero Recall and undefined Precision (though Precision is often taken to be also zero in such cases). Interestingly, however, the incorrect \(H^{c}_{ci}\), which has the same (equally bad) Precision and Recall as \(H_{cc}\), got high scores in many word overlap measures in Table 3, showing a weakness of such measures with respect to clinical accuracy assessment.

Xue et al. [108] were the first to use an evaluation measure that considers medical tags extracted from system-generated and human-authored reports. The authors called the measure Keyword Accuracy, but it should not be confused with the conventional classification Accuracy, since it only measures Recall. The authors, who used the IU X-Ray dataset for their study, compiled a list of tags per examination and used it as the ground truth; the list consisted of the system-generated (MTI) tags and some of the human tags available in IU X-Ray. However, Xue et al. did not provide any further details (e.g., about the human tag selection criteria or how system and human tags were merged). Huang et al. [41] followed the same approach, but they used only the MTI tags as their ground truth. In both of these studies, however, where gold tags were compared with predicted tags, it is unclear how the predicted tags were extracted from the system-generated reports. Liu et al. [67] used the CheXpert medical abnormality mention detection system [43], which generates one out of 4 labels (presence, absence, negative, not sure) for each one of 14 thoracic diseases. For any given report, any disease for which the assigned label was ‘presence’ was considered to be mentioned in that report. When the assigned label for a disease was ‘not sure,’ then the authors considered that the disease was mentioned in the report with 0.5 probability. This process was applied to both system-generated and gold reports and then micro- and macro-averaged Precision and Recall were computed (along with macro-averaged Accuracy). A disadvantage of this approach is that it uses only 14 diseases, which is a very small number compared to the hundreds of abnormality tags of other studies and the much wider variety of medical conditions physicians need to consider in practice. However, the work of [67] can be used to highlight the limitations of Accuracy compared to Precision and Recall, when used to assess DC systems. A majority classifier (a system that always reported no findings, the majority prediction) obtained a higher Accuracy than more elaborate methods in the experiments of Liu et al. More generally, it is well known in machine learning that a large class imbalance may lead to misleadingly high accuracy, which is why Precision, Recall, and F1 are used instead in such cases. In Table 4, we computed and report the harmonic mean (F1) of Precision and Recall using the results of [67]. We note that receiver operating characteristics (ROC) curves, which plot the true positive rate against the false positive rate for different classification probability cutoffs, can also be used to get a better view of the performance of systems and baselines [92]. Precision–Recall (PR) curves can be used in a similar manner. In both cases, the area under each curve (AUC) can serve as a single evaluation score that aggregates results over different cutoffs. However, previous DC work does not provide enough information to reconstruct ROC or PR curves, and does not report AUC scores.

4 Diagnostic captioning methods

We now discuss the main types of DC methods, including their relation to generic image captioning. We also briefly cover early approaches that did not process medical images directly, but were fed with findings manually extracted from medical images, or that did not generate text, but were intended to help in the manual preparation of diagnostic reports.

Early approaches Varges et al. [97] followed an ontology-based natural language generation approach to assist medical professionals turn cardiological findings (from diagnostic images) into readable and informative textual descriptions. The input to the text generator, however, was not directly a medical image, but triplets of descriptive words or phrases like <right atrium, size, normal>. Schlegl et al. [82] used medical images and their diagnostic reports as input to a convolutional neural network (CNN), in order to classify voxels (3D pixels) as intraretinal cystoid fluid, subretinal fluid, or normal retinal tissue (intuitively, decide which kind of fluid or tissue each voxel depicts), with the help of concepts automatically extracted from the text of the corresponding report; in this case, the report was part of the system’s input, not the system’s output. Kisilev et al. [53, 54] performed semi-automatic lesion detection and contour extraction from medical images. Structured Support Vector Machines [94] were then used to generate semantic tags, originating from a radiology lexicon, for each lesion. In later work, [52] used a CNN to detect Regions of Interest (ROIs) in the images, then fully connected layers to assign predefined features describing abnormalities to each ROI. The assigned features were finally filled into sentence templates to generate captions. We discuss template-based text generation below.

Generic image captioning versus diagnostic captioning Deep learning approaches are currently dominant in both generic and diagnostic image captioning.Footnote 12 [39] compiled a taxonomy of aspects of generic image captioning methods that are based on deep learning. Figure 5 depicts that taxonomy. We highlight (red italics) aspects that have not also been used in diagnostic captioning work. The fact that most of the aspects have also been used in DC indicates that generic image captioning methods are also applicable (or at least have been considered) in DC. The best generic image captioning methods, however, are not necessarily the best ones for DC, mostly because of two factors. First, DC methods do not aim to simply describe what is present in an image, unlike generic image captioning methods. DC aims to report clinically important information that is relevant for diagnostic purposes. Simply reporting, for example, which organs are shown in a medical image is undesirable, if there is nothing clinically important to be reported about them. Second, as we have already discussed in previous sections, diagnostic reports are often very similar across examinations of different patients. This allows retrieval-based approaches to perform surprisingly well, often challenging encoder–decoder approaches that are currently the state of the art in generic image captioning. We present both approaches and other alternatives below.

Fig. 5
figure 5

Aspects of deep learning generic image captioning methods, using the terminology of [39]. Aspects that have not been used in DC are shown in red and italics (color figure online)

Regarding the type of input (‘feature mapping’ box of Fig. 5) that is used to generate diagnostic captions, both images [29, 41, 60, 67, 102, 108] and images combined with text [45, 86, 109, 112] have been explored. Both supervised and reinforcement learning have been employed in DC [45, 60, 67]; the latter falls in the ‘other’ category of learning type in the taxonomy of Fig. 5. A caption in current DC work typically refers to an entire medical image or even to a set of medical images, for example multiple X-rays from a single examination [60, 112], not to a particular region of the image(s), unlike some generic image captioning work [39]. The most common system architecture in DC is the encoder–decoder approach. However, other approaches (dubbed ‘compositional’ by [39]) have also been tried in DC, for example the knowledge graph approach by [61]. Usually, LSTMs (viewed as language models in the taxonomy of Fig. 5) are employed to generate text in DC, but we observe that Transformer-based models [22, 79] can also be employed, which is a direction with potential that has just begun being explored in DC [16]. Regarding the ‘other’ box of Fig. 5, where miscellaneous other aspects are listed, concepts [45] and attention-based methods are common in DC [29, 41, 45, 60, 67, 102, 108, 109, 112]. Gaze mechanisms, which have not been explored in generic image captioning [39], could also be useful in DC. For example, attention mechanisms might aim to mimic how a physician focuses in turn on different parts of a medical image. By contrast, the emotional aspects that have been studied for generic image captioning [75], are not related to DC. ‘Novel objects’ in generic image captioning are objects present only in the test dataset [2]. Novel object captioning has not been investigated in DC yet. However, it would certainly be interesting to assess system-generated text for patients with rare (or new) conditions. ‘Stylised captioning,’ which aims to generate descriptions written, for example, in the style of a particular specified author, is of little importance in DC, where informativeness and clinical accuracy matter most and the writing style of the reports is more standardized (at least within a particular diagnostic center or health unit) and can be learned directly from the training examples.

Diagnostic captioning architectures Most often, the encoder–decoder architecture [17] is employed in DC [4, 68], with or without visual attention and reinforcement learning. Systems that adopt this architecture first encode the medical images as dense vectors, typically using CNN-based image encoders. They then generate the diagnostic text from the image encoding, typically using recurrent neural network (RNN) decoders. However, retrieval-based methods have also been proposed for DC [60], and even their simplest forms (e.g., reusing the report of the visually nearest training instance) have been found to outperform all other systems in clinical (tag-based) Recall [67]. As shown in Table 4, more elaborate retrieval-based systems can outperform state-of-the-art encoder–decoders in DC. More specifically, the retrieval-based system of [61] achieved overall better results than their earlier hybrid encoder–decoder and retrieval-based approach [60]. The reader is warned that not all of the results of Table 4 are directly comparable, since some of them are obtained from different datasets, or different training/development/test splits. However, the very high manual evaluation score of the retrieval-based method of [61] is an indication that the encoder–decoder approach may be worse than retrieval-based approaches to DC, and that the latter should be explored more in DC. We also note that retrieval has been recently found to improve language models [48, 49]. The benefits may be greater when modeling the language of diagnostic reports, where large parts of text are often very similar or exactly the same across different patients, as already discussed.

Having provided a brief overview of the most common DC approaches, we now discuss them in more detail, starting from the encoder–decoder architecture.

Table 4 Evaluation scores of DC methods, using BLEU-1/-2/-3/-4 (B1, B2, B3, B4), METEOR (M), ROUGE-L (R), CIDEr (C), manual evaluation (ME), clinical F1 (CF1)

Encoder–Decoder (ED) The encoder–decoder deep learning architecture was originally introduced for machine translation [17], but was then also adopted in generic image captioning [4, 39, 68]. In machine translation or other text-to-text generation tasks (e.g., summarization) an encoder network, often an RNN such as an LSTM [37], reads the input text and converts it to a single vector or a sequence of vectors. A decoder network, often another RNN then produces the target text, in the simplest case word by word, using as its input the encoding of the input text. An attention mechanism [107] allows the decoder to focus on particular vectors of the input text encoding, if the latter is an entire sequence of vectors. In generic image captioning, the encoder is typically a CNN [58], which converts the image into a single or multiple vectors (e.g., corresponding to patches of the image). The decoder again produces the target text (caption), using the image encoding as its input. An attention mechanism may again allow the decoder to focus on particular vectors of the image encoding when generating each word; we call mechanisms of this kind ‘visual attention’ and we discuss them separately below. An example of an encoder–decoder model without visual attention for generic image captioning is the model of [23]. This model comprises a CNN to encode the image and an LSTM to decode to text. The CNN was CaffeNet [44] or the better performing VGG [87]. The decoder was an LSTM, which used the representation of the previous generated word along with the image encoding to generate the next word at each timestep. The authors also experimented with a stacked, two-layer LSTM decoder. Another example is the well-known Show & Tell (S &T) system of [100], which was also introduced for generic image captioning; see Fig. 6a. It employs the Inceptionv3 CNN [93] to encode the image and uses the image encoding to initialize the LSTM decoder.

Fig. 6
figure 6

Left: the Show & Tell (S &T) model by [100]. Right: the Show, Attend & Tell (SA &T) model by [107]. S &T uses the image encoding of the CNN to initialize the LSTM decoder. SA &T also comprises a visual attention mechanism

ED + Visual attention (VA) We place in this category encoder–decoders that also employ visual attention mechanisms (VA), as in Fig. 6b [107]. Such mechanisms can also be used to highlight on the image the findings described in the report to make the diagnosis more easily interpretable [45, 102, 109, 112, 116]. Zhang et al. [116] were the first to employ visual attention in DC with the MDNet model.Footnote 13 They used the BCIDR dataset (not publicly available, Sect. 2), which contains pathological bladder cancer images and diagnostic reports, aiming to generate paragraphs conveying findings. MDNet used a form of ResNet [36] to encode images. The image encoding acts as the initial hidden state of an LSTM decoder, which also uses visual attention. The decoder was cloned to generate multiple sentences. However, in most evaluation measures the model performed only slightly better than the generic image captioning model of [47] applied to DC.

The system of [69], which was designed for generic image captioning and uses visual attention too, was also applied to DC [60]. Its CNN encoder is a ResNet [36], and its decoder is an LSTM. At each timestep, the spatial image encodings (one per image region) and the LSTM hidden state are used as input to a multilayer perceptron (MLP) with a single hidden layer and a softmax output activation function, acting as a visual attention mechanism [107]. This mechanism generates one weight per image region, and the weights are used to form an overall weighted image representation. This image representation is then used along with the hidden state of the LSTM decoder to predict the next word. The authors also extended the decoder with a binary gate, which allows deactivating the visual attention when visual information is redundant (e.g., generating stopwords may require no visual attention).

More recently, encoder–decoder models employing Transformers [98], both in the encoder and the decoder, have been used in DC [16].Footnote 14 The Transformer encoder operated on top of features extracted from a pre-trained CNN image encoder. The Transformer decoder was extended with a mechanism to help it ‘remember’ text patterns (e.g., “the lungs are clear”) that appear in the reports of similar images. The mechanism is called ‘relational memory’ and is based on an input and a forget gate, similar to those of an LSTM cell. This method was the first to use Transformers for DC and achieved promising results.

Fig. 7
figure 7

Jing et al. [45] proposed a DC model that first encodes the image, then extracts ‘visual’ features from the image encoding and ‘semantic’ features from terms predicted from the image encoding. Attention mechanisms are used to produce an overall image representation from the visual and semantic features, which is different at each timestep of the sentence LSTM decoder. At each timestep, the sentence decoder selects the topic of the corresponding sentence. A word LSTM decoder then generates the words of the sentence

Jing et al. [45] created an encoder–decoder model with visual attention especially for DC, illustrated in Fig. 7a. They used VGG-19 [87] to encode each image and extract equally sized patches. Each patch encoding is treated as a ‘visual’ feature vector. An MLP, called MLC in their article and in Fig. 7a, is then fed with the visual feature vectors and predicts terms from a predetermined term vocabulary. The word embeddings (dense vector representations) of the predicted terms of each image are treated as ‘semantic’ feature vectors representing the image. The decoder, which produces the text, is a hierarchical RNN, consisting of a sentence-level LSTM and a word-level LSTM. The sentence-level LSTM produces a sequence of sentence embeddings (vectors), each intuitively specifying the information to be expressed by a sentence of the image description (acting as a sentence topic). For each sentence embedding, the word-level LSTM then produces the words of the corresponding sentence, word by word. More precisely, at each one of its timesteps, the sentence-level LSTM examines both the visual and the semantic feature vectors of the image. An attention mechanism (an MLP fed with the current state of the sentence-level LSTM and each one of the visual feature vectors of the image) assigns attention scores to the visual feature vectors, and the weighted sum of the visual feature vectors (weighted by their attention scores) becomes a visual ‘context’ vector, intuitively specifying which visual features to express by the next sentence. Another attention mechanism (another MLP) assigns attention scores to the semantic feature vectors (representing the image terms), and the weighted sum of the semantic feature vectors (weighted by attention) becomes the semantic context vector, specifying which terms of the image to express by the next sentence. At each timestep, the sentence-level LSTM considers the visual and semantic context vectors, produces a sentence embedding (topic), and updates its state, until a stop control instructs it to stop. Given a sentence embedding, the word-level LSTM produces the words of the corresponding sentence, until a special ‘stop’ token is generated. Jing et al. showed that their model outperforms generic image captioning models with visual attention [23, 100, 107, 111] in DC. Wang et al. [102] adopted an approach to DC similar to that of [45], using a ResNet-based image encoder and an LSTM decoder, but their LSTM is flat, as opposed to the hierarchical LSTM of [45]. Wang et al. [102] also extract additional image features from the states of the LSTM.

Yin et al. [109] created an encoder–decoder DC model similar to that of [45]. Again, a hierarchical LSTM attends over image features and representations of abnormality labels predicted from the image encoding. The image encoder is DenseNet [40], but Yin et al. remove its last global pooling layer (arguing it could lose important spatial information) and the last fully connected layer (which serves as a classifier). Instead they add a convolutional layer that operates on the image region representations and outputs a probability distribution for each particular label over the image, intuitively a heatmap for each label. A global max pooling over each label’s heatmap then produces a single probability per label for the entire image. The DenseNet encoder was pre-trained on ImageNet and then fine-tuned on IU X-ray (Sect. 2). The hierarchical LSTM and the attention mechanisms are very similar to those of [45] discussed above. Yin et al. also add a ‘topic matching’ loss that roughly speaking penalizes topic representations (sentence embeddings) produced by the sentence-level LSTM decoder when they deviate from the representations of the corresponding ground-truth sentences.

Another encoder–decoder model with visual attention for DC was proposed by [112]. It uses the ResNet-152 image encoder [36], which Yuan et al. pre-trained on the medical image dataset CheXpert [43] to perform multi-label classification with 14 labels (12 disease labels, “Support Devices” and “No Finding”).Footnote 15 The image encoder was fine-tuned by training it for the same task on IU X-ray (Sect. 2) and then used with a new dense layer added on top for classification with 69 gold labels (medical concepts) extracted from the ground truth reports by SemRep.Footnote 16 Yuan et al. show experimentally that an image encoder pre-trained on a large dataset of medical images (224,316 chest X-rays) has better performance than encoders pre-trained on ImageNet. The decoder is again a hierarchical LSTM, in which the sentence LSTM attends over the image and the word LSTM over the medical concepts produced by the encoder. Yuan et al. report state-of-the-art results, outperforming [45, 61] among others. Yuan et al. also allow their model to be fed with multiple images to generate a single report from. This is important, because many imaging examinations comprise multiple images, for example a frontal and a lateral projection image. Systems trained to generate a report from a single medical image at a time cannot handle such cases well. Similar provisions are made in the system of [60], which also uses Reinforcement Learning, discussed below.

ED + Reinforcement learning (RL) These methods use the encoder–decoder architecture, but also employ Reinforcement Learning (RL) [91]. For example, [81] employed the REINFORCE algorithm [105] with a reward based on CIDEr (Sect. 3.1), but in the context of generic image captioning. An advantage of RL is that non-differentiable evaluation measures can be used directly during training, so that systems are not optimizing loss functions like cross-entropy during training while being assessed with measures such as BLEU, ROUGE, or clinical F1 (Sect. 3) at test time. For readers not familiar with these issues, we note that when training with backpropagation the loss function must be differentiable, which is not the case for most current DC evaluation measures. By contrast with RL the reward does not need to be differentiable. It can also be given at the end of a sequence of system decisions, in cases where a loss is not available for each individual decision. In the DC system of [60], for example, RL is used to decide if a sentence will be generated from scratch, or if it will be retrieved from a database with frequently occurring sentences. The image encoding, produced by a DenseNet-121 [42] or a VGG-19 [87] CNN, is fed to a hierarchical RNN decoder similar to that of [45], illustrated in Fig. 7b. However, for each sentence embedding (topic) produced by the sentence-level RNN, an agent trained using RL (again using REINFORCE and CIDEr) decides if the sentence will be generated using the word-level RNN or if it will be retrieved from a database of frequent sentences. Li et al. [60] applied their system to IU X-Ray (Sect. 2), but their experimental results were close to those of a baseline. In more recent work, [67] used DenseNet-121 [42] for image encoding and a hierarchical LSTMs decoder. Similarly to [60, 81], REINFORCE with a CIDEr-based reward was employed. However, this time RL was used to optimize readability. Liu et al. also included a reward based on comparing labels, like the ones of Fig. 1b, extracted by CheXpert [43] from the system-generated text and the human-authored report, in order to optimize clinical accuracy.

ED + Language templates (LT) Template-based generation has a long history in natural language generation [31, 80], where templates of many different forms have been used, ranging from surface-form sentence templates, to sentence templates at the level of syntax trees, to document structure templates [95]. In the context of DC, language templates (LT) have recently been combined with encoder–decoder approaches, attempting to provide more satisfactory diagnostic reports. Gale et al. [29] focused on classifying hip fractures in frontal pelvic X-rays, and argued that generating reports for such narrow medical tasks can be simplified to using only two sentence templates; one for positive cases, including five placeholders (slots) to be filled in by descriptive terms, and a fixed negative template with no slots. They used DenseNet [40] to encode the image, and (presumably) classify it as a positive or negative case, and a two-layer LSTM with attention over the image encoding to fill in the slots of the positive template. Their scores are very high (Table 4), but this is expected due to the extremely simplified and standardized ground-truth reports. For example, the vocabulary of the latter contains only 30 words, including special tokens. The DC system of [60] also uses retrieved sentence ‘templates,’ but these are complete sentences, which have no empty slots to fill in. Templates, however, may change over time, hence putting at stake reliability, as was noticed by [66], who proposed the alternative of retrieving whole reports.

Retrieval-based approaches to DC can be as simple as reusing the diagnostic text of the visually nearest (in terms of image encoding similarity) medical exam of the training set [67]. Even this 1-nearest neighbor approach achieves surprisingly good results; see the second row of Table 4. Then, it should be no surprise that the more advanced retrieval-based DC approach of [61] outperforms ED [23, 100], ED+VA [45], and ED+RL [60] methods in Table 4 (we employ comparable results from the Table). We also note that methods that retrieve sentences [60, 61], discussed above, can also be seen as belonging in the category of retrieval-based systems. Retrieval-based systems were also the top performing submissions of the ImageCLEF Caption Prediction subtask, a task that ran for two consecutive years [20, 24].Footnote 17 The top participating systems of the competition in both years relied on (or included) image retrieval [62, 114]. Zhang et al. [114], who obtained the best results in 2018, used the Lucene Image Retrieval system (LIRE) to retrieve similar images from the training set, then simply concatenated the captions of the top three retrieved images to obtain the new caption.Footnote 18 Liang et al. [62], who had the best results in 2017, combined an ED approach with image-based retrieval. They reused a pre-trained VGG encoder and an LSTM decoder, similarly to those of [47]. They trained three such models on different caption lengths and used an SVM classifier to choose the most suitable decoder for the given image. They also used a 1-nearest neighbor method to retrieve the caption of the most similar training image and concatenated it with the generated caption.

Baselines are included in the first two lines of Table 4. BlindRNN is an RNN that simply generates word sequences, having been trained as a language model on medical captions, without considering the image(s); a single-layer LSTM was used in the BlindRNN of Table 4. The 1-NN baseline retrieves the diagnostic text of the visually most similar image from the training set [67]. These simplistic baselines were intended to be easy to beat, but as can be seen in Table 4, the scores of 1-NN are very high, and they outperform some much more elaborate approaches, such as the system of [67] in clinical recall.

5 Literature search methodology and discussion

As discussed already, only three datasets are available for DC (Sect. 2), and DC systems are being assessed mostly via measures originating from machine translation and summarisation (Sect. 3). Table 5 presents all the scientific articles presenting DC methods that were reviewed in this work (it does not list papers discussing datasets, evaluation measures etc.).

For articles that present DC methods (Table 5), we considered only peer-reviewed publications, in English, up to 2021. We initially queried Google Scholar with “radiology report generation” (568,000 results) and searched the top 100 results returned. Beyond that point, results were irrelevant. We repeated the same step for the query “medical image report generation” (473,000 results) and “medical image captioning” (10,200 results). The rest of the queries we tried returned irrelevant results. We also used Google search to collect articles citing the most highly cited articles we had already retrieved. For example, we explored the 222 articles citing [102] (see Table 5).

Most of the articles of Table 5 were selected due to their high citation count, but two [40, 109] were selected due to their claimed state-of-the-art performance. We excluded any articles with fewer than 8 citations, but we note that this also excludes very recent work, as for example the work of [30, 66, 88, 103], who had 1, 0, 4, 0 citations, respectively. The selected articles mainly focus on chest radiographs and only two study other types of images. In particular, [115] study blood cancer images and their diagnostic reports, and [109] study non-medical natural images and their descriptions, along with chest radiographs and their reports.

All the papers we reviewed that reported the software library they used, employed Torch [102, 115], PyTorch [60, 61, 67] or Tensorflow [109]. However, only one of the articles of Table 5 makes the code of its experiments publicly available, which makes replicating the experiments harder.

Generating a diagnostic text from an encoded image can benefit the medical field, by potentially assisting the physician to provide higher quality reports, in less time. As shown in Table 5, the main medical focus is the chest, which is probably due to the fact that available resources mainly comprise chest radiographs. Moreover, we note that diagnostic captioning could potentially be applied to any diagnostic task that involves images, even without a medical focus. For instance, the image could be about a malfunctioning system/engine and the caption could be about the fault diagnosis.

The encoder–decoder architecture is the most common approach employed (see Table 4). However, retrieval-based methods have also been explored (Table 4). They are simpler in the sense that they do not generate text from scratch, but instead reuse previous text. They are also more easily deployed in medical facilities, as they require less training (e.g., no decoders need to be trained), yet perform competitively to encoder–decoder approaches (Table 4). The benefits of retrieval approaches have also been noted in the natural language generation literature [59].

Table 5 Scientific articles presenting DC methods that have been reviewed and motivation for reviewing them

Evaluation measures of diagnostic captioning methods are still under-explored in literature. Clinical accuracy evaluated by human experts is possible, but expensive. Also, it is hard to find suitable evaluators or even if they are found, it is hard for them to allocate much time for evaluation. Alternatively, clinical accuracy can be automatically measured by employing tagging systems. However, any errors of such systems are propagated to the evaluation of the system-produced caption. A possible solution would be to improve the accuracy of those tagging tools (e.g., by better capturing negation). Measures that originate from a non-medical field may apply generic stemming; e.g., METEOR uses the Snowball stemmer.Footnote 19 However, using a generic stemmer, not tuned for the medical field, may introduce errors by conflating terms with the same stem. To conclude, more and better evaluation measures are needed to allow DC to grow.

6 Conclusions and directions for future research

We provided an extensive overview of DC methods, publicly available datasets, and evaluation measures.

Methods In terms of methods, we found that most current DC work uses encoder–decoder deep learning approaches, largely because of their success in generic (non-medical) image captioning. We have pointed out, however, that DC aims to report only information that helps in a medical diagnosis. Prominent objects shown (e.g., body organs) do not need to be mentioned, if there is nothing clinically important to be reported about them, unlike generic image captioning where salient objects (and actions taking place) typically have to be reported. Another major difference from generic image captioning is that medical images vary much less and, consequently, the corresponding diagnostic text is often very similar or even identical across different patients. These two factors allow retrieval-based methods, which reuse diagnostic text from training examples with similar images, to perform surprisingly well in DC. Frequent sentences or sentence templates can also be used, instead of generating them.

Evaluation For evaluation purposes, we reported that DC work has so far relied mostly on word overlap measures, originating from machine translation and summarization, which often fail to capture clinical correctness, as we have also demonstrated using examples (Table 3). Measures that compare tags (also viewed as labels or classes, corresponding to medical terms or concepts) that are manually or, more often, automatically extracted from system-generated and human-authored diagnostic reports have also been employed, as a means to better capture clinical correctness. They may also fail as we demonstrated (Table 3) when the tools that automatically extract the tags are inaccurate, when human annotation guidelines are unclear on exactly which tags should be assigned or not, and when tags cannot fully capture the information to be included in the diagnostic text. Manual evaluation is rare in DC, presumably because of the difficulty and cost of employing evaluators with sufficient medical expertise.

Datasets In terms of datasets, we focused on the only two publicly available datasets that are truly representative of the task (IU X-Ray, MIMIC-CXR), having first discussed severe shortcomings of the other publicly available datasets (e.g., they may not contain medical images from real examinations). We also collected and reported evaluation results from previous published work for all the DC datasets, methods, and evaluation measures we considered. Although these results are often not directly comparable, because of different datasets or splits used, they provide an overall indication of how well different types of DC methods perform. The results we collected may also help other researchers produce results that will be more directly comparable to previously reported ones.

Our main findings, summarized above, also guide our proposals for future work on DC. Our proposals, however, also take into account developments and concerns in the broader area of Artificial Intelligence (AI) applications in medicine, which we summarize first, before discussing our proposals.

The role of AI in medicine First, it should be made clear that the role of AI in medicine, including DC, is not to replace, but rather to assist accredited physicians (e.g., radiologists, nuclear medicine physicians in our case), who are responsible for any medical decision and treatment. In that sense, we should aim to position AI as tools to be used by physicians for a better medical outcome [19, 35, 71]. The recent trend of comparing AI predictive models to physicians, rather than assessing how AI may complement the skills and expertise of physicians and improve their performance, reflects a misunderstanding of AI’s potential clinical role [57].

Reliable trials and validation Second, to guarantee safe clinical adoption, medical AI applications, including DC systems, should be supported by robust scientific trials demonstrating their effectiveness to the standard of clinical practice. Despite the plethora of research on medical AI, only 6% of recent articles on diagnostic deep learning models incorporated independent external validation, and very few studies deployed and assessed AI in a clinical context [51, 110]. Moreover, even though there are currently about 100 CE-markedFootnote 20 AI products on the market, only 36% have scientifically solid proof, with the majority of research demonstrating lower efficacy, and just 22% concern diagnosis [96]. DC is still in its infancy, hence concerns of this kind should be carefully considered to ensure DC research and development is on the right track.

Electronic medical records and structured reporting Third, when planning for future medical AI applications, including DC systems, one should take into account efforts in national health systems to move toward personal electronic medical records. Such records will provide physicians with a wealth of information about the medical histories of their patients (including past examinations and images) and will require tools to help the physicians focus on information relevant to a particular diagnosis. Furthermore, there is a trend toward structured reporting in clinical routines [12, 13, 25], which could result in more accessible and standardized data from past medical reports, but may also favor semi-automatic report generation. On the other hand, health professionals should become familiar with medical AI applications and methods, and curricula to assist medical trainees acquire the necessary knowledge and skills are being explored [104].

Having our main findings and the concerns above in mind, we now propose directions for future work on DC.

Hybrid encoder–decoder and retrieval-based DC-specific methods We believe that hybrid methods, which will combine encoder–decoder approaches that generate diagnostic text from scratch with retrieval-based methods that reuse text from similar past cases are more likely to succeed. Retrieval-based methods often work surprisingly well in DC as already discussed, a fact that DC research has not fully exploited yet, but some editing (or filling in) of recycled previous text (or templates) will presumably still be necessary in many cases, especially when reporting abnormalities. Hence, decoders that tailor (or fill in) previous diagnostic text (or templates) may still be needed. Reinforcement learning can be used to decide when to switch from recycling previous text to editing previous text or to generating new text.

Multimodal DC methods Ideally future work will also take into account that physicians do not consider only medical images when diagnosing. They also consider the medical history and profile of the patients (e.g., previous examinations, previous medications, age, sex, occupation). Hence, information of this kind, increasingly available via electronic medical records as already discussed, may need to be encoded, along with the images of the current examination, which may be more than one as also discussed.

Support tools for physicians We also believe that DC methods need to involve more closely and support the physicians who are responsible for the diagnoses. Current DC work seems to assume that systems should generate a complete diagnostic text on their own, which the responsible physician may then edit. In practice, however, it may be more desirable to allow the physicians to see and correct regions of possible interest highlighted by the system on the medical images; then allow the physicians to inspect and correct medical terms assigned by the system to the images; then let the physicians start authoring the diagnostic text, with the system suggesting sentence completions, re-phrasings, missing sentences, in effect acting as an intelligent image-aware authoring tool. This would allow the physicians to monitor and control more closely the system’s predictions and decision making, especially if mechanisms to explain each system prediction or suggestion are available (e.g., highlighting regions on the images that justify each predicted term or suggested sentence completion).

Better intrinsic and extrinsic evaluation We discussed the shortcomings of current intrinsic automatic evaluation DC measures, such as word overlap measures, and we showed that measures of this kind do not necessarily capture clinical correctness (Table 3). Improving these measures to capture desirable properties of diagnostic text, especially clinical correctness, is hence an obvious area where further research is needed. Advances in evaluation measures for machine translation [90], summarization [106] or, more generally, text generation [83] also need to be monitored and ported to DC evaluation when appropriate. Despite the high cost, more manual evaluations of system-generated (or system-physician co-authored) diagnostic reports by qualified medical experts in realistic trials are also needed to obtain a better view of the real-life value and reliability of current DC methods and desired improvements. More extrinsic evaluations are also necessary, e.g., to check if DC methods can indeed shorten the time needed by a physician to author a diagnostic report, if DC methods indeed help inexperienced physicians avoid medical errors, if they reduce the pressure physicians feel etc. Extrinsic evaluations of this kind may also help shift DC methods toward becoming support tools for physicians, as suggested above.

More, larger, realistic public datasets We pointed out that there are currently only two datasets that are truly representative of the DC task. The first one, IU X-Ray, is rather small (approx. 4k instances) by today’s standards. The second one, MIMIC-CXR, is much larger (approx. 228k instances), but still small compared to the approx. 1 billion imaging examinations performed annually worldwide (Sect. 1), and it contains only English reports. Hospitals worldwide routinely save diagnostic medical images and the corresponding reports in their systems using established standards, at least for the images and their metadata (e.g., DICOM). Regulations and guidelines to protect sensitive information (e.g., HIPAA) are also available, and automatically removing sensitive information from both images and diagnostic reports is feasible to a large extent (Sect. 2). Hence, it should be possible to construct many more, much larger publicly available DC datasets in many more languages; ideally these datasets would also include medical records and other information that physicians consult for diagnostic purposes, not just the medical images, as already discussed. What seems to be missing is a set of established, possibly standardized, procedures to construct publicly available and appropriately anonymized DC datasets. In turn, this requires concrete evidence (e.g., from extrinsic evaluations) of the possible benefits that DC may bring to public health systems, and well-documented best practices.