Keywords

1 Introduction

Deep learning approaches have exhibited impressive performance in medical imaging applications in recent years [2, 7, 19]. For instance, convolutional neural networks (CNNs) have had some success in detecting and classifying radiological abnormalities on chest x-rays, a particularly complex task [2, 12, 15, 21]. The majority of these studies have been designed for cross-sectional analyses, viewing a single image in isolation, and discard the fact that a patient may have had previous medical imaging examinations for which the radiological reports are also available. It is standard practice for radiologists to take clinical history into account to add context to their report by using comparison to previous imaging. Some abnormalities will be long-standing, but others may change over time, with varying clinical relevance. Often in elderly patients or those with a history of smoking, the baseline x-ray appearances, i.e. when that patient is “well”, can still be abnormal. If individual films are viewed in isolation, it can be challenging to tell with certainty if there are acute findings. If previous imaging is available, it is possible to determine if there has been interval change, for example, acute consolidation (indicating infection). As with humans, it is expected that a neural network can learn from previous patient-specific information, in this case all prior chest radiographs for that patient and their corresponding reports.

The motivation for this work is to assess the potential of recurrent neural networks (RNNs) for the real-time detection of radiological abnormalities when modelling the entire series of past exams that are available for any given patient. In particular, we set out to explore the performance of Long Short-Term Memory (LSTM) networks [8, 10], which have lately become the method of choice in sequential modelling, especially when used in combination with CNNs for visual feature extraction [6, 20]. The technical challenge faced in our context is that sequential medical exams are event-based observations. As such, they are collected at times of clinical need, i.e. they are not equally spaced, and the number of historical exams available for each patient can vary greatly. Figure 1 shows four longitudinal chest x-rays acquired on the same patient over a certain period of time. This figure also illustrates other challenges faced when modelling this type of longitudinal data: the images may be aquired using different x-ray devices (resulting in different image quality, i.e. resolution, brightness, etc.), there may be differences in patient positioning (i.e. supine, erect, rotated, degree of inspiration), differences in projection (postero-anterior and antero-posterior), and not all images are equally centred (i.e. there can be rotations, translations, etc.).

As LSTMs are typically applied on regularly-sampled data [9, 16, 17], they are ill-suited to work with irregular time gaps between consecutive observations, as previously noted [3, 13]. This is a particularly important limitation in our context as certain radiological abnormalities tend to be observed for longer periods of time whereas others are short-lived. In this article we demonstrate that an architecture combining a CNN with a simple modification of the standard LSTM is able to handle irregularly-sampled data and learn the temporal dynamics of certain visual features resulting in improved pattern detection. Using both simulated and real x-ray datasets, we demonstrate that this capability yields improved image classification performance over an LSTM baseline.

Fig. 1.
figure 1

Example of longitudinal x-rays for a given patient.

2 Motivating Dataset and Problem Formulation

The dataset used in this study was collected from the historical archives of the PACS (Picture Archiving and Communication System) at Guy’s and St. Thomas’ NHS Foundation Trust, in London, during the period from January 2005 to March 2016. The dataset has been previously used for the detection of lung nodules [14] and for multi-label metric learning [1]. It consists of \(745\,480\) chest radiographs representative of an adult population and acquired using 40 different x-ray systems. Each associated radiological report was parsed using a natural language processing system for the automated extraction of radiological labels [5, 14]. For this study, we extracted a subset of \(80\,737\) patients having a history of at least two exams, which resulted in \(337\,575\) images (with \(232\,610\) used for training and \(104\,965\) for testing). Each image was scaled to a standard format of \(299 \times 299\) pixels. The resulting dataset has an average of 4.18 examinations per patient with an average of 180.29 days between consecutive exams per patient.

In what follows, each individual sequence of longitudinal chest x-rays along with its associated vector of radiological labels is denoted as \(\{X_{i}^t, l_{i}^t\}\), where \(i=1,\ldots ,N\) is the patient index and \(t=1, \ldots ,T_i\) is the time index. Typical chest x-ray datasets are characterised by relatively few examinations per patient (e.g. \(T_i\) is around 4–5) and highly-irregular sampling rates. Our task is to predict the vector of image labels \(l_{i}^{T_i}\) given the entire history of exams up to time \(T_i-1\) plus the current image, i.e. \(X_i^{T_i}\).

3 Time-Modulated LSTM

LSTMs are a particular type of RNNs able to classify, process and predict time series [8, 10]. The internal state of an LSTM (a.k.a. the cell state or memory) gives the architecture its ability to ’remember’. A standard LSTM contains memory blocks, and blocks contain memory cells. A typical memory block is made of three main components: an input gate controlling the flow of input activations into the memory cell, an output gate controlling the output flow of cell activations, and a forget gate for scaling the internal state of the cell. The forget gate modulates how much information is used from the internal state of the previous time-step. However, standard LSTMs are ill-suited for our task where the time between consecutive exams is variable, because they have no mechanism for explicitly modelling the arrival time of each observation. In fact, it has been shown that LSTMs, and more generally RNNs, underperform with irregularly sampled data or time series with missing values [4, 13]. Previous attempts to adapt LSTMs for use with irregularly sampled datapoints have mostly focused on speeding up the converge of the algorithm in settings with high-resolution sampled data [13] or to discount short-term memory [3].

To address these issues, we introduce two simple modifications of the standard LSTM architecture, called time-modulated LSTM (tLSTM), both making explicit use of the time indexes associated to the inputs. In the proposed architecture, all the images for a given patient are initially processed by a CNN architecture, which extracts a set of imaging features, denoted by \(\widehat{X}_i^t\), at each time step. The LSTM takes as inputs \(l_i^{t-1}\), i.e. the radiological labels describing the images acquired at the previous time-step, the current image features, \(\widehat{X}_i^t\), and the time lapse between \(X_i^{t-1}\) and \(X_i^{t}\), which we denote as \(\delta _i^t\). For the last image in the sequence, the LSTM predicts the image labels, \(l_i^t\), called \(y_i^t\). Figure 2 provides a high-level overview of this model and the equations below define the tLSTM unit:

$$\begin{aligned} \begin{aligned} f_t&= \sigma (W_{fl}*l^{t-1} + W_{fx}*\widehat{X}^t + W_{fj}*\delta ^t + b_f) ,\\ i_t&= \sigma (W_{il}*l^{t-1} + W_{ix}*\widehat{X}^t + W_{ij}*\delta ^t + b_i) ,\\ o_t&= \sigma (W_{ol}*l^{t-1} + W_{ox}*\widehat{X}^t + W_{oj}*\delta ^t + b_o) ,\\ c_t&= \tanh (W_{cl}*l^{t-1} + W_{cx}*\widehat{X}^t + W_{cj}*\delta ^t + b_c) ,\\ h_t&= f_t * h_{t-1} + i_t * c_t ,\\ y^t&= o_t * \tanh (h_t) \end{aligned} \end{aligned}$$
(1)

Here, \(h_t\) defines the internal state at time-step t, while \(f_t\), \(i_t\) and \(o_t\) refer to the forget, input and output gates at time-step t, respectively. These are all computed as linear combinations of the vectors \(l^{t-1}, \widehat{X^t}\) and the scalar \(\delta ^t\), and then transformed by a sigmoid function, \(\sigma (\cdot )\). The matrices denoted by W contain learnable weights indexed by two letters (e.g. \(W_{fl}\) contains the weights of the forget gate f for labels l, and so on). At time \(t = 1\), we initialise \(l_i^{t-1} = <0\dots 0>\) (an array of zeros) and \(\delta _i^t=0\). The time lapses, \(\delta _i^t\), linearly modulate the information inside the internal cell state as well as the output, forget and input gates.

Fig. 2.
figure 2

An overview of the proposed architecture for image label prediction leveraging all historical exams.

A different variation of the previous model (tLSTMv2) uses the time lapse only to modulate the internal state, \(h_t\). In this case, each \(\delta _i^t\) actively contributes to updating \(h_t\) directly and, implicitly, to estimating the label vector \(y^t\), i.e.

$$\begin{aligned} \begin{aligned} h_t&= f_t * h_{t-1} + i_t * c_t + W_{tj}*\delta ^t\\ y^t&= o_t * \tanh (h_t) . \end{aligned} \end{aligned}$$
(2)

The form of the other updating equations, i.e. \(f_g, i_t, o_t\) and \(c_t\), is similar to those in Eq. (1), without the \(Ws \times \delta ^t\) elements.

4 Simulated Data

In order to better assess the potential advantages introduced by the time-modulated LSTM in settings where observations are event-driven and the underlying patterns to be detected are time-varying, we generated simulated data as an alternative to the real chest x-ray dataset of Sect. 2. Simulating images enables us to precisely control the sampling frequency at which the relevant visual patterns appear and disappear over time as well as the signal to noise ratio. For this study, we simulated a population of image sequences of varying lengths. Within a sequence, each image consisted of a noisy background image containing one or more randomly placed digits drawn from the set \(\{0, 3, 6, 8, 9\}\). We simulated three kinds of patterns inspired by the radiological patterns seen in real medical images: (i) rare patterns consisting of digits appearing with low probability; (ii) common patterns consisting of rapidly appearing and resolving digits; (iii) persistent labels, consisting of digits observed for extended periods of time. In analogy to medical images, each digit in our simulation represents a radiological abnormality to be detected, hence multiple (and possibly overlapping) digits are allowed to coexist within an image. The time lapse \(\delta ^t\) was modelled as a uniform random variable taking value in the interval [1, 10]. An example of simulated images can be found in the Supplementary Material.

5 Experimental Results

In our experiments with the real x-ray dataset, the CNN component in our architecture conists of a pre-trained Inception v3 [18] without the classification layer. The imaging features \(\hat{X}_i^t\) (an array 2048 elements) from the CNN are as used as inputs for the LSTM component along with the image labels. We considered four possible radiological labels: cardiomegaly, consolidation, pleural effusion and hiatus hernia. The performance of the time-modulated LSTM models is assessed by the PPV (Positive Predictive Value) and NPV (Negative Predictive Value) along with F-score, i.e the harmonic mean of precision and recall.

Table 1. Results on real data\(^{*}\)

We compared the performance of four models: the baseline CNN classifier (Inceptionv3) that only uses each current image to predict the labels, but does not exploit the historical exams for a given patient, and three variations of the architecture illustrated in Fig. 2: one using the standard LSTM and the two versions of time-modulated LSTM model introduced in Sect. 3. Both tLSTM versions introduced noticeable performance improvements; see Table 1. In particular, tLSTMv1 yields an increase of \(\sim \)7% in F-measure over the baseline and \(\sim \)8% over a standard LSTM. Moreover, tLSTMv1 achieves a \(\sim \)9% improvement in PPV over the baseline. Overall, tLSTM achieves improved performance over the standard LSTM due to its ability to handle irregularly sampled data.

For the simulated dataset, we used a pre-trained AlexNet [11] as feature extractor in combination with three versions of the LSTM for modelling sequences of images. A full table with results can be found in the Supplementary Material. We purposely introduced a sufficiently high level of noise in the visual patterns so as to make the classification problem with individual images particularly difficult; accordingly, the single-image classifier did not achieve acceptable classification results. Likewise, the architecture using a standard LSTM did not introduce significant improvements due to the irregularly sampled observations. On the other hand, larger classification improvements were achieved using the time-modulated LSTM units as those were able to decode the sequential patterns by explicitly taking into account the time gaps between consecutive observations.

6 Conclusions

Our experimental results suggest that the modified LSTM architectures, combined with CNNs, are suitable for modelling sequences of event-based imaging observations. By explicitly modelling the individual time lapses between consecutive events, these architectures are able to better capture the evolution of visual patterns over time, which has a boosting effect on the classification performance. The full potential of these models is best demonstrated using simulated datasets whereby we have control over the exact nature of the temporal patterns and the image labels are perfectly known. In real radiological datasets, there are often errors in some of the image labels due to typographical errors, interpretive errors, ambiguous language and, in some cases, long-standing findings not being mentioned. This can cause problems both in CNN training and testing. Despite these challenges, we have demonstrated that improved classification results can also be achieved by the time-modulated LSTM components on a large chest x-ray dataset. Thus we empirically proved that a patient’s imaging history can be used to improve automated radiological reporting. In future work, we plan more extensive testing of a system trained end-to-end on a much larger number of radiological classes. The code with the networks used for our experiment can be found online: https://github.com/WMGDataScience/tLSTM.