Introduction

Prediction tasks defined on structured EHR data are a key focus for applications of Machine Learning in Healthcare, with the potential to improve patient outcomes through faster and more accurate diagnoses. Due to the rapidly increasing quantity and availability of EHR data, methods in deep learning are increasingly being utilised to model the complex interactions in a range of healthcare related predictive tasks. Due to the sequential nature of EHR data, RNNs have emerged as a key component in many recent state of the art methods. This paper introduces signature methods as a theoretically well-grounded method of extracting features from sequential structured EHR data. We provide an empirical evaluation of signature methods as a novel alternative to RNNs for disease prediction using data collected during routine healthcare encounters.

The signature transform maps a path (for example a time series) onto an infinite sequence of summary statistics. It is known that these terms completely characterise the path (up to translation) and that any function on the path can be modelled arbitrarily well by a linear function on the signature [21, 34]. In a machine learning context, this makes the signature a useful feature set with which to learn from. The signature has been successful across a range of predictive tasks involving time series data [37] in particular in the medical domain [38]. However, the signature method has not yet been applied to structure EHR data, most likely due to its high dimensionality posing computational challenges. To this end, we follow recent work that enables the signature to be used as a differentiable layer within a neural architecture enabling application in high dimensional domains where calculation would have previously been intractable [7].

In this paper we perform an empirical evaluation of signature methods as a novel alternative to RNNs for disease prediction using EHR data. We create a 90 day HF prediction task with data from the UK Biobank [9] to compare neural-signature models with various augmentations against RNN and bag of words baselines. The key results can be summarised as follows:

  • Neural-signature methods are able to produce a competitive predictive performance when compared to RNN models, returning results over two separate corpora and metrics within one standard deviation

  • Log-signature and lead-lag variants improve results from those similar to basic bag of words models to those comparable with RNNs

  • Adding time-augmentations does not significantly effect model performance for both neural-signature and RNN models

Related work

The methods previously used to address temporality in EHR can be roughly separated into three main areas;

Discretization This consists of splitting the continuous-time variables into discrete bins. Features are then calculated from the sub-sequences within each time period. For categorical data, the most common approach is to count the number of events.

Neural approaches Neural network approaches attempt to automatically learn a feature set that best describes the underlying data for a specific prediction task. [41] and [12] applied RNN variants to find results that reported improved performance over existing state-of-the-art methods.

RNN variants continue to play a role in more recent papers [1, 13, 14, 20, 35, 42, 47, 51]. Modifications include using bidirectional RNNs to reduce steps between dependencies, attention mechanisms to improve interpretability, facilitate combining with Convolutional Neural Network (CNN) models, and improved embeddings for visits with graph-based attention models. In all such papers, RNNs are used to handle the sequential aspect of structured EHR.

While it is clear that RNNs perform comparatively well in deep learning applications, an alternative set of methods is also worth discussing.

Sequential feature extraction This encompasses methods that are able to extract flat features from sequential data while retaining information relating to ordering. Despite being more popular in higher frequency data modalities such as streams data, previous works using structure EHR have explored; shapelets [52] and symbolic aggregate approximation [4] for adverse drug reaction prediction. More broadly, this includes methods such as the discrete Fourier transformation [43].

It is this category that signature methods can be considered to belong. A key advantage of signature methods is a strong theoretical groundwork showing the signatures usefulness in non-parametric hypothesis testing [11] and algebraic geometry [40]. Machine learning applications have also been demonstrated in a growing variety of domains [10] including: healthcare [3, 27, 28, 38], finance [24, 39], action recognition [32, 49] and hand-writing recognition [50].

Data preprocessing and cohort

The UK Biobank [9] is a national population-based study comprising of 502,629 individuals. We extract a retrospective heart failure (HF) cohort using the same methodology as [19] which uses the previously-validated phenotyping algorithm in the CALIBER resource [18].

To form the sequential input data required for our predictive model, we extract primary and secondary diagnosis terms (ICD10), procedure terms (OPCS4) and, timestamps (“epistart”) from the UK Biobank inpatient dataset. Patient events are extracted with a buffer period of 90 days before HF diagnosis (for controls this is the HF diagnosis of its matched case) to exclude highly correlated events such as end of life care [29]. Events that occur at the same time are randomly ordered.

We create two separate corpora for each patient: PRIMDX is a corpus that only contains primary diagnosis terms, and PRIMDX-SECDX-PROC also includes secondary diagnoses and procedure terms. Since the number of events in each sequence is greater for the PRIMDX-SECDX-PROC cohort, this allows us to compare each of our methods’ ability to handle longer sequences with more complex and redundant information.

In Table 1 we provide a breakdown of the demographics of the matched cohort used in this study. In Appendix A we provide further details on the HF cohort extracted and the tokenization process of healthcare terms.

Table 1 A summary of the cohort used for HF prediction experiments

Methods

Let each patient record be denoted by the path \({\mathbf {x}} = (x^1_t, \ldots , x^d_t)\), where each value \(x^i_t\) is real-valued and parameterised by \(t \in [0,T]\).

Our objective is to classify each sequence with a binary variable which indicates whether the patient will develop heart failure within 90 days. The dimension of the path, d, is be determined by the maximum number of unique tokens as we represent each token with a one-hot-vector, such that only the dimension corresponding with the index of the vocabulary is one and with zeros everywhere else.

Signature methods

The definition of the signature transform is as follows.

Let \(T > 0\) and \(0< t_1< t_2< \cdots< t_{n-1} < t_n = T\). Let \(f_x = (f_x^1, \ldots , f_x^d): [0,T] \rightarrow {\mathbb {R}}^d\) be the unique continuous function such that \(f_x(t_i)=x_i\) and is affine on the intervals between them. The signature is the infinite collection of iterated integrals

$$\begin{aligned}&\mathrm {Sig}({\mathbf {x}}) = \nonumber \\&\left( \left( \,\underset{0< t_1< \cdots< t_k < T}{\int \cdots \int } \prod _{j = 1}^k \frac{ d f^{i_j}}{ dt}(t_j) dt_j \right) _{1 \le i_1, \ldots , i_k \le d}\right) _{1 \le k}. \end{aligned}$$
(1)

The form of the signature in Eq. 1 can be broken down to help give the reader a better understanding as done in [10]. We can start by simplifying to a single index \(i \in \{1, \ldots , d\}\). This reduces Eq. 1 to

$$\begin{aligned} \mathrm {Sig}({\mathbf {x}})^i = \int _{0<t<T}\frac{ d f^{i}}{dt}(t) dt. \end{aligned}$$
(2)

As this is a single integral and f is affine, the equation simply resolves to the increment of i-th coordinate of the path

$$\begin{aligned} \mathrm {Sig}({\mathbf {x}})^i = x^i_T - x^i_0. \end{aligned}$$
(3)

The double-interated integral considers any pair of coordinates \(i,j \in \{1,\ldots , d\}\) such that

$$\begin{aligned} \mathrm {Sig}({\mathbf {x}})^{i,j}&= \int _{0<t<T}\mathrm {Sig}(\mathbf{x})^i\frac{ d f^{i}}{dt}(t) dt\end{aligned}$$
(4)
$$\begin{aligned}&= \int _{0<t_1<t_2<T}\frac{ d f^{i}}{ dt_1}(t_1) dt_1 \frac{d f^{j}}{dt_2}(t_2) dt_2 \end{aligned}$$
(5)

where we have used Eq. 2 and replaced t to denote the integration limits as

$$\begin{aligned} a< t_1< t_2< T = {\left\{ \begin{array}{ll} 0< t_1< t_2 \\ 0< t_2 < T \end{array}\right. } \end{aligned}$$
(6)

Notice that the integration limits in Eq. 6 correspond to the integration over a triangle. Going further this process can continue recursively and be interpreted as integrating over an increasingly high dimensional simplex. This real number is known as the k-fold iterated integral seen in Eq. 1

$$\begin{aligned} \mathrm {Sig}({\mathbf {x}})^{i_i,\ldots ,i_k}&= \int _{0<t<T}\mathrm {Sig}({\mathbf {x}})^{i_1,\ldots ,i_{k-1}}\frac{ d f^{i_k}}{dt}(t) dt \end{aligned}$$
(7)
$$\begin{aligned}&= \,\underset{0< t_1< \cdots< t_k < T}{\int \cdots \int } \prod _{j = 1}^k \frac{ d f^{i_j}}{ dt}(t_j) dt_j \end{aligned}$$
(8)

where the superscripts are members of the set

$$\begin{aligned} 1,\ldots ,k\in \{1,\ldots ,d\}. \end{aligned}$$
(9)

We can further simplify this form and remove the need for the integral when we consider the path as a series of linear segments in a piecewise linear path. For a single segment the signature can be expressed by the product of its increment

$$\begin{aligned} \mathrm {Sig}({\mathbf {x}})^{i_1,\ldots ,i_k}_{t,t+1} = \frac{1}{k!} \prod _{j = 1}^k (x^{i_j}_{t+1} - x^{i_j}_{t}). \end{aligned}$$
(10)

To calculate each signature term of the full path, we can use Chen’s Identity, which states that the signature of the entire path can be calculated from the signatures of its segments [30]

$$\begin{aligned} \mathrm {Sig}({\mathbf {x}})^{i_1,\ldots ,i_k}_{0,T} = \sum ^k_{j=0}\mathrm {Sig}(\mathbf{x})^{i_1,\ldots ,i_j}_{0,t}\mathrm {Sig}(\mathbf{x})^{i_1,\ldots ,i_k}_{t,T}. \end{aligned}$$
(11)

Using the signature as an infinite series in a machine learning pipeline would not be tractable. Instead, it is common to truncate the series to the k-th level, this is also known as the depth of the signature. This results in the finite collection of terms \(\mathrm {Sig}({\mathbf {x}})^{i_1,\ldots ,i_k}\) where the multi-index is restricted to length N. For example a signature of depth 1 is the collection of d real numbers \(\mathrm {Sig}({\mathbf {x}})^1, \ldots , \mathrm {Sig}({\mathbf {x}})^d\) and a signature of depth 2 is the collection of \(d+d^2\) real numbers \(\mathrm {Sig}({\mathbf {x}})^1, \ldots , \mathrm {Sig}({\mathbf {x}})^d, \mathrm {Sig}^{1,1}, \ldots , \mathrm {Sig}({\mathbf {x}})^{d,d}\).

The number of terms \(\tau\), for any truncated signature of order N of a d-dimensional path, where \(d \ge 1\), is the geometric series:

$$\begin{aligned} \tau = \sum _{k=0}^N d^k = \frac{d^{N+1}-1}{d-1} \end{aligned}$$
(12)

For structured EHR data with hundreds or thousands of unique terms, this poses a significant computational issue. In the next section, we highlight a number of variations that can be used to encourage information into lower order signature terms.

In Appendix B, we provide a further breakdown of the definitions provided here and explore an example in toy data to show how the signature terms describe sequential data. Theoretically, the signature terms are proven to uniquely describe any path up to translations (Proposition 1) and act as a universal non-linearity (Proposition 2). This latter property is shared with neural networks and allows us to reduce potentially complicated non-linear relationships between variables into linear ones.

Signature variations

There is a body of variations on the standard signature transform that have been developed. Each can tailor the properties of the signature to be more suited for a certain task. [37] provides an overview of possible variations of the signature together with an empirical evaluation on streams data. Given the substantially greater dimensionality of structured EHR data, we restrict our investigation to the augmentations in Table 3 and the log-signature (Table 2).

Augmentations

An augmentation considers transforming our sequence of patient events \({\mathbf {x}} \in {\mathbb {R}}^d\) into one or several new sequences, p, whilst potentially changing the dimensionality of each path to a. In general, this can be described by the map

$$\begin{aligned} \phi : {\mathcal {S}}({\mathbb {R}}^{d})\rightarrow {\mathcal {S}}({\mathbb {R}}^a)^p. \end{aligned}$$
(13)
Table 2 Summary of augmentations

The time augmentation consists of the concatenation of an extra dimension. As shown in Proposition 1, this can be used in the absence of any actual timestamps by simply using the index of the event in the sequence. In both cases, this removes the property of time-parameterisation invariance of the signature [31]. We also investigate applying actual time differences from prediction date to account for the irregularly sampled nature of the data. We follow [1] by applying the parameterised scaling function, \(f(\Delta T)=T_{scale} log(\Delta T)\) capped a maximum \(T_{max}\). \(T_{scale}\) and \(T_{max}\) control extreme time-deltas and are optimised as hyperparameters.

The basepoint [25] is used to remove the property of translational invariance. This property means that the signature of two paths separated by a constant translation will be the same. The basepoint also has a significant advantage for our pipeline as \(\sim 20\%\) of pathways in the dataset used in this study have only a single event. Basepoint introduces an origin point at the start of each path and thus ensures each path has at least two points which is a requirement for calculating the signature.

The lead-lag augmentation [10, 22] adds shifted copies of the path as new coordinates. This augmentation explicitly captures the quadratic variation of the underlying process, an important concept for our data where the co-variance between medical concepts is known to be highly important to the underlying pathology of disease [16, 42]. A lag of a single timestep is described by the following augmentation

$$\begin{aligned} \phi ({\mathbf {x}}) = ((x_1,x_1),(x_2,x_1),(x_2,x_2), \dots ,(x_t,x_t)) \in {\mathcal {S}}({\mathbb {R}}^{2d}). \end{aligned}$$
(14)

The learnt projection can be described by the affine transformation or embedding, \(\theta _A \in {\mathbb {R}}^{a \times d}\), such that

$$\begin{aligned} \phi ({\mathbf {x}}) = (\theta _A x_1, \theta _A x_2, \ldots , \theta _A x_n) \in {\mathcal {S}}({\mathbb {R}}^{a}). \end{aligned}$$
(15)

This reduces the dimensionality of the path to make the calculation of the signature tractable.

The log-signature

The log-signature corresponds with taking the formal logarithm of the signature in the algebra of formal power series [10]. Both the signature and its logarithm uniquely define a path (Proposition 1) but the log-signature does not hold the same universality property (Proposition 2) [32]. The log-signature maps to a smaller number of terms at each truncation level determined by Witt’s formula, which is shown in Appendix B.3.2 along with an example.

Deep signatures

For the affine transformations discussed in Equation 15, we briefly described a learning process. As detailed in recent works from [7], it is possible to train the affine transformations together with the signature transform through an end-to-end neural network architecture. Here, the signature acts as a non-parametric pooling function able to extract provably useful information from sequential data.

It is possible to calculate the gradient needed in this method as the signature can be formulated as a calculation tree of differentiable operations [25, 45].

The generalised function of the neural-signature model used in this work can be written as

$$\begin{aligned} f({\mathbf {x}};\Theta ) = \sigma \left( (g^{\theta _{fc}} \circ \mathrm {Sig}^N \circ \phi ^{\theta _A})({\mathbf {x}}) \right) \end{aligned}$$
(16)

where we have denoted that the learnt parameters as the weights of a fully connected neural network classifier \(\theta _{fc}\) and elements of the affine transformation augmentation \(\theta _A\). The sigmoid function is used to map the output activation to a [0,1] score.

Experiments

As baselines, we consider a bag of words model with logistic regression as a commonly used most basic model, along with a GRU model, which is comparable with the state of the art RETAIN [14, 48]. We also include a GRU variation that incorporates the time delta augmentation.

Additionally, we consider the following signature models: the standard signature (S) provides the baseline for further variations, the log-signature (LS) removes the universality property (the fully connected neural network classifier still guarantees this overall) but greatly reduces the number of signature terms, the lead-lag (LL) augmentation encourages information about the quadratic variation into lower-order signature terms, the add time index augmentation (ATI) provides sensitivity to parameterization, the time delta (ATD) version goes further to account for non-uniform sampling rates. We limited the exploration on augmentations to the above after initial testing on validation data found the leag-lag augmentation to be most influential.

Fig. 1
figure 1

Overall experiment pipeline

Table 3 Sequential models compared

We use two metrics for evaluation; area under the receiver operator curve (AUROC) and area under the precision-recall curve (AUPRC).

Previous studies have shown that the AUROC can provide misleading results when there is considerable data imbalance, mainly if the number of negative examples is high, and we have a preference for identifying true positive examples [46]. This issue exists in our task due to the 1:9 case-control split and the increased benefit of correctly identifying HF cases over correctly identifying controls. The result of this class imbalance can cause AUROC to become inflated due to a high number of true negative cases. AUPRC is an alternative metric that captures the trade-off between precision and recall. Crucially, it ignores the number of true negatives allowing changes in performance to be seen without being diluted as in AUROC.

The signature variations explored are summarised together with the baselines in Table 3. Common to each model is the architecture shown in Fig. 1. Further details on implementation, including initialisation, activation functions, optimisation, hyperparameters, regularisation, and other such related details are found in Appendix C.

Results

Table 4 Best performing models maximised for AUROC during hyperparameter optimisation

From Table 4, we observe similar predictive performance across signature models using lead-lag augmentations and GRU models over all corpora, with all metrics from the two sets of models remaining within one standard deviation of performance seen on the validation data. All models perform the same or better on the larger PRIMDX-SECDX-PROC cohort, but more complex models gain a more significant benefit from the added data.

The addition of time augmentations does not show a consistent improvement in performance over just applying the lead-lag augmentation, and there is no consistent difference between adding a time index and time delta. Increasing the depth of the signature to three also shows no significant increase in performance. Signature models perform similar to the bag of words baselines without the lead-lag transform.

Data ablation study

Our final set of experiments evaluates how the models perform as the volume of data is reduced. For the data ablation study, each trial randomly samples a proportion of the training and validation dataset. For each proportion a new set of hyperparameters is found for each model.

The model parameters and hyperparameters are trained using 5-fold cross validation on the sub-sample while the remaining data is unused. The ablation study test data remains the same as the main experiment.

Again, results are broadly similar for both models except for three points where the two models produce performance outside of one standard deviation of validation performance. Notably, at 20% data ablation for AUPRC, the signature model has a 21.0% higher score with 0.283 versus 0.237 for the GRU. Overall, both models’ performance begins to saturate at \(\sim 20\%\) for both metrics, and the results show no conclusive trend as to which model performs best as the amount of training data is reduced.

Discussion

Given the properties that the two methods share these results might not come as a surprise. However, without the lead-lag augmentation the performance of the signature models drop significantly. This confirms the prior belief that the quadratic variance of the path plays an important role in structured EHR HF prediction. This could correspond with encouraging features that describe changing comorbidities to be present in the lower order terms of the signature.

For the data ablation study, we expected the signature model to outperform the GRU baseline however, results for both methods are similar. Our prior hypothesis was partly motivated by the success of signature methods in previous shallow machine learning tasks [38]. A key difference in our task could be the high dimensionality and reliance on embeddings to make the signature tractable. The need to train these embeddings is likely data-intensive but could benefit from initialisation using pretrained word2vec embeddings as has been shown for RNNs [12].

Comparison to previous literature

Comparing the results in this paper directly to previous work is challenging due to the use of different underlying study designs, populations, and incomplete definitions of cohorts and outcomes [17]. We note that previous works investigating sequential models for predicting HF on structured EHR data have found greater performance [14, 44, 48]. In particular, these works also show a more significant performance difference between bag of words baselines and RNN based architectures. Again, differences in the features and data sources used make comparisons difficult. For example, we could compare against Solares et al. [48], which also uses data from a multi-center UK EHR data source and achieves 0.951 AUROC using the RNN based model RETAIN. However, we must consider that the authors also include primary care and demographics data, which could influence prediction performance independent of model choice. The large US multi-center study by [44] show RETAIN achieving a more comparable AUROC of 0.769 on US healthcare care with a balanced cohort with 14,500 cases and with only diagnosis codes provided for as prediction input. The same model achieved an AUROC of 0.822 when trained and tested on the full cohort of 152,790 cases and 1,152,517 controls with diagnoses, demographic, medication, and surgery data.

In this work, we have restricted our work to prediction on high dimensional structured EHR data. Signature methods have shown success in related health prediction applications but with lower dimensional, high frequency data domains including: mood ratings for Bipolar and Borderline Personality Disorder [3], brain imaging data for Alzheimer’s disease [36] and physiological data for Sepsis prediction [38]. Future work could look to expand signature method applications within similar domains such as ECG signals diagnosis [6] and prediction systems for biogas production [8].

Conclusion

Given the prevalence of RNNs in current structured EHR architectures, any improvement in this fundamental component is likely to influence future work significantly. A substantial body of theory motivates the use of signature transforms to represent sequential data, and previous works have shown them to have strong empirical performance. In particular, recent works on neural-signature architectures have enabled their applications on high-dimensional data.

This work is the first to show that neural-signature methods with dimensional reduction before the transform are competitive on high dimensional structured EHR data. Using an HF prediction task, we evaluated the signature transforms as an alternative to RNNs that provide a predictive and compact representation of sequential structured EHR data. We show that the signature achieves comparable performance to RNNs and that the performance of both models saturates with a similar number of training examples. While the signature originates from perhaps abstract theory, empirically, it can successfully compete with the current state-of-the-art architectures.