Key words

1 Introduction

Chronic progressive diseases are a major drain on social and economic resources. Many of these diseases have no treatments and no cure. In particular, age-related chronic diseases such as neurodegenerative diseases of the brain are a global healthcare pandemic-in-waiting as most of the world’s population is living ever longer. A key example is Alzheimer’s disease—the leading cause of dementia—but there are numerous other conditions that cause abnormal deterioration of brain tissue, leading to loss of cognitive performance, bodily function, independence, and ultimately death. Despite the increasing socioeconomic burden, neurodegenerative disease research has made impressive progress in the past decade, driven largely by the availability of large observational datasets and the computational analyses this enables.

Understanding neurodegenerative diseases is vital if they are to be managed, or even cured, but our understanding remains poor despite impressive progress in recent years. This poor understanding can be attributed to the many challenges of neurodegenerative diseases: no well-defined time axis due in part to heterogeneity in onset/speed/presentation, and censoring/attrition especially in later stages as patients deteriorate. These challenges, coupled with intense debate in the neurology community (hypothetical models [1, 2]) and increasing availability of data, piqued the interest of computational researchers aiming to provide quantitative answers to the mysteries of neurodegenerative diseases. This has ranged from vanilla off-the-shelf machine learning approaches through to more holistic statistical modeling approaches, the most advanced of which is data-driven disease progression modeling (D3PM).

D3PMs are defined by two key features: (1) they simultaneously reconstruct the disease timeline and estimate the quantitative disease signature/trajectory along this timeline; and (2) they are directly informed by observed data. D3PMs strike a balance between pure unsupervised learning, which requires truly big data, and traditional longitudinal modeling, which relies on a well-defined temporal axis—neither of which are available in neurodegenerative diseases. For a review of the history and development of D3PM, see refs. 3.

The goal of this chapter is to highlight selected key D3PMs in a practical manner. The focus is on model capabilities and data requirements, aiming to inform the reader’s D3PM analysis strategy based on the desired disease insight(s) and the data available. Figure 1 places selected D3PMs on a capability×data quadrant matrix: single timeline estimation vs subtyping, and cross-sectional vs longitudinal data availability. Table A.1 lists more methodological papers relevant to D3PM, with model innovations grouped by the original paper for that method.

Fig. 1
2 by 2 quadrant matrix of the capability of a single timeline, and subtypes versus data requirements of cross-sectional, and longitudinal. E B M, D E B M, K D E E B M, D P S, L T J M M, time-warping, G P P M, course maps, Sustaln, Sublign, and course maps + are marked inside.

Quadrant matrix. D3PMs all estimate a disease timeline, with some capable of estimating multiple subtype timelines, using either cross-sectional data (pseudo-timeline) or longitudinal data (time-shift). Abbreviations: EBM, event-based model; DEBM, discriminative EBM; KDE-EBM, kernel density estimation EBM; DPS, disease progression score; LTJMM, latent-time joint mixed model; GPPM, Gaussian process progression model; SuStaIn, subtype and stage inference; SubLign, subtyping alignment

Table A.1 A taxonomy and pedigree of D3PM papers. *Asterisks denote models for cross-sectional data

The chapter is organized as follows. It starts with a brief discussion of data preprocessing considerations in Subheading 2—an important step in medical data analysis. The treatment of D3PMs is separated into models for cross-sectional data (Subheading 3) and models for longitudinal data (Subheading 4), each split into approaches that estimate a single timeline of disease progression and those capable of estimating multiple timelines within a dataset (subtyping). Subheading 5 concludes.

For a detailed timeline of D3PM development including taxonomy and pedigree of key models, see Appendix.

2 Data Preprocessing

This section briefly touches on two common preprocessing steps before fitting a D3PM to data from a progressive condition such as an irreversible chronic disease: controlling for confounding variables, and handling missing data. We refer to input features as biomarkers and use “covariate” and “confounder” interchangeably. Missing data can refer to irregular/variable visits across individuals, or missing biomarker data due to one or more measurements not being performed for some reason. This section deals with the latter, since longitudinal models can typically handle irregular visits.

Controlling for confounding variables is an important element of any D3PM analysis. This helps to prevent the D3PM from learning non-disease-related patterns such as due to confounding covariates. Confounders can be included as covariates in certain models—to account for that source of variation alongside other variables of interest. Another approach, often used for continuous-valued confounders, is to “regress out” this source of variation prior to fitting a model—to remove non-disease-related signal in the data. This process involves training regression models on data from control participants (who are not expected to develop the disease being studied) and then removing the relevant trends from all data. This method can also be applied to categorical risk factors (discrete variables). The canonical example of a potentially confounding variable in neurodegenerative diseases of the brain is age—a key risk factor in many chronic diseases. Removing normal aging signal is often phrased as “adjusting for” or “controlling for” age.

Handling missing data is an active area of research with a considerable body of literature. Broadly speaking, there are two strategies. The easiest is to exclude participants having any missing biomarker (or covariate) data, but this can considerably reduce the sample size of data available for D3PM analysis. The second approach is to impute the missing data, e.g., using group mean values. Imputation can be explicit or implicit. An example of implicit imputation is in Bayesian models that map data to probabilities and then deal with missing data probabilistically such as in the event-based model [4] where P(event|x) = 0.5 represents maximal uncertainty such as when a measurement x is missing.

3 Models for Cross-Sectional Data

Box 1: Models for Cross-Sectional Data

  • Pro: Data-economical.

    Require cross-sectional data only.

  • Con: Limited forecasting utility.

    Forecasting requires augmentation with longitudinal data.

  • Key application(s): assessing disease severity from a single visit, e.g., economical stratification for clinical research/trials.

3.1 Single Timeline Estimation Using Cross-Sectional Data

There is only one framework for estimating disease timelines from cross-sectional data: event-based modeling.

3.1.1 Event-Based Model

The event-based model (EBM) emerged in 2011 [5, 6]. The concept is simple: in a progressive disease, biomarker measurements only ever get worse, i.e., become increasingly and irreversibly abnormal. Thus, among a cohort of individuals at different stages of a single progressive disease, the cumulative sequence of biomarker abnormality events can be inferred from only a single visit per individual. This requires making a few assumptions: measurements from individuals are independent and represent samples from a single sequence of cumulative abnormality, i.e., a single timeline of disease progression. Such assumptions are commonplace in many statistical analyses of disease progression and are reasonable approximations to make when analyzing data from research studies that typically have strict inclusion and exclusion criteria to focus on a single condition of interest. Unsurprisingly, the event-based model has proven to be extremely powerful, producing insight into many neurodegenerative diseases: sporadic Alzheimer’s disease [7,8,9,10], familial Alzheimer’s disease [6, 11], Huntington’s disease [6, 12], Parkinson’s disease [13], and others [14, 15].

3.1.1.1 EBM Fitting

The first step in fitting an event-based model maps biomarker values to abnormality values, similar to the hypothetical curves of biomarker abnormality proposed in 2010 [1, 2]. The EBM does this probabilistically, using bivariate mixture modeling where individuals can be labeled either as pre-event/normal or post-event/abnormal to allow for (later) events that are yet to occur in patients, and similarly for the possibility of (earlier) events to have occurred in asymptomatic individuals. Various distributions have been proposed for this mixture modeling: combinations of uniform [5, 6], Gaussian [5,6,7], and kernel density estimate (KDE) distributions [9]. This is visualized in Fig. 2.

Fig. 2
A set of 3 histograms depicts a mixture model of Kernal density estimate of mixing component = 0.50, 0.66, and 0.75. The lines at the peak depict components one and two.

Event-based models fit a mixture model to map biomarker values to abnormality probabilities. Left to right shows the convergence of a kernel density estimate (KDE) mixture model. From Firth et al. [9] (CC BY 4.0)

The second step in fitting an EBM over N events is to search the space of N! possible sequences S to reveal the most likely sequence (see refs. 6, 7, 9 for mathematical details). For small N ≲ 10, it can be computationally feasible to perform an exhaustive search over all possible N! sequences to find the maximum likelihood/a posteriori solution. The EBM uses a combination of multiply-initialized gradient ascent, followed by MCMC sampling to estimate uncertainty in the sequence. This results in a model posterior that is a collection of samples from the posterior probability density for each biomarker as a function of sequence position. This is presented as a positional variance diagram [6], such as in Fig. 3.

Fig. 3
A heat map depicts biomarker versus positional density plots smell, R B D S Q, M o C A, D T I R O I 2, 5, 4, 6, 3, and 1, M R I parietal L, temporal R, L, and cingulate L, R, and fluency of category, and letter.

The event-based model posterior is a positional variance diagram showing uncertainty (left-to-right) in the maximum likelihood sequence (top-to-bottom). Parkinson’s disease model from Oxtoby et al. [13] (CC BY 4.0)

For further information and to try out EBM tutorials, the reader is directed to the open-source kde_ebm package (github.com/ucl-pond/kde_ebm) and disease-progression-modelling.github.io.

3.1.2 Discriminative Event-Based Model

The discriminative event-based model (DEBM) was proposed in 2017 by Venkatraghavan et al. [16]. Whereas the EBM treats data from individuals as observations of a single group-level disease cascade (sequence), the DEBM estimates individual-level sequences and combines them into a group-level description of disease progression. This is done using a Mallow’s model, which is the ranking/sequencing equivalent of a univariate Gaussian distribution—including estimation of a mean sequence and variance in this mean. Both EBM and DEBM estimate group-level biomarker abnormality using mixture modeling and both approaches directly estimate uncertainty in the sequence.

Additionally, Venkatraghavan et al. [16, 17] also introduced a pseudo-temporal “disease time” that converts the DEBM posterior into a continuous measure of disease severity.

3.1.2.1 DEBM Fitting

As with the EBM, DEBM model fitting starts with mixture modeling (see Subheading 3.1.1). Next, a sequence is estimated for each individual by ranking the abnormality probabilities in descending order. A group-level mean sequence (with variance) is estimated by fitting the individual sequences to a Mallow’s model. For details, see refs. 16, 17 and subsequent innovations to the DEBM. Notably, DEBM is often quicker to fit than EBM, which makes it appealing for high-dimensional extensions, e.g., aiming to estimate voxel-wise atrophy signatures from cross-sectional brain imaging data.

For further information and to try it out, the reader is directed to the open-source pyebm package (https://github.com/88vikram/pyebm).

3.2 Subtyping Using Cross-Sectional Data

Box 2: Subtyping Models

  • Pro: Uncovering heterogeneity without conflating severity with subtype.

    Evidence suggests that disease subtypes exist.

  • Con: Overly simplistic.

    Current models ignore comorbidity.

Augmenting the event-based model concept with unsupervised machine learning, subtype and stage inference (SuStaIn), was introduced by Young et al. [18]. This marriage of clustering to disease progression modeling has proven very powerful and popular, with high-impact results appearing in prominent journals for multiple brain diseases [19,20,21], chronic lung disease [22], and knee osteoarthritis [23]. SuStaIn’s popularity is perhaps unsurprising given that it was the first method capable of unraveling spatiotemporal heterogeneity (pathological severity across an organ) from phenotypic heterogeneity (disease subtypes) in progressive conditions using only cross-sectional data.

Figure 4 (adapted from [18]) shows the concept behind SuStaIn. SuStaIn iteratively solves the clustering problem from 1 to \( {N}_{\mathrm{S}}^{\mathrm{max}} \) subtypes. The NS model is fitted by splitting each of the NS − 1 subtypes into two clusters and then solving the NS-cluster problem, which produces NS − 1 candidate NS-cluster models, from which the maximum likelihood model is chosen, and then the algorithm continues to NS + 1 and so on.

Fig. 4
4 schematic representations. A, an underlying model of 2 subtypes along with the human brain with respect to time and B. input data. d. depict the application that has 2 bar graph for probability and subtype versus stage and C. output of reconstruction of disease subtype and stages.

The concept of subtype and stage inference (SuStaIn). Reproduced from Young et al. [18] (CC BY 4.0)

Young et al. [18] also introduced the z-score event progression model that breaks down individual biomarker events into piecewise linear transitions between z-scores of interest. This removes the need for mixture modeling (such as in event-based modeling) and enables inference to be performed at subthreshold biomarker values.

SuStaIn Fitting

For the user, a SuStaIn analysis is very similar to an event-based model analysis. For further information, the reader is directed to the open-source pySuStaIn package [24] (https://github.com/ucl-pond/pySuStaIn), which includes tutorials. As well as the z-score progression model, pySuStaIn includes the various event-based models (see Subheading 3.1), and the more recent scored-events model for ordinal data [25] such as visual ratings of medical images.

4 Models for Longitudinal Data

Box 3: Models for Longitudinal Data

  • Pro: Good forecasting utility.

    High temporal precision allows individualized forecasting.

  • Con: Data-heavy.

    Require longitudinal data (multiple visits, years). Can be slow to fit.

  • Key application(s): assessing speed of disease progression and assessing individual variability.

The availability of longitudinal data has fueled development of more sophisticated D3PMs, inspired by mixed models. Mixed (effect) modeling is the workhorse of longitudinal statistical analysis against a known timeline, e.g., age. Mixed models provide a hierarchical description of individual-level variation (random effects) about group-level trends (fixed effects), hence the common parlance “mixed-effects” models. Many of the D3PMs for longitudinal data discussed below are in fact mixed models with an additional latent-time parameter that characterizes the disease timeline. Similar approaches in various fields are known as “self-modeling regression” or “latent-time” models. We focus on parametric models, but also mention nonparametric models, and an emerging hybrid discrete-continuous model.

4.1 Single Timeline Estimation Using Longitudinal Data

There are both parametric and nonparametric approaches to estimating disease timelines from longitudinal data. The common goal is to “stitch together” a full disease timeline (decades long) out of relatively short samples from individuals (a few years each) covering a range of severity in symptoms and biomarker abnormality. Some of the earliest work emerged from the medical image registration community, where “warping” images to a common template is one of the first steps in group analyses [26].

Broadly speaking, there are two categories of D3PMs for longitudinal data: time-shifting models and differential equation models. Time-shifting models translate/deform the individual data, metaphorically stitching them together into a quantitative template of disease progression. Differential equation models estimate a statistical model of biomarker dynamics in phase-plane space (position vs velocity), which is subsequently inverted to produce biomarker trajectories.

4.1.1 Explicit Models for Longitudinal Data: Latent-Time Models

Jedynak et al. [27] introduced the disease progression score (DPS) model in 2012, which aligns biomarker data from individuals to a group template model using a linear transformation of age into a disease progression score si = αiage + βi. Individuals have their own rate of progression αi (constant over the short observation time) and disease onset βi. Group-level biomarker dynamics are modeled as sigmoid (“S”) curves. A Bayesian extension of the DPS approach (BPS) appeared in 2019 [28]. Code for both the DPS and BPS was released publicly: https://www.nitrc.org/projects/progscore; https://hub.docker.com/r/bilgelm/bayesian-ps-adni/.

Donohue et al. [29] introduced a self-modeling regression approach similar to the DPS model in 2014. It was later generalized into the more flexible latent-time joint mixed (effects) model (LTJMM) [30], which can include covariates as fixed effects and is a flexible Bayesian framework for inference. The LTJMM software was released publicly: https://bitbucket.org/mdonohue/ltjmm.

A nonparametric latent-time mixed model appeared in 2017: the Gaussian process progression model (GPPM) of Lorenzi et al. [31]. This is a flexible Bayesian approach akin to (parametric) self-modeling regression that doesn’t impose a parametric form for biomarker trajectories. More recent work supplemented the GPPM with a dynamical systems model of molecular pathology spread through the brain [32] that can regularize the GPPM fit to produce a more accurate disease timeline reconstruction that also provides insight into neurodegenerative disease mechanisms (which is a topic that could be a standalone chapter of this book). The GPPM and GPPM-DS model source code was released publicly via gitlab.inria.fr/epione and tutorials are available at disease-progression-model.github.io.

In 2015, Schiratti et al. [33,34,35] introduced a general framework for estimating spatiotemporal trajectories for any type of manifold-valued data. The framework is based on Riemannian geometry and a mixed-effects model with time reparametrization. It was subsequently extended by Koval et al. [36] to form the disease course mapping approach (available in the leaspy software package). Disease course mapping combines time warping (of age) and inter-biomarker spacing translation. Time warping changes disease progression dynamics—time shift/onset and acceleration/progression speed—but not the trajectory. Inter-biomarker spacings shift an individual’s trajectory to account for individual differences in the timing and ordering of biomarker trajectories.

Figures 5 and 6 show example outputs of these models when trained on data from older people at risk of Alzheimer’s disease, including those with diagnosed mild cognitive impairment and dementia due to probable Alzheimer’s disease.

Fig. 5
A. a line graph of normalized biomarker dynamics of the stage of A D P S plots increasing lines, A-b, multi-line of the timing of biomarker dynamics versus A D P S. b. 9 scatter plot of biomarker severity versus progression time.

Two examples of D3PMs fit to longitudinal data: disease progression score [27] and Gaussian process progression model [31]. (a) Alzheimer’s disease progression score (2012) [27]. Reprinted from NeuroImage, Vol 63, Jedynak et al., A computational neurodegenerative disease progression score: Method and results with the Alzheimer’s Disease Neuroimaging Initiative cohort, 1478–1486, Ⓒ(2012), with permission from Elsevier. (b) Gaussian process progression model (2017) [31]. Reprinted from NeuroImage, Vol 190, Lorenzi et al., Probabilistic disease progression modeling to characterize diagnostic uncertainty: Application to staging and prediction in Alzheimer’s disease, 56–68, Ⓒ(2019), with permission from Elsevier

Fig. 6
A. multi-line graph of population level predicted severity versus age plots outcomes in an increasing trend. B. A chart depicts Alzheimer's disease course at neuropsychological assessments, cortical thickness, hippocampus, and metabolism at 62, 70, 78, and 86 years.

Two additional examples of D3PMs fit to longitudinal data: latent-time joint mixed model [30] and disease course mapping [36]. (a) Latent-time joint mixed model (2017) [30]. From [37] (CC BY 4.0). (b) Alzheimer’s disease course map (2021) [36] (CC BY 4.0)

4.1.1.1 Fitting Longitudinal Latent-Time Models

Fitting D3PMs for longitudinal data is more complex than for cross-sectional data, and the software packages discussed above each expect the data in slightly different formats. One thing they have in common is that renormalization (e.g., min-max or z-score) and reorientation (e.g., to be increasing) is required to put biomarkers on a common scale and direction. In some cases, such preprocessing is necessary to ensure/accelerate model convergence. For example, the LTJMM used a quantile transformation followed by inverse Gaussian quantile function to put all biomarkers on a Gaussian scale. For further detailed discussion, including model identifiability, we refer the reader to the original publications cited above and the didactic resources at disease-progression-modelling.github.io.

4.1.2 Implicit Models for Longitudinal Data: Differential Equation Models

Parametric differential equation D3PMs emerged between 2011 and 2014 [38,39,40,41], receiving a more formal treatment in 2017 [42]. In a hat-tip to physics, these have also been dubbed “phase-plane” models, which aids in their understanding as a model of velocity (biomarker progression rate) as a function of position (biomarker value). Model fitting is a two-step process whereby the long-time biomarker trajectory is estimated by integrating the phase-plane model estimated on observed data.

A nonparametric differential equation D3PM using Gaussian processes (GP-DEM) was introduced in 2018 [11]. This added flexibility to the preceding parametric approaches and produced state-of-the-art results in predicting symptom onset in familial Alzheimer’s disease.

4.1.2.1 Fitting Differential Equation Models

The concept is shown in Fig. 7: differential equation model fitting is a three-step process. First, estimate a single value per individual of biomarker “velocity” and “position,” and then estimate a group-level differential equation model of velocity y as a function of position x, which is integrated/inverted to produce a biomarker trajectory x(t). For example, linear regression can produce estimates of position (e.g., intercept) and velocity (e.g., gradient). Differential equation models can be univariate or multivariate and can include covariates explicitly.

Fig. 7
4 graphs of differential equation models for data, differential fit estimated trajectory, and Stochastic model. a, c, and d line graph of x of t and x cap of t versus t in years. b. scatter plot of y versus x for fit and data.

Differential equation models, or phase-plane models, for biomarker dynamics involve a three-step process: estimate individual-level position and velocity; fit a group-level model of velocity y vs position x; and integrate to produce a trajectory x(t). Reprinted by permission from Springer Nature: Oxtoby, N.P. et al., Learning Imaging Biomarker Trajectories from Noisy Alzheimer’s Disease Data Using a Bayesian Multilevel Model. In: Cardoso, M.J., Simpson, I., Arbel, T., Precup, D., Ribbens, A. (eds) Bayesian and grAphical Models for Biomedical Imaging. Lecture Notes in Computer Science, vol 8677, pp. 85–94 Ⓒ(2014) [41]. (a) Data. (b) Differential fit. (c) Est. trajectory. (d) Stochastic model

4.1.3 Hybrid Discrete-Continuous Models

Recent work introduced the temporal EBM (TEBM) [43, 44], which augments event-based modeling with hidden Markov modeling to produce a hybrid discrete-continuous D3PM. This is a halfway house between discrete models (great for medical decision making) and continuous models (great for detailed understanding of disease progression). Trained on data from ADNI, the TEBM revealed the full timeline of the pathophysiological cascade of Alzheimer’s disease, as shown in Fig. 8.

Fig. 8
A timeline chart of disease time in years marks no event A B E T A, ventricles T A U, P T A U, A D A S 13, R A V L T, M M S E, hippocampus, entorhinal, fusiform, mid-temporal, and whole-brain. 2 arrows labeled patient Y of A D, age = 88.5, and X of M C I, age = 69.5 are above.

Alzheimer’s disease sequence and timeline estimated by a hybrid discrete-continuous D3PM: the temporal event-based model [43, 44]. Permission to reuse was kindly granted by the authors of [43]

4.2 Subtyping Using Longitudinal Data

Clustering longitudinal data without a well-defined time axis can be extremely difficult. Jointly estimating latent time for multiple trajectories is an identifiability challenge, i.e., multiple parameter combinations can explain the same data. This is particularly challenging when observations span a relatively small fraction of the full disease timeline, as in age-related neurodegenerative diseases.

Chen et al. [45] introduced SubLign for subtyping and aligning longitudinal disease data. The authors frame the challenge eloquently as having misaligned, interval-censored data: left censoring from patients being observed only after disease onset and right censoring from patient dropout in more severe disease. SubLign combines a deep generative model (based on a recurrent neural network [46]) for learning individual latent time-shifts and parametric biomarker trajectories using a variational approach, followed by k-means clustering. It was applied to data from a Parkinson’s disease cohort to recover some known clinical phenotypes in new detail.

Poulet and Durrleman [47] recently added mixture-model clustering to the nonlinear mixed model approach of disease course mapping [36]. The framework jointly estimates model parameters and subtypes using a modification of the expectation-maximization algorithm. In simulated data experiments, their approach outperforms a naive baseline. Experiments on real data in Alzheimer’s disease distinguished rapid from slow clinical progression, with minimal differences in biomarker trajectories.

5 Conclusion

Twenty-first century medicine faces many challenges due to aging populations worldwide, including increasing socioeconomic burden from age-related brain disorders like Alzheimer’s disease. Many failed clinical trials fueled intense debate in neurology in the first decade of this century, culminating in the prominent hypothesis of Alzheimer’s disease progression as a pathophysiological cascade of dynamic biomarker events. This inspired the emergence of data-driven disease progression modeling (D3PM) from the computer science community during the second decade of the twenty-first century—an explosion of quantitative models for neurodegenerative disease progression enabling numerous high-impact insights across multiple brain disorders. The community continues to build and share open-source code (see Box 4) and run machine learning challenges [48,49,50]. What will the third decade of the twenty-first century bring for this exciting subset of machine learning for brain disorders?