Key words

1 Introduction

There is a growing clinical evidence that structural and functional brain development and aging take heterogeneous paths within different subsets of the human population [1,2,3]. This heterogeneity has been relatively ignored in case-control study analyses, yielding a limited understanding of the diversity of underlying biological processes that might give rise to similar clinical phenotypes. The advent of high-throughput neuroimaging technologies and the concentrated efforts of the collection of large-scale datasets [4, 5] provide a unique opportunity to dissect the structural and functional heterogeneity of brain disorders in finer details and in an unbiased data-driven manner. A developing body of work that leverages ML and neuroimaging seeks disease subtypes of neuropsychiatric and neurodegenerative disorders, including Alzheimer’s disease (AD) [6,7,8,9,10,11], schizophrenia [12, 13], and late-life depression [14].

Subtyping brain diseases is a clustering problem where the goal is to break down the set of patients into distinct and relatively homogeneous subgroups (i.e., subtypes). While this has been actively investigated in the computer science community, subtyping neuroimaging data is endowed with a unique set of obstacles, such as the “curse of dimensionality” and the confounding nuisance effects, such as global demographics and scanner differences. Furthermore, brain development and pathologies often progress along a continuum, e.g., from healthy state to preclinical stages to full-fledged disease [15], thereby modeling directly in the patient domain may lead to a biased clustering solution. Thus, to tackle these problems, some recent efforts have focused on developing semi-supervised [6, 8, 9, 16] and unsupervised clustering methods [10, 11]. Early studies mainly focused on unsupervised clustering methods, such as K-means [17] or hierarchical clustering [18], to derive data-driven subtypes using imaging data. However, such approaches directly partition the patients based on similarities/dissimilarities, potentially biased by confounding factors, such as demographics or heterogeneity caused by unrelated pathological processes. More recently, semi-supervised clustering methods [6, 8, 9, 16] have been proposed to tackle this problem from a novel angle. To seek a pathology-oriented clustering solution, semi-supervised approaches dissect disease heterogeneity by the “1-to-k” mapping between the reference group (i.e., healthy control (CN)) and the subgroups of the patient group (i.e., the k subtypes). This approach presumably zooms into the heterogeneity of pathological processes rather than unwanted heterogeneity in general. Furthermore, confounding variations, such as demographics, are often ruled out in these approaches.

Aiming to provide the reader in the imaging and machine learning community with a broad guideline in terms of methodology and clinical applications, we organize the remainder of this chapter as follows. In Subheading 2, we provide a brief overview of clustering methods, including unsupervised and semi-supervised approaches. Subheading 3 discusses their applications in various neurological and neuropsychiatric disorders and diseases. Subheading 4 concludes the paper by discussing our main observations, methodological limitations, and future directions.

2 Methodological Development Using Machine Learning and Neuroimaging

Machine learning and neuroimaging have brought unprecedented opportunities to elucidate disease heterogeneity in various brain disorders and diseases [19]. Several trailblazing methodological papers have been recently published [9,10,11], challenging the conventional approach of patient stratification that puts all patients into the same bucket. Among these, unsupervised [10, 11] and semi-supervised clustering methods [9] sought to derive biologically data-driven disease subtypes, but they anchor the modeling from distinct perspectives. For conciseness, let us note that our imaging dataset contains q healthy control (CN) samples \( {\boldsymbol{X}}_r=\left[{\boldsymbol{x}}_1,\dots, {\boldsymbol{x}}_q\right],{\boldsymbol{X}}_r\in {\mathbb{R}}^{p\times q} \), representing our reference group, and n patient samples (PT) \( {\boldsymbol{X}}_t=\left[{\boldsymbol{x}}_1,\dots, {\boldsymbol{x}}_m\right],{\boldsymbol{X}}_t\in {\mathbb{R}}^{p\times m} \), representing the target subtype population. We denote the whole population as a matrix X that is organized by arranging each image as a vector per column \( \boldsymbol{X}=\left[{\boldsymbol{x}}_1,\dots, {\boldsymbol{x}}_{q+m}\right],\boldsymbol{X}\in {\mathbb{R}}^{p\times \left(q+m\right)} \), where p is the number of features per image. We use binary labels to distinguish the patient and control groups, where 1 represents PT and − 1 means CN. Disease subtyping sought to find the number of clusters (k) in the patient group that are neuroanatomically distinct while clinically relevant.

2.1 Unsupervised Clustering

Many recent efforts to discover the heterogeneous nature of brain diseases have investigated different unsupervised clustering algorithms [10, 11, 20,21,22,23,24,25,26,27,28,29,30,31,32]. Among these approaches, the key clustering methods are often K-means, hierarchical clustering, and nonnegative matrix factorization (NMF) (Fig. 1). In this subsection, we first briefly go through these methods. Subsequently, we focus on two representative models building on these unsupervised methods, i.e., Sustain [10] and latent Dirichlet allocation [11].

Fig. 1
A schematic. 1. In k-means clustering, 3 clusters labeled clusters 1, 2, and 3 are represented along with their respective centroids. 2. In N M F, there are 3 matrices labeled X, C, and L. On the L matrix, clusters 1, 2, and 3 are indicated. 3. A hierarchical clustering method is presented.

Schematic diagram of representative unsupervised clustering methods, K-means, NMF, and hierarchical clustering

2.1.1 K-Means Clustering

K-means clustering aims to directly partition the n patients into k clusters. Each patient belongs to the cluster with the nearest mean (i.e., cluster centroid) quantified by a distance metric of choice (e.g., Euclidean distance). Since searching the global minimum in clustering is computationally difficult (NP-hard), local minima are searched in the K-means algorithm via an iterative refinement approach. This usually involves two steps after giving an initial set of k centroids: (i) assignment step, assigning each data point to the cluster with the nearest centroid with the least squared Euclidean distance, and (ii) update step, recalculating means (centroids) for all data points assigned to each cluster. The two steps iteratively continue until the convergence, i.e., the assignments no longer change. More details regarding the k-means algorithm are provided in Chap. 2, Subheading 12.1. Please refer to [33,34,35] for representative studies using K-means for disease subtyping.

2.1.2 NMF Clustering

Nonnegative matrix factorization (NMF) is a method that implicitly performs clustering by taking advantage that complex patterns can be construed as a sum of simple parts. In essence, the input data Xt is factorized into two nonnegative matrices \( \boldsymbol{C}\in {\mathbb{R}}^{p\times k} \) and \( \boldsymbol{L}\in {\mathbb{R}}^{k\times m} \), for which we refer to the component matrix and loading coefficient matrix, respectively. This method has been widely used as an effective dimensionality reduction technique in signal processing and image analysis [36]. By its nature, the L matrix can be directly used for clustering purposes, which is analogous to K-means if we impose an orthogonality constraint on the L matrix. Specifically, if Lkj > Lij for all ik, this clusters the data point xn into the k-th cluster. The vectors of the C matrix indicate the cluster centroids. Please refer to [32] for a representative study using NMF for disease subtyping.

2.1.3 Hierarchical Clustering

Hierarchical clustering aims to build a hierarchy of clusters, including two types of approach: agglomerative and divisive [18]. In general, the merges and splits are determined greedily and presented in a dendrogram. Similarly, a measure of dissimilarity between sets of observations is required. Most commonly, this is achieved by using an appropriate metric (e.g., Euclidean distance) and a linkage criterion that specifies the dissimilarity of sets as a function of the pairwise distances of observations. Please refer to [24, 25, 30, 37, 38] for representative studies using the hierarchical clustering for disease subtyping.

2.1.4 Representative Unsupervised Clustering Methods

Sustain [10] is an unsupervised clustering method for subtype and stage inference. Specifically, Sustain formulates the model as groups of subjects with a particular biomarker progression pattern as a subtype. The biomarker evolution of each subtype is modeled as a linear z-score model, a continuous generalization of the original event-based model [39]. Each biomarker follows a piecewise linear trajectory over a common timeframe. The key advantage of this model is that it can work with purely cross-sectional data and derive an imaging signatures of subtype and stage simultaneously.

A Bayesian latent Dirichlet allocation model [11] was proposed to extract latent AD-related atrophy factors. This probabilistic approach hypothesizes that each patient expresses one or more latent factors, and each factor is associated with distinct but possibly overlapping atrophy patterns. However, due to the nature of latent Dirichlet allocation methods, the input images have to be discretized. Moreover, this method exclusively models brain atrophy while ignoring brain enlargement. For example, larger brain volumes in basal ganglia have been associated with one subtype of schizophrenia [12].

2.2 Semi-supervised Clustering

Semi-supervised clustering methods dissect the subtle heterogeneity of interest under the principle of deriving data-driven and neurobiologically plausible subtypes (Fig. 2). In essence, these methods seek the “1-to-k” mapping between the reference CN group and the PT group, thereby teasing out clusters that are likely driven by distinct pathological trajectories, instead of by global similarity/dissimilarity in data, which is the core momentum of conventional unsupervised clustering methods.

Fig. 2
An illustration of the 1-to-k mapping clustering method. The reference group is divided into sub-type 1, sub-type 2, and sub-type 3.

Schematic diagram of semi-supervised clustering methods. Figure is adapted from [14]

In the following subsections, we briefly discuss four semi-supervised clustering methods. These methods employ different techniques to seek this “1-to-k” mapping. In particular, CHIMERA [16] and Smile-GAN [9] utilize generative models to achieve this mapping, while HYDRA [6] and MAGIC [8] are built on top of discriminative models.

Box 1: Representative Semi-supervised Clustering Methods

The central principle of semi-supervised clustering methods is to seek the “1-to-k” mapping from the reference domain to the patient domain.

  • CHIMERA: a generative approach that leverages the coherent point drift algorithm and maps the data distribution of the CN group to the PT group, thereby enabling to subtype by the distinct k regularized transformations.

  • Smile-GAN: a generative approach based on GANs to learn multiple distinct mappings by generating PT from CN. Simultaneously, a clustering model is trained interactively with mapping functions to assign PT into the corresponding subtype memberships.

  • HYDRA: a discriminative approach which leverages multiple linear support vector machines to construct a polytope that clusters the patients depending on the patterns of differences between the CN group and the PT group.

  • MAGIC: a generalization of HYDRA that aims to dissect disease heterogeneity at multiple imaging scales for a scale-consistent solution.

2.2.1 CHIMERA

CHIMERA employs a generative probabilistic approach, considers all samples as points in the imaging space, and infers the clusters from the transformations between the CN and PT distributions. It hypothesizes that the PT distribution can be generated from the CN distribution under k sets of transformations, each reflecting a distinct disease process.

Mathematically, the transformation T is a convex combination of the k linear transformations that map a CN subject in the reference space to the target space: \( {\boldsymbol{x}}_i^r\in {\mathbb{R}}^q\to {\boldsymbol{x}}_i^t=\boldsymbol{T}\left({x}_i\right)={\sum}_{j=1}^k{\xi}_j{\boldsymbol{T}}_j\left({x}_i\right) \), where ξj is the probability that a PT belongs to the j-th subtype. Ideally, if the disease subtypes were distinct, ξj should take value 1 for the transformation corresponding to this specific disease subtype and value 0 otherwise. At its core, the coherent point drift algorithm [40], a generative probabilistic approach, is used to estimate the transformation T. Specifically, the CN sample point is mapped to the PT domain and regarded as a centroid of a spherical Gaussian cluster, whereas the patient points are treated as independent and identically distributed data generated by a Gaussian mixture model (GMM) with equal weights for each cluster. The goal is to maximize the data likelihood during the distribution matching while also taking into account covariate confounds (e.g., age and gender). The expectation-maximization approach is adopted to optimize the resulting energy objective. Clustering inference is straightforward after the optimized transformation Tj is achieved, i.e., a patient can be assigned the subtype membership corresponding to the largest likelihood.

2.2.2 Smile-GAN

Smile-GAN is a novel generative deep learning approach based on generative adversarial networks (GAN). The reader may refer to Chap. 5 for generic information about GANs. Smile-GAN aims to learn a mapping function, f, from joint CN domain \( \mathcal{X} \) and subtype domain \( \mathcal{Z} \) to the PT domain \( \mathcal{Y} \), by transforming CN data x to different synthesized PT data y′ = f(x, z) that are indistinguishable from real PT data, y, by the discriminator, D. Mapping function, f, is regularized for inverse consistencies, with a clustering function, \( g:\mathcal{Y}\to \mathcal{Z} \), trained interactively to reconstruct z from synthesized PT data y′. The clustering function, g, can also be directly used to cluster both training and unseen test data after the training process.

More specifically, three different data distributions are denoted as x ∼ pCN (for controls), y ∼ pPT (for patients), and z ∼ pSub (for a subtype), respectively, where z ∼ pSub is sampled from a discrete uniform distribution and encoded as a one-hot vector with dimension K (number of clusters). Mapping function, \( f:\mathcal{X}\ast \mathcal{Z}\to \mathcal{Y} \), and clustering function, \( g:\mathcal{Y}\to \mathcal{Z} \), are learned through the following training procedure (lc denotes the cross-entropy loss):

$$ f,g=\arg \underset{f,g}{\min}\underset{D}{\max }{L}_{\mathrm{GAN}}\left(D,f\right)+\mu {L}_{\mathrm{change}}(f)+\lambda {L}_{\mathrm{cluster}}\left(f,g\right) $$
(1)

where

$$ {\displaystyle \begin{array}{rlll}{L}_{\mathrm{GAN}}\left(D,f\right)& ={\mathbbm{E}}_{\mathbf{y}\sim {p}_{\mathrm{PT}}}\left[\log \left(D\left(\boldsymbol{y}\right)\right)\right]& & \\ {}& \kern1em +{\mathbbm{E}}_{\mathbf{z}\sim {p}_{\mathrm{Sub}},\mathbf{x}\sim {p}_{\mathrm{CN}}}\left[1-\log \left(D\left(f\left(\boldsymbol{x},\boldsymbol{z}\right)\right)\right)\right]\Big]& \end{array}} $$
(2)
$$ {L}_{\mathrm{change}}(f)\kern0.5em ={\mathbbm{E}}_{\mathbf{x}\sim {p}_{\mathrm{CN}},\mathbf{z}\sim {p}_{\mathrm{Sub}}}\left[{\left\Vert f\Big(\boldsymbol{x},\boldsymbol{z}\Big)-\boldsymbol{x}\right\Vert}_1\right]\kern0.5em $$
(3)
$$ {L}_{\mathrm{cluster}}\left(f,g\right)\kern0.5em ={\mathbbm{E}}_{\mathbf{x}\sim {p}_{\mathrm{CN}},\mathbf{z}\sim {p}_{\mathrm{Sub}}}\left[{l}_c\left(\boldsymbol{z},g\left(f\left(\boldsymbol{x},\boldsymbol{z}\right)\right)\right)\right]\kern0.5em $$
(4)

The objective consists of adversarial loss LGAN, regularization terms Lchange and Lcluster. Adversarial loss forces the synthesized PT data to follow similar distributions as real PT data. The discriminator D, trying to identify synthesized PT data from real PT data, attempts to maximize the loss, while the mapping f attempts to minimize against it. Both regularization terms serve to constrain the function class where the mapping function f is sampled from so that it is truly meaningful while matching the distributions. Minimization of Lchange encourages sparsity of regions captured by f, with the assumption that only some regions are changed by disease effect. Optimizing Lcluster ensures that the input sub variable z can be reconstructed from synthesized PT data y′, so that the mutual information between z and y are maximized, and distinct imaging patterns are synthesized when z takes different values. Further regularization is also imposed by forcing mapping function f and clustering function g to be Lipschitz continuous. More importantly, thanks to the inverse consistencies led by Lcluster, function g can directly output cluster probabilities and cluster labels when given unseen test PT data.

2.2.3 HYDRA

In contrast to the generative approaches used in CHIMERA and Smile-GAN, HYDRA leverages a widely used discriminative method, i.e., support vector machines (SVM), to seek this “1-to-k” mapping. The novelty is that HYDRA extends multiple linear SVMs to the nonlinear case in a piecewise fashion, thereby simultaneously serving for classification and clustering. Specifically, it constructs a convex polytope by combining the hyperplane from k linear SVMs, separating the CN group and the k subpopulation of the PT group. Intuitively, each face of the convex polytope can be regarded to encode each subtype, capturing a distinct disease effect.

The convex polytope is estimated by sequentially solving each linear SVM as a subproblem under the principle of the sample weighted SVM [41]. The optimization stops when the sample weights become stable, i.e., the polytope is stably established. The objective of maximizing the polytope’s margin can be summarized as

$$ {\displaystyle \begin{array}{rlll}& \underset{{\left\{{\boldsymbol{w}}_j,{b}_j\right\}}_{\left(j=1\right)}^k}{\min}\sum \limits_{\left(j=1\right)}^k\frac{{\left\Vert {\boldsymbol{w}}_j\right\Vert}_2^2}{2}+\mu \sum \limits_{i\mid {y}_i=+1}\frac{1}{k}\max \left\{0,1-{\boldsymbol{w}}_j^T{\boldsymbol{X}}_i^T-{b}_j\right\}& & \\ {}& \kern1em +\mu \sum \limits_{i\mid {y}_i=-1}{s}_{i,j}\max \left\{0,1+{\boldsymbol{w}}_j^T{\boldsymbol{X}}_i^T+{b}_j\right\}& \end{array}} $$
(5)

where wj and bj are the weight and bias for each hyperplane, respectively. μ is a penalty parameter on the training error, and S is the subtype membership matrix of dimension m ∗ k deciding whether a patient sample i belongs to subtype j. The cluster membership is inferred as follows:

$$ {\boldsymbol{S}}_{i,j}=\Big\{{\displaystyle \begin{array}{ll}1,\kern1em & j=\arg \kern0.2em {\max}_j\left({\boldsymbol{w}}_j^T{\boldsymbol{X}}^T+{b}_j\right)\\ {}0,\kern1em & \mathrm{otherwise}\end{array}} $$
(6)

2.2.4 MAGIC

MAGIC was proposed to overcome one of the main limitations that HYDRA faced. That is, a single-scale set of features (e.g., atlas-based regions of interest) may not be sufficient to derive subtle differences, compared to global demographics, disease heterogeneity, since ample evidence has shown that the brain is fundamentally composed of multi-scale structural or functional entities. To this objective, MAGIC extracts multi-scale features in a coarse-to-fine granular fashion via stochastic orthogonal projective nonnegative matrix factorization (opNMF) [42], a very effective unbiased, data-driven method for extracting biologically interpretable and reproducible feature representations. Together with these multi-scale features, HYDRA is embedded into a double-cyclic optimization procedure to yield robust and scale-consistent cluster solutions.

MAGIC encapsulates the two previous proposed methods (i.e., opNMF and HYDRA) and optimizes the clustering objective for each single-scale feature as a sub-optimization problem. To fuse the multi-scale clustering information and enforce the clusters to be scale-consistent, it adopts a double-cyclic procedure that transfers and fine-tunes the clustering polytope. Firstly, (i) inner cyclic procedure: let us remind that HYDRA decides the clusters based on the subtype membership matrix (S). MAGIC first initializes the S matrix with a specific single-scale feature set, i.e., Li, and then the S matrix is transferred to the next set of feature set Li+1 until the predefined stopping criterion is achieved (i.e., the clustering solution across scales is stable). Secondly, (ii) outer cyclic procedure: the inner cyclic procedure was repeated by initializing with each single-scale feature set. Finally, to determine the final subtype assignment, we perform a consensus clustering by computing a co-occurrence matrix based on all the clustering results and then perform spectral clustering [43].

3 Application to Brain Disorders

Brain disorders and diseases affect the human brain across a wide age range. Neurodevelopmental disorders, such as autism spectrum disorders (ASD), are usually present from early childhood and affect daily functioning [44]. Psychotic disorders, such as schizophrenia, involve psychosis that is typically diagnosed for the first time in late adolescence or early adulthood [45]. Dementia and mild cognitive impairment (MCI) prevail both in late mid-life for early-onset AD (usually 30–60 years of age) and most frequently in late-life for late-onset AD (usually over 65 years of age) [46]. Brain cancers in children and adults are heterogeneous and encompass over 100 different histological types of tumors, based on cells of origin and other histopathological features, and have substantial morbidity and mortality [47]. Ample clinical evidence encourages the stratification of the patients in these brain disorders and cancers, potentially paving the road toward individualized precision medicine.

This section collectively overviews previous work aiming to unravel imaging-derived heterogeneity in ASD, psychosis, major depressive disorders (MDD), MCI and AD, and brain cancer.

3.1 Autism Spectrum Disorder

ASD encompasses a broad spectrum of social deficits and atypical behaviors [48]. Heterogeneity of its clinical presentation has sparked massive research efforts to find subtypes to better delineate its diagnosis [49, 50]. Recent initiatives to aggregate neuroimaging data of ASD, such as the ABIDE [51] and the EU-AIMS [52], also have motivated large-scale subtyping projects using imaging signatures [53].

Different clustering methods have been applied to reveal structural brain-based subtypes, but primarily traditional techniques such as the K-means [54] or hierarchical clustering [37]. Besides structural MRI, functional MRI [55] and EEG [56] have also been popular modalities. For reasons discussed earlier, normative clustering and dimensional analyses are better suited to parse a patient population that is highly heterogeneous [57]. However, efforts in this avenue have been primitive, with only a few recent publications using cortical thickness [58]. Taken together, although more validation and replication efforts are necessary to define any reliable neuroanatomical subtypes of ASD, some convergence in findings has been noted [53]. First, most sets of ASD neuroimaging subtypes indicate a combination of both increases and decreases in imaging features compared to the CN group, instead of pointing in a uniform direction. Second, most subtypes are characterized by spatially distributed imaging patterns instead of isolated or focal patterns. Both findings emphasize the significant heterogeneity in ASD brains and the need for better stratification.

The search for subtypes in the ASD population has unique challenges. First, the early onset of ASD implies that it is heavily influenced by neurodevelopmental processes. Depending on the selected age range, the results may significantly differ. Second, ASD is more prevalent in males, with three to four male cases for one female case [59], which adds a layer of potential bias. Third, individuals with ASD often suffer psychiatric comorbidities, such as ADHD, anxiety disorders, and obsessive-compulsive disorder, among many others [60], which, if not screened carefully, can dilute or alter the true signal.

3.2 Psychosis

Psychosis is a medical syndrome characterized by unusual beliefs called delusions and sometimes hallucinations of visions, sounds, smells, or body sensations that are not present in reality. Symptoms, functioning, and outcomes are highly heterogeneous across individuals, leading to long-standing hypotheses of underlying brain subgroups. However, objective brain biomarkers have largely not been discovered for any psychosis diagnosis, stage, or clinically defined subgroup [61, 62]. Neuroimaging studies are also affected by brain heterogeneity [63, 64]. Recent research has thus focused on finding structural brain subtypes using unbiased statistical techniques [12, 13, 65].

Psychosis studies have mainly focused on determining subtypes by clustering brain structural data within the chronic schizophrenia population that has had the illness for years, with results demonstrating two [12, 13], three [26], and six [31] subgroups. Various clustering techniques have been used to achieve these outcomes, including conventional approaches, such as k-means, in addition to more advanced machine learning methods, such as semi-supervised learning. A limitation of the work so far has been the lack of internal or external validation. Still, in studies with robust internal validation methods using metrics that choose the optimal cluster number based on the stability of the solution (e.g., consensus clustering), subtypes cluster along the lines of the severity of brain differences.

In a recent study, with the largest sample to date (n=671), clustered individuals with chronic schizophrenia using HYDRA and multiple internal validation procedures were applied (i.e., cross-validation resampling, split-half reproducibility, and leave-site-out validation) [12]. A two-subtype solution was found, with one subtype demonstrating widespread reductions and the other showing the localized larger volume of the striatum that was not associated with antipsychotic use. Interestingly, there were limited associations with current psychosis symptoms in this work, but indications of associations with education and illness duration in specific subtypes.

Functional imaging has also been used to define psychosis subgroups using functional connectivity at rest [66] and effective connectivity during task performance [67]. The research commonly has relatively low sample sizes with little internal or external validation. Still, of these works, preliminary results demonstrate that clusters can follow diagnostic divisions between individuals with psychosis [67] and that specific networks (e.g., frontoparietal network) are associated with specific psychotic symptoms [67] [66]. A recent advanced deep learning approach has also revealed clinical separations along the lines of symptom severity [68]. Taken together with brain structural results, it is possible that functional imaging maps onto symptom states rather than underlying illness traits that are captured by structural imaging. Further internal and external validation work is required to investigate this hypothesis by characterizing, comparing, and ultimately combining clustering solutions. A critical future direction will also be to conduct longitudinal studies that track individuals over time. Such research could lead the way toward clinical translation.

3.3 Major Depressive Disorder

MDD is a common, severe, and recurrent disorder, with over 300 million people affected worldwide, and is characterized by low mood, apathy, and social withdrawal, with symptoms spanning multiple domains [69]. Its vast heterogeneity is exemplified by the fact that according to DSM-5 criteria, at least 227 and up to 16,400 unique symptom presentations exist [70, 71]. The potential causes for this heterogeneity vary from divergent clinical symptom profiles to genetic etiologies and individual differences in treatment outcomes.

Despite neurobiological findings in MDD spanning cortical thickness, gray matter volume (GMV), and fractional anisotropy (FA) measures, objective brain biomarkers that can be used to diagnose and predict disease course and outcome remain elusive [71,72,73]. Recently, there have been efforts to identify neurobiologically based subtypes of depression using a bottom-up approach, mainly using data from resting-state fMRI [71]. Several studies [33,34,35] employed k-means clustering and group iterative multiple model estimation, respectively, to identify two functional connectivity subtypes, while Tokuda et al. [74] and Drysdale et al. [75] identified three and four subtypes, respectively, using nonparametric Bayesian mixture models and hierarchical clustering. These subtypes are characterized by reduced connectivity in different networks, including the default mode network (DMN), ventral attention network, and frontostriatal and limbic dysfunction. Regarding structural neuroimaging, one study has used k-means clustering on fractional anisotropy (FA) data to identify two depression subtypes. The first subtype was characterized by decreased FA in the right temporal lobe and the right middle frontal areas and was associated with an older age at onset. In contrast, the second subtype was characterized by increased FA in the left occipital lobe and was associated with a younger age at onset [76].

Current research in the identification of brain subtypes in MDD has produced results that are promising but confounded by methodological and design limitations. While some studies have shown clinical promises such as predicting higher depressive symptomatology and lower sustenance of positive mood [34, 35], depression duration [33], and TMS therapy response [75], they are confounded by limitations such as relatively small sample sizes; nuisance variances such as age, gender, and common ancestry; lack of external validation; and lack of statistical significance testing of identified clusters. Furthermore, there has been a lack of ambition in the use of novel clustering techniques. Clustering based on structural neuroimaging is limited compared to other disease entities and is an avenue that future research should consider. Future studies should also aim to perform longitudinal clustering to elucidate the stability of identified brain subtypes over time and examine their utility in predicting disease outcomes.

3.4 MCI and AD

AD, along with its prodromal stage presenting MCI, is the most common neurodegenerative disease, affecting millions across the globe. Although a plethora of imaging studies have derived AD-related imaging signatures, most studies ignored the heterogeneity in AD. Recently, there has been a developing body of effort to derive imaging signatures of AD that are heterogeneity-aware (i.e., subtypes) [7,8,9,10,11].

Most previous studies leveraged unsupervised clustering methods such as Sustain [10], NMF [32], latent Dirichlet allocation [11], and hierarchical clustering [24, 25, 30, 38]. Other papers [6, 9, 20, 77, 78] utilized semi-supervised clustering methods. Due to the variabilities of the choice of databases and methodologies and the lack of ground truth in the context of clustering, the reported number of clusters and the subtypes’ neuroanatomical patterns differ and cannot be directly compared. The targeted heterogeneous population of study also varies across papers. For instance, [6] focused on dissecting the neuroanatomical heterogeneity for AD patients, while [77] included AD plus MCI and [20] studied MCI only. However, some common subtypes were found in different studies. First, a subtype showing a typical diffuse atrophy pattern over the entire brain was witnessed in several studies [6, 8,9,10, 22, 27, 29, 30, 32, 38, 77]. Another subtype demonstrating nearly normal brain anatomy was robustly identified [8, 9, 16, 20, 22, 24, 25, 29, 30]. Moreover, studies [8, 9, 29, 30, 77] also reported one subtype showing atypical AD patterns (i.e., hippocampus or medial temporal lobe atrophy spared).

Though these methods enabled a better understanding of heterogeneity in AD, there are still limitations and challenges. First, due to demographic variations and the existence of comorbidities, it is not guaranteed that models cluster the data based on variations of the pathology of interest. Semi-supervised methods might tackle this problem to some extent, but more careful sample selection and further study with longitudinal data may ensure disease specificity. Second, spatial differences and temporal changes may simultaneously contribute to subtypes derived through clustering methods. Third, subtypes captured from neuroimaging data alone bring limited insight into disease treatments, thereby a joint study of neuroimaging and genetic heterogeneity may provide greater clinical value [14, 79].

3.5 Brain Cancer

Brain tumors, such as glioblastoma (GBM), exhibit extensive inter- and intra-tumor heterogeneity, diffuse infiltration, and invasiveness of various immune and stromal cell populations, which pose diagnostic and prognostic challenges, and render the standard therapies futile [80]. Deciphering the underlying heterogeneity of brain tumors, which arises from genomic instability of these tumors, plays a key role in understanding and predicting the course of tumor progression and its response to the standard therapies, thereby designing effective therapies targeted at aberrant genetic alterations [81, 82]. Medical imaging noninvasively portrays the phenotypic differences of brain tumors and their microenvironment caused by molecular activities of tumors on a macroscopic scale [83, 84]. It has the potential to provide readily accessible and surrogate biomarkers of particular genomic alterations, predict response to therapy, avoid risks of tumor biopsy or inaccurate diagnosis due to sampling errors, and ultimately develop personalized therapies to improve patient outcomes. An imaging subtype of brain tumors may provide a wealth of information about the tumor, including distinct molecular pathways [85, 86].

Recent studies on radiomic analysis of multiparametric MRI (mpMRI) scans provide evidence of distinct phenotypic presentation of brain tumors associated with specific molecular characteristics. These studies propose that quantification of tumor morphology, texture, regional microvasculature, cellular density, or microstructural properties can map to different imaging subtypes. In particular, one study [87] discovered three distinct clusters of GBM subtypes through unsupervised clustering of these features, with significant differences in survival probabilities and associations with specific molecular signaling pathways. These imaging subtypes, namely solid, irregular, and rim-enhancing, were significantly linked to different clinical outcomes and molecular characteristics, including isocitrate dehydrogenase-1, O6-methylguanine-DNA methyltransferase, epidermal growth factor receptor variant III, and transcriptomic molecular subtype composition.

These studies have offered new insights into the characterization of tumor heterogeneity on both microscopic, i.e., histology and molecular, and macroscopic, i.e., imaging levels, consequently providing a more comprehensive understanding of the tumor aggressiveness and patient prognosis, and ultimately, the development of personalized treatments.

4 Conclusion

Taken together, these novel clustering algorithms tailored for high-resolution yet highly variable neuroimaging datasets have demonstrated a broad utility in disease subtyping across many neurological and psychiatric conditions. Simultaneously, cautions need to be taken in order not to overclaim the biological importance of subtypes, since all clustering methods find patterns in data, even if such patterns don’t have a meaningful underlying biological correlate [88]. External validations are necessary. For instance, evidence of post hoc evaluations, e.g., a difference in clinical variables or genetic architectures, can support the biological relevance of identified neuroimaging-based subtypes [14]. Moreover, good practices such as split-sample analysis, permutation tests [12], and comparison to the guideline of semi-simulated experiments [8] discern the robustness of the subtypes. As dataset sizes and imaging resolution improve over time, unique computational challenges are expected to appear, along with unique opportunities to further refine our methodologies to decipher the diversity of brain diseases.