Abstract
This chapter focuses on the joint modeling of heterogeneous information, such as imaging, clinical, and biological data. This kind of problem requires to generalize classical uni and multivariate association models to account for complex data structure and interactions, as well as high data dimensionality.
Typical approaches are essentially based on the identification of latent modes of maximal statistical association between different sets of features and ultimately allow to identify joint patterns of variations between different data modalities, as well as to predict a target modality conditioned on the available ones. This rationale can be extended to account for several data modalities jointly, to define multiview, or multichannel, representation of multiple modalities. This chapter covers both classical approaches such as partial least squares (PLS) and canonical correlation analysis (CCA), along with most recent advances based on multichannel variational autoencoders. Specific attention is here devoted to the problem of interpretability and generalization of such highdimensional models. These methods are illustrated in different medical imaging applications, and in the joint analysis of imaging and nonimaging information, such as omics or clinical data.
Key words
Download protocol PDF
1 Introduction
The goal of multimodal data analysis is to reveal novel insights on complex biological conditions. Through the combined analysis of multiple type of data, and the complementary views on pathophysiological processes they provide, we have the potential to improve our understanding of the underlying processes leading to complex and multifactorial disorders [1]. In medical imaging applications, multiple imaging modalities, such as structural magnetic resonance imaging (sMRI), functional MRI (fMRI), diffusion tensor imaging (DTI), or positron emission tomography (PET), can be jointly analyzed to better characterize pathological conditions affecting individuals [2]. Other typical multimodal analysis problems involve the joint analysis of heterogeneous data types, such as imaging and genetics data, where medical imaging is associated with the patient’s genotype information, represented by genetic variants such as singlenucleotide polymorphisms (SNPs) [3]. This kind of application, termed imaginggenetics, is of central importance for the identification of genetic risk factors underlying complex diseases including agerelated macular degeneration, obesity, schizophrenia, and Alzheimer’s disease [4].
Despite the great potential of multimodal data analysis, the complexity of multiple data types and clinical questions poses several challenges to the researchers, involving scalability, interpretability, and generalization of complex association models.
1.1 Challenges of Multimodal Data Assimilation
Due to the complementary nature of multimodal information, there is great interest in combining different data types to better characterize the anatomy and physiology of patients and individuals. Multimodal data is generally acquired using heterogeneous protocols highlighting different anatomical, physiological, clinical, and biological information for a given individual [5].
Typical multimodal data integration challenges are:

Noncommensurability. Since each data modality quantifies different physical and biological phenomena, multimodal data is represented by heterogeneous physical units associated to different aspects of the studied biological process (e.g., brain structure, activity, clinical scores, gene expression levels).

Spatial heterogeneity. Multimodal medical images are characterized by specific spatial resolution, which is independent from the spatial coordinate system on which they are standardized.

Heterogeneous dimensions. The data type and dimensions of medical data can vary according to the modality, ranging from scalars and time series typical of fMRI and PET data to structured tensors of diffusion weighted imaging.

Heterogeneous noise. Medical data modalities are characterized by specific and heterogeneous artifacts and measurement uncertainty, resulting from heterogeneous acquisition and processing routines.

Missing data. Multimodal medical datasets are often incomplete, since patients may not undergo the same protocol, and some modalities may be more expensive to acquire than others.

Interpretability. A major challenge of multimodal data integration is the interpretability of the analysis results. This aspect is impacted by the complexity of the analysis methods and generally requires important expertise in data acquisition, processing, and analysis.
Multimodal data analysis methods proposed in the literature have been focusing on different data complexity and integration, depending on the application of interest. Visual inspection is the typical initial step of multimodal studies, where single modalities are compared on a qualitative basis. For example, different medical imaging modalities can be jointly visualized for a given individual to identify common spatial patterns of signal changes. Data integration can be subsequently performed by jointly exploring unimodal features and unimodal analysis results. To this end, we may stratify the cohort of a clinical study based on some biomarkers extracted from different medical imaging modalities exceeding predefined thresholds. Finally, multivariate statistical and machine learning techniques can be applied for datadriven analysis of the joint relationship between information encoded in different modalities. Such approaches attempt to maximize the advantages of combining crossmodality information, dimensions, and resolution of the multimodal signal. The ultimate goal of such analysis methods is to identify the “mechanisms” underlying the generation of the observed medical data, to provide a joint representation of the common variation of heterogeneous data types.
The literature on multimodal analysis approaches is extensive, depending on the kind of applications and related data types. In this chapter we focus on general data integration methods, which can be classically related to the fields of multivariate statistical analysis and latent variable modeling. The importance of these approaches lies in the generality of their formulation, which makes them an ideal baseline for the analysis of heterogeneous data types. Furthermore, this chapter illustrates current extensions of these basic approaches to deep probabilistic models, which allow great modeling flexibility for current stateoftheart applications.
In Subheading 1.2 we provide an overview of typical multimodal analyses in neuroimaging applications, while in Subheading 2 we introduce the statistical foundations of multivariate latent variable modeling, with emphasis on the standard approaches of partial least squares (PLS) and canonical correlation analysis (CCA). In Subheading 3, these classical methods are reformulated under the Bayesian lens, to define linear counterparts of latent variable models (Subheading 3.2) and their extension to multichannel and deep multivariate analysis (Subheadings 3.3 and 3.4). In Subheading 4 we finally address the problem of groupwise regularization to improve the interpretability of multivariate association models, with specific focus in imaginggenetics applications.
Box 1: Online Tutorial
The material covered in this chapter is available at the following online tutorial:
1.2 Motivation from Neuroimaging Applications
Multimodal analysis methods have been explored for their potential in automatic patient diagnosis and stratification, as well as for their ability to identify interpretable data patterns characterizing clinical conditions. In this section, we summarize stateoftheart contributions to the field, along with the remaining challenges to improve our understanding and applications to complex brain disorders.

Structuralstructural combination. Methods combining sMRI and dMRI imaging modalities are predominant in the field. Such combined analysis has been proposed, for example, for the detection of brain lesions (e.g., strokes [6, 7]) and to study and improve the management of patients with brain disorders [8].

Functionalfunctional combination. Due to the complementary nature of EEG and fMRI, research in brain connectivity analysis has focused in the fusion of these modalities, to optimally integrate the high temporal resolution of EEG with the high spatial resolution of the fMRI signal. As a result, EEGfMRI can provide simultaneous cortical and subcortical recording of brain activity with high spatiotemporal resolution. For example, this combination is increasingly used to provide clinical support for the diagnosis and treatment of epilepsy, to accurately localize seizure onset areas, as well as to map the surrounding functional cortex in order to avoid disability [9,10,11].

Structuralfunctional combination. The combined analysis of sMRI, dMRI, and fMRI has been frequently proposed in neuropsychiatric research due to the high clinical availability of these imaging modalities and due to their potential to link brain function, structure, and connectivity. A typical application is in the study of autism spectrum disorder and attentiondeficit hyperactivity disorder (ADHD). The combined analysis of such modalities has been proposed, for example, for the identification of altered white matter connectivity patterns in children with ADHD [12], highlighting association patterns between regional brain structural and functional abnormalities [13].

Imaginggenetics. The combination of imaging and genetics data has been increasingly studied to identify genetic risk factors (genetic variations) associated with functional or structural abnormalities (quantitative traits, QTs) in complex brain disorders [3]. Such multimodal analyses are key to identify the underlying mechanisms (from genotype to phenotype) leading to neurodegenerative diseases, such as Alzheimer’s disease [14] or Parkinson’s disease [15]. This analysis paradigm paves the way to novel data integration scenarios, including imaging and transcriptomics, or multiomic data [16].
Overall, multimodal data integration in the study of brain disorders has shown promising results and is an actively evolving field. The potential of neuroimaging information is continuously improving, with increasing resolution and improved image contrast. Moreover, multiple imaging modalities are increasingly available in large collections of multimodal brain data, allowing for the application of complex modeling approaches on representative cohorts.
2 Methodological Background
2.1 From Multivariate Regression to Latent Variable Models
The use of multivariate analysis methods for biomedical data analysis is widespread, for example, in neuroscience [17], genetics [18], and imaginggenetics studies [19, 20]. These approaches come with the potential of explicitly highlighting the underlying relationship between data modalities, by identifying sets of relevant features that are jointly associated to explain the observed data.
In what follows, we represent the multimodal information available for a given subject k as a collection of arrays \( {\boldsymbol{x}}_i^k \), i = 1, …, M, where M is the number of available modalities. Each array has dimension \( \mathit{\dim}\left({\boldsymbol{x}}_i^k\right)={D}_i \). A multimodal data matrix for N individuals is therefore represented by the collection of matrices X_{i}, with dim(X_{i}) = N × D_{i}. For sake of simplicity, we assume that \( {\boldsymbol{x}}_i^k\in {\mathbb{R}}^{D_i} \).
A first assumption that can be made for defining a multivariate analysis method is that a target modality, say X_{j}, is generated by the combination of a set of given modalities {X_{i}}_{i ≠ j}. A typical example of this application concerns the prediction of certain clinical variables from the combination of imaging features. In this case, the underlying forward generative model for an observation \( {\boldsymbol{x}}_j^k \) can be expressed as:
where we assume that there exists an ideal mapping g(⋅) that transforms the ensemble of observed modalities for the individual k, to generate the target one \( {\boldsymbol{x}}_j^k \). Note that we generally assume that the observations are corrupted by a certain noise \( {\boldsymbol{\varepsilon}}_j^k \), whose nature depends on the data type. The standard choice for the noise is Gaussian, \( {\boldsymbol{\varepsilon}}_j^k\sim \mathcal{N}\left(\mathbf{0},{\sigma}^2\boldsymbol{Id}\right) \).
Within this setting, a multimodal model is represented by a function \( f\left({\left\{{\boldsymbol{X}}_i\right\}}_{i=1}^M,\boldsymbol{\theta} \right) \), with parameters θ, taking as input the ensemble of modalities across subjects. The model f is optimized with respect to θ to solve a specific task. In our case, the set of input modalities can be used to predict a target modality j, in this case we have \( f:{\otimes}_{i\ne j}{\mathbb{R}}^{D_i}\mapsto {\mathbb{R}}^{D_j} \).
In its basic form, this kind of formulation includes standard multivariate linear regression, where the relationship between two modalities X_{1} and X_{2} is modeled through a set a linear parameters \( \boldsymbol{\theta} =\boldsymbol{W}\in {\mathbb{R}}^{D_2\times {D}_1} \) and f(X_{2}) = X_{2} ⋅W. Under the Gaussian noise assumption, the typical optimization task is formulated as the least squares problem:
When modeling jointly multiple modalities, the forward generative model of Eq. 1 may be suboptimal, as it implies the explicit dependence of the target modality upon the other ones. This assumption may be too restrictive, as often an explicit assumption of dependency cannot be made, and we are rather interested in modeling the joint variation between data modalities. This is the rationale of latent variable models.
In the latent variable setting, we assume that the multiple modalities are jointly dependent from a common latent representation z (Fig. 1) belonging to an ideal lowdimensional space of dimension D ≤min{dim(D_{i}), i = 1, …, M}.^{Footnote 1} In this case, Eq. 1 can be extended to the generative process:
Equation 3 is the forward process governing the data generation. The goal of latent variable modeling is to make inference on the latent space and on the generative process from the observed data modalities, based on specific assumptions on the transformations from the latent to the data space, and on the kind of noise process affecting the observations (Box 2). In particular, the inference problem can be tackled by estimating inverse mappings, \( {f}_j\left({\boldsymbol{x}}_j^k\right) \), from the data space of the observed modalities to the latent space.
Based on this framework, in the following sections, we illustrate the standard approaches for solving the inference problem of Eq. 1.
Box 2: Online Tutorial—Generative Models
The forward model of Eq. 3 for multimodal data generation can be easily coded in Python to generate a synthetic multimodal dataset:
2.2 Classical Latent Variable Models: PLS and CCA
Classical latent variable models extend the standard linear regression to analyze the joint variability of different modalities. Typical formulation of latent variable models include partial least squares (PLS) and canonical correlation analysis (CCA) [24], which have successfully been applied in biomedical research [25], along with multimodal [26, 27] and nonlinear [28, 29] variants.
Box 3: Online Tutorial—PLS and CCA with sklearn
The basic principle of these multivariate analysis techniques relies on the identification of linear transformations of modalities X_{i} and X_{j} into a lower dimensional subspace of dimension D ≤min{dim(D_{i}), dim(D_{j})}, where the projected data exhibits the desired statistical properties of similarity. For example, PLS aims at maximizing the covariance between these combinations (or projections on the modes’ directions), while CCA maximizes their statistical correlation (Box 3). For simplicity, in what follows we focus on the joint analysis of two modalities X_{1} and X_{2}, and the multimodal model can be written as
where θ = {u_{1}, u_{2}} are linear projection operators for the modalities, \( {\boldsymbol{u}}_i\in {\mathbb{R}}^{D_i} \), while \( {\boldsymbol{z}}_i={\boldsymbol{X}}_i\cdot {\boldsymbol{u}}_i\in {\mathbb{R}}^N \) are the latent projections for each modality i = 1, 2. The optimization problem can thus be formulated as:
where Sim is a suitable measure of statistical similarity, depending on the envisaged methods (e.g., variance for PLS, or correlation for CCA) (Fig. 2).
2.3 Latent Variable Models Through EigenDecomposition
2.3.1 Partial Least Squares
For PLS, the problem of Eq. 6 requires the estimation of projections u_{1} and u_{2} maximizing the covariance between the latent representation of the two modalities X_{1} and X_{2}:
where
and \( \boldsymbol{S}={\boldsymbol{X}}_1^T{\boldsymbol{X}}_2 \) is the sample covariance between modalities.
Without loss of generality, the maximization of Eq. 9 can be considered under the orthogonality constraint \( \sqrt{{\boldsymbol{u}}_1^T{\boldsymbol{u}}_1}=\sqrt{{\boldsymbol{u}}_2^T{\boldsymbol{u}}_2}=1 \). This constrained optimization problem can be expressed in the Lagrangian form:
whose solution can be written as:
Equation 11 corresponds to the primal formulation of PLS and shows that the PLS projections maximizing the latent covariance are the left and right eigenvectors of the sample covariance matrix across modalities. This solution is known as PLSSVD and has been widely adopted in the field of neuroimaging [30, 31], for the study of common patterns of variability between multimodal imaging data, such as PET and fMRI.
It is worth to notice that classical principal component analysis (PCA) is a special case of PLS when X_{1} = X_{2}. In this case the latent projections maximize the data variance and correspond to the eigenmodes of the sample covariance matrix \( \boldsymbol{S}={\boldsymbol{X}}_1^T{\boldsymbol{X}}_1 \).
2.3.2 Canonical Correlation Analysis
In canonical correlation analysis (CCA), the problem of Eq. 6 is formulated by optimizing linear transformations such that X_{1}u_{1} and X_{2}u_{2} are maximally correlated:
where
where \( {\boldsymbol{S}}_1={\boldsymbol{X}}_1^T{\boldsymbol{X}}_1 \) and \( {\boldsymbol{S}}_2={\boldsymbol{X}}_2^T{\boldsymbol{X}}_2 \) are the sample covariances of modality 1 and 2, respectively.
Proceeding in a similar way as for the derivation of PLS, it can be shown that CCA is associated to the generalized eigendecomposition problem [32]:
It is common practice to reformulate the CCA problem of Eq. 14 with a regularized version aimed to avoid numerical instabilities due to the estimation of the sample covariances S_{1} and S_{2}:
In this latter formulation, the right hand side of Eq. 14 is regularized by introducing a constant diagonal term δ, proportional to the regularization strength (with δ = 0 we obtain Eq. 14). Interestingly, for large value of δ, the diagonal term dominates the sample covariance matrices of the righthand side, and we retrieve the standard eigenvalue problem of Eq. 11. This shows that PLS can be interpreted as an infinitely regularized formulation of CCA.
2.4 Kernel Methods for Latent Variable Models
In order to capture nonlinear relationships, we may wish to project our input features into a highdimensional space prior to performing CCA (or PLS):
where ϕ is a nonlinear feature map. As derived by Bach et al. [33], the data matrices X_{1} and X_{2} can be replaced by the Gram matrices K_{1} and K_{2} such that we can achieve a nonlinear feature mapping via the kernel trick [34]:
where \( {\mathbf{K}}_1={\left[{K}_1\left({\boldsymbol{x}}_1^i,{\boldsymbol{x}}_1^j\right)\right]}_{N\times N} \) and \( {\mathbf{K}}_2={\left[{K}_2\left({\boldsymbol{x}}_2^i,{\boldsymbol{x}}_2^j\right)\right]}_{N\times N} \). In this case, kernel CCA canonical directions correspond to the solutions of the updated generalized eigenvalue problem:
Similarly to the primal formulation of CCA, we can apply an ℓ_{2}norm regularization penalty on the weights α_{1} and α_{2} of Eq. 18, giving rise to regularized kernel CCA:
2.5 Optimization of Latent Variable Models
The nonlinear iterative partial least squares (NIPALS) is a classical scheme proposed by H. Wold [35] for the optimization of latent variable models through the iterative computation of PLS and CCA projections. Within this method, the projections associated with the modalities X_{1} and X_{2} are obtained through the iterative solution of simple least squares problems.
The principle of NIPALS is to identify projection vectors \( {\boldsymbol{u}}_1,{\boldsymbol{u}}_2\in \mathbb{R} \) and corresponding latent representations z_{1} and z_{2} to minimize the functionals
subject to the constraint of maximal similarity between representations z_{1} and z_{2} (Fig. 3).
Following [37], the NIPALS method is optimized as follows (Algorithm 1). The latent projection for modality 1 is first initialized as \( {\boldsymbol{z}}_1^{(0)} \) from randomly chosen columns of the data matrix X_{1}. Subsequently, the linear regression function
is optimized with respect to u_{2}, to obtain the projection \( {\boldsymbol{u}}_2^{(0)} \). After unit scaling of the projection coefficients, the new latent representation is computed for modality 2 as \( {\boldsymbol{z}}_2^{(0)}={\boldsymbol{X}}_2\cdot {\boldsymbol{u}}_2^{(0)} \). At this point, the latent projection is used for a new optimization step of the linear regression problem
this time with respect to u_{1}, to obtain the projection parameters \( {\boldsymbol{u}}_1^{(0)} \) relative to modality 1. After unit scaling of the coefficients, the new latent representations is computed for modality 1 as \( {\boldsymbol{z}}_1^{(1)}={\boldsymbol{X}}_1\cdot {\boldsymbol{u}}_1^{(0)} \). The whole procedure is then iterated.
It can be shown that the NIPALS method of Algorithm 1 converges to a stable solution for projections and latent parameters and the resulting projection vectors correspond to the first left and right eigenmodes associated to the covariance matrix \( \boldsymbol{S}={\boldsymbol{X}}_1^T\cdot {\boldsymbol{X}}_2 \).
Algorithm 1 NIPALS iterative computation for PLS components [37]
After the first eigenmodes are computed through Algorithm 1, the higherorder components can be subsequently computed by deflating the data matrices X_{1} and X_{2}. This can be done by regressing out the current projections in the latent space:
NIPALS can be seamlessly used to optimize the CCA problem. Indeed, it can be shown that the CCA projections and latent representations can be obtained by estimating the linear projections u_{2} and u_{1} in steps 1 and 4 of Algorithm 1 via the linear regression problems
and
Box 4: Online Tutorial—NIPALS Implementation
The online tutorial provides an implementation of the NIPALS algorithm for both CCA and PLS, corresponding to Algorithm 1. It can be verified that the numerical solution is equivalent to the one provided by sklearn and to the one obtained through the solution of the eigenvalue problem.
3 Bayesian Frameworks for Latent Variable Models
Bayesian formulations for latent variable models have been developed in the past, including for PLS [38] and CCA [39]. The advantage of employing a Bayesian framework to solve the original inference problem is that it provides a natural setting to quantify the parameters’ variability in an interpretable manner, coming with their estimated distribution. In addition, these methods are particularly attractive for their ability of integrating prior knowledge on the model’s parameters.
3.1 Multiview PPCA
Recently, the seminal work of Tipping and Bishop on probabilistic PCA (PPCA) [40] has been extended to allow the joint integration of multimodal data [41] (multiview PPCA), under the assumption of a common latent space able to explain and generate all modalities.
Recalling the notation of Subheading 2.1, let \( \boldsymbol{x}={\left\{{\boldsymbol{x}}_i^k\right\}}_{i=1}^M \) be an observation of M modalities for subject k, where each \( {\boldsymbol{x}}_i^k \) is a vector of dimension D_{i}. We denote by z^{k} the Ddimensional latent variable commonly shared by each \( {\boldsymbol{x}}_i^k \). In this context, the forward process underlying the data generation of Eq. 1 is linear, and for each subject k and modality i, we write (see Fig. 4a):
where W_{i} represents the linear mapping from the ithmodality to the latent space, while μ_{i} and ε_{i} denote the common intercept and error for modality i. Note that the modality index i does not appear in the latent variable z^{k}, allowing a compact formulation of the generative model of the whole dataset (i.e., including all modalities) by simple concatenation:
Further hypotheses are needed to define the probability distributions of each element appearing in Eq. 22, such as z^{k} ∼ p(z^{k}), the standard Gaussian prior distribution for the latent variables, and ε_{i} ∼ p(ε_{i}), a centered Gaussian distribution. From these assumptions, one can finally derive the likelihood of the data given latent variables and model parameters, \( p\left({\boldsymbol{x}}_i^k{\boldsymbol{z}}^k,{\boldsymbol{\theta}}_{\boldsymbol{i}}\right) \), θ_{i} = {W_{i}, μ_{i}, ε_{i}} and, by using Bayes theorem, also the posterior distribution of the latent variables, \( p\left({\boldsymbol{z}}^k{\boldsymbol{x}}_i^k\right) \).
Box 5: Online Tutorial—Multiview PPCA
3.1.1 Optimization
In order to solve the inference problem and estimate the model’s parameters in θ, the classical expectationmaximization (EM) scheme can be deployed. EM optimization consists in an iterative process where each iteration is composed of two steps:

Expectation step (E): Given the parameters previously optimized, the expectation of the loglikelihood of the joint distribution of x_{i} and z^{k} with respect to the posterior distribution of the latent variables is evaluated.

Maximization step (M): The functional of the E step is maximized with respect to the model’s parameters.
It is worth noticing that prior knowledge on the model’s parameters distribution can be easily integrated in this Bayesian framework (Fig. 4b), with minimal modification of the optimization scheme, consisting in a penalization of the functional to be maximized in the Mstep forcing the optimized parameters to remain close to their priors. In this case we talk about maximum a posteriori (MAP) optimization.
3.2 Bayesian Latent Variable Models via Autoencoding
Autoencoders and variational autoencoders have become very popular approaches for the estimation of latent representation of complex data, which allow powerful extensions of the Bayesian models presented in Subheading 3.1 to account for nonlinear and deep data representations.
Autoencoders (AEs) extend classical latent variable models to account for complex, potentially highly nonlinear, projections from the data space to the latent space (encoding), along with reconstruction functions (decoding) mapping the latent representation back to the data space. Since typical encoding (f_{e}) and decoding (f_{d}) functions of AEs are parameterized by feedforward neural networks, inference can be efficiently performed by means of stochastic gradient descent through backpropagation. In this sense, AEs can be seen as a powerful extension of classical PCA, where encoding into the latent representations and decoding are jointly optimized to minimize the reconstruction error of the data:
The variational autoencoder (VAE) [42, 43] introduces a Bayesian formulation of AEs, akin to PPCA, where the latent variables are inferred by estimating the associated posterior distributions. In this case, the optimization problem can be efficiently performed by stochastic variational inference [44], where the posterior moments of the variational posterior of the latent distribution are parameterized by neural networks.
In the same way PLS and CCA extend PCA for multimodal analysis, research has been devoted to define equivalent extensions for the VAEs to identify common latent representations of multiple data modalities, such as the multichannel VAE [23], or deep CCA [29]. These approaches are based on a similar formulation, which is provided in the following section.
3.3 Multichannel Variational Autoencoder
The multichannel variational autoencoder (mcVAE) assumes the following generative process for the observation set:
where p(z^{k}) is a prior distribution for the latent variable. In this case, \( p\left({\boldsymbol{x}}_i^k\boldsymbol{z},{\theta}_i\right) \) is the likelihood of the observed modality i for subject k, conditioned on the latent variable and on the generative parameters θ_{i} parameterizing the decoding from the latent space to the data space of modality i.
Solving this inference problem requires the estimation of the posterior for the latent distribution p(zX_{1}, …, X_{M}), which is generally an intractable problem. Following the VAE scheme, variational inference can be applied to compute an approximate posterior [45].
3.3.1 Optimization
The inference problem of mcVAE is solved by identifying variational posterior distributions specific to each data modality \( q\left({\boldsymbol{z}}^k{\boldsymbol{x}}_i^k,{\varphi}_i\right) \), by conditioning them on the observed modality x_{i} and on the corresponding variational parameters φ_{i} parameterizing the encoding of the observed modality to the latent space.
In this way, since each modality provides a different approximation, a similarity constraint is imposed in the latent space to enforce each modalityspecific distribution \( q\left({\boldsymbol{z}}^k{\boldsymbol{x}}_i^k,{\varphi}_i\right) \) to be as close as possible to the common target posterior distribution. The measure of “proximity” between distributions is the KullbackLeibler (KL) divergence. This constraint defines the following functional:
where the approximate posteriors q(zx_{i}, φ_{i}) represent the view on the latent space that can be inferred from the modality x_{i}. In [23] it was shown that the optimization of Eq. 27 is equivalent to the optimization of the following evidence lower bound (ELBO):
where \( R={\sum}_i\mathrm{KL}\left[q\left({\boldsymbol{z}}^k{\boldsymbol{x}}_i^k,{\varphi}_i\right)\parallel p\left(\boldsymbol{z}\right)\right] \), and D =∑_{i}L_{i}, with
is the expected loglikelihood of each data channel x_{j} quantifying the reconstruction obtained by decoding from the latent representation of the remaining channels x_{i}. Therefore, optimizing the term D in Eq. 28 with respect to encoding and decoding parameters \( {\left\{{\theta}_i,{\varphi}_i\right\}}_{i=1}^M \) identifies the optimal representation of each modality in the latent space which can, on average, jointly reconstruct all the other channels. This term thus enforces a coherent latent representation across different modalities and is balanced by the regularization term R, which constrains the latent representation of each modality to the common prior p(z). As for standard VAEs, encoding and decoding functions can be arbitrarily chosen to parameterize respectively latent distributions and data likelihoods. Typical choices for such functions are neural networks, which can provide extremely flexible and powerful data representation (Box 6). For example, leveraging the modeling capabilities of deep convolutional networks, mcVAE has been used in a recent cardiovascular study for the prediction of cardiac MRI data from retinal fundus images [46].
Box 6 Online Tutorial —mcVAE with PyTorch
3.4 Deep CCA
The mcVAE uses neural network layers to learn nonlinear representations of multimodal data. Similarly, Deep CCA [29] provides an alternative to kernel CCA to learn nonlinear mappings of multimodal information. Deep CCA computes representations by passing two views through functions f_{1} and f_{2} with parameters θ_{1} and θ_{2}, respectively, which can be learnt by multilayer neural networks. The parameters are optimized by maximizing the correlation between the learned representations f_{1}(X_{1};θ_{1}) and f_{2}(X_{2};θ_{2}):
In its classical formulation, the correlation objective given in Eq. 29 is a function of the full training set, and as such, minibatch optimization can lead to suboptimal results. Therefore, optimization of classical deep CCA must be performed with fullbatch optimization, for example, through the LBFGS (limited BroydenFletcherGoldfarbShanno) scheme [47]. For this reason, with this vanilla implementation, deep CCA is not computationally viable for large datasets. Furthermore, this approach does not provide a model for generating samples from the latent space. To address these issues, Wang et al. [48] introduced deep variational CCA (VCCA) which extends the probabilistic CCA framework introduced in Subheading 3 to a nonlinear generative model. In a similar approach to VAEs and mcVAE, deep VCCA uses variational inference to approximate the posterior distribution and derives the following ELBO:
where the approximate posterior, q_{ϕ}(z∣x_{1}), and likelihood distributions, \( {p}_{{\boldsymbol{\theta}}_1}\left({\boldsymbol{x}}_1\mid \boldsymbol{z}\right) \) and \( {p}_{{\boldsymbol{\theta}}_2}\left({\boldsymbol{x}}_2\mid \boldsymbol{z}\right) \), are parameterized by neural networks with parameters ϕ, θ_{1}, and θ_{2}.
We note that, in contrast to mcVAE, deep VCCA is based on the estimation of a single latent posterior distribution. Therefore, the resulting representation is dependent on the reference modality from which the joint latent representation is encoded and may therefore bias the estimation of the latent representation. Finally Wang et al. [48] introduce a variant of deep VCCA, VCCAprivate, which extracts the private, in addition to shared, latent information. Here, private latent variables hold viewspecific information which is not shared across modalities.
4 Biologically Inspired Data Integration Strategies
Medical imaging and omics data are characterized by nontrivial relationships across features, which represent specific mechanisms underlying the pathophysiological processes.
For example, the pattern of brain atrophy and functional impairment may involve brain regions according to the brain connectivity structure [49]. Similarly, biological processes such as gene expression are the result of the joint contribution of several SNPs acting according to biological pathways. According to these processes, it is possible to establish relationships between genetics features under the form of relation networks, represented by ontologies such as the KEGG pathways^{Footnote 2} and the Gene Ontology Consortium.^{Footnote 3}
When applying datadriven multivariate analysis methods to this kind of data, it is therefore relevant to promote interpretability and plausibility of the model, by enforcing the solution to follow the structural constraints underlying the data. This kind of model behavior can be achieved through regularization of the model parameters.
In particular, groupwise regularization [50] is an effective approach to enforce structural patterns during model optimization, where related features are jointly penalized with respect to a common parameter. For example, groupwise constraints may be introduced to account for biological pathways in models of gene association, or for known brain networks and regional interactions in neuroimaging studies. More specifically, we assume that the D_{i} features of a modality \( {\boldsymbol{x}}_i=\left({x_i}_1,\dots, {x_i}_{D_i}\right) \) are grouped in subsets \( {\left\{{\mathcal{S}}_l\right\}}_{l=1}^L \), according to the indices \( {\mathcal{S}}_l=\left({s}_1,\dots, {s}_{N_l}\right) \). The regularization of the of the general multivariate model of Eq. 2 according to the groupwise constraint can be expressed as:
where \( R\left({\boldsymbol{W}}_l\right)={\sum}_{j=1}^{D_1}\sqrt{\sum_{s\in {\mathcal{S}}_l}\boldsymbol{W}{\left[s,j\right]}^2} \) is the penalization of the entries of W associated with the features of X_{2} indexed by \( {\mathcal{S}}_l \). The total penalty is achieved by the sum across the D_{1} columns.
Groupwise regularization is particularly effective in the following situations:

To compensate for large data dimensionality, by reducing the number of “free parameters” to be optimized by aggregating the available features [51].

To account for the small effect size of each independent features, to combine features in order to increase the detection power. For example, in genetic analysis, each SNP accounts for below 1% of the variance in brain imaging quantitative traits when considered individually [52, 53].

To meaningfully integrate complementary information to introduce biologically inspired constraints into the model.
In the context of groupwise regularization in neural networks, several optimization/regularization strategies have been proposed to allow the identification of compressed representation of multimodal data in the bottleneck layers, such as by imposing sparsity of the model parameters or by introducing grouping constraints motivated by prior knowledge [54].
For instance, the Bayesian GenometoPhenome Sparse Regression (G2PSR) method proposed in [55] associates genomic data to phenotypic features, such as multimodal neuroimaging and clinical data, by constraining the transformation to optimize relevant groupwise SNPsgene associations. The resulting architecture groups the input SNP layer into corresponding genes represented in the intermediate layer L of the network (Fig. 6). Sparsity at the gene level is introduced through variational dropout [56], to estimate the relevance of each gene (and related SNPs) in reconstructing the output phenotypic features.
In more detail, to incorporate biological constraints in G2PSR framework, a groupwise penalization is imposed with nonzero weights W^{g} mapping the input SNPs to their common gene g. The idea is that during optimization the model is forced to jointly discard all the SNPs mapping to genes which are not relevant to the predictive task. Following [56], the variational approximation is parametrized as q(W^{g}), such that each element of the input layer is defined as \( {W}_i^g\sim \mathcal{N}\left({\mu}_i^g;{\alpha}_g.{\mu_i^g}^2\right) \) [57], where the parameter α_{g} is optimized to quantify the common uncertainty associated with the ensemble of SNPs contributing to the gene g.
5 Conclusions
This chapter presented an overview of basic notions and tools for multimodal analysis. The set of frameworks introduced here represents an ideal starting ground for more complex analysis, either based on linear multivariate methods [58, 59] or on neural network architectures, extending the modeling capabilities to account for highly heterogeneous information, such multiorgan data [46], text information, and data from electronic health records [60, 61].
Notes
 1.
Note that we could also consider overcomplete basis for the latent space such that D > min{dim(D_{i}), i = 1, …, M}. This choice may be motivated by the need of accounting for modalities with particularly low dimension. The study of overcomplete latent data representations is focus of active research [21,22,23].
 2.
 3.
References
Civelek M, Lusis AJ (2014) Systems genetics approaches to understand complex traits. Nat Rev Gen 15(1):34–48. https://doi.org/10.1038/nrg3575
Liu S, Cai W, Liu S, Zhang F, Fulham M, Feng D, Pujol S, Kikinis R (2015) Multimodal neuroimaging computing: a review of the applications in neuropsychiatric disorders. Brain Inform 2(3):167–180. https://doi.org/10.1007/s407080150019x
Shen L, Thompson PM (2020) Brain imaging genomics: Integrated analysis and machine learning. Proc IEEE Inst Electr Electron Eng 108(1):125–162. https://doi.org/10.1109/JPROC.2019.2947272
Visscher PM, Wray NR, Zhang Q, Sklar P, McCarthy MI, Brown MA, Yang J (2017) 10 years of GWAS discovery: biology, function, and translation. Am J Hum Genet 101(1):5–22. https://doi.org/10.1016/j.ajhg.2017.06.005
Lahat D, Adali T, Jutten C (2014) Challenges in multimodal data fusion. In: EUSIPCO 2014—22th European signal processing conference, Lisbonne, Portugal, pp 101–105. https://hal.archivesouvertes.fr/hal01062366
Menon BK, Campbell BC, Levi C, Goyal M (2015) Role of imaging in current acute ischemic stroke workflow for endovascular therapy. Stroke 46(6):1453–1461. https://doi.org/10.1161/STROKEAHA.115.009160
Zameer S, Siddiqui AS, Riaz R (2021) Multimodality imaging in acute ischemic stroke. Curr Med Imaging 17(5):567–577
Liu X, Lai Y, Wang X, Hao C, Chen L, Zhou Z, Yu X, Hong N (2013) A combined DTI and structural MRI study in medicatednaïve chronic schizophrenia. Magn Reson Imaging 32(1):1–8
Rastogi S, Lee C, Salamon N (2008) Neuroimaging in pediatric epilepsy: a multimodality approach. Radiographics 28(4):1079–1095
Abela E, Rummel C, Hauf M, Weisstanner C, Schindler K, Wiest R (2014) Neuroimaging of epilepsy: lesions, networks, oscillations. Clin Neuroradiol 24(1):5–15
Fernández S, Donaire A, Serès E, Setoain X, Bargalló N, Falcén C, Sanmartí F, Maestro I, Rumià J, Pintor L, Boget T, Aparicio J, Carreño M (2015) PET/MRI and PET/MRI/SISCOM coregistration in the presurgical evaluation of refractory focal epilepsy. Epilepsy Research 111:1–9. https://doi.org/10.1016/j.eplepsyres.2014.12.011
Hong SB, Zalesky A, Fornito A, Park S, Yang YH, Park MH, Song IC, Sohn CH, Shin MS, Kim BN, Cho SC, Han DH, Cheong JH, Kim JW (2014) Connectomic disturbances in attentiondeficit/hyperactivity disorder: a wholebrain tractography analysis. Biol Psychiatry 76(8):656–663
Mueller S, Keeser D, Samson AC, Kirsch V, Blautzik J, Grothe M, Erat O, Hegenloh M, Coates U, Reiser MF, HennigFast K, Meindl T (2013) Convergent findings of altered functional and structural brain connectivity in individuals with high functioning autism: a multimodal mri study. PLOS ONE 8(6):1–11. https://doi.org/10.1371/journal.pone.0067329
Lorenzi M, Altmann A, Gutman B, Wray S, Arber C, Hibar DP, Jahanshad N, Schott JM, Alexander DC, Thompson PM, Ourselin S, null null (2018) Susceptibility of brain atrophy to TRIB3 in Alzheimer’s disease, evidence from functional prioritization in imaging genetics. Proc Natl Acad Sci 115(12):3162–3167. https://doi.org/10.1073/pnas.1706100115
Kim M, Kim J, Lee SH, Park H (2017) Imaging genetics approach to Parkinson’s disease and its correlation with clinical score. Sci Rep 7(1):46700. https://doi.org/10.1038/srep46700
Martins D, Giacomel A, Williams SC, Turkheimer F, Dipasquale O, Veronese M, Group PTW, et al. (2021) Imaging transcriptomics: convergent cellular, transcriptomic, and molecular neuroimaging signatures in the healthy adult human brain. Cell Rep 37(13):110173
Schrouff J, Rosa MJ, Rondina JM, Marquand AF, Chu C, Ashburner J, Phillips C, Richiardi J, MourãoMiranda J (2013) PRoNTo: pattern recognition for neuroimaging toolbox. Neuroinformatics 11(3):319–337
Szymczak S, Biernacka JM, Cordell HJ, GonzálezRecio O, König IR, Zhang H, Sun YV (2009) Machine learning in genomewide association studies. Genetic Epidemiol 33(S1):S51–S57
Liu J, Calhoun VD (2014) A review of multivariate analyses in imaging genetics. Front Neuroinform 8:29
Lorenzi M, Altmann A, Gutman B, Wray S, Arber C, Hibar DP, Jahanshad N, Schott JM, Alexander DC, Thompson PM, Ourselin S (2018) Susceptibility of brain atrophy to trib3 in Alzheimer’s disease, evidence from functional prioritization in imaging genetics. Proc Natl Acad Sci 115(12):3162–3167. https://doi.org/10.1073/pnas.1706100115
Shashanka M, Raj B, Smaragdis P (2007) Sparse overcomplete latent variable decomposition of counts data. In: Advances in neural information processing systems, vol 20
Anandkumar A, Ge R, Janzamin M (2015) Learning overcomplete latent variable models through tensor methods. In: Conference on learning theory, PMLR, pp 36–112
Antelmi L, Ayache N, Robert P, Lorenzi M (2019) Sparse multichannel variational autoencoder for the joint analysis of heterogeneous data. In: International conference on machine learning, PMLR, pp 302–311
Hotelling H (1936) Relations between two sets of variates. Biometrika 28(3/4):321
Liu J, Calhoun V (2014) A review of multivariate analyses in imaging genetics. Front Neuroinform 8:29. https://doi.org/10.3389/fninf.2014.00029
Kettenring JR (1971) Canonical analysis of several sets of variables. Biometrika 58(3):433–451. https://doi.org/10.1093/biomet/58.3.433
Luo Y, Tao D, Ramamohanarao K, Xu C, Wen Y (2015) Tensor canonical correlation analysis for multiview dimension reduction. IEEE Trans Knowl Data Eng 27(11):3111–3124. https://doi.org/10.1109/TKDE.2015.2445757
Huang SY, Lee MH, Hsiao CK (2009) Nonlinear measures of association with kernel canonical correlation analysis and applications. J Stat Plan Inference 139(7):2162–2174. https://doi.org/10.1016/j.jspi.2008.10.011
Andrew G, Arora R, Bilmes J, Livescu K (2013) Deep canonical correlation analysis. In: Dasgupta S, McAllester D (eds) Proceedings of the 30th international conference on machine learning, PMLR, Atlanta, Georgia, USA, Proceedings of Machine Learning Research, vol 28, pp 1247–1255. https://proceedings.mlr.press/v28/andrew13.html
McIntosh A, Bookstein F, Haxby JV, Grady C (1996) Spatial pattern analysis of functional brain images using partial least squares. Neuroimage 3(3):143–157
Worsley KJ (1997) An overview and some new developments in the statistical analysis of pet and fmri data. Hum Brain Mapp 5(4):254–258
De Bie T, Cristianini N, Rosipal R (2005) Eigenproblems in pattern recognition. In: Handbook of geometric computing, pp 129–167
Bach F, Jordan M (2003) Kernel independent component analysis. J Mach Learn Res 3:1–48. https://doi.org/10.1162/153244303768966085
Theodoridis S, Koutroumbas K (2008) Pattern recognition, 4th edn. Academic Press, New York
Wold H (1975) Path models with latent variables: the nipals approach. In: Quantitative sociology. Elsevier, Amsterdam, pp 307–357
Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V et al (2011) Scikitlearn: machine learning in python. J Mach Learn Res 12:2825–2830
Tenenhaus M (1999) L’approche pls. Revue de statistique appliquée 47(2):5–40
Vidaurre D, van Gerven MA, Bielza C, Larrañaga P, Heskes T (2013) Bayesian sparse partial least squares. Neural Comput 25(12):3318–3339
Klami A, Virtanen S, Kaski S (2013) Bayesian canonical correlation analysis. J Mach Learn Res 14(4):965–1003
Tipping ME, Bishop CM (1999) Probabilistic principal component analysis. J R Stat Soc Series B (Statistical Methodology) 61(3):611–622
Balelli I, Silva S, Lorenzi M (2021) A probabilistic framework for modeling the variability across federated datasets of heterogeneous multiview observations. In: Information processing in medical imaging: proceedings of the…conference.
Kingma DP, Welling M (2014) AutoEncoding Variational Bayes. In: Proc. 2nd Int. Conf. Learn. Represent. (ICLR2014) 1312.6114
Rezende DJ, Mohamed S, Wierstra D (2014) Stochastic backpropagation and approximate inference in deep generative models. In: International conference on machine learning. PMLR, pp 1278–1286
Kim Y, Wiseman S, Miller A, Sontag D, Rush A (2018) Semiamortized variational autoencoders. In: International conference on machine learning. PMLR, pp 2678–2687
Blei DM, Kucukelbir A, McAuliffe JD (2017) Variational inference: a review for statisticians. J Am Stat Assoc 112(518):859–877
DiazPinto A, Ravikumar N, Attar R, Suinesiaputra A, Zhao Y, Levelt E, Dall’Armellina E, Lorenzi M, Chen Q, Keenan TD et al (2022) Predicting myocardial infarction through retinal scans and minimal personal information. Nat Mach Intell 4:55–61
Nocedal J, Wright S (2006) Numerical optimization. Springer nature, pp 1–664. Springer series in operations research and financial engineering
Wang W, Lee H, Livescu K (2016) Deep variational canonical correlation analysis. http://arxiv.org/abs/1610.03454
Hafkemeijer A, AltmannSchneider I, Oleksik AM, van de Wiel L, Middelkoop HA, van Buchem MA, van der Grond J, Rombouts SA (2013) Increased functional connectivity and brain atrophy in elderly with subjective memory complaints. Brain Connectivity 3(4):353–362
Yuan M, Lin Y (2006) Model selection and estimation in regression with grouped variables. J R Stat Soc Series B (Statistical Methodology) 68(1):49–67
Zhang Y, Xu Z, Shen X, Pan W, Initiative ADN (2014) Testing for association with multiple traits in generalized estimation equations, with application to neuroimaging data. NeuroImage 96:309–325. https://doi.org/10.1016/j.neuroimage.2014.03.061
Hibar DP, Stein JL, Kohannim O, Jahanshad N, Saykin AJ, Shen L, Kim S, Pankratz N, Foroud T, Huentelman MJ, Potkin SG, Jack Jr CR, Weiner MW, Toga AW, Thompson PM, Initiative ADN (2011) Voxelwise genewide association study (vGeneWAS): multivariate genebased association testing in 731 elderly subjects. NeuroImage 56(4):1875–1891. https://doi.org/10.1016/j.neuroimage.2011.03.077
Ge T, Feng J, Hibar DP, Thompson PM, Nichols TE (2012) Increasing power for voxelwise genomewide association studies: The random field theory, least square kernel machines and fast permutation procedures. NeuroImage 63:858–873
Schmidt W, Kraaijveld M, Duin R (1992) Feedforward neural networks with random weights. In: Proceedings of the 11th IAPR international conference on pattern recognition. Vol. II. Conference B: pattern recognition methodology and systems, pp 1–4. https://doi.org/10.1109/ICPR.1992.201708
Deprez M, Moreira J, Sermesant M, Lorenzi M (2022) Decoding genetic markers of multiple phenotypic layers through biologically constrained genometophenome Bayesian sparse regression. Front Mol Med. https://doi.org/10.3389/fmmed.2022.830956
Molchanov D, Ashukha A, Vetrov D (2017) Variational dropout sparsifies deep neural networks. arXiv 1701.05369
Kingma DP, Welling M (2014) Autoencoding variational bayes. CoRR abs/1312.6114
Pearlson GD, Liu J, Calhoun VD (2015) An introductory review of parallel independent component analysis (pICA) and a guide to applying pICA to genetic data and imaging phenotypes to identify diseaseassociated biological pathways and systems in common complex disorders. Front Genetics 6:276
Le Floch É, Guillemot V, Frouin V, Pinel P, Lalanne C, Trinchera L, Tenenhaus A, Moreno A, Zilbovicius M, Bourgeron T et al (2012) Significant correlation between a set of genetic polymorphisms and a functional brain network revealed by feature selection and sparse partial least squares. Neuroimage 63(1):11–24
Rodin I, Fedulova I, Shelmanov A, Dylov DV (2019) Multitask and multimodal neural network model for interpretable analysis of xray images. In: 2019 IEEE international conference on bioinformatics and biomedicine (BIBM). IEEE, pp 1601–1604
Huang SC, Pareek A, Zamanian R, Banerjee I, Lungren MP (2020) Multimodal fusion with deep neural networks for leveraging ct imaging and electronic health record: a casestudy in pulmonary embolism detection. Sci Rep 10(1):1–9
Acknowledgements
This work was supported by the French government, through the 3IA Côte d’Azur Investments in the Future project managed by the National Research Agency (ANR) (ANR19P3IA0002).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.
The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.
Copyright information
© 2023 The Author(s)
About this protocol
Cite this protocol
Lorenzi, M., Deprez, M., Balelli, I., Aguila, A.L., Altmann, A. (2023). Integration of Multimodal Data. In: Colliot, O. (eds) Machine Learning for Brain Disorders. Neuromethods, vol 197. Humana, New York, NY. https://doi.org/10.1007/9781071631959_19
Download citation
DOI: https://doi.org/10.1007/9781071631959_19
Published:
Publisher Name: Humana, New York, NY
Print ISBN: 9781071631942
Online ISBN: 9781071631959
eBook Packages: Springer Protocols