1 Introduction

The goal of multimodal data analysis is to reveal novel insights on complex biological conditions. Through the combined analysis of multiple type of data, and the complementary views on pathophysiological processes they provide, we have the potential to improve our understanding of the underlying processes leading to complex and multifactorial disorders [1]. In medical imaging applications, multiple imaging modalities, such as structural magnetic resonance imaging (sMRI), functional MRI (fMRI), diffusion tensor imaging (DTI), or positron emission tomography (PET), can be jointly analyzed to better characterize pathological conditions affecting individuals [2]. Other typical multimodal analysis problems involve the joint analysis of heterogeneous data types, such as imaging and genetics data, where medical imaging is associated with the patient’s genotype information, represented by genetic variants such as single-nucleotide polymorphisms (SNPs) [3]. This kind of application, termed imaging-genetics, is of central importance for the identification of genetic risk factors underlying complex diseases including age-related macular degeneration, obesity, schizophrenia, and Alzheimer’s disease [4].

Despite the great potential of multimodal data analysis, the complexity of multiple data types and clinical questions poses several challenges to the researchers, involving scalability, interpretability, and generalization of complex association models.

1.1 Challenges of Multimodal Data Assimilation

Due to the complementary nature of multimodal information, there is great interest in combining different data types to better characterize the anatomy and physiology of patients and individuals. Multimodal data is generally acquired using heterogeneous protocols highlighting different anatomical, physiological, clinical, and biological information for a given individual [5].

Typical multimodal data integration challenges are:

  • Non-commensurability. Since each data modality quantifies different physical and biological phenomena, multimodal data is represented by heterogeneous physical units associated to different aspects of the studied biological process (e.g., brain structure, activity, clinical scores, gene expression levels).

  • Spatial heterogeneity. Multimodal medical images are characterized by specific spatial resolution, which is independent from the spatial coordinate system on which they are standardized.

  • Heterogeneous dimensions. The data type and dimensions of medical data can vary according to the modality, ranging from scalars and time series typical of fMRI and PET data to structured tensors of diffusion weighted imaging.

  • Heterogeneous noise. Medical data modalities are characterized by specific and heterogeneous artifacts and measurement uncertainty, resulting from heterogeneous acquisition and processing routines.

  • Missing data. Multimodal medical datasets are often incomplete, since patients may not undergo the same protocol, and some modalities may be more expensive to acquire than others.

  • Interpretability. A major challenge of multimodal data integration is the interpretability of the analysis results. This aspect is impacted by the complexity of the analysis methods and generally requires important expertise in data acquisition, processing, and analysis.

Multimodal data analysis methods proposed in the literature have been focusing on different data complexity and integration, depending on the application of interest. Visual inspection is the typical initial step of multimodal studies, where single modalities are compared on a qualitative basis. For example, different medical imaging modalities can be jointly visualized for a given individual to identify common spatial patterns of signal changes. Data integration can be subsequently performed by jointly exploring unimodal features and unimodal analysis results. To this end, we may stratify the cohort of a clinical study based on some biomarkers extracted from different medical imaging modalities exceeding predefined thresholds. Finally, multivariate statistical and machine learning techniques can be applied for data-driven analysis of the joint relationship between information encoded in different modalities. Such approaches attempt to maximize the advantages of combining cross-modality information, dimensions, and resolution of the multimodal signal. The ultimate goal of such analysis methods is to identify the “mechanisms” underlying the generation of the observed medical data, to provide a joint representation of the common variation of heterogeneous data types.

The literature on multimodal analysis approaches is extensive, depending on the kind of applications and related data types. In this chapter we focus on general data integration methods, which can be classically related to the fields of multivariate statistical analysis and latent variable modeling. The importance of these approaches lies in the generality of their formulation, which makes them an ideal baseline for the analysis of heterogeneous data types. Furthermore, this chapter illustrates current extensions of these basic approaches to deep probabilistic models, which allow great modeling flexibility for current state-of-the-art applications.

In Subheading 1.2 we provide an overview of typical multimodal analyses in neuroimaging applications, while in Subheading 2 we introduce the statistical foundations of multivariate latent variable modeling, with emphasis on the standard approaches of partial least squares (PLS) and canonical correlation analysis (CCA). In Subheading 3, these classical methods are reformulated under the Bayesian lens, to define linear counterparts of latent variable models (Subheading 3.2) and their extension to multi-channel and deep multivariate analysis (Subheadings 3.3 and 3.4). In Subheading 4 we finally address the problem of group-wise regularization to improve the interpretability of multivariate association models, with specific focus in imaging-genetics applications.

Box 1: Online Tutorial

The material covered in this chapter is available at the following online tutorial:


1.2 Motivation from Neuroimaging Applications

Multimodal analysis methods have been explored for their potential in automatic patient diagnosis and stratification, as well as for their ability to identify interpretable data patterns characterizing clinical conditions. In this section, we summarize state-of-the-art contributions to the field, along with the remaining challenges to improve our understanding and applications to complex brain disorders.

  • Structural-structural combination. Methods combining sMRI and dMRI imaging modalities are predominant in the field. Such combined analysis has been proposed, for example, for the detection of brain lesions (e.g., strokes [6, 7]) and to study and improve the management of patients with brain disorders [8].

  • Functional-functional combination. Due to the complementary nature of EEG and fMRI, research in brain connectivity analysis has focused in the fusion of these modalities, to optimally integrate the high temporal resolution of EEG with the high spatial resolution of the fMRI signal. As a result, EEG-fMRI can provide simultaneous cortical and subcortical recording of brain activity with high spatiotemporal resolution. For example, this combination is increasingly used to provide clinical support for the diagnosis and treatment of epilepsy, to accurately localize seizure onset areas, as well as to map the surrounding functional cortex in order to avoid disability [9,10,11].

  • Structural-functional combination. The combined analysis of sMRI, dMRI, and fMRI has been frequently proposed in neuropsychiatric research due to the high clinical availability of these imaging modalities and due to their potential to link brain function, structure, and connectivity. A typical application is in the study of autism spectrum disorder and attention-deficit hyperactivity disorder (ADHD). The combined analysis of such modalities has been proposed, for example, for the identification of altered white matter connectivity patterns in children with ADHD [12], highlighting association patterns between regional brain structural and functional abnormalities [13].

  • Imaging-genetics. The combination of imaging and genetics data has been increasingly studied to identify genetic risk factors (genetic variations) associated with functional or structural abnormalities (quantitative traits, QTs) in complex brain disorders [3]. Such multimodal analyses are key to identify the underlying mechanisms (from genotype to phenotype) leading to neurodegenerative diseases, such as Alzheimer’s disease [14] or Parkinson’s disease [15]. This analysis paradigm paves the way to novel data integration scenarios, including imaging and transcriptomics, or multi-omic data [16].

Overall, multimodal data integration in the study of brain disorders has shown promising results and is an actively evolving field. The potential of neuroimaging information is continuously improving, with increasing resolution and improved image contrast. Moreover, multiple imaging modalities are increasingly available in large collections of multimodal brain data, allowing for the application of complex modeling approaches on representative cohorts.

2 Methodological Background

2.1 From Multivariate Regression to Latent Variable Models

The use of multivariate analysis methods for biomedical data analysis is widespread, for example, in neuroscience [17], genetics [18], and imaging-genetics studies [19, 20]. These approaches come with the potential of explicitly highlighting the underlying relationship between data modalities, by identifying sets of relevant features that are jointly associated to explain the observed data.

In what follows, we represent the multimodal information available for a given subject k as a collection of arrays \( {\boldsymbol{x}}_i^k \), i = 1, …, M, where M is the number of available modalities. Each array has dimension \( \mathit{\dim}\left({\boldsymbol{x}}_i^k\right)={D}_i \). A multimodal data matrix for N individuals is therefore represented by the collection of matrices Xi, with dim(Xi) = N × Di. For sake of simplicity, we assume that \( {\boldsymbol{x}}_i^k\in {\mathbb{R}}^{D_i} \).

A first assumption that can be made for defining a multivariate analysis method is that a target modality, say Xj, is generated by the combination of a set of given modalities {Xi}ij. A typical example of this application concerns the prediction of certain clinical variables from the combination of imaging features. In this case, the underlying forward generative model for an observation \( {\boldsymbol{x}}_j^k \) can be expressed as:

$$ {\boldsymbol{x}}_j^k\kern0.5em =g\left({\left\{{\boldsymbol{x}}_i^k\right\}}_{i\ne j}\right)+{\boldsymbol{\varepsilon}}_j^k,\kern0.5em $$

where we assume that there exists an ideal mapping g(⋅) that transforms the ensemble of observed modalities for the individual k, to generate the target one \( {\boldsymbol{x}}_j^k \). Note that we generally assume that the observations are corrupted by a certain noise \( {\boldsymbol{\varepsilon}}_j^k \), whose nature depends on the data type. The standard choice for the noise is Gaussian, \( {\boldsymbol{\varepsilon}}_j^k\sim \mathcal{N}\left(\mathbf{0},{\sigma}^2\boldsymbol{Id}\right) \).

Within this setting, a multimodal model is represented by a function \( f\left({\left\{{\boldsymbol{X}}_i\right\}}_{i=1}^M,\boldsymbol{\theta} \right) \), with parameters θ, taking as input the ensemble of modalities across subjects. The model f is optimized with respect to θ to solve a specific task. In our case, the set of input modalities can be used to predict a target modality j, in this case we have \( f:{\otimes}_{i\ne j}{\mathbb{R}}^{D_i}\mapsto {\mathbb{R}}^{D_j} \).

In its basic form, this kind of formulation includes standard multivariate linear regression, where the relationship between two modalities X1 and X2 is modeled through a set a linear parameters \( \boldsymbol{\theta} =\boldsymbol{W}\in {\mathbb{R}}^{D_2\times {D}_1} \) and f(X2) = X2 ⋅W. Under the Gaussian noise assumption, the typical optimization task is formulated as the least squares problem:

$$ {\boldsymbol{W}}^{\ast }=\underset{\boldsymbol{W}}{\mathrm{argmin}}\kern0.3em \parallel {\boldsymbol{X}}_1-{\boldsymbol{X}}_2\cdot \boldsymbol{W}\parallel {}^2. $$

When modeling jointly multiple modalities, the forward generative model of Eq. 1 may be suboptimal, as it implies the explicit dependence of the target modality upon the other ones. This assumption may be too restrictive, as often an explicit assumption of dependency cannot be made, and we are rather interested in modeling the joint variation between data modalities. This is the rationale of latent variable models.

In the latent variable setting, we assume that the multiple modalities are jointly dependent from a common latent representation z (Fig. 1) belonging to an ideal low-dimensional space of dimension D ≤min{dim(Di), i = 1, …, M}.Footnote 1 In this case, Eq. 1 can be extended to the generative process:

$$ {\boldsymbol{x}}_i^k\kern0.5em ={g}_i\left({\boldsymbol{z}}_k\right)+{\boldsymbol{\varepsilon}}_i^k,\kern2em i=1,\dots, M.\kern0.5em $$
Fig. 1
A branching diagram that starts with z and branches into a 6 by 4 grid and a single row with 6 boxes through g 1 and g 2, respectively. Through plus epsilon 1 and 2, it ends with x 1 and x 2. Below is a schematic displaying the interference problem with the data leading to parameters.

Illustration of a generative process for the modeling of imaging and genetics data

Equation 3 is the forward process governing the data generation. The goal of latent variable modeling is to make inference on the latent space and on the generative process from the observed data modalities, based on specific assumptions on the transformations from the latent to the data space, and on the kind of noise process affecting the observations (Box 2). In particular, the inference problem can be tackled by estimating inverse mappings, \( {f}_j\left({\boldsymbol{x}}_j^k\right) \), from the data space of the observed modalities to the latent space.

Based on this framework, in the following sections, we illustrate the standard approaches for solving the inference problem of Eq. 1.

Box 2: Online Tutorial—Generative Models

The forward model of Eq. 3 for multimodal data generation can be easily coded in Python to generate a synthetic multimodal dataset:

A code snippet generates synthetic data by defining two Gaussian latent variables, transforming them using random transformation matrices, and adding random Gaussian noise to the resulting data. The resulting datasets represent the observed variables in the latent variable model.

2.2 Classical Latent Variable Models: PLS and CCA

Classical latent variable models extend the standard linear regression to analyze the joint variability of different modalities. Typical formulation of latent variable models include partial least squares (PLS) and canonical correlation analysis (CCA) [24], which have successfully been applied in biomedical research [25], along with multimodal [26, 27] and nonlinear [28, 29] variants.

Box 3: Online Tutorial—PLS and CCA with sklearn

A code snippet explains the P L S and C C A models to the training data and then projects the data into the corresponding latent dimensions using the transform function. This allows for dimensionality reduction and capturing the relationships between the variables in the reduced space.

The basic principle of these multivariate analysis techniques relies on the identification of linear transformations of modalities Xi and Xj into a lower dimensional subspace of dimension D ≤min{dim(Di), dim(Dj)}, where the projected data exhibits the desired statistical properties of similarity. For example, PLS aims at maximizing the covariance between these combinations (or projections on the modes’ directions), while CCA maximizes their statistical correlation (Box 3). For simplicity, in what follows we focus on the joint analysis of two modalities X1 and X2, and the multimodal model can be written as

$$ f\left({\boldsymbol{X}}_1,{\boldsymbol{X}}_2,\boldsymbol{\theta} \right)\kern0.5em =\left[{f}_1\left({\boldsymbol{X}}_1,{\boldsymbol{u}}_1\right),{f}_2\left({\boldsymbol{X}}_2,{\boldsymbol{u}}_2\right)\right]\kern0.5em $$
$$ \kern0.5em =\left[{\boldsymbol{z}}_1,{\boldsymbol{z}}_2\right],\kern0.5em $$

where θ = {u1, u2} are linear projection operators for the modalities, \( {\boldsymbol{u}}_i\in {\mathbb{R}}^{D_i} \), while \( {\boldsymbol{z}}_i={\boldsymbol{X}}_i\cdot {\boldsymbol{u}}_i\in {\mathbb{R}}^N \) are the latent projections for each modality i = 1, 2. The optimization problem can thus be formulated as:

$$ {\boldsymbol{u}}_1^{\ast },{\boldsymbol{u}}_2^{\ast}\kern0.5em =\underset{\boldsymbol{\theta}}{\mathrm{argmax}}\kern1em Sim\left({\boldsymbol{z}}_1,{\boldsymbol{z}}_2\right)\kern0.5em $$
$$ \kern8.50em =\underset{{\boldsymbol{u}}_1,{\boldsymbol{u}}_2}{\mathrm{argmax}}\kern1em Sim\left({\boldsymbol{X}}_1\cdot {\boldsymbol{u}}_1,{\boldsymbol{X}}_2\cdot {\boldsymbol{u}}_2\right),\kern0.5em $$

where Sim is a suitable measure of statistical similarity, depending on the envisaged methods (e.g., variance for PLS, or correlation for CCA) (Fig. 2).

Fig. 2
A diagram is titled latent variable modeling. It consists of X 1 and X 2 of negative 10 superscript 6 and 5 brain features, respectively. X subscript 1 u subscript 1 and x subscript 2 u subscript 2 converge at a graph that plots z 1 versus z 2, which exhibits a linear increasing trend.

Illustration of latent variable modeling for an idealized application to the modeling of genetics and imaging data

2.3 Latent Variable Models Through Eigen-Decomposition

2.3.1 Partial Least Squares

For PLS, the problem of Eq. 6 requires the estimation of projections u1 and u2 maximizing the covariance between the latent representation of the two modalities X1 and X2:

$$ {\boldsymbol{u}}_1^{\ast },{\boldsymbol{u}}_2^{\ast}\kern0.5em =\underset{{\boldsymbol{u}}_1,{\boldsymbol{u}}_2}{\mathrm{argmax}}\kern1em \mathrm{Cov}\left({\boldsymbol{X}}_1\cdot {\boldsymbol{u}}_1,{\boldsymbol{X}}_2\cdot {\boldsymbol{u}}_2\right),\kern0.5em $$


$$ \mathrm{Cov}\left({\boldsymbol{X}}_1\cdot {\boldsymbol{u}}_1,{\boldsymbol{X}}_2\cdot {\boldsymbol{u}}_2\right)=\frac{{\boldsymbol{u}}_1^T\boldsymbol{S}{\boldsymbol{u}}_2}{\sqrt{{\boldsymbol{u}}_1^T{\boldsymbol{u}}_1}\sqrt{{\boldsymbol{u}}_2^T{\boldsymbol{u}}_2}}, $$

and \( \boldsymbol{S}={\boldsymbol{X}}_1^T{\boldsymbol{X}}_2 \) is the sample covariance between modalities.

Without loss of generality, the maximization of Eq. 9 can be considered under the orthogonality constraint \( \sqrt{{\boldsymbol{u}}_1^T{\boldsymbol{u}}_1}=\sqrt{{\boldsymbol{u}}_2^T{\boldsymbol{u}}_2}=1 \). This constrained optimization problem can be expressed in the Lagrangian form:

$$ \mathcal{L} \left({\boldsymbol{u}}_1,{\boldsymbol{u}}_2,{\lambda}_x,{\lambda}_y\right)={\boldsymbol{u}}_1^T\boldsymbol{S}{\boldsymbol{u}}_2-{\lambda}_x\left({\boldsymbol{u}}_1^T{\boldsymbol{u}}_1-1\right)-{\lambda}_y\left({\boldsymbol{u}}_2^T{\boldsymbol{u}}_2-1\right), $$

whose solution can be written as:

$$ \left[\begin{array}{ll}\hfill \mathbf{0}\hfill & \hfill \boldsymbol{S}\hfill \\ {}\hfill {\boldsymbol{S}}^T\hfill & \hfill \mathbf{0}\hfill \\ {}\hfill \hfill \end{array}\right]\left[\begin{array}{l}\hfill {\boldsymbol{u}}_1\hfill \\ {}\hfill {\boldsymbol{u}}_2\hfill \end{array}\right]=\lambda \left[\begin{array}{l}\hfill {\boldsymbol{u}}_1\hfill \\ {}\hfill {\boldsymbol{u}}_2\hfill \end{array}\right]. $$

Equation 11 corresponds to the primal formulation of PLS and shows that the PLS projections maximizing the latent covariance are the left and right eigen-vectors of the sample covariance matrix across modalities. This solution is known as PLS-SVD and has been widely adopted in the field of neuroimaging [30, 31], for the study of common patterns of variability between multimodal imaging data, such as PET and fMRI.

It is worth to notice that classical principal component analysis (PCA) is a special case of PLS when X1 = X2. In this case the latent projections maximize the data variance and correspond to the eigen-modes of the sample covariance matrix \( \boldsymbol{S}={\boldsymbol{X}}_1^T{\boldsymbol{X}}_1 \).

2.3.2 Canonical Correlation Analysis

In canonical correlation analysis (CCA), the problem of Eq. 6 is formulated by optimizing linear transformations such that X1u1 and X2u2 are maximally correlated:

$$ {\boldsymbol{u}}_1^{\ast },{\boldsymbol{u}}_2^{\ast}\kern0.5em =\underset{{\boldsymbol{u}}_1,{\boldsymbol{u}}_2}{\mathrm{argmax}}\kern0.3em \mathrm{Corr}\left({\boldsymbol{X}}_1{\boldsymbol{u}}_1,{\boldsymbol{X}}_2{\boldsymbol{u}}_2\right),\kern0.5em $$


$$ \mathrm{Corr}\left({\boldsymbol{X}}_1{\boldsymbol{u}}_1,{\boldsymbol{X}}_2{\boldsymbol{u}}_2\right)\kern0.5em =\frac{{\boldsymbol{u}}_1^T\boldsymbol{S}{\boldsymbol{u}}_2}{\sqrt{{\boldsymbol{u}}_1^T{\boldsymbol{S}}_1{\boldsymbol{u}}_1}\sqrt{{\boldsymbol{u}}_2^T{\boldsymbol{S}}_2{\boldsymbol{u}}_2}}.\kern0.5em $$

where \( {\boldsymbol{S}}_1={\boldsymbol{X}}_1^T{\boldsymbol{X}}_1 \) and \( {\boldsymbol{S}}_2={\boldsymbol{X}}_2^T{\boldsymbol{X}}_2 \) are the sample covariances of modality 1 and 2, respectively.

Proceeding in a similar way as for the derivation of PLS, it can be shown that CCA is associated to the generalized eigen-decomposition problem [32]:

$$ \left[\begin{array}{ll}\hfill \mathbf{0}\hfill & \hfill \boldsymbol{S}\hfill \\ {}\hfill {\boldsymbol{S}}^T\hfill & \hfill \mathbf{0}\hfill \\ {}\hfill \hfill \end{array}\right]\left[\begin{array}{l}\hfill {\boldsymbol{u}}_1\hfill \\ {}\hfill {\boldsymbol{u}}_2\hfill \end{array}\right]=\lambda \left[\begin{array}{ll}\hfill {\boldsymbol{S}}_1\hfill & \hfill \mathbf{0}\hfill \\ {}\hfill \mathbf{0}\hfill & \hfill {\boldsymbol{S}}_2\hfill \\ {}\hfill \hfill \end{array}\right]\left[\begin{array}{l}\hfill {\boldsymbol{u}}_1\hfill \\ {}\hfill {\boldsymbol{u}}_2\hfill \end{array}\right], $$

It is common practice to reformulate the CCA problem of Eq. 14 with a regularized version aimed to avoid numerical instabilities due to the estimation of the sample covariances S1 and S2:

$$ \left[\begin{array}{ll}\hfill \mathbf{0}\hfill & \hfill \boldsymbol{S}\hfill \\ {}\hfill {\boldsymbol{S}}^T\hfill & \hfill \mathbf{0}\hfill \\ {}\hfill \hfill \end{array}\right]\left[\begin{array}{l}\hfill {\boldsymbol{u}}_1\hfill \\ {}\hfill {\boldsymbol{u}}_2\hfill \end{array}\right]=\lambda \left[\begin{array}{ll}\hfill {\boldsymbol{S}}_1+\delta \boldsymbol{I}\hfill & \hfill \mathbf{0}\hfill \\ {}\hfill \mathbf{0}\hfill & \hfill {\boldsymbol{S}}_2+\delta \boldsymbol{I}\hfill \\ {}\hfill \hfill \end{array}\right]\left[\begin{array}{l}\hfill {\boldsymbol{u}}_1\hfill \\ {}\hfill {\boldsymbol{u}}_1\hfill \end{array}\right]. $$

In this latter formulation, the right hand side of Eq. 14 is regularized by introducing a constant diagonal term δ, proportional to the regularization strength (with δ = 0 we obtain Eq. 14). Interestingly, for large value of δ, the diagonal term dominates the sample covariance matrices of the right-hand side, and we retrieve the standard eigen-value problem of Eq. 11. This shows that PLS can be interpreted as an infinitely regularized formulation of CCA.

2.4 Kernel Methods for Latent Variable Models

In order to capture nonlinear relationships, we may wish to project our input features into a high-dimensional space prior to performing CCA (or PLS):

$$ \phi :\boldsymbol{X}=\left({\boldsymbol{x}}^1,\dots, {\boldsymbol{x}}^N\right)\mapsto \left[\phi \left({\boldsymbol{x}}^1\right),\dots, \phi \left({\boldsymbol{x}}^N\right)\right] $$

where ϕ is a nonlinear feature map. As derived by Bach et al. [33], the data matrices X1 and X2 can be replaced by the Gram matrices K1 and K2 such that we can achieve a nonlinear feature mapping via the kernel trick [34]:

$$ {\boldsymbol{K}}_1\left({\boldsymbol{x}}_1^i,{\boldsymbol{x}}_1^j\right)=\left\langle \phi \left({\boldsymbol{x}}_1^i\right),\phi \left({\boldsymbol{x}}_1^j\right)\right\rangle \kern0.3em \mathrm{and}\kern0.3em {\boldsymbol{K}}_2\left({\boldsymbol{x}}_2^i,{\boldsymbol{x}}_2^j\right)=\left\langle \phi \left({\boldsymbol{x}}_2^i\right),\phi \left({\boldsymbol{x}}_2^j\right)\right\rangle $$

where \( {\mathbf{K}}_1={\left[{K}_1\left({\boldsymbol{x}}_1^i,{\boldsymbol{x}}_1^j\right)\right]}_{N\times N} \) and \( {\mathbf{K}}_2={\left[{K}_2\left({\boldsymbol{x}}_2^i,{\boldsymbol{x}}_2^j\right)\right]}_{N\times N} \). In this case, kernel CCA canonical directions correspond to the solutions of the updated generalized eigen-value problem:

$$ \left[\begin{array}{ll}\hfill \mathbf{0}\hfill & \hfill {\boldsymbol{K}}_1{\boldsymbol{K}}_2\hfill \\ {}\hfill {\boldsymbol{K}}_2{\boldsymbol{K}}_1\hfill & \hfill \mathbf{0}\hfill \\ {}\hfill \hfill \end{array}\right]\left[\begin{array}{l}\hfill {\upalpha}_1\hfill \\ {}\hfill {\upalpha}_2\hfill \end{array}\right]=\lambda \left[\begin{array}{ll}\hfill {\boldsymbol{K}}_1^2\hfill & \hfill \mathbf{0}\hfill \\ {}\hfill \mathbf{0}\hfill & \hfill {\boldsymbol{K}}_2^2\hfill \\ {}\hfill \hfill \end{array}\right]\left[\begin{array}{l}\hfill {\upalpha}_1\hfill \\ {}\hfill {\upalpha}_2\hfill \end{array}\right]. $$

Similarly to the primal formulation of CCA, we can apply an 2-norm regularization penalty on the weights α1 and α2 of Eq. 18, giving rise to regularized kernel CCA:

$$ \left[\begin{array}{ll}\hfill \mathbf{0}\hfill & \hfill {\boldsymbol{K}}_1{\boldsymbol{K}}_2\hfill \\ {}\hfill {\boldsymbol{K}}_2{\boldsymbol{K}}_1\hfill & \hfill \mathbf{0}\hfill \\ {}\hfill \hfill \end{array}\right]\left[\begin{array}{l}\hfill {\boldsymbol{u}}_1\hfill \\ {}\hfill {\boldsymbol{u}}_2\hfill \end{array}\right]=\lambda \left[\begin{array}{ll}\hfill {\boldsymbol{K}}_1^2+\delta \boldsymbol{I}\hfill & \hfill \mathbf{0}\hfill \\ {}\hfill \mathbf{0}\hfill & \hfill {\boldsymbol{K}}_2^2+\delta \boldsymbol{I}\hfill \\ {}\hfill \hfill \end{array}\right]\left[\begin{array}{l}\hfill {\boldsymbol{u}}_1\hfill \\ {}\hfill {\boldsymbol{u}}_2\hfill \end{array}\right], $$

2.5 Optimization of Latent Variable Models

The nonlinear iterative partial least squares (NIPALS) is a classical scheme proposed by H. Wold [35] for the optimization of latent variable models through the iterative computation of PLS and CCA projections. Within this method, the projections associated with the modalities X1 and X2 are obtained through the iterative solution of simple least squares problems.

The principle of NIPALS is to identify projection vectors \( {\boldsymbol{u}}_1,{\boldsymbol{u}}_2\in \mathbb{R} \) and corresponding latent representations z1 and z2 to minimize the functionals

$$ {\mathcal{L}}_i\kern0.5em =\parallel {X}_i-{\boldsymbol{z}}_i{\boldsymbol{u}}_i^T\parallel {}^2,\kern0.5em $$

subject to the constraint of maximal similarity between representations z1 and z2 (Fig. 3).

Fig. 3
A diagram is titled non-linear iterative partial least squares. X 1 and X 2 lead to Z 1 and Z 2, respectively. Z 1 and Z 2 have a double-ended arrow between them. Z 1 and Z 2 lead to X 1 and X 2, respectively.

Schematic of NIPALS algorithm (Algorithm 1). This implementation can be found in standard machine learning packages such as scikit-learn [36]

Following [37], the NIPALS method is optimized as follows (Algorithm 1). The latent projection for modality 1 is first initialized as \( {\boldsymbol{z}}_1^{(0)} \) from randomly chosen columns of the data matrix X1. Subsequently, the linear regression function

$$ {\mathcal{L}}_2^{(0)}=\parallel {X}_2-{\boldsymbol{z}}_1^{(0)}{\boldsymbol{u}}_2^T\parallel {}^2 $$

is optimized with respect to u2, to obtain the projection \( {\boldsymbol{u}}_2^{(0)} \). After unit scaling of the projection coefficients, the new latent representation is computed for modality 2 as \( {\boldsymbol{z}}_2^{(0)}={\boldsymbol{X}}_2\cdot {\boldsymbol{u}}_2^{(0)} \). At this point, the latent projection is used for a new optimization step of the linear regression problem

$$ {\mathcal{L}}_1^{(0)}=\parallel {X}_1-{\boldsymbol{z}}_2^{(0)}{\boldsymbol{u}}_1^T\parallel {}^2, $$

this time with respect to u1, to obtain the projection parameters \( {\boldsymbol{u}}_1^{(0)} \) relative to modality 1. After unit scaling of the coefficients, the new latent representations is computed for modality 1 as \( {\boldsymbol{z}}_1^{(1)}={\boldsymbol{X}}_1\cdot {\boldsymbol{u}}_1^{(0)} \). The whole procedure is then iterated.

It can be shown that the NIPALS method of Algorithm 1 converges to a stable solution for projections and latent parameters and the resulting projection vectors correspond to the first left and right eigen-modes associated to the covariance matrix \( \boldsymbol{S}={\boldsymbol{X}}_1^T\cdot {\boldsymbol{X}}_2 \).

Algorithm 1 NIPALS iterative computation for PLS components [37]

After the first eigen-modes are computed through Algorithm 1, the higher-order components can be subsequently computed by deflating the data matrices X1 and X2. This can be done by regressing out the current projections in the latent space:

$$ {\boldsymbol{X}}_i\kern0.5em \leftarrow {\boldsymbol{X}}_i-{\boldsymbol{z}}_i\frac{{\boldsymbol{z}}_i^T{\boldsymbol{X}}_i}{{\boldsymbol{z}}_i^T{\boldsymbol{z}}_i}\kern0.5em $$

NIPALS can be seamlessly used to optimize the CCA problem. Indeed, it can be shown that the CCA projections and latent representations can be obtained by estimating the linear projections u2 and u1 in steps 1 and 4 of Algorithm 1 via the linear regression problems

$$ {\mathcal{L}}_2^{(i)}=\parallel {X}_2{\boldsymbol{u}}_2-{\boldsymbol{z}}_1^{(i)}\parallel {}^2\kern2em \left(\mathrm{step}\ 1\ \mathrm{for}\ \mathrm{CCA}\right), $$


$$ {\mathcal{L}}_1^{(i)}=\parallel {X}_1{\boldsymbol{u}}_1-{\boldsymbol{z}}_2^{(i)}\parallel {}^2\kern2em \left(\mathrm{step}\ 4\ \mathrm{for}\ \mathrm{CCA}\right). $$

Box 4: Online Tutorial—NIPALS Implementation

The online tutorial provides an implementation of the NIPALS algorithm for both CCA and PLS, corresponding to Algorithm 1. It can be verified that the numerical solution is equivalent to the one provided by sklearn and to the one obtained through the solution of the eigen-value problem.

3 Bayesian Frameworks for Latent Variable Models

Bayesian formulations for latent variable models have been developed in the past, including for PLS [38] and CCA [39]. The advantage of employing a Bayesian framework to solve the original inference problem is that it provides a natural setting to quantify the parameters’ variability in an interpretable manner, coming with their estimated distribution. In addition, these methods are particularly attractive for their ability of integrating prior knowledge on the model’s parameters.

3.1 Multi-view PPCA

Recently, the seminal work of Tipping and Bishop on probabilistic PCA (PPCA) [40] has been extended to allow the joint integration of multimodal data [41] (multi-view PPCA), under the assumption of a common latent space able to explain and generate all modalities.

Recalling the notation of Subheading 2.1, let \( \boldsymbol{x}={\left\{{\boldsymbol{x}}_i^k\right\}}_{i=1}^M \) be an observation of M modalities for subject k, where each \( {\boldsymbol{x}}_i^k \) is a vector of dimension Di. We denote by zk the D-dimensional latent variable commonly shared by each \( {\boldsymbol{x}}_i^k \). In this context, the forward process underlying the data generation of Eq. 1 is linear, and for each subject k and modality i, we write (see Fig. 4a):

$$ {\boldsymbol{x}}_i^k\kern0.5em ={W}_i\left({\boldsymbol{z}}^k\right)+{\boldsymbol{\mu}}_i+{\boldsymbol{\varepsilon}}_i,\kern0.5em $$
$$ i=1,\dots, M;\kern1em k=1,\dots, N;\kern1em \mathit{\dim}\left({\boldsymbol{z}}^k\right)<\min \left({D}_i\right),\kern0.5em $$

where Wi represents the linear mapping from the ith-modality to the latent space, while μi and εi denote the common intercept and error for modality i. Note that the modality index i does not appear in the latent variable zk, allowing a compact formulation of the generative model of the whole dataset (i.e., including all modalities) by simple concatenation:

$$ {\boldsymbol{x}}^k:= \left[\begin{array}{l}\hfill {\boldsymbol{x}}_1^k\hfill \\ {}\hfill \vdots \hfill \\ {}\hfill {\boldsymbol{x}}_M^k\hfill \end{array}\right]=\left[\begin{array}{l}\hfill {W}_1\hfill \\ {}\hfill \vdots \hfill \\ {}\hfill {W}_M\hfill \end{array}\right]{\boldsymbol{z}}^k+\left[\begin{array}{l}\hfill {\boldsymbol{\mu}}_1\hfill \\ {}\hfill \vdots \hfill \\ {}\hfill {\boldsymbol{\mu}}_M\hfill \end{array}\right]+\left[\begin{array}{l}\hfill {\boldsymbol{\varepsilon}}_1\hfill \\ {}\hfill \vdots \hfill \\ {}\hfill {\boldsymbol{\varepsilon}}_M\hfill \end{array}\right]=: W{\boldsymbol{z}}^k+\boldsymbol{\mu} +\boldsymbol{\varepsilon} . $$
Fig. 4
2 parts. Part A. A diagram starts with person Z leading to x. Inside the box representing the i th modality, 3 elements converge at x subscript i. A double-ended arrow connects x subscript i and x. Part B. A flow chart starts with hyperpriors and leads to priors, followed by data distribution.

(a) Graphical model of multi-view PPCA. The green node represents the latent variable able to jointly describe all observed data explaining the patient status. Gray nodes denote original multimodal data, and blue nodes the view-specific parameters. (b) Hierarchical structure of multi-view PPCA: prior knowledge on model’s parameters can be integrated in a natural way when the model is embedded in a Bayesian framework

Further hypotheses are needed to define the probability distributions of each element appearing in Eq. 22, such as zk ∼ p(zk), the standard Gaussian prior distribution for the latent variables, and εi ∼ p(εi), a centered Gaussian distribution. From these assumptions, one can finally derive the likelihood of the data given latent variables and model parameters, \( p\left({\boldsymbol{x}}_i^k|{\boldsymbol{z}}^k,{\boldsymbol{\theta}}_{\boldsymbol{i}}\right) \), θi = {Wi, μi, εi} and, by using Bayes theorem, also the posterior distribution of the latent variables, \( p\left({\boldsymbol{z}}^k|{\boldsymbol{x}}_i^k\right) \).

Box 5: Online Tutorial—Multi-view PPCA

A code snippet presents the M V P P C A model to multi-view data and performs model training for 200 iterations. The results, including the optimized parameters, are stored in a DataFrame.

3.1.1 Optimization

In order to solve the inference problem and estimate the model’s parameters in θ, the classical expectation-maximization (EM) scheme can be deployed. EM optimization consists in an iterative process where each iteration is composed of two steps:

  • Expectation step (E): Given the parameters previously optimized, the expectation of the log-likelihood of the joint distribution of xi and zk with respect to the posterior distribution of the latent variables is evaluated.

  • Maximization step (M): The functional of the E step is maximized with respect to the model’s parameters.

It is worth noticing that prior knowledge on the model’s parameters distribution can be easily integrated in this Bayesian framework (Fig. 4b), with minimal modification of the optimization scheme, consisting in a penalization of the functional to be maximized in the M-step forcing the optimized parameters to remain close to their priors. In this case we talk about maximum a posteriori (MAP) optimization.

3.2 Bayesian Latent Variable Models via Autoencoding

Autoencoders and variational autoencoders have become very popular approaches for the estimation of latent representation of complex data, which allow powerful extensions of the Bayesian models presented in Subheading 3.1 to account for nonlinear and deep data representations.

Autoencoders (AEs) extend classical latent variable models to account for complex, potentially highly nonlinear, projections from the data space to the latent space (encoding), along with reconstruction functions (decoding) mapping the latent representation back to the data space. Since typical encoding (fe) and decoding (fd) functions of AEs are parameterized by feedforward neural networks, inference can be efficiently performed by means of stochastic gradient descent through backpropagation. In this sense, AEs can be seen as a powerful extension of classical PCA, where encoding into the latent representations and decoding are jointly optimized to minimize the reconstruction error of the data:

$$ {\displaystyle \begin{array}{r}\mathcal{L} ={\parallel \boldsymbol{X}-{f}_d\left({f}_e\left(\boldsymbol{X}\right)\right)\parallel}_2^2\end{array}} $$

The variational autoencoder (VAE) [42, 43] introduces a Bayesian formulation of AEs, akin to PPCA, where the latent variables are inferred by estimating the associated posterior distributions. In this case, the optimization problem can be efficiently performed by stochastic variational inference [44], where the posterior moments of the variational posterior of the latent distribution are parameterized by neural networks.

In the same way PLS and CCA extend PCA for multimodal analysis, research has been devoted to define equivalent extensions for the VAEs to identify common latent representations of multiple data modalities, such as the multi-channel VAE [23], or deep CCA [29]. These approaches are based on a similar formulation, which is provided in the following section.

3.3 Multi-channel Variational Autoencoder

The multi-channel variational autoencoder (mcVAE) assumes the following generative process for the observation set:

$$ {\displaystyle \begin{array}{rll}& {\boldsymbol{z}}^k\sim p\left({\boldsymbol{z}}^k\right)& \\ {}& {\boldsymbol{x}}_i^k\sim p\left({\boldsymbol{x}}_i^k|{\boldsymbol{z}}^k,{\uptheta}_i\right)\kern2em \hspace{2.77695pt}i=1,\dots, M,\end{array}} $$

where p(zk) is a prior distribution for the latent variable. In this case, \( p\left({\boldsymbol{x}}_i^k|\boldsymbol{z},{\theta}_i\right) \) is the likelihood of the observed modality i for subject k, conditioned on the latent variable and on the generative parameters θi parameterizing the decoding from the latent space to the data space of modality i.

Solving this inference problem requires the estimation of the posterior for the latent distribution p(z|X1, …, XM), which is generally an intractable problem. Following the VAE scheme, variational inference can be applied to compute an approximate posterior [45].

Fig. 5
An illustration consists of encoding and decoding. The former consists of x subscripts 1 through 4 converging at the patient's latent status, which then leads to the latter from x subscripts 1 through 4.

The multi-channel VAE (mcVAE) for the joint modeling of multimodal medical imaging, clinical, and biological information. The mcVAE approximates the latent posterior p(z|X1, X2, X3, X4) to maximize the likelihood of the data reconstruction p(X1, X2, X3, X4|z) (plus a regularization term)

3.3.1 Optimization

The inference problem of mcVAE is solved by identifying variational posterior distributions specific to each data modality \( q\left({\boldsymbol{z}}^k|{\boldsymbol{x}}_i^k,{\varphi}_i\right) \), by conditioning them on the observed modality xi and on the corresponding variational parameters φi parameterizing the encoding of the observed modality to the latent space.

In this way, since each modality provides a different approximation, a similarity constraint is imposed in the latent space to enforce each modality-specific distribution \( q\left({\boldsymbol{z}}^k|{\boldsymbol{x}}_i^k,{\varphi}_i\right) \) to be as close as possible to the common target posterior distribution. The measure of “proximity” between distributions is the Kullback-Leibler (KL) divergence. This constraint defines the following functional:

$$ \underset{q}{argmin}\kern0.60em \sum \limits_i\kern0.3em {D}_{\mathrm{KL}}\left[q\left({\boldsymbol{z}}^k|{\boldsymbol{x}}_i^k,{\varphi}_i\right)\parallel p\Big(\boldsymbol{z}|{\boldsymbol{x}}_1^k,\dots, {\boldsymbol{x}}_M^k\Big)\right] $$

where the approximate posteriors q(z|xi, φi) represent the view on the latent space that can be inferred from the modality xi. In [23] it was shown that the optimization of Eq. 27 is equivalent to the optimization of the following evidence lower bound (ELBO):

$$ \mathcal{L} =D-R $$

where \( R={\sum}_i\mathrm{KL}\left[q\left({\boldsymbol{z}}^k|{\boldsymbol{x}}_i^k,{\varphi}_i\right)\parallel p\left(\boldsymbol{z}\right)\right] \), and D =∑iLi, with

$$ {L}_i=\underset{q\left({\boldsymbol{z}}^k|{\boldsymbol{x}}_i^k,{\varphi}_i\right)}{\mathbbm{E}}\sum \limits_{j=1}^M\ln p\left({\boldsymbol{x}}_j|\boldsymbol{z},{\theta}_j\right) $$

is the expected log-likelihood of each data channel xj quantifying the reconstruction obtained by decoding from the latent representation of the remaining channels xi. Therefore, optimizing the term D in Eq. 28 with respect to encoding and decoding parameters \( {\left\{{\theta}_i,{\varphi}_i\right\}}_{i=1}^M \) identifies the optimal representation of each modality in the latent space which can, on average, jointly reconstruct all the other channels. This term thus enforces a coherent latent representation across different modalities and is balanced by the regularization term R, which constrains the latent representation of each modality to the common prior p(z). As for standard VAEs, encoding and decoding functions can be arbitrarily chosen to parameterize respectively latent distributions and data likelihoods. Typical choices for such functions are neural networks, which can provide extremely flexible and powerful data representation (Box 6). For example, leveraging the modeling capabilities of deep convolutional networks, mcVAE has been used in a recent cardiovascular study for the prediction of cardiac MRI data from retinal fundus images [46].

Box 6 Online Tutorial —mcVAE with PyTorch

A code sets up and fits a M c V A E model to the provided data. It handles the model initialization, data preparation, optimization setup, and training process. The load underscore or underscore fit function allows for loading pre-trained models or performing new model training.

3.4 Deep CCA

The mcVAE uses neural network layers to learn nonlinear representations of multimodal data. Similarly, Deep CCA [29] provides an alternative to kernel CCA to learn nonlinear mappings of multimodal information. Deep CCA computes representations by passing two views through functions f1 and f2 with parameters θ1 and θ2, respectively, which can be learnt by multilayer neural networks. The parameters are optimized by maximizing the correlation between the learned representations f1(X1;θ1) and f2(X2;θ2):

$$ \left({\boldsymbol{\theta}}_{1 opt},{\boldsymbol{\theta}}_{2 opt}\right)=\mathrm{argmaxCorr}\left({f}_1\left({\boldsymbol{X}}_1;{\boldsymbol{\theta}}_1\right),{f}_2\left({\boldsymbol{X}}_2;{\boldsymbol{\theta}}_2\right)\right)\left({\boldsymbol{\theta}}_1,{\boldsymbol{\theta}}_2\right) $$

In its classical formulation, the correlation objective given in Eq. 29 is a function of the full training set, and as such, mini-batch optimization can lead to suboptimal results. Therefore, optimization of classical deep CCA must be performed with full-batch optimization, for example, through the L-BFGS (limited Broyden-Fletcher-Goldfarb-Shanno) scheme [47]. For this reason, with this vanilla implementation, deep CCA is not computationally viable for large datasets. Furthermore, this approach does not provide a model for generating samples from the latent space. To address these issues, Wang et al. [48] introduced deep variational CCA (VCCA) which extends the probabilistic CCA framework introduced in Subheading 3 to a nonlinear generative model. In a similar approach to VAEs and mcVAE, deep VCCA uses variational inference to approximate the posterior distribution and derives the following ELBO:

$$ \mathcal{L} =-{D}_{\mathrm{KL}}\left({q}_{\phi}\left(\boldsymbol{z}\mid {\boldsymbol{x}}_1\right)\parallel p\left(\boldsymbol{z}\right)\right)+{\mathbbm{E}}_{q_{\phi}\left(\boldsymbol{z}\mid {\boldsymbol{x}}_1\right)}\left[\log {p}_{{\boldsymbol{\theta}}_1}\left({\boldsymbol{x}}_1\mid \boldsymbol{z}\right)+\log {p}_{{\boldsymbol{\theta}}_2}\left({\boldsymbol{x}}_2\mid \boldsymbol{z}\right)\right] $$

where the approximate posterior, qϕ(zx1), and likelihood distributions, \( {p}_{{\boldsymbol{\theta}}_1}\left({\boldsymbol{x}}_1\mid \boldsymbol{z}\right) \) and \( {p}_{{\boldsymbol{\theta}}_2}\left({\boldsymbol{x}}_2\mid \boldsymbol{z}\right) \), are parameterized by neural networks with parameters ϕ, θ1, and θ2.

We note that, in contrast to mcVAE, deep VCCA is based on the estimation of a single latent posterior distribution. Therefore, the resulting representation is dependent on the reference modality from which the joint latent representation is encoded and may therefore bias the estimation of the latent representation. Finally Wang et al. [48] introduce a variant of deep VCCA, VCCA-private, which extracts the private, in addition to shared, latent information. Here, private latent variables hold view-specific information which is not shared across modalities.

4 Biologically Inspired Data Integration Strategies

Medical imaging and -omics data are characterized by nontrivial relationships across features, which represent specific mechanisms underlying the pathophysiological processes.

For example, the pattern of brain atrophy and functional impairment may involve brain regions according to the brain connectivity structure [49]. Similarly, biological processes such as gene expression are the result of the joint contribution of several SNPs acting according to biological pathways. According to these processes, it is possible to establish relationships between genetics features under the form of relation networks, represented by ontologies such as the KEGG pathwaysFootnote 2 and the Gene Ontology Consortium.Footnote 3

When applying data-driven multivariate analysis methods to this kind of data, it is therefore relevant to promote interpretability and plausibility of the model, by enforcing the solution to follow the structural constraints underlying the data. This kind of model behavior can be achieved through regularization of the model parameters.

In particular, group-wise regularization [50] is an effective approach to enforce structural patterns during model optimization, where related features are jointly penalized with respect to a common parameter. For example, group-wise constraints may be introduced to account for biological pathways in models of gene association, or for known brain networks and regional interactions in neuroimaging studies. More specifically, we assume that the Di features of a modality \( {\boldsymbol{x}}_i=\left({x_i}_1,\dots, {x_i}_{D_i}\right) \) are grouped in subsets \( {\left\{{\mathcal{S}}_l\right\}}_{l=1}^L \), according to the indices \( {\mathcal{S}}_l=\left({s}_1,\dots, {s}_{N_l}\right) \). The regularization of the of the general multivariate model of Eq. 2 according to the group-wise constraint can be expressed as:

$$ {\boldsymbol{W}}^{\ast }=\underset{\boldsymbol{W}}{\mathrm{argmin}}\kern0.3em \parallel {\boldsymbol{X}}_1-{\boldsymbol{X}}_2\cdot \boldsymbol{W}\parallel {}^2+\lambda \sum \limits_{l=1}^L{\beta}_lR\left({\boldsymbol{W}}_l\right), $$

where \( R\left({\boldsymbol{W}}_l\right)={\sum}_{j=1}^{D_1}\sqrt{\sum_{s\in {\mathcal{S}}_l}\boldsymbol{W}{\left[s,j\right]}^2} \) is the penalization of the entries of W associated with the features of X2 indexed by \( {\mathcal{S}}_l \). The total penalty is achieved by the sum across the D1 columns.

Group-wise regularization is particularly effective in the following situations:

  • To compensate for large data dimensionality, by reducing the number of “free parameters” to be optimized by aggregating the available features [51].

  • To account for the small effect size of each independent features, to combine features in order to increase the detection power. For example, in genetic analysis, each SNP accounts for below 1% of the variance in brain imaging quantitative traits when considered individually [52, 53].

  • To meaningfully integrate complementary information to introduce biologically inspired constraints into the model.

In the context of group-wise regularization in neural networks, several optimization/regularization strategies have been proposed to allow the identification of compressed representation of multimodal data in the bottleneck layers, such as by imposing sparsity of the model parameters or by introducing grouping constraints motivated by prior knowledge [54].

For instance, the Bayesian Genome-to-Phenome Sparse Regression (G2PSR) method proposed in [55] associates genomic data to phenotypic features, such as multimodal neuroimaging and clinical data, by constraining the transformation to optimize relevant group-wise SNPs-gene associations. The resulting architecture groups the input SNP layer into corresponding genes represented in the intermediate layer L of the network (Fig. 6). Sparsity at the gene level is introduced through variational dropout [56], to estimate the relevance of each gene (and related SNPs) in reconstructing the output phenotypic features.

Fig. 6
A neural network architecture has 3 layers, S N P layer X, gene layer L, and phenotypic layer Y. The 3 layers have corresponding graphs that plot samples versus S N Ps, S N Ps versus genes, and samples versus genes, respectively.

Illustration of G2PSR SNP-gene grouping constraint and overall neural network architecture

In more detail, to incorporate biological constraints in G2PSR framework, a group-wise penalization is imposed with nonzero weights Wg mapping the input SNPs to their common gene g. The idea is that during optimization the model is forced to jointly discard all the SNPs mapping to genes which are not relevant to the predictive task. Following [56], the variational approximation is parametrized as q(Wg), such that each element of the input layer is defined as \( {W}_i^g\sim \mathcal{N}\left({\mu}_i^g;{\alpha}_g.{\mu_i^g}^2\right) \) [57], where the parameter αg is optimized to quantify the common uncertainty associated with the ensemble of SNPs contributing to the gene g.

5 Conclusions

This chapter presented an overview of basic notions and tools for multimodal analysis. The set of frameworks introduced here represents an ideal starting ground for more complex analysis, either based on linear multivariate methods [58, 59] or on neural network architectures, extending the modeling capabilities to account for highly heterogeneous information, such multi-organ data [46], text information, and data from electronic health records [60, 61].