Abstract
A novel multitask Gaussian process (GP) framework is proposed, by using a common mean process for sharing information across tasks. In particular, we investigate the problem of time series forecasting, with the objective to improve multiplestepahead predictions. The common mean process is defined as a GP for which the hyperposterior distribution is tractable. Therefore an EM algorithm is derived for handling both hyperparameters optimisation and hyperposterior computation. Unlike previous approaches in the literature, the model fully accounts for uncertainty and can handle irregular grids of observations while maintaining explicit formulations, by modelling the mean process in a unified GP framework. Predictive analytical equations are provided, integrating information shared across tasks through a relevant prior mean. This approach greatly improves the predictive performances, even far from observations, and may reduce significantly the computational complexity compared to traditional multitask GP models. Our overall algorithm is called Magma (standing for Multi tAsk GPs with common MeAn). The quality of the mean process estimation, predictive performances, and comparisons to alternatives are assessed in various simulated scenarios and on real datasets.
1 Introduction
Gaussian processes (GPs) are a powerful tool, widely used in machine learning (Bishop, 2006; Rasmussen & Williams, 2006). The classic context of regression aims at inferring the underlying mapping function associating input to output data. In a probabilistic framework, a typical strategy is to assume that this function is drawn from a prior GP. Doing so, we may enforce some properties for the function solely by characterising the mean and covariance functions of the process, the latter often being associated with a specific kernel. This covariance function plays a central role and GPs are an example of kernel methods. We refer to Álvarez et al. (2012) for a comprehensive review. On the other hand, the mean function is generally set to 0 for all entries assuming that the covariance structure already integrates the desired relationship between observed data and prediction targets. In this paper, we consider a novel multitask learning framework where a series of GPs share a common mean, expressed as a GP as well. We demonstrate that modelling the mean function as such can be key to obtain more relevant predictions.
Related work
The multitask framework consists in using data from several tasks (or individuals) to improve learning or predictive capacities compared to an isolated model. It has been introduced by Caruana (1997) and then adapted in many fields of machine learning. GP versions of such models were introduced in Schwaighofer et al. (2004), which proposed an ExpectationMaximisation (EM) algorithm for learning. Similar techniques can be found in Shi et al. (2005). Meanwhile, Yu et al. (2005) offered an extensive study of the relationships between the linear model and GPs to develop a multitask GP formulation. However, since the introduction in Bonilla et al. (2008) of the idea of two matrices, modelling covariance between inputs and tasks respectively, the term multitask Gaussian process has mostly referred to the choice made regarding the covariance structure. Some further developments were discussed by Hayashi et al. (2012), Rakitsch et al. (2013) and Zhu & Sun (2014). In particular, an interesting approach in Nguyen and Bonilla (2014) proposed a sparse approximation for multitask GP inference. More generally, these approaches are known as examples of linear models of coregionalization (LMC) in the geostatistics literature, and Álvarez & Lawrence (2011) provides a unified view on the topic as well as an efficient strategy for constructing computationally efficient approximations. Let us emphasise that the present paper is not based on the same assumptions and principles, and aims at defining a different multitask paradigm for GPs, focusing on sharing information through the mean function rather than the covariance structure. Besides, the work of Swersky et al. (2013) on Bayesian hyperparameter optimisation in such LMC models is also worth a mention. Real applications were tackled by similar models in Williams et al. (2009) and Alaa & van der Schaar (2017), while Clingerman & Eaton (2017) and MorenoMuñoz et al. (2019) developed continual learning methods for multitask GP.
As we focus on multitask time series forecasting, a connection can be drawn to the study of multiple curves, or functional data analysis (FDA). As initially proposed in Rice & Silverman (1991), it is possible to model and learn mean and covariance structures simultaneously in this context. We refer to the monographs Ramsay & Silverman (2005) and Ferraty & Vieu (2006) for a comprehensive introduction to FDA. In particular, these books introduced several usual ways for modelling a set of functional objects in frequentist frameworks, for example by using a decomposition in a basis of functions (such as Bsplines, wavelets, Fourier). This kind of Bsplines decomposition was used in Shi et al. (2007) for modelling the mean function in a generative model that somehow resembles ours. Subsequently, some Bayesian alternatives were developed in Thompson & Rosen (2008), and Crainiceanu & Goldsmith (2010).
Our contributions
A multitask GP framework with a common mean process is introduced, allowing reliable probabilistic forecasts even in multiplestepahead problems, or for sparsely observed individuals. For this purpose, (i) we introduce a GP model where the specific covariance structure of each task is defined through a separate kernel and its associated set of hyperparameters, whereas the common mean function \(\mu _0\) allows sharing information across tasks and overcomes the weaknesses of classic GPs in making predictions far from observed data. To account for uncertainty, we propose a hierarchical formulation to define the common mean process \(\mu _0\) as a GP as well. (ii) We derive an algorithm called Magma (available as an R package at https://github.com/ArthurLeroy/MagmaClustR) to compute \(\mu _0\)’s hyperposterior distribution together with the estimation of hyperparameters in an EM fashion, and discuss its computational complexity. (iii) We enrich Magma with explicit formulas to make predictions for any new, partially observed, task. The hyperposterior distribution of \(\mu _0\) provides a prior belief on what we would expect to observe before seeing any new data, acting as an alreadyinformed mean process, integrating both trend and uncertainty coming from other tasks. (iv) We illustrate the performance of our method on synthetic and two reallife datasets and obtain stateoftheart results compared to alternative approaches.
Outline
The paper is organised as follows. We introduce our multitask Gaussian process model in Sect. 2, along with notation. Section 3 is devoted to the inference procedure, with an ExpectationMaximisation (EM) algorithm to estimate the Gaussian process hyperparameters and \(\mu _0\)’s hyperposterior. We leverage this strategy in Sect. 4 and derive a prediction algorithm. In Sect. 5, we analyse and discuss the computational complexity of both the inference and prediction procedures. Our methodology is illustrated in Sect. 6, with a series of experiments on both synthetic and reallife datasets, and a comparison to competing stateoftheart algorithms. On those tasks, we provide empirical evidence that our algorithm outperforms other approaches. Section 7 draws perspectives for future work, and we defer some proofs to original results claimed in the paper to Sect. 8.
2 The model
2.1 Notation
While GPs can handle many types of data, their continuous nature makes them particularly well suited to study temporal phenomena. Throughout, the term individual is used as a synonym of task or batch, and we adopt notation and vocabulary of time series to remain consistent with the application on real dataset provided in Sect. 6.5, which addresses young swimmers performances’ forecast.
We are provided with functional data coming from \(M \in {\mathcal{I}}\) different individuals, where \({\mathcal{I}} \subset {\mathbb{N}}\). For each individual i, we observe a set of inputs \(\{ t_i^1, \dots , t_i^{N_i} \}\) and associated outputs \(\{ y_i(t_i^1), \dots , y_i(t_i^{N_i}) \}\), where \(N_i\) is the number of data points for the ith individual. Since many objects are defined for all individuals, we shorten our notation as follows: for any object x existing for all i, we denote \(\left\{ x_i \right\} _i = \left\{ x_1, \dots , x_M \right\} \). Moreover, as we work in a temporal context, the inputs are referred to as timestamps. In the specific case where all individuals are observed at the same timestamps, we call the grid of observations common. On the contrary, a grid of observations is uncommon if the timestamps are different in number and/or location among the individuals. Some convenient notation follows:

\(\mathbf{t }_i= \{ t_i^1,\dots ,t_i^{N_i} \}\), the set of timestamps for the ith individual,

\(\mathbf{y }_i= y_i(\mathbf{t }_i)\), the vector of outputs for the ith individual,

\(\mathbf{t }= \bigcup \limits _{i = 1}^M \mathbf{t }_i\), the pooled set of timestamps among individuals,

\(N = {\text{card}}(\mathbf{t })\), the total number of observed timestamps.
2.2 Model and hypotheses
Suppose that functional data are coming from the sum of a mean process, common to all individuals, and an individualspecific centred process. To clarify relationships in the generative model, we illustrate our graphical model in Fig. 1. Let \({\mathcal{T}}\) be the input space, our model is
where \(\mu _0(\cdot ) \sim {\mathcal{GP}} (m_0(\cdot ), k_{\theta _0}(\cdot ,\cdot ))\) and \(f_i(\cdot ) \sim {\mathcal{GP}} \left( 0,c_{\theta _i}(\cdot ,\cdot ) \right) \) are respectively the common mean and individual specific processes. Moreover, the error term is supposed to be \(\epsilon _i(\cdot ) \sim {\mathcal{N}} (0,\sigma _i^2 I)\). The following notation is used for parameters:

\(m_0(\cdot )\), an arbitrary prior mean function,

\(k_{\theta _0}(\cdot , \cdot )\), a covariance kernel of hyperparameters \(\theta _0\),

\(\forall i \in {\mathcal{I}}, \ c_{\theta _i}(\cdot , \cdot )\), a covariance kernel with hyperparameters \(\theta _i\),

\(\sigma _i^2 \in {\mathbb{R}}^{+}\), the noise variance associated with the ith individual,

\(\forall i \in {\mathcal{I}},\) we define the shorthand \(\psi _{\theta _i, \sigma _i^2}(\cdot ,\cdot ) = c_{\theta _i}(\cdot ,\cdot ) + \sigma _i^2 I\),

\(\varTheta = \{\theta _0, \left\{ \theta _i \right\} _i, \left\{ \sigma _i^2 \right\} _i \}\), the set of all hyperparameters to learn in the model.
We also assume that:

\(\{ f_i \}_{i}\) are independent,

\(\{ \epsilon _i \}_{i}\) are independent,

\(\forall i \in {\mathcal{I}}, \ \mu _0\), \(f_i\) and \(\epsilon _i\) are independent.
It follows that \(\{ y_i \mid \mu _0 \}_{i = 1,\dots ,M}\) are independent from one another, and for all \(i \in {\mathcal{I}}\):
Let us emphasise that this property only holds conditionally to \(\mu _0\). Otherwise, once \(\mu _0\) is integrated out, the \(y_i\) are no longer independent. Here, we do not assume any specific covariance structure between individuals contrarily to standard LMC approaches. As we shall see in the next sections, the process \(\mu _0\) will be key to handle the dependencies and share information across the individuals.
Although this model is based on infinitedimensional GPs, the inference will be conducted on a finite grid of observations. According to the aforementioned notation, we observe \(\{ (\mathbf{t }_i, \mathbf{y }_i) \}_{i}\), and the corresponding likelihoods are Gaussian:
where \(\ \varvec{\varPsi }_{\theta _i, \sigma _i^2}^{\mathbf{t }_i} = \psi _{\theta _i, \sigma _i^2}(\mathbf{t }_i, \mathbf{t }_i) = \left[ \psi _{\theta _i, \sigma _i^2}(k, l) \right] _{k, \ell \in \mathbf{t }_i}\) is a \(N_i \times N_i\) covariance matrix. Since \(\mathbf{t }_i\) might be different among individuals, we also need to evaluate \(\mu _0\) on the pooled grid of timestamps \(\mathbf{t }\):
where \(\mathbf{K }_{\theta _0}^{\mathbf{t }} = k_{\theta _0}(\mathbf{t }, \mathbf{t }) = \left[ k_{\theta _0}(k, \ell ) \right] _{k,l \in \mathbf{t }}\) is a \(N \times N\) covariance matrix.
An alternative hypothesis consists in considering hyperparameters \(\left\{ \theta _i \right\} _i\) and \(\left\{ \sigma _i^2 \right\} _i\) equal for all individuals. We call this hypothesis Common HP (where HP stands for hyperparameters) in the Sect. 6. This particular case represents a context where individuals correspond to different trajectories of the same process, whereas different hyperparameters indicate different covariance structures and thus a more flexible model. For the sake of generality, the remainder of the paper is written with \(\theta _i\) and \(\sigma _i^2\) notation, when there are no differences in the procedure. Moreover, the model above and the subsequent algorithm may use any form of covariance function, often parametrised by a finite set (usually small) of hyperparameters. For example, a common kernel in the GP literature is known as the Exponentiated Quadratic kernel (also called sometimes Squared Exponential or Radial Basis Function kernel). It solely depends on two hyperparameters \(\theta = \left\{ v, \ell \right\} \) and is defined as:
The Exponentiated Quadratic kernel is simple and enjoys useful smoothness properties. This is the kernel used in the current version of our implementation (see Sect. 6 for details). Note that there is a rich literature on kernel choice, their construction and properties, which is beyond the scope of the present work: we refer to Rasmussen and Williams (2006) or Duvenaud (2014) for comprehensive studies.
3 Inference
3.1 Learning
Several approaches to learn hyperparameters for Gaussian processes have been proposed in the literature, we refer to Rasmussen and Williams (2006) for a comprehensive study. One classical approach, called empirical Bayes (Casella 1985), is based on the maximisation of an explicit likelihood to estimate hyperparameters. This procedure avoids sampling from intractable distributions, usually resulting in additional computational cost and complicating practical use in moderate to large sample sizes. As previously stated, once \(\mu _0\) is marginalised out, the loglikelihood cannot be written as a sum of Gaussian loglikelihoods any more. Therefore, we propose an EM algorithm (see the pseudocode in Algorithm 1) to learn the hyperparameters \(\varTheta \) in this context. The procedure alternatively computes the hyperposterior distribution \(p(\mu _0 \mid (\mathbf{y }_i)_i, {\widehat{\varTheta }})\) with current hyperparameters, and then optimises \(\varTheta \) according to this hyperposterior distribution. This EM algorithm converges to local maxima (Dempster et al. 1977), typically in a handful of iterations.
E step
For the sake of simplicity, we assume in that section that \( \forall i,j \in {\mathcal{I}}, \ \mathbf{t }_i= \mathbf{t }_j = \mathbf{t }\), i.e. the individuals are observed on a common grid of timestamps. We provide a generalisation of the following proposition in Sect. 4 (Proposition 4), where the result holds for uncommon grids. The E step then consists in computing the hyperposterior distribution of \(\mu _0(\mathbf{t })\).
Proposition 1
Assume the hyperparameters \({\widehat{\varTheta }}\) known from initialisation or estimated from a previous M step. The hyperposterior distribution of \(\mu _0\) remains Gaussian:
with

\(\widehat{\mathbf{K }}^{\mathbf{t }} = \left( { \mathbf{K }_{{\widehat{\theta }}_0}^{\mathbf{t }}}^{1} + \sum \limits _{i = 1}^{M}{\varvec{\varPsi }_{{\widehat{\theta }}_i, {\widehat{\sigma }}_i^2}^{\mathbf{t }}}^{1} \right) ^{1},\)

\({\widehat{m}}_0(\mathbf{t }) = \widehat{\mathbf{K }}^{\mathbf{t }} \left( { \mathbf{K }_{{\widehat{\theta }}_0}^{\mathbf{t }}}^{1} m_0\left( \mathbf{t } \right) + \sum \limits _{i = 1}^{M}{ \varvec{\varPsi }_{{\widehat{\theta }}_i, {\widehat{\sigma }}_i^2}^{\mathbf{t }}}^{1} \mathbf{y }_i \right) .\)
Proof
We omit specifying timestamps in what follows since each process is evaluated on \(\mathbf{t }\). Therefore, we can write:
The term \({\mathcal{L}}_1 =  (1/2) \log p ( \mu _0 \mid \left\{ \mathbf{y }_i \right\} _i, {\widehat{\varTheta }})\) may then be written as
where the constant terms are gathered into \(C_1, C_2 \in {\mathbb{R}}\). Identifying terms in the quadratic form with the Gaussian likelihood, we get the desired result. \(\square \)
The maximisation step depends on the assumptions on the generative model, resulting in two versions for the EM algorithm (the E step is common to both, the branching point is here).
M step: different hyperparameters
Assuming each individual has its own set of hyperparameters \(\{ \theta _i, \sigma _i^2 \}\), the M step is given by the following procedure.
Proposition 2
Assume \(p(\mu _0 \mid \left\{ \mathbf{y }_i \right\} _i) = {\mathcal{N}} \left( \mu _0(\mathbf{t }); {\widehat{m}}_0(\mathbf{t }), \widehat{\mathbf{K }}^{\mathbf{t }} \right) \) computed in a previous E step. For a set of hyperparameters \(\varTheta = \{ \theta _0, \left\{ \theta _i \right\} _i, \left\{ \sigma _i^2 \right\} _i\}\), optimal values are given by
inducing \(M +1\) independent maximisation problems:
where
Proof
One simply has to distribute the conditional expectation in order to get the right likelihood to maximise, and then notice that the function can be written as a sum of \(M+1\) independent (with respect to the hyperparameters) terms. Moreover, by rearranging, one can observe that each independent term is the sum of a Gaussian likelihood and a correction trace term. See Sect. 8.2 for details. \(\square \)
M step: common hyperparameters
Alternatively, assuming all individuals share the same set of hyperparameters \(\{ \theta , \sigma ^2 \}\), the M step is given by the following procedure.
Proposition 3
Assume \(p(\mu _0 \mid \left\{ \mathbf{y }_i \right\} _i) = {\mathcal{N}} \left( \mu _0(\mathbf{t }); {\widehat{m}}_0(\mathbf{t }), \widehat{\mathbf{K }}^{\mathbf{t }} \right) \)computed in a previous E step. For a set of hyperparameters \(\varTheta = \{ \theta _0, \theta , \sigma ^2 \}\), optimal values are given by
inducing two independent maximisation problems:
where
Proof
We use the same strategy as for Proposition 2, see Sect. 8.2 for details. \(\square \)
In both cases, explicit gradients associated with the likelihoods to maximise are available, facilitating the optimisation with gradientbased methods.
3.2 Initialisation
To implement the EM algorithm described above, several constants must be (appropriately) initialised:

\(m_0(\cdot )\), the mean parameter from the hyperprior distribution of the process \(\mu _0(\cdot )\). A somewhat classical choice in GP is to set its value to a constant function, typically 0 in the absence of external knowledge. Notice that, in our multitask framework, the influence of \(m_0(\cdot )\) in hyperposterior computation decreases as M grows anyway (see Proposition 1).

Initial values for kernel parameters \(\theta _0\) and \(\left\{ \theta _i \right\} _i\). Those strongly depend on the chosen kernel and its properties. We advise initiating \(\theta _0\) and \(\left\{ \theta _i \right\} _i\) with close values, as a too large difference might induce nearly singular covariance matrices and result in numerical instability (typical in GPs applications). In such pathological regime, the influence of a specific individual tends to overtake others in the calculus of \(\mu _0\)’s hyperposterior distribution.

Initial values for the variance of the error terms \(\left\{ \sigma _i^2 \right\} _i\). This choice mostly depends on the context and properties of the dataset. We suggest avoiding initial values with more than an order of magnitude different from the variability of data. In particular, a too high value might result in a model mostly capturing noise.
As a final note, let us stress that the EM algorithm depends on the initialisation and is only guaranteed to converge to local maxima of the likelihood function (McLachlan & Krishnan, 2007). Several strategies have been considered in the literature to tackle this issue such as simulated annealing (Ueda & Nakano, 1998) or repeated short runs (Biernacki et al., 2003). In this work, we chose the latter option.
3.3 Pseudocode
We wrap up this section with the pseudocode of the EM component of our complete algorithm, which we call Magma (standing for Multi tAsk Gaussian processes with common MeAn). The corresponding code is available at https://github.com/ArthurLeroy/MAGMA.
3.4 Discussion of EM algorithms and alternatives
Let us stress that even though we focus on prediction purpose in this paper, the output of the EM algorithm already provides results on related FDA problems. The generative model in Yang et al. (2016) describes a Bayesian framework that resembles ours to smooth multiple curves simultaneously. However, modelling variance structure with an InverseWishart process forces the use of an MCMC algorithm for inference or the introduction of a more tractable approximation in Yang et al. (2017). One can think of the learning through Magma and applying a single task GP regression on each individual as an empirical Bayes counterpart to their approach. Meanwhile, \(\mu _0\)’s hyperposterior distribution also provides the probabilistic estimation of a mean curve from a set of functional data. The closest method to our approach can be found in Shi et al. (2007) and the following book Shi & Choi (2011). The authors also work in the context of a multitask GP model, and one can retrieve the idea of defining a mean function \(\mu _0\) to overcome the weaknesses of classic GPs in making predictions far from observed data. However, since their model uses Bsplines to estimate this mean function, the method only works if all individuals share the same grid of observations, and does not account for uncertainty over \(\mu _0\).
4 Prediction
Once the hyperparameters of the model have been learned, we can focus on our main goal: prediction for new individuals at unobserved timestamps. Since \({\widehat{\varTheta }}\) is known and for the sake of concision, we omit conditioning on \({\widehat{\varTheta }}\) in the sequel. Note there are two cases for prediction (referred to as Type I and Type II in Shi & Cheng 2014, Section 3.2.1), depending on whether we observe some data or not for any new individual we wish to predict on. We denote by the index \(*\) a new individual for whom we want to make a prediction, say at timestamps \(\mathbf{t }^{p}\). If there are no available data for this individual, we have no \(*\)specific information, and the prediction is merely given by \(p(\mu _0(\mathbf{t }^{p}) \mid \left\{ \mathbf{y }_i \right\} _i)\). This quantity may be considered as the ’generic’ (or Type II) prediction according to the trained model, and only informs us through the mean process. Computing \(p(\mu _0(\mathbf{t }^{p}) \mid \left\{ \mathbf{y }_i \right\} _i)\) is also one of the steps leading to the prediction for a partially observed new individual (Type I). The latter being the most compelling case, we consider Type II prediction as a particular case of the full Type I procedure, described below.
If we observe \(\left\{ \mathbf{t }_{*}, y_{*}(\mathbf{t }_{*}) \right\} \) for the new individual, the multitask GP prediction is obtained in our model by computing the posterior distribution \(p(y_{*}(\mathbf{t }^{p}) \mid y_{*}(\mathbf{t }_{*}), \left\{ \mathbf{y }_i \right\} _i)\). Note that the conditioning is taken over \(y_{*}(\mathbf{t }_{*})\), as for any GP regression, but also on \(\left\{ \mathbf{y }_i \right\} _i\), which is specific to our multitask setting. The procedure for computing this distribution requires to successively complete the following steps:

1.
choose a grid of prediction \(\mathbf{t }^{p}\) and define the pooled vector of timestamps \(\mathbf{t }^{p}_{*}\),

2.
compute the hyperposterior distribution of \(\mu _0\) at \(\mathbf{t }^{p}_{*}\): \(p(\mu _0(\mathbf{t }^{p}_{*}) \mid \left\{ \mathbf{y }_i \right\} _i)\),

3.
compute the multitask prior distribution \(p(y_{*}(\mathbf{t }^{p}_{*}) \mid \left\{ \mathbf{y }_i \right\} _i)\),

4.
compute hyperparameters \(\theta _*\) associated with the new individual (optional),

5.
compute the multitask posterior distribution: \(p(y_{*}(\mathbf{t }^{p}) \mid y_{*}(\mathbf{t }_{*}), \left\{ \mathbf{y }_i \right\} _i)\).
4.1 Posterior inference on the mean process
As mentioned above, we observed a new individual at timestamps \(\mathbf{t }_{*}\). The GP regression consists in arbitrarily choosing a vector \(\mathbf{t }^{p}\) of timestamps for which we aim at making predictions. Then, we define new notation for the pooled vector of timestamps \(\mathbf{t }^{p}_{*}= \begin{bmatrix} \mathbf{t }^{p}\\ \mathbf{t }_{*}\end{bmatrix}\), which will serve as a working grid to define the prior and posterior distributions involved in the prediction process. One can note that, although not mandatory in theory, it is often a good idea to include the observed timestamps of training individuals, \(\mathbf{t }\), within \(\mathbf{t }^{p}_{*}\) since they match locations that contain information for the mean process to ’help’ the prediction. In particular, if \(\mathbf{t }^{p}_{*}= \mathbf{t }\), the computation of \(\mu _0\)’s hyperposterior distribution is not necessary since \(p(\mu _0(\mathbf{t }) \mid \left\{ \mathbf{y }_i \right\} _i)\) has previously been obtained from the EM algorithm. However, in general, it is necessary to compute the hyperposterior \(p(\mu _0(\mathbf{t }^{p}_{*}) \mid \left\{ \mathbf{y }_i \right\} _i)\) at the new timestamps. The idea remains similar to the E step aforementioned, and we obtain the following result.
Proposition 4
Let \(\mathbf{t }^{p}_{*}\) be a vector of timestamps of size \({\tilde{N}}\). The hyperposterior distribution of \(\mu _0\) remains Gaussian:
with:

\(\widehat{\mathbf{K }}_{*}^{p} = \left( \tilde{\mathbf{K }}^{1} + \sum \limits _{i = 1}^{M}{\tilde{\varvec{\varPsi }}_i}^{1} \right) ^{1}\),

\({\widehat{m}}_0(\mathbf{t }^{p}_{*}) = \widehat{\mathbf{K }}_{*}^{p} \left( \tilde{\mathbf{K }}^{1} m_0\left( \mathbf{t }^{p}_{*} \right) + \sum \limits _{i = 1}^{M}{\tilde{\varvec{\varPsi }}_i^{1} \tilde{\mathbf{y }}_i } \right) \),
where we used the shortening notation:

\(\tilde{\mathbf{K }} = k_{{\widehat{\theta }}_0} \left( \mathbf{t }^{p}_{*}, \mathbf{t }^{p}_{*} \right) \) (\({\tilde{N}}\times {\tilde{N}}\) matrix),

\({\tilde{\mathbf{y}}}_{i} = \left( {\mathbbm{1}}_{[t \in {\mathbf{t}}_{i}]} \times y_i(t)\right)_{t \in {\mathbf{t}}^{p}_{*}}\) (\({\tilde{N}}\)size vector),

\(\tilde{\varvec{\varPsi }}_i = \left[ \, {\mathbbm{1}}_{ {[}t, t' \in \mathbf{t }_i{]}} \times \psi _{{\widehat{\theta }}_i, {\widehat{\sigma }}_i^2}\left( t, t' \right) \,\right] _{t, t' \in \mathbf{t }^{p}_{*}}\) (\({\tilde{N}}\times {\tilde{N}}\) matrix).
Proof
The sketch of the proof is similar to Proposition 1 in the E step. The only technicality consists in dealing carefully with the dimensions of vectors and matrices involved, and whenever relevant, to define augmented versions of \(\mathbf{y }_i\) and \(\varvec{\varPsi }_{{\widehat{\theta }}_i, {\widehat{\sigma }}_i^2}\) with 0 elements at unobserved timestamps’ position for the ith individual. Note that if we pick a vector \(\mathbf{t }^{p}_{*}\) including only some of the timestamps from \(\mathbf{t }_i\), information coming from \(y_i\) at the remaining timestamps is ignored. We defer details to Sect. 8.1. \(\square \)
4.2 Computing the multitask prior distribution
According to our generative model, given the mean process, any new individual \(*\) is modelled as:
Therefore, for any finitedimensional vector of timestamps, and in particular for \(\mathbf{t }^{p}_{*}\), \(p(y_*(\mathbf{t }^{p}_{*}) \mid \mu _0(\mathbf{t }^{p}_{*}))\) is a multivariate Gaussian. Moreover, from this distribution and \(\mu _0\)’s hyperposterior, we can figure out the multitask prior distribution over \(y_*(\mathbf{t }^{p}_{*})\), defined as below.
Proposition 5
For any set of timestamps \(\mathbf{t }^{p}_{*}\), the multitask prior distribution of \(y_*\) is given by
Proof
To compute this prior, we need to integrate out the mean process \(\mu _0\) in \(p(y_* \mid \mu _0, \left\{ \mathbf{y }_i \right\} _i)\), whereas the multitask aspect remains through the conditioning over \(\left\{ \mathbf{y }_i \right\} _i\). We omit the writing of timestamps, by using the simplified notation \(\mu _0\) and \(y_*\) instead of \(\mu _0(\mathbf{t }^{p}_{*})\) and \(y_*(\mathbf{t }^{p}_{*})\), respectively. We first use the assumption that \(\{ y_i \mid \mu _0 \}_{i \in \{ 1,\dots ,M \}} \perp \!\!\! \perp y_* \mid \mu _0\), i.e., the individuals are independent conditionally to \(\mu _0\). Then, one can notice that the two distributions involved within the integral are Gaussian, which leads to the explicit Gaussian target distribution after integration.
This convolution of two Gaussians remains Gaussian (Bishop, 2006, Chapter 2.3.3). The mean parameter is then given by
Following the same idea, the secondorder moment is given by
hence
\(\square \)
Note that the process \(y_*(\cdot ) \mid \left\{ \mathbf{y }_i \right\} _i\) is not strictly a GP, although its finitedimensional evaluation (3) remains Gaussian. The covariance structure cannot be expressed as a kernel that could be directly evaluated at any timestamps: the process is known as a degenerated GP. In practice however, this does not bear much consequence as any arbitrary vector of timestamps \(\tau \) can be chosen at first, and computing hyperposterior \(p(\mu _0(\tau ) \mid \left\{ \mathbf{y }_i \right\} _i)\) still yields to the Gaussian distribution \(p(y_*(\tau ) \mid \left\{ \mathbf{y }_i \right\} _i)\) as above. For the sake of simplicity, we now rename the covariance matrix of the multitask prior distribution:
where the indices in the blocks of the matrix correspond to the associated timestamps \(\mathbf{t }^{p}\) and \(\mathbf{t }_{*}\).
4.3 Learning the new hyperparameters
When we collect data points for a new individual, as in the singletask GPs setting, we would need to learn the hyperparameters of its covariance kernel before making predictions. A salient fact in our multitask approach is that we consider this step being part of the prediction process, for two main reasons. First, the model is already trained for individuals \(i = 1,\dots , M\), and this training is independent of the future individual \(*\) or the choice of prediction timestamps. Since learning these new hyperparameters requires knowledge of \(\mu (\mathbf{t }^{p}_{*})\) and thus of the prediction timestamps, we cannot compute them beforehand. Second, learning these hyperparameters with the empirical Bayes approach only requires maximisation of a Gaussian likelihood which is negligible in computing time compared to the previous EM algorithm. As for singletask GP, we have the following estimates for hyperparameters:
Note that this step is optional depending on the modelling assumption: in the common hyperparameters model (i.e. \((\theta , \sigma ^2) = (\theta _i, \sigma _i^2), \forall i \in {\mathcal{I}}\)), any new individual will also share the same hyperparameters and we already have \({\widehat{\varTheta }}_* = ({\widehat{\theta }}_*, {\widehat{\sigma }}_*^2) = ({\widehat{\theta }}, {\widehat{\sigma }}^2)\) from the EM algorithm.
4.4 Prediction
We can rewrite the multitask prior distribution, by separating observed and prediction timestamps, as:
As usual, the conditional distribution remains Gaussian, and the multitask posterior distribution is given by:
where:

\({\widehat{\mu }}_0^p = {\widehat{m}}_0(\mathbf{t }^{p}) + \varvec{\varGamma }_{p*} \varvec{\varGamma }_{**}^{1} \left( y_*(\mathbf{t }_{*})  {\widehat{m}}_0(\mathbf{t }_{*}) \right) ,\)

\(\widehat{\varvec{\varGamma }}^p = \varvec{\varGamma }_{pp}  \varvec{\varGamma }_{p*} \varvec{\varGamma }_{**}^{1} \varvec{\varGamma }_{*p}.\)
Although this predictive distribution presents a formulation nicely analogous to standard GPs, let us emphasise on the terms \({\widehat{m}}_0(\mathbf{t }^{p}_{*})\) and \(\varvec{\varGamma }_*^p\), which embed crucial information from training individuals for the mean prediction to be more relevant even in far from the observed points \(y_*(\mathbf{t }_{*})\).
5 Complexity analysis for training and prediction
Computational complexity is of paramount importance in GPs as it quickly scales with large datasets. The classical cost to train a GP is \({\mathcal{O}}(N^3)\), and \({\mathcal{O}}(N^2)\) for prediction (Rasmussen & Williams, 2006) where N is the number of data points (although there exist various sparse approximations, see Sect. 7 for references). Moreover, multitask GP models lying on LMC approaches typically present a complexity of \({\mathcal{O}}(M^3 N^3)\) in training, which can be diminished when using sparse approximations (Álvarez and Lawrence 2011). As detailed below, our model reaches a reduction to \({\mathcal{O}}((M + 1) N^3)\) for the training complexity in a similar context (common grid of timestamps for all individuals), without using any sparse approximation.
More specifically, since Magma uses information from M individuals, each of them providing \(N_i\) observations, these quantities determine the overall complexity of the algorithm. If we recall that N is the number of distinct timestamps (i.e. \(N \le \sum \nolimits _{i = 1}^{M}N_i\)), the training complexity is \({\mathcal{O}} \left( M \times N_i^3 + N^3 \right) \) (i.e. the complexity of each EM iteration). As usual with GPs, the cubic costs come from the inversion of the corresponding matrices, and here, the constant is proportional to the number of iterations of the EM algorithm. The dominating term in this expression depends on the values of M, relatively to N. For a large number of individuals with many common timestamps (\(MN_i \gtrsim N\)), the first term dominates. For diverse timestamps among individuals (\(MN_i \lesssim N\)), the second term becomes the primary burden, as in any GP problem. During the prediction step, the recomputation of \(\mu _0\)’s hyperposterior implies the inversion of a \({\tilde{N}} \times {\tilde{N}}\) (dimension of \(\mathbf{t }^{p}_{*}\)) which has a \({\mathcal{O}}({\tilde{N}}^3)\) complexity while the new hyperparameters estimation’s cost is \({\mathcal{O}}(N_*^3)\). In practice, the most computationallyexpensive steps can be performed in advance to allow for quick onthefly prediction when collecting new data. If we observe the training dataset once and precompute the hyperposterior of \(\mu _0\) on a fine grid on which to predict later, the immediate computational cost for each new individual is identical to the one of the singletask GP regression.
6 Experimental results
We evaluate our Magma algorithm on synthetic data and two real datasets. The classical GP regression on single tasks separately is used as the baseline alternative for predictions. While it is not expected to perform well on the dataset used, the comparison highlights the interest of multitask approaches. To our knowledge, the only alternative to Magma is the GPFDA algorithm from Shi et al. (2007), Shi & Choi (2011), described in Sect. 3.4, and the associated R package GPFDA, which is applied during the experiments. Throughout the section, the standard Exponentiated Quadratic kernel (see Eq. (1)) is used both for simulating the data and for modelling the covariance structures in the three algorithms. Hence, each kernel is associated with \(\theta = \{ v, \ell \}, \ v, \ell \in {\mathbb{R}}^{+}\), a set of variance and lengthscale hyperparameters, respectively. Each simulated dataset has been drawn from the sampling scheme below:

1.
Draw a random working grid \(\mathbf{t }\subset \left[ \, 0,10 \,\right] \) of \(N = 200\) timestamps, and a number M of individuals.

2.
Define a prior mean function : \(m_0(t) = at + b, \ \forall t \in \mathbf{t }\), where \(a \in \left[ \, 2, 2 \,\right] \) and \(b \in \left[ \, 0, 10 \,\right] \) are drawn uniformly.

3.
Draw hyperparameters uniformly for \(\mu _0\)’s kernel : \(\theta _0 = \{ v_0, \ell _0 \}\), where \(v_0 \in \left[ \, 1, \exp (5) \,\right] \) and \(\ell _0 \in \left[ \, 1, \exp (2) \,\right] \).

4.
Draw \(\mu _0 (\mathbf{t }) \sim {\mathcal{N}} \left( m_0(\mathbf{t }), \mathbf{K }_{\theta _0}^{\mathbf{t }} \right) \).

5.
\(\forall i \in {\mathcal{I}}\), draw \(v_i \in \left[ \, 1, \exp (5) \,\right] \), \(\ell _i \in \left[ \, 1, \exp (2) \,\right] \), and \(\sigma _i^2 \in \left[ \, 0, 1 \,\right] \) uniformly.

6.
\(\forall i \in {\mathcal{I}}\), draw a subset \(\mathbf{t }_i\subset \mathbf{t }\) of \(N_i = 30\) timestamps uniformly, and draw \(\mathbf{y }_i\sim {\mathcal{N}} \left( \mu _0(\mathbf{t }_i), \varvec{\varPsi }_{\theta _i, \sigma _i^2}^{\mathbf{t }_i} \right) \).
This procedure provides a synthetic dataset \(\left\{ \mathbf{t }_i, \mathbf{y }_i \right\} _i\), and its associated mean process \(\mu _0(\mathbf{t })\). Those quantities are used to train the model, make predictions with each algorithm, and then compute errors in \(\mu _0\) estimation and forecasts. We recall that the Magma algorithm enables two different settings depending on the model’s assumption over hyperparameters (HP), and we refer to them as Common HP and Different HP in the following. In order to test these two contexts, differentiated datasets have been generated, by drawing Common HP data or Different HP data for each individual at step 5. We previously presented the idea of the model used in GPFDA, and, although the algorithm has many features (in particular about the type and number of input variables), it is not yet usable when timestamps are different among individuals. Therefore, two frameworks are considered, Common grid and Uncommon grid, to take this specification into account. Thus, the comparison between the different methods can only be performed on data generated under the settings Common HP and Common grid, and the effect of those different settings on Magma is analysed separately. Moreover, the initialisation for the prior mean function, \(m_0(\cdot )\), is set to be constant, equal to 0 for each algorithm. Except in some experiments, where the influence of the number of individuals is analysed, the generic value is \(M = 20\). In the case of prediction on unobserved timestamps for a new individual, the first 20 data points are used as observations, and the remaining 10 are taken as test values. Optimisation of the hyperparameters is performed by likelihood maximisation, using the LBFGSB algorithm (Morales & Nocedal, 2011; Nocedal, 1980) in all methods. The convergence criterion for all algorithms is reached if the difference of loglikelihood between two iterations is lower than \(10^{2}\). In general, the EM algorithm in Magma converges in a few iterations, typically fewer than 5 with the Common HP setting, and rarely more than 15 even with the Different HP setting.
6.1 Illustration on a simple example
To illustrate the multitask approach of Magma, Fig. 2 displays a comparison between standard GP regression and Magma on a simple example, from a dataset simulated according to the scheme above and using the Uncommon grid/Common HP setting. Given the observed data (in black), values on a thin grid of unobserved timestamps are predicted and compared, in particular, with the true test values (in red). As expected, the GP regression provides a good fit close to the data points and then dives rapidly to the prior 0 with increasing uncertainty. Conversely, although the initialisation for the prior mean is 0 in Magma as well, the hyperposterior distribution of \(\mu _0\) (dashed line) is estimated thanks to all individuals in the training dataset. This process acts as an informed prior helping GP prediction for the new individual, even far from its own observations. More precisely, 3 phases can be distinguished according to the level of information coming from the data: in the first one, close to the observed data (\(t \in \left[ \, 1,7 \,\right] \)), the two processes behave similarly, except for a slight increase in the variance for Magma, which is logical since the prediction also takes uncertainty over \(\mu _0\) into account (see Eq. (3)); in the second one, on intervals of unobserved timestamps containing data points from the training dataset (\(t \in \left[ \, 0,1 \,\right] \cup \left[ \, 7,10 \,\right] \)), the prediction is guided by the information coming from other individuals through \(\mu _0\). In this context, the mean trajectory remains coherent and the uncertainty increases only slightly. In the third phase, where no observations are available, neither from the new individual nor from the training dataset (\(t \in \left[ \, 10,12 \,\right] \)), the prediction behaves as expected, with a slow drifting to the prior mean 0, with highly increasing variance. Overall, the multitask framework provides reliable probabilistic predictions on a wider range of timestamps, potentially outside of the usual scope of GPs.
6.2 Performance comparison on simulated datasets
We confront the performance of Magma to alternatives in several situations and for different datasets. In the first place, the classical GP regression (GP), GPFDA and Magma are compared through their performance in prediction and estimation of the true mean process \(\mu _0\). In the prediction context, the performances are evaluated according to the following indicators:

the mean squared error (MSE) which compares the predicted values to the true test values of the 10 last timestamps:
$$\begin{aligned} \dfrac{1}{10} \sum \limits _{k = 21}^{30} \left( y_*^{{\text{pred}}} (t_*^k)  y_*^{{\text{true}}} (t_*^k) \right) ^2 , \end{aligned}$$ 
the \(CI_{95}\) coverage (\(CIC_{95}\)), i.e. the percentage of unobserved data points effectively lying within the 95% credible interval defined from the predictive posterior distribution \(p(y_*(\mathbf{t }^{p}) \mid y_*(\mathbf{t }_{*}), \left\{ \mathbf{y }_i \right\} _i)\):
$$\begin{aligned} 100 \times \dfrac{1}{10} \sum \limits _{k = 21}^{30} \mathbbm {1}_{ \{ y_*^{{\text{true}}}(t_*^k) \in \ CI_{95} \} }. \end{aligned}$$
The \(CIC_{95}\) provides insights on the reliability of the predictive variance and should be as close to the value 95% as possible. Other values would indicate a tendency to underestimate or overestimate the uncertainty. Let us recall that GPFDA uses Bsplines to estimate the mean process and does not account for uncertainty, contrarily to a probabilistic framework as Magma. However, a measure of uncertainty based on an empirical variance estimated from training curves is proposed (see Shi & Cheng, 2014, Section 3.2.1). In practice, this measure constantly overestimates the true variance, and their 95% empirical interval coverage is generally equal or close to 100%.
In the estimation context, the performances are evaluated thanks to another MSE, which compares the estimations to the true values of \(\mu _0\) at all timestamps:
Table 1 presents the results obtained over 100 datasets, where the models are trained on \(M = 20\) individuals, each of them observed on \(N = 30\) common timestamps. As expected, both multitask methods lead to better results than GP. However, Magma outperforms GPFDA, both in the estimation of \(\mu _0\) and in predictive performance. In terms of error as well as in uncertainty quantification, Magma provides more accurate results, in particular with a \(CI_{95}\) coverage close to the 95% expected value. Each method presents a quite high standard deviation for MSE in prediction, which is due to some datasets with particularly difficult values to predict, although most of the cases lead to small errors. This behaviour is reasonably expected since such 10timestampsahead forecasts might sometimes be tricky. It can also be noticed on Fig. 3 that Magma consistently provides lower errors as well as less pathological behaviour, as it may sometimes occur with the Bsplines modelling used in GPFDA.
To highlight the effect of the number of individuals M on the performance, Fig. 3 provides the same 100 runs trial as previously, for different values of M. The boxplots exhibit, for each method, the behaviour of the prediction and estimation MSE as information is added in the training dataset. Let us mention the absence of discernible changes as soon as \(M > 200\). As expected, we notice on the right panel that adding information from new individuals improves the estimation of \(\mu _0\), leading to shallow errors for high values of M, in particular for Magma. Meanwhile, the left panel exhibits reasonably unchanged prediction performance with respect to the values of M, excepted for some random fluctuations. This property is expected for GP regression since no external information is used from the training dataset in this context. For both multitasks algorithms though, the estimation of \(\mu _0\) improves the prediction by one order of magnitude below the typical errors, even with only a few training individuals. Furthermore, since a new individual behaves independently through \(f_*\), it is natural for a 10pointsahead forecast to present intrinsic variations, despite an adequate estimation of the shared mean process.
To illustrate the advantage of multitask methods, even for \(M = 20\), we display on Fig. 4 the evolution of MSE according to the number of timestamps N that are assumed to be observed for the new individual on which we make predictions. These predictions remain computed on the last 10 timestamps, although in this experiment, we only observe the first 5, 10, 15, or 20 timestamps, in order to change the volume of information and the distance from training observations to targets. We observe on Fig. 4 that, as expected in a GP framework, the closer observations are to targets, the better the results. However, for multitasks approaches and in particular for Magma, the prediction remains consistently adequate even with few observations. Once more, sharing information across individuals significantly helps the prediction, even for small values of M or few observed data.
6.3 Magma’s specific settings
As we previously discussed, different settings are available for Magma according to the nature of data and the model hypotheses. First, the Common grid setting corresponds to cases where all individuals share the same timestamps, whereas Uncommon grid is used otherwise. Moreover, Magma enables to consider identical hyperparameters for all individuals or specific ones, as previously discussed in Sect. 2.2. To evaluate the effect of the different settings, performances in prediction and \(\mu _0\)’s estimation are evaluated in the following cases in Table 2:

Common HP, when data are simulated with a common set of hyperparameters for all individuals, and Proposition 3 is used for inference in Magma,

Different HP, when data are simulated with its own set of hyperparameters for each individual, and Proposition 2 is used for inference in Magma,

Common HP on different HP data, when data are simulated with its own set of hyperparameters for each individual, and Proposition 3 is used for inference in Magma.
Note that the first line of the table (Common grid / Common HP) of Table 2 is identical to the corresponding results in Table 1, providing reference values, significantly better than for other methods. The results obtained in Table 2 indicate that the Magma performance is not significantly altered by the settings used or the nature of the simulated data. To confirm the robustness of the method, the setting Common HP was applied to data generated by drawing different values of hyperparameters for each individual (Different HP data). In this case, performances in prediction and estimation of \(\mu _0\) are slightly deteriorated, although Magma still provides quite reliable forecasts. This experience also highlights a particularity of the Different HP setting: looking at the estimation of \(\mu _0\) performance, we observe a significant decrease in the \(CI_{95}\) coverage, due to numerical instability in some pathological cases. Numerical issues, in particular during matrix inversions, are classical problems in the GP literature and, because of the potentially large number of different hyperparameters to train, the probability for at least one of them to lead to a nearly singular matrix increases. In this case, one individual might overwhelm others in the calculus of \(\mu _0\)’s hyperposterior (see Proposition 4), and thus lead to an underestimated posterior variance. This problem does not occur in the Common HP settings, since sharing the same hyperparameters prevents the associated covariance matrices from running over each other. Thus, except if one specifically wants to smooth multiple curves presenting really different behaviours, keeping Common HP as a default setting appears as a reasonable choice. Let us notice that the estimation of \(\mu _0\) is slightly better for common than for uncommon grid since the estimation problem on the union of different timestamps is generally more difficult. However, this feature only depends on the nature of data.
6.4 Running times comparisons
The counterpart of the more accurate and general results provided by Magma is a natural increase in running time. Table 3 exhibits the raw and relative training times for GPFDA and Magma (prediction times are negligible and comparable in both cases), on data coming from the simulation scheme with varying values of M on a Common grid of \(N = 30\) timestamps. The algorithms were run under the 3.6.1 R version, on a laptop with a dualcore processor cadenced at 2.90GHz and an 8GB RAM. The reported computing times are in seconds, and for small to moderate datasets (\(N \simeq 10^3\), \(M \simeq 10^4\) ) the procedures ran in few minutes to few hours. The difference between the two algorithms is due to GPFDA modelling \(\mu _0\) as a deterministic function through Bsplines smoothing, whereas Magma accounts for uncertainty. The ratio of computing times between the two methods tends to decrease as M increases, and stabilises around 2 for higher numbers of training individuals. This behaviour comes from the E step in Magma, which is incompressible and quite insensitive to the value of M. Roughly speaking, one needs to pay twice the computing price of GPFDA for Magma to provide (significantly) more accurate predictions and uncertainty over \(\mu _0\). Table 4 provides running times of Magma according to its different settings, with \(M=20\). Because the complexity is linear in M in each case, the ratio in running times would remain roughly similar no matter the value of M. Prediction time appears negligible compared to training time, and generally takes less than one second to run. Besides, the Different HP setting increases the running time since in this context M maximisations (instead of one for Common HP) are required at each EM iteration. In this case, the prediction also takes slightly longer because of the necessity to optimise hyperparameters for the new individual. Although the nature of the grid of timestamps does not matter in itself, a key limitation lies in the dimension N of the pooled set of timestamps, which tends to get bigger when individuals have different timestamps from one another.
6.5 Application of Magma on swimmers’ progression curves
Data and problematic
We consider the problem of performance prediction in competition for french swimmers. The French Swimming Federation provided us with an anonymised dataset, compiling the age and results of its members between 2000 and 2016. For each competitor, the race times are registered for competitions of 100m freestyle (50m swimmingpool). The database contains results from 1731 women and 7876 men, each of them compiling an average of 22.2 data points (min = 15, max = 61) and 12 data points (min = 5, max = 57), respectively. In the following, age of the ith swimmer is considered as the input variable (timestamp t) and the performance (in s) on a 100 m freestyle as the output (\(y_i(t)\)). For reasons of confidentiality and property, the raw dataset cannot be published. The analysis focuses on the youth period, from 10 to 20 years, where the progression is the most noticeable. In order to get relevant time series, we retained only individuals having a sufficient number of data points (\(N_i \ge 5\)) on the considered time period. For a young swimmer, observed during its first years of competition, we aim at modelling its progression curve and make predictions on its future performance in the subsequent years. Since we consider a decisionmaking problem involving irregular time series, the GP probabilistic framework is a natural choice to work on. Thereby, assuming that each swimmer in the database is a realisation \(y_i\) defined as previously, we expect Magma to provide multitask predictions for a new young swimmer, that will benefit from information of other swimmers already observed at older ages. To study such modelling, and validate its efficiency in practice, we split the individuals into training and testing datasets with respective sizes:

\(M_{{\text{train}}}^F = 1039\), for the female training set,

\(M_{{\text{test}}}^F = 692\), for the female testing set,

\(M_{{\text{train}}}^M = 4726\), for the male training set,

\(M_{{\text{test}}}^M = 3150\), for the male testing set.
Inference on the hyperparameters is performed thanks to the training dataset in both cases. Considering the different timestamps and the relative monotony of the progression curves, the settings Uncommon grid/Common HP has been used for Magma. The overall training lasted around 2 h with the same hardware configuration as for simulations. To compute MSE and the \(CI_{95}\) coverage, the data points of each individual in the testing set has been split into observed and testing timestamps. Since each individual has a different number of data points, the first 80% of timestamps are taken as observed, while the remaining 20% are considered as testing timestamps. Magma’s predictions are compared with the true values of \(y_i\) at testing timestamps. As previously, both GP and Magma have been initialised with a constant 0 mean function. Initial values for hyperparameters are also similar for all i, \(\theta _0^{{\text{ini}}} = \theta _i^{{\text{ini}}} = (\exp (1), \exp (1))\) and \(\sigma _i^{{\text{ini}}} = 0.4\). Those values are the default in Magma and remain adequate in the context of these datasets.
Results and interpretation The overall performance and comparison are summarised in Table 5.
We observe that Magma still provides excellent results in this context, and naturally outperforms predictions provided by a standard GP regression. As the progression curves present relatively monotonic variations and thus avoid pathological behaviours that could occur with synthetic data, the MSE in prediction remains very low. The \(CI_{95}\) coverage sticks close to the 95% expected value for Magma, indicating an adequate quantification of uncertainty. To illustrate these results, an example is displayed on Fig. 5 for both men and women. For a randomly chosen testing individual, we plot its predicted progression curve (in blue), where its first 15 data points are used as observations (in black), while the remaining true data points (in red) are displayed for comparison purpose. As previously observed in the simulation study, the simple GP quickly drifts to the prior 0 mean, as soon as data lack. However, for both men and women, the Magma predictions remain close to the true data, which also lie within the 95% credible interval. Even for long term forecast, where the mean prediction curve tends to overlap the mean process (dashed line), the true data remain in our range of uncertainty, as the credible interval widens far from observations. For clarity, we displayed only a few individuals from the training dataset (colourful points) in the background. The mean process (dashed line) seems to represent the main trend of progression among swimmers correctly, even though we cannot numerically compare \(\mu _0\) to any reallife analogous quantity. From a more sportrelated perspective, we can note that both genders present similar patterns of progression. However, while performances are roughly similar in mean trend before the age of 14, they start to differentiate afterwards and then converge to average times with approximatively a 5 s gap. Interestingly, the difference between men and women in terms of world records in swimming competitions for the 100m freestyle is currently 4.8 s (46.91 versus 51.71). These results, obtained under reasonable hypotheses on several hundreds of swimmers, seem to indicate that Magma would give quite reliable predictions for a new young swimmer. Furthermore, the uncertainty provided through the predictive posterior distribution offers an adequate degree of caution in a decisionmaking process.
7 Discussion
We have introduced a unified multitask framework integrating a mean Gaussian process prior in the context of GP regression. While we believe that this process is an interesting object in itself, it also allows individuals to borrow information from each other and provides more accurate predictions, even far from data points. Furthermore, our method accounts for uncertainty in the mean process and remains applicable no matter eventual irregular timestamps in the grid of observations. The proposed algorithm, Magma, also presents a reduced computational complexity compared to previous multitask GPs frameworks. Both on simulated and reallife datasets, we exhibited the efficiency of such an approach and studied some of its properties and possible settings. Magma outperforms the alternatives in estimation of the mean process as well as in prediction, and leads to a reliable quantification of uncertainty. We also displayed evidence of its predictive efficiency for reallife problems and provided some insights on practical interpretations about the mean process.
Despite the extensive literature on these aspects of GPs, our model does not yet include sparse approximations. While these aspects remain beyond the scope of the present paper, we might aim at adapting existing approaches (Snelson & Ghahramani, 2006; QuiñoneroCandela et al., 2007; Titsias, 2009) in our model to widen its applicability. Another possible avenue is an adaptation to the classification context, which is presented in Rasmussen and Williams (2006, Chapter 3). Besides, this work leaves the door open to improvement as we tackled here the problem of unidimensional regression: enabling either multidimensional or mixed type of inputs as in Shi & Choi (2011) would be of interest. To conclude, the hypothesis of a unique underlying mean process might be considered too restrictive for some datasets, and enabling clusterspecific mean processes would be a relevant extension.
8 Proofs
Note that the proof of Proposition 1 is a particular case of the proof below, where \(\varvec{\tau }= \mathbf{t }\) exactly (where \(\varvec{\tau }\) is the set of timestamps the hyperposterior is to be computed on). Moreover, in order to keep an analytical expression for \(\mu _0\)’s hyperposterior distribution, we discard the superfluous information contained in \(\left\{ \mathbf{y }_i \right\} _i\) at timestamps on which the hyperposterior is not to be computed. Hence, the proof below states that the remaining data points are observed on subsets \(\left\{ \varvec{\tau }_i \right\} _i\) of \(\varvec{\tau }\).
8.1 Proof of Proposition 4
Let \(\varvec{\tau }\) be a finite vector of timestamps, and \(\left\{ \varvec{\tau }_i \right\} _i\) such as \(\forall i \in {\mathcal{I}}, \ \ \varvec{\tau }_i \subset \varvec{\tau }\). We define convenient notation:

\(\varvec{\mu }_0^{\varvec{\tau }}= \mu _0(\varvec{\tau })\),

\(\mathbf{m }_0^{\varvec{\tau }}= m_0(\varvec{\tau })\),

\(\varvec{\mu }_0^{\varvec{\tau }_i}= \mu _0(\varvec{\tau }_i), \ \forall i \in {\mathcal{I}}\),

\(\mathbf{y }_i^{\varvec{\tau }_i}= y_i(\varvec{\tau }_i), \ \forall i \in {\mathcal{I}}\),

\(\varvec{\varPsi }_i = \psi _{\theta _i, \sigma _i^2}(\varvec{\tau }_i, \varvec{\tau }_i), \forall i \in {\mathcal{I}}\),

\(\mathbf{K } = k_{\theta _0}(\varvec{\tau }, \varvec{\tau })\).
Moreover, for a covariance matrix C, and \(u, v \in \varvec{\tau }\), we note \(\left[ \, C \,\right] _{uv}^{1}\) the element of the inverse matrix at row associated with timestamp u, and column associated with timestamp v. We also ignore the conditionings over \({\widehat{\varTheta }}\), \(\varvec{\tau }_i\) and \(\varvec{\tau }\) to maintain simple expressions. By construction of the models, we have:
The term \({\mathcal{L}}_1 =  (1/2)\log p(\varvec{\mu }_0^{\varvec{\tau }}\mid \left\{ \mathbf{y }_i^{\varvec{\tau }_i} \right\} _i) \) associated with the hyperposterior remains quadratic and we may find the corresponding Gaussian parameters by identification:
where we entirely decomposed the vectormatrix products. We factorise the expression according to the common timestamps between \(\tau _i\) and \(\tau \). Since for all \( i, \varvec{\tau }_i \subset \varvec{\tau }\), let us introduce a dummy indicator function \(\mathbbm {1}_{\varvec{\tau }_i} = \mathbbm {1}_{\{u,v \in \varvec{\tau }_i\}}\) to write:
subsequently, we can gather the sums such as:
where the \(\mathbf{y }_i\) and \(\varvec{\varPsi }_i\) are completed by zeros:

\(\tilde{\mathbf{y }}_i^{\varvec{\tau }} = \mathbbm {1}_{\varvec{\tau }_i} y_i( \varvec{\tau }) \),

\(\left[ \, \tilde{\varvec{\varPsi }}_i \,\right] _{uv}^{1} = \mathbbm {1}_{\varvec{\tau }_i} \left[ \, \varvec{\varPsi }_i \,\right] _{uv}^{1}, \ \forall u,v \in \varvec{\tau }\).
By identification of the quadratic form, we reach:
with,

\(\widehat{\mathbf{K }}= \left( \mathbf{K }^{1} + \sum \limits _{i = 1}^{M}\tilde{\varvec{\varPsi }}_i^{1} \right) ^{1}\),

\({\widehat{m}}_0(\varvec{\tau }) = \widehat{\mathbf{K }}\left( \mathbf{K }^{1} \mathbf{m }_0^{\varvec{\tau }}+ \sum \limits _{i = 1}^{M}\tilde{\varvec{\varPsi }}_i^{1} \tilde{\mathbf{y }}_i^{\varvec{\tau }} \right) \).
\(\square \)
8.2 Proof of Propositions 2 and 3
Since the central part of the proofs is similar for both propositions, we detail the calculus by denoting \(\varTheta = \{ \theta _0, \left\{ \theta _i \right\} _i, \left\{ \sigma _i^2 \right\} _i\}\) for generality, and dissociating the two cases only when necessary. Before considering the maximisation, we notice that the joint density can be developed as:
The expectation is taken over \(p(\mu _0(\mathbf{t }) \mid \left\{ \mathbf{y }_i \right\} _i)\) though we write it \({\mathbb{E}}\) for simplicity. We have:
Lemma 1
Let \(X \in {\mathbb{R}}^N\) be a random Gaussian vector \(X \sim {\mathcal{N}} \left( m , \mathbf{K } \right) \), \(b \in {\mathbb{R}}^N\), and \(\mathbf{S }\), a \(N \times N\) covariance matrix. Then:
Proof
(Lemma 1)
\(\square \)
As we note that X and b play symmetrical roles in the calculus of the conditional expectation, we can apply the lemma regardless of the position of \(\mu _0\) in the \(M+1\) equalities involved. Applying Lemma 1 to our previous expression of \(f(\varTheta )\), we obtain:
We recall that, at the M step, \({\widehat{m}}_0(\mathbf{t })\) is a known constant, computed at the previous E step. Thus, we identify here the characteristic expression of several Gaussian loglikelihoods and associated correction trace terms. Moreover, each set of hyperparameters is merely involved in independent terms of the whole function to maximise. Hence, the global maximisation problem can be separated into several maximisations of subfunctions according to the hyperparameters getting optimised. Regardless to additional assumptions, the hyperparameters \(\theta _0\), controlling the covariance matrix of the mean process, appears in a function which is exactly a Gaussian loglikelihood, \(\log {\mathcal{N}} \left( {\widehat{m}}_0(\mathbf{t }), m_0(\mathbf{t }) , \mathbf{K }_{\theta _0}^{\mathbf{t }} \right) \), added to a corresponding trace term, \(  \dfrac{1}{2} {\text{Tr}} \left( \widehat{\mathbf{K }}^{\mathbf{t }} {\mathbf{K }_{\theta _0}^{\mathbf{t }}}^{1} \right) \). This function can be maximised independently from the other parameters, giving the first part of the results in Propositions 2 and 3.
Although the idea is analogous for the remaining hyperparameters, we have to discriminate here regarding the assumption on the model. If each individual is supposed to have its own set \(\left\{ \theta _i, \sigma _i \right\} \), which thus can be optimised independently from the observations and hyperparameters of other individuals, we identify a sum of M Gaussian loglikelihoods, \(\log {\mathcal{N}} \big ( \mathbf{y }_i, {\widehat{m}}_0(\mathbf{t }_i) , \varvec{\varPsi }_{\theta _i, \sigma _i^2}^{\mathbf{t }_i} \big )\), and the corresponding trace terms, \( \dfrac{1}{2} {\text{Tr}}( \widehat{\mathbf{K }}^{\mathbf{t }} {\varvec{\varPsi }_{\theta _i, \sigma _i^2}^{\mathbf{t }_i}}^{1} )\). This property results in M independent maximisation problems on the corresponding functions, proving Proposition 2. Conversely, if we assume that all individuals in the model shares their hyperparameters (i.e. \(\left\{ \theta , \sigma ^2 \right\} = \left\{ \theta _i, \sigma _i^2 \right\} , \forall i \in {\mathcal{I}}\)), we can no longer divide the problem into M submaximisations, and the whole sum on all individual should be optimised thanks to observations from all individuals. This case corresponds to the second part of Proposition 3. \(\square \)
Data availability
The synthetic data and table of results are available at https://github.com/ArthurLeroy/MAGMA/tree/master/Simulations.
Code availability
The R code associated with the present work is available at https://github.com/ArthurLeroy/MAGMA. The current version of the R package implementing an extended version of Magma is available at https://github.com/ArthurLeroy/MagmaClustR.
References
Alaa, A. M., & van der Schaar, M. (2017). Bayesian inference of individualized treatment effects using multitask Gaussian processes. In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., & Garnett, R. (Eds.) Advances in neural information processing systems 30, Curran Associates, Inc., pp. 3424–3432.
Álvarez, M. A., & Lawrence, N. D. (2011). Computationally efficient convolved multiple output Gaussian processes. Journal of Machine Learning Research, 12(41), 1459–1500.
Álvarez, M. A., Rosasco, L., & Lawrence, N.D. (2012). Kernels for vectorvalued functions: A review. Foundations and Trends® in Machine Learning, 4(3), 195–266. https://doi.org/10.1561/2200000036
Biernacki, C., Celeux, G., & Govaert, G. (2003). Choosing starting values for the EM algorithm for getting the highest likelihood in multivariate Gaussian mixture models. Computational Statistics & Data Analysis, 41(3), 561–575. https://doi.org/10.1016/S01679473(02)001639
Bishop, C. M. (2006). Pattern recognition and machine learning, information science and statistics. Springer.
Bonilla, E. V., Chai, K. M., & Williams, C. (2008). Multitask Gaussian process prediction. In Platt, J. C., Koller, D., Singer, Y., Roweis, S. T. (Eds.) Advances in neural information processing systems 20, Curran Associates, Inc., pp. 153–160.
Caruana, R. (1997). Multitask learning. Machine Learning, 28(1), 41–75. https://doi.org/10.1023/A:1007379606734
Casella, G. (1985). An introduction to empirical Bayes data analysis. The American Statistician, 39(2), 83–87. https://doi.org/10.2307/2682801
Clingerman, C., & Eaton, E. (2017). Lifelong learning with Gaussian processes. In: Ceci, M., Hollmén, J., Todorovski, L., Vens, C., Džeroski, S. (Eds) Machine learning and knowledge discovery in databases (Vol. 10535, pp 690–704). Springer. https://doi.org/10.1007/9783319712468_42
Crainiceanu, C. M., & Goldsmith, A. J. (2010). Bayesian functional data analysis using WinBUGS. Journal of Statistical Software, 32(11).
Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society Series B (Methodological), 39(1), 1–38.
Duvenaud, D. (2014). Automatic model construction with Gaussian processes. Thesis, University of Cambridge, https://doi.org/10.17863/CAM.14087
Ferraty, F., & Vieu, P. (2006). Nonparametric functional data analysis: Theory and practice. Springer.
Hayashi, K., Takenouchi, T., Tomioka, R., & Kashima, H. (2012). Selfmeasuring similarity for multitask Gaussian process. Transactions of the Japanese Society for Artificial Intelligence, 27(3), 103–110. https://doi.org/10.1527/tjsai.27.103
McLachlan, G. J., & Krishnan, T. (2007). The EM algorithm and extensions. Wiley.
Morales, J. L., & Nocedal, J. (2011). Remark on algorithm LBFGSB: Fortran subroutines for largescale bound constrained optimization. ACM Transactions on Mathematical Software, 38(1), 7:1–7:4. https://doi.org/10.1145/2049662.2049669
MorenoMuñoz, P., ArtésRodríguez, A., & Álvarez, M. A. (2019). Continual multitask Gaussian processes. arXiv:1911.00002 [cs, stat] arXiv:1911.00002
Nguyen, T. V., & Bonilla, E. V. (2014). Collaborative multioutput Gaussian processes. In Proceedings of the thirtieth conference on uncertainty in artificial intelligence, AUAI Press, UAI’14, pp. 643–652
Nocedal, J. (1980). Updating quasiNewton matrices with limited storage. Mathematics of Computation, 35(151), 773–782. https://doi.org/10.1090/S00255718198005728557
QuiñoneroCandela, J., Rasmussen, C. E., & Williams, C. K. I. (2007). Approximation methods for Gaussian process regression. MIT Press.
Rakitsch, B., Lippert, C., Borgwardt, K., & Stegle, O. (2013). It is all in the noise: Efficient multitask Gaussian process inference with structured residuals. In Advances in neural information processing systems 26, Curran Associates, Inc., pp. 1466–1474
Ramsay, J. O., & Silverman, B. W. (2005). Functional data analysis. Springer.
Rasmussen, C. E., & Williams, C. K. I. (2006). Gaussian processes for machine learning, adaptive computation and machine learning. MIT Press.
Rice, J. A., & Silverman, B. W. (1991). Estimating the mean and covariance structure nonparametrically when the data are curves. Journal of the Royal Statistical Society Series B (Methodological), 53(1), 233–243.
Schwaighofer, A., Tresp, V., & Yu, K. (2004). Learning Gaussian process kernels via hierarchical bayes. Advances in Neural Information Processing Systems, 17, 8.
Shi, J. Q., & Cheng, Y. (2014). Gaussian process function data analysis R package ‘GPFDA’. https://cran.rproject.org/web/packages/GPFDA/GPFDA.pdf
Shi, J. Q., & Choi, T. (2011). Gaussian process regression analysis for functional data. CRC Press.
Shi, J., MurraySmith, R., & Titterington, D. (2005). Hierarchical Gaussian process mixtures for regression. Statistics and Computing, 15(1), 31–41. https://doi.org/10.1007/s1122200547877
Shi, J. Q., Wang, B., MurraySmith, R., & Titterington, D. M. (2007). Gaussian process functional regression modeling for batch data. Biometrics, 63(3), 714–723. https://doi.org/10.1111/j.15410420.2007.00758.x
Snelson, E., & Ghahramani, Z. (2006). Sparse Gaussian processes using pseudoinputs. In Advances in neural information processing systems (Vol 18), MIT Press
Swersky, K., Snoek, J., & Adams, R. P. (2013). Multitask Bayesian optimization. Advances in Neural Information Processing Systems, 26, 2004–2012.
Thompson, W. K., & Rosen, O. (2008). A Bayesian model for sparse functional data. Biometrics, 64(1), 54–63. https://doi.org/10.1111/j.15410420.2007.00829.x
Titsias, M. (2009). Variational learning of inducing variables in sparse Gaussian processes. In Proceedings of the twelth international conference on artificial intelligence and statistics, PMLR, pp. 567–574.
Ueda, N., & Nakano, R. (1998). Deterministic annealing EM algorithm. Neural Networks, 11(2), 271–282. https://doi.org/10.1016/S08936080(97)001330
Williams, C., Klanke, S., Vijayakumar, S., & Chai, K. M. (2009). Multitask Gaussian process learning of robot inverse dynamics. Advances in Neural Information Processing Systems, 21, 265–272.
Yang, J., Zhu, H., Choi, T., & Cox, D. D. (2016). Smoothing and meancovariance estimation of functional data with a Bayesian hierarchical model. Bayesian Analysis, 11(3), 649–670. https://doi.org/10.1214/15BA967
Yang, J., Cox, D. D., Lee, J. S., Ren, P., & Choi, T. (2017). Efficient Bayesian hierarchical functional data analysis with basis function approximations using GaussianWishart processes. Biometrics, 73(4), 1082–1091. https://doi.org/10.1111/biom.12705
Yu, K., Tresp, V., & Schwaighofer, A. (2005). Learning Gaussian processes from multiple tasks. In Proceedings of the 22nd international conference on machine learning, ACM, ICML ’05, pp. 1012–1019. https://doi.org/10.1145/1102351.1102479
Zhu, J., & Sun, S. (2014). Multitask sparse Gaussian processes with improved multitask sparsity regularization. In Pattern recognition, Springer, pp. 54–62, https://doi.org/10.1007/9783662456460_6
Acknowledgements
The authors warmly thank Andy Marc, Olivier Dupas, Richard Martinez and the French Swimming Federation for providing data and helping in the analysis of the results. Benjamin Guedj acknowledges partial support by the U.S. Army Research Laboratory and the U.S. Army Research Office, and by the U.K. Ministry of Defence and the U.K. Engineering and Physical Sciences Research Council (EPSRC) under grant number EP/R013616/1. Benjamin Guedj acknowledges partial support from the French National Agency for Research, grants ANR18CE40001601 and ANR18CE23001502.
Funding
The authors received no financial support for the research, authorship, and/or publication of this article.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare no conflict of interest.
Additional information
Editor: Ulf Brefeld.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visithttp://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Leroy, A., Latouche, P., Guedj, B. et al. MAGMA: inference and prediction using multitask Gaussian processes with common mean. Mach Learn 111, 1821–1849 (2022). https://doi.org/10.1007/s10994022061721
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10994022061721