We evaluate our Magma algorithm on synthetic data and two real datasets. The classical GP regression on single tasks separately is used as the baseline alternative for predictions. While it is not expected to perform well on the dataset used, the comparison highlights the interest of multi-task approaches. To our knowledge, the only alternative to Magma is the GPFDA algorithm from Shi et al. (2007), Shi & Choi (2011), described in Sect. 3.4, and the associated R package GPFDA, which is applied during the experiments. Throughout the section, the standard Exponentiated Quadratic kernel (see Eq. (1)) is used both for simulating the data and for modelling the covariance structures in the three algorithms. Hence, each kernel is associated with \(\theta = \{ v, \ell \}, \ v, \ell \in {\mathbb{R}}^{+}\), a set of variance and length-scale hyper-parameters, respectively. Each simulated dataset has been drawn from the sampling scheme below:
-
1.
Draw a random working grid \(\mathbf{t }\subset \left[ \, 0,10 \,\right] \) of \(N = 200\) timestamps, and a number M of individuals.
-
2.
Define a prior mean function : \(m_0(t) = at + b, \ \forall t \in \mathbf{t }\), where \(a \in \left[ \, -2, 2 \,\right] \) and \(b \in \left[ \, 0, 10 \,\right] \) are drawn uniformly.
-
3.
Draw hyper-parameters uniformly for \(\mu _0\)’s kernel : \(\theta _0 = \{ v_0, \ell _0 \}\), where \(v_0 \in \left[ \, 1, \exp (5) \,\right] \) and \(\ell _0 \in \left[ \, 1, \exp (2) \,\right] \).
-
4.
Draw \(\mu _0 (\mathbf{t }) \sim {\mathcal{N}} \left( m_0(\mathbf{t }), \mathbf{K }_{\theta _0}^{\mathbf{t }} \right) \).
-
5.
\(\forall i \in {\mathcal{I}}\), draw \(v_i \in \left[ \, 1, \exp (5) \,\right] \), \(\ell _i \in \left[ \, 1, \exp (2) \,\right] \), and \(\sigma _i^2 \in \left[ \, 0, 1 \,\right] \) uniformly.
-
6.
\(\forall i \in {\mathcal{I}}\), draw a subset \(\mathbf{t }_i\subset \mathbf{t }\) of \(N_i = 30\) timestamps uniformly, and draw \(\mathbf{y }_i\sim {\mathcal{N}} \left( \mu _0(\mathbf{t }_i), \varvec{\varPsi }_{\theta _i, \sigma _i^2}^{\mathbf{t }_i} \right) \).
This procedure provides a synthetic dataset \(\left\{ \mathbf{t }_i, \mathbf{y }_i \right\} _i\), and its associated mean process \(\mu _0(\mathbf{t })\). Those quantities are used to train the model, make predictions with each algorithm, and then compute errors in \(\mu _0\) estimation and forecasts. We recall that the Magma algorithm enables two different settings depending on the model’s assumption over hyper-parameters (HP), and we refer to them as Common HP and Different HP in the following. In order to test these two contexts, differentiated datasets have been generated, by drawing Common HP data or Different HP data for each individual at step 5. We previously presented the idea of the model used in GPFDA, and, although the algorithm has many features (in particular about the type and number of input variables), it is not yet usable when timestamps are different among individuals. Therefore, two frameworks are considered, Common grid and Uncommon grid, to take this specification into account. Thus, the comparison between the different methods can only be performed on data generated under the settings Common HP and Common grid, and the effect of those different settings on Magma is analysed separately. Moreover, the initialisation for the prior mean function, \(m_0(\cdot )\), is set to be constant, equal to 0 for each algorithm. Except in some experiments, where the influence of the number of individuals is analysed, the generic value is \(M = 20\). In the case of prediction on unobserved timestamps for a new individual, the first 20 data points are used as observations, and the remaining 10 are taken as test values. Optimisation of the hyper-parameters is performed by likelihood maximisation, using the L-BFGS-B algorithm (Morales & Nocedal, 2011; Nocedal, 1980) in all methods. The convergence criterion for all algorithms is reached if the difference of log-likelihood between two iterations is lower than \(10^{-2}\). In general, the EM algorithm in Magma converges in a few iterations, typically fewer than 5 with the Common HP setting, and rarely more than 15 even with the Different HP setting.
Illustration on a simple example
To illustrate the multi-task approach of Magma, Fig. 2 displays a comparison between standard GP regression and Magma on a simple example, from a dataset simulated according to the scheme above and using the Uncommon grid/Common HP setting. Given the observed data (in black), values on a thin grid of unobserved timestamps are predicted and compared, in particular, with the true test values (in red). As expected, the GP regression provides a good fit close to the data points and then dives rapidly to the prior 0 with increasing uncertainty. Conversely, although the initialisation for the prior mean is 0 in Magma as well, the hyper-posterior distribution of \(\mu _0\) (dashed line) is estimated thanks to all individuals in the training dataset. This process acts as an informed prior helping GP prediction for the new individual, even far from its own observations. More precisely, 3 phases can be distinguished according to the level of information coming from the data: in the first one, close to the observed data (\(t \in \left[ \, 1,7 \,\right] \)), the two processes behave similarly, except for a slight increase in the variance for Magma, which is logical since the prediction also takes uncertainty over \(\mu _0\) into account (see Eq. (3)); in the second one, on intervals of unobserved timestamps containing data points from the training dataset (\(t \in \left[ \, 0,1 \,\right] \cup \left[ \, 7,10 \,\right] \)), the prediction is guided by the information coming from other individuals through \(\mu _0\). In this context, the mean trajectory remains coherent and the uncertainty increases only slightly. In the third phase, where no observations are available, neither from the new individual nor from the training dataset (\(t \in \left[ \, 10,12 \,\right] \)), the prediction behaves as expected, with a slow drifting to the prior mean 0, with highly increasing variance. Overall, the multi-task framework provides reliable probabilistic predictions on a wider range of timestamps, potentially outside of the usual scope of GPs.
Performance comparison on simulated datasets
Table 1 Average MSE (standard deviation) and average \(CIC_{95}\) (standard deviation) on 100 runs for GP, GPFDA and Magma
We confront the performance of Magma to alternatives in several situations and for different datasets. In the first place, the classical GP regression (GP), GPFDA and Magma are compared through their performance in prediction and estimation of the true mean process \(\mu _0\). In the prediction context, the performances are evaluated according to the following indicators:
-
the mean squared error (MSE) which compares the predicted values to the true test values of the 10 last timestamps:
$$\begin{aligned} \dfrac{1}{10} \sum \limits _{k = 21}^{30} \left( y_*^{{\text{pred}}} (t_*^k) - y_*^{{\text{true}}} (t_*^k) \right) ^2 , \end{aligned}$$
-
the \(CI_{95}\) coverage (\(CIC_{95}\)), i.e. the percentage of unobserved data points effectively lying within the 95% credible interval defined from the predictive posterior distribution \(p(y_*(\mathbf{t }^{p}) \mid y_*(\mathbf{t }_{*}), \left\{ \mathbf{y }_i \right\} _i)\):
$$\begin{aligned} 100 \times \dfrac{1}{10} \sum \limits _{k = 21}^{30} \mathbbm {1}_{ \{ y_*^{{\text{true}}}(t_*^k) \in \ CI_{95} \} }. \end{aligned}$$
The \(CIC_{95}\) provides insights on the reliability of the predictive variance and should be as close to the value 95% as possible. Other values would indicate a tendency to underestimate or overestimate the uncertainty. Let us recall that GPFDA uses B-splines to estimate the mean process and does not account for uncertainty, contrarily to a probabilistic framework as Magma. However, a measure of uncertainty based on an empirical variance estimated from training curves is proposed (see Shi & Cheng, 2014, Section 3.2.1). In practice, this measure constantly overestimates the true variance, and their 95% empirical interval coverage is generally equal or close to 100%.
In the estimation context, the performances are evaluated thanks to another MSE, which compares the estimations to the true values of \(\mu _0\) at all timestamps:
$$\begin{aligned} \dfrac{1}{M} \sum \limits _{i = 1}^{M}\dfrac{1}{N_i} \sum \limits _{k = 1}^{N_i} \left( \mu _0^{{\text{pred}}} (t_i^k) - \mu _0^{{\text{true}}} (t_i^k) \right) ^2 . \end{aligned}$$
Table 1 presents the results obtained over 100 datasets, where the models are trained on \(M = 20\) individuals, each of them observed on \(N = 30\) common timestamps. As expected, both multi-task methods lead to better results than GP. However, Magma outperforms GPFDA, both in the estimation of \(\mu _0\) and in predictive performance. In terms of error as well as in uncertainty quantification, Magma provides more accurate results, in particular with a \(CI_{95}\) coverage close to the 95% expected value. Each method presents a quite high standard deviation for MSE in prediction, which is due to some datasets with particularly difficult values to predict, although most of the cases lead to small errors. This behaviour is reasonably expected since such 10-timestamps-ahead forecasts might sometimes be tricky. It can also be noticed on Fig. 3 that Magma consistently provides lower errors as well as less pathological behaviour, as it may sometimes occur with the B-splines modelling used in GPFDA.
To highlight the effect of the number of individuals M on the performance, Fig. 3 provides the same 100 runs trial as previously, for different values of M. The boxplots exhibit, for each method, the behaviour of the prediction and estimation MSE as information is added in the training dataset. Let us mention the absence of discernible changes as soon as \(M > 200\). As expected, we notice on the right panel that adding information from new individuals improves the estimation of \(\mu _0\), leading to shallow errors for high values of M, in particular for Magma. Meanwhile, the left panel exhibits reasonably unchanged prediction performance with respect to the values of M, excepted for some random fluctuations. This property is expected for GP regression since no external information is used from the training dataset in this context. For both multi-tasks algorithms though, the estimation of \(\mu _0\) improves the prediction by one order of magnitude below the typical errors, even with only a few training individuals. Furthermore, since a new individual behaves independently through \(f_*\), it is natural for a 10-points-ahead forecast to present intrinsic variations, despite an adequate estimation of the shared mean process.
To illustrate the advantage of multi-task methods, even for \(M = 20\), we display on Fig. 4 the evolution of MSE according to the number of timestamps N that are assumed to be observed for the new individual on which we make predictions. These predictions remain computed on the last 10 timestamps, although in this experiment, we only observe the first 5, 10, 15, or 20 timestamps, in order to change the volume of information and the distance from training observations to targets. We observe on Fig. 4 that, as expected in a GP framework, the closer observations are to targets, the better the results. However, for multi-tasks approaches and in particular for Magma, the prediction remains consistently adequate even with few observations. Once more, sharing information across individuals significantly helps the prediction, even for small values of M or few observed data.
Magma’s specific settings
As we previously discussed, different settings are available for Magma according to the nature of data and the model hypotheses. First, the Common grid setting corresponds to cases where all individuals share the same timestamps, whereas Uncommon grid is used otherwise. Moreover, Magma enables to consider identical hyper-parameters for all individuals or specific ones, as previously discussed in Sect. 2.2. To evaluate the effect of the different settings, performances in prediction and \(\mu _0\)’s estimation are evaluated in the following cases in Table 2:
-
Common HP, when data are simulated with a common set of hyper-parameters for all individuals, and Proposition 3 is used for inference in Magma,
-
Different HP, when data are simulated with its own set of hyper-parameters for each individual, and Proposition 2 is used for inference in Magma,
-
Common HP on different HP data, when data are simulated with its own set of hyper-parameters for each individual, and Proposition 3 is used for inference in Magma.
Note that the first line of the table (Common grid / Common HP) of Table 2 is identical to the corresponding results in Table 1, providing reference values, significantly better than for other methods. The results obtained in Table 2 indicate that the Magma performance is not significantly altered by the settings used or the nature of the simulated data. To confirm the robustness of the method, the setting Common HP was applied to data generated by drawing different values of hyper-parameters for each individual (Different HP data). In this case, performances in prediction and estimation of \(\mu _0\) are slightly deteriorated, although Magma still provides quite reliable forecasts. This experience also highlights a particularity of the Different HP setting: looking at the estimation of \(\mu _0\) performance, we observe a significant decrease in the \(CI_{95}\) coverage, due to numerical instability in some pathological cases. Numerical issues, in particular during matrix inversions, are classical problems in the GP literature and, because of the potentially large number of different hyper-parameters to train, the probability for at least one of them to lead to a nearly singular matrix increases. In this case, one individual might overwhelm others in the calculus of \(\mu _0\)’s hyper-posterior (see Proposition 4), and thus lead to an underestimated posterior variance. This problem does not occur in the Common HP settings, since sharing the same hyper-parameters prevents the associated covariance matrices from running over each other. Thus, except if one specifically wants to smooth multiple curves presenting really different behaviours, keeping Common HP as a default setting appears as a reasonable choice. Let us notice that the estimation of \(\mu _0\) is slightly better for common than for uncommon grid since the estimation problem on the union of different timestamps is generally more difficult. However, this feature only depends on the nature of data.
Table 2 Average MSE (standard deviation) and average \(CIC_{95}\) (standard deviation) on 100 runs for the different settings of Magma Running times comparisons
The counterpart of the more accurate and general results provided by Magma is a natural increase in running time. Table 3 exhibits the raw and relative training times for GPFDA and Magma (prediction times are negligible and comparable in both cases), on data coming from the simulation scheme with varying values of M on a Common grid of \(N = 30\) timestamps. The algorithms were run under the 3.6.1 R version, on a laptop with a dual-core processor cadenced at 2.90GHz and an 8GB RAM. The reported computing times are in seconds, and for small to moderate datasets (\(N \simeq 10^3\), \(M \simeq 10^4\) ) the procedures ran in few minutes to few hours. The difference between the two algorithms is due to GPFDA modelling \(\mu _0\) as a deterministic function through B-splines smoothing, whereas Magma accounts for uncertainty. The ratio of computing times between the two methods tends to decrease as M increases, and stabilises around 2 for higher numbers of training individuals. This behaviour comes from the E step in Magma, which is incompressible and quite insensitive to the value of M. Roughly speaking, one needs to pay twice the computing price of GPFDA for Magma to provide (significantly) more accurate predictions and uncertainty over \(\mu _0\). Table 4 provides running times of Magma according to its different settings, with \(M=20\). Because the complexity is linear in M in each case, the ratio in running times would remain roughly similar no matter the value of M. Prediction time appears negligible compared to training time, and generally takes less than one second to run. Besides, the Different HP setting increases the running time since in this context M maximisations (instead of one for Common HP) are required at each EM iteration. In this case, the prediction also takes slightly longer because of the necessity to optimise hyper-parameters for the new individual. Although the nature of the grid of timestamps does not matter in itself, a key limitation lies in the dimension N of the pooled set of timestamps, which tends to get bigger when individuals have different timestamps from one another.
Table 3 Average (standard deviation) training time (in s) for Magma and GPFDA on 100 runs for different numbers M of individuals in the training dataset Table 4 Average (standard deviation) training and prediction time (in s) on 100 runs for different settings of Magma Application of Magma on swimmers’ progression curves
Data and problematic
We consider the problem of performance prediction in competition for french swimmers. The French Swimming Federation provided us with an anonymised dataset, compiling the age and results of its members between 2000 and 2016. For each competitor, the race times are registered for competitions of 100m freestyle (50m swimming-pool). The database contains results from 1731 women and 7876 men, each of them compiling an average of 22.2 data points (min = 15, max = 61) and 12 data points (min = 5, max = 57), respectively. In the following, age of the ith swimmer is considered as the input variable (timestamp t) and the performance (in s) on a 100 m freestyle as the output (\(y_i(t)\)). For reasons of confidentiality and property, the raw dataset cannot be published. The analysis focuses on the youth period, from 10 to 20 years, where the progression is the most noticeable. In order to get relevant time series, we retained only individuals having a sufficient number of data points (\(N_i \ge 5\)) on the considered time period. For a young swimmer, observed during its first years of competition, we aim at modelling its progression curve and make predictions on its future performance in the subsequent years. Since we consider a decision-making problem involving irregular time series, the GP probabilistic framework is a natural choice to work on. Thereby, assuming that each swimmer in the database is a realisation \(y_i\) defined as previously, we expect Magma to provide multi-task predictions for a new young swimmer, that will benefit from information of other swimmers already observed at older ages. To study such modelling, and validate its efficiency in practice, we split the individuals into training and testing datasets with respective sizes:
-
\(M_{{\text{train}}}^F = 1039\), for the female training set,
-
\(M_{{\text{test}}}^F = 692\), for the female testing set,
-
\(M_{{\text{train}}}^M = 4726\), for the male training set,
-
\(M_{{\text{test}}}^M = 3150\), for the male testing set.
Inference on the hyper-parameters is performed thanks to the training dataset in both cases. Considering the different timestamps and the relative monotony of the progression curves, the settings Uncommon grid/Common HP has been used for Magma. The overall training lasted around 2 h with the same hardware configuration as for simulations. To compute MSE and the \(CI_{95}\) coverage, the data points of each individual in the testing set has been split into observed and testing timestamps. Since each individual has a different number of data points, the first 80% of timestamps are taken as observed, while the remaining 20% are considered as testing timestamps. Magma’s predictions are compared with the true values of \(y_i\) at testing timestamps. As previously, both GP and Magma have been initialised with a constant 0 mean function. Initial values for hyper-parameters are also similar for all i, \(\theta _0^{{\text{ini}}} = \theta _i^{{\text{ini}}} = (\exp (1), \exp (1))\) and \(\sigma _i^{{\text{ini}}} = 0.4\). Those values are the default in Magma and remain adequate in the context of these datasets.
Results and interpretation The overall performance and comparison are summarised in Table 5.
Table 5 Average MSE (standard deviation) and average \(CIC_{95}\) (standard deviation) for prediction on french swimmer testing datasets
We observe that Magma still provides excellent results in this context, and naturally outperforms predictions provided by a standard GP regression. As the progression curves present relatively monotonic variations and thus avoid pathological behaviours that could occur with synthetic data, the MSE in prediction remains very low. The \(CI_{95}\) coverage sticks close to the 95% expected value for Magma, indicating an adequate quantification of uncertainty. To illustrate these results, an example is displayed on Fig. 5 for both men and women. For a randomly chosen testing individual, we plot its predicted progression curve (in blue), where its first 15 data points are used as observations (in black), while the remaining true data points (in red) are displayed for comparison purpose. As previously observed in the simulation study, the simple GP quickly drifts to the prior 0 mean, as soon as data lack. However, for both men and women, the Magma predictions remain close to the true data, which also lie within the 95% credible interval. Even for long term forecast, where the mean prediction curve tends to overlap the mean process (dashed line), the true data remain in our range of uncertainty, as the credible interval widens far from observations. For clarity, we displayed only a few individuals from the training dataset (colourful points) in the background. The mean process (dashed line) seems to represent the main trend of progression among swimmers correctly, even though we cannot numerically compare \(\mu _0\) to any real-life analogous quantity. From a more sport-related perspective, we can note that both genders present similar patterns of progression. However, while performances are roughly similar in mean trend before the age of 14, they start to differentiate afterwards and then converge to average times with approximatively a 5 s gap. Interestingly, the difference between men and women in terms of world records in swimming competitions for the 100m freestyle is currently 4.8 s (46.91 versus 51.71). These results, obtained under reasonable hypotheses on several hundreds of swimmers, seem to indicate that Magma would give quite reliable predictions for a new young swimmer. Furthermore, the uncertainty provided through the predictive posterior distribution offers an adequate degree of caution in a decision-making process.