Multivariate Gaussian and Student-t process regression for multi-output prediction

Chen, Zexun; Wang, Bo; Gorban, Alexander N.

doi:10.1007/s00521-019-04687-8

Multivariate Gaussian and Student-t process regression for multi-output prediction

Original Article
Open access
Published: 31 December 2019

Volume 32, pages 3005–3028, (2020)
Cite this article

Download PDF

You have full access to this open access article

Neural Computing and Applications Aims and scope Submit manuscript

Multivariate Gaussian and Student-t process regression for multi-output prediction

Download PDF

10k Accesses
62 Citations
7 Altmetric
Explore all metrics

A Correction to this article was published on 27 February 2020

This article has been updated

Abstract

Gaussian process model for vector-valued function has been shown to be useful for multi-output prediction. The existing method for this model is to reformulate the matrix-variate Gaussian distribution as a multivariate normal distribution. Although it is effective in many cases, reformulation is not always workable and is difficult to apply to other distributions because not all matrix-variate distributions can be transformed to respective multivariate distributions, such as the case for matrix-variate Student-t distribution. In this paper, we propose a unified framework which is used not only to introduce a novel multivariate Student-t process regression model (MV-TPR) for multi-output prediction, but also to reformulate the multivariate Gaussian process regression (MV-GPR) that overcomes some limitations of the existing methods. Both MV-GPR and MV-TPR have closed-form expressions for the marginal likelihoods and predictive distributions under this unified framework and thus can adopt the same optimization approaches as used in the conventional GPR. The usefulness of the proposed methods is illustrated through several simulated and real-data examples. In particular, we verify empirically that MV-TPR has superiority for the datasets considered, including air quality prediction and bike rent prediction. At last, the proposed methods are shown to produce profitable investment strategies in the stock markets.

Non-central Student-t Mixture of Student-t Processes for Robust Regression and Prediction

Multivariate Gaussian processes: definitions, examples and applications

Article Open access 27 January 2023

Composite T-Process Regression Models

Article 03 June 2022

Find the latest articles, discoveries, and news in related topics.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Over the last few decades, Gaussian processes regression (GPR) has been proven to be a powerful and effective method for nonlinear regression problems due to many favorable properties, such as simple structure of obtaining and expressing uncertainty in predictions, the capability of capturing a wide variety of behavior by parameters and a natural Bayesian interpretation [5, 21]. In 1996, Neal [20] revealed that many Bayesian regression models based on neural network converge to Gaussian processes (GP) in the limit of an infinite number of hidden units [26]. GP has been suggested as a replacement for supervised neural networks in nonlinear regression [17, 28] and classification [17]. Furthermore, GP has excellent capability of forecasting time series [6, 7].

Despite the popularity of GPR in various modeling tasks, there still exists a conspicuous imperfection, that is, the majority of GPR models are implemented for single response variables or considered independently for multiple responses variables without consideration of their correlation [5, 25]. In order to resolve the multi-output prediction problem, Gaussian process regression for vector-valued function is proposed and regarded as a pragmatic and straightforward method. The core of this method is to vectorize the multi-response variables and construct a “big” covariance, which describes the correlations between the inputs as well as between the outputs [2, 5, 8, 25]. This modeling strategy is feasible due to the fact that the matrix-variate Gaussian distributions can be reformulated as multivariate Gaussian distributions [8, 15]. Intrinsically, Gaussian process regression for vector-valued function is still a conventional Gaussian process regression model since it merely vectorizes multi-response variables, which are assumed to follow a developed case of GP with a reproduced kernel. As an extension, it is natural to consider more general elliptical processes models for multi-output prediction. However, the vectorization method cannot be used to extend multi-output GPR because the equivalence between vectorized matrix-variate and multivariate distributions only exists in Gaussian cases [15].

To overcome this drawback, in this paper we propose a unified framework which: (1) is used to introduce a novel multivariate Student-t process regression model (MV-TPR) for multi-output prediction, (2) is used to reformulate the multivariate Gaussian process regression (MV-GPR) that overcomes some limitations of the existing methods and (3) can be used to derive regression models of general elliptical processes. Both MV-GPR and MV-TPR have closed-form expressions for the marginal likelihoods and predictive distributions under this unified framework and thus can adopt the same optimization approaches as used in the conventional GPR. The usefulness of the proposed methods is illustrated through several simulated examples. Furthermore, we also verify empirically that MV-TPR has superiority in the prediction based on some widely used datasets, including air quality prediction and bike rent prediction. The proposed methods are then applied to stock market modeling which shows that the profitable stock investment strategies can be obtained.

The rest of the paper is organized as follows. Section 2 introduces some preliminaries of matrix-variate Gaussian and Student-t distributions with their useful properties. Section 3 presents the unified framework to reformulate the multivariate Gaussian process regression and to derive the new multivariate Student-t process regression models. Some numerical experiments by the simulated data and real data and the applications to stock market investment are presented in Sect. 4. Conclusion and discussion are given in Sect. 5.

2 Backgrounds and notations

Matrix-variate Gaussian and Student-t distributions have many useful properties, as discussed in the studies [10, 15, 31]. For completeness and easy referencing, below we list some of them which will be used in this paper.

2.1 Matrix-variate Gaussian distribution

Definition 1

The random matrix $X \in \mathbb {R}^{n \times d}$ is said to have a matrix-variate Gaussian distribution with mean matrix $M \in \mathbb {R}^{n \times d}$ and covariance matrix $\Sigma \in \mathbb {R}^{n \times n},\Omega \in \mathbb {R}^{d \times d}$ if and only if its probability density function is given by

$$\begin{aligned}&p(X|M,\Sigma ,\Omega ) = (2\pi )^{-\frac{dn}{2}}\det (\Sigma )^{-\frac{d}{2}}\det (\Omega )^{-\frac{n}{2}}\nonumber \\&\quad \times \,{\mathrm{etr}}\left( -\frac{1}{2}\Omega ^{-1}(X-M)^{\mathrm {T}}\Sigma ^{-1}(X -M)\right) , \end{aligned}$$

(1)

where ${\mathrm{etr}}(\cdot )$ is exponential of matrix trace and $\Omega$ and $\Sigma$ are positive semi-definite. It is denoted $X \sim \mathcal {MN}_{n,d}(M, \Sigma , \Omega )$. Without loss of clarity, it is denoted $X \sim \mathcal {MN}(M, \Sigma , \Omega )$.

Like multivariate Gaussian distribution, matrix-variate Gaussian distribution also possesses several important properties as follows.

Theorem 1

(Transposable) If$X \sim \mathcal {MN}_{n,d}(M, \Sigma , \Omega )$, then$X^{\mathrm {T}} \sim \mathcal {MN}_{d,n}(M^{\mathrm {T}}, \Omega , \Sigma )$.

The matrix-variate Gaussian is related to the multivariate Gaussian in the following way.

Theorem 2

(Vectorizable) $X \sim \mathcal {MN}_{n,d}(M, \Sigma , \Omega )$ if and only if

$$\begin{aligned} \mathrm {vec}(X^{\mathrm {T}}) \sim {\mathcal {N}}_{nd}(\mathrm {vec}(M^{\mathrm {T}}), \Sigma \otimes \Omega ), \end{aligned}$$

where$\mathrm {vec}(\cdot )$ is the vector operator and$\otimes$ is the Kronecker product (or called tensor product).

Furthermore, the matrix-variate Gaussian distribution is consistent under the marginalization and conditional distribution.

Theorem 3

(Marginalization and conditional distribution) Let$X \sim \mathcal {MN}_{n,d}(M, \Sigma , \Omega )$ and partition$X, M, \Sigma$ and$\Omega$ as

where$n_1,n_2, d_1,d_2$ is the column or row length of the corresponding vector or matrix. Then,

1.
$X_{1r} \sim \mathcal {MN}_{n_1,d}\left( M_{1r},\Sigma _{11},\Omega \right)$,
$$\begin{aligned}&X_{2r}|X_{1r} \sim \mathcal {MN}_{n_2,d} \\&\quad \Big( M_{2r} + \Sigma _{21}\Sigma _{11}^{-1}(X_{1r}-M_{1r}),\Sigma _{22\cdot 1},\Omega \Big) ;\end{aligned}$$
2.
$X_{1c} \sim \mathcal {MN}_{n,d_1}\left( M_{1c},\Sigma ,\Omega _{11}\right)$,
$$\begin{aligned}&X_{2c}|X_{1c} \sim \mathcal {MN}_{n,d_2} \\&\quad \Big( M_{2c} + (X_{1c}-M_{1c})\Omega _{11}^{-1}\Omega _{12}, \Sigma, \Omega _{22\cdot 1} \Big) , \end{aligned}$$

where $\Sigma _{22\cdot 1}$ and $\Omega _{22\cdot 1}$ are the Schur complement [30] of $\Sigma _{11}$ and $\Omega _{11}$, respectively,

$$\begin{aligned} \Sigma _{22\cdot 1} = \Sigma _{22} - \Sigma _{21}\Sigma _{11}^{-1}\Sigma _{12} , \quad \Omega _{22\cdot 1} = \Omega _{22} - \Omega _{21}\Omega _{11}^{-1}\Omega _{12}. \end{aligned}$$

2.2 Matrix-variate Student-t distribution

Definition 2

The random matrix $X \in \mathbb {R}^{n \times d}$ is said to have a matrix-variate Student-t distribution with the mean matrix $M\in \mathbb {R}^{n \times d}$ and covariance matrix $\Sigma \in \mathbb {R}^{n \times n},\Omega \in \mathbb {R}^{d \times d}$ and the degree of freedom $\nu$ if and only if the probability density function is given by

$$\begin{aligned}&p(X|\nu , M, \Sigma , \Omega )=\frac{\Gamma _n \left[ \frac{1}{2}(\nu + d + n -1)\right] }{\pi ^{\frac{1}{2}dn}\Gamma _n \left[ \frac{1}{2}(\nu + n -1)\right] }\nonumber \\&\quad \times \det (\Sigma )^{-\frac{d}{2}} \det (\Omega )^{-\frac{n}{2}} \nonumber \\&\quad \times \det ({\mathbf {I}}_n + \Sigma ^{-1}(X-M)\Omega ^{-1}(X-M)^{\mathrm {T}})^{-\frac{1}{2}(\nu + d + n -1)}, \end{aligned}$$

(2)

where $\Omega$ and $\Sigma$ are positive semi-definite, and

$$\begin{aligned} \Gamma _n(\lambda ) = \pi ^{n(n-1)/4}\prod _{i=1}^{n}\Gamma \left( \lambda + \frac{1}{2} - \frac{i}{2}\right) . \end{aligned}$$

We denote this by $X \sim \mathcal {MT}_{n,d}(\nu , M, \Sigma , \Omega )$. Without loss of clarity, it is denoted $X \sim \mathcal {MT}(\nu , M, \Sigma , \Omega )$.

Theorem 4

(Expectation and covariance) Let$X \sim \mathcal {MT}(\nu , M, \Sigma , \Omega )$, then

$$\begin{aligned} {\mathbb {E}}(X) = M,\quad \mathrm {cov}(\mathrm {vec}(X^{\mathrm {T}})) = \frac{1}{\nu -2}\Sigma \otimes \Omega , \nu > 2. \end{aligned}$$

Theorem 5

(Transposable) If$X \sim \mathcal {MT}_{n,d}(\nu ,M, \Sigma , \Omega )$, then$X^{\mathrm {T}} \sim \mathcal {MT}_{d,n}(\nu ,M^{\mathrm {T}}, \Omega ,\Sigma ).$

Theorem 6

(Asymptotics) Let$X \sim \mathcal {MT}_{n,d}(\nu ,M, \Sigma , \Omega )$, then$X \overset{d}{\rightarrow } \mathcal {MN}_{n,d}(M, \Sigma , \Omega )$ as$\nu \rightarrow \infty$, where “$\overset{d}{\rightarrow }$” denotes the convergence in distribution.

Theorem 7

(Marginalization and conditional distribution) Let$X \sim \mathcal {MT}_{n,d}(\nu ,M, \Sigma , \Omega )$ and partition$X, M, \Sigma$ and$\Omega$ as

where$n_1,n_2, d_1,d_2$ is the column or row length of the corresponding vector or matrix. Then,

1.
$X_{1r} \sim \mathcal {MT}_{n_1,d}\left( \nu , M_{1r},\Sigma _{11},\Omega \right)$,
$$\begin{aligned}&X_{2r}|X_{1r} \sim \mathcal {MT}_{n_2,d} \Big ( \nu + n_1, M_{2r} \\&\quad + \Sigma _{21}\Sigma _{11}^{-1}(X_{1r}-M_{1r}),\Sigma _{22\cdot 1}, \\&\quad \Omega + (X_{1r}-M_{1r})^{\mathrm {T}}\Sigma _{11}^{-1}(X_{1r}-M_{1r}) \Big ); \end{aligned}$$
2.
$X_{1c} \sim \mathcal {MT}_{n,d_1}\left( ,\nu , M_{1c},\Sigma ,\Omega _{11}\right)$,
$$\begin{aligned}&X_{2c}|X_{1c} \sim \mathcal {MT}_{n,d_2} \Big (\nu + d_1, M_{2c} + (X_{1c}-M_{1c})\Omega _{11}^{-1}\Omega _{12}, \\&\quad \Sigma + (X_{1c}-M_{1c})\Omega _{11}^{-1}(X_{1c}-M_{1c})^{\mathrm {T}},\Omega _{22\cdot 1} \Big ), \end{aligned}$$

where $\Sigma _{22\cdot 1}$and $\Omega _{22\cdot 1}$ are the Schur complement of $\Sigma _{11}$ and $\Omega _{11}$, respectively,

$$\begin{aligned} \Sigma _{22\cdot 1} = \Sigma _{22} - \Sigma _{21}\Sigma _{11}^{-1}\Sigma _{12} , \quad \Omega _{22\cdot 1} = \Omega _{22} - \Omega _{21}\Omega _{11}^{-1}\Omega _{12}. \end{aligned}$$

Remark 1

It can be seen that matrix-variate Student-$t$ distribution has many properties similar to matrix-variate Gaussian distribution, and it converges to matrix-variate Gaussian distribution if its degree of freedom tends to infinity. However, matrix-variate Student-$t$ distribution lacks the property of vectorizability (Theorem 2) [15]. As a consequence, Student-t process regression for multiple outputs cannot be derived by vectorizing the multi-response variables. In the next section, we propose a new framework to introduce multivariate Student-t process regression model.

3 Multivariate Gaussian and Student-t process regression models

3.1 Multivariate Gaussian process regression (MV-GPR)

If $\varvec{f}$ is a multivariate Gaussian process on ${\mathcal {X}}$ with vector-valued mean function $\varvec{u} : {\mathcal {X}}\mapsto \mathbb {R}^d$, covariance function (also called kernel) $k: {\mathcal {X}}\times {\mathcal {X}} \mapsto \mathbb {R}$ and positive semi-definite parameter matrix $\Omega \in \mathbb {R}^{d \times d}$, then any finite collection of vector-valued variables have a joint matrix-variate Gaussian distribution:

$$\begin{aligned}{}[\varvec{f}(x_1)^{\mathrm {T}},\ldots ,\varvec{f}(x_n)^{\mathrm {T}}]^{\mathrm {T}} \sim \mathcal {MN}(M, \Sigma , \Omega ),n \in \mathbb {N}, \end{aligned}$$

where $\varvec{f}, \varvec{u} \in \mathbb {R}^d$ are row vectors whose components are the functions $\{f_i\}_{i=1}^d$ and $\{\mu _i\}_{i=1}^d,$ respectively. Furthermore, $M \in \mathbb {R}^{n \times d}$ with $M_{ij} = \mu _{j}(x_i)$, and $\Sigma \in \mathbb {R}^{n \times n}$ with $\Sigma _{ij} = k(x_i,x_j)$. Sometimes $\Sigma$ is called column covariance matrix, while $\Omega$ is row covariance matrix. We denote $\varvec{f} \sim \mathcal {MGP}(\varvec{u}, k, \Omega )$.

In conventional GPR methods, the noisy model $y=f(x)+\varepsilon$ is usually considered. However, for Student-$t$ process regression such a model is analytically intractable [24]. Therefore, we adopt the method used in [24] and consider the noise-free regression model where the noise term is incorporated into the kernel function.

Given n pairs of observations $\{(x_i,\varvec{y}_i)\}_{i=1}^n, x_i \in \mathbb {R}^p, \varvec{y}_i \in \mathbb {R}^{1\times d}$, we assume the following model:

$$\begin{aligned} \varvec{f}&\sim \mathcal {MGP}(\varvec{u},k',\Omega ), \\ \varvec{y}_i&= \varvec{f}(x_i), \text{ for } \ i = 1,\ldots ,n, \end{aligned}$$

where

$$\begin{aligned} k' = k(x_i,x_j) + \delta _{ij}\sigma _n^2, \end{aligned}$$

(3)

and $\delta _{ij}=1$ if $i=j$, otherwise $\delta _{ij}=0$. Note that the second term in (3) represents the random noises.

We assume $\varvec{u} = \varvec{0}$ as commonly done in GPR. By the definition of multivariate Gaussian process, it yields that the collection of functions $[\varvec{f}(x_1),\ldots ,\varvec{f}(x_n)]$ follow a matrix-variate Gaussian distribution:

$$\begin{aligned} {[}\varvec{f}(x_1)^{\mathrm {T}},\ldots ,\varvec{f}(x_n)^{\mathrm {T}}]^{\mathrm {T}} \sim \mathcal {MN}(\varvec{0},K',\Omega ), \end{aligned}$$

where $K'$ is the $n \times n$ covariance matrix of which the (i, j)-th element $[K']_{ij} = k'(x_i,x_j)$.

To predict a new variable $\varvec{f}_* = [f_{*1},\ldots ,f_{*m}]^\mathrm {T}$ at the test locations $X_* = [x_{n+1},\ldots ,x_{n+m}]^\mathrm {T}$, the joint distribution of the training observations $Y = [\varvec{y}_1^{\mathrm {T}},\ldots ,\varvec{y}_n^{\mathrm {T}}]^{\mathrm {T}}$ and the predictive targets $\varvec{f}_*$ are given by

$$\begin{aligned} \begin{bmatrix} Y \\ \varvec{f}_* \end{bmatrix} \sim \mathcal {MN} \left( \varvec{0}, \begin{bmatrix} K'(X,X) \quad K'(X_*,X)^{\mathrm {T}} \\ K'(X_*,X) \ \ K'(X_*,X_*) \end{bmatrix}, \Omega \right) , \end{aligned}$$

(4)

where $K'(X,X)$ is an $n \times n$ matrix of which the (i, j)-th element $[K'(X,X)]_{ij} = k'(x_{i},x_j)$, $K'(X_*,X)$ is an $m \times n$ matrix of which the (i, j)-th element $[K'(X_*,X)]_{ij} = k'(x_{n+i},x_j)$, and $K'(X_*,X_*)$ is an $m \times m$ matrix with the (i, j)-th element $[K'(X_*,X_*)]_{ij} = k'(x_{n+i},x_{n+j})$. Thus, taking advantage of conditional distribution of multivariate Gaussian process, the predictive distribution is

$$\begin{aligned} p(\varvec{f}_*|X,Y,X_*) = \mathcal {MN}({\hat{M}},{\hat{\Sigma }},{\hat{\Omega }}), \end{aligned}$$

(5)

where

$$\begin{aligned} {\hat{M}}&= K'(X_*,X)^{\mathrm {T}}K'(X,X)^{-1}Y, \end{aligned}$$

(6)

$$\begin{aligned} {\hat{\Sigma }}&= K'(X_*,X_*) - K'(X_*,X)^{\mathrm {T}}K'(X,X)^{-1}K'(X_*,X),\end{aligned}$$

(7)

$$\begin{aligned} {\hat{\Omega }}&= \Omega . \end{aligned}$$

(8)

Additionally, the expectation and the covariance are obtained:

$$\begin{aligned}&{\mathbb {E}}[\varvec{f}_*] = {\hat{M}}=K'(X_*,X)^{\mathrm {T}}K'(X,X)^{-1}Y, \end{aligned}$$

(9)

$$\begin{aligned}&\mathrm {cov}(\mathrm {vec}(\varvec{f}^{\mathrm {T}}_*)) = {\hat{\Sigma }}\otimes {\hat{\Omega }} = [K'(X_*,X_*) \nonumber \\&\quad - K'(X_*,X)^{\mathrm {T}}K'(X,X)^{-1}K'(X_*,X)]\otimes \Omega . \end{aligned}$$

(10)

3.1.1 Kernel

Although there are two covariance matrices in the above regression model: the column covariance and the row covariance, only the column covariance depends on inputs and is considered as kernel since it contains our presumptions about the function we wish to learn and define the closeness and similarity between data points [22]. As in conventional GPR, the choice of kernels has a profound impact on the performance of multivariate Gaussian process regression (as well as multivariate Student-$t$ process regression introduced later). A wide range of useful kernels have been proposed in the literature, such as linear, rational quadratic and Matérn [22]. But the squared exponential (SE) kernel is the most commonly used due to its simple form and many desirable properties such as smoothness and integrability with other functions, although it could oversmooth the data, especially financial data.

The squared exponential (SE) kernel is defined as:

$$\begin{aligned} k_{\mathrm{SE}}(x,x') = s_f^2 \exp \left( -\frac{\Vert x-x'\Vert ^2}{2\ell ^2}\right) , \end{aligned}$$

where $s_f^2$ is the signal variance and can also be considered as an output-scale amplitude and the parameter $\ell$ is the input (length or time) scale [23]. The kernel can also be defined by automatic relevance determination (ARD):

$$\begin{aligned} k_{SEard}(\varvec{x},\varvec{x}') = s_f^2 \exp \left( - \frac{(\varvec{x}-\varvec{x}')^{\mathrm {T}}\Theta ^{-1}(\varvec{x}-\varvec{x}')}{2}\right) , \end{aligned}$$

where $\Theta$ is a diagonal matrix with the element components $\{\ell _i^2\}_{i=1}^p$, which represents the length scales for each corresponding input dimension.

For convenience and the purpose of demonstration, SE kernel is used in all our experiments where there is only one input variable, while SEard is used in those with multiple input variables. It should be noted that there is no technical difficulty to use other kernels in our models.

3.1.2 Parameter estimation

The hyperparameters involved in the kernel and the row covariance matrix of MV-GPR need to be estimated from the training data. Many approaches used in the conventional GP models [27], such as maximum likelihood estimation (MLE), maximum a posteriori (MAP) and Markov chain Monte Carlo (MCMC), can be used for our proposed models. Although Monte Carlo methods can perform GPR without the need of estimating hyperparameters [6, 18, 19, 28], the common approach is to estimate them by means of MLE due to the high computational cost of Monte Carlo methods. Therefore, as an example we consider parameter estimation using MLE. Compared to the conventional GPR model, $\Omega$ is an extra parameter; hence, the unknown parameters include the hyperparameters in the kernel, the noise variance $\sigma _n^2$ and the row covariance parameter matrix $\Omega$.

Because $\Omega$ is positive semi-definite, it can be denoted as $\Omega = \Phi \Phi ^{\mathrm {T}}$, where

$$\begin{aligned} \Phi = \left[ \begin{matrix} \phi _{11} &{}\quad 0 &{}\quad \cdots &{}\quad 0 \\ \phi _{21} &{}\quad \phi _{22} &{}\quad \cdots &{}\quad 0 \\ \vdots &{}\quad \vdots &{}\quad \ddots &{}\quad \vdots \\ \phi _{d1} &{}\quad \phi _{d2} &{} \quad \cdots &{}\quad \phi _{dd} \\ \end{matrix} \right] . \end{aligned}$$

To guarantee the uniqueness of $\Phi$, the diagonal elements are restricted to be positive and denote $\varphi _{ii} = \ln (\phi _{ii})$ for $i = 1,2,\ldots ,d$.

In MV-GPR model, the observations follow a matrix-variate Gaussian distribution $Y \sim \mathcal {MN}_{n,d}(\varvec{0},K',\Omega )$ where $K'$ is the noisy column covariance matrix with element $[K']_{ij} = k'(x_i,x_j)$ so that $K' = K + \sigma _n^2 {\mathbf {I}}$ where K is noise-free column covariance matrix with element $[K]_{ij} = k(x_i,x_j)$. As we know, there are hyperparameters in the kernel k so that we can denote $K = K_{\theta }$. The hyper-parameter set denotes $\Theta = \{\theta _1, \theta _2, \ldots \}$, thus

$$\begin{aligned} \frac{\partial K'}{\partial \sigma _n^2} = {\mathbf {I}}_n, \quad \frac{\partial K' }{\partial \theta _i} = \frac{\partial K_{\theta }}{\partial \theta _i} \end{aligned}$$

According to the matrix-variate distribution, the negative log marginal likelihood of observations is

$$\begin{aligned} {\mathcal {L}}= & {} \frac{nd}{2}\ln (2\pi ) + \frac{d}{2}\ln \det (K') + \frac{n}{2}\ln \det (\Omega )\nonumber \\&+\, \frac{1}{2}{\mathrm{tr}}((K')^{-1}Y\Omega ^{-1}Y^{\mathrm {T}}). \end{aligned}$$

(11)

The derivatives of the negative log marginal likelihood with respect to parameter $\sigma _n^2$, $\theta _i$, $\phi _{ij}$ and $\varphi _{ii}$ are as follows:

$$\begin{aligned} \frac{\partial {\mathcal {L}}}{\partial \sigma _n^2} =&\frac{d}{2}{\mathrm{tr}}((K')^{-1}) - \frac{1}{2}{\mathrm{tr}}(\alpha _{K'}\Omega ^{-1}\alpha _{K'}^{\mathrm {T}}),\\ \frac{\partial {\mathcal {L}}}{\partial \theta _i} =&\frac{d}{2}{\mathrm{tr}}\left( (K')^{-1}\frac{\partial K_{\theta }}{\partial \theta _i}\right) - \frac{1}{2}{\mathrm{tr}}\left( \alpha _{K'}\Omega ^{-1}\alpha _{K'}^{\mathrm {T}}\frac{\partial K_{\theta }}{\partial \theta _i}\right) , \\ \frac{\partial {\mathcal {L}}}{\partial \phi _{ij}} =&\frac{n}{2}{\mathrm{tr}}[\Omega ^{-1}({\mathbf {E}}_{ij}\Phi ^{\mathrm {T}} + \Phi {\mathbf {E}}_{ji})] \\&- \frac{1}{2}{\mathrm{tr}}[\alpha _{\Omega }(K')^{-1}\alpha _{\Omega }^{\mathrm {T}}({\mathbf {E}}_{ij}\Phi ^{\mathrm {T}} + \Phi {\mathbf {E}}_{ji})], \\ \frac{\partial {\mathcal {L}}}{\partial \varphi _{ii}} =&\frac{n}{2}{\mathrm{tr}}[\Omega ^{-1}({\mathbf {J}}_{ii}\Phi ^{\mathrm {T}} + \Phi {\mathbf {J}}_{ii})] \\&- \frac{1}{2}{\mathrm{tr}}[\alpha _{\Omega }(K')^{-1}\alpha _{\Omega }^{\mathrm {T}}({\mathbf {J}}_{ii}\Phi ^{\mathrm {T}} + \Phi {\mathbf {J}}_{ii})], \end{aligned}$$

where $\alpha _{K'} = (K')^{-1}Y$, $\alpha _{\Omega } = \Omega ^{-1}Y^{\mathrm {T}}$, ${\mathbf {E}}_{ij}$ is the $d \times d$ elementary matrix having unity in the (i, j)th element and zeros elsewhere, and ${\mathbf {J}}_{ii}$ is the same as ${\mathbf {E}}_{ij}$ but with the unity being replaced by $e^{\varphi _{ii}}$. The details are provided in “Multivariate Gaussian process regression” in Appendix.

Hence, standard gradient-based numerical optimization techniques, such as conjugate gradient method, can be used to minimize the negative log marginal likelihood function to obtain the estimates of the parameters. Note that since the random noise is incorporated into the kernel function, the noise variance is estimated alongside the other hyperparameters.

3.1.3 Comparison with the existing methods

Compared with the existing multi-output GPR methods [2, 5, 25], our proposed method possesses several advantages.

Firstly, the existing methods have to vectorize the multi-output matrix in order to utilize the GPR models. It is complicated and not always workable if the numbers of outputs and observations are large. In contrast, our proposed MV-GPR has more straightforward form where the model settings, derivations and computations are all directly performed in matrix form. In particular, we use column covariance (kernel) and row covariance to capture all the correlations together in the multivariate outputs, rather than assuming a separate kernel for each output and constructing a “big” covariance matrix by Kronecker product as done in [5].

Secondly, the existing methods rely on the equivalence between vectorized matrix-variate Gaussian distribution and multivariate Gaussian distribution. However, this equivalence does not exist for other elliptical distributions such as matrix-variate Student-$t$ distribution [15]. Therefore, the existing methods for multi-output Gaussian process regression cannot be applied to Student-$t$ process regression. On the other hand, our proposed MV-GPR is based on matrix forms directly and does not require vectorization, so it can naturally be extended to MV-TPR as we will do in the next subsection.

Therefore, our proposed MV-GPR provides not only a new derivation of the multi-output Gaussian process regression model, but also a unified framework to derive more general elliptical processes models.

3.2 Multivariate Student-t process regression (MV-TPR)

In this subsection, we propose a new nonlinear regression model for multivariate response, namely multivariate Student-t process regression model (MV-TPR), using the framework discussed in the previous subsections. MV-TPR is an extension to multi-output GPR, as well as an extension to the univariate Student-t process regression proposed in [24].

By definition, if $\varvec{f}$ is a multivariate Student-$t$ process on ${\mathcal {X}}$ with parameter $\nu >2$, vector-valued mean function $\varvec{u} : {\mathcal {X}}\mapsto \mathbb {R}^d$, covariance function (also called kernel) $k: {\mathcal {X}}\times {\mathcal {X}} \mapsto \mathbb {R}$ and positive semi-definite parameter matrix $\Omega \in \mathbb {R}^{d \times d}$, then any finite collection of vector-valued variables have a joint matrix-variate Student-t distribution:

$$\begin{aligned}{}[\varvec{f}(x_1)^{\mathrm {T}},\ldots ,\varvec{f}(x_n)^{\mathrm {T}}]^{\mathrm {T}} \sim \mathcal {MT}(\nu , M, \Sigma , \Omega ),n \in \mathbb {N}, \end{aligned}$$

where $\varvec{f}, \varvec{u} \in \mathbb {R}^d$ are row vectors whose components are the functions $\{f_i\}_{i=1}^d$ and $\{\mu _i\}_{i=1}^d,$ respectively. Furthermore, $M \in \mathbb {R}^{n \times d}$ with $M_{ij} = \mu _{j}(x_i)$, and $\Sigma \in \mathbb {R}^{n \times n}$ with $\Sigma _{ij} = k(x_i,x_j)$. We denote $\varvec{f} \sim \mathcal {MTP}(\nu , \varvec{u}, k, \Omega )$.

Therefore, MV-TPR model can be formulated along the same line as MV-GPR based on the definition of multivariate Student-t process. We present the model briefly below.

Given n pairs of observations $\{(x_i,\varvec{y}_i)\}_{i=1}^n, x_i \in \mathbb {R}^p, \varvec{y}_i \in \mathbb {R}^{1\times d}$, we assume

$$\begin{aligned} \varvec{f}&\sim \mathcal {MTP}(\nu ,\varvec{u},k',\Omega ),\nu >2, \\ \varvec{y}_i&= \varvec{f}(x_i), \text{ for } \ i = 1,\ldots ,n, \end{aligned}$$

where $\nu$ is the degree of freedom of Student-t process and the remaining parameters have the same meaning of MV-GPR model. Consequently, the predictive distribution is obtained as:

$$\begin{aligned} p(\varvec{f}_*|X,Y,X_*) = \mathcal {MT}({\hat{\nu }},{\hat{M}},{\hat{\Sigma }},{\hat{\Omega }}), \end{aligned}$$

(12)

where

$$\begin{aligned} {\hat{\nu }}&= \nu + n, \end{aligned}$$

(13)

$$\begin{aligned} {\hat{M}}&= K'(X_*,X)^{\mathrm {T}}K'(X,X)^{-1}Y, \end{aligned}$$

(14)

$$\begin{aligned} {\hat{\Sigma }}&= K'(X_*,X_*) - K'(X_*,X)^{\mathrm {T}}K'(X,X)^{-1}K'(X_*,X),\end{aligned}$$

(15)

$$\begin{aligned} {\hat{\Omega }}&= \Omega + Y^{\mathrm {T}}K'(X,X)^{-1}Y. \end{aligned}$$

(16)

According to the expectation and the covariance of matrix-variate Student-t distribution, the predictive mean and covariance are given by

$$\begin{aligned} {\mathbb {E}}[\varvec{f}_*]&= {\hat{M}}=K'(X_*,X)^{\mathrm {T}}K'(X,X)^{-1}Y, \end{aligned}$$

(17)

$$\begin{aligned} \mathrm {cov}(\mathrm {vec}(\varvec{f}^{\mathrm {T}}_*))&= \frac{1}{\nu +n-2}{\hat{\Sigma }}\otimes {\hat{\Omega }} \nonumber \\&= \frac{1}{\nu + n-2}[K'(X_*,X_*) \nonumber \\&\quad - K'(X_*,X)^{\mathrm {T}}K'(X,X)^{-1}K'(X_*,X)] \nonumber \\&\quad \otimes (\Omega + Y^{\mathrm {T}}K'(X,X)^{-1}Y). \end{aligned}$$

(18)

In the MV-TPR model, the observations are followed by a matrix-variate Student-t distribution $Y \sim \mathcal {MT}_{n,d}(\nu ,\varvec{0},K',\Omega )$. The negative log marginal likelihood is

$$\begin{aligned} {\mathcal {L}}= & {} \frac{1}{2}(\nu + d+n -1) \ln \det ({\mathbf {I}}_n + (K')^{-1}Y\Omega ^{-1}Y^{\mathrm {T}}) \\&+ \frac{d}{2}\ln \det (K') + \frac{n}{2}\ln \det (\Omega ) \\&+ \ln \Gamma _n \left( \frac{1}{2}(\nu + n -1)\right) - \ln \Gamma _n \left( \frac{1}{2}(\nu + d + n -1)\right) \\&+ \frac{1}{2}dn\ln \pi \\= & {} \frac{1}{2}(\nu + d+n -1) \ln \det (K' +Y\Omega ^{-1}Y^{\mathrm {T}}) \\&- \frac{\nu + n -1}{2}\ln \det (K') \\&+ \ln \Gamma _n \left( \frac{1}{2}(\nu + n -1)\right) - \ln \Gamma _n \left( \frac{1}{2}(\nu + d + n -1)\right) \\&+ \frac{n}{2}\ln \det (\Omega )+ \frac{1}{2}dn\ln \pi . \end{aligned}$$

Therefore, the parameters of MV-TPR contain all the parameters in MV-GPR and one more parameter: the degree of freedom $\nu$. The derivatives of the negative log marginal likelihood with respect to parameter $\nu$,$\sigma _n^2$, $\theta _i$, $\phi _{ij}$ and $\varphi _{ii}$ are as follows:

$$\begin{aligned} \frac{\partial {\mathcal {L}}}{\partial \nu }&= \frac{1}{2}\ln \det (U) - \frac{1}{2}\ln \det (K') + \frac{1}{2}\psi _n\left( \frac{1}{2}\tau \right) \\&\quad - \frac{1}{2}\psi _n\left( \frac{1}{2}(\tau + d)\right) , \\ \frac{\partial {\mathcal {L}}}{\partial \sigma ^2_n}&= \frac{(\tau +d)}{2}{\mathrm{tr}}(U^{-1}) - \frac{\tau }{2}{\mathrm{tr}}((K')^{-1}), \\ \frac{\partial {\mathcal {L}}}{\partial \theta _i}&= \frac{(\tau +d)}{2}{\mathrm{tr}}\left( U^{-1} \frac{\partial K_{\theta }}{\partial \theta _i}\right) - \frac{\tau }{2}{\mathrm{tr}}\left( \Sigma ^{-1} \frac{\partial K_{\theta }}{\partial \theta _i}\right) ,\\ \frac{\partial {\mathcal {L}}}{\partial \phi _{ij}}&= - \frac{(\tau +d)}{2}{\mathrm{tr}}[U^{-1}\alpha _{\Omega }^{\mathrm {T}}({\mathbf {E}}_{ij}\Phi ^{\mathrm {T}} + \Phi {\mathbf {E}}_{ji})\alpha _{\Omega }] \\&\quad + \frac{n}{2}{\mathrm{tr}}[\Omega ^{-1}({\mathbf {E}}_{ij}\Phi ^{\mathrm {T}} + \Phi {\mathbf {E}}_{ji})], \\ \frac{\partial {\mathcal {L}}}{\partial \varphi _{ii}}&= -\frac{(\tau +d)}{2}{\mathrm{tr}}[U^{-1}\alpha _{\Omega }^{\mathrm {T}}({\mathbf {J}}_{ii}\Phi ^{\mathrm {T}} + \Phi {\mathbf {J}}_{ii})\alpha _{\Omega }] \\&\quad + \frac{n}{2}{\mathrm{tr}}[\Omega ^{-1}({\mathbf {J}}_{ii}\Phi ^{\mathrm {T}} + \Phi {\mathbf {J}}_{ii})], \end{aligned}$$

where $U = K' + Y\Omega ^{-1}Y^{\mathrm {T}}$, $\tau = \nu + n -1$ and $\psi _n(\cdot )$ is the derivative of the function $\ln \Gamma _n(\cdot )$ with respect to $\nu$. The details are provided in “Multivariate Student-t process regression” in Appendix.

Remark 2

It is known that the marginal likelihood function in GPR models is not usually convex with respect to the hyperparameters; therefore, the optimization algorithm may converge to a local optimum, whereas the global one provides better result [12]. As a result, the optimized hyperparameters obtained by maximum likelihood estimation and the performance of GPR models may depend on the initial values of the optimization algorithm [6, 18, 28, 29]. A common strategy adopted by most GPR practitioners is a heuristic method. That is, the optimization is repeated using several initial values generated randomly from a prior distribution based on their expert opinions and experiences, for example, using ten initial values randomly selected from a uniform distribution. The final estimates of the hyperparameters are the ones with the largest likelihood values after convergence [6, 28, 29]. Further discussion on how to select suitable initial hyperparameters can be found in [9, 29]. In our numerical experiments, the same heuristic method is used for both MV-GPR and MV-TPR.

Remark 3

Another issue related to the Gaussian process and Student-t process models is the existence of the maximum likelihood estimators. To guarantee the existence of the MLE, one needs to show that there exists a solution to the system of equations that the derivatives of the marginal likelihood equal 0 and to prove that the Hessians at the solution are negative definite. However, due to the complex structure of these models and non-concavity of their likelihood functions, this issue has not been theoretically studied to the best of our knowledge, even for the conventional univariate Gaussian process regression models, although the numerical examples and applications in the literature have shown that the likelihood functions in the GPR models often have many local optima and the heuristic method discussed in Remark 2 need to be used in order to find an estimate as optimal as possible. For the MV-GPR and MV-TPR models, in practice we can impose a reasonable range for the parameters based on the data so that (local) optima of the marginal likelihoods always exist. In fact, our numerical examples demonstrate that the likelihood functions are not concave and there often exist multiple local optima; hence, several random initial values are used in order to find optimal estimates of the parameters. Of course, this procedure is not guaranteed to find the global optimum, but our experiments have provided much empirical evidence that for the proposed MV-GPR and MV-TPR (local) optima can be found and the models with these local optima as the estimates of the hyperparameters perform better than many existing models. As the focus of this paper is to propose multi-output prediction methods using Gaussian process and Student-t process and to demonstrate its usefulness through numerical examples and the maximum likelihood method is just an example for the model parameter estimation, the issue of the existence of MLE is not further studied here and will be investigated in our future work.

4 Experiments

In this section, we demonstrate the usefulness of MV-GPR and our proposed MV-TPR using some numerical examples, including simulated data and real data.

4.1 Simulated example

We first use simulation examples to evaluate the quality of the parameter estimation and the prediction performance of the proposed models.

4.1.1 Evaluation of parameter estimation

We generate random samples from a bivariate Gaussian process $\varvec{y} \sim \mathcal {MGP}(0,k',\Omega )$, where $k'$ is defined as in (3) with the kernel $k_{\mathrm{SE}}$. The hyperparameters in $k_{\mathrm{SE}}$ are set as $[\ell ,s_f^2] = [0.5, 2.5]$ and $\Omega = \left( {\begin{matrix} 1 &{} 0.8 \\ 0.8 &{} 1 \end{matrix}} \right)$. The variance of the random noise in $k'$ takes values $\sigma _n^2=0.1$, 0.05 and 0.01. As explained in Sect. 3.1, in our models the random noises are included in the kernel; therefore, no additional random errors can be added when the random samples are generated; otherwise, two random error terms will result in identifiability issues in the parameter estimation. The covariate x has 100 equally spaced values in [0, 1]. We utilize the heuristic method discussed in Remark 2 to estimate the parameters. The experiment is repeated 50 times, and we use the parameter median relative error (pMRE) as a measure of the quality of the estimates [4]:

$$\begin{aligned} \text{ pMRE } = \text{ median } \left\{ \frac{|{\hat{\theta }}_i - \theta |}{\theta }, i = 1,2,\ldots 50 \right\} , \end{aligned}$$

where $|\cdot |$ is the absolute value, ${\hat{\theta }}_i$ is the parameter estimates in repetition i and $\theta$ is the true parameter. The results are shown in Table 1.

Table 1 The pMREs of MV-GP samples with different noise levels estimated by MV-GPR

Multivariate Gaussian and Student-t process regression for multi-output prediction

Abstract

Similar content being viewed by others

Non-central Student-t Mixture of Student-t Processes for Robust Regression and Prediction

Multivariate Gaussian processes: definitions, examples and applications

Composite T-Process Regression Models

Explore related subjects

1 Introduction

2 Backgrounds and notations

2.1 Matrix-variate Gaussian distribution

Definition 1

Theorem 1

Theorem 2

Theorem 3

2.2 Matrix-variate Student-t distribution

Definition 2

Theorem 4

Theorem 5

Theorem 6

Theorem 7

Remark 1

3 Multivariate Gaussian and Student-t process regression models

3.1 Multivariate Gaussian process regression (MV-GPR)

3.1.1 Kernel

3.1.2 Parameter estimation

3.1.3 Comparison with the existing methods

3.2 Multivariate Student-t process regression (MV-TPR)

Remark 2

Remark 3

4 Experiments

4.1 Simulated example

4.1.1 Evaluation of parameter estimation

4.2 Evaluation of prediction accuracy

4.3 Real-data examples

4.3.1 Bike rent prediction

4.3.2 Air quality prediction

4.4 Application to stock market investment

4.4.1 Data preparation

4.4.2 Prediction model and strategy

4.4.3 Chinese companies in NASDAQ

4.4.4 Diverse sectors in Dow 30

5 Conclusion and discussion

Change history

27 February 2020

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Appendices

Appendix 1: Negative log marginal likelihood and gradient evaluation

1.1 Matrix derivatives

1.2 Multivariate Gaussian process regression

1.3 Multivariate Student-\(t\) process regression

Appendix 2: Three Chinese stocks investment details

Appendix 3: Final investment details of stocks in Dow 30

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation