Remarks on multivariate Gaussian Process

Gaussian processes occupy one of the leading places in modern statistics and probability theory due to their importance and a wealth of strong results. The common use of Gaussian processes is in connection with problems related to estimation, detection, and many statistical or machine learning models. With the fast development of Gaussian process applications, it is necessary to consolidate the fundamentals of vector-valued stochastic processes, in particular multivariate Gaussian processes, which is the essential theory for many applied problems with multiple correlated responses. In this paper, we propose a precise definition of multivariate Gaussian processes based on Gaussian measures on vector-valued function spaces, and provide an existence proof. In addition, several fundamental properties of multivariate Gaussian processes, such as strict stationarity and independence, are introduced. We further derive multivariate Brownian motion including It\^o lemma as a special case of a multivariate Gaussian process, and present a brief introduction to multivariate Gaussian process regression as a useful statistical learning method for multi-output prediction problems.


Introduction
In the theory of stochastic processes, some general results on Gaussian processes play a essential role in the construction of Brownian motion, as they both arise naturally from the requirement of independent increments. Furthermore, an understanding of Gaussian processes also gives a better understanding of many fundamentals of stochastic analysis. These factors, together with the simplicity and wealth of important results in the field, have led Gaussian processes to be considered one of the outstanding sub-fields of modern statistics and probability theory.
Nowadays Gaussian processes (GP) are also often considered in the context of supervised machine learning that uses lazy learning and a measure of the similarity between points (the kernel function) to predict the value for an unseen point from training data. Rather than inferring a distribution over the parameters of an undetermined parametric function, GP can be used as a non-parametric model in order to infer a distribution over functions directly. A GP defines a prior over functions. Given some observed function values, it can achieve a posterior over functions. GP has been proven to be an effective method for nonlinear problems due to many desirable properties, such as a clear structure with Bayesian interpretation, a simple integrated approach of obtaining and expressing uncertainty in predictions and the capability of capturing a wide variety of data feature by hyper-parameters [13,2]. Since Neal [11] revealed that many Bayesian neural networks converge to Gaussian processes in the limit of an infinite number of hidden units [16], GP has been widely used as an alternative to neural networks in order to solve complicated regression and classification problems in many areas, e.g., Bayesian optimisation [5], time series forecasting [3,10], feature selection [14], and so on.
With the development of Gaussian processes related to machine learning algorithms, the application of Gaussian processes has faced a conspicuous limitation. The classical GP model can be only used to deal with a single output or single response problem because the process itself is defined on R, and as a result the correlation between multiple tasks or responses cannot be taken into consideration [2,15]. In order to overcome the drawback above, many advanced Gaussian process model was proposed, including dependent Gaussian process [2], Gaussian process regression with multiple response variables [15], and Gaussian process regression for vector-valued function [1]. The general idea of these methods is to vectorise the multi-response variables and construct a "big" covariance, which describes the correlations between the inputs as well as between the outputs. Intrinsically, these approaches depend on the fact that the matrixvariate Gaussian distributions can be reformulated as multivariate Gaussian distributions, and these are still conventional Gaussian process regression models since the reformulation merely vectorises the multiresponse variables, which are assumed to follow a developed case of GP with a reproduced kernel [6].
In another development, Chen et al. [4] defined multivariate Gaussian processes (MV-GP) and proposed a unified framework to perform multi-output prediction using Gaussian processes. This framework does not rely on the equivalence between vectorised matrix-variate Gaussian distribution and multivariate Gaussian distribution, and it can be easily used to produce a general elliptical process model, for example, multivariate Student-t process (MV-TP) for multi-output prediction. Both MV-GPR and MV-TPR have closedform expressions for the marginal likelihoods and predictive distributions under this unified framework and thus can adopt the same optimization approaches as used in the conventional GP regression. Although Chen et al. [4] showed the usefulness of the proposed methods via data-driven examples, some theoretical issues of multivariate Gaussian processes are still not clear, e.g., the existence of MV-GP.
When it comes to the theoretical fundamentals of stochastic processes, a close look at measure theory is indispensable. Briefly speaking, (multivariate) Gaussian distributions are Gaussian measures on R n , and Gaussian processes are Gaussian measures on the function space (R T , F ) (for details refer to Definition 2.3, Definition 2.5, and Theorem 2.2 below). Based on the relationship between Gaussian measures and Gaussian processes, we properly defined multivariate Gaussian processes by extending Gaussian measures on function spaces to vector-valued function spaces.
The paper is organised as follows. Section 2 introduces some preliminaries of Gaussian processes, including some useful properties and the proof of existence. Section 3 presents some theoretical definitions of multivariate Gaussian process with the proof of existence. The examples and application of multivariate Gaussian processes which show their usefulness is presented in Section 4 and Section 5. Conclusions and a discussion are given in Section 6.

Stochastic process
A stochastic (or random) process is defined by a collection of random variables defined on a common probability space (Ω, F , P ), where Ω is a sample space, F is a σ-algebra and P is a probability measure; and the random variables, indexed by some set T, all take values in the same mathematical space S, which must be measurable with respect to some σ -algebra Σ [8]. In other words, for a given probability space (Ω, F , P ) and a measurable space (S, Σ), a stochastic process is a collection of S-valued random variables, which can be written as: A stochastic process can be interpreted or defined as a S T -valued random variable, where S T is the space of all the possible S-valued functions of t ∈ T that map from the set T into the space S [7]. The set T is usually one of these: R, R n , R + = [0, +∞), Z = {· · · , −1, 0, 1, · · · }, Z + = (0, 1, · · · ). If T = Z or Z + , we always call it random sequence. If T = R n with n > 1, the process is often considered as a random field. The set S is called state space and usually formulated as one of these: Indeed, the random variable of the process is not required to be in the form of one of these above sets, but it must have the same measurable S. For example, if S = R d or D d with d > 1, it is called vector-valued process.

Gaussian measure and distribution
for any measurable set A ∈ B(R).
A random variable X on a probability space (Ω, F , P ) is Gaussian with mean µ and variance σ 2 if its distribution measure is Gaussian, i.e. P (X ∈ A) = γ(A) As we know from the view of random variable, we have Definition 2.2. An n-dimensional random vector X = (X 1 , · · · , X n ) is Gaussian if and only if a, X := a T X = ∑ a i X i is a Gaussian random variable for all a = (a 1 , · · · , a n ) ∈ R n .
In terms of measure, Definition 2.3 (Gaussian measure on R n ). Let γ be a Borel probability measure on R n . For each a ∈ R n , denote a random variable Y(x ∈ R n ) as a mapping x → a, x ∈ R on the probability space (R n , B(R n ), γ). The Borel probability measure γ is a Gaussian measure on R n if and only if the random variable Y is Gaussian for each a.
A matrix Gaussian distribution in statistics is a probability distribution by generalizing the multivariate normal distribution to matrix-valued random variables, which can be defined by multivariate Gaussian distribution.

Definition 2.4 (Matrix Gaussian distribution).
The random matrix is said to be Gaussian [6]: where ⊗ denotes the Kronecker products and vec(X) denotes the vectorisation of X.

Gaussian process
Consider the space R T of all R-valued functions on T. A subset of the form { f : f (t i ) ∈ A i , 1 ≤ i ≤ n} for some n ≥ 1, t i ∈ T and some Borel sets A i ⊆ R is called a cylinder set. Let F be the σ-algebra generated by all cylinder sets. Also, we may consider the product topology on R T , which defined as the smallest topology that makes the projection maps Π t 1 ,...,t n ( f ) = [ f (t 1 ) , . . . , f (t n )] from R T to R n measurable, and define F as the Borel σ-algebra of this topology. We can obtain: Definition 2.5 (Gaussian measure on (R T , F )). A measure γ on (R T , F ) is called as a Gaussian measure if for any n ≥ 1 and t 1 , · · · , t n ∈ T, the push-forward measure γ • Π −1 t 1 ,··· ,t n on R n is a Gaussian measure.
Theorem 2.2 (Relationship between Gaussian process and Gaussian measure). If X = (X t ) t∈T is a Gaussian process, then the push-forward measure γ = P • X −1 with X : Ω → R T is Gaussian on R T , namely, γ is a Gaussian measure on (R T , F ). Conversely, if γ is a Gaussian measure on (R T , F ), then on the probability space (R T , F , γ), the co-ordinate random variable Π = (Π t ) t∈T is from a Gaussian process.
The proof of the relationship between Gaussian process and Gaussian measure can be found in [12].

Theorem 2.3 (Existence of Gaussian process).
For any index set T, any mean function µ : T → R and any covariance function (function has covariance form), k : T × T → R, there exists a probability space (Ω, F , P ) and a Gaussian process GP (µ, k) on this space, whose mean function is µ and covariance function is k. It is denoted as X ∼ GP (µ, k).
Proof. Thanks to Theorem 2.2, we just need to prove the existence of Gaussian measure with the specific mean vector generated by mean function and specific covariance matrix generated by covariance function. Given n > 1, for every t 1 , · · · , t n ∈ T, a Gaussian measure γ t 1 ,··· ,t n on R n satisfies the assumptions of Daniell-Kolmogorov theorem because the projection of Gaussian distribution on R n with n-dimensional vector [µ(t 1 ), · · · , µ(t n )] ∈ R n and n × n covariance matrix K = (k i,j ) ∈ R n×n , to the first n − 1 co-ordinates, is precisely a Gaussian distribution with n − 1-dimensional vector [µ(t 1 ), · · · , µ(t n−1 )] ∈ R n−1 and (n − 1) × (n − 1) covariance matrix K = (k i,j ) ∈ R (n−1)×(n−1) . By the Daniell-Kolmogorov theorem, there exists a probability space (Ω, F , P ) as well as a Gaussian process X = (X t ) t∈T ∼ GP (µ, k) defined on this space such that any finite dimensional distribution of [X t 1 , · · · , X t n ] is given by the measure γ t 1 ,··· ,t n .

Multivariate Gaussian process
Following the classical theory of Gaussian measure and Gaussian process, we can introduce Gaussain measure on R n×d and Gaussian measure on ((R n ) T , G), and finally define the multivariate Gaussian process. According to Definition 2.3 and Definition 2.4, we can have a definition of Gaussian measure on R n×d . Definition 3.1 (Gaussian measure on R n×d ). Let γ be a Borel probability measure on R n×d . For each a ∈ R nd , denote a random variable Y(x ∈ R n×d ) as a mapping x → a, vec(x) ∈ R on the probability space (R n×d , B(R n×d ), γ). The Borel probability measure γ is a Gaussian measure on R n×d if and only if the random variable Y is Gaussian for each a.
Similarly to the introduction in Gaussian process, now we consider the space (R d ) T of all R d -valued functions on T. Let G be a σ-algebra generated by all cylinder sets where each cylinder set here defined as a subset of the form {f : f (t i ) ∈ B i , 1 ≤ i ≤ n} for some n ≥ 1, t i ∈ T and some Borel sets B i ⊆ R d . Also, we can define the smallest topology on R d that makes the projection mappings Ξ t 1 ,...,t n ( f ) = [ f (t 1 ) , . . . , f (t n )] from (R d ) T to R n×d measurable, and define G as the Borel σ-algebra of this topology. Thus we can have a definition of Gaussian measure on ((R d ) T , G).
is called as a Gaussian measure if for any n ≥ 1 and t 1 , · · · , t n ∈ T, the push-forward measure γ • Ξ −1 t 1 ,··· ,t n on R n×d is a Gaussian measure.
Since the relationship between Gaussian process and Gaussian measure in Theorem 2.2, we can well define multivariate Gaussian process (MV-GP).

Definition 3.3 (d-variate Gaussian process)
. Given a Gaussian measure on ((R d ) T , G), d ≥ 1, the coordinate random vector Ξ = (Ξ t ) t∈T on the probability space ((R d ) T , G, γ) is said to be from a d-variate Gaussian process. Theorem 3.1 (Existence of d-variate Gaussian process). For any index set T, any vector-valued mean function u : T → R d , any covariance function k : T × T → R and any positive semi-definite parameter matrix Λ ∈ R d×d , there exists a probability space (Ω, G, P ) and a d-variate Gaussian process f (x) on this space, whose mean function is u, covariance function is k and parameter matrix is Λ , such that, Proof. Given n > 1, for every t 1 , · · · , t n ∈ T, a Gaussian measure γ t 1 ,··· ,t n on R n×d satisfies the assumptions of Daniell-Kolmogorov theorem because the projection of a matrix Gaussian distribution on R n×d with u(t 1 ) T , · · · , u(t n ) T T ∈ R n×d , n × n column covariance matrix K = (k i,j ) ∈ R n×n , and d × d row covariance matrix Λ ∈ R d×d , to the first n − 1 co-ordinates, is precisely the Gaussian distribution with u(t 1 ) T , · · · , u(t n−1 ) T T ∈ R (n−1)×d , (n − 1) × (n − 1) column covariance matrix K = (k i,j ) ∈ R (n−1)×(n−1) , and row covariance matrix Λ ∈ R d×d . This is due to the conditional property of matrix Gaussian distribution shown in Theorem 2.1. By the Daniell-Kolmogorov theorem, there exists a probability space (Ω, G, P ) as well as a d-variate Gaussian process X = (X t ) t∈T ∼ MGP d (u, k, Λ) defined on this space such that any finite dimensional distribution of [X t 1 , · · · , X t n ] is given by the measure γ t 1 ,··· ,t n .
Following the existence of d-variate Gaussian process, we can also achieve some properties as follow.

Example: special cases
Instinctively, a special case is centred multivariate Gaussian process where vector-valued mean function µ = 0. The 50 realisation samples generated from centred multivariate Gaussian process are demonstrated in Figure 1:Left. Furthermore, we can derive the multivariate Gaussian white noise and the multivariate Brownian motion. Proof.
Remark 4.1. We observed in the proof that d-variate Gaussian white noise has independence property as white noise along with T, but it has correlation along with d-variate dimension. Therefore, d-variate Gaussian white noise is also called as variate-dependent Gaussian white noise or variate-correlated Gaussian white noise, which is distinct from the traditional d-dimensional independent Gaussian white noise. Here are 50 realisation samples generated from multivariate Gaussian white noise shown in Figure 1:Right.

Multivariate Brownian motion
According to the Chapter 2 of the book written by Le Gall [9], there is a definition of Brownian motion, which is a Gaussian white noise whose intensity is Lebesgue measure. Since Brownian motion is a special case of Gaussian process with continuous sample paths, mean function u = 0 and covariance function k(s, t) = min(s, t), we propose an example, which is d-variate Brownian motion, as a special case of dvariate Gaussian process with vector-valued mean function u = 0, covariance function k(s, t) = min(s, t) and parameter matrix Λ. Based on the Theorem 3.1, we derived some properties of the traditional Brownian motion to a more general vector-valued case. Let B t be a d-variate Brownian motion, which means for all 0 ≤ t 1 ≤ · · · ≤ t n the random variable Z = (B T t 1 , . . . , B T t n ) T ∈ R n×d has a normal distribution on the probability space (Ω, G, P ) we mentioned before in Theorem 3.1. There exists a matrix M ∈ R n×d and two non-negative definite matrices C = [c] jm ∈ R n×n and Λ = [λ] ab ∈ R d×d such that where W = [w] ja ∈ R n×d and i is the imaginary unit. Moreover, we also have the mean value M = E[Z] and two covariance matrices Assume that the mean matrix M here is a zero matrix, i.e.
Hence, E[B t ] = 0 for all t ≥ 0 and Moreover, we have Remark 4.2. Similar to d-variate Gaussian white noise, d-variate Brownian motion also has independence property along with T, but it has correlation along with d-variate dimension. Therefore, d-variate Brownian motion is also called as variate-dependent Brownian motion or variate-correlated Brownian motion, which is distinct from the "traditional" d-dimensional Brownian motion. Actually, the "traditional" d-dimensional Brownian motion is a special case of d-variate Brownian motion with diagonal matrix Λ.
As a Brownian motion, we then introduce Itô lemma for the d-variate Brownian motion. Let B t = [B 1 (t), · · · , B d (t)] be the d-variate Brownian motion derived in Section 4.2. Then, we have the following lemma.
Lemma 4.1 (Itô lemma for the d-variate Brownian motion). Let F be a twice continuously differentiable real function on R d+1 and let Λ = [λ] i,j ∈ R d×d be the covariance matrix for the d-variate dimension. Then, Proof. By Itô lemma and the definition of the d-variate Brownian motion, we obtain The proof is complete by d B i , B j (s) = λ i,j ds.

Application: multivariate Gaussian process regression
As a useful application, multi-output prediction using multivariate Gaussian process is a good example.
Multivariate Gaussian process provides a solid and unified framework to make the prediction with multiple responses by taking advantage of their correlations. As a regression problem, multivariate Gaussian process regression (MV-GPR) have closed-form expressions for the marginal likelihoods and predictive distributions and thus parameter estimation can adopt the same optimization approaches as used in the conventional Gaussian process [4]. As a summary of MV-GPR in [4], the noise-free multi-output regression model is considered and the noise term is incorporated into the kernel function. Given n pairs of observations {(x i , y i )} n i=1 , x i ∈ R p , y i ∈ R d , we assume the following model where Λ is an undetermined covariance (correlation) matrix (the relationship between different outputs), k ′ = k(x i , x j ) + δ ij σ 2 n , and δ ij is Kronecker delta. According to multivariate Gaussian process, it yields that the collection of functions [f (x 1 ), . . . , f (x n )] follows a matrix-variate Gaussian distribution where K ′ is the n × n covariance matrix of which the (i, j)-th element [K ′ ] ij = k ′ (x i , x j ). Therefore, the predictive targets f * = [ f * 1 , . . . , f * m ] T at the test locations X * = [x n+1 , . . . , x n+m ] T is given by p(f * |X, Y, X * ) = MN (M,Σ,Λ), whereM = K ′ (X * , X) T K ′ (X, X) −1 Y, Σ = K ′ (X * , X * ) − K ′ (X * , X) T K ′ (X, X) −1 K ′ (X * , X), andΛ = Λ.
Here K ′ (X, X) is an n × n matrix of which the (i, j)-th element [K ′ (X, X)] ij = k ′ (x i , x j ), K ′ (X * , X) is an m × n matrix of which the (i, j)-th element [K ′ (X * , X)] ij = k ′ (x n+i , x j ), and K ′ (X * , X * ) is an m × m matrix with the (i, j)-th element [K ′ (X * , X * )] ij = k ′ (x n+i , x n+j ). In addition, the expectation and the covariance are obtained, cov(vec(f T * )) =Σ ⊗Λ = [K ′ (X * , X * ) − K ′ (X * , X) T K ′ (X, X) −1 K ′ (X * , X)] ⊗ Λ. From the view of data science, the hyperparameters involved in the covariance function (kernel) k ′ (·, ·) and the row covariance matrix of MV-GPR need to be estimated from the training data using many approaches [17], such as maximum likelihood estimation, maximum a posteriori and Markov chain Monte Carlo [18].

Conclusion
In this paper, we give a proper definition of the multivariate Gaussian process (MV-GP) and some related properties such as strict stationarity and independence of this process. We also provide the examples of multivariate Gaussian white noise and multivariate Brownian motion including Itô lemma and present an useful application of multivariate Gaussian process regression in statistical learning with our definition.