1 Introduction

Multi-output Gaussian processes (MOGPs), a generalisation of Gaussian processes (GPs), exploit dependencies among outputs to improve joint prediction performances (Bonilla et al., 2008; Álvarez et al., 2012; Dahl & Bonilla, 2019; Nguyen et al., 2018; Wistuba et al., 2018; Alvarez, 2011). For example, in a sensor network, prediction of missing signals from some sensors may be done by exploiting dependencies with signals obtained from nearby sensors (Osborne et al., 2008).

Our main purpose in this paper is to use MOGPs to study the problem of multiple outputs where each output is a multi-class classification problem. The setting considered here goes beyond multi-label classification since we allow each output to potentially have its own inputs moving into the multi-task setting.

MOGPs have mainly been used for multi-output regression to predict continuous variables (Bonilla et al., 2008; Álvarez et al., 2012; Dai et al., 2017). In this setting, the assumption is that each output follows a Gaussian likelihood and the mean of the Gaussian likelihood is given by one output of the MOGP. Due to the properties of the Gaussian distribution, Bayesian inference is tractable in this case.

Beyond the muti-output regression problem, there is some research on other types of outputs in MOGPs. For example, Skolidis and Sanguinetti (2011) use MOGPs to model a setting where each output corresponds to a binary classification problem. Each binary outcome is modelled using a probit likelihood. The MOGP corresponds to the so called intrinsic coregionalisation model (ICM) (Bonilla et al., 2008). Since Bayesian inference is intractable in this model, the authors approximate posterior distributions using expectation-propagation and variational Bayes.

Several research works have addressed the case of multi-class classification using GPs. Previous works have used the softmax likelihood (Williams & Rasmussen, 2006; Kim & Ghahramani, 2006; Galy-Fajou et al., 2020), the multinomial probit likelihood function (Girolami & Rogers, 2006), the step function (Hernández-Lobato et al., 2011). Recently, Liu et al. (2019) can use all the above likelihoods through additive noise terms. The parameters in these likelihood functions are assumed to follow independent Gaussian processes. Another strand of works generalise this setting by allowing correlated Gaussian processes for the latent parameters of the likelihood functions, typically using MOGPs. Both Dezfouli and Bonilla (2015) and Chai (2012) use an ICM for a single-output multi-class classification problem modelled through a multinomial logistic likelihood, i.e. the softmax likelihood. In terms of Bayesian inference, Chai (2012) proposes a variational sparse approximation for the posterior distribution, and based on scalable automated variational inference, Dezfouli and Bonilla (2015) approximates the posterior distribution by a mixture of Gaussians. Moreno-Muñoz et al. (2018) build a heterogeneous multi-output Gaussian process, where each output has its own likelihood, through a linear model of coregionalisation (LMC) (Álvarez et al., 2012). Moreno-Muñoz et al. (2018) use an stochastic variational approach for Bayesian inference.

The approaches for single-output multi-class classification described above are restricted to the case where the number of classes is small. They scale poorly when the number of classes go beyond a few tens. Scalability is also poorly handled by the more general model of Moreno-Muñoz et al. (2018) for the multi-output multi-class classification case, where once again, problems that go beyond a few tens of classes are out of reach.

Our main contribution in this paper is that we introduce a new extension of multi-output GPs able to handle large scale multi-output multi-class classification problems, typically in the range of hundreds and even thousands of classes. We achieve scalability by subsampling both training input data and classes in each output, by using stochastic variational inference (Hensman et al., 2013; Moreno-Muñoz et al., 2018), and by choosing a softmax likelihood function via Gumbel noise error for all outputs. We refer to this model as Multi-output Gaussian Processes with Augment & Reduce (MOGPs-AR).

We also enable our MOGPs-AR to allow downsized images as input data. To efficiently deal with downsized images, we employ convolutional kernels (Van der Wilk et al., 2017), computing the entries of the kernel matrices using kernels over patches of the images and integrating these kernels within a MOGP. Since our model is able to capture both intra- and inter-output dependencies, it also provides a means to perform transfer learning in the multi-task setting. We show an example of this multi-task learning ability of our model in the Ommiglot dataset. To the best of our knowledge, this is the first time that a multi-task multi-class Gaussian process model is used over such dataset.

2 Related work

As we mentioned early, the multi-class classification problem has been mainly studied using single-output GPs (Williams & Rasmussen, 2006; Kim & Ghahramani, 2006; Hernández-Lobato et al., 2011; Girolami & Rogers, 2006; Liu et al., 2019). The model introduced in this paper, MOGPs-AR, uses the softmax likelihood through additive noise errors, which is the same as Liu et al. (2019). However, MOGPs-AR solves multiple outputs problems together while the model in Liu et al. (2019), like all single-output GPs, only solves single output problems. Regarding single output problems, MOGPs-AR can also improve prediction using a correlation between all latent parameter functions whereas single-output GPs cannot capture the correlation.

The works more relevant to ours are Chai (2012); Dezfouli and Bonilla (2015); Skolidis and Sanguinetti (2011); Moreno-Muñoz et al. (2018). Both Chai (2012) and Dezfouli and Bonilla (2015) can only handle a single output multi-class classification problem even if they use MOGPs. Nevertheless, our model can tackle multiple outputs where each output is a multi-class classification problem. Skolidis and Sanguinetti (2011) only solve multi-output binary classification problems, which is different to ours. Compared with Skolidis and Sanguinetti (2011), our inference method is also suited to large scale data sets. Moreno-Muñoz et al. (2018) can tackle multi-output multi-class classification problems and develop a similar stochastic variational inference method as us. However, we are different to Moreno-Muñoz et al. (2018) since we can cope with a large number of classes by subsampling classes and also can deal with downsized images through convolutional kernels (Van der Wilk et al., 2017).

The work by Panos et al. (2021) is much related to us since we use a similar subsampling method. Panos et al. (2021) extend a semiparametric latent model, a special case of LMC, to address the multi-label problem by using sigmoidal/Bernoulli likelihood for each latent parameter function. Panos et al. (2021) can doubly subsample data points and classes to reduce computational complexity based on stochastic variational inference, which is analogous to us. However, we are different in other aspects. First, we solve multi-class classification problems using the softmax likelihood instead of multi-label problems using sigmoidal/Bernoulli likelihood. Further, we can apply a convolutional kernel to handle downsized image data. Finally, our model can deal with multi-output problems instead of only tackling single output problems.

3 Methodology

In this section, we will derive the MOGPs-AR model. We first develop the LMC model with a convolutional kernel. We then define the softmax likelihood through augmenting noise data. We finally describe stochastic variational inference and the approximated predictive distribution for our model.

We assume there are D different outputs (Table 1 shows the description of our notation). The vector \(\textbf{y}\in \textbf{R}^{D}\) groups all the D different outputs:

$$\begin{aligned} \textbf{y}(\textbf{x})=\left[ y_{1}(\textbf{x}), y_{2}(\textbf{x}), \cdots , y_{D}(\textbf{x})\right] ^{\top }, \end{aligned}$$
(1)

where \(\textbf{x} \in \textbf{R}^{v}\). Each output \(y_{d}(\textbf{x}) \in \{1,...,C_{d}\} (C_{d} \ge 2 \text { and } d \in \{1,...,D\})\) is a categorical variable and \(C_{d}\) is the number of classes in the d-th output. Like Moreno-Muñoz et al. (2018), we also assume that those outputs are conditionally independent given parameters \(\varvec{\theta }(\textbf{x})=\left[ \theta _{1}(\textbf{x}), \theta _{2}(\textbf{x}), \cdots , \theta _{D}(\textbf{x})\right] ^{\top }\), where \(\varvec{\theta }(\textbf{x})\) is defined by latent parameter functions:

$$\begin{aligned} \textbf{f}(\textbf{x})=\left[ f_{1}^{1}(\textbf{x}), f_{1}^{2}(\textbf{x}), \cdots f_{1}^{C_{1}}(\textbf{x}), f_{2}^{1}(\textbf{x}), f_{2}^{2}(\textbf{x}), \cdots , f_{D}^{C_{D}}(\textbf{x})\right] ^{\top } \in \textbf{R}^{C \times 1}, \end{aligned}$$
(2)

where \(C=\sum _{d=1}^{D} C_{d}\) and \(f_{d}^{c}(\textbf{x})\) is c-th latent parameter function in the d-th output evaluated at \(\textbf{x}\). We then obtain:

$$\begin{aligned} p(\textbf{y}(\textbf{x}) | \varvec{\theta }(\textbf{x}))&=p(\textbf{y}(\textbf{x}) | \textbf{f}(\textbf{x})) \end{aligned}$$
(3)
$$\begin{aligned}&=\prod _{d=1}^{D} p\left( y_{d}(\textbf{x}) | \varvec{\theta }_{d}(\textbf{x})\right) \end{aligned}$$
(4)
$$\begin{aligned}&=\prod _{d=1}^{D} p\left( y_{d}(\textbf{x}) | \widetilde{\textbf{f}}_{d}(\textbf{x})\right) , \end{aligned}$$
(5)

where \(\widetilde{\textbf{f}}_{d}(\textbf{x})=\left[ f_{d}^{1}(\textbf{x}), \cdots , f_{d}^{C_{d}}(\textbf{x})\right] ^{\top } \in \textbf{R}^{C_{d} \times 1}\) is a group of latent parameter functions defining the parameters in \(\varvec{\theta }_{d}(\textbf{x})\).

Table 1 Nomenclature

3.1 Combining with convolutional kernel

We use the linear model of coregionalisation (LMC) and combine it with the convolutional kernel. The LMC is a popular model in MOGPs, where each output is expressed as a linear combination of a collection of Gaussian processes (Álvarez et al., 2012). The convolutional kernel (Van der Wilk et al., 2017) can effectively exploit features in images.

We construct a convolutional structure for mutually independent latent functions \(\mathcal {U}=\left\{ u_{q}(\textbf{x})\right\} _{q=1}^{Q}\) where \(u_{q}\) follows a Gaussian process, Q is the number of the latent functions and each latent parameter function \(f_{d}^{c}(\textbf{x})\) is a linear combination of the latent functions \(\mathcal {U}\). Here, we assume \(\textbf{x} \in \textbf{R}^{W \times H}\) is an image data point that has a \(v=W \times H\) size where W and H are the width and height of the image separately. We also assume \(\textbf{x}^{[p]}\) is the \(p^{\text {th}}\) patch of \(\textbf{x}\) with patches of size \(E=w \times h\) where w and h are the width and height of each patch, respectively.

After dividing an image into patches, we get a total of \(P=(W-w+1) \times (H-h+1)\) patches. We begin with a patch response function \(u_q \left( \textbf{x}^{[p]}\right) : \textbf{R}^{w \times h} \rightarrow \textbf{R}\), which maps a patch of size \(E=w \times h\) to a real number in \(\textbf{R}\). Then we add a weight for each patch response function and get a latent function \(u_{q}(\textbf{x}): \textbf{R}^{W \times H} \rightarrow \textbf{R}\), where \(u_{q}(\textbf{x})\) is the sum of all patch responses with weights: \(u_{q}(\textbf{x})=\sum _{p} w_{p} u_q\left( \textbf{x}^{[p]}\right)\). Each function \(u_q\) is drawn from an independent GP prior: \(u_{q}(\cdot ) \sim \mathcal {G} \mathcal {P}\left( 0, k_{q}(\cdot , \cdot )\right)\), where \(k_q(\cdot ,\cdot )\) can be any kernel function. In this paper, we use the radial basis function kernel with automatic relevance determination (RBF-ARD) (Williams & Rasmussen, 2006):

$$\begin{aligned} k_{{\text {ard}}}\left( \textbf{x}^{[p]}, \textbf{x}^{[p^{\prime }]}\right) =\sigma _{\textrm{ard}}^{2} \exp \left( -\frac{1}{2} \sum _{j=1}^{E} \frac{\left( x^{[p]}_{j} - x^{[p^{\prime }]}_{j}\right) ^{2}}{l_{j}^{2}}\right) , \end{aligned}$$
(6)

where \(x^{[p]}_{j}\) is the j-th dimension of \(\textbf{x}^{[p]}\), \(\sigma _{\textrm{ard}}^{2}\) is a variance parameter and \(l_{j}\) is the length scale for the j-th input dimension. Therefore \(k_q(\cdot ,\cdot )\) is RBF-ARD. When all length scales are the same, the kernel is called radial basis function kernel (RBF) (Lawrence & Hyvärinen, 2005). Hence, each \(f_{d}^{c}(\textbf{x})\) is defined as

$$\begin{aligned} f_{d}^{c}(\textbf{x})=\sum _{q=1}^{Q} \sum _{i=1}^{R_{q}} a_{d, c, q}^{i} u_{q}^{i}(\textbf{x})=\sum _{q=1}^{Q} \sum _{i=1}^{R_{q}} a_{d, c, q}^{i} \bigg (\sum _{p=1}^{P} w_{p} u_{q}^{i}(\textbf{x}^{[p]}) \bigg ), \end{aligned}$$
(7)

where \(a_{d, c, q}^{i} \in \textbf{R}\) can be considered as a weight on \(\mathcal {U}\) and we assume \(\{\phi _q\}_{q=1}^{Q}\) are the hyperparameters for \(\{k_{q}(\cdot , \cdot )\}_{q=1}^{Q}\), with \(\phi _q\) being the hyperparameters for the kernel \(k_{q}(\cdot , \cdot )\). \(R_{q}\) represents the number of latent functions \(u_{q}^{i}(\textbf{x})\) that are sampled independently and identically from the Gaussian processes \(u_{q}(\cdot ) \sim \mathcal {G} \mathcal {P}\left( 0, k_{q}(\cdot , \cdot )\right)\). The difference between the convolutional kernel model and a more classic kernel, e.g., RBF, is that we use the convolutional structure term \(\sum _{p=1}^{P} w_{p} u_{q}^{i}(\textbf{x}^{[p]})\) instead of solely \(u_{q}^{i}(\textbf{x})\), where Fig. 1 shows an example of two images and how they are handled through the convolutional kernel. With \(q=1,...,Q\) and \(i=1,...,R_{q}\), the function \(u_{q}^{i}(\textbf{x})\) have a zero mean and covariance \({\text {cov}}\left[ u_{q}^{i}(\textbf{x}), u_{q^{\prime }}^{i^{\prime }}(\textbf{x}^{\prime })\right]\) = \(\sum _{p=1}^{P} \sum _{p^{\prime }=1}^{P} w_{p} w_{p^{\prime }} k_{q}\left( \textbf{x}^{[p]}, \textbf{x}^{\prime \left[ p^{\prime }\right] }\right)\) if \(i=i^\prime\) and \(q=q^\prime\). Let the mean function of \(f_{d}^{c}(\textbf{x})\) be zero and the cross-covariance function of \(f_{d}^{c}(\textbf{x})\) be

$$\begin{aligned} k_{f_{d}^{c} f_{d^{\prime }}^{c^{\prime }}}\left( \textbf{x}, \textbf{x}^{\prime }\right) =\;&{\text {cov}}\left[ f_{d}^{c}(\textbf{x}), f_{d^{\prime }}^{c^{\prime }}\left( \textbf{x}^{\prime }\right) \right] \end{aligned}$$
(8)
$$\begin{aligned} =\;&{\text {cov}}\left[ \sum _{q=1}^{Q} \sum _{i=1}^{R_{q}} a_{d, c, q}^{i} u_{q}^{i}(\textbf{x}), \sum _{q^\prime =1}^{Q} \sum _{i^{\prime }=1}^{R_{q}} a_{d^\prime , c^{\prime }, q^\prime }^{i^\prime } u_{q^\prime }^{i^\prime }(\mathbf {x^\prime })\right] \end{aligned}$$
(9)
$$\begin{aligned} =&\sum _{q=1}^{Q}\sum _{q^{\prime }=1}^{Q}\sum _{i=1}^{R_{q}} \sum _{i^{\prime }=1}^{R_{q}} a_{d, c, q}^{i}a_{d^\prime , c^{\prime }, q^\prime }^{i^\prime } {\text {cov}}\left[ u_{q}^{i^\prime }(\textbf{x}), u_{q^\prime }^{i^\prime }(\mathbf {x^\prime })\right] . \end{aligned}$$
(10)

Because \(u_{q}^{i}(\cdot )\) is independently and identically drawn from \(u_{q}(\cdot )\) and \(\mathcal {U}(\cdot )\) are mutually independent

$$\begin{aligned} k_{f_{d}^{c} f_{d^{\prime }}^{c^{\prime }}}\left( \textbf{x}, \textbf{x}^{\prime }\right) = \sum _{q=1}^{Q} b_{(d, c),\left( d^{\prime }, c^{\prime }\right) }^{q} \left[ \sum _{p=1}^{P} \sum _{p^{\prime }=1}^{P} w_{p} w_{p^{\prime }} k_{q}\left( \textbf{x}^{[p]}, \textbf{x}^{\prime \left[ p^{\prime }\right] }\right) \right] , \end{aligned}$$
(11)

where \(b_{(d, c),\left( c^{\prime }, c^{\prime }\right) }^{q}=\sum _{i=1}^{R_{q}} a_{d, c, q}^{i} a_{d^{\prime }, c^{\prime }, q}^{i}\). For simplicity in the presentation, we assume that all outputs \(y_{d}(\textbf{x})\) have a collection of the same input vectors \(\textbf{X}=\left\{ \textbf{x}_{n}\right\} _{n=1}^{N} \in \textbf{R}^{N \times v}\). Our model also works for each output with a different input data set. For notation simplicity, we define

$$\begin{aligned} \textbf{f}_{d}^{c}&=\left[ f_{d}^c\left( \textbf{x}_{1}\right) , \cdots , f_{d}^{c}\left( \textbf{x}_{N}\right) \right] ^{\top } \in \textbf{R}^{N \times 1} \end{aligned}$$
(12)
$$\begin{aligned} \widetilde{\textbf{f}}_{d}&=\left[ \left( \textbf{f}_{d}^1\right) ^{\top }, \cdots , \left( \textbf{f}_{d}^{C_{d}}\right) ^{\top }\right] ^{\top } \in \textbf{R}^{C_{d} N \times 1} \end{aligned}$$
(13)
$$\begin{aligned} \textbf{f}&=\left[ \tilde{\textbf{f}}_{1}^{\top }, \cdots , \widetilde{\textbf{f}}_{D}^{\top }\right] ^{\top } \in \textbf{R}^{C N \times 1} \end{aligned}$$
(14)

The prior distribution of \(\textbf{f}\) is given by \(\textbf{f} \sim \mathcal {N}(\textbf{0}, \textbf{K})\), where \(\textbf{K}\) is a block-wise matrix based on \(\left\{ \textbf{K}_{\textbf{f}_{d}^{c} \textbf{f}_{d^{\prime }}^{c^{\prime }}}\right\} _{d=1, d^{\prime }=1, c=1, c^{\prime }=1}^{D, D, C_{d}, C_{d^{\prime }}}\) as each block and \(\textbf{K}_{\textbf{f}_{d}^{c} \textbf{f}_{d^{\prime }}^{c^{\prime }}}\) has entries computed using \(k_{f_{d}^{c} f_{d^{\prime }}^{c^{\prime }}}\left( \textbf{x}_{n}, \textbf{x}_{m}\right)\) with \(\textbf{x}_{n}, {\textbf {x}}_{m} \in \textbf{X}\). \(\textbf{K}\) can be formulated as a sum of Kronecker products \(\textbf{K}=\sum _{q=1}^{Q} \textbf{A}_{q} \textbf{A}_{q}^{\top } \otimes \textbf{K}_{q}=\sum _{q=1}^{Q} \textbf{B}_{q} \otimes \textbf{K}_{q}\) as well, where \(\textbf{A}_{q} \in \textbf{R}^{C \times R_{q}}\) and \(\textbf{B}_{q}\) have entries \(\left\{ a_{d, c, q}^{i}\right\} _{d=1, c=1, i=1}^{D, C_{d},R_{q}}\) and \(\left\{ b_{(d, c),\left( d^{\prime }, c^{\prime }\right) }^{q}\right\} _{d=1, d^{\prime }=1, c=1, c^{\prime }=1}^{D, D, C_{d}, C_{d^{\prime }}}\), respectively. \(\textbf{K}_{q} \in \textbf{R}^{N \times N}\) has entries computed using \(\sum _{p=1}^{P} \sum _{p^{\prime }=1}^{P} w_{p} w_{p^{\prime }} k_{q}\left( \textbf{x}_{n}^{[p]}, \textbf{x}_{m}^{\left[ p^{\prime }\right] }\right)\) for \(\textbf{x}_{n}, \textbf{x}_{m} \in \textbf{X}\). Each matrix \(\textbf{B}_{q} \in \textbf{R}^{C \times C}\) is known as a coregionalisation matrix and it controls the correlation between each latent parameter function.

Fig. 1
figure 1

An example of two images for the convolutional kernel inputs. The two images are two characters in the Ojibwe alphabet (please see Sect. 4.4.1 for more detail). We consider two characters as two classes. The two images are one data point for each class separately. Left: The whole image is considered as an input data point \(\textbf{x}\) and the blue grid represents the p-th patch \(\textbf{x}^{[p]}\). Right: The whole image is considered as an input data point \(\textbf{x}^{\prime }\) and the blue grid represents the \(p^{\prime }\)-th patch \(\textbf{x}^{\prime \left[ p^{\prime }\right] }\) (Color figure online)

3.2 Augmenting model by noise data

In this section, we generalise the model in the last subsection to cope with the multi-output multi-class classification problem using the softmax likelihood. We derive a softmax likelihood function through Gumbel noise error for a generic output \(\textbf{y}_d\).

We take the d-th output \(y_{d}(\textbf{x})\) with the latent parameter function \(\widetilde{\textbf{f}}_{d}(\textbf{x}) =\left[ f_{d}^{1}(\textbf{x}), \cdots , f_{d}^{C_{d}}(\textbf{x})\right] ^{\top }\). We first add a Gumbel noise error to each of latent parameter functions \(\widetilde{\textbf{f}}_{d}(\textbf{x})\) to get a new vector function \(\textbf{h}_{d}(\textbf{x})=\left\{ h_{d}^{c}(\textbf{x})\right\} _{c=1}^{C_d}\) for each of the classes in the d-th output. We thus obtain:

$$\begin{aligned} h_{d}^{c}(\textbf{x})&=f_{d}^{c}(\textbf{x})+\epsilon _{d,i}^{c}, \end{aligned}$$
(15)
$$\begin{aligned} y_d(\textbf{x})&=\underset{c}{{\text {argmax}}} \, h_{d}^{c}(\textbf{x}), \end{aligned}$$
(16)

where \(\epsilon _{d,i}^{c}\) is the i-th i.i.d. Gumbel noise error for class c in the d-th output. We define \(\textbf{h}_{d}(\textbf{x}_{i})=\left( h_{d}^{1}\left( \textbf{x}_{i}\right) , \cdots , h_{d}^{C_{d}}\left( \textbf{x}_{i}\right) \right) ^{\top } \in \textbf{R}^{C_{d} \times 1}\). We then employ the internal step likelihood (Liu et al., 2019):

$$\begin{aligned} p\left( y_{d}\left( \textbf{x}_{i}\right) | \textbf{h}_{d}(\textbf{x}_{i})\right)&=\prod _{c \ne y_{d}\left( \textbf{x}_{i}\right) } H\left( h_{d}^{y_{d} \left( \textbf{x}_{i}\right) }\left( \textbf{x}_{i}\right) -h_{d}^{c}\left( \textbf{x}_{i}\right) \right) \end{aligned}$$
(17)
$$\begin{aligned}&=\prod _{c \ne y_{d}\left( \textbf{x}_{i}\right) } H\left( f_{d}^{y_{d}\left( \textbf{x}_{i}\right) }\left( \textbf{x}_{i}\right) +\epsilon _{d,i}^{y_{d}\left( \textbf{x}_{i}\right) }-f_{d}^{c}\left( \textbf{x}_{i}\right) -\epsilon _{d,i}^{c}\right) , \end{aligned}$$
(18)

where \(H(z) = 1\) when \(z > 0\); otherwise, \(H(z) = 0\); \(y_{d}\left( \textbf{x}_{i}\right)\) is i-th point in d-th output. By integrating \(\textbf{h}_{d}(\textbf{x}_{i})\) out, we get

$$\begin{aligned} p\left( y_{d}\left( \textbf{x}_{i}\right) | \widetilde{\textbf{f}}_{d}(\textbf{x}_{i})\right) =&\int p\left( y_{d}\left( \textbf{x}_{i}\right) | \textbf{h}_{d}(\textbf{x}_{i})\right) p\left( \textbf{h}_{d}(\textbf{x}_{i}) | \widetilde{\textbf{f}}_{d}(\textbf{x}_{i})\right) d \textbf{h}_{d}(\textbf{x}_{i}) \end{aligned}$$
(19)
$$\begin{aligned} =&\int \phi _{\mathcal {G}}\left( \epsilon _{d,i}\right) \prod _{c \ne y_{d}\left( \textbf{x}_{i}\right) } \Phi _{\mathcal {G}}\left( \epsilon _{d,i}+f_{d}^{y_{d}\left( \textbf{x}_{i}\right) }\left( \textbf{x}_{i}\right) - f_{d}^{c}\left( \textbf{x}_{i}\right) \right) d \epsilon _{d, i}, \end{aligned}$$
(20)

where \(\phi _{\mathcal {G}}\) and \(\Phi _{\mathcal {G}}\) are the probability density function and the cumulative distribution function of the Gumbel distribution, respectively. (NB: We drop out the c in \(\epsilon _{d,i}^c\) for convenience since all the \(\epsilon _{d,i}^c\) are from the same Gumbel error distribution). Now, we assume the Gumbel error \(\epsilon _{d,i} \sim \phi _{\mathcal {G}}(\epsilon _{d,i} | 0,1)\) so we obtain \(\phi _{\mathcal {G}}(\epsilon _{d, i})=\) \(\exp \left( -\epsilon _{d,i}-e^{-\epsilon _{d,i}}\right)\) and \(\Phi _{\mathcal {G}}(\epsilon _{d, i})=\exp \left( -e^{-\epsilon _{d, i}}\right)\). From expression (20), we recover a softmax likelihood (Liu et al., 2019; Ruiz et al., 2018):

$$\begin{aligned} p\left( y_{d}\left( \textbf{x}_{i}\right) | \widetilde{\textbf{f}}_{d}(\textbf{x}_{i})\right) =\frac{\exp \left( f_{d}^{y_{d}\left( \textbf{x}_{i}\right) }\left( \textbf{x}_{i}\right) \right) }{\sum _{c=1}^{C_{d}} \exp \left( f_{d}^{c}\left( \textbf{x}_{i}\right) \right) }. \end{aligned}$$
(21)

The softmax likelihood is a common likelihood used in multi-class classification with Gaussian processes (Williams & Rasmussen, 2006). As we mentioned in expression (5), all outputs are conditional independent given the corresponding latent parameter functions so each output has its own likelihood expression (20).

3.3 Scalable variational inference

We have derived the LMC model with a convolutional kernel and used the softmax likelihood. However, there exists a computational challenge if there are a very large number of classes and training instances in each output. We thus use scalable variational inference to reduce the computational complexity by the techniques of inducing patches and subsampling, where we refer our model to Multi-output Gaussian Processes with Augment & Reduce (MOGPs-AR). Inducing patches can ease the computational complexity of the inversion of a kernel matrix from \(\mathcal {O}\left( N^{3}\right)\) to \(\mathcal {O}\left( N M^{2}\right)\), where N is the number of data points per output and M is the number of inducing patches (\(M \ll N\)). Subsampling reduces the computational complexity of our model using a subset of both training data and classes for each output during hyperparameters and parameters optimisation.

3.3.1 Inducing patches for MOGPs-AR

We assume we use image data sets in this section. We define the inducing patches (Van der Wilk et al., 2017) at the latent functions \(\mathcal {U}\). If our input data sets are not image data sets, we use inducing points (Hensman et al., 2013). The difference between the inducing points and the inducing patches is the dimension size. The dimensions of the inducing points are the same as the input data, whereas the dimensions of the inducing patches match the patch of the images.

We first define a group of M inducing patches (Van der Wilk et al., 2017) \(\textbf{Z}=\left\{ \textbf{z}_{m}\right\} _{m=1}^{M} \in \textbf{R}^{M \times E}\) for each latent function \(u_{q}\). We then obtain \(\textbf{u}_{q}=\left[ u_{q}\left( \textbf{z}_{1}\right) , \cdots , u_{q}\left( \textbf{z}_{M}\right) \right] ^{\top }\) evaluated at \(\textbf{Z}\). All latent functions \(\mathcal {U}=\left\{ u_{q}(\textbf{x})\right\} _{q=1}^{Q}\) have different inducing patches and \(\textbf{u}=\left[ \textbf{u}_{1}^{\top }, \cdots , \textbf{u}_{Q}^{\top }\right] ^{\top } \in \textbf{R}^{Q M \times 1}\). Since the latent functions \(\mathcal {U}\) are mutually independent, the distribution \(p(\textbf{u})\) factorises as \(p(\textbf{u})=\prod _{q=1}^{Q} p\left( \textbf{u}_{q}\right)\), with \(\textbf{u}_{q} \sim \mathcal {N}\left( \textbf{0}, \textbf{K}_{q}\right)\), where \(\textbf{K}_{q} \in \textbf{R}^{M \times M}\) has entries given by \(k_{q}\left( \textbf{z}_{i}, \textbf{z}_{j}\right)\) with \(\textbf{z}_{i}, \textbf{z}_{j} \in \textbf{Z}\). The latent parameter functions \(\textbf{f}_{d}^{c}\) are conditionally independent given \(\textbf{u}\). We therefore obtain the conditional distribution \(p(\textbf{f} | \textbf{u})\):

$$\begin{aligned} p(\textbf{f} | \textbf{u})&= \prod _{d=1}^{D} \prod _{c=1}^{C_{d}} p\left( \textbf{f}_{d}^{c} | \textbf{u}\right) \end{aligned}$$
(22)
$$\begin{aligned}&= \prod _{d=1}^{D} \prod _{c=1}^{C_{d}} \mathcal {N}\left( \textbf{f}_{d}^{c} \Big | \textbf{K}_{\textbf{f}_{d}^{c} \textbf{u}} \textbf{K}_{\textbf{u} \textbf{u}}^{-1} \textbf{u}, \textbf{K}_{\textbf{f}_{d}^{c} \textbf{f}_{d}^{c}}-\textbf{K}_{\textbf{f}_{d}^{c} \textbf{u}} \textbf{K}_{\textbf{u} \textbf{u}}^{-1} \textbf{K}_{\textbf{f}_{d}^{c} \textbf{u}}^{\top }\right) , \end{aligned}$$
(23)

where \(\textbf{K}_{\textbf{u} \textbf{u}} \in \textbf{R}^{Q M \times Q M}\) is a block-diagonal matrix based on \(\textbf{K}_{q}\) as each block.

Based on Moreno-Muñoz et al. (2018); Liu et al. (2019), we obtain the lower bound \(\mathcal {L}\) for \(\text {log } p\left( \textbf{y}\right)\):

$$\begin{aligned} \text {log } p\left( \textbf{y}\right)&= \text {log} \int \int \int p\left( \textbf{y}, \epsilon , \textbf{f}, \textbf{u}\right) d \textbf{f} d \epsilon d \textbf{u} \end{aligned}$$
(24)
$$\begin{aligned}& =\text {log} \int \int \int \frac{p\left( \textbf{y}, \epsilon , \textbf{f}, \textbf{u}\right) q(\textbf{f}, \textbf{u}, \epsilon )}{q(\textbf{f}, \textbf{u}, \epsilon )} d \textbf{f} d \epsilon d \textbf{u} \end{aligned}$$
(25)
$$\begin{aligned}& \ge \int \int \int q(\textbf{f}, \textbf{u}, \epsilon ) \text {log} \frac{p\left( \textbf{y}, \epsilon , \textbf{f}, \textbf{u}\right) }{q(\textbf{f}, \textbf{u}, \epsilon )} d \textbf{f} d \epsilon d \textbf{u} \end{aligned}$$
(26)
$$\begin{aligned}& = \int \int \int q(\epsilon | \textbf{f}, \textbf{u}) q(\textbf{f}, \textbf{u}) \text {log} \frac{p\left( \textbf{y}, \epsilon , \textbf{f}, \textbf{u}\right) }{q(\epsilon | \textbf{f}, \textbf{u}) q(\textbf{f}, \textbf{u})} d \textbf{f} d \epsilon d \textbf{u}\end{aligned}$$
(27)
$$\begin{aligned}& =\mathcal {L}, \end{aligned}$$
(28)

where

$$\begin{aligned} q(\textbf{f}, \textbf{u})=p(\textbf{f} \mid \textbf{u}) q(\textbf{u})= \prod _{d=1}^{D} \prod _{c=1}^{C_{d}} p\left( \textbf{f}_{d}^{c} \mid \textbf{u}\right) \prod _{q=1}^{Q} q\left( \textbf{u}_{q}\right) , \end{aligned}$$

where \(q\left( \textbf{u}_{q}\right) =\mathcal {N}\left( \textbf{u}_{q} \mid \varvec{\mu }_{\textbf{u}_{q}}, \textbf{S}_{\textbf{u}_{q}}\right)\) and we refer \(\left\{ \varvec{\mu }_{\textbf{u}_{q}}, \textbf{S}_{\textbf{u}_{q}}\right\} _{q=1}^{Q}\) as the variational hyperparameters that need to be optimised. Further, we get (see the Appendix A for detail):

$$\begin{aligned} \mathcal {L}=&\sum _{d}^{D}\sum _{i}^{N}\left\langle \log p\left( y_{d}\left( \textbf{x}_{i}\right) | \widetilde{\textbf{f}}_{d}(\textbf{x}_{i}), \epsilon _{d,i}\right) \right\rangle _{q\left( \widetilde{\textbf{f}}_{d}(\textbf{x}_{i})\right) q\left( \epsilon _{d,i} | \widetilde{\textbf{f}}_{d}(\textbf{x}_{i})\right) } \end{aligned}$$
(29)
$$\begin{aligned}&\quad\quad-\sum _{q=1}^{Q} \textrm{KL}\left( q\left( \textbf{u}_{q}\right) \Vert p\left( \textbf{u}_{q}\right) \right) -\sum _{i=d}^{D}\sum _{i=1}^{N} \textrm{KL}\left( q\left( \epsilon _{d,i} | \widetilde{\textbf{f}}_{d}(\textbf{x}_{i})\right) \Vert p\left( \epsilon _{d,i}\right) \right) , \end{aligned}$$
(30)

where \(q\left( \epsilon _{d,i} | \widetilde{\textbf{f}}_{d}(\textbf{x}_{i})\right)\) approximates the posterior \(p\left( \epsilon _{d,i} | y_{d}\left( \textbf{x}_{i}\right) , \widetilde{\textbf{f}}_{d}(\textbf{x}_{i})\right)\):

$$\begin{aligned} p\left( \epsilon _{d,i} | y_{d}\left( \textbf{x}_{i}\right) , \widetilde{\textbf{f}}_{d}(\textbf{x}_{i})\right)&\propto p\left( y_{d}\left( \textbf{x}_{i}\right) | \widetilde{\textbf{f}}_{d}(\textbf{x}_{i}), \epsilon _{d,i}\right) p\left( \epsilon _{d,i}\right) , \end{aligned}$$
(31)
$$\begin{aligned} p\left( y_{d}\left( \textbf{x}_{i}\right) | \widetilde{\textbf{f}}_{d}(\textbf{x}_{i}), \epsilon _{d,i}\right)&=\prod _{c \ne y_{d}\left( \textbf{x}_{i}\right) } \Phi _{\mathcal {G}}\left( \epsilon _{d,i}+f_{d}^{y_{d}\left( \textbf{x}_{i}\right) }\left( \textbf{x}_{i}\right) -f_{d}^{c}\left( \textbf{x}_{i}\right) \right) , \end{aligned}$$
(32)
$$\begin{aligned}&= \prod _{c \ne y_{d}\left( \textbf{x}_{i}\right) } \exp \left( -e^{-\epsilon _{d,i}-f_{d}^{y_{d}\left( \textbf{x}_{i}\right) }\left( \textbf{x}_{i}\right) +f_{d}^{c}\left( \textbf{x}_{i}\right) }\right) . \end{aligned}$$
(33)

\(p\left( \epsilon _{d,i}\right)\) is a probability density function of the standard Gumbel distribution. \(q\left( \widetilde{\textbf{f}}_{d}(\textbf{x}_{i})\right)\) approximates the marginal posterior for \(\widetilde{\textbf{f}}_{d}(\textbf{x}_{i})\):

$$\begin{aligned} q\left( \widetilde{\textbf{f}}_{d}(\textbf{x}_{i})\right) =&\int p\left( \widetilde{\textbf{f}}_{d}(\textbf{x}_{i}) | \textbf{u}\right) q(\textbf{u}) d \textbf{u} \end{aligned}$$
(34)
$$\begin{aligned}\quad\quad =&\mathcal {N}\Big (\widetilde{\textbf{f}}_{d}(\textbf{x}_{i}) | \textbf{K}_{\widetilde{\textbf{f}}_{d}(\textbf{x}_{i}) \textbf{u}} \textbf{K}_{\textbf{u u}}^{-1} \varvec{\mu }_{\textbf{u}}, \end{aligned}$$
(35)
$$\begin{aligned}&\textbf{K}_{\widetilde{\textbf{f}}_{d}(\textbf{x}_{i}) \widetilde{\textbf{f}}_{d}(\textbf{x}_{i})}+\textbf{K}_{\widetilde{\textbf{f}}_{d}(\textbf{x}_{i}) \textbf{u}} \textbf{K}_{\textbf{u u}}^{-1}\left( \textbf{S}_{\textbf{u}}-\textbf{K}_{\textbf{u u}}\right) \textbf{K}_{\textbf{u u}}^{-1} \textbf{K}_{\widetilde{\textbf{f}}_{d}(\textbf{x}_{i}) \textbf{u}}^{\top }\Big ), \end{aligned}$$
(36)

where \(\varvec{\mu }_{\textbf{u}}=\left[ \varvec{\mu }_{\textbf{u}_{1}}^{\top }, \cdots , \varvec{\mu }_{\textbf{u}_{Q}}^{\top }\right] ^{\top }\) and \(\textbf{S}_{\textbf{u}}\) is a block diagonal matrix with blocks given as \(\textbf{S}_{\textbf{u}_{q}}\). After calculation (NB: detail is in the Appendix A), we obtain:

$$\begin{aligned} \mathcal {L} =-\sum _{d=1}^{D}\sum _{i=1}^{N} \log \left( \mathcal {P}_{d,i}+1\right) -\sum _{q=1}^{Q} \textrm{KL}\left( q\left( \textbf{u}_{q}\right) \Vert p\left( \textbf{u}_{q}\right) \right) , \end{aligned}$$
(37)

where

$$\begin{aligned} \mathcal {P}_{d,i} =\exp \left( \frac{\nu _{f_{d}^{y_{d}\left( \textbf{x}_{i}\right) }}(\textbf{x}_{i})}{2}-\mu _{f_{d}^{y_{d}\left( \textbf{x}_{i}\right) }}(\textbf{x}_{i})\right) \sum _{c \ne y_{d}\left( \textbf{x}_{i}\right) } \exp \left( \frac{\nu _{f_{d}^{c}}(\textbf{x}_{i})}{2}+\mu _{f_{d}^{c}}(\textbf{x}_{i})\right) , \end{aligned}$$
(38)

where \(\mu _{f_{d}^{c}}(\textbf{x}_{i})\) and \(\nu _{f_{d}^{c}}(\textbf{x}_{i})\) are the mean and variance of \(q\left( f_{d}^{c}\left( \textbf{x}_{i}\right) \right)\), respectively.

3.3.2 Reducing computational complexity by subsampling

To reduce the computational complexity of our model, we use only a subset of the data observations and a subset of the classes to estimate the parameters and hyperparameters. The optimal parameters and hyperparameters are chosen by maximising an unbiased estimator of \(\mathcal {L}\) (37), where the unbiased estimator is obtained through a subset of both training data points and classes in each output.

In our model, the hyperparameters are \(\textbf{Z}\), \(\left\{ a_{c,q}\right\} _{c=1, q=1}^{C, Q}\), \(\{\phi _q\}_{q=1}^{Q}\), \(\{w_p\}_{p=1}^{P}\) and the variational parameters are \(\left\{ \varvec{\mu }_{\textbf{u}_{q}}, \textbf{S}_{\textbf{u}_{q}}\right\} _{q=1}^{Q}\) for \(\{q\left( \textbf{u}_{q} \right) \}_{q=1}^{q=Q}\). We refer all those parameters as \(\Theta\). To obtain the optimal values of the parameters \(\Theta\), we use gradient descent to maximise \(\mathcal {L}\) with respect to \(\Theta\):

$$\begin{aligned} \nabla _{\Theta } \mathcal {L} =&- \nabla _{\Theta } \sum _{d=1}^{D}\sum _{i=1}^{N} \log \left( \mathcal {P}_{d,i}+1\right) -\nabla _{\Theta } \sum _{q=1}^{Q} \textrm{KL}\left( q\left( \textbf{u}_{q}\right) \Vert p\left( \textbf{u}_{q}\right) \right) \end{aligned}$$
(39)
$$\begin{aligned} =&-\nabla _{\Theta } \sum _{d=1}^{D}\sum _{i=1}^{N} \log \left( \left( \psi _{y_{d}\left( \textbf{x}_{i}\right) } \sum _{c \ne y_{d}\left( \textbf{x}_{i}\right) } \gamma _{{d,c}(\textbf{x}_{i})} \right) +1\right) \end{aligned}$$
(40)
$$\begin{aligned}&-\nabla _{\Theta } \sum _{q=1}^{Q}\textrm{KL} \left( q\left( \textbf{u}_{q} \right) \Vert p\left( \textbf{u}_{q}\right) \right) , \end{aligned}$$
(41)

where, for notation simplicity, we define

$$\begin{aligned} \psi _{y_{d}\left( \textbf{x}_{i}\right) }&=\exp \left( \frac{\nu _{f_{d}^{y_{d}\left( \textbf{x}_{i}\right) }}(\textbf{x}_{i})}{2}-\mu _{f_{d}^{y_{d}\left( \textbf{x}_{i}\right) }}(\textbf{x}_{i})\right) , \end{aligned}$$
(42)
$$\begin{aligned} \gamma _{{d,c}(\textbf{x}_{i})}&=\exp \left( \frac{\nu _{f_{d}^{c}}(\textbf{x}_{i})}{2}+\mu _{f_{d}^{c}}(\textbf{x}_{i})\right) . \end{aligned}$$
(43)

We then estimate \(\nabla _{\Theta } \mathcal {L}\) by randomly sampling a subset of data points \(\mathcal {B}=\{\textbf{b}_{1},...,\textbf{b}_{|\mathcal {B}|}\} \subseteq \{\textbf{x}_{1}, \ldots , \textbf{x}_{N}\}\) of size \(|\mathcal {B}|\) and a subset of classes \(\mathcal {S} = \{s_{1},..., s_{|\mathcal {S}|}\}\) \(\subseteq \{1, \ldots , C_{d}\} \backslash \{y_{d}(\textbf{x})\}\) with size \(|\mathcal {S}|\) (“\” means \(\{y_{d}(\textbf{x})\}\) is excluded from \(\{1, \ldots , C_{d}\}\)) for each multi-class classification output:

$$\begin{aligned} \nabla _{\Theta } \widetilde{\mathcal {L}}=&- \nabla _{\Theta } \sum _{d=1}^{D} \sum _{\textbf{x}_{i} \in \mathcal {B}} \frac{N}{|\mathcal {B}|} \log \left( \frac{C_{d}-1}{|\mathcal {S}|}\left( \psi _{y_{d}\left( \textbf{x}_{i}\right) } \sum _{c \in \mathcal {S}} \gamma _{{d,c}(\textbf{x}_{i})}\right) +1\right) \end{aligned}$$
(44)
$$\begin{aligned}&-\nabla _{\Theta } \sum _{q=1}^{Q}\textrm{KL}\left( q\left( \textbf{u}_{q} \right) \Vert p\left( \textbf{u}_{q}\right) \right) . \end{aligned}$$
(45)

\(\nabla _{\Theta } \widetilde{\mathcal {L}}\) is an unbiased estimator for \(\nabla _{\Theta } \mathcal {L}\) where the computational complexity of MOGPs-AR is dominated by optimising the parameters through maximising \(\mathcal {L}\).

Our sampling strategy is in Algorithm 1. The computational complexity of MOGPs-AR mainly depends on the inversion of \(\textbf{K}_{\textbf{u u}}\) with complexity \(\mathcal {O}(Q M^{3})\) and products like \(\textbf{K}_{\widetilde{\textbf{f}} \textbf{u}}\) with complexity \(\mathcal {O}\left( D|\mathcal {S}| |\mathcal {B}| Q M^{2}\right)\); If we do not use the subsampling of classes, we have to calculate products like \(\textbf{K}_{\overline{\textbf{f}} \textbf{u}}\) with a cost of \(\mathcal {O}\left( C|\mathcal {B}| Q M^{2}\right)\), where the notations are defined as below:

$$\begin{aligned} \widetilde{\textbf{f}} =&\left[ \tilde{\textbf{f}}_{1,\mathcal {B}}^{\top } ,\cdots , \widetilde{\textbf{f}}_{D,\mathcal {B}}^{\top } \right] ^{\top } \in \textbf{R}^{D|\mathcal {S}| |\mathcal {B}| \times 1} \end{aligned}$$
(46)
$$\begin{aligned} \overline{\textbf{f}} =&\left[ \overline{\textbf{f}}_{1,\mathcal {B}}^{\top } , \cdots , \overline{\textbf{f}}_{D,\mathcal {B}}^{\top } \right] ^{\top } \in \textbf{R}^{C |\mathcal {B}| \times 1} \end{aligned}$$
(47)
$$\begin{aligned} \widetilde{\textbf{f}}_{d,\mathcal {B}} =&\left[ \textbf{f}_{d,\mathcal {B}}^{\mathcal {S}_{1}} , \cdots , \textbf{f}_{d,\mathcal {B}}^ {\mathcal {S}_{|\mathcal {S}|}}\right] ^{\top } \in \textbf{R}^{|\mathcal {S}||\mathcal {B}| \times 1} \end{aligned}$$
(48)
$$\begin{aligned} \overline{\textbf{f}}_{d,\mathcal {B}} =&\left[ \textbf{f}_{d,\mathcal {B}}^1 , \cdots , \textbf{f}_{d,\mathcal {B}}^ {C_{d}}\right] ^{\top } \in \textbf{R}^{C_{d} |\mathcal {B}| \times 1} \end{aligned}$$
(49)
$$\begin{aligned} \textbf{f}_{d,\mathcal {B}}^{c} =&\left[ f_{d}^c\left( \textbf{b}_{1}\right) , \cdots , f_{d}^{c}\left( \textbf{b}_{|\mathcal {B}|}\right) \right] ^{\top } \in \textbf{R}^{|\mathcal {B}| \times 1} \end{aligned}$$
(50)

We notice \(D|\mathcal {S}| \ll C\) (\(C=\sum _{d=1}^{D} C_{d}\)) so MOGPs-AR alleviates the computational complexity of the product \(\textbf{K}_{\widetilde{\textbf{f}} \textbf{u}}\) from \(\mathcal {O}\left( C|\mathcal {B}| Q M^{2}\right)\) to \(\mathcal {O}\left( D|\mathcal {S}| |\mathcal {B}| Q M^{2}\right)\).

figure a

3.4 Prediction

In this subsection, we derive the predictive distribution of MOGPs-AR. Considering a new test input \(\textbf{x}_{*}\) in the d-th output, we assume \(p(\textbf{u} | \textbf{y}) \approx q(\textbf{u})\) and approximate the predictive distribution \(p\left( y_{d}\left( \textbf{x}_{*}\right) \right)\) by

$$\begin{aligned} p\left( y_{d}\left( \textbf{x}_{*}\right) | \textbf{y}\right) \approx \int p\left( y_{d}\left( \textbf{x}_{*}\right) \mid \widetilde{\textbf{f}}_{d}\left( \textbf{x}_{*}\right) \right) q\left( \widetilde{\textbf{f}}_{d}\left( \textbf{x}_{*}\right) \right) d \widetilde{\textbf{f}}_{d}\left( \textbf{x}_{*}\right) , \end{aligned}$$
(51)

where \(q\left( \widetilde{\textbf{f}}_{d}\left( \textbf{x}_{*}\right) \right) =\int p\left( \widetilde{\textbf{f}}_{d}\left( \textbf{x}_{*}\right) \mid \textbf{u}\right) q(\textbf{u}) d \textbf{u}=\prod _{c=1}^{C_{d}} q\left( f_{d}^{c}\left( \textbf{x}_{*}\right) \right)\). The approximated latent parameter functions \(q\left( \widetilde{\textbf{f}}_{d}\left( \textbf{x}_{*}\right) \right)\) are mutually independent, so we obtain

$$\begin{aligned} p\left( y_{d}\left( \textbf{x}_{*}\right) | \textbf{y}\right)&\approx \int p\left( y_{d}\left( \textbf{x}_{*}\right) \mid \widetilde{\textbf{f}}_{d}\left( \textbf{x}_{*}\right) \right) q\left( \widetilde{\textbf{f}}_{d}\left( \textbf{x}_{*}\right) \right) d \widetilde{\textbf{f}}_{d}\left( \textbf{x}_{*}\right) \end{aligned}$$
(52)
$$\begin{aligned} =\int&\frac{\exp \left( f_{d}^{y_{d}\left( \textbf{x}_{*}\right) }\left( \textbf{x}_{*}\right) \right) }{\sum _{c=1}^{C_{d}} \exp \left( f_{d}^{c}\left( \textbf{x}_{*}\right) \right) } \prod _{c=1}^{C_{d}} \mathcal {N}\left( f_{d}^{c}\left( \textbf{x}_{*}\right) \mid \mu _{f_{d}}^{c}\left( \textbf{x}_{*}\right) , \nu _{f_{d}}^{c}\left( \textbf{x}_{*}\right) \right) d \widetilde{\textbf{f}}_{d}\left( \textbf{x}_{*}\right) . \end{aligned}$$
(53)

We can use Monte Carlo to approximate the integral in the same way as Liu et al. (2019).

4 Experiments

In this section, we evaluate MOGPs-AR in various data sets. We apply MOGPs-AR in a synthetic data set to show its scalability in the number of classes compared to multi-output Gaussian processes. We also compare MOGPs-AR to other models in different real data sets. Further, to test MOGPs-AR the capacity in dealing with an image data set, we compare the performance of a convolutional kernel and RBF-ARD in our model.

Baselines. We compare the MOGPs-AR with the following two single-output and one multi-output Gaussian process models: 1) A Gaussian process for multi-class classification model (G-M), an independent Gaussian process using the softmax likelihood. 2) A Gaussian process multi-class classification with additive noise model (G-A), an independent Gaussian process using the softmax likelihood via Gumbel noise. 3) A multi-output Gaussian process model for multi-class classification problems (MG-M), a standard linear model of coregionalisation for MOGPs using the softmax likelihood. For all the different models in this paper, we use RBF-ARD, unless otherwise stated. For all models, we use traditional inducing points (Hensman et al., 2013) unless mentioned otherwise. All models are trained using the Adam optimiser with 0.01 learning rate and trained by 4000 iterations (Kingma & Ba, 2014). We use the same 80% training and 20% validation data set to choose the optimal number Q of latent functions \(\mathcal {U}\), where we optimise again all hyperparameters during cross-validation, for MOGPs-AR and MG-M.

Evaluation Metrics. There are three different evaluation metrics in this paper:

$$\begin{aligned} \text {Precision-Weighted = }&\frac{1}{\sum _{l \in \mathbb {L}}\left| \mathbb {O}_{true}^{l}\right| } \sum _{l \in \mathbb {L}}\left| \mathbb {O}_{true}^{l}\right| P\left( \mathbb {O}_{prediction}^{l}, \mathbb {O}_{true}^{l}\right) , \end{aligned}$$
(54)
$$\begin{aligned} \text {Recall-Weighted = }&\frac{1}{\sum _{l \in \mathbb {L}}\left| \mathbb {O}_{true}^{l}\right| } \sum _{l \in \mathbb {L}}\left| \mathbb {O}_{true}^{l}\right| R\left( \mathbb {O}_{prediction}^{l}, \mathbb {O}_{true}^{l}\right) , \end{aligned}$$
(55)
$$\begin{aligned} \text {F1-Weighted = }&\frac{1}{\sum _{l \in \mathbb {L}}\left| \mathbb {O}_{true}^{l}\right| } \sum _{l \in \mathbb {L}}\left| \mathbb {O}_{true}^{l}\right| F_{1}\left( \mathbb {O}_{prediction}^{l}, \mathbb {O}_{true}^{l}\right) , \end{aligned}$$
(56)
$$\begin{aligned} P(\mathbb {O}_{prediction}^{l}, \mathbb {O}_{true}^{l})&= \frac{|\mathbb {O}_{prediction}^{l} \cap \mathbb {O}_{true}^{l}|}{|\mathbb {O}_{prediction}^{l}|}, \end{aligned}$$
(57)
$$\begin{aligned} R(\mathbb {O}_{prediction}^{l}, \mathbb {O}_{true}^{l})&=\frac{|\mathbb {O}_{prediction}^{l} \cap \mathbb {O}_{true}^{l}|}{|\mathbb {O}_{true}^{l}|}, \end{aligned}$$
(58)
$$\begin{aligned} F_{1}(\mathbb {O}_{prediction}^{l}, \mathbb {O}_{true}^{l})&=2 \frac{P(\mathbb {O}_{prediction}^{l}, \mathbb {O}_{true}^{l}) \times R(\mathbb {O}_{prediction}^{l}, \mathbb {O}_{true}^{l})}{P(\mathbb {O}_{prediction}^{l}, \mathbb {O}_{true}^{l})+R(\mathbb {O}_{prediction}^{l}, \mathbb {O}_{true}^{l})}, \end{aligned}$$
(59)

where \(\mathbb {O}_{true}\) and \(\mathbb {O}_{prediction}\) are sets of true and predicted pairs (input data point, class) separately (e.g., \(\mathbb {O}_{true,n} = \left( \textbf{x}_{n},y_{true,n}\right)\) where \(\textbf{x}_{n}\) is the n-th input data point and \(y_{true,n}\) is the corresponding class for \(\textbf{x}_{n}\). The \(\mathbb {O}_{true}^{l}\) and \(\mathbb {O}_{prediction}^{l}\) are subsets of \(\mathbb {O}_{true}\) and \(\mathbb {O}_{prediction}\) separately (e.g., \(\mathbb {O}_{true}^{l} =\) \(\{\left( \textbf{x}_{n}, y_{true,n}\right)\) \(\in \mathbb {O}_{true} \mid y_{true,n}=l, n \in \mathbb {N} \}\)). The \(\mathbb {L}\) and \(\mathbb {N}\) are the sets of classes and input data points, respectively. The formulas use \(P(\mathbb {O}_{prediction}^{l}, \mathbb {O}_{true}^{l}) =0\) or \(R(\mathbb {O}_{prediction}^{l}, \mathbb {O}_{true}^{l}) =0\) if \(\mathbb {O}_{prediction}^{l}=\emptyset\) or \(\mathbb {O}_{true}^{l}=\emptyset\).

The synthetic data experiment was performed on a Dell PowerEdge C6320 with an Intel Xeon E5-2630 v3 at 2.40 GHz and 64GB of RAM. All real data experiments were performed on a PowerEdge R740XD Server with NVIDIA Tesla v100 32GB GDDR.Footnote 1

4.1 Synthetic data

In this subsection, we compare the performance of MOGPs-AR with MG-M on synthetic data where we generate a single output classification synthetic data set.Footnote 2 We create a 20 class data set by assigning a cluster of 100 points normally distributed, where each data point has five features, to each class. In total, there are 2000 samples. Since the synthetic data has 20 classes, we refer to it as S-20. We use 20 classes to compare MOGPs-AR with MG-M in terms of scalability. MOGPs-AR and MG-M use the same parameter setting (see Table 3) exclude that MOGPs-AR used a different number of subset classes.

Fig. 2
figure 2

Left: The training time in MG-M and MOGPs-AR model in S-20 (MOGPs-AR(5) means that MOGPs-AR with a subset of classes \(\mathcal {S}_{d} \subseteq \{1, \ldots , C_{d}\} \backslash \{y_{d}(\textbf{x})\}\) with size \(|\mathcal {S}_{d}|=5\) \((|\mathcal {S}|=|\mathcal {S}_{d}|+1)\) and we use \(y_{d}(\textbf{x})\) and \(C_{d}\) for notation consistency but here is only one output); Right: The Recall weight between MG-M and MOGPs-AR with different number of samples (e.g., 5 means \(|\mathcal {S}_{d}|=5\))

We compare MOGPs-AR with MG-M in terms of training time and Recall-Weighted performance. Figure 2 shows the mean training time for MOGPs-AR is less than MG-M in five folds cross-validation. This is because the computational complexity of MOGPs-AR is less than MG-M. As we mentioned in 3.3.2, compared to MG-M, MOGPs-AR reduce the computational complexity of the product \(\textbf{K}_{\widetilde{\textbf{f}} \textbf{u}}\) from \(\mathcal {O}\left( C|\mathcal {B}| Q M^{2}\right)\) to \(\mathcal {O}\left( D|\mathcal {S}| |\mathcal {B}| Q M^{2}\right)\) where \(D|\mathcal {S}| \ll C\). Figure 2 empirically shows the mean training time of MOGPs-AR (1) with 596s is nearly one-sixth of MG-M with 3641s. The mean training time in MOGPs-AR increases as the number of \(|\mathcal {S}_{d}|\) increases but it is still less than MG-M. While MOGPs-AR has less training time than MG-M, it has a similar performance in Recall-Weighted with MG-M for S-20. Even if we use a small subset of classes, e.g., five classes, MOGPs-AR also has a close performance to MG-M (see Fig. 2 right panel). The Recall-Weighted of MOGPs-AR slightly increases as the number of samples increase. Further, we notice that MOGPs-AR (17) has a better performance than MG-M. In theory, MOGPs-AR should have the same performance as MG-M. However, we can not perform convex optimisation for both MG-M and MOGPs-AR, so MOGPs-AR may outperform MG-M in various performance metrics in practice.

4.2 Single-output GP classification: four real data sets

Fig. 3
figure 3

Performance in cross-validation (mean ± standard deviation) in Balance, CANE9 and Mediamill data sets. a Balance Data Results; b CANE9 Data Results; c Mediamill Data Results; d Bibtex Data Results

We will use the following four real data sets to test the performance of the different GP classifiers: 1) Balance (Dua & Graff, 2017) is a data set for the results of psychology experiments. There are 625 data points with four discrete variables: Left-Weight, Left-Distance, Right-Weight and Right-Distance. The value of all four discrete variables ranges from one to five. The data set consists of three classes: the balance scale tipped to the right (R), tipped to the left (L) or be balanced (B). 2) CANE9 (Dua & Graff, 2017) contains 1080 documents of free text business descriptions of Brazilian companies. Those documents are divided into nine different categories. Each document has 856 integer variables (word frequency). 3) Mediamill (Snoek et al., 2006) is a multi-label data set for generic video indexing. To apply multi-classification, we only maintained one label, which is the first label to appear, for each data point. Further, we only use part of this data set since the original data set is highly imbalanced. We then obtain the number of data points for each class ranged from 31 to 545. In total, we have 6689 data points with 120 numeric features and 35 classes. 4) Bibtex data set (Katakis et al., 2008) is also a multi-label data that contains 7395 Bibtex entries with 1836 variables. Similarly, we only maintained one label, which is the first label to appear, obtaining 148 classes.

In all three performance measures, MOGPs-AR outperforms G-A and G-M on all four data sets (see Figure. 3). This is because MOGPs-AR can use each of latent parameter functions \(\textbf{f}\), which is a linear combination of latent functions \(\mathcal {U}\), to predict each class. The underlying function of the latent functions \(\mathcal {U}\) and \(\textbf{B}_{q}\) can transfer knowledge between each class in the same output. However, G-A and G-M only have independent Gaussian processes that can not capture the similarity between each class. Further, Fig. 3 indicates that using a small subset of classes (e.g., MOGPs-AR(1) or MOGPs-AR(5)), MOGPs-AR obtains a similar result as MG-M for Balance, CANE9, Mediamill data sets as we have discussed as in 4.1.

Compared with single output Gaussian processes, MOGPs-AR can achieve around 10% improvement in terms of three performance metrics on Balance and CANE9 data set (Fig. 3 upper panel). The optimal number (Q) of latent functions \(\mathcal {U}\) is two and nine for the Balance and CANE9 data sets separately. Those latent functions share the knowledge between each class and help to improve the performance. There is also a connection between single output and multi-output Gaussian processes. Considering an extreme case, we assume there is only one class, Q=1 and \(\textbf{B}_{q}\) = 1, MOGPs-AR and MG-M have the same structure as G-A and G-M in theory, respectively.

Regarding both Mediallmill (35 classes) and Bibtex (148 classes), MOGPs-AR has excellent performance compared to the single-output Gaussian processes and MG-M. For the Mediamill dataset, based on capturing dependency between each class, MOGPs-AR is nearly 6 times better than G-A and 4 times better than G-M in terms of F1-Weighted, where the mean of F1-Weighted is 0.04 for G-A, 0.08 for G-M and 0.25 for MOGPs-AR. Further, we cannot apply MG-M in the Bibtex data set since it is not able to compute \(\textbf{K}_{\overline{\textbf{f}} \textbf{u}}\) (out of memory). However, MOGPs-AR scales well since it only uses a subset of classes (MOGPs-AR (20)) for prediction.

4.3 Multi-output GPs classifications: UJIIndoorLoc

Fig. 4
figure 4

Performance in cross-validation (mean ± standard deviation)

To compare the performance of MOGPs-AR in multi-output multi-class classification problems, we apply MOGPs-AR to UJIIndoorLoc Data Set (Torres-Sospedra et al., 2014). There are 21048 instances that rely on WIFI fingerprint for three buildings of Universitat Jaume I where Building I and Building II have four floors respectively and Building III has five floors. Each instance has 520 features based on signal strength intensity. We randomly sample 200 data points from each floor so there are 800 data points for Building I and Building II respectively and 1000 data points for Building III. Further, we standardise the data set for each Building. We make predictions for each floor depending on the 520 features. Since there are three buildings of Universitat Jaume I, we assume there is a strong correlation between each building. We regard each building as each output and different floors as different classes in our model. The UJIIndoorLoc is considered as a multi-output multi-class classification problem. In this experiment and following, we do not apply the MG-M model due to its computational complexity. MOGPs-AR can be an alternative model for MG-M so we only consider MOGPs-AR and two single output GP models.

Figure 4 shows that MOGPs-AR outperforms single-output Gaussian processes in both Building I, II and III in all three performance measures. For example, MOGPs-AR can achieve around 50% improvement in terms of Recall-Weighted on Building I compared with single output Gaussian processes. The reason is that MOGPs-AR can capture dependencies of intra- and inter- three buildings. The dependencies can help improve the prediction for all buildings. The single-output Gaussian processes cannot use the dependency so the single output Gaussian process does not perform well in UJIIndoorLoc.

To investigate the correlation between intra- and inter output, we create a global absolute coregionalisation matrix. First, we create absolute coregionalisation matrices \(\left\{ \textbf{B}^{abs}_{q}\right\} _{q=1}^{Q}\), where \(\textbf{B}^{abs}_{q} \in \textbf{R}^{C \times C}\), by taking the absolute value of each entry in \(\textbf{B}_{q}\). Second, we obtain the mean of those absolute coregionalisation matrices: \(\varvec{\overline{B}} = \frac{1}{Q}\sum _{q=1}^{Q}\textbf{B}^{abs}_{q}\) and \(\varvec{\overline{B}} \in \textbf{R}^{C \times C}\). Since we are performing K-fold cross-validation, we have K different mean absolute coregionalisation matrices: \(\left\{ \varvec{\overline{B}}^{i}\right\} _{i=1}^{K}\), where \(\varvec{\overline{B}}^{i} \in \textbf{R}^{C \times C}\) refers to the mean absolute coregionalisation matrices during the i-th fold cross-validation. Further, we calculate the mean of \(\left\{ \varvec{\overline{B}}^{i}\right\} _{i=1}^{K}\) for all K-fold cross-validation so \(\tilde{\textbf{B}} \in \textbf{R}^{C \times C} =\frac{1}{K}\sum _{i=1}^{K}\varvec{\overline{B}}^{i}\):

$$\begin{aligned} \tilde{\textbf{B}}= \begin{bmatrix} \left( \tilde{\textbf{B}}\right) _{1,1}&{} \dots &{} \left( \tilde{\textbf{B}}\right) _{1,D} \\ \left( \tilde{\textbf{B}}\right) _{2,1} &{} \dots &{} \left( \tilde{\textbf{B}}\right) _{2,D} \\ \vdots &{} \ddots &{} \vdots \\ \left( \tilde{\textbf{B}}\right) _{D,1} &{} \dots &{} \left( \tilde{\textbf{B}}\right) _{D,D} \end{bmatrix}, \end{aligned}$$
(60)

where \(\left( \tilde{\textbf{B}}\right) _{i,j} \in \textbf{R}^{C_{i} \times C_{j}}\) indicates the correlations for all latent parameter functions between i-th output and j-th output. At the end, in order to find the correlation for outputs independently, we calculate a scalar \(\tilde{B}_{i,j} = \frac{1}{C_{i}C_{j}}\sum _{m}\sum _{n} \left\{ \left( \tilde{\textbf{B}}\right) _{i,j}\right\} _{m,n}\), which represents dependence between i-th output and j output. We therefore define a global absolute coregionalisation matrix (GACM \(\in \textbf{R}^{D \times D}\)) as the following:

$$\begin{aligned} \text {GACM}= \begin{bmatrix} \tilde{B}_{1,1}&{} \dots &{} \tilde{B}_{1,D} \\ \tilde{B}_{2,1} &{} \dots &{} \tilde{B}_{2,D} \\ \vdots &{} \ddots &{} \vdots \\ \tilde{B}_{D,1} &{} \dots &{} \tilde{B}_{D,D} \end{bmatrix}. \end{aligned}$$
(61)
Fig. 5
figure 5

Global absolute coregionalisation matrix. a The global absolute coregionalisation matrix of UJIIndoorLoc data set. b The global absolute coregionalisation matrix of Ojibwe and Blackfoot alphabets

Figure 5a shows the correlation between each building captured by our model. We can notice there is a strong correlation between the different buildings. Building I and Building II have a relatively strong correlation compared to Building I and Building III, Building II and Building III. Building II has the strongest intra-output correlation while Building III has the smallest intra-output correlation among those three buildings.

4.4 Multi-output GPs classifications: Omniglot-dataset

We apply MOGPs-AR to Omniglot image data set (Lake et al., 2015). The Omniglot data set includes 1623 various handwritten characters from 50 distinct alphabets. Each of the 1623 characters was drawn by 20 different people (the total number of images is 32460). Although traditional MOGPs are not specifically designed to deal with image data, MOGPs-AR can handle image data by incorporating a convolutional kernel (Van der Wilk et al., 2017). The size of each image is 105 \(\times\) 105 pixels. To help speeding up the computation and reducing the computational complexity in the covolutional kernel, we resize the images from 105 \(\times\) 105 to 20 \(\times\) 20 as Santoro et al. (2016) did. We regard each alphabet as an output in our model. Each alphabet has different characters which are considered as different classes. Therefore, we consider the Omniglot data set as multi-output multi-class classification problems.

Fig. 6
figure 6

Both the Ojibwe and the Blackfoot alphabets have 14 characters each. We show two data points for each character

4.4.1 Ojibwe and blackfoot alphabets

To compare the performance of MOGPs-AR in multi-output multi-class classification problems and image input data, we first consider Ojibwe and Blackfoot alphabets as two different multi-class classification problems (see Fig. 6). Since the two alphabets are from Canadian Aboriginal Syllabics, we assume there is a strong correlation between them. Our model can capture the correlation through joint modelling of the two alphabets to improve predictive performance for each multi-class classification problem. There are 14 different characters in each output so there are 14 classes, and each class has 20 data points. We compare both the RBF-ARD kernel and the convolutional kernel. Table 4 shows the parameter setting in Omniglot data set.

In Fig. 7 we show that MOGPs-AR outperforms single-output Gaussian processes in both alphabets in terms of the convolutional kernel or RBF-ARD. The reason is that MOGPs-AR can capture the dependency between the two alphabets. The dependency can help improve the prediction for both alphabets. The single output Gaussian processes cannot use the dependency so the single output Gaussian process with either the convolutional kernel or RBF-ARD does not perform well in both Ojibwe and Blackfoot. The size of the mini-batch is too small that has also a negative influence on the single output Gaussian processes (Fig. 8). Especially, the values of the three performance metrics are closed to 0.05 for G-A with the convolutional kernel on Ojibwe.

Since both alphabets are from Canadian Aboriginal Syllabic we expect they have a strong correlation. Figure 5b indeed shows there is a similar global correlation between intra- and inter- output for both alphabets, which indicates that our model has the capacity of capturing the underline correlation among those correlated data sets.

MOGPs-AR with the convolutional kernel outperforms MOGPs-AR with RBF-ARD in both alphabets in terms of three performance metrics (see Fig. 7). For example, MOGPs-AR improves the Recall-Weighted from 0.468 to 0.714 by changing RBF-ARD to the convolutional kernel on the Blackfoot alphabet. Moreover, we also combine G-M and G-A with the convolutional kernel and they also have stronger performance compared with RBF-ARD. In particular, G-M with the convolutional kernel obtains 0.5858 compared with 0.0857 using RBF-ARD in terms of Recall-Weighted on the Blackfoot alphabet. The performance of G-M with the convolutional kernel (0.5858) is better than MOGPs-AR with RBF-ARD (0.468) on the Blackfoot alphabet. The reason is that the convolutional kernel is more effectively capturing image-level features than the RBF-ARD kernel.

Fig. 7
figure 7

Image: Performance in cross-validation (mean ± standard deviation). We compare both RBF-ARD and the convolutional kernel for all the model

To investigate the effects of mini-batch size, we set up another experiment. We train again the exact same models with the parameters initialised in the same way as the experiment above but using different mini-batch sizes (e.g., 50, 70, 90). Since the convolution kernel provided better results in the previous experiments, we only show the results using the convolution kernel and the Recall-Weighted performance measure for both alphabets. Figure 8 shows that the size of the mini-batch has more influence on single output Gaussian processes than MOGPs-AR. A small size number for the mini-batch, e.g., 50, has a negative impact on G-M and G-A. However, MOGPs-AR has a slight increase in performance or keeps a similar result with the mini-batch size increasing. G-A and G-M improve the performance as the mini-batch size grows up from 50 to 90. When the size of the mini-batch is 90, G-M has a similar performance with MOGPs-AR. However, when we consider the mini-batch of size 50, MOGPs-AR still can get good performance compared to single GPs.

4.4.2 All alphabets

In our final experiment, we apply MOGPs-AR in 50 alphabets in the original data set. There are 50 outputs with different number of classes in each output (for more details on the number of classes in each output see Table 2). The total number of classes in the 50 outputs are 1623. We follow Lake et al. (2015) and split the 50 alphabets into two sets: a background set and an evaluation set, where the background set has 30 alphabets (with a total of 964 classes) and the evaluation set has 20 alphabets (with a total of 659 classes). In order to apply MOGPs-AR in all 50 alphabets, we use a mini-batch size of nine data points for each output to train our model. The small mini-batch size has a negative impact on G-M and G-A so we only apply MOGPs-AR in this experiments. We apply MOGPs-AR for three different sets of alphabets: all alphabets, the Background alphabets and the Evaluation alphabets.

In Fig. 9, we empirically (50 different outputs and a total of 1623 classes of image data) show that MOGPs-AR has better scalability than traditional multi-output Gaussian processes. MOGPs-AR reduces the computational complexity by subsampling both training data points and classes in each output. Figure 9 also indicates that MOGPs-AR obtains good performance even if we choose a small size of mini-batch (nine) and only a small number of classes (one) in each output since it captures both intra- and inter-output correlation.

Fig. 8
figure 8

Recall-Weighted Performance in cross-validation (mean ± standard deviation)

In most predictions, our model trained with the data of all alphabets could outperform one trained with the data of part of the alphabets. For example, our model trained using all alphabets improves the Recall-Weighted from 0.6096 to 0.6692 for the Aurek alphabet, compared with one using evaluation alphabets for training. The extra alphabets can help our model improve its performance.

Fig. 9
figure 9

Performance in cross-validation in Evaluation alphabets (Left) and Background alphabets (Right). In the both Figures, each circle is the mean of Recall-Weighted; the error bar is the standard deviation of Recall-Weighted

However, there are exceptions to the scenario in the last paragraph. For example, for the Syriac (Estrangelo) alphabet, the values of the Recall-Weighted 0.5174 is less than 0.5283 where only use background alphabets for training our model. One likely reason is that our model assumes a correlation with all alphabets. However, the correlation with those alphabets may not exist or the correlation may hinder the predictive performance.

5 Conclusion

In this paper, we have introduced MOGPs-AR, a novel framework that allows the use of multi-output Gaussian processes for multi-output multi-class classification. MOGPs-AR can tackle large scale data sets and a large number of classes in each output. Further, when combined with the convolutional kernel, it is suited for downsized image data.

We experimentally show that MOGPs-AR has a similar result to MG-M that is a linear model of coregionalization and uses a similar stochastic variational inference method as us. However, the training time of MOGPs-AR is less than MG-M. Experimental results in various data sets also indicate that MOGPs-AR significantly improves the performance compared to single output Gaussian processes.

MOGPs-AR has good performance in extreme classification using a softmax function which is only suited to each instance associated with a single class. Because of the softmax function, MOGPs-AR can not deal with a multi-label problem where each data point belongs to multiple classes. It would be an interesting work for future research if we can generalise MOGPs-AR for the multi-label problem that has a strong correlation with extreme classification problems.

A practical application of Gaussian process models to realistic image recognition tasks is still an open research problem. For example, in terms of accuracy performance in a realistic RGB set CIFAR-10 (Krizhevsky & Hinton, 2009), the accuracy performance of Gaussian processes (Van der Wilk et al., 2017; Blomqvist et al., 2019) is not as high as the state-of-the-art like deep learning. A potential extension would be to consider integrating the structural properties of deep learning architectures into our model by using deep kernel learning (Wilson et al., 2016).