1 Introduction

Multi-label classification is a supervised learning problem where data instances are associated with multiple classes (Tsoumakas and Katakis 2007; Read et al. 2011; Zhang and Zhou 2013; Gibaja and Ventura 2014, 2015). It can be viewed as a generalization to the more traditional multi-class classification problem, where each data point can belong only to a single class. Multi-label learning has attracted a lot of attention in the recent literature, due to its numerous applications ranging from text and image classification to computational advertising and recommender systems (Gibaja and Ventura 2014, 2015; Prabhu and Varma 2014; Jain et al. 2016). Two main challenges in multi-label learning are: (i) the modelling challenge associated with introducing suitable models to capture the correlation across different labels, and (ii) the computational or scalability challenge associated with dealing with datasets having very large number of labels, training instances and input dimensions.

In this paper, we tackle the problem of multi-label learning using a probabilistic framework based on Gaussian processes (GPs) (Rasmussen and Williams 2005). From a GP perspective multi-label learning shares similarities with the standard approaches for multi-task or multi-output Gaussian regression suitable for real-valued output data (Teh et al. 2005; Alvarez et al. 2012; Bonilla et al. 2008; Moreno-Muñoz et al. 2018). The difference is that in multi-label learning the output data are binary, thus requiring Bernoulli or binary regression type of likelihoods. Based on this, we introduce a multi-label extension of the semi-parametric latent factor model (Teh et al. 2005) that allows to capture the correlation of multiple labels using a small set of shared latent GP functions. Our work bear many similarities with Dai et al. (2017) and it can be seen as a special case of the more general model presented in Moreno-Muñoz et al. (2018) where both continuous and discrete outputs are allowed and a variant number of latent GP functions can be deployed.

Furthermore, to address the computational challenges when training the model, we make use of sparse GP approximations (Csato and Opper 2002; Lawrence et al. 2002; Seeger et al. 2003; Quiñonero-Candela and Rasmussen 2005; Snelson and Ghahramani 2006; Titsias 2009; Hensman et al. 2013; Bui et al. 2017) together with stochastic variational inference (Hoffman et al. 2013). Specifically, by using the sparse GP variational inference framework which employs inducing variables (Titsias 2009) as well as its stochastic and non-Gaussian likelihood variants (Hensman et al. 2013; Lloyd et al. 2015; Hensman et al. 2015; Dezfouli and Bonilla 2015; Sheth et al. 2015), we derive an algorithm that can scale to arbitrarily large numbers of data instances. More precisely, the main techniques we introduce regarding scalable GPs are: (i) stochastic variational inference that allows us to train the GP multi-label model by sub-sampling both data instances and labels, and (ii) optimization of the inducing inputs, using gradient-based methods, in extremely high dimensional input spaces involving possibly hundreds of thousands of dimensions; we show that such optimization can significantly improve predictive performance.

It is also noted that the main goal of the paper is to present a Bayesian non-parametric model that is able to scale well to extreme dimensions while at the same time achieves performance similar to other state-of-the-art methods that make use of non-probabilistic models. The remainder of the paper has as follows: Sect. 2 provides a brief discussion about related work. Section 3 describes the form of the multi-label GP model while Sect. 4 discusses scalable variational inference for sparse GP models. Section 5 demonstrates the method using a series of large scale multi-label datasets. Finally, the paper gives some performance characteristics of our method in Sect. 6 concludes with a discussion in Sect. 7.

2 Related work

In the last decade, a multitude of methods have been developed to tackle multi-label classification problems (Tsoumakas and Katakis 2007; Zhang and Zhou 2013). Algorithms such as multi-label random forest (Kocev et al. 2007) or multi-label k-nearest neighbours (Zhang and Zhou 2007) have achieved superior performances than other methods. Nevertheless, all the aforementioned methods fail to scale with the large dimensions describing the eXtreme Multi-label Learning (XML). The last few years, all the proposed XML algorithms fall mainly into two main categories: Label-embedding methods, Tree-based methods.

Label-embedding methods are based on the assumption that the label space can be efficiently approximated by a low-rank structure. The first methods that tried to apply this concept were WSABIE (Weston et al. 2011) and LEML (Yu et al. 2014). Nonetheless, the low-rank representation of the output space failed to lead to high performance due to the information loss followed by the long tail distribution of the labels. This hindrance was later circumvented by SLEEC (Bhatia et al. 2015) and AnnexML (Tagami 2017) where a number of local low-rank embeddings are trained separately based on a partition of the feature space.

Tree-based methods (Prabhu and Varma 2014; Jain et al. 2016; Jasinska et al. 2016; Niculescu-Mizil and Abbasnejad 2017; Si et al. 2017; Siblini et al. 2018; Prabhu et al. 2018; Wydmuch et al. 2018; Khandagale et al. 2020) can be seen as transformation methods that aim to divide the initial large-scale problem into a multiple small-scale sub-problems by recursively partitioning the feature or label space. Those subsets are connected with the nodes of the trees. These methods are usually known for their fast training and prediction time at the expense of predictive performance.

Despite these two major categories, there are recently developed methods (Yen et al. 2016, 2017; Babbar and Schölkopf 2017) which are based on one-vs-rest strategies and/or linear models. For instance, DiSMEC (Babbar and Schölkopf 2017) achieves state-of-the art performance by minimizing Hamming loss with \(\ell _2\) regularization while an imposed threshold help to reduce model size. These methods are heavily relied on distributed hardware to reduce training and test time. Deep learning methods have also emerged in the XML field (Nam et al. 2017) which make extensive use of GPU resources and more sophisticated text preprocessing techniques (Liu et al. 2017; You et al. 2019).

A few Bayesian methods have also been proposed (He et al. 2012; Kapoor et al. 2012; Jain et al. 2017; Gaure et al. 2017; Papanikolaou and Tsoumakas 2018), however they cannot be trained on XML datasets and/or they attain poor performance comparing to state-of-the art methods. Our method bears similarities with most of those works. Specifically, it is a latent factor model that probabilistically finds a low rank representation of the label matrix Y conditioned on the inputs X, thus, it is also connected to the label-embedding approaches, however, most of these approaches are not probabilistic, so they cannot easily be incorporated to a larger model (possible handling additional types of observed data) or make use of stochastic or on-line statistical learning algorithms. A probabilistic method that is mostly related to ours is the approach by Jain et al. (2017) who constructed a bilinear latent factor model of Y conditioned on X. Some aspects of our method can be thought of as kernelized GP-based extensions of certain linear latent variables in Jain et al. (2017), with the additional difference that we base inference on variational methods while Jain et al. (2017) use Gibbs sampling or on-line EM.

Finally, a previous GP-based method for multi-label learning was proposed by He et al. (2012) and it was based on the multi-task GP model of Bonilla et al. (2008). This method makes use of basic Nyström-type methods to approximate the full multi-task kernel matrix of size \(K N \times K N\) with a matrix of size \(K M \times K M\) (N is the number of training data points and \(M \ll N\); see next section) which leads to complexity \({\mathscr {O}}(K^3 M^3)\), i.e. cubic with respect to the number of labels K , rendering this method impractical for XML problems. Instead, our method utilizes variational sparse GP methods and stochastic optimization to obtain a fully scalable algorithm having a sub-linear complexity over the class labels and data points.

3 The multi-label GP factor model

Suppose a training dataset \({\mathscr {D}} = ({\mathbf {x}}^{(i)}, {\mathbf {y}}^{(i)})_{i=1}^N\) where each \({\mathbf {x}}^{(i)} \in {\mathbb {R}}^D\) is the input vector and \({\mathbf {y}}^{(i)} \in \{-1,1\}^K\) is the binary vector that indicates the class labels assigned to \({\mathbf {x}}^{(i)}\), so that \(y_k=1\) indicates presence of the k-th label while \(y_k=-1\) indicates absence. We will collectively denote all input vectors by \(X \in {\mathbb {R}}^{N \times D}\) and the binary labels by \(Y \in \{ -1, 1 \}^{N \times K}\) so that rows of these matrices store respective data points. We wish to model these data using a flexible probabilistic model that captures the correlation between different labels. We consider a multi-label extension of the semiparametric latent factor model (SLFM) of Teh et al. (2005) that combines a linear latent variable model with GPs. Specifically, SLFM is a general-purpose multi-output GP model (Teh et al. 2005; Álvarez and Lawrence 2011; Alvarez et al. 2012) that uses a small number of P latent GPs (factors) to generate the K outputs through a linear mapping. The full hierarchical model for generating the training examples is,

$$\begin{aligned} h_p&\sim {{\mathscr {G}}}{{\mathscr {P}}}(0, k({\mathbf {x}}^{(i)}, {\mathbf {x}}^{(j)})), \ p=1,\ldots ,P, \end{aligned}$$
(1)
$$\begin{aligned} {\mathbf {f}}^{(i)}&= \Phi {\mathbf {h}}^{(i)} + {\mathbf {b}}, \ i=1,\ldots ,N, \end{aligned}$$
(2)
$$\begin{aligned} {\mathbf {y}}^{(i)}&\sim p({\mathbf {y}}^{(i)}|{\mathbf {h}}^{(i)}) = \prod _{k=1}^K \sigma (y^{(i)}_k f_k^{(i)} ), \ i=1,\ldots ,N, \end{aligned}$$
(3)

where \(h_p\) denotes a latent function drawn from a GP with zero-mean and kernel function \(k({\mathbf {x}}^{(i)}, {\mathbf {x}}^{(j)})\) that depends on kernel hyperparameters \(\varvec{\theta }\) (although \(\varvec{\theta }\) is suppressed throughout to keep the notation uncluttered). Further, \({\mathbf {h}}^{(i)} = [h_1^{(i)} \ldots , h_P^{(i)}]^\top \in {\mathbb {R}}^P\) denotes the vector of all function values evaluated at input \({\mathbf {x}}^{(i)}\), i.e. \(h_p^{(i)} \equiv h_p( {\mathbf {x}}^{(i)})\), while the parameters \(\Phi \in {\mathbb {R}}^{K \times P}\) and \({\mathbf {b}}\in {\mathbb {R}}^K\) correspond to the factor loadings matrix and the bias vector of the linear mapping. By using these parameters the latent vector \({\mathbf {h}}^{(i)}\) is deterministically mapped into \({\mathbf {f}}^{(i)} = [f_1^{(i)},\ldots ,f_K^{(i)}]^\top \in {\mathbb {R}}^K\), such that each

$$\begin{aligned} f_k^{(i)} = \sum _{p=1}^P \phi _{k p} h_p^{(i)} + b_k, \end{aligned}$$
(4)

defines the so-called utility score that finally generates the k-th binary label through a sigmoidal/Bernoulli likelihood. Notice that while the labels are conditionally independent given \({\mathbf {h}}^{(i)}\), they become fully coupled once these variables are integrated out. The full joint distribution is given by

$$\begin{aligned} \prod _{i=1}^N p({\mathbf {y}}^{(i)}|{\mathbf {h}}^{(i)}) \prod _{p=1}^P p({\mathbf {h}}_p), \end{aligned}$$
(5)

where \(p({\mathbf {h}}_p) = {\mathscr {N}}({\mathbf {h}}_p | {\mathbf {0}}, K_X)\) is an N-dimensional Gaussian distribution induced by evaluating the GP prior at the training inputs X with \(K_X\) denoting the kernel or covariance matrix, \([K_X]_{ij} = k({\mathbf {x}}^{(i)}, {\mathbf {x}}^{(j)})\). An equivalent way to write the above model is by using the concept of kernels for multi-task or vector-valued functions (Bonilla et al. 2008; Álvarez and Lawrence 2011; Alvarez et al. 2012). More precisely, observe that the utility scores \(f_k^{(i)}\) that directly interact with the data in (3) follow a GP prior with mean given by the bias \(b_k\) (that depends on the label but not on the input) and covariance function

$$\begin{aligned} \text {Cov}(f_l^{(i)}, f_k^{(j)}) = k({\mathbf {x}}^{(i)}, {\mathbf {x}}^{(j)}) \sum _{p=1}^P \phi _{l p} \phi _{k p}. \end{aligned}$$
(6)

For regression problems with Gaussian likelihoods the above multi-task GP is known as the intrinsic correlation model (Stoyan 1996; Bonilla et al. 2008), a specific case of co-kriging in geostatistics; see Alvarez et al. (2012) for a full review. Here, we use this model for multi-label learning where the tasks correspond to different class labels.

Inference in the above model is very challenging since real applications in multi-label classification involve both very large number of training instances N and very large number of class labels K (Zhang and Zhou 2013; Gibaja and Ventura 2014, 2015). In the next section we propose an efficient variational inference algorithm that combines sparse GPs with stochastic variational inference (Hoffman et al. 2013) and scales as \(O(P M^3)\) where \(M \ll N\).

4 Scalable variational inference

The approximate inference procedures derived in this section are mainly based on the representation that uses the latent GP vectors \({\mathbf {u}}_p\) rather than the multi-task kernel representation in (6). The utility scores \(f_k^{(i)}\) will only be used to simplify the computations of some final Gaussian integrals. Section 4.1 discusses the variational sparse GP formulation, while Sect. 4.2 shows how doubly stochastic optimization can allow to deal with arbitrarily large N and K.

4.1 Sparse approximation

To deal with large number of training data we consider the variational sparse GP inference framework based on inducing variables introduced by Titsias (2009); see also Matthews et al. (2016) for a measure-theoretic derivation of this method and Bauer et al. (2016) for a useful discussion about its properties. For each latent function \(h_p\) we introduce a vector \({\mathbf {u}}_p \in {\mathbb {R}}^M\) of function values of \(h_p\) evaluated at inputs Z, where for simplicity we take the inputs Z to be shared by all latent GPs. The vector \({\mathbf {u}}_p\) is often referred to as inducing variables and Z as the inducing or pseudo inputs (Quiñonero-Candela and Rasmussen 2005; Snelson and Ghahramani 2006). In the variational sparse GP method Z plays the role of a variational parameter that can be optimized to improve the approximation. By following Titsias (2009) we augment the joint distribution in (5) with the inducing variables to obtain

$$\begin{aligned} \prod _{i=1}^n p({\mathbf {y}}^{(i)}|{\mathbf {h}}^{(i)}) \prod _{p=1}^P p({\mathbf {h}}_p | {\mathbf {u}}_p) p({\mathbf {u}}_p). \end{aligned}$$
(7)

Here, \(p({\mathbf {u}}_p) = {\mathscr {N}} ({\mathbf {u}}_p | {\mathbf {0}}, K_Z)\) is the marginal GP prior over \({\mathbf {u}}_p\) and \(K_Z\) is the \(M \times M\) kernel matrix obtained by evaluating the kernel function at Z while \(p({\mathbf {h}}_p | {\mathbf {u}}_p)\) is the conditional GP prior

$$\begin{aligned} p({\mathbf {h}}_p | {\mathbf {u}}_p) = {\mathscr {N}} \left( {\mathbf {h}}_p | K_{X Z} K_{Z}^{-1} {\mathbf {u}}_p, K_X - K_{X Z} K_{Z}^{-1} K_{Z X} \right) , \end{aligned}$$

where \(K_{X Z}\) is the cross-covariance matrix between the training inputs X and the inducing inputs Z while \(K_{Z X} = K_{X Z}^{\top }\). For any value of Z this augmentation does not change the model (e.g. the exact marginal likelihood is invariant to the value of Z), however by applying a certain variational approximation in the space of \(({\mathbf {h}}_p,{\mathbf {u}}_p)\) we can both reduce the time complexity and also treat Z as a variational parameter. This is achieved by choosing the approximate distribution to be

$$\begin{aligned} {\mathscr {Q}} = \prod _{p=1}^P p({\mathbf {h}}_p | {\mathbf {u}}_p) q({\mathbf {u}}_p), \end{aligned}$$
(8)

where \(p({\mathbf {h}}_p | {\mathbf {u}}_p)\) is the conditional GP prior that appears also in the joint (7), while \(q({\mathbf {u}}_p) = {\mathscr {N}}({\mathbf {u}}_p | \varvec{\mu }_p, \Sigma _p)\) is a Gaussian variational distribution over the inducing variables for the p-th latent GP. Hence, \(\varvec{\mu }_p \in {\mathbb {R}}^M\) is a real-valued vector of tunable variational parameters, while \(\Sigma _p \in {\mathbb {R}}^{M \times M}\) is the covariance matrix of the variational distribution which is parametrized by \(\frac{M^2 + M}{2}\) variational parameters since we only need a \(M \times M\) lower triangular matrix \(L_p\) to express \(\Sigma _p\). The corresponding \(q({\mathbf {h}}_p)\) obtained by the previous choice of \(q({\mathbf {u}}_p)\) (derived by marginalizing out up from the variational distribution in (8)) is \(q({\mathbf {h}}_p) = {\mathscr {N}}({\mathbf {h}}_p | \varvec{\mu }^h_p, \Sigma ^h_p)\) where

$$\begin{aligned} \varvec{\mu }^h_p&= Q_{X Z} \varvec{\mu }_p, \end{aligned}$$
(9)
$$\begin{aligned} \Sigma ^h_p&= K_X - Q_{X Z} K_{Z X} + R^p_{X Z} R^{p^{\top }}_{X Z}, \end{aligned}$$
(10)

with \(Q_{X Z} = K_{X Z} K_Z^{-1}\) and \(R^p_{X Z} = Q_{X Z} L_p\).

To express the lower bound on the log marginal likelihood \(\log p(Y)\) under the variational distribution in (8) we start the derivation as in Titsias (2009) which leads to cancellation of each conditional GP prior \(p({\mathbf {h}}_p | {\mathbf {u}}_p)\) and then by following the derivation suitable for scalable and/or non-Gaussian likelihoods (Hensman et al. 2013; Lloyd et al. 2015; Hensman et al. 2015; Dezfouli and Bonilla 2015) we derive the final bound. More specifically, we would like to approximate the true posterior \({\mathscr {P}} \equiv p( \{{\mathbf {h}}_p,{\mathbf {u}}_p\}_{p=1}^P |Y)\) with the variational distribution in (8). The minimization of the KL divergence \(\text {KL}[{\mathscr {Q}} || {\mathscr {P}}]\) is equivalently expressed as the maximization of the following lower bound on the log marginal likelihood \(\log p(Y)\),

$$\begin{aligned}&{\mathbb {E}}_{ {\mathscr {Q}} } \left[ \log \frac{\prod _{i=1}^N p({\mathbf {y}}^{(i)}|{\mathbf {h}}^{(i)}) \prod _{p=1}^P p({\mathbf {h}}_p | {\mathbf {u}}_p) p({\mathbf {u}}_p)}{\prod _{p=1}^P p({\mathbf {h}}_p | {\mathbf {u}}_p) q({\mathbf {u}}_p)} \right] \\&\quad = {\mathbb {E}}_{ {\mathscr {Q}} } \left[ \log \frac{\prod _{i=1}^N p({\mathbf {y}}^{(i)}|{\mathbf {h}}^{(i)}) \prod _{p=1}^P p({\mathbf {u}}_p)}{\prod _{p=1}^P q({\mathbf {u}}_p)} \right] \\&\quad = \sum _{i=1}^N {\mathbb {E}}_{ {\mathscr {Q}} }\left[ \log p({\mathbf {y}}^{(i)}|{\mathbf {h}}^{(i)}) \right] - \sum _{p=1}^P {\mathbb {E}}_{ {\mathscr {Q}} }\left[ \log \frac{q({\mathbf {u}}_p)}{ p({\mathbf {u}}_p)} \right] \end{aligned}$$

Since each \({\mathscr {Q}}\) is given by (8), each term in the second sum simplifies to become an expectation over \({\mathbf {u}}_p\),

$$\begin{aligned} {\mathbb {E}}_{ {\mathscr {Q}} }\left[ \log \frac{q({\mathbf {u}}_p)}{ p({\mathbf {u}}_p)} \right] = {\mathbb {E}}_{ q({\mathbf {u}}_p) }\left[ \log \frac{q({\mathbf {u}}_p)}{ p({\mathbf {u}}_p)} \right] \end{aligned}$$

which is precisely the KL divergence \(\text {KL}[q({\mathbf {u}}_p) || p({\mathbf {u}}_p)]\). Therefore, the derived bound can be written as

$$\begin{aligned} {\mathscr {F}}&= - \sum _{i=1}^N \sum _{k=1}^K {\mathbb {E}}_{q(f_k^{(i)})} \left[ \log ( 1 + e^{-y_{k}^{(i)} f_k^{(i)} }) \right] - \sum _{p=1}^P \text {KL}[ q({\mathbf {u}}_p) || p({\mathbf {u}}_p) ]. \end{aligned}$$
(11)

In the first line of this expression we have written the expectation of each log-likelihood term as an integral under the scalar utility \(f_k^{(i)} = \sum _{p=1}^P \phi _{kp} h_p^{(i)} + b_k\), defined in (4), that follows the univariate variational Gaussian distribution

$$\begin{aligned} q( f_k^{(i)} ) = {\mathscr {N}}\left( f_k^{(i)} | \sum _{p=1}^P \phi _{kp} \mu _p^{(i)} + b_k, \sum _{p=1}^P \phi _{kp}^2 s_p^{(i)} \right), \end{aligned}$$
(12)

where \(\mu _p^{(i)}\) is the i-th element of the vector \(\varvec{\mu }^h_p\) defined in (9) and \(s_p^{(i)}\) the i-th diagonal element (i.e. variance) of the covariance matrix \(\Sigma ^h_p\) from (10). Clearly, all expectations over the likelihood terms reduce to performing NK one-dimensional integrals under Gaussian distributions and each such integral can be accurately approximated by Gauss-Hermite quadrature.

Each KL divergence term of the lower bound in the second line of Eq. (11) is given by

$$\begin{aligned} \text {KL} [q({\mathbf {u}}_p) || p({\mathbf {u}}_p)]&= \frac{1}{2} \varvec{\mu }_p^\top K_Z^{-1} \varvec{\mu }_p + \frac{1}{2} {{\,\mathrm{tr}\,}}\left( K_Z^{-1} \Sigma _p \right) + \frac{1}{2} \log |K_Z| - \frac{1}{2} \log |\Sigma _p| - \frac{M}{2}. \end{aligned}$$
(13)

Notice that this term and the overall bound in (11) can be computed efficiently while numerical stability can be achieved by a ”jitter” addition on \(K_Z\). More specifically, we only need to compute once the Cholesky decomposition of \(K_Z\) in order to evaluate the P KL divergences, while the fact that we only keep the lower triangular matrix \(L_p\) of each variational covariance matrix, allows us to efficiently calculate all the remaining terms of each KL divergence.

To compute the bound we need firstly to perform one Cholesky decomposition of the matrix \(K_Z\) that overall scales as \({\mathscr {O}}(M^3)\) and allows us to fully calculate the sum of the KL divergence terms in the second line in (11). Then, with this Cholesky decomposition precomputed, for each i-th data point we need to compute \((\mu _p^{(i)},s_p^{(i)})_{p=1}^P\), an operation that scales as \({\mathscr {O}}(P M^2)\), and subsequently calculate the K variational distributions (i.e. their means and variances) over the utility scores in (12) which requires additional \({\mathscr {O}}(K P)\) time. Therefore, in order to compute the whole data reconstruction term of the bound (first line in Eq. (11)) we need \({\mathscr {O}}(N K P + N P M^2)\) time and for the full bound we need \({\mathscr {O}}(N K P + N P M^2 + M^3)\) time. Given that \(N \gg M\) and \(K \gg P\), the terms that can dominate are either \({\mathscr {O}}(N K P)\) or \({\mathscr {O}}(N P M^2)\) which can make the computations very expensive when the number of data instances and/or labels is very large. Next, we show how to make the optimization of the bound scalable for arbitrarily large numbers of data points and labels.

4.2 Scalable training using stochastic optimization

To ensure that the time complexity \({\mathscr {O}}(N K P + N P M^2 + M^3)\) for very large datasets is reduced to \({\mathscr {O}}(P M^3)\) we shall optimize the bound using stochastic gradient ascent by following a similar procedure used in stochastic variational inference for GPs Hensman et al. (2013). Given that the sum of KL divergences in (11) is already within the desired complexity \({\mathscr {O}}(P M^3)\), we only need to speed up the remaining data reconstruction term. This term involves a double sum over data instances and class labels, a setting suitable for stochastic approximation. Thus, a straightforward procedure is to uniformly sub-sample terms in the double sum in (11) which leads to an unbiased estimate of the bound and its gradients. It turns out that we can further reduce the variance of this basic strategy by applying a more stratified sub-sampling over class labels as discussed next.

Suppose \({\mathscr {B}}\subset \{1,\ldots ,N\}\) denotes the current minibatch at the t-th iteration of stochastic gradient ascent. For each \(i \in {\mathscr {B}}\) the internal sum over class labels can be written as

$$\begin{aligned} - \sum \limits _{k \in {\mathscr {P}}_{i}}{\mathbb {E}}_{q(f_k^{(i)})} \left[ \log ( 1 + e^{- f_k^{(i)} }) \right] - \sum \limits _{\ell \in {\mathscr {N}}_{i}} {\mathbb {E}}_{q(f_{\ell }^{(i)})} \left[ \log ( 1 + e^{f_{\ell }^{(i)} }) \right] , \end{aligned}$$

where \({\mathscr {P}}_i = \{ k | y_{k}^{(i)} = 1 \}\) is the set of present or positive labels of \({\mathbf {x}}^{(i)}\) while \({\mathscr {N}}_i = \{ k | y_{k}^{(i)} = - 1 \}\) is the set of absent or negative labels such that \({\mathscr {P}}_i \cup {\mathscr {N}}_i = \{ 1, \cdots , K \}\). In typical multi-label classification problems (Zhang and Zhou 2013; Gibaja and Ventura 2014, 2015) the size of positive labels \({\mathscr {P}}_i\) is very small, while the negative set can be extremely large. Thus, we can enumerate exactly the first sum and use (if needed) sub-sampling to approximate the second sum over the negative labels. The whole process becomes somehow similar to negative sampling used in large scale classification and for learning word embeddings Mikolov et al. (2013). Overall, we get the following unbiased stochastic estimate of the lower bound,

$$\begin{aligned}&- \frac{N}{ | {\mathscr {B}} | } \ \sum _{i \in {\mathscr {B}}} \left( \sum _{k \in {\mathscr {P}}_i} {\mathbb {E}}_{q(f_k^{(i)})} \left[ \log ( 1 + e^{-f_k^{(i)} }) \right] \right. + \left. \frac{|{\mathscr {N}}_i|}{|{\mathscr {L}}_i|} \sum \limits _{\ell \in {\mathscr {L}}_i} {\mathbb {E}}_{q(f_{\ell }^{(i)})} \left[ \log ( 1 + e^{f_{\ell }^{(i)} }) \right] \right) \\&- \sum _{p=1}^P \ \text {KL}[ q({\mathbf {u}}_p) || p({\mathbf {u}}_p) ], \end{aligned}$$

where \({\mathscr {L}}_i\) is the set of negative classes for the i-th data point. In general, the computation of this stochastic bound scales as \({\mathscr {O}}( |{\mathscr {B}} | ( |{\mathscr {P}}_i| + |{\mathscr {L}}_i|) P + |{\mathscr {B}} | P M^2 + M^3)\) and by choosing \(|{\mathscr {B}} | \sim {\mathscr {O}}(M)\) and \(|{\mathscr {P}}_i| + |{\mathscr {L}}_i| \sim {\mathscr {O}}(M^2)\) we can ensure that the overall time is \({\mathscr {O}}(P M^3)\). Notice that the second condition is not that restrictive and in many cases might not be needed, i.e. in practice we can use very large negative sets \({\mathscr {L}}_i\) which for many datasets could be equal to the full negative set \({\mathscr {N}}_i\).

We implemented the above stochastic bound in Python (where the one-dimensional integrals are obtained by Gauss-Hermite quadrature) in order to jointly optimize using stochastic gradient ascent and automatic differentiation toolsFootnote 1 over the parameters \((\Phi , {\mathbf {b}})\) of the linear mapping, the \(\frac{P M (M + 3)}{2}\) variational parameters \(\{ \varvec{\mu }_p, L_p \}_{p=1} ^P\) of the variational distributions \(q({\mathbf {u}}_p)\), the inducing inputs Z and the kernel hyperparameters \(\varvec{\theta }\).

Table 1 Data sets statistics: N and \(N_{ts}\) are the number of the training and test points respectively, D and K are the number of features and labels respectively, and \({\overline{K}}\) is the average number of positive labels in an instance

4.3 Prediction

Given a novel data point \({\mathbf {x}}^{(*)}\) we would like to make prediction over its unknown label vector \({\mathbf {y}}^{(*)}\). This requires approximating the predictive distribution \(p({\mathbf {y}}^{(*)}|Y)\),

$$\begin{aligned} p({\mathbf {y}}^{(*)}|Y) \approx \int p({\mathbf {y}}^{(*)} | {\mathbf {u}}^{(*)}) q({\mathbf {u}}^{(*)}) d {\mathbf {u}}^{(*)}. \end{aligned}$$
(14)

Here, \(q( {\mathbf {u}}^{(*)})\) is the variational predictive posterior over the latent function values \({\mathbf {u}}^{(*)}\) evaluated at \({\mathbf {x}}^{(*)}\). An interesting aspect of the variational sparse GP method is that to obtain \(q( {\mathbf {u}}^{(*)})\) we need to make no further approximations since everything follows from the GP consistency property, i.e.

$$\begin{aligned} q({\mathbf {u}}^{(*)})&= \prod _{p=1}^P \int p( u^{(*)}_p |{\mathbf {h}}_p, {\mathbf {u}}_p) p({\mathbf {h}}_p | {\mathbf {u}}_p ) q({\mathbf {u}}_p) d {\mathbf {h}}_p d {\mathbf {u}}_p \nonumber \\&= \prod _{p=1}^P \int p( u^{(*)}_p | {\mathbf {u}}_p) q({\mathbf {u}}_p) d {\mathbf {u}}_p = \prod _{p=1}^P q(u^{(*)}_p). \end{aligned}$$
(15)

Here, GP consistency tractably simplifies each integral \(\int p( u^{(*)}_p |{\mathbf {h}}_p, {\mathbf {u}}_p) p({\mathbf {h}}_p | {\mathbf {u}}_p ) d {\mathbf {h}}_p\)    to \(p( u^{(*)}_p | {\mathbf {u}}_p)\) so that the obtained \(p( u^{(*)}_p | {\mathbf {u}}_p)\) is the conditional GP prior of \(u^{(*)}_p\) given the inducing variables. The final form of each univariate Gaussian \(q(u^{(*)}_p)\) has a mean and variance given precisely by Eqs. (9) and (10) with X replaced by \({\mathbf {x}}^{(*)}\). In practice, when we compute several accuracy ranking-based scores that are often used in the literature to report multi-label classification performance Zhang and Zhou (2013), Gibaja and Ventura (2014), Gibaja and Ventura (2015) it suffices to further approximate \(q({\mathbf {u}}^{(*)})\) by a delta mass centred at the MAPFootnote 2. This reduces the whole computation of such scores to only requiring the evaluation of the mean utility vector \({\bar{{\mathbf {f}}}}^{(*)} = [{\bar{f}}_1^{(*)} \ldots {\bar{f}}_K^{(*)}]^\top\) such that \({\bar{f}}_k^{(*)} = \sum _{p=1}^P \phi _{k p} \mu _p^{(*)} + b_k\), where \(\mu _p^{(*)} = {\mathbf {k}}({\mathbf {x}}^{(*)}, Z) K_Z^{-1} \varvec{\mu }_p\) and \({\mathbf {k}}({\mathbf {x}}^{(*)}, Z)\) is the cross covariance row vector between \({\mathbf {x}}^{(*)}\) and the inducing points Z. By using \({\bar{{\mathbf {f}}}}^{(*)}\) we can compute several ranking scores as described in the experiments section after.

Table 2 Parameter settings of the MLGPF model for each dataset used in our experiments

5 Experiments

Experiments are carried out on three small-scale (SS) real-world datasets, Bibtex Katakis et al. (2008), Delicious Tsoumakas et al. (2008), and EUR-Lex Mencia and Fürnkranz (2008) and three large-scale (LS) real-world datasets, Wiki10 Zubiaga (2012), AmazonCat McAuley et al. (2015), and Delicious-L(arge) Wetzker et al. (2008). All the datasets are publicly available; see Table 1 for summary statistics of each dataset. To the best of our knowledge, this is the first time that a GB-based method is applied to datasets of that large dimensions in the context of multi-label learning. Nevertheless, datasets of even larger dimensions such as WikiLSHTC325K (Partalas et al. 2015) or Amazon3M (McAuley et al. 2015) were not able to be trained by our framework due to hardware limitations. The current code is available at https://github.com/aresPanos/mlgpf.

Table 3 Training time (in hours) comparison between the MLGPF model and Parabel/ProXML over the six multi-label datasets

In our experiments, we applied the proposed multi-label GP factor model (MLGPF) using a linear and a squared exponential (se) kernel, where in both cases we freely optimized the inducing inputs matrix Z resulting to methods LR and SE respectively. It is also worth mentioning that for those two kernels, we add D extra tunable hyperparameters – one for each input dimension – in order to improve our model performance in the extremely high dimensional datasets. We also employed a linear combination of those two kernels for providing extra flexibility to our model. Initialization of Z is achieved by running a few iterations of the k-means algorithm while the rest of the variational parameters are initialised by i.i.d. standard Gaussian samples. Additionally, we consider the case where the inducing inputs are randomly chosen from the training instances and are kept fixed throughout the optimization process. This is to demonstrate the crucial role of optimizing inducing inputs regarding the posterior approximation and the predictive performance. For this case, we employ the se kernel and we denote that method by SE-F. Further, it should be mentioned that choosing large size for the negative label set \({\mathscr {L}}_i\) can dramatically reduce variance in the stochastic optimization of the lower bound. Fortunately, the computational complexity analysis of Sect. 4.2 allows us to choose \({\mathscr {L}}_i\) such that \(|{\mathscr {L}}_i | \sim {\mathscr {O}}(M^2)\) which is not that restrictive, and in practice, for all datasets used in the experiments, \({\mathscr {L}}_i\) is selected to be equal to the size of the full negative set \({\mathscr {N}}_i\). For the evaluation of the one-dimensional integrals by Gauss-Hermite method in (11) we use ten quadrature points which allows to have a sufficiently close approximation without any increase on the computational cost of the algorithm.

For the small-scale datasets we run the MLGPF model ten times using ten different random initializations and we compute the corresponding average precision scores (see Table 7) and lower bounds (see Fig. 5) accompanied by their corresponding standard deviations. The high computational cost and training time of the large-scale datasets restricts us to just a single run for the remaining three large datasets.

We evaluate the predictive performance of our method against some of the state-of-the-art algorithms by using the Precision@k score (P@k). For a ground truth test vector \({\mathbf {y}}^{(*)} \in \{-1, 1\}^K\) and a predicted score vector \({\bar{{\mathbf {f}}}}^{(*)} \in {\mathbb {R}}^K\), the P@k is defined as \(k^{-1} \sum _{l \in \text {rank}_k({\bar{{\mathbf {f}}}}^{(*)})} ({\mathbf {y}}_{l}^{(*)} + 1)/2\), where \(\text {rank}_k({\bar{{\mathbf {f}}}}^{(*)})\) returns indices of the k largest values of \({\bar{{\mathbf {f}}}}^{(*)}\) in descending order. Here, \({\bar{{\mathbf {f}}}}^{(*)}\) can be evaluated using the trained MLGPF model as described in Sect. 4.3. Such ranking-based evaluation of multi-label models is very standard in the multi-label literature where we only care to predict the few 1s in the multi-label vectors, typical in many real-world examples, such as recommender systems; for example, see Prabhu and Varma (2014); Jain et al. (2017) and the reported results in the Extreme Classification Repository.Footnote 3 Moreover, following common practice in the extreme label literature we examine and compare the robustness of our method against tail labels, that is labels that appear rarely in the dataset. This is achieved by using the propensity score version of P@k, namely PSP@k, proposed in Jain et al. (2016). Finally, for completeness, we also report results for nDCG performance measure and its propensity version PSnDCG@k, see Jain et al. (2016) (Fig. 1).

Table 4 Performance comparison between the MLGPF model, where a linear combination of SE and LR is used, and other state-of-the-art methods
Table 5 Performance comparison, using the PSP@k metric, between the MLGPF model, where a linear combination of SE and LR is employed, and other state-of-the-art methods
Table 6 Predictive performance of the MLGPF model for the six multi-label datasets with respect to the nDCG@k and PSnDCG@1 scores

The parameter settings for each dataset can be found in Table 2. Parameters are primarily chosen in that way such that the memory footprint is not large enough for the 64Gb RAM Intel Xeon E5-2686 v4 @ 2.30GHz on which we run the experiments. Nevertheless, the aforementioned limitations do not impose much restriction on the performance of our model. A detailed comparison of the effect that different values of P and M have on our model using the Bibtex dataset can be found in Figs. 3 and 4 while Fig. 2 show the importance of increasing the values of P and M for the general performance of the model.

Table 7 Predictive Performance of the MLGPF model for the six multi-label datasets
Fig. 1
figure 1

Evolution of the lower bound for a Bibtex, b Delicious, c EUR-Lex, and d Wiki10. The blue line corresponds to maximization of the lower bound using fixed inducing inputs Z, while the red one to optimized Z. The SE kernel is used for each dataset while all the other parameters are set as described in Table 2 (Color figure online)

Based on both predictive performance (see Table 7) and computational time (see Table 3) of all three methods on small-scale datasets, we choose to solely employ the SE for large-scale datasets (except Wiki10, where SE-F is also used), since it combines high predictive power and low computational cost; attributes very important for the challenging task of training our model on those datasets.

Furthermore, we compare the MLGPF model with a linear combination of SE and LR kernels and optimized inducing inputs with six state-of-the-art-methods from the literature, namely SLEEC Bhatia et al. (2015), PFastreXML Jain et al. (2016), PD-Sparse Yen et al. (2016), DiSMEC Babbar and Schölkopf (2017), Parabel Prabhu et al. (2018), and ProXML Babbar and Schölkopf (2019) as they are reported in the Extreme Classification Repository (ECR) (see footnote 3). Results that do not appear in ECR were generated by using the publicly available codes with the suggested configurations. All those results appear in Tables 4 and 5 for P@k and PSP@k, respectively. Additional results, including nDCG@k and PSnDCG@k are available in Table 6. The choice of the specific methods is based on the high predictive power on both small and large scale datasets. At this point we would like to note that despite the similarities of our work with Moreno-Muñoz et al. (2018) as we mention at the beginning, the code provided by the latter is not capable to deal with the extremely large number of input and label dimensions and cannot be deployed for that kind of problems, rendering our tensorflow-based implementation more suitable. We also acknowledge that deep-learning based methods, see for example You et al. (2019), are known for their state-of-the-art accuracy results but they are heavily relied on sophisticated text pre-processing techniques and advanced computing hardware, and thus, we omit any comparisons with them.

Regarding predictive performance of the MLGPF model, it achieves the best P@k scores in Bibtex and Delicious dataset compared to the other state-of-the-art methods. Additionally, the predictive ability in each of the remaining datasets, even for the large-scale ones, is very close to the best state-of-the-art method, rendering our model highly competitive overall. For instance, in Eurlex the P@1 score of our model exceeds more than 7% the PFastreXML method while in the largest dataset in our experiments, Delicious-L, MLGPF attains a P@5 score very close to the SLEEC, surpassing in that way all the remaining methods. Similar patterns can be seen in Table 5, where our methods exhibits robustness against tail labels with respect to PSP@k measure.

A detailed training time comparison between our method and ProXML/Parabel is given in Table 3. The results indicate that the Bayesian nature of our framework and the large number of parameters to be optimized pay its toll when it comes to computational speed.

Finally, Fig. 1 and Table 7 reveal the positive correlation that the lower bound has with the predictive performance, while at the same time it is stressed the significance of optimizing over the inducing inputs Z (especially when the input dimensionality is increased). For instance, the lower bound of the SE for EUR-Lex dataset is significantly higher than the bound of SE-F, as shown in Fig. 1a, which leads to considerably better P@k scores for SE against SE-F.

Fig. 2
figure 2

Bar plots comparing both the P@1 and the lower bounds for Bibtex dataset using a, c fixed P={80, 159}, and b, d fixed M={50, 100}. For all cases the SE kernel is deployed while we run the MLGPF model for 50 epochs

6 Performance characteristics

Table 8 presents a detailed comparison of the effect that different values of the number of GPs P and number of inducing inputs M have on the optimization of the lower bound and the predictive performance of the MLGPF model. Lower bounds for the small scale datasets along with 95% confidence intervals are also provided; see Table 8.

Table 8 Predictive Performance of the MLGPF model for the six multi-label datasets

6.1 Extra experimental results

Figures 34, and 5 depict various comparisons in terms of predictive performance and optimization of the lower bound for the Bibtex dataset. Figure 3 shows the evolution of the P@1 throughout epochs for different values of P and M while Fig. 4 presents the same information as previously but in terms of the values of the lower bound. In all the above figures, the SE method is used for all cases. The last Figure gives the lower bounds for the small scale datasets in tandem with their corresponding 95% confidence intervals based on the experiments of the main paper.

Fig. 3
figure 3

P@1 of Bibtex dataset using a fixed M = 50, b fixed M = 100, c fixed P = 80, and d fixed P = 159

Fig. 4
figure 4

Lower bounds of Bibtex dataset using a fixed M = 50, b fixed M = 100, c fixed P = 80, and d fixed P = 159

Fig. 5
figure 5

Average lower bounds (black line) with 95% confidence intervals (cyan-coloured area) for each of the small-scale datasets based on ten random initializations. Left colums indicates use of the linear kernel while the right one the SE kernel (Color figure online)

7 Discussion

We have presented a GP factor model for multi-label learning and we have applied it to real-world large scale datasets involving hundreds of thousands input and label dimensions with similar or even higher predictive performance than other state-of-the-art methods. To achieve scalability, we have introduced a doubly stochastic variational inference scheme which could be useful to other non GP-based models for multi-label learning, while the simplified variational approximation could be useful to other GP applications, involving non-Gaussian likelihoods, such as standard multi-class GP classification.

Regarding future research, an important computational challenge is concerned with the development of efficient optimization schemes over inducing inputs in extremely high-dimensional input spaces, which is typical in many applications of multi-label learning Prabhu and Varma (2014). For that, we plan to exploit the significant sparsity patterns of these high dimensional points and optimize inducing inputs so that such sparsity patterns are preserved. Furthermore, we wish to elaborate more on the modelling aspects of our method such as, to modify the likelihood in order to deal with missing labels Jain et al. (2017) and add extra latent variables that can capture non-input dependent correlation between the class labels Gibaja and Ventura (2014). Finally, inspired by the encouraging results of the recent work on natural gradient-based optimization for sparse GP models for non-conjugate likelihoods in Salimbeni et al. (2018), we aim to explore its efficiency to our multi-label framework in those high-dimensional regimes.