1 Introduction

Transposable data describes relationships between pairs of entities. Such data can be organized as a matrix, with one set of entities as the rows, the other set of entities as the columns. In such datasets, both the rows and column of the matrix are of interest. Transposable data matrices are often sparse, and of primary interest is the prediction of unobserved matrix entries representing unknown interactions. In the machine learning community, the modeling of transposable data is often encountered as multitask learning (Stegle et al. 2011). In addition to the matrix, transposable datasets often include features describing each row entity and each column entity, or graphs describing relationships between the rows and the columns. These features and graphs can be useful for improving in-matrix prediction performance and for extending model predictions outside of the observed matrix, thus alleviating the cold-start problem. In this work, we combine the matrix-variate Gaussian process model with low rank constraints for the predictive modeling of transposable data.

In recent years, the matrix variate Gaussian distribution (MV-G) has emerged as a popular model for transposable data (Allen and Tibshirani 2010, 2012) as it compactly decomposes correlations between the matrix entries into correlations between the rows, and correlations between the columns. Although the MV-G has been shown to be effective for modeling matrix data with missing entries, model predictions do not extend to rows and columns that are unobserved in the training data. One approach to remedy this deficiency is to replace the MV-G with the nonparametric matrix-variate Gaussian process (MV-GP) (Stegle et al. 2011). This is achieved by replacing the row and column covariance matrices of the MV-G with parameterized row and column covariance functions. Thus, the resulting model can provide predictions for new rows and columns given features. The MV-GP may also be described as an extension of the scalar valued Gaussian process (GP) (Rasmussen and Williams 2005), a popular model for scalar functions, to vector valued responses. The MV-GP has been applied to link analysis, transfer learning, collaborative prediction and other multitask learning problems (Yu and Chu 2008; Bonilla et al. 2008; Yan et al. 2011). Despite its wide use for transposable data and multitask learning, the MV-GP does not capture low rank structure.

Rank constraints have become ubiquitous in matrix prediction tasks (Yu et al. 2007; Zhu et al. 2009; Koyejo and Ghosh 2011; Zhou et al. 2012; Koyejo and Ghosh 2013a). The low rank assumption implies that matrix-valued parameters of interest can be decomposed as the inner product of low dimensional factors. This reduces the degrees of freedom in the matrix model and can improve the parsimony of the results. Recent theoretical (Candès and Recht 2009) and empirical (Koren et al. 2009) results have provided additional motivation for the low rank approach. The low rank assumption is also motivated by computational concerns. Consider the computational requirements of a full matrix regression model such as a Gaussian process regression (Rasmussen and Williams 2005). Here, the memory requirements scale quadratically with data size, and naïve inference via using a matrix inverse scales cubically with data size (Álvarez et al. 2012). In contrast, training low rank models can scale linearly with the data size and quadratically with the underlying matrix rank (using the factor representation). Further, efficient optimization methods have been proposed (Koren et al. 2009; Dudik et al. 2012).

We propose a novel constrained Bayesian inference approach that combines the flexibility and extensibility of the matrix-variate Gaussian process with the parsimony and empirical performance of low rank models. Constrained Bayesian inference (Koyejo and Ghosh 2013a) is a principled approach for enforcing expectation constraints on the Bayesian inference procedure. It is a useful approach for probabilistic inference when the problem of interest requires constraints that are difficult to capture using standard prior distributions alone. Examples include linear inequality constraints (Gelfand et al. 1992) and margin constraints (Zhu et al. 2012). To enforce these restrictions, constrained Bayesian inference represents the Bayesian inference procedure as a constrained relative entropy minimization problem. The resulting optimization problem can often be reduced to constrained parameter estimation and solved using standard optimization theoretic techniques.

The main contributions of this paper are as follows:

  • We propose a novel approach for capturing the low rank characteristics of transposable data by combining the matrix-variate Gaussian process prior with constrained Bayesian inference subject to nuclear norm constraints.

  • We show that (i) the distribution that solves the constrained Bayesian inference problem is a Gaussian process, (ii) its inference can be reduced to the estimation of a finite set of parameters, and (iii) the resulting optimization problem is strongly convex in these parameters.

  • We evaluate the proposed model empirically and show that it performs as well as (or better than) the state of the art domain specific models for disease-gene association prediction with gene network and disease ontology side information and recommender systems with social network side information.

We begin by discussing relevant background on the matrix-variate Gaussian process and nuclear norm constraints for matrix-variate functions in Sect. 2. We introduce the concept of constrained inference in Sect. 2.4 and apply it to the matrix-variate Gaussian process to compute a low rank prediction (Sect. 4). We present the empirical performance of the proposed model compared to state of the art domain specific models for transposable data in the disease-gene association domain (Sect. 5.1) and the recommender systems domain (Sect. 5.2). Finally, we conclude in Sect. 6.

2 Background

This section describes the problem statement (Sect. 2.2), and the main building blocks of our approach (i) the matrix-variate Gaussian process (Sect. 2.3) and (ii) constrained Bayesian inference (Sect.2.4).

2.1 Preliminaries

We denote vectors by bold lower case e.g. \({\mathbf {x}}\) and matrices by bold upper case e.g. \({\mathbf {X}}\). Let \({\mathbf {I}}_D\) represent the \(D \times D\) identity matrix. Given a matrix \({\mathbf {A}}\in {\mathbb {R}}^{P \times Q}\), \({\text {vec}({\mathbf {A}})} \in {\mathbb {R}}^{PQ}\) is the vector obtained by concatenating columns of \({\mathbf {A}}\). Given matrices \({\mathbf {A}}\in {\mathbb {R}}^{P \times Q}\) and \({\mathbf {B}}\in {\mathbb {R}}^{P' \times Q'}\), the Kronecker product of \({\mathbf {A}}\) and \({\mathbf {B}}\) is denoted as \({\mathbf {A}}\otimes {\mathbf {B}}\in {\mathbb {R}}^{PP'\times QQ'}\). A useful property is the Kronecker identity: \({\text {vec}({\mathbf {A}}{\mathbf {X}}{\mathbf {B}})} = ({{\mathbf {B}}^\top } \otimes {\mathbf {A}}) {\text {vec}({\mathbf {X}})}\), where \({\mathbf {X}}\in {\mathbb {R}}^{Q \times P'}\) and \({{\mathbf {B}}^\top }\) represents the transpose of \({\mathbf {B}}\).

Let \(\mathrm {E}_{ }\left[ \, \cdot \, \right] \) be the expectation operator with \(\mathrm {E}_{ p}\left[ \, f(z) \, \right] = \int _z p(z)f(z) dz\). The Kullback-Leibler (KL) divergence between densities \(q(z)\) and \(p(z)\) is given by:

$$\begin{aligned} {\mathrm {KL}\!\left( {q(z)}\Vert {p(z)}\right) } = \mathrm {E}_{ q}\left[ \, \log q(z) - \log p(z) \, \right] . \end{aligned}$$

Let \({\mathbf {x}}\in {\mathbb {R}}^{P}\) be drawn from a multivariate Gaussian distribution. The density is given as:

$$\begin{aligned} \fancyscript{N}\left( {\mathbf {m}}, {\varvec{\varSigma }}\right) = \frac{ \exp \left( -\frac{1}{2}\text {tr} \left[ {({\mathbf {x}}- {\mathbf {m}})^\top }{{\varvec{\varSigma }}^{-1}}({\mathbf {x}}- {\mathbf {m}}) \right] \right) }{(2\pi )^{P/2}|{\varvec{\varSigma }}|^{P/2}}, \end{aligned}$$

where \({\mathbf {m}}\in {\mathbb {R}}^{P}\) is the mean vector and \({\varvec{\varSigma }}\in {\mathbb {R}}^{P \times P}\) is the covariance matrix. \(|\cdot |\) denotes the matrix determinant and \({\mathrm {tr}}\!\left( \cdot \right) \) denotes the matrix trace.

2.2 Transposable data notation and problem statement

Let \({\mathbb {M}}\ni m\) be the index set of rows and \({\mathbb {N}}\ni n\) be the index set of columns. The index set of observed matrix entries is represented by \(\mathsf {L}=\{(m, n)\} \subset {\mathbb {M}}\times {\mathbb {N}}\) with every \(l=(m,n)\in \mathsf {L}\). We define the subset of observed rows as the set \(\mathsf {M}= \{m\,|\, (m,n) \in \mathsf {L}\} \subset {\mathbb {M}}\) with size \(|\mathsf {M}|=M\), and the subset of observed columns as the set \(\mathsf {N}= \{n\,|\, (m,n) \in \mathsf {L}\} \subset {\mathbb {N}}\) with size \(|\mathsf {N}|=N\) so \(L=|\mathsf {L}|\le M\times N\). Let each entry in the matrix be represented by \(y_l\). The observed subset of the transposable matrix is represented by \({\mathbf {y}}= {\left[ y_{l_1}\,\ldots \,y_{l_{L}} \right] ^\top }\). Our goal is to estimate a predictive model for any unobserved entries \(\{y_{l'}\,|\, l' \notin \mathsf {L}\}\) including entries not observed within the bounds of the training matrix i.e. \(\{y_{l'}\,|\, l' \notin \mathsf {M}\times \mathsf {N}\}\).

2.3 Matrix-variate Gaussian process for transposable data

The matrix-variate Gaussian process is a doubly indexed stochastic process \(\{ Z_{m,n}\}_{m \in {\mathbb {M}}, n \in {\mathbb {N}}}\) where finitely indexed samples follow a multivariate Gaussian distribution. As with the scalar Gaussian process (Rasmussen and Williams 2005), the MV-GP is completely specified by its mean and covariance functions. We use the notation \(\fancyscript{M}\fancyscript{G}\fancyscript{P}\left( \phi , \fancyscript{C}_{\mathrm{N}}, \fancyscript{C}_{\mathrm{M}}\right) \) to denote the MV-GP with mean function \(\phi : {\mathbb {M}}\times {\mathbb {N}}\mapsto {\mathbb {R}}\), row covariance function \(\fancyscript{C}_{\mathrm{M}}: {\mathbb {M}}\times {\mathbb {M}}\mapsto {\mathbb {R}}\) and the column covariance function \(\fancyscript{C}_{\mathrm{N}}: {\mathbb {N}}\times {\mathbb {N}}\mapsto {\mathbb {R}}\). The covariance function of the prior MV-GP has a Kronecker product structure (Álvarez et al. 2012). This form assumes that the prior covariance between matrix entries can be decomposed as the product of the row and column covariances. The joint covariance function of the MV-GP decomposes into product form as \(\fancyscript{C}\left( (m,n),(m',n')\right) = \fancyscript{C}_{\mathrm{M}}(m,m')\fancyscript{C}_{\mathrm{N}}(n,n')\), or equivalently, \(\fancyscript{C}= \fancyscript{C}_{\mathrm{N}}\otimes \fancyscript{C}_{\mathrm{M}}\). We use the notation \(\fancyscript{G}\fancyscript{P}\left( \psi , \fancyscript{C}\right) \) to denote the scalar valued Gaussian process with mean function \(\psi : \mathsf {L}\mapsto {\mathbb {R}}\) and covariance function \(\fancyscript{C}: \mathsf {L}\times \mathsf {L}\mapsto {\mathbb {R}}\).

Let \(Z \sim \fancyscript{M}\fancyscript{G}\fancyscript{P}\left( \phi , \fancyscript{C}_{\mathrm{M}}, \fancyscript{C}_{\mathrm{N}}\right) \), and define the matrix \({\mathbf {Z}}\in {\mathbb {R}}^{M\times N}\) with entries \(z_{m,n}= Z(m,n)\) for \({m,n}\in \mathsf {M}\times \mathsf {N}\), \({\text {vec}({\mathbf {Z}})}\) is a distributed as a multivariate Gaussian with mean \({\text {vec}({\varvec{\varPhi }})}\) and covariance matrix \({\mathbf {C}}_{\mathrm{N}}\otimes {\mathbf {C}}_{\mathrm{M}}\), i.e., \({\text {vec}({\mathbf {Z}})}\sim \fancyscript{N}\left( {\text {vec}({\varvec{\varPhi }})}, {\mathbf {C}}_{\mathrm{N}}\otimes {\mathbf {C}}_{\mathrm{M}}\right) \), where \(\phi _{m,n}= \phi (m,n)\), \({\varvec{\varPhi }}\in {\mathbb {R}}^{M\times N}\) is the mean matrix, \({\mathbf {C}}_{\mathrm{M}}\in {\mathbb {R}}^{M \times M}\) is the row covariance matrix and \({\mathbf {C}}_{\mathrm{N}}\in {\mathbb {R}}^{N \times N}\) is the column covariance matrix. This definition extends to finite subsets \(\mathsf {L}\subset {\mathbb {M}}\times {\mathbb {N}}\) that are not complete matrices. For any subset \(\mathsf {L}\), the vector \({\mathbf {z}}= \left[ z_{l_1}\,\ldots \,z_{l_{L}} \right] \) is distributed as \({\mathbf {z}}\sim \fancyscript{N}\left( {\varvec{\varPhi }}_{\mathsf {L}} , {\mathbf {C}}\right) \) where the vector \({\varvec{\varPhi }}_{\mathsf {L}} =[\phi (1) \ldots \phi (L)] \in {\mathbb {R}}^L\) are arranged from the entries of the mean matrix corresponding to the set \(l \in \mathsf {L}\), and \({\mathbf {C}}\) is the covariance matrix evaluated only on pairs \(l,l' \in \mathsf {L}\times \mathsf {L}\).

The MV-GP is a popular prior distribution for transposable matrix data. Here we combine it with a Gaussian observation noise model as follows (see Fig. 1):

  1. 1.

    Draw the function \(Z\) from a zero mean MV-GP as \(Z \sim \fancyscript{M}\fancyscript{G}\fancyscript{P}\left( {0} , \fancyscript{C}_{\mathrm{M}}, \fancyscript{C}_{\mathrm{N}}\right) \).

  2. 2.

    Draw observed response independently as \(y_{m,n}\sim \fancyscript{N}\left( z_{m,n}, \sigma ^2\right) \) given \(z_{m,n}= Z({m,n})\).

The hidden matrix \({\mathbf {Z}}\in {\mathbb {R}}^{M \times N}\) with entries \(z_{m,n}= Z({m,n})\) may be interpreted as the latent noise-free matrix. The inference task is to estimate the posterior distribution \(Z|\fancyscript{D}\), where \(\fancyscript{D}=\{{\mathbf {y}}, \mathsf {L}\}\). It follows that the posterior distribution is a Gaussian process (Rasmussen and Williams 2005) given by \(Z|\fancyscript{D}\sim \fancyscript{G}\fancyscript{P}\left( \phi , \varSigma \right) \), with mean and covariance functions:

$$\begin{aligned}&\phi (m, n) = {\mathbf {C}}_{\mathsf {L}}(m,n){[{\mathbf {C}}+ \sigma ^2{\mathbf {I}}]^{-1}}{\mathbf {y}}\end{aligned}$$
(1a)
$$\begin{aligned}&\varSigma \left( (m, n), (m',n')\right) = \fancyscript{C}((m,n),(m',n')) - {\mathbf {C}}_{\mathsf {L}}(m,n){[{\mathbf {C}}+ \sigma ^2{\mathbf {I}}]^{-1}} {{\mathbf {C}}_{\mathsf {L}}(m,n)^\top }.\qquad \end{aligned}$$
(1b)

The covariance function \({\mathbf {C}}_{\mathsf {L}}(m,n)\) corresponds to the sampled covariance matrix between the index \((m,n)\) and all training data indexes \((m',n')\in \mathsf {L}\), \({\mathbf {C}}\) is the covariance matrix between all pairs \((m,n),(m', n') \in \mathsf {L}\times \mathsf {L}\), and \({\mathbf {I}}\) is the \(L \times L\) identity matrix. The closed form follows directly from the definition of a MV-GP as a scalar GP (Rasmussen and Williams 2005) with appropriately vectorized variables. The computational complexity of applying the GP model scales with the number of observed samples \(L\). Storage of the covariance matrix requires \(\fancyscript{O}(L^2)\) memory, and the naïve inference requires \(\fancyscript{O}(L^3)\) computation.

2.4 Constrained Bayesian inference

Probabilistic inference involves estimating the distribution of latent variables given new information such as observed data and constraints. This is often achieved via Bayes rule. Given the prior distribution of the latent variables, Bayes rule is a simple formula for computing the latent variable distribution conditioned on the observed data. However, Bayes rule may be inadequate when the constraints one seeks to impose on a latent variable distribution are computationally intractable to enforce by careful selection of the prior distribution alone. An alternative approach is to enforce these constraints as part of the inference procedure. While this can be achieved via rejection sampling and related techniques (Gelfand et al. 1992), such methods are computationally intractable for high dimensional variables as a large proportion of the samples will be discarded. Constrained Bayesian inference via variational optimization is a useful alternative in such cases. Constrained Bayesian inference converts the probabilistic inference into an optimization problem, thus allowing the application of standard optimization techniques.

Fig. 1
figure 1

Plate diagram of the hierarchical matrix-variate Gaussian process model with i.i.d Gaussian observation noise. \(Z({m,n})\) is the hidden noise-free matrix entry

Let \(z\) represent the latent variables and \(y\) represent the observations. Bayes rule can be used to compute the posterior density \(p(z|y)\) as:

$$\begin{aligned} p(z|y) = \frac{ p(y|z) p(z)}{p(y)} \end{aligned}$$

where the conditional density \(p(y|z)\) is known as the likelihood, \(p(z)\) is the prior density and \(p(y)\) is the evidence. An alternative approach was proposed by Zellner (1988), who showed that the Bayesian posterior can be computed as the solution of the variational optimization problem:

$$\begin{aligned} p(z|y) = \underset{q \in \fancyscript{P}}{\arg \min }\; {\mathrm {KL}\!\left( {q(z)}\Vert {p(z)}\right) } - \mathrm {E}_{ q}\left[ \, \log p(y|z) \, \right] . \end{aligned}$$
(2)

where \(\fancyscript{P}= \{q \, | \, \int _z q(z)dz = 1\}\).

Constrained Bayesian inference (Koyejo and Ghosh 2013a) can be used to enforce additional structure on the posterior distribution. It involves enforcing additional constraints on the variational optimization posed in (2). This paper will focus on expectation constraints applied to feature functions of the latent variables. Given a vector of feature functions \({\varvec{\gamma }}(z)\) and a constraint set \(\mathsf {C}\), let \(\fancyscript{R}_{\mathsf {C}} = \{q \in \fancyscript{P}\, \vert \, \mathrm {E}_{ q}\left[ \, {\varvec{\gamma }}(z) \, \right] \in \mathsf {C}\}\) represent the set of densities that satisfy the constraint \(\mathrm {E}_{ q}\left[ \, {\varvec{\gamma }}(z) \, \right] \in \mathsf {C}\). Constrained Bayesian inference requires solving one of the following equivalent variational optimization problems (Ganchev and Ja 2010; Zhu et al. 2012; Koyejo and Ghosh 2013a):

$$\begin{aligned} q_*(z)&= \underset{q \in \fancyscript{R}_{\mathsf {C}}}{\arg \min }\; {\mathrm {KL}\!\left( {q(z)}\Vert {p(z)}\right) } - \mathrm {E}_{ q}\left[ \, \log p(y|z) \, \right] .\end{aligned}$$
(3a)
$$\begin{aligned} q_*(z)&= \underset{q \in \fancyscript{R}_{\mathsf {C}}}{\arg \min }\; {\mathrm {KL}\!\left( {q(z)}\Vert {p(z|y)}\right) }. \end{aligned}$$
(3b)

Thus, the solution is an information projection of the Bayesian posterior distribution onto the constraint set \(\fancyscript{C}\). Following Zellner, we call \(q_*\) the postdata density to distinguish it from the unconstrained Bayesian posterior density. Further discussion of constrained Bayesian inference is provided in Appendix 7.

3 Related work

Constrained Bayesian inference is a special case of constrained relative entropy minimization where some of the constraints are generated from observed data (Koyejo and Ghosh 2013). Constrained relative entropy minimization and constrained entropy maximization have been studied in several application domains including natural language processing (Berger et al. 1996) and ecology (Dudík et al. 2007). Applications in the machine learning literature include maximum entropy discrimination (MED) (Jaakkola et al. 1999), and other models inspired by MED have been proposed for combining nonparametric topic models with large margin constraints for document classification (Zhu et al. 2009) and multitask classification (Zhu et al. 2011). Constrained relative entropy models have also been applied to collaborative filtering (Xu et al. 2012) and link prediction (Zhu et al. 2012) Other work using nonparametric priors (Zhu et al. 2009, 2011) has resulted in intractable inference, requiring the application of variational approximations with tractable assumptions made for the independence structure and parametric families of the solution. Our work appears to be the first that uses nonparametric prior distributions without requiring such simplifying assumptions. In addition, we consider constraints on the function space of the Gaussian process, which generalize the evaluation based constraints proposed in prior work i.e. constraints on the entire mean function as opposed to constraints on the mean of a set of matrix entries.

Factor models such as principal component analysis (PCA) (Bishop 2006) and its variants are popular methods for extracting information from matrix data. The standard PCA model can be extended to handle missing data using a Bayesian approach (Bishop 2006) that marginalizes over the missing data. The Gaussian process latent variable model (GP-LVM) (Lawrence and Hyvärinen 2005) was proposed to extend PCA to model non-linear relationships by replacing the covariance matrix with a non-linear kernel. This kernel approach has been applied to non-linear matrix factorization (Lawrence and Urtasun 2009). The GP-LVM integrates out one of the factors and estimates the other. The rank of the factor model must be pre-specified in such models, and is often fixed via expensive cross-validation. Implementations of Kernel PCA typically capture prior correlations over the rows or the columns, but not bothFootnote 1. Our proposed model is designed capture prior correlations simultaneously over the rows and columns via the matrix-variate Gaussian process prior. Further, the nuclear norm provides an avenue for automatic (implicit) rank selection.

The most common common approach for low rank matrix data modeling in the Gaussian process literature is the hierarchical low rank factor model. In particular, the hierarchical low rank factor Gaussian process (factor GP) has been proposed to capture low rank structure (Yu et al. 2007; Zhu et al. 2009; Zhou et al. 2012). We discuss this approach in some detail as it is used as our main baseline. Here, Gaussian processes are used as the priors for the low dimensional factors. With a fixed model rank \(R\), the generative model for the factor GP is as follows (see Fig. 2):

  1. 1.

    For each \(r\in \{1\ldots R\}\), draw row functions: \(U^r \sim \fancyscript{G}\fancyscript{P}\left( {0} , \fancyscript{C}_{\mathrm{M}}\right) \). Let \({\mathbf {u}}_m \in {\mathbb {R}}^R\) with entries \(u^r_m = U^r(m)\).

  2. 2.

    For each \(r\in \{1\ldots R\}\), draw column functions: \(V^r \sim \fancyscript{G}\fancyscript{P}\left( {0} , \fancyscript{C}_{\mathrm{N}}\right) \). Let \({\mathbf {v}}_n \in {\mathbb {R}}^R\) with \(v^r_n = V^r(n)\).

  3. 3.

    Draw each matrix entry independently: \(y_{m,n}\sim \fancyscript{N}\left( {{{\mathbf {u}}_m^\top }{\mathbf {v}}_n}, \sigma ^2\right) \; \forall \,(m,n)\in \mathsf {L}\).

where \({\mathbf {u}}_m\) is the \(m{^{th}}\) row of \({\mathbf {U}}=[{\mathbf {u}}^1 \ldots {\mathbf {u}}^R]\in {\mathbb {R}}^{M\times R}\), and \({\mathbf {v}}_n\) is the \(n{^{th}}\) row of \({\mathbf {V}}=[{\mathbf {v}}^1 \ldots {\mathbf {v}}^R]\in {\mathbb {R}}^{N\times R}\). The maximum-a-posteriori (MAP) estimates of \({\mathbf {U}}\) and \({\mathbf {V}}\) can be computed as the solution of the following optimization problem:

$$\begin{aligned} \underset{{\mathbf {U}}, {\mathbf {V}}}{\arg \min } \frac{1}{\sigma ^2}\sum _{({m,n})\in L} (y_{m,n}-{{{\mathbf {u}}_m^\top }{\mathbf {v}}_n})^2 + {\mathrm {tr}}\!\left( {{\mathbf {U}}^\top }{{\mathbf {C}}_{\mathrm{M}}^{-1}}{\mathbf {U}} \right) + {\mathrm {tr}}\!\left( {{\mathbf {V}}^\top }{{\mathbf {C}}_{\mathrm{N}}^{-1}}{\mathbf {V}} \right) \end{aligned}$$
(4)

where \({\mathrm {tr}}\!\left( {\mathbf {X}} \right) \) is the trace of the matrix \({\mathbf {X}}\). Statistically, the factor GP may be interpreted as the sum of rank-one factor matrices. Hence the law of large numbers can be used to show that the distribution of \(Z\) converges to \(\fancyscript{G}\fancyscript{P}\left( 0, \fancyscript{C}_{\mathrm{N}}\otimes \fancyscript{C}_{\mathrm{M}}\right) \) as the rank \(R {\,\rightarrow \,}\infty \) (Yu et al. 2007).

Fig. 2
figure 2

Hierarchical low rank factor Gaussian process

Despite its success, the factor Gaussian process approach has some deficiencies when applied for probabilistic inference. First, posterior distributions of interest are generally intractable. Specifically, neither the joint posterior distribution of \(\{{\mathbf {U}}, {\mathbf {V}}\}\) nor the distribution of \({\mathbf {Z}}={\mathbf {U}}{{\mathbf {V}}^\top }\) is Gaussian, and their posterior distributions are quite challenging to characterize. As a result, the posterior mean is challenging to compute without sampling and practitioners often apply the MAP approach. Second order statistics such as the posterior covariance are also computationally intractable. Instead, various approximate inference techniques have been applied. A Laplace approximation was proposed by (Yu et al. 2007) and (Zhu et al. 2009) utilized sampling techniques. Further, in most cases, the rank must be fixed a-priori. More recently, Bayesian models for matrix factorization that include a nonparametric prior for the number of latent factors have been proposed based on the Indian buffet process (Zhu 2012; Xu et al. 2012) and multiplicative gamma process (Zhang and Carin 2012). Inference with these models is generally intractable, and requires approximations or sampling, which may result in slow or inaccurate inference for large datasets. Further, many of these approaches have focused on in-matrix prediction, and have not been applied to out-of-matrix predictions.

Other related literature include Li and Yeung (2009), where the authors proposed a regularized matrix factorization model exploiting relation information. The proposed model is identical to the Gaussian process factor modelFootnote 2 (Zhou et al. 2012) with an appropriate choice of kernel. Li et al. (2009b) proposed an approach for learning a kernel based on network links that can then be applied to predictive modeling tasks. Li et al. (2009a) proposed a Bayesian probabilistic PCA model for full matrix prediction exploiting relational data information by constructing a covariance matrix that accounted for the relational data. An alternative approach focusing on learning additive Gaussian process kernels was proposed by (Xu et al. 2009), and an approach for nonparametric relational data modeling using co-clustering (instead of matrix factorization) was proposed by Xu et al. (2006). Several works have focused on the matrix prediction task alone without the use of side information. For example, Sutskever et al. (2009) utilized the clustering of factors to model the latent relationships as an alternative to designing covariance matrices.

4 Proposed approach: the nuclear norm constrained MV-GP

We propose nuclear norm constrained Bayesian inference for modeling low rank transposable data as an alternative to the low rank factor approach. The proposed approach constrains the model by directly regularizing the rank of the expected prediction via a constraint on its nuclear norm. Optimization with the rank constraint is computationally intractable, and the popular factor representation results in a nonconvex optimization problem that is susceptible to local minima (Dudik et al. 2012). The nuclear norm constraint has been proposed as a tractable surrogate regularization for the low rank constraint, which is in turn motivated by parsimony of the low rank representation, and the superior empirical performance of low rank models in many application domains. The nuclear norm of a matrix variate function is given by the sum of its singular values (Abernethy et al. 2009), and is the tightest convex hull of its rank. Under certain conditions, it can be shown that nuclear norm regularization recovers the true low rank matrix (Pong et al. 2010). Further details on the nuclear norm of matrix functions are provided in Appendix 8.

With no loss of generality, we assume a set of rows \(\mathsf {M}\) and a set of columns \(\mathsf {N}\) of interest so \(\mathsf {L}\subset \mathsf {M}\times \mathsf {N}\). Let \({\mathbf {Z}}\in {\mathbb {R}}^{M \times N}\) be the matrix of hidden variables, with \({\mathbf {z}}= {\text {vec}({\mathbf {Z}})} \in {\mathbb {R}}^{M \times N}\). Given any finite index set of observations at indices \(l \in \mathsf {L}\), the finite dimensional prior distribution is a Gaussian distribution given by \(\fancyscript{N}\left( 0, {\mathbf {C}}\right) \) where \({\mathbf {C}}\in {\mathbb {R}}^{MN \times MN}\). We seek a postdata density \(q(Z|\fancyscript{D})\) that optimizes (3a) subject to the constraint \({\left| \left| \left| \mathrm {E}_{ q}\left[ \, Z \, \right] \right| \right| \right| }_1 \le \eta \) where \({\left| \left| \left| \cdot \right| \right| \right| }_1\) is the nuclear norm. For any finite index set, the unconstrained Bayesian posterior distribution is Gaussian (Sect. 2.3). Following the steps of Sect. 2.4 (see also Appendix 7), it is straightforward to show that since the feature function \(\gamma (Z) = Z\) is linear, the constrained Bayes solution must also take a Gaussian form. All that remains is to solve for the mean and covariance. We may apply either the prior form (3a) or the equivalent posterior form (3b) for constrained inference. We discuss both approaches for illustrative purposes.

Let the Bayesian posterior be given by \(\fancyscript{N}\left( {\varvec{\phi }}, {\varvec{\varSigma }}\right) \) as described in (1) where \( {\varvec{\phi }}= {\text {vec}({\varvec{\varPhi }})} \in {\mathbb {R}}^{M \times N}\), and \({\varvec{\varSigma }}\in {\mathbb {R}}^{MN \times MN}\). Let the postdata distribution be given by \(\fancyscript{N}\left( {\varvec{\psi }}, {\mathbf {S}}\right) \), where \( {\varvec{\psi }}= {\text {vec}({\varvec{\varPsi }})} \in {\mathbb {R}}^{M \times N}\), and \({\mathbf {S}}\in {\mathbb {R}}^{MN \times MN}\). Using the posterior form (3b), the postdata distribution is found by minimizing the KL divergence between the Gaussian distribution \(\fancyscript{N}\left( {\varvec{\psi }}, {\mathbf {S}}\right) \) and the Bayesian posterior distribution \(\fancyscript{N}\left( {\varvec{\phi }}, {\varvec{\varSigma }}\right) \). This is given by:

$$\begin{aligned} \underset{{\varvec{\psi }}, {\mathbf {S}}}{\min } \; {\mathrm {tr}}\!\left( {{\varvec{\varSigma }}^{-1}}{\mathbf {S}} \right) + {({\varvec{\phi }}- {\varvec{\psi }})^\top }{{\varvec{\varSigma }}^{-1}}({\varvec{\phi }}-{\varvec{\psi }}) - \log |{\mathbf {S}}| + \log |{\varvec{\varSigma }}| \;\; \text {s.t.}\, {\left| \left| \left| \mathrm {E}_{ q}\left[ \, Z \, \right] \right| \right| \right| }_1 \le \eta \end{aligned}$$

where \({\varvec{\psi }}= {\text {vec}({\varvec{\varPsi }})}\). The optimization decouples between the mean term \({\varvec{\psi }}\) and the covariance term \({\mathbf {S}}\) as:

$$\begin{aligned}&\underset{{\varvec{\psi }}}{\min } \; {({\varvec{\phi }}- {\varvec{\psi }})^\top }{{\varvec{\varSigma }}^{-1}}({\varvec{\phi }}-{\varvec{\psi }}) \;\;\text {s.t.}\, {\left| \left| \left| \mathrm {E}_{ q}\left[ \, Z \, \right] \right| \right| \right| }_1 \le \eta \end{aligned}$$
(5a)
$$\begin{aligned}&\underset{{\mathbf {S}}}{\min } \; {\mathrm {tr}}\!\left( {{\varvec{\varSigma }}^{-1}}{\mathbf {S}} \right) - \log |{\mathbf {S}}| + \log |{\varvec{\varSigma }}| \end{aligned}$$
(5b)

The minimum in terms of the covariance is achieved for \({\mathbf {S}}= {\varvec{\varSigma }}\) and the mean optimization is given by the solution of a constrained quadratic optimization.

Direct optimization of (5a) requires the computation, storage and inversion of the covariance matrix \({\varvec{\varSigma }}\). This may become computationally infeasible for high dimensional data. In such situations, estimation of the postdata mean using the prior form (3a) is a more computationally feasible approach. The result is the optimization problem:

$$\begin{aligned} \mathfrak {L}({\varvec{\varPsi }}, {\mathbf {S}}) = \underset{{\varvec{\psi }}, {\mathbf {S}}}{\min } \bigg [ {\mathbb {E}}_{{\mathbf {Z}}}[\ln p({\mathbf {Z}})] - {\mathbb {E}}_{{\mathbf {Z}}}[\ln p({\mathbf {y}},{\mathbf {Z}})] \; \text {s.t.}\, {\left| \left| \left| \mathrm {E}_{ q}\left[ \, Z \, \right] \right| \right| \right| }_1 \le \eta \bigg ]. \end{aligned}$$
(6)

Let \({\mathbf {P}}\in {\mathbb {R}}^{L \times MN}\) be a selection matrix such that \({\mathbf {S}}_{L} = {\mathbf {P}}{\mathbf {S}}{{\mathbf {P}}^\top }\) is the postdata covariance matrix of the subset of observed entries \(l \in \mathsf {L}\), and \({\mathbf {C}}_{L} = {\mathbf {P}}{\mathbf {C}}{{\mathbf {P}}^\top }\) is the prior covariance of the corresponding subset of entries. Evaluating expectations, the cost function (6) results in the following inference cost function (omitting terms independent of \({\varvec{\psi }}\) and \({\mathbf {S}}\)):

$$\begin{aligned} \mathfrak {L}({\varvec{\varPsi }}, {\mathbf {S}}) = \underset{\{ {\varvec{\psi }}\, |\, {\left| \left| \left| \mathrm {E}_{ q}\left[ \, Z \, \right] \right| \right| \right| }_1 \le \eta \}, {\mathbf {S}}}{\min } \; \left[ \begin{array}{c} \frac{1}{2\sigma ^2} \sum _{{m,n}\in \mathsf {L}}(y_{m,n}-\psi _{m,n})^2 +\frac{1}{2}{{\varvec{\psi }}^\top }{{\mathbf {C}}^{-1}}{\varvec{\psi }}\\ - \ln |{\mathbf {S}}| +\frac{1}{2\sigma ^2} {\mathrm {tr}}\!\left( {\mathbf {S}}_{L} \right) +\frac{1}{2}{\mathrm {tr}}\!\left( {{\mathbf {C}}^{-1}}{\mathbf {S}} \right) \end{array} \right] . \end{aligned}$$

First, we compute gradients with respect to \({\mathbf {S}}\). After setting the gradients to zero, we compute:

$$\begin{aligned} {\mathbf {S}}_* = {\left( { {{\mathbf {C}}^{-1}} + \frac{1}{\sigma ^2} {{\mathbf {P}}^\top }{\mathbf {P}}}\right) ^{-1}} = {\mathbf {C}}- {\mathbf {C}}{{\mathbf {P}}^\top }{\left( {{\mathbf {C}}_L + \frac{1}{\sigma ^2} {\mathbf {I}}_L }\right) ^{-1}} {\mathbf {P}}{\mathbf {C}}\end{aligned}$$
(7)

The second equality is a consequence of the matrix inversion lemma. We note that this is the exact same result as was found by using the posterior approach (5b). Next, collecting the terms involving the mean results in the optimization problem:

$$\begin{aligned} {\varvec{\psi }}_*= \underset{{\varvec{\psi }}}{\arg \min } \;\frac{1}{2\sigma ^2} \sum _{m,n \in \mathsf {L}}(y_{m,n}-\psi _{m,n})^2 +\frac{1}{2}{{\varvec{\psi }}^\top }{{\mathbf {C}}^{-1}}{\varvec{\psi }}\quad \text {s.t.}\; {\left| \left| \left| \mathrm {E}_{ q}\left[ \, Z \, \right] \right| \right| \right| }_1 \le \eta \end{aligned}$$
(8)

This is a convex regularized least squares problem with a convex constraint set. Hence, (8) is convex, and \({\varvec{\psi }}_*\) is unique. Using the Kronecker identity, we can re-write the cost function in parameter matrix form. We can also replace the nuclear norm constraint with the equivalent regularizer weighed by \(\lambda \). This leads to the equivalent optimization problem:

$$\begin{aligned} {\varvec{\varPsi }}_* =\underset{\varPsi }{\arg \min } \; \frac{1}{2\sigma ^2} \sum _{m,n \in \mathsf {L}}(y_{m,n}-\psi _{m,n})^2 +\frac{1}{2}{\mathrm {tr}}\!\left( {{\varvec{\varPsi }}^\top }{{\mathbf {C}}_{\mathrm{M}}^{-1}}{\varvec{\varPsi }}{{\mathbf {C}}_{\mathrm{N}}^{-1}} \right) +\lambda {\left| \left| \left| \mathrm {E}_{ q}\left[ \, Z \, \right] \right| \right| \right| }_1. \end{aligned}$$
(9)

The final step is to define the term \({\left| \left| \left| \mathrm {E}_{ q}\left[ \, Z \, \right] \right| \right| \right| }_1\). We note that since the prior distribution is a Gaussian process, a valid postdata distribution must extend to arbitrary index sets. Hence the postdata mean is a matrix-variate function. The parametric representation of the postdata mean can be defined using the posterior distribution of the Gaussian process outlined by Csató (2002) and applying the representation theorem (18). Thus, we recover the parametric form of the mean function as \({\varvec{\varPsi }}= {\mathbf {C}}_{\mathrm{M}}{\varvec{A}}{\mathbf {C}}_{\mathrm{N}}\) where \({\varvec{A}}\in {\mathbb {R}}^{M \times N}\). We may now solve for \({\varvec{A}}\) directly:

(10)

where \(\psi _{\varvec{A}}\) is the mean function corresponding to the parameter \({\varvec{A}}\) (see (18)), and represents the nuclear norm in the Hilbert space \(\fancyscript{H}_{\fancyscript{C}}\) (defined in Appendix 8). We also note that the optimization problem (10) is strongly convex.

We now seek to extend the solution from the finite observed index set to the nonparametric domain. Our approach will rely on Kolmogorov’s Extension theorem (Bauer 1996) which provides a mechanism for describing infinite dimensional random processes via their finite dimensional marginals (Orbanz and Teh 2010). We will apply the theorem to extend the solution estimated by (6) using a finite index to a corresponding nonparametric Gaussian process. This will be achieved by showing that the solution can be extended to an arbitrary index set with a consistent functional form for the mean and the covariance.

Theorem 1

The postdata distribution \(\fancyscript{N}\left( {\varvec{\psi }}, {\mathbf {S}}\right) \) is a finite dimensional representation of the Gaussian process \(\fancyscript{G}\fancyscript{P}\left( \psi , S\right) \) sampled at indices \(\mathsf {L}\) where the mean function \(\psi \) is given by (10) and the covariance function \(S\) is given by (1b).

Sketch of proof:

The requirements of Kolmogorov’s extension theorem can be reduced to a proof that for a fixed training set \(\fancyscript{D}\), the postdata distribution of the superset \((\mathsf {M}\times \mathsf {N})\cup (m',n')\) has a consistent function representationFootnote 3. The mean and covariance of the postdata density are decoupled in the optimization and the postdata covariance function can be computed in closed form. Thus, for the covariance, this follows trivially from the functional form of (1b). The functional form of the mean follows from the finite representation (18) that solves the optimization problem (10). Note that the solution does not change with the addition of indices \(l'=(m',n') \notin \mathsf {L}\) without corresponding observations \(y_{l'}\). Uniqueness of the solution follows from the strong convexity of (10). We refer the reader to the dissertation (Koyejo 2013) for further details. \(\square \)

4.1 Alternative representation of the nuclear norm constrained inference

The mean function optimization (10) may also be represented in terms of matrix parameters that are amenable to direct optimization. With the index set fixed, compute a basis. \({\mathbf {G}}_{\mathrm{M}}\in {\mathbb {R}}^{M \times D_{\mathrm{M}}}\) and \({\mathbf {G}}_{\mathrm{N}}\in {\mathbb {R}}^{N \times D_\mathrm{{N}}}\) such that \({\mathbf {C}}_{\mathrm{M}}= {\mathbf {G}}_{\mathrm{M}}{{\mathbf {G}}_{\mathrm{M}}^\top }\) and \({\mathbf {C}}_{\mathrm{N}}= {\mathbf {G}}_{\mathrm{N}}{{\mathbf {G}}_{\mathrm{N}}^\top }\). The mean function can be re-parameterized as \(\psi (m,n) = {\mathbf {G}}_{\mathrm{M}}(m){\mathbf {B}}{{\mathbf {G}}_{\mathrm{N}}(n)^\top }\), where \({\mathbf {B}}\in {\mathbb {R}}^{D_{\mathrm{M}}\times D_\mathrm{{N}}}\). The nuclear norm of \(\psi \) can now be computed directly as the nuclear norm of the parameter matrix (Abernethy et al. 2009, Theorem 3). The resulting optimization problem is:

$$\begin{aligned} {\mathbf {B}}_*=\underset{ {\mathbf {B}}}{\arg \min } \; \frac{1}{2\sigma ^2} \sum _{{m,n}\in \mathsf {L}}\left( y_{m,n}-({\mathbf {G}}_{\mathrm{M}}{\mathbf {B}}{{\mathbf {G}}_{\mathrm{N}}^\top })_{m,n}\right) ^2 +\frac{1}{2}{\left| \left| \left| {\mathbf {B}} \right| \right| \right| }_2^2+ \lambda {\left| \left| \left| {\mathbf {B}} \right| \right| \right| }_1. \end{aligned}$$
(11)

where \({\mathbf {B}}\) is the estimated parameter matrix, and \({\left| \left| \left| \cdot \right| \right| \right| }_2^2\) and \({\left| \left| \left| \cdot \right| \right| \right| }_1\) represent the matrix squared Frobenius norm and the matrix nuclear norm respectively. In this form, the mean function can be estimated directly using standard solvers for large scale nuclear norm constrained optimization (e.g. Dudik et al. 2012; Laue 2012).

To improve scalability, large scale nuclear norm regularized solvers generally represent the parameter matrix in low rank form, avoiding storage of the full matrix. Further, the rank of the parameter matrix is automatically estimated during the optimization. We provide a short summary of the approaches in Dudik et al. (2012) and Laue (2012). Interested readers are referred to the relevant papers for further details. The parameter matrix can be estimated starting from a rank one solution, then the rank is increased until additional factors do not improve the cost any further. The first step consists of determining a good descent direction, and the second step consists of optimizing the factors given the initial direction. In the first step, a descent direction is determined by computing the singular vectors associated with the maximum singular value of the sparse gradient matrix.This step does not need to be accurate and is usually achieved using a few iterations of the power method. The factor optimization in the second step is analogous to the standard matrix factorization optimization, so the large scale nuclear norm solvers mainly differ from standard matrix factorization in the determination of an initial descent direction (matrix factorization is generally randomly initialized), and in the automatic determination of the number of required factors i.e. the rank. Thus the computational requirements of large scale nuclear norm regularized regression are comparable to standard matrix factorization methods.

5 Experiments

We completed experiments with transposable datasets from the disease-gene association domain and the recommender system domain. Prior covariances: All the datasets studied consist of transposable data matrices with corresponding row and/or column graphs. We experimented with the identity prior covariance \({\mathbf {C}}= {\mathbf {I}}\), where \({\mathbf {I}}\) is the identity matrix, and the diffusion prior covariance (Smola and Kondor 2003) given as \({\mathbf {C}}= \exp {(-a{\mathbf {L}})} + b{\mathbf {I}}\), where \({\mathbf {L}}\) is the normalized graph Laplacian matrix. Let \({\mathbf {A}}\) be the adjacency matrix for the graph and \({\mathbf {D}}\) be a diagonal matrix with entries \({\mathbf {D}}_{i,i}= ({\mathbf {A}}\mathbf {1})_i\). The normalized Laplacian matrix is computed as \({\mathbf {L}}= {\mathbf {I}}- {\mathbf {D}}^{-\frac{1}{2}}{\mathbf {A}}{\mathbf {D}}^{-\frac{1}{2}}\). We set the \(a=b=1\). No further optimization was performed, and more detailed experimental validation of covariance parameter selection is left for future work.

Models: We present results for the proposed constrained MV-GP approach (Con. MV-GP), and the special cases using only the nuclear norm (Trace GP)Footnote 4 and using only the Hilbert norm (MV-GP) i.e. the standard MV-GP regression. To the best of our knowledge, the special case of Trace GP is a novel contribution. As baselines, we implemented kernelized probabilistic matrix factorization (KPMF) (Zhou et al. 2012) and probabilistic matrix factorization (PMF) (Mnih and Salakhutdinov 2007) using rank 5 and rank 20 factors. PMF is identical to KPMF using an identity covariance. KPMF has been shown to outperform PMF and and other baseline models in various domains. We note that the rank constraint ensures that all of the proposed models except for MV-GP can be used for in-matrix predictions even with the identity prior covariance. Out-of-matrix predictions require the use of other covariance matrices.

We implemented Con. MV-GP using the representation outlined in Sect. 4.1. The Cholesky decomposition of the covariance matrices was used as the basis representation. The model hyperparameter \(\lambda \) was selected using \(5\) values logarithmically spaced between \(10^{-3}\) and \(10^{3}\) and the noise hyperparameter was selected \(\sigma ^2\) using 20 values logarithmically spaced between \(10^{-3}\) and \(10^{3}\) for all the models. We experimented with learning the data noise variance term \(\sigma ^2\), but found the results worse than using parameter selection. In particular, the estimated noise variance often approached zero - indicating overfitting. A possible solution we plan to explore is to introduce a prior distribution for \(\sigma ^2\) ( see e.g. Bayesian linear regression in Bishop (2006, Chapter 3.3) that may help to regularize the noise term away from zero.

The standard MV-GP is often implemented as a scalar GP with the row and column prior covariance matrices multiplied as shown in (1). We found this “direct” approach computationally intractable as the memory requirements scale quadratically with the size of the observed transposable data matrix. Instead, we implemented the MV-GP in matrix form as a special case of (11) with \(\lambda = 0\). This allowed us to scale the model to the larger datasets at the expense of more computation. The nuclear norm regularized optimization in (11) was solved using the large scale approach of Laue (2012). All numerical optimization was implemented using the limited memory Broyden-Fletcher-Goldfarb-Shanno (L-BFGS) algorithm.

Experiment design and cross validation: We performed two kinds of experiments. In the rest of this discussion, “rows” will refer to either the disease (disease-gene prediction) or the user (recommender system). The known rows experiment was designed to evaluate the performance of the model for entries selected randomly over the observed values in the matrix. In contrast, the new rows experiment was designed to evaluate the generalization ability of the model for new rows not observed in the training set. We partitioned each dataset into five-fold crossvalidation sets. The model was trained on \(4\) of the \(5\) sets and tested on the held out set. The results presented are the averaged fivefold cross validation performance. For the “known row” experiments, the cross validation sets were randomly selected over the matrix. For the new row experiments, the cross validation was performed row-wise, i.e., we selected training set row and test set rows. Note that the identity prior covariance cannot be used for new row prediction, but due to the low rank constraint, it can be used for known row prediction.

5.1 Disease-gene prediction

Genes are segments of DNA that determine specific characteristics; over 20,000 genes have been identified in humans, which interact to regulate various functions in the body. Researchers have identified thousands of diseases, including various cancers and respiratory diseases such as asthma (NCBI 1998), caused by mutations in these genes. Genetic association studies (McCarthy et al. 2008) are the standard approach for discovering disease-causing genes. However, these studies are often tedious and expensive to conduct. Hence, computational methods that can reduce the search space by predicting the list of candidate genes associated with a given disease are of significant scientific interest. The disease gene prediction task has been the subject of a significant amount of study in recent years (Vanunu et al. 2010; Li and Toh 2010; Mordelet and Vert 2011; Singh-Blom et al. 2013). The task is challenging because all the observed responses correspond to known associations, and there are no reliable negative examples. Disease gene association shares the binary matrix representation of the one class (also known as implicit feedback) matrix prediction studied in the collaborative filtering literature (Pan et al. 2008; Hu et al. 2008).

Additional baseline: In addition to the matrix factorization baseline models, we compared our proposed approach to ProDiGe (Mordelet and Vert 2011); a start-of-the-art approach that has been shown to be superior to previous top-performing approaches, including distance-based learning methods like Endeavour (Aerts et al. 2006) and label propagation methods like PRINCE (Vanunu et al. 2010). ProDiGe estimates the prioritization function using a multitask support vector machine (SVM) trained with the gene prior covariance and disease prior covariance as kernels. Of the models implemented, ProDiGe is most similar to the MV-GP. In fact, MV-GP and ProDiGe mainly differ in their loss functions (squared loss and hinge loss respectively). The SVM regularization parameter for ProDiGe was selected from \(\{10^{-3}, 10^{-2},\ldots , 10^3\}\). We note that also PMF represents the matrix factorization baseline often applied to similar implicit feedback datasets in the recommendation system literature (Pan et al. 2008).

Sampling “negative” entries: Following Mordelet and Vert (2011), we sampled the unknown entries as “negative” observations randomly over the disease-gene association matrix. We sampled \(5\) different negatively labeled item sets. All models were trained with the positive set combined with one of the negative labeled sets. The model scores were computed by averaging the scores over the 5 trained models. All models were trained using the same samples.

Metrics: Experimental validation of disease-gene associations in a laboratory can be time consuming and costly, so only a small set of the top ranked predictions are of practical interest. Hence, we focus on metrics that capture the ranking behavior of the model at the top of the ranked list. All the ranking metrics were computed on the test set after removing all genes that had been observed in the training set. We computed precision (\(\hbox {P}_{@k}\)) and recall (\(\hbox {R}_{@k}\)) where \(k = 1,\ldots ,20\). Let \({\mathbf {g}}_l\) denote the labels of gene \(l\) as sorted by the predicted scores of the trained regression model, and let \(G_m = \sum _{l} {\mathsf {1}}_{\left[ {{\mathbf {g}}_l=1} \right] }\) be the total number of relevant genes for disease \(m\) in the test data after removing relevant genes observed in the training data. The precision at \(k\) computes the fraction of relevant genes retrieved out off all retrieved genes at position \(k\). The recall at \(k\) computes the fraction of relevant genes retrieved out of all relevant genes that can be retrieved with a list of length \(k\). These are computed as:

$$\begin{aligned} \hbox {P}_{@k} = \frac{\sum \nolimits _{l=1}^k {\mathsf {1}}_{\left[ {{\mathbf {g}}_l=1} \right] } }{k}, \quad \hbox {R}_{@k} = \frac{\sum \nolimits _{l=1}^k {\mathsf {1}}_{\left[ {{\mathbf {g}}_l=1} \right] } }{G_m}. \end{aligned}$$

All metrics were computed per disease and then averaged over all the diseases in the test set. Model selection was computed separately per metric. Higher values reflect better performance for the \(\hbox {P}_{@k}\) and \(\hbox {R}_{@k}\) metrics and their maximum value is \(1.0\).

Datasets: We trained and evaluated our models using two sets of gene-disease association data curated from the literature. The first, which we call the OMIM data set, is based on the Online Mendelian Inheritance in Man (OMIM) database and is representative of the candidate gene prediction task for monogenic or near monogenic diseases, i.e., diseases caused by only one or at most a few genes. The data matrix contains a total of \(M = \text {3,210}\) diseases, \(N = \text {13,614}\) genes, and \(T = \text {3,636}\) known associations (data density of 0.0083 %). We note that the extreme sparsity of this data set makes the prediction problem extremely difficult. The second dataset, which we call the Medline data set, is a much larger data set and is representative of predicting candidate genes for both monogenic as well as polygenic diseases, i.e., diseases caused by the interactions of tens or even hundreds of genes. The set of genes in this data set is defined using the NCBI ENTREZ Gene database (Maglott et al. 2011), and the set of diseases is defined using the “Disease” branch of the NIH Medical Subject Headings (MeSH) ontology (National Library of Medicine 2012). We extracted co-citations of these genes and diseases from the PubMed/Medline database (National Library of Medicine 2012) to identify positive gene-disease associations. This resulting data set contains a total of of \(M = \text {4,496}\) diseases, \(N = \text {21,243}\) genes, and \(T = \text {250,190}\) known associations (data density of 0.36 %).

Information about biological interactions among genes and known relationships among diseases were used to improve the accuracy of our model, since similar diseases very often have similar genetic causes. We derive gene networks from the HumanNet database (Lee et al. 2011), a genome-wide functional network of human genes constructed using multiple lines of evidence, including gene co-expression, protein-protein interaction data, and networks from other species. For both the OMIM and Medline data sets, our gene-gene interaction network contains a total of \(\text {433,224}\) links. Our disease network is derived from the term hierarchy established in the 2011 release of the MeSH ontology. The disease network for the Medline data set contains a total of \(\text {13,922}\) links. However, because we do not have a direct mapping of OMIM diseases to MeSH terms, we do not use a disease network for the OMIM data set. As a result, we are unable to test our model’s ability to produce predictions for “new” diseases, i.e., diseases with no associated genes in the training set.

The OMIM dataset contains an average of \(1.2\) test genes (positive items) per disease, and the model is required to rank more than \(13,000\) genes per disease. Hence, the gene prediction task is particularly challenging. This difficulty is reflected in the low precision values observed in Table 1 and Fig. 3a. Despite this extreme sparsity, we found that the proposed approaches (Con. MV-GP and Trace GP) performed as well or better than the matrix factorization baselines (KMPF, PMF), and significantly outperformed the domain specific baseline (ProdiGe). In fact, both full rank models (MV-GP and ProDiGe) performed poorly, suggesting the importance of the low rank / nuclear norm constraint. The results in Fig. 3a, b further highlight the performance of the proposed models at the very top of the list.

Table 1 OMIM disease-gene dataset
Fig. 3
figure 3

Disease-gene prediction. Precision (left) and Recall (right) \(@k=1, 2, \ldots , 20\). (I): Identity prior covariance, (D): Diffusion prior covariance. Low precision values in OMIM are due to the high class imbalance in the test data (avg of 1.2 genes per disease). The identity prior covariance does not generalize to new diseases. Constrained MV-GP out-performs ProDiGe (domain specific baseline), KPMF and PMF. ProDiGe was unable to scale to the full curated dataset (see text) a OMIM precision\(@k\) b OMIM recall\(@k\) c Medline (known diseases) precision\(@k\) d Medline (known diseases) recall\(@k\) e Medline (new diseases) precision\(@k\) f Medline (new diseases) recall\(@k\)

We were unable to run ProDiGe on the Medline dataset due computational issues. In particular, the implementation of ProDiGe requires the full kernel matrix as an input. The memory required to store the full kernel is quadratic in the transposable data size. We did not pursue an alternative implementation with reduced memory requirements as experiments with OMIM and initial experiments with subsampled data indicated inferior performance. The Medline dataset contained an average of \(59.2\) positive items per disease. Correspondingly, the tested models achieved a higher precision than in the OMIM dataset. Our experimental results (Table 2) show that the proposed models (Con. MV-GP, Trace GP) significantly outperformed the matrix factorization baselines (PMF, KMPF) on the known diseases, and performed as least as well as KMPF on the new diseases. The results in Fig. 3c, d show that the proposed models outperform the baselines for known diseases prediction at all levels of precision and recall we measured. The results for new disease prediction in Fig. 3e, f show similar performance for both approaches on the new diseases.

Table 2 Medline disease-gene dataset

In summary, the presented results suggest that the low rank constraint is useful for describing the structure of disease-gene association. We also found that in all the datasets, the constrained Bayesian models (Con. MV-GP and Trace GP) performed the same or better than the Bayesian factor models (KPMF and PMF) and the unconstrained Bayesian model (MV-GP). This shows the utility of the constrained Bayesian inference approach as compared to the Bayesian factor model approach. Constrained MV-GP with the identity kernel was the best single performing method, matching results in the literature suggesting that the network information is not always helpful for in-matrix predictions (Koyejo and Ghosh 2011; Zhou et al. 2012), though it remains essential for generalization beyond the training matrix. Future work will include further examination of these issues.

5.2 Recommender systems

The goal of a recommender system is to suggest items to users based on past feedback and other user and item information. Recommender systems may also be used for targeted advertising and other personalized services. The low rank matrix factorization approach has proven to be a popular and effective model for the recommender systems data (Mnih and Salakhutdinov 2007; Koren et al. 2009; Koyejo and Ghosh 2011). Several authors (Yu et al. 2007; Zhu et al. 2009; Zhou et al. 2012) have studied the factor GP approach for recommender systems, and have shown that prior covariances extracted from the social network can improve the prediction accuracy and may be used to provide predictions with no training ratings (Koyejo and Ghosh 2011; Zhou et al. 2012). Kernelized probabilistic matrix factorization (KPMF) is of particular interest, as it has been shown to outperform PMF (Mnih and Salakhutdinov 2007) and SoRec (Ma et al. 2008), strong baseline methods for predicting user item preferences with social network side information.

Metrics The model performance was measured using a combination of regression and ranking metrics. Recommender systems are typically most concerned with presenting the few items that the user is very likely to be interested in, and accurately predicting the score of the other items is less important. Several authors (Steck 2010; Steck and Zemel 2010) have shown that measuring the recall (\(\hbox {R}_{@k}\)) of the top relevant items compared to all available items can provide an unbiased estimate of the predicted ranking. As suggested by (Steck 2010) we measure the ability of the model to predict relevant items (ratings greater than 4) ahead of other entries (both missing and observed entries with rating less than or equal to 4) using recall at 20 (\(\hbox {R}_{@20}\)). Recall per user was computed on the test set after removing all items that had been observed in the training set, and averaged over all users. For regression, we used the root mean square error (\(\text {RMSE}\)) metric (Koren et al. 2009) given by \(\sqrt{\frac{1}{L}\sum _{l=1}^L (y_l - \hat{y}_l)^2}\) where \(\hat{y}_l\) is the prediction for index \(l\). Lower values reflect better performance for the \(\text {RMSE}\).

Datasets: We trained and evaluated our models using two publicly available recommender systems datasets with social network side information - Flixster and Epinions datasets. Flixster Footnote 5 is a website where users share film reviews and ratings. The users can also signify social connections. We utilized the dataset described by (Jamali and Ester 2010) which contains a ratings matrix and the social network. We selected the \(M=5,000\) users with the most friends in the network and \(N=5,000\) movies with the most ratings. This resulted in a matrix with \(L=33,182\) (density \(=0.001\,\%\)) ratings and \(211,702\) undirected user social connections. The identity prior covariance was used for the movies. Ratings in Flixster take one of 10 values in the set \(\{0.5, 1, 1.5, \ldots , 5.0 \}\). Epinions Footnote 6 is an item review site where users can also specify directed association by signifying a trust link. We utilized the extended Epinions dataset (Massa and Avesani 2006) and converted all the directed trust links into undirected links. We selected the \(M=5,000\) users with the most trust links in the network and \(N=5,000\) movies with the most ratings. This resulted in a matrix with \(L=187,163\) (density \(0.007\,\%\)) ratings and \(550,298\) user social connections. The identity prior covariance was used for the items. Ratings in the Epinions dataset take one of five values in the set \(\{1.0, 2.0, \ldots , 5.0 \}\).

We present five fold cross validation performance for in matrix and new user predictions on both Flixster and Epinions datasets. We found that the model that selected using \(\text {RMSE}\) as the validation metric did not always perform best in terms of recall (and vice versa). This matches the results by other researchers (Steck 2010; Steck and Zemel 2010). Hence, we performed cross validation separately for \(\text {RMSE}\) and \(\hbox {R}_{@20}\). The results on the Flixster dataset are shown in Table 3. For known users, we found that the tested models performed similarly in terms of RMSE, but the proposed models (Con MV-GP, Trace GP) significantly outperformed the matrix factorization baselines (KPMF, PMF) in terms of recall. These results are further highlighted in the \(\hbox {R}_{@k}\) performance as shown in Fig. 4a. The results were often equivalent for new user predictions Fig. 4b. Thus our experimental results suggest that the proposed models are more accurate in terms of ranking while retaining competitive regression performance.

Table 3 Flixster dataset
Fig. 4
figure 4

Performance results on recommender systems datasets. Recall \(@k=1, 2, \ldots , 20\) for known users (left) and new users (right). The prior covariances and constraints have the largest effect for very sparse data. (I): Identity prior covariance, (D): Diffusion prior covariance. Con. MV-GP outperforms KPMF (Zhou et al. 2012), which has been shown to outperform PMF (Mnih and Salakhutdinov 2007) and SoRec (Ma et al. 2008)

The RMSE and \(\hbox {R}_{@20}\) performance on the Epinions dataset is shown in Table 4. Our results here mirror the results on the Flixster dataset. Our experiments show similar RMSE performance for all models, and a significant gain in performance in terms of \(\hbox {R}_{@20}\) for the proposed constrained approach for known users. A similar trend is also highlighted in Fig. 4c. Con. MV-GP, Trace GP and KPMF perform similarly when tested on new users as shown in Fig. 4d with a slight performance improvement for Con. MV-GP. Comparing the Bayesian MV-GP to its constrained variant clearly shows the utility of the nuclear norm constraint in both recommender systems datasets. In all, our results suggest that the nuclear norm constrained MV-GP is effective for regression and for ranking in recommender systems.

Table 4 Epinions dataset

6 Conclusion

This paper introduces a novel approach for the predictive modeling of low rank transposable data with the matrix-variate Gaussian process. The low rank is achieved using a nuclear norm constrained inference; recovering a mean function of low rank. We showed that inference for the Gaussian process with the nuclear norm constraint is convex. The proposed approach was applied to the disease-gene association task and to the recommender system task. The proposed model was effective for regression and for ranking with highly imbalanced data, and performed at least as well as (and often significantly better than) state of the art domain specific baseline models.

Recent work (Yu et al. 2013) characterizing necessary and sufficient conditions for the existence of a representer theorem points to the potential scope of the constrained inference approach combined with nonparametric processes. Thus, we plan to explore other constraint sets in addition to the nuclear norm constraint explored here. We are also interested in exploring covariance constraints as outlined for Gaussian distributions in Koyejo and Ghosh (2013a) applied to nonparametric processes. We are interested in applications of nonparametric constrained Bayesian inference to more complicated models beyond Gaussian distributions. Finally, we intend to explore the biological implications of these constrained disease gene association results in collaboration with domain experts.