A constrained matrixvariate Gaussian process for transposable data
 621 Downloads
 1 Citations
Abstract
Transposable data represents interactions among two sets of entities, and are typically represented as a matrix containing the known interaction values. Additional side information may consist of feature vectors specific to entities corresponding to the rows and/or columns of such a matrix. Further information may also be available in the form of interactions or hierarchies among entities along the same mode (axis). We propose a novel approach for modeling transposable data with missing interactions given additional side information. The interactions are modeled as noisy observations from a latent noise free matrix generated from a matrixvariate Gaussian process. The construction of row and column covariances using side information provides a flexible mechanism for specifying apriori knowledge of the row and column correlations in the data. Further, the use of such a prior combined with the side information enables predictions for new rows and columns not observed in the training data. In this work, we combine the matrixvariate Gaussian process model with low rank constraints. The constrained Gaussian process approach is applied to the prediction of hidden associations between genes and diseases using a small set of observed associations as well as prior covariances induced by genegene interaction networks and disease ontologies. The proposed approach is also applied to recommender systems data which involves predicting the item ratings of users using known associations as well as prior covariances induced by social networks. We present experimental results that highlight the performance of constrained matrixvariate Gaussian process as compared to state of the art approaches in each domain.
Keywords
Constrained Bayesian inference Gaussian process Transposable data Nuclear norm Low rank1 Introduction
Transposable data describes relationships between pairs of entities. Such data can be organized as a matrix, with one set of entities as the rows, the other set of entities as the columns. In such datasets, both the rows and column of the matrix are of interest. Transposable data matrices are often sparse, and of primary interest is the prediction of unobserved matrix entries representing unknown interactions. In the machine learning community, the modeling of transposable data is often encountered as multitask learning (Stegle et al. 2011). In addition to the matrix, transposable datasets often include features describing each row entity and each column entity, or graphs describing relationships between the rows and the columns. These features and graphs can be useful for improving inmatrix prediction performance and for extending model predictions outside of the observed matrix, thus alleviating the coldstart problem. In this work, we combine the matrixvariate Gaussian process model with low rank constraints for the predictive modeling of transposable data.
In recent years, the matrix variate Gaussian distribution (MVG) has emerged as a popular model for transposable data (Allen and Tibshirani 2010, 2012) as it compactly decomposes correlations between the matrix entries into correlations between the rows, and correlations between the columns. Although the MVG has been shown to be effective for modeling matrix data with missing entries, model predictions do not extend to rows and columns that are unobserved in the training data. One approach to remedy this deficiency is to replace the MVG with the nonparametric matrixvariate Gaussian process (MVGP) (Stegle et al. 2011). This is achieved by replacing the row and column covariance matrices of the MVG with parameterized row and column covariance functions. Thus, the resulting model can provide predictions for new rows and columns given features. The MVGP may also be described as an extension of the scalar valued Gaussian process (GP) (Rasmussen and Williams 2005), a popular model for scalar functions, to vector valued responses. The MVGP has been applied to link analysis, transfer learning, collaborative prediction and other multitask learning problems (Yu and Chu 2008; Bonilla et al. 2008; Yan et al. 2011). Despite its wide use for transposable data and multitask learning, the MVGP does not capture low rank structure.
Rank constraints have become ubiquitous in matrix prediction tasks (Yu et al. 2007; Zhu et al. 2009; Koyejo and Ghosh 2011; Zhou et al. 2012; Koyejo and Ghosh 2013a). The low rank assumption implies that matrixvalued parameters of interest can be decomposed as the inner product of low dimensional factors. This reduces the degrees of freedom in the matrix model and can improve the parsimony of the results. Recent theoretical (Candès and Recht 2009) and empirical (Koren et al. 2009) results have provided additional motivation for the low rank approach. The low rank assumption is also motivated by computational concerns. Consider the computational requirements of a full matrix regression model such as a Gaussian process regression (Rasmussen and Williams 2005). Here, the memory requirements scale quadratically with data size, and naïve inference via using a matrix inverse scales cubically with data size (Álvarez et al. 2012). In contrast, training low rank models can scale linearly with the data size and quadratically with the underlying matrix rank (using the factor representation). Further, efficient optimization methods have been proposed (Koren et al. 2009; Dudik et al. 2012).
We propose a novel constrained Bayesian inference approach that combines the flexibility and extensibility of the matrixvariate Gaussian process with the parsimony and empirical performance of low rank models. Constrained Bayesian inference (Koyejo and Ghosh 2013a) is a principled approach for enforcing expectation constraints on the Bayesian inference procedure. It is a useful approach for probabilistic inference when the problem of interest requires constraints that are difficult to capture using standard prior distributions alone. Examples include linear inequality constraints (Gelfand et al. 1992) and margin constraints (Zhu et al. 2012). To enforce these restrictions, constrained Bayesian inference represents the Bayesian inference procedure as a constrained relative entropy minimization problem. The resulting optimization problem can often be reduced to constrained parameter estimation and solved using standard optimization theoretic techniques.

We propose a novel approach for capturing the low rank characteristics of transposable data by combining the matrixvariate Gaussian process prior with constrained Bayesian inference subject to nuclear norm constraints.

We show that (i) the distribution that solves the constrained Bayesian inference problem is a Gaussian process, (ii) its inference can be reduced to the estimation of a finite set of parameters, and (iii) the resulting optimization problem is strongly convex in these parameters.

We evaluate the proposed model empirically and show that it performs as well as (or better than) the state of the art domain specific models for diseasegene association prediction with gene network and disease ontology side information and recommender systems with social network side information.
2 Background
This section describes the problem statement (Sect. 2.2), and the main building blocks of our approach (i) the matrixvariate Gaussian process (Sect. 2.3) and (ii) constrained Bayesian inference (Sect.2.4).
2.1 Preliminaries
We denote vectors by bold lower case e.g. \({\mathbf {x}}\) and matrices by bold upper case e.g. \({\mathbf {X}}\). Let \({\mathbf {I}}_D\) represent the \(D \times D\) identity matrix. Given a matrix \({\mathbf {A}}\in {\mathbb {R}}^{P \times Q}\), \({\text {vec}({\mathbf {A}})} \in {\mathbb {R}}^{PQ}\) is the vector obtained by concatenating columns of \({\mathbf {A}}\). Given matrices \({\mathbf {A}}\in {\mathbb {R}}^{P \times Q}\) and \({\mathbf {B}}\in {\mathbb {R}}^{P' \times Q'}\), the Kronecker product of \({\mathbf {A}}\) and \({\mathbf {B}}\) is denoted as \({\mathbf {A}}\otimes {\mathbf {B}}\in {\mathbb {R}}^{PP'\times QQ'}\). A useful property is the Kronecker identity: \({\text {vec}({\mathbf {A}}{\mathbf {X}}{\mathbf {B}})} = ({{\mathbf {B}}^\top } \otimes {\mathbf {A}}) {\text {vec}({\mathbf {X}})}\), where \({\mathbf {X}}\in {\mathbb {R}}^{Q \times P'}\) and \({{\mathbf {B}}^\top }\) represents the transpose of \({\mathbf {B}}\).
2.2 Transposable data notation and problem statement
Let \({\mathbb {M}}\ni m\) be the index set of rows and \({\mathbb {N}}\ni n\) be the index set of columns. The index set of observed matrix entries is represented by \(\mathsf {L}=\{(m, n)\} \subset {\mathbb {M}}\times {\mathbb {N}}\) with every \(l=(m,n)\in \mathsf {L}\). We define the subset of observed rows as the set \(\mathsf {M}= \{m\,\, (m,n) \in \mathsf {L}\} \subset {\mathbb {M}}\) with size \(\mathsf {M}=M\), and the subset of observed columns as the set \(\mathsf {N}= \{n\,\, (m,n) \in \mathsf {L}\} \subset {\mathbb {N}}\) with size \(\mathsf {N}=N\) so \(L=\mathsf {L}\le M\times N\). Let each entry in the matrix be represented by \(y_l\). The observed subset of the transposable matrix is represented by \({\mathbf {y}}= {\left[ y_{l_1}\,\ldots \,y_{l_{L}} \right] ^\top }\). Our goal is to estimate a predictive model for any unobserved entries \(\{y_{l'}\,\, l' \notin \mathsf {L}\}\) including entries not observed within the bounds of the training matrix i.e. \(\{y_{l'}\,\, l' \notin \mathsf {M}\times \mathsf {N}\}\).
2.3 Matrixvariate Gaussian process for transposable data
The matrixvariate Gaussian process is a doubly indexed stochastic process \(\{ Z_{m,n}\}_{m \in {\mathbb {M}}, n \in {\mathbb {N}}}\) where finitely indexed samples follow a multivariate Gaussian distribution. As with the scalar Gaussian process (Rasmussen and Williams 2005), the MVGP is completely specified by its mean and covariance functions. We use the notation \(\fancyscript{M}\fancyscript{G}\fancyscript{P}\left( \phi , \fancyscript{C}_{\mathrm{N}}, \fancyscript{C}_{\mathrm{M}}\right) \) to denote the MVGP with mean function \(\phi : {\mathbb {M}}\times {\mathbb {N}}\mapsto {\mathbb {R}}\), row covariance function \(\fancyscript{C}_{\mathrm{M}}: {\mathbb {M}}\times {\mathbb {M}}\mapsto {\mathbb {R}}\) and the column covariance function \(\fancyscript{C}_{\mathrm{N}}: {\mathbb {N}}\times {\mathbb {N}}\mapsto {\mathbb {R}}\). The covariance function of the prior MVGP has a Kronecker product structure (Álvarez et al. 2012). This form assumes that the prior covariance between matrix entries can be decomposed as the product of the row and column covariances. The joint covariance function of the MVGP decomposes into product form as \(\fancyscript{C}\left( (m,n),(m',n')\right) = \fancyscript{C}_{\mathrm{M}}(m,m')\fancyscript{C}_{\mathrm{N}}(n,n')\), or equivalently, \(\fancyscript{C}= \fancyscript{C}_{\mathrm{N}}\otimes \fancyscript{C}_{\mathrm{M}}\). We use the notation \(\fancyscript{G}\fancyscript{P}\left( \psi , \fancyscript{C}\right) \) to denote the scalar valued Gaussian process with mean function \(\psi : \mathsf {L}\mapsto {\mathbb {R}}\) and covariance function \(\fancyscript{C}: \mathsf {L}\times \mathsf {L}\mapsto {\mathbb {R}}\).
Let \(Z \sim \fancyscript{M}\fancyscript{G}\fancyscript{P}\left( \phi , \fancyscript{C}_{\mathrm{M}}, \fancyscript{C}_{\mathrm{N}}\right) \), and define the matrix \({\mathbf {Z}}\in {\mathbb {R}}^{M\times N}\) with entries \(z_{m,n}= Z(m,n)\) for \({m,n}\in \mathsf {M}\times \mathsf {N}\), \({\text {vec}({\mathbf {Z}})}\) is a distributed as a multivariate Gaussian with mean \({\text {vec}({\varvec{\varPhi }})}\) and covariance matrix \({\mathbf {C}}_{\mathrm{N}}\otimes {\mathbf {C}}_{\mathrm{M}}\), i.e., \({\text {vec}({\mathbf {Z}})}\sim \fancyscript{N}\left( {\text {vec}({\varvec{\varPhi }})}, {\mathbf {C}}_{\mathrm{N}}\otimes {\mathbf {C}}_{\mathrm{M}}\right) \), where \(\phi _{m,n}= \phi (m,n)\), \({\varvec{\varPhi }}\in {\mathbb {R}}^{M\times N}\) is the mean matrix, \({\mathbf {C}}_{\mathrm{M}}\in {\mathbb {R}}^{M \times M}\) is the row covariance matrix and \({\mathbf {C}}_{\mathrm{N}}\in {\mathbb {R}}^{N \times N}\) is the column covariance matrix. This definition extends to finite subsets \(\mathsf {L}\subset {\mathbb {M}}\times {\mathbb {N}}\) that are not complete matrices. For any subset \(\mathsf {L}\), the vector \({\mathbf {z}}= \left[ z_{l_1}\,\ldots \,z_{l_{L}} \right] \) is distributed as \({\mathbf {z}}\sim \fancyscript{N}\left( {\varvec{\varPhi }}_{\mathsf {L}} , {\mathbf {C}}\right) \) where the vector \({\varvec{\varPhi }}_{\mathsf {L}} =[\phi (1) \ldots \phi (L)] \in {\mathbb {R}}^L\) are arranged from the entries of the mean matrix corresponding to the set \(l \in \mathsf {L}\), and \({\mathbf {C}}\) is the covariance matrix evaluated only on pairs \(l,l' \in \mathsf {L}\times \mathsf {L}\).
 1.
Draw the function \(Z\) from a zero mean MVGP as \(Z \sim \fancyscript{M}\fancyscript{G}\fancyscript{P}\left( {0} , \fancyscript{C}_{\mathrm{M}}, \fancyscript{C}_{\mathrm{N}}\right) \).
 2.
Draw observed response independently as \(y_{m,n}\sim \fancyscript{N}\left( z_{m,n}, \sigma ^2\right) \) given \(z_{m,n}= Z({m,n})\).
2.4 Constrained Bayesian inference
3 Related work
Constrained Bayesian inference is a special case of constrained relative entropy minimization where some of the constraints are generated from observed data (Koyejo and Ghosh 2013). Constrained relative entropy minimization and constrained entropy maximization have been studied in several application domains including natural language processing (Berger et al. 1996) and ecology (Dudík et al. 2007). Applications in the machine learning literature include maximum entropy discrimination (MED) (Jaakkola et al. 1999), and other models inspired by MED have been proposed for combining nonparametric topic models with large margin constraints for document classification (Zhu et al. 2009) and multitask classification (Zhu et al. 2011). Constrained relative entropy models have also been applied to collaborative filtering (Xu et al. 2012) and link prediction (Zhu et al. 2012) Other work using nonparametric priors (Zhu et al. 2009, 2011) has resulted in intractable inference, requiring the application of variational approximations with tractable assumptions made for the independence structure and parametric families of the solution. Our work appears to be the first that uses nonparametric prior distributions without requiring such simplifying assumptions. In addition, we consider constraints on the function space of the Gaussian process, which generalize the evaluation based constraints proposed in prior work i.e. constraints on the entire mean function as opposed to constraints on the mean of a set of matrix entries.
Factor models such as principal component analysis (PCA) (Bishop 2006) and its variants are popular methods for extracting information from matrix data. The standard PCA model can be extended to handle missing data using a Bayesian approach (Bishop 2006) that marginalizes over the missing data. The Gaussian process latent variable model (GPLVM) (Lawrence and Hyvärinen 2005) was proposed to extend PCA to model nonlinear relationships by replacing the covariance matrix with a nonlinear kernel. This kernel approach has been applied to nonlinear matrix factorization (Lawrence and Urtasun 2009). The GPLVM integrates out one of the factors and estimates the other. The rank of the factor model must be prespecified in such models, and is often fixed via expensive crossvalidation. Implementations of Kernel PCA typically capture prior correlations over the rows or the columns, but not both^{1}. Our proposed model is designed capture prior correlations simultaneously over the rows and columns via the matrixvariate Gaussian process prior. Further, the nuclear norm provides an avenue for automatic (implicit) rank selection.
 1.
For each \(r\in \{1\ldots R\}\), draw row functions: \(U^r \sim \fancyscript{G}\fancyscript{P}\left( {0} , \fancyscript{C}_{\mathrm{M}}\right) \). Let \({\mathbf {u}}_m \in {\mathbb {R}}^R\) with entries \(u^r_m = U^r(m)\).
 2.
For each \(r\in \{1\ldots R\}\), draw column functions: \(V^r \sim \fancyscript{G}\fancyscript{P}\left( {0} , \fancyscript{C}_{\mathrm{N}}\right) \). Let \({\mathbf {v}}_n \in {\mathbb {R}}^R\) with \(v^r_n = V^r(n)\).
 3.
Draw each matrix entry independently: \(y_{m,n}\sim \fancyscript{N}\left( {{{\mathbf {u}}_m^\top }{\mathbf {v}}_n}, \sigma ^2\right) \; \forall \,(m,n)\in \mathsf {L}\).
Despite its success, the factor Gaussian process approach has some deficiencies when applied for probabilistic inference. First, posterior distributions of interest are generally intractable. Specifically, neither the joint posterior distribution of \(\{{\mathbf {U}}, {\mathbf {V}}\}\) nor the distribution of \({\mathbf {Z}}={\mathbf {U}}{{\mathbf {V}}^\top }\) is Gaussian, and their posterior distributions are quite challenging to characterize. As a result, the posterior mean is challenging to compute without sampling and practitioners often apply the MAP approach. Second order statistics such as the posterior covariance are also computationally intractable. Instead, various approximate inference techniques have been applied. A Laplace approximation was proposed by (Yu et al. 2007) and (Zhu et al. 2009) utilized sampling techniques. Further, in most cases, the rank must be fixed apriori. More recently, Bayesian models for matrix factorization that include a nonparametric prior for the number of latent factors have been proposed based on the Indian buffet process (Zhu 2012; Xu et al. 2012) and multiplicative gamma process (Zhang and Carin 2012). Inference with these models is generally intractable, and requires approximations or sampling, which may result in slow or inaccurate inference for large datasets. Further, many of these approaches have focused on inmatrix prediction, and have not been applied to outofmatrix predictions.
Other related literature include Li and Yeung (2009), where the authors proposed a regularized matrix factorization model exploiting relation information. The proposed model is identical to the Gaussian process factor model^{2} (Zhou et al. 2012) with an appropriate choice of kernel. Li et al. (2009b) proposed an approach for learning a kernel based on network links that can then be applied to predictive modeling tasks. Li et al. (2009a) proposed a Bayesian probabilistic PCA model for full matrix prediction exploiting relational data information by constructing a covariance matrix that accounted for the relational data. An alternative approach focusing on learning additive Gaussian process kernels was proposed by (Xu et al. 2009), and an approach for nonparametric relational data modeling using coclustering (instead of matrix factorization) was proposed by Xu et al. (2006). Several works have focused on the matrix prediction task alone without the use of side information. For example, Sutskever et al. (2009) utilized the clustering of factors to model the latent relationships as an alternative to designing covariance matrices.
4 Proposed approach: the nuclear norm constrained MVGP
We propose nuclear norm constrained Bayesian inference for modeling low rank transposable data as an alternative to the low rank factor approach. The proposed approach constrains the model by directly regularizing the rank of the expected prediction via a constraint on its nuclear norm. Optimization with the rank constraint is computationally intractable, and the popular factor representation results in a nonconvex optimization problem that is susceptible to local minima (Dudik et al. 2012). The nuclear norm constraint has been proposed as a tractable surrogate regularization for the low rank constraint, which is in turn motivated by parsimony of the low rank representation, and the superior empirical performance of low rank models in many application domains. The nuclear norm of a matrix variate function is given by the sum of its singular values (Abernethy et al. 2009), and is the tightest convex hull of its rank. Under certain conditions, it can be shown that nuclear norm regularization recovers the true low rank matrix (Pong et al. 2010). Further details on the nuclear norm of matrix functions are provided in Appendix 8.
With no loss of generality, we assume a set of rows \(\mathsf {M}\) and a set of columns \(\mathsf {N}\) of interest so \(\mathsf {L}\subset \mathsf {M}\times \mathsf {N}\). Let \({\mathbf {Z}}\in {\mathbb {R}}^{M \times N}\) be the matrix of hidden variables, with \({\mathbf {z}}= {\text {vec}({\mathbf {Z}})} \in {\mathbb {R}}^{M \times N}\). Given any finite index set of observations at indices \(l \in \mathsf {L}\), the finite dimensional prior distribution is a Gaussian distribution given by \(\fancyscript{N}\left( 0, {\mathbf {C}}\right) \) where \({\mathbf {C}}\in {\mathbb {R}}^{MN \times MN}\). We seek a postdata density \(q(Z\fancyscript{D})\) that optimizes (3a) subject to the constraint \({\left \left \left \mathrm {E}_{ q}\left[ \, Z \, \right] \right \right \right }_1 \le \eta \) where \({\left \left \left \cdot \right \right \right }_1\) is the nuclear norm. For any finite index set, the unconstrained Bayesian posterior distribution is Gaussian (Sect. 2.3). Following the steps of Sect. 2.4 (see also Appendix 7), it is straightforward to show that since the feature function \(\gamma (Z) = Z\) is linear, the constrained Bayes solution must also take a Gaussian form. All that remains is to solve for the mean and covariance. We may apply either the prior form (3a) or the equivalent posterior form (3b) for constrained inference. We discuss both approaches for illustrative purposes.
We now seek to extend the solution from the finite observed index set to the nonparametric domain. Our approach will rely on Kolmogorov’s Extension theorem (Bauer 1996) which provides a mechanism for describing infinite dimensional random processes via their finite dimensional marginals (Orbanz and Teh 2010). We will apply the theorem to extend the solution estimated by (6) using a finite index to a corresponding nonparametric Gaussian process. This will be achieved by showing that the solution can be extended to an arbitrary index set with a consistent functional form for the mean and the covariance.
Theorem 1
The postdata distribution \(\fancyscript{N}\left( {\varvec{\psi }}, {\mathbf {S}}\right) \) is a finite dimensional representation of the Gaussian process \(\fancyscript{G}\fancyscript{P}\left( \psi , S\right) \) sampled at indices \(\mathsf {L}\) where the mean function \(\psi \) is given by (10) and the covariance function \(S\) is given by (1b).
Sketch of proof:
The requirements of Kolmogorov’s extension theorem can be reduced to a proof that for a fixed training set \(\fancyscript{D}\), the postdata distribution of the superset \((\mathsf {M}\times \mathsf {N})\cup (m',n')\) has a consistent function representation^{3}. The mean and covariance of the postdata density are decoupled in the optimization and the postdata covariance function can be computed in closed form. Thus, for the covariance, this follows trivially from the functional form of (1b). The functional form of the mean follows from the finite representation (18) that solves the optimization problem (10). Note that the solution does not change with the addition of indices \(l'=(m',n') \notin \mathsf {L}\) without corresponding observations \(y_{l'}\). Uniqueness of the solution follows from the strong convexity of (10). We refer the reader to the dissertation (Koyejo 2013) for further details. \(\square \)
4.1 Alternative representation of the nuclear norm constrained inference
To improve scalability, large scale nuclear norm regularized solvers generally represent the parameter matrix in low rank form, avoiding storage of the full matrix. Further, the rank of the parameter matrix is automatically estimated during the optimization. We provide a short summary of the approaches in Dudik et al. (2012) and Laue (2012). Interested readers are referred to the relevant papers for further details. The parameter matrix can be estimated starting from a rank one solution, then the rank is increased until additional factors do not improve the cost any further. The first step consists of determining a good descent direction, and the second step consists of optimizing the factors given the initial direction. In the first step, a descent direction is determined by computing the singular vectors associated with the maximum singular value of the sparse gradient matrix.This step does not need to be accurate and is usually achieved using a few iterations of the power method. The factor optimization in the second step is analogous to the standard matrix factorization optimization, so the large scale nuclear norm solvers mainly differ from standard matrix factorization in the determination of an initial descent direction (matrix factorization is generally randomly initialized), and in the automatic determination of the number of required factors i.e. the rank. Thus the computational requirements of large scale nuclear norm regularized regression are comparable to standard matrix factorization methods.
5 Experiments
We completed experiments with transposable datasets from the diseasegene association domain and the recommender system domain. Prior covariances: All the datasets studied consist of transposable data matrices with corresponding row and/or column graphs. We experimented with the identity prior covariance \({\mathbf {C}}= {\mathbf {I}}\), where \({\mathbf {I}}\) is the identity matrix, and the diffusion prior covariance (Smola and Kondor 2003) given as \({\mathbf {C}}= \exp {(a{\mathbf {L}})} + b{\mathbf {I}}\), where \({\mathbf {L}}\) is the normalized graph Laplacian matrix. Let \({\mathbf {A}}\) be the adjacency matrix for the graph and \({\mathbf {D}}\) be a diagonal matrix with entries \({\mathbf {D}}_{i,i}= ({\mathbf {A}}\mathbf {1})_i\). The normalized Laplacian matrix is computed as \({\mathbf {L}}= {\mathbf {I}} {\mathbf {D}}^{\frac{1}{2}}{\mathbf {A}}{\mathbf {D}}^{\frac{1}{2}}\). We set the \(a=b=1\). No further optimization was performed, and more detailed experimental validation of covariance parameter selection is left for future work.
Models: We present results for the proposed constrained MVGP approach (Con. MVGP), and the special cases using only the nuclear norm (Trace GP)^{4} and using only the Hilbert norm (MVGP) i.e. the standard MVGP regression. To the best of our knowledge, the special case of Trace GP is a novel contribution. As baselines, we implemented kernelized probabilistic matrix factorization (KPMF) (Zhou et al. 2012) and probabilistic matrix factorization (PMF) (Mnih and Salakhutdinov 2007) using rank 5 and rank 20 factors. PMF is identical to KPMF using an identity covariance. KPMF has been shown to outperform PMF and and other baseline models in various domains. We note that the rank constraint ensures that all of the proposed models except for MVGP can be used for inmatrix predictions even with the identity prior covariance. Outofmatrix predictions require the use of other covariance matrices.
We implemented Con. MVGP using the representation outlined in Sect. 4.1. The Cholesky decomposition of the covariance matrices was used as the basis representation. The model hyperparameter \(\lambda \) was selected using \(5\) values logarithmically spaced between \(10^{3}\) and \(10^{3}\) and the noise hyperparameter was selected \(\sigma ^2\) using 20 values logarithmically spaced between \(10^{3}\) and \(10^{3}\) for all the models. We experimented with learning the data noise variance term \(\sigma ^2\), but found the results worse than using parameter selection. In particular, the estimated noise variance often approached zero  indicating overfitting. A possible solution we plan to explore is to introduce a prior distribution for \(\sigma ^2\) ( see e.g. Bayesian linear regression in Bishop (2006, Chapter 3.3) that may help to regularize the noise term away from zero.
The standard MVGP is often implemented as a scalar GP with the row and column prior covariance matrices multiplied as shown in (1). We found this “direct” approach computationally intractable as the memory requirements scale quadratically with the size of the observed transposable data matrix. Instead, we implemented the MVGP in matrix form as a special case of (11) with \(\lambda = 0\). This allowed us to scale the model to the larger datasets at the expense of more computation. The nuclear norm regularized optimization in (11) was solved using the large scale approach of Laue (2012). All numerical optimization was implemented using the limited memory BroydenFletcherGoldfarbShanno (LBFGS) algorithm.
Experiment design and cross validation: We performed two kinds of experiments. In the rest of this discussion, “rows” will refer to either the disease (diseasegene prediction) or the user (recommender system). The known rows experiment was designed to evaluate the performance of the model for entries selected randomly over the observed values in the matrix. In contrast, the new rows experiment was designed to evaluate the generalization ability of the model for new rows not observed in the training set. We partitioned each dataset into fivefold crossvalidation sets. The model was trained on \(4\) of the \(5\) sets and tested on the held out set. The results presented are the averaged fivefold cross validation performance. For the “known row” experiments, the cross validation sets were randomly selected over the matrix. For the new row experiments, the cross validation was performed rowwise, i.e., we selected training set row and test set rows. Note that the identity prior covariance cannot be used for new row prediction, but due to the low rank constraint, it can be used for known row prediction.
5.1 Diseasegene prediction
Genes are segments of DNA that determine specific characteristics; over 20,000 genes have been identified in humans, which interact to regulate various functions in the body. Researchers have identified thousands of diseases, including various cancers and respiratory diseases such as asthma (NCBI 1998), caused by mutations in these genes. Genetic association studies (McCarthy et al. 2008) are the standard approach for discovering diseasecausing genes. However, these studies are often tedious and expensive to conduct. Hence, computational methods that can reduce the search space by predicting the list of candidate genes associated with a given disease are of significant scientific interest. The disease gene prediction task has been the subject of a significant amount of study in recent years (Vanunu et al. 2010; Li and Toh 2010; Mordelet and Vert 2011; SinghBlom et al. 2013). The task is challenging because all the observed responses correspond to known associations, and there are no reliable negative examples. Disease gene association shares the binary matrix representation of the one class (also known as implicit feedback) matrix prediction studied in the collaborative filtering literature (Pan et al. 2008; Hu et al. 2008).
Additional baseline: In addition to the matrix factorization baseline models, we compared our proposed approach to ProDiGe (Mordelet and Vert 2011); a startoftheart approach that has been shown to be superior to previous topperforming approaches, including distancebased learning methods like Endeavour (Aerts et al. 2006) and label propagation methods like PRINCE (Vanunu et al. 2010). ProDiGe estimates the prioritization function using a multitask support vector machine (SVM) trained with the gene prior covariance and disease prior covariance as kernels. Of the models implemented, ProDiGe is most similar to the MVGP. In fact, MVGP and ProDiGe mainly differ in their loss functions (squared loss and hinge loss respectively). The SVM regularization parameter for ProDiGe was selected from \(\{10^{3}, 10^{2},\ldots , 10^3\}\). We note that also PMF represents the matrix factorization baseline often applied to similar implicit feedback datasets in the recommendation system literature (Pan et al. 2008).
Sampling “negative” entries: Following Mordelet and Vert (2011), we sampled the unknown entries as “negative” observations randomly over the diseasegene association matrix. We sampled \(5\) different negatively labeled item sets. All models were trained with the positive set combined with one of the negative labeled sets. The model scores were computed by averaging the scores over the 5 trained models. All models were trained using the same samples.
Datasets: We trained and evaluated our models using two sets of genedisease association data curated from the literature. The first, which we call the OMIM data set, is based on the Online Mendelian Inheritance in Man (OMIM) database and is representative of the candidate gene prediction task for monogenic or near monogenic diseases, i.e., diseases caused by only one or at most a few genes. The data matrix contains a total of \(M = \text {3,210}\) diseases, \(N = \text {13,614}\) genes, and \(T = \text {3,636}\) known associations (data density of 0.0083 %). We note that the extreme sparsity of this data set makes the prediction problem extremely difficult. The second dataset, which we call the Medline data set, is a much larger data set and is representative of predicting candidate genes for both monogenic as well as polygenic diseases, i.e., diseases caused by the interactions of tens or even hundreds of genes. The set of genes in this data set is defined using the NCBI ENTREZ Gene database (Maglott et al. 2011), and the set of diseases is defined using the “Disease” branch of the NIH Medical Subject Headings (MeSH) ontology (National Library of Medicine 2012). We extracted cocitations of these genes and diseases from the PubMed/Medline database (National Library of Medicine 2012) to identify positive genedisease associations. This resulting data set contains a total of of \(M = \text {4,496}\) diseases, \(N = \text {21,243}\) genes, and \(T = \text {250,190}\) known associations (data density of 0.36 %).
Information about biological interactions among genes and known relationships among diseases were used to improve the accuracy of our model, since similar diseases very often have similar genetic causes. We derive gene networks from the HumanNet database (Lee et al. 2011), a genomewide functional network of human genes constructed using multiple lines of evidence, including gene coexpression, proteinprotein interaction data, and networks from other species. For both the OMIM and Medline data sets, our genegene interaction network contains a total of \(\text {433,224}\) links. Our disease network is derived from the term hierarchy established in the 2011 release of the MeSH ontology. The disease network for the Medline data set contains a total of \(\text {13,922}\) links. However, because we do not have a direct mapping of OMIM diseases to MeSH terms, we do not use a disease network for the OMIM data set. As a result, we are unable to test our model’s ability to produce predictions for “new” diseases, i.e., diseases with no associated genes in the training set.
OMIM diseasegene dataset
Model  \(\hbox {P}_{@20}\)  \(\hbox {R}_{@20}\) 

MVGP (D)  0.000 (0.000)  0.003 (0.002) 
Con. MVGP (D)  0.008 (0.001)  0.146 (0.031) 
Con. MVGP (I)  0.010 (0.001)  0.175 (0.025) 
Trace GP (D)  0.006 (0.001)  0.117 (0.021) 
Trace GP (I)  0.009 (0.001)  0.157 (0.023) 
KPMF5 (D)  0.010 (0.001)  0.167 (0.028) 
PMF5 (I)  0.002 (0.000)  0.034 (0.004) 
KPMF20 (D)  0.009 (0.002)  0.161 (0.040) 
PMF20 (I)  0.002 (0.000)  0.039 (0.008) 
ProDiGe  0.000 (0.000)  0.001 (0.003) 
Medline diseasegene dataset
Known diseases  New Diseases  

Model  \(\hbox {P}_{@20}\)  \(\hbox {R}_{@20}\)  \(\hbox {P}_{@20}\)  \(\hbox {R}_{@20}\) 
MVGP (D)  0.022 (0.000)  0.049 (0.002)  0.069 (0.020)  0.091 (0.022) 
Con. MVGP (D)  0.078 (0.001)  0.131 (0.004)  0.137 (0.029)  0.181 (0.026) 
Con. MVGP (I)  0.126 (0.001)  0.216 (0.002)  –  – 
Trace GP (D)  0.078 (0.001)  0.131 (0.004)  0.137 (0.029)  0.181 (0.026) 
Trace GP (I)  0.091 (0.001)  0.152 (0.004)  –  – 
KPMF5 (D)  0.085 (0.001)  0.142 (0.004)  0.136 (0.032)  0.179 (0.032) 
PMF5 (I)  0.079 (0.002)  0.133 (0.003)  –  – 
KPMF20 (D)  0.091 (0.001)  0.151 (0.004)  0.136 (0.032)  0.179 (0.032) 
PMF20 (I)  0.078 (0.001)  0.131 (0.002)  –  – 
In summary, the presented results suggest that the low rank constraint is useful for describing the structure of diseasegene association. We also found that in all the datasets, the constrained Bayesian models (Con. MVGP and Trace GP) performed the same or better than the Bayesian factor models (KPMF and PMF) and the unconstrained Bayesian model (MVGP). This shows the utility of the constrained Bayesian inference approach as compared to the Bayesian factor model approach. Constrained MVGP with the identity kernel was the best single performing method, matching results in the literature suggesting that the network information is not always helpful for inmatrix predictions (Koyejo and Ghosh 2011; Zhou et al. 2012), though it remains essential for generalization beyond the training matrix. Future work will include further examination of these issues.
5.2 Recommender systems
The goal of a recommender system is to suggest items to users based on past feedback and other user and item information. Recommender systems may also be used for targeted advertising and other personalized services. The low rank matrix factorization approach has proven to be a popular and effective model for the recommender systems data (Mnih and Salakhutdinov 2007; Koren et al. 2009; Koyejo and Ghosh 2011). Several authors (Yu et al. 2007; Zhu et al. 2009; Zhou et al. 2012) have studied the factor GP approach for recommender systems, and have shown that prior covariances extracted from the social network can improve the prediction accuracy and may be used to provide predictions with no training ratings (Koyejo and Ghosh 2011; Zhou et al. 2012). Kernelized probabilistic matrix factorization (KPMF) is of particular interest, as it has been shown to outperform PMF (Mnih and Salakhutdinov 2007) and SoRec (Ma et al. 2008), strong baseline methods for predicting user item preferences with social network side information.
Metrics The model performance was measured using a combination of regression and ranking metrics. Recommender systems are typically most concerned with presenting the few items that the user is very likely to be interested in, and accurately predicting the score of the other items is less important. Several authors (Steck 2010; Steck and Zemel 2010) have shown that measuring the recall (\(\hbox {R}_{@k}\)) of the top relevant items compared to all available items can provide an unbiased estimate of the predicted ranking. As suggested by (Steck 2010) we measure the ability of the model to predict relevant items (ratings greater than 4) ahead of other entries (both missing and observed entries with rating less than or equal to 4) using recall at 20 (\(\hbox {R}_{@20}\)). Recall per user was computed on the test set after removing all items that had been observed in the training set, and averaged over all users. For regression, we used the root mean square error (\(\text {RMSE}\)) metric (Koren et al. 2009) given by \(\sqrt{\frac{1}{L}\sum _{l=1}^L (y_l  \hat{y}_l)^2}\) where \(\hat{y}_l\) is the prediction for index \(l\). Lower values reflect better performance for the \(\text {RMSE}\).
Datasets: We trained and evaluated our models using two publicly available recommender systems datasets with social network side information  Flixster and Epinions datasets. Flixster ^{5} is a website where users share film reviews and ratings. The users can also signify social connections. We utilized the dataset described by (Jamali and Ester 2010) which contains a ratings matrix and the social network. We selected the \(M=5,000\) users with the most friends in the network and \(N=5,000\) movies with the most ratings. This resulted in a matrix with \(L=33,182\) (density \(=0.001\,\%\)) ratings and \(211,702\) undirected user social connections. The identity prior covariance was used for the movies. Ratings in Flixster take one of 10 values in the set \(\{0.5, 1, 1.5, \ldots , 5.0 \}\). Epinions ^{6} is an item review site where users can also specify directed association by signifying a trust link. We utilized the extended Epinions dataset (Massa and Avesani 2006) and converted all the directed trust links into undirected links. We selected the \(M=5,000\) users with the most trust links in the network and \(N=5,000\) movies with the most ratings. This resulted in a matrix with \(L=187,163\) (density \(0.007\,\%\)) ratings and \(550,298\) user social connections. The identity prior covariance was used for the items. Ratings in the Epinions dataset take one of five values in the set \(\{1.0, 2.0, \ldots , 5.0 \}\).
Flixster dataset
Known users  New users  

Model  RMSE  \(\hbox {R}_{@20}\)  RMSE  \(\hbox {R}_{@20}\) 
MVGP (D)  1.066 (0.006)  0.067 (0.008)  1.066 (0.088)  0.075 (0.017) 
Con. MVGP (D)  0.989 (0.002)  0.092 (0.012)  1.066 (0.088)  0.075 (0.017) 
Con. MVGP (I)  0.982 (0.001)  0.104 (0.004)  –  – 
Trace GP (D)  0.989 (0.002)  0.088 (0.008)  1.066 (0.088)  0.069 (0.015) 
Trace GP (I)  0.982 (0.001)  0.093 (0.003)  –  – 
KPMF5 (D)  0.993 (0.003)  0.064 (0.012)  1.066 (0.088)  0.062 (0.014) 
PMF5 (I)  0.995 (0.003)  0.052 (0.006)  –  – 
KPMF20 (D)  0.986 (0.001)  0.069 (0.007)  1.066 (0.088)  0.069 (0.015) 
PMF20 (I)  0.989 (0.002)  0.070 (0.003)  –  – 
Epinions dataset
Known users  New users  

Model  RMSE  \(\hbox {R}_{@20}\)  RMSE  \(\hbox {R}_{@20}\) 
MVGP (D)  0.323 (0.007)  0.016 (0.000)  0.329 (0.020)  0.029 (0.002) 
Con. MVGP (D)  0.273 (0.005)  0.023 (0.001)  0.307 (0.022)  0.036 (0.009) 
Con. MVGP (I)  0.274 (0.006)  0.046 (0.002)  –  – 
Trace GP (D)  0.273 (0.005)  0.022 (0.001)  0.307 (0.022)  0.035 (0.009) 
Trace GP (I)  0.274 (0.006)  0.041 (0.003)  –  – 
KPMF5 (D)  0.274 (0.004)  0.021 (0.002)  0.305 (0.022)  0.036 (0.009) 
PMF5 (I)  0.272 (0.004)  0.023 (0.001)  –  – 
KPMF20 (D)  0.275 (0.005)  0.031 (0.003)  0.306 (0.022)  0.035 (0.007) 
PMF20 (I)  0.273 (0.005)  0.023 (0.001)  –  – 
6 Conclusion
This paper introduces a novel approach for the predictive modeling of low rank transposable data with the matrixvariate Gaussian process. The low rank is achieved using a nuclear norm constrained inference; recovering a mean function of low rank. We showed that inference for the Gaussian process with the nuclear norm constraint is convex. The proposed approach was applied to the diseasegene association task and to the recommender system task. The proposed model was effective for regression and for ranking with highly imbalanced data, and performed at least as well as (and often significantly better than) state of the art domain specific baseline models.
Recent work (Yu et al. 2013) characterizing necessary and sufficient conditions for the existence of a representer theorem points to the potential scope of the constrained inference approach combined with nonparametric processes. Thus, we plan to explore other constraint sets in addition to the nuclear norm constraint explored here. We are also interested in exploring covariance constraints as outlined for Gaussian distributions in Koyejo and Ghosh (2013a) applied to nonparametric processes. We are interested in applications of nonparametric constrained Bayesian inference to more complicated models beyond Gaussian distributions. Finally, we intend to explore the biological implications of these constrained disease gene association results in collaboration with domain experts.
Footnotes
 1.
The choice to capture either row or column covariances in PCA and GPLVM is not fundamental to these models i.e. it is primarily a modeling choice.
 2.
See experiments (Sect. 5) for further discussion.
 3.
See (Rasmussen and Williams 2005, Section 2.2) for an analogous proof applied to Gaussian process regression.
 4.
The nuclear norm is also known as the trace norm.
 5.
 6.
 7.
Note that the model evidence term must be added back in order to use the posterior form of the constrained Bayesian inference cost function (3b).
Notes
Acknowledgments
Authors acknowledge support from NSF grant IIS 1016614. We also thank U. Martin Blom and Edward Marcotte for providing the OMIM data set. The authors thank the anonymous reviewers for insightful comments that helped to improve this manuscript.
References
 Abernethy, J., Bach, F., Evgeniou, T., & Vert, J. P. (2009). A new approach to collaborative filtering: Operator estimation with spectral regularization. JMLR: The Journal of Machine Learning Research, 10, 803–826.zbMATHGoogle Scholar
 Aerts, S., Lambrechts, D., Maity, S., Van Loo, P., Coessens, B., De Smet, F., et al. (2006). Gene prioritization through genomic data fusion. Nature Biotechnology, 24(5), 537–544.CrossRefGoogle Scholar
 Allen, G. I., & Tibshirani, R. (2010). Transposable regularized covariance models with an application to missing data imputation. The Annals of Applied Statistics, 4(2), 764–790.MathSciNetCrossRefzbMATHGoogle Scholar
 Allen, G. I., & Tibshirani, R. (2012). Inference with transposable data: Modelling the effects of row and column correlations. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 74(4), 721–743.Google Scholar
 Altun, Y., & Smola, A. J. (2006). Unifying divergence minimization and statistical inference via convex duality. In: COLT.Google Scholar
 Álvarez, M. A., Rosasco, L., & Lawrence, N. D. (2012). Kernels for vectorvalued functions: A review. Foundations and Trends in Machine Learning, 4(3), 195–266.CrossRefGoogle Scholar
 Bauer, H. (1996). Probability Theory. De Gruyter Studies in Mathematics Series: De Gruyter.Google Scholar
 Berger, A. L., Pietra, V. J. D., & Pietra, S. A. D. (1996). A maximum entropy approach to natural language processing. Comput Linguist, 22(1), 39–71.Google Scholar
 Berlinet, A., & ThomasAgnan, C. (2004). Reproducing kernel Hilbert spaces in probability and statistics. Boston, Dordrecht, London: Kluwer Academic Publishers.CrossRefzbMATHGoogle Scholar
 Bishop, C. M. (2006). Pattern recognition and machine learning (information science and statistics). Secaucus, NJ, USA: Springer.Google Scholar
 Bonilla, E., Chai, K. M., & Williams, C. (2008). Multitask gaussian process prediction. In: NIPS ,20, 153–160.Google Scholar
 Borwein, J., & Zhu, Q. (2005). Techniques of variational analysis, CMS books in mathematics. Berlin: Springer.Google Scholar
 Candès, E. J., & Recht, B. (2009). Exact matrix completion via convex optimization. Foundations of Computational Mathematics, 9(6), 717–772.MathSciNetCrossRefzbMATHGoogle Scholar
 Csató, L. (2002). Gaussian processes: Iterative sparse approximations. PhD thesis, Aston University.Google Scholar
 Dudík, M., Phillips, S. J., & Schapire, R. E. (2007). Maximum entropy density estimation with generalized regularization and an application to species distribution modeling. Journal of Machine Learning Research, 8, 1217–1260.zbMATHGoogle Scholar
 Dudik, M., Harchaoui, Z., Malick, J., et al. (2012). Lifted coordinate descent for learning with tracenorm regularization. In: AISTATSproceedings of the fifteenth international conference on artificial intelligence and statistics2012, Vol. 22.Google Scholar
 Ganchev, K., & Ja, Graça. (2010). Posterior regularization for structured latent variable models. Journal of Machine Learning Research, 11, 2001–2049.zbMATHGoogle Scholar
 Gelfand, A. E., Smith, A. F. M., & Lee, T. M. (1992). Bayesian analysis of constrained parameter and truncated data problems using gibbs sampling. Journal of the American Statistical Association, 87(418), 523–532.MathSciNetCrossRefGoogle Scholar
 Hu, Y., Koren, Y., Volinsky, C. (2008). Collaborative filtering for implicit feedback datasets. In: Data Mining, 2008. ICDM’08. Eighth IEEE international conference on, IEEE, pp. 263–272.Google Scholar
 Jaakkola, T., Meila, M., Jebara, T. (1999). Maximum entropy discrimination. In: NIPS, MIT Press.Google Scholar
 Jamali, M., & Ester, M. (2010). A matrix factorization technique with trust propagation for recommendation in social networks. In: Proceedings of the fourth ACM conference on recommender systems, ACM, pp. 135–142.Google Scholar
 Koren, Y., Bell, R., & Volinsky, C. (2009). Matrix factorization techniques for recommender systems. Computer, 42, 30–37.CrossRefGoogle Scholar
 Koyejo, O. (2013). Constrained relative entropy minimization with applications to multitask learning. PhD thesis, The University of Texas at Austin.Google Scholar
 Koyejo, O., & Ghosh, J. (2011). A kernelbased approach to exploiting interactionnetworks in heterogeneous information sources for improved recommender systems. In: Proceedings of the 2nd international workshop on information heterogeneity and fusion in recommender systems, ACM, pp. 9–16.Google Scholar
 Koyejo, O., & Ghosh, J. (2013). Constrained Bayesian inference for low rank multitask learning. In: Proceedings of the 29th conference on Uncertainty in artificial intelligence (UAI).Google Scholar
 Koyejo, O., & Ghosh, J. (2013). A representation approach for relative entropy minimization with expectation constraints. In: ICML workshop on divergences and divergence learning (WDDL).Google Scholar
 Laue, S. (2012). A hybrid algorithm for convex semidefinite optimization. In: Proceedings of the 29th international conference on machine learning (ICML12), pp. 177–184.Google Scholar
 Lawrence, N., & Hyvärinen, A. (2005). Probabilistic nonlinear principal component analysis with Gaussian process latent variable models. Journal of Machine Learning Research, 6, 1783–1816.zbMATHGoogle Scholar
 Lawrence, N. D., & Urtasun, R. (2009). Nonlinear matrix factorization with gaussian processes. In: Proceedings of the 26th annual international conference on machine learning, ACM, pp. 601–608.Google Scholar
 Lee, I., Blom, U. M., Wang, P. I., Shim, J. E., & Marcotte, E. M. (2011). Prioritizing candidate disease genes by networkbased boosting of genomewide association data. Genome Research, 21(7), 1109–1121.CrossRefGoogle Scholar
 Li, L., & Toh, K. C. (2010). An inexact interior point method for l 1regularized sparse covariance selection. Mathematical Programming Computation, 2(3–4), 291–315.MathSciNetCrossRefzbMATHGoogle Scholar
 Li, W. J., & Yeung, D. Y. (2009). Relation regularized matrix factorization. In: Proceedings of the 21st international joint conference on artificial intelligence, IJCAI’09, pp. 1126–1131.Google Scholar
 Li, W. J., Yeung, D. Y., & Zhang, Z. (2009). Probabilistic relational PCA. In: Y. Bengio, D. Schuurmans, J. Lafferty, C. Williams, & A. Culotta (Eds.), Advances in neural information processing systems (Vol. 22, pp. 1123–1131).Google Scholar
 Li, W. J., Zhang, Z., Yeung D. Y. (2009). Latent Wishart processes for relational kernel learning. In: D. A. V. Dyk & M. Welling (Eds.), AISTATS, pp. 336–343.Google Scholar
 Ma, H., Yang, H., Lyu, M. R., King, I. (2008). Sorec: Social recommendation using probabilistic matrix factorization. In: Proceeding of the 17th ACM conference on Information and knowledge management, ACM, New York, NY, USA, CIKM ’08, pp. 931–940.Google Scholar
 Maglott, D. R., Ostell, J., Pruitt, K. D., & Tatusova, T. A. (2011). Entrez gene: Genecentered information at NCBI. Nucleic Acids Research, 39(Database–Issue), 52–57.CrossRefGoogle Scholar
 Massa, P., & Avesani, P. (2006). Trustaware bootstrapping of recommender systems. In: ECAI 2006 workshop on recommender systems, pp. 29–33.Google Scholar
 McCarthy, M. I., Abecasis, G. R., Cardon, L. R., Goldstein, D. B., Little, J., Ioannidis, J. P., et al. (2008). Genomewide association studies for complex traits: Consensus, uncertainty and challenges. Nature Reviews Genetics, 9(5), 356–369.CrossRefGoogle Scholar
 Mnih, A., & Salakhutdinov, R. (2007). Probabilistic matrix factorization. In: J. C. Platt, D. Koller, Y. Singer & S. T. Roweis (Eds.), Advances in neural information processing systems (Vol. 20, pp. 1257–1264).Google Scholar
 Mordelet, F., & Vert, J. P. (2011). Prodige: Prioritization of disease genes with multitask machine learning from positive and unlabeled examples. BMC Bioinformatics, 12, 389.CrossRefGoogle Scholar
 National Library of Medicine. (2012) Medical subject headings. http://www.nlm.nih.gov/mesh/. Retrieved from March 2012.
 National Library of Medicine. (2012). PubMed. http://www.ncbi.nlm.nih.gov/pubmed/. Retrieved from March 2012.
 NCBI. (1998). Genes and disease. Online, URL http://www.ncbi.nlm.nih.gov/books/NBK22183/. Retrieved from January 10, 2011.
 Orbanz, P., & Teh, Y. W. (2010). Bayesian nonparametric models. In: C. Sammut & G. I. Webb (Eds.),Encyclopedia of machine learning. Berlin: Springer.Google Scholar
 Pan, R., Zhou, Y., Cao, B., Liu, N. N., Lukose, R., Scholz, M., Yang, Q. (2008). Oneclass collaborative filtering. In: Data mining, 2008. ICDM’08. eighth IEEE international conference on, IEEE, pp. 502–511.Google Scholar
 Pong, T. K., Tseng, P., Ji, S., & Ye, J. (2010). Trace norm regularization: Reformulations, algorithms, and multitask learning. SIAM Journal on Optimization, 20(6), 3465–3489.MathSciNetCrossRefzbMATHGoogle Scholar
 Rasmussen, C. E., & Williams, C. K. I. (2005). Gaussian processes for machine learning (adaptive computation and machine learning series). Cambridge, MA: The MIT Press.Google Scholar
 SinghBlom, U. M., Natarajan, N., Tewari, A., Woods, J. O., Dhillon, I. S., & Marcotte, E. M. (2013). Prediction and validation of genedisease associations using methods inspired by social network analyses. PloS One, 8(5), e58,977.CrossRefGoogle Scholar
 Smola, A. J., & Kondor, R. (2003). Kernels and regularization on graphs. In: B. Schölkopf & M. K. Warmuth (Eds.), Learning theory and kernel machines (pp. 144–158). Berlin: Springer.Google Scholar
 Steck, H. (2010). Training and testing of recommender systems on data missing not at random. In: Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining, ACM, pp 713–722.Google Scholar
 Steck, H., & Zemel, R. S. (2010). A generalized probabilistic framework and its variants for training topk recommender systems. In: PRSAT.Google Scholar
 Stegle, O., Lippert, C., Mooij, J. M., Lawrence, N. D., Borgwardt, K. M. (2011). Efficient inference in matrixvariate gaussian models with iid observation noise. In: Advances in neural information processing systems (pp 630–638).Google Scholar
 Sutskever, I., Tenenbaum, J. B., Salakhutdinov, R. (2009). Modelling relational data using bayesian clustered tensor factorization. In: Advances in neural information processing systems (pp 1821–1828).Google Scholar
 Vanunu, O., Magger, O., Ruppin, E., Shlomi, T., & Sharan, R. (2010). Associating genes and protein complexes with disease via network propagation. PLoS Computational Biology, 6(1), e1000641.MathSciNetCrossRefGoogle Scholar
 Xu, M., Zhu, J., & Zhang, B. (2012). Nonparametric maxmargin matrix factorization for collaborative prediction. Advances in Neural Information Processing Systems, 25, 64–72.Google Scholar
 Xu, Z., Tresp, V., Yu, K., Kriegel, H. P. (2006). Learning infinite hidden relational models. Uncertainity in, Artificial Intelligence (UAI2006).Google Scholar
 Xu, Z., Kersting, K., & Tresp, V. (2009). Multirelational learning with gaussian processes. In: Proceedings of the 21st international joint conference on artificial intelligence, IJCAI’09, pp. 1309–1314.Google Scholar
 Yan, F., Xu, Z., Qi, Y. A. (2011). Sparse matrixvariate gaussian process blockmodels for network modeling. In: UAI.Google Scholar
 Yu, K., & Chu, W. (2008). Gaussian process models for link analysis and transfer learning. In: NIPS, pp 1657–1664.Google Scholar
 Yu, K., Chu, W., Yu, S., Tresp, V., & Xu, Z. (2007). Stochastic relational models for discriminative link prediction. Advances in neural information processing systems 19 (pp. 1553–1560). Cambridge, MA: MIT Press.Google Scholar
 Yu, Y., Cheng, H., Schuurmans, D., Szepesvri, C. (2013). Characterizing the representer theorem. In: ICML.Google Scholar
 Zellner, A. (1988). Optimal information processing and bayes’s theorem. The American Statistician, 42(4), 278–280.MathSciNetGoogle Scholar
 Zhang, X., & Carin, L. (2012). Joint modeling of a matrix with associated text via latent binary features. Advances in Neural Information Processing Systems, 25, 1565–1573.zbMATHGoogle Scholar
 Zhou, T., Shan, H., Banerjee, A., Sapiro, G. (2012). Kernelized probabilistic matrix factorization: Exploiting graphs and side information. In: SDM, pp 403–414.Google Scholar
 Zhu, J. (2012). Maxmargin nonparametric latent feature models for link prediction. In: Proceedings of the 29th international conference on machine learning (ICML12), pp 719–726.Google Scholar
 Zhu, J., Ahmed, A., Xing, E. P. (2009). Medlda: Maximum margin supervised topic models for regression and classification. In: Proceedings of the 26th annual international conference on machine learning, ACM, pp 1257–1264.Google Scholar
 Zhu, J., Chen, N., Xing, E. P. (2011). Infinite latent SVM for classification and multitask learning. In: J. ShaweTaylor, R. S. Zemel, P. L. Bartlett, F. Pereira, & K. Q. Weinberger (Eds.), Advances in neural information processing systems (Vol. 24, pp 1620–1628).Google Scholar
 Zhu, J., Chen, N., Xing, E. P. (2012). Bayesian inference with posterior regularization and infinite latent support vector machines. CoRR abs/1210.1766.Google Scholar