Advertisement

Machine Learning

, Volume 107, Issue 3, pp 579–603 | Cite as

1-Bit matrix completion: PAC-Bayesian analysis of a variational approximation

  • Vincent Cottet
  • Pierre Alquier
Article

Abstract

We focus on the completion of a (possibly) low-rank matrix with binary entries, the so-called 1-bit matrix completion problem. Our approach relies on tools from machine learning theory: empirical risk minimization and its convex relaxations. We propose an algorithm to compute a variational approximation of the pseudo-posterior. Thanks to the convex relaxation, the corresponding minimization problem is bi-convex, and thus the method works well in practice. We study the performance of this variational approximation through PAC-Bayesian learning bounds. Contrary to previous works that focused on upper bounds on the estimation error of M with various matrix norms, we are able to derive from this analysis a PAC bound on the prediction error of our algorithm. We focus essentially on convex relaxation through the hinge loss, for which we present a complete analysis, a complete simulation study and a test on the MovieLens data set. We also discuss a variational approximation to deal with the logistic loss.

Keywords

Matrix completion PAC-Bayesian bounds Variational Bayes Supervised classification Risk convexification Oracle inequalities 

1 Introduction

Motivated by modern applications like recommendation systems and collaborative filtering, video analysis or quantum statistics, the matrix completion problem has been widely studied over the recent years. Recovering a matrix is, without any additional information, a impossible task. However, under some assumptions on the structure of the matrix to be recovered, it might become feasible, as shown by Candès and Tao (2010) and Candès and Recht (2012) where the assumption is that the matrix has a small rank. This assumption is natural in many applications. For example, in recommendation systems, it is equivalent to the existence of a small number of hidden features that explain the users preferences. While Candès and Tao (2010) and Candès and Recht (2012) focused on matrix completion without noise, many authors extended these techniques to the case of noisy observations, see Candès and Plan (2010) and  Chatterjee (2015) among others. The main idea in Candès and Plan (2010) is to minimize the least squares criterion, penalized by the rank. This penalization is then relaxed by the nuclear norm, which is the sum of the singular values of the matrix at hand. An efficient algorithm is described in Recht and Ré (2013).

All the aforementioned papers focused on real-valued matrices. However, in many applications, the matrix entries are binary, that is in the set \(\{0,1\}\). For example, in collaborative filtering, we have often only access to a binary choice: the (ij)th entry being 1 means that user i is satisfied by object j while this entry being 0 means that he/she is not satisfied by it. The problem of recovering a binary matrix from partial observations is usually referred as 1-bit matrix completion. To deal with binary observations requires specific estimation methods. Most works on this problem usually assume a generalized linear model (GLM): the observations \(Y_{ij}\) for \(1\le i\le m_1\), \(1\le j\le m_2\), are Bernoulli distributed with parameter \(f(M_{ij})\), where f is a link function which maps from \(\mathbb {R}\) to [0, 1], for example the logistic function \(f(x) = \exp (x)/[1+\exp (x)]\), and M is a \(m_1 \times m_2\) real matrix, see Cai and Zhou (2013), Davenport et al. (2014) and Klopp et al. (2015). In these works, the goal is to recover the matrix M and a convergence rate is then derived. For example, Klopp et al. (2015) provides an estimate \(\widehat{M}\) for which, under suitable assumptions and when the data are generated according to the true model with \(M=M_0\),
$$\begin{aligned} \frac{1}{m_1 m_2} \Vert \widehat{M}-M_0\Vert _{F}^2 \le C \max \left( \sqrt{\frac{\log (m_1+m_2)}{n}},\frac{\max (m_1,m_2) {\mathrm{rank}(M_0)}\log (m_1+m_2)}{n}\right) \end{aligned}$$
for some constant C that depends on the assumptions and the sampling scheme, and where \(\Vert .\Vert _F\) stands for the Frobenius norm [we refer the reader to Corollary 2 page 2955 in Klopp et al. (2015) for the exact statement]. While this result ensures the consistency of \(\widehat{M}\) when \(M_0\) is low-rank, it does not provide any guarantee on the probability of a prediction error. Moreover, the results rely on the assumption that the model (in particular the function f) is well specified. In practice, this assumption is unrealistic, and it is important to provide generalization error bounds that hold even in case of misspecification.

Here, we adopt a machine learning point of view: in machine learning, dealing with binary output is called a classification problem, for which methods are known that do not assume any model on the observations. That is, instead of focusing on a parametric model for \(Y_{i,j}\), we will only define a set of prediction matrices M and seek for the one that leads to the best prediction error. Using the zero-one loss function, we could actually directly use Vapnik (1998) theory  to propose a classifier \(\widehat{M}\) risk would be controlled by a PAC inequality. However, it is known that this approach usually is computationally intractable. A popular approach is to replace the zero-one loss by a convex surrogate (Zhang 2004), namely, the hinge loss. Our approach is as follows: we propose a pseudo-Bayesian approach, where we define a pseudo-posterior distribution on a set of matrices M. This pseudo-posterior distribution does not have a simple form, however, thanks to a variational approximation, we manage to approximate it by a tractable distribution. Thanks to the PAC-Bayesian theory (McAllester 1998; Herbrich and Graepel 2002; Shawe-Taylor and Langford 2003; Catoni 2004, 2007; Seldin et al. 2012; Dalalyan and Tsybakov 2008), we are able to provide a PAC bound on the prediction risk of this variational approximation. We then show that, due to the convex relaxation of the zero-one loss, the computation of this variational approximation is actually a bi-convex minimization problem. As a consequence, efficient algorithms are available.

Other settings for 1-bit matrix completion have also been studied. For example, in some real-life applications, only positive instances are available. This setting is studied in details in Hsieh et al. (2015). It requires a different approach. Here, we stick to the classification approach where positive and negative instances are observed. We refer the reader to Hsieh et al. (2015) and the references therein for the positive-only case.

The rest of the paper is as follows. In Sect. 2 we provide the notations used in the paper, the definition of the pseudo-posterior and of its variational approximation. In Sect. 3 we give the PAC analysis of the variational approximation. This yields an empirical and a theoretical upper bound on the prediction risk of our method. Sect. 4 provides details on the implementation of our method. Note that in the aforementioned sections, the convex surrogate of the zero-one loss used is the hinge loss. An extension to the logistic loss is briefly discussed in Sect. 5, together with an algorithm to compute the variational approximation. Finally, Sect. 6 is devoted to an empirical study and Sect. 6.3 to an application to the MovieLens data set. The proof of the theorems of Sect. 3 are provided in Sect. 1.

2 Estimation procedure

For any integer m we define \([m]=\{1,\dots ,m\}\); for two real numbers a and b we write \(\max (a,b)= a\vee b\) and \(\min (a,b) = a\wedge b\). We define, for any integers \(m_1\) and \(m_2\) and any matrix \(M \in \mathbb {R}^{m_1\times m_2}\), \(\Vert M \Vert _\text {max} = \max _{(i,j)\in [m_1]\times [m_2]} M_{ij}\). Let \(\mathbb {R}^+\) stand for the set of non-negative real numbers, and \(\mathbb {R}^{+*}\) for the positive real numbers. For any real number a, \((a)_+\) is the positive part of a and is equal to \(\max (0,a)\).

For a pair of matrices (AB), we write \(\ell (A,B)= \Vert A \Vert _\text {max} \vee \Vert B \Vert _\text {max}\). Finally, when an \(m_1\times m_2\) matrix M has \(\mathrm{rank}(M)=r\) then it can be written as \(M=LR^T\) where L is \(m_1\times r\) and R is \(m_2 \times r\). This decomposition is obviously not unique; we put \(\ell (M) = \inf _{(L,R)}\ell (L,R) \) where the infimum is taken over all such possible pairs (LR) such that \(LR^\top = M\). In frequentist approaches like Klopp et al. (2015), it is common that the upper bound depends on the infinite norm of the entries. This quantity is replaced in our analysis by \(\ell (M)\).

2.1 1-Bit matrix completion as a classification problem

We formally describe the 1-bit matrix completion problem as a classification problem: we observe \((X_k,Y_k)_{k\in [n]}\) that are n i.i.d pairs from a distribution \(\mathbf {P}\). The \(X_k\)’s take values in \(\mathscr {X}=[m_1]\times [m_2]\) and the \(Y_k\)’s take values in \(\mathscr {Y}=\{-1,+1\}\). Hence, the k-th observation of an entry of the matrix is \(Y_k\) and the corresponding position in the matrix is provided by \(X_k=(i_k,j_k)\). In this setting, a predictor is a function \([m_1]\times [m_2]\rightarrow \mathbb {R}\) and thus can be represented by a matrix M and for any X, \(M_X\) is the entry of M at location X. It is natural to use M in the following way: when \((X,Y)\sim \mathbf {P}\), M predicts Y by \(\mathrm{sign}(M_X)\). The ability of this predictor to predict a new entry of the matrix is then assessed by the risk
$$\begin{aligned} \mathbf {R}(M) = \mathbb {E}_\mathbf {P}\left[ \mathbb {1}(Y M_X<0)\right] , \end{aligned}$$
and its empirical counterpart is:
$$\begin{aligned} r_n(M) = \frac{1}{n}\sum _{k=1}^n \mathbb {1}(Y_{k} M_{X_k}<0) = \frac{1}{n}\sum _{k=1}^n \mathbb {1}(Y_{k} M_{i_k,j_k}<0) . \end{aligned}$$
It is then possible to use the standard approach in classification theory (Vapnik 1998). For example, the best possible classifier is the Bayes classifier and it relies on the regression function:
$$\begin{aligned} \eta (x) = \mathbb {E}(Y|X=x) \quad \text {or equivalently} \quad \eta (i,j) = \mathbb {E}[Y|X=(i,j)], \end{aligned}$$
and therefore we have an optimal matrix
$$\begin{aligned} M^B_{ij}=\text {sign}[\eta (i,j)] =\text {sign}\Bigl \{\mathbb {E}[Y|X=(i,j)]\Bigr \}. \end{aligned}$$
We define \(\overline{\mathbf {R}} = \inf _M \mathbf {R}(M)= \mathbf {R}(M^B)\), and \(\overline{r_n}=r_n(M^B)\). Note that, clearly, if two matrices \(M^1\) and \(M^2\) are such as, for every (ij), \(\text {sign}(M^1_{ij})=\text {sign}(M^2_{ij})\) then \(\mathbf {R}(M^1)=\mathbf {R}(M^2)\), and obviously,
$$\begin{aligned} \forall M, \forall (i,j)\in [m_1]\times [m_2], \quad \text {sign}(M_{ij})=M^B_{ij} \quad \Rightarrow \quad r_n(M)=\overline{r_n}. \end{aligned}$$
While the risk \(\mathbf {R}(M)\) has a clear interpretation, its empirical counterpart \(r_n(M)\) usually leads to intractable problems, as it is non-smooth and non-convex. Hence, it is standard to replace the empirical risk by a convex surrogate (Zhang 2004). In this paper, we will mainly deal with the hinge loss, which leads to the following so-called hinge risk and hinge empirical risk:
$$\begin{aligned} R^h(M)&= \mathbb {E}_\mathbf {P}\left[ (1-Y M_X)_+\right] ,\\ r_n^h(M)&= \frac{1}{n}\sum _{k=1}^n (1-Y_k M_{X_k})_+. \end{aligned}$$
The hinge loss was also used by Srebro et al. (2004) and in Herbster et al. (2016) in the 1-bit matrix completion problem, with a different approach leading to different algorithms. Moreover, here, we provide an analysis of the rate of convergence of our method, that is not provided in Srebro et al. (2004). Note that our analysis can be extended to any Lipschitz and convex surrogate, and indeed, we study also briefly the logistic loss in Sect. 5. Still, we prefer to focus only on the hinge loss in the main part of the paper, for its good algorithmic and theoretical properties, c.f. Zhang (2004).

Contrary to many recent papers on matrix completion, our approach leads to distribution-free bounds. The marginal distribution of X is not an issue and we do not have to assume a uniform sampling scheme. Following standard notations in matrix completion, we define \(\Omega \) as the set of indices of observed entries: \(\Omega =\{X_1,\dots ,X_n\}\). We will use in the following the sub-sample of \(\left\{ 1,\dots ,n\right\} \) for a specified line i: \(\Omega _{i,\cdot }=\left\{ l \in [n]:(i,j_l) \in \Omega \right\} \) and the counterpart for a specified column j: \(\Omega _{\cdot , j}=\left\{ l \in [n]:(i_l,j) \in \Omega \right\} \).

2.2 Pseudo-Bayesian estimation

The Bayesian framework has been used several times for matrix completion [see Salakhutdinov and Mnih (2008) and Lim and Teh (2007) and the references therein]. The PAC-Bayesian approach has been well used on different models [see Mai and Alquier (2015) and Sect. 6 in Seldin and Tishby (2010)]. A common idea in all of these papers is to factorize the matrix into two parts in order to define a prior on low-rank matrices. It needs an additional parameter and a hierarchical model in order to be rank-adaptive and we explain here the idea. Every matrix whose rank is r can be factorized:
$$\begin{aligned} M=LR^\top , L\in \mathbb {R}^{m_1 \times r}, \quad R \in \mathbb {R}^{m_2 \times r}. \end{aligned}$$
As mentioned in the introduction, the Bayes matrix \(M^B\) is expected to be low-rank, or at least well approximated by a low-rank matrix. However, in practice, we do not know what would be the rank of this matrix. So, we actually write \(M=LR^\top \) with \(L\in \mathbb {R}^{m_1 \times K}\), \(R \in \mathbb {R}^{m_2 \times K}\) for some large enough K. Adaptation with respect to \(r\in [K]\) is obtained by shrinking some columns of L and R to 0. In order to do so, we will scale parameters \(\gamma _k\) for the columns of L and R, and let \(\gamma := (\gamma _1,\dots ,\gamma _K)\). We then define the following hierarchical probability distribution:
$$\begin{aligned}&\forall k \in [K], \quad \gamma _k \mathop {\sim }\limits ^{iid}\pi ^\gamma , \end{aligned}$$
(1)
$$\begin{aligned}&\forall (i,j,k) \in [m_1] \times [m_2] \times [K], \quad L_{i,k},R_{j,k}|\gamma \mathop {\sim }\limits ^{indep.}\mathscr {N}(0,\gamma _k), \end{aligned}$$
(2)
$$\begin{aligned}&\text { and } M =LR^\top , \end{aligned}$$
(3)
where the prior distribution on the variances \(\pi ^\gamma \) is yet to be specified. It means that the entries of L and R are normally distributed but the variance depends on the column index: a large \(\gamma _k\) leads to spread values and a small \(\gamma _k\) leads to almost null entries of the column k. In most papers \(\pi ^\gamma \) is chosen as an inverse-Gamma distribution because it is conjugate in this model. This kind of hierarchical prior distribution is also very similar to the Bayesian Lasso developed in Park and Casella (2008) and especially of the form of the Bayesian Group Lasso developed in Kyung et al. (2010) in which the variance term is Gamma distributed. We will show that the Gamma distribution is a possible alternative in matrix completion, both for theoretical results and practical considerations. Thus all the results in this paper are stated under the assumption that \(\pi ^\gamma \) is either the Gamma or the inverse-Gamma distribution: \(\pi ^\gamma =\Gamma (\alpha ,\beta )\), or \(\pi ^\gamma =\Gamma ^{-1}(\alpha ,\beta )\).
Let \(\theta \) denote the parameter \(\theta =(L,R,\gamma )\) and \(\pi \) denote the prior distribution defined in (1). Following the aforementioned papers in PAC-Bayesian theory, we define the pseudo-posterior as follows:
$$\begin{aligned} \widehat{\rho }_\lambda (d\theta ) = \frac{\exp \left[ -\lambda r_n^h(LR^\top )\right] }{\int \exp \left[ -\lambda r_n^h\right] d\pi }\pi (d\theta ) \end{aligned}$$
where \(\lambda >0\) is a parameter to be fixed by the statistician. The calibration of \(\lambda \) is discussed below. This distribution is close to a classic posterior distribution but the likelihood has been replaced by the pseudo-likelihood \(\exp [ -\lambda r_n^h(LR^\top )]\) based on the hinge empirical risk.

2.3 Variational Bayes approximations

Unfortunately, the pseudo-posterior is intractable and MCMC methods may be too expensive because of the dimension of the parameter. We decide to use a Variational Bayes approximation, that is to seek an approximation of the pseudo-posterior by efficient optimization algorithms (Bishop 2006). First, we fix a subset \(\mathscr {F}\) of the set of all distributions on the parameter space. The class \(\mathscr {F}\) should be large enough to contain a good enough approximation of \(\widehat{\rho }_\lambda \), but not too large, in order to keep optimization in \(\mathscr {F}\) feasible. The VB approximation is then defined by
$$\begin{aligned} \arg \min _{\rho \in \mathscr {F}} \mathscr {K}(\rho ,\widehat{\rho }_\lambda ), \text { where } \mathscr {K}(\rho ,\widehat{\rho }_\lambda ) = \int \log \left( \frac{\mathrm{d} \rho }{\mathrm{d}\widehat{\rho }_\lambda } \right) \mathrm{d} \rho \end{aligned}$$
is the Kullback–Leibler divergence between \(\rho \) and \(\widehat{\rho }_\lambda \). When \(\mathscr {K}(\rho ,\widehat{\rho }_\lambda )\) is not available in closed form, it is usual to replace it by an upper bound. Following the classical approach with matrix factorization priors, as in Lim and Teh (2007), we define here the class \(\mathscr {F}\) as follows:
$$\begin{aligned} \mathscr {F} =&\Biggl \lbrace \rho (\mathrm{d} L,\mathrm{d}R,\mathrm{d}\gamma ) = \prod _{k=1}^{K} \left[ \prod _{i=1}^{m_1} \varphi (L_{i,k};L_{i,k}^0,v_{i,k}^L) \mathrm{d} L_{i,k} \prod _{j=1}^{m_2} \varphi \left( R_{j,k};R_{j,k}^0,v_{j,k}^R\right) \mathrm{d} R_{j,k} \rho ^{\gamma _k}(d\gamma _k)\right] , \\&L^0\in \mathbb {R}^{m_1\times K}, R^0\in \mathbb {R}^{m_2\times K}, v^L \in \mathbb {R}_+^{m_1\times K},v^R \in \mathbb {R}_+^{m_2\times K} \Biggr \rbrace , \end{aligned}$$
where \(\varphi (.;\mu ,v)\) is the density of the Gaussian distribution with parameters \((\mu ,v)\) and \(\rho ^{\gamma _k}\) ranges over all possible probability distributions for \(\gamma _k\in \mathbb {R}^+\). The VB approximations are referred as parametric when \(\mathscr {F}\) is finite dimensional and as mean-field otherwise. Here we actually use a mixed approach. Informally, under \(\rho \in \mathscr {F}\), all the coordinates are independent and the variational distribution of the entries of L and R is specified. The free variational parameters to be optimized are the means and the variances. We will show below that the optimization with respect to \(\rho ^{\gamma _k}\) is available in close form. Also, note that any probability distribution \(\rho \in \mathscr {F}\) is uniquely determined by \(L^0\), \(R^0\), \(v^L\), \(v^R\) and \(\rho ^{\gamma _1},\dots ,\rho ^{\gamma _K}\). We could actually use the notation \(\rho = \rho _{L^0,R^0,v^L,v^R,\rho ^{\gamma _1}, \dots ,\rho ^{\gamma _K}}\), but it would be too cumbersome, so we will avoid it as much as possible. Conversely, once \(\rho \) is given in \(\mathscr {F}\), we can define \(L^0 = \mathbb {E}_\rho [L]\), \(R^0 = \mathbb {E}_\rho [R]\) and so on.
It is well-known that the Kullback divergence can be decomposed as
$$\begin{aligned} \mathscr {K}(\rho ,\widehat{\rho }_\lambda ) = \lambda \int r_n^h d\rho + \mathscr {K}(\rho ,\pi ) + \log \int \exp [-\lambda r_n^h] d\pi \end{aligned}$$
(4)
but the first term in the right-hand side is not tractable here. We then use the Lipschitz property of the loss and derive an upper bound of the Kullback divergence for any \(\rho \in \mathscr {F}\) which is explicit in the parameters of \(\rho \). It is this quantity that we will optimize in the algorithm and we will see in the next section that this estimate enjoys good properties. We remind the reader that all the proofs are postponed to Sect. 1.

Proposition 1

For any \(\rho =\rho _{L^0,R^0,v^L,v^R,\rho ^{\gamma _1}, \dots ,\rho ^{\gamma _k}}\in \mathscr {F}\),
$$\begin{aligned} \int r_n^h d\rho + \frac{1}{\lambda }\mathscr {K}(\rho ,\pi ) \le r_n^h\left( \mathbb {E}_\rho [L] \mathbb {E}_\rho [R]^\top \right) + \mathscr {R}(\rho ,\lambda ) \end{aligned}$$
(5)
where
$$\begin{aligned} \mathscr {R}(\rho ,\lambda ) =&\frac{1}{n} \sum _{h=1}^n \sum _{k=1}^K \left[ \sqrt{v^L_{i_h,k}\frac{2}{\pi }}\sqrt{v^R_{j_h,k}\frac{2}{\pi }} + |R_{j_h,k}^0|\sqrt{v^L_{i_h,k}\frac{2}{\pi }}+ |L_{i_h,k}^0| \sqrt{v^R_{j_h,k}\frac{2}{\pi }} \right] \\&+ \frac{1}{\lambda } \left\{ \frac{1}{2}\sum _{k=1}^K \mathbb {E}_{\rho }\left[ \frac{1}{\gamma _k} \right] \left( \sum _{i=1}^{m_1} \left( v^L_{i,k}+\left( L^0_{i,k}\right) ^2\right) + \sum _{j=1}^{m_2} \left( v^R_{j,k}+\left( R^0_{j,k}\right) ^2\right) \right) \right. \\&\left. - \frac{1}{2}\sum _{k=1}^K \left( \sum _{i=1}^{m_1} \log v^L_{i,k} + \sum _{j=1}^{m_2} \log v^R_{j,k}\right) + \sum _{k=1}^K \left[ \mathscr {K}(\rho ^{\gamma _k},\pi ^\gamma )\right. \right. \\&\left. \left. + \frac{m_1+m_2}{2}\left( \mathbb {E}_\rho \left[ \log \gamma _k \right] -1\right) \right] \right\} . \end{aligned}$$
Note that the explicit expression of our upper bound \(\mathscr {R}(\rho ,\lambda )\) is very cumbersome to say the least. A few comments are in order. First, this upper bound is explicit and can be computed easily. Hence, instead of minimizing the Kullback divergence, this is this term that we minimize in practice. Then, our theoretical analysis will show that the upper bound is acceptable in the sense that its minimization will lead to a small generalization error. But it is actually possible to understand at first sight why the minimization of \(r_n^h\left( \mathbb {E}_\rho [L] \mathbb {E}_\rho [R]^\top \right) + \mathscr {R}(\rho ,\lambda )\) works well. Indeed, assume that a matrix M with \(r_n^h(M)=0\) satisfies \(\mathrm{rank}(M) = r \ll K\). Then, it is possible to decompose M as a product \(M = L^0 (R^0)^\top \) with \(L^0_{i,k}=R^0_{j,k}=0\) when \(r<k\le K\). So, the sum
$$\begin{aligned} \frac{1}{2 \lambda } \sum _{k=1}^K \mathbb {E}_{\rho }\left[ \frac{1}{\gamma _k} \right] \left( \sum _{i=1}^{m_1} \left( L^0_{i,k}\right) ^2 + \sum _{j=1}^{m_2} \left( R_{j,k}^0\right) ^2 \right) \end{aligned}$$
has actually only \(r(m_1+m_2)= \mathrm{rank}(M) (m_1 + m_2)\) non-null terms. To minimize \( r_n^h\left( \mathbb {E}_\rho [L] \mathbb {E}_\rho [R]^\top \right) + \mathscr {R}(\rho ,\lambda )\) is thus related to penalized risk minimization with a penalty proportional to the rank, as in most frequentist approaches (Klopp et al. 2015).

The quantity \(r_n^h\left( \mathbb {E}_\rho [L] \mathbb {E}_\rho [R]^\top \right) + \mathscr {R}(\rho ,\lambda )\) will be referred as the Approximate Variational Bound (AVB) of \(\rho \) in the following. We are now able to define our estimate.

Definition 1

For a fixed \(\lambda >0\) we put
$$\begin{aligned} AVB(\rho ,\lambda )&= r_n^h\left( \mathbb {E}_\rho [L] \mathbb {E}_\rho [R]^\top \right) + \mathscr {R}(\rho ,\lambda ), \nonumber \\ \widetilde{\rho }_\lambda&= \arg \min _{\rho \in \mathscr {F}} AVB(\rho ,\lambda ). \end{aligned}$$
(6)
Also, when explicit notations involving \(L^0,R^0,v^L,v^R,\rho ^{\gamma _1}, \dots ,\rho ^{\gamma _k}\) are necessary we will use the notation
$$\begin{aligned} AVB(L^0,R^0,v^L,v^R,\rho ^{\gamma _1}, \dots ,\rho ^{\gamma _k},\lambda )=AVB(\rho _{L^0,R^0,v^L,v^R,\rho ^{\gamma _1}, \dots ,\rho ^{\gamma _k}},\lambda ). \end{aligned}$$

In the next section, we study the theoretical properties of our estimate. The main result is that the minimizer \(\widetilde{\rho }_\lambda \) of the \(AVB(\rho ,\lambda )\) has a small prediction risk for a well chosen \(\lambda \). We also provide an algorithm that computes \(\widetilde{\rho }_\lambda \) and show on simulations that it behaves well in practice.

3 PAC analysis of the variational approximation

Alquier et al. (2015) propose a general framework for analyzing the prediction properties of VB approximations of pseudo-posteriors based on PAC-Bayesian bounds. In this section, we apply this method to derive a control of the out-of-sample prediction risk \(\mathbf {R}\) for our approximation \(\widetilde{\rho }_\lambda \).

3.1 Empirical bound

The first result is a so-called empirical bound: it provides an upper bound on the prediction risk of the pseudo-posterior \(\widetilde{\rho }_\lambda \) that depends only on the data and on quantities defined by the statistician.

Lemma 1

For any \(\epsilon \in (0,1)\), with probability at least \(1-\epsilon \) on the drawing of the sample, for any \(\rho \in \mathscr {F}\),
$$\begin{aligned} \int \mathbf {R} d\rho \le r_n^h\left( \mathbb {E}_\rho [L] \mathbb {E}_\rho [R]^\top \right) + \mathscr {R}(\rho ,\lambda )+ \frac{\lambda }{2n} + \frac{\log \frac{1}{\epsilon }}{\lambda } = AVB(\rho ,\lambda ) + \frac{\lambda }{2n} + \frac{\log \frac{1}{\epsilon }}{\lambda }. \end{aligned}$$

This shows that our strategy to minimize \(AVB(\rho ,\lambda )\) is indeed the minimization of an empirical upper bound on the prediction risk, a standard approach in PAC-Bayesian theory. An immediate consequence of Lemma 1 and of the definition of \(\widetilde{\rho }_\lambda \) is the following theorem.

Theorem 1

For any \(\epsilon \in (0,1)\), with probability at least \(1-\epsilon \) on the drawing of the sample,
$$\begin{aligned} \int \mathbf {R} d\widetilde{\rho }_\lambda \le \inf _{\rho \in \mathscr {F}} AVB(\rho ,\lambda ) + \frac{\lambda }{2n} + \frac{\log \frac{1}{\epsilon }}{\lambda } \end{aligned}$$

Even though the bound in the right-hand side may be evaluated in practice, and thus may provide a numerical guarantee on the out-of-sample prediction risk, it is not very clear how it depends on the parameters. The following corollary of Theorem 1 will clarify things. It is obtain by deriving upper bounds of \(AVB(\rho ,\lambda )\) (once again, the proof is provided explicitly in Sect. 1).

Corollary 1

Assume that \(\lambda \le n\). For any \(\epsilon \in (0,1)\),with probability at least \(1-\epsilon \):
$$\begin{aligned} \int \mathbf {R} d\widetilde{\rho }_\lambda \le \inf _{M} \left[ r_n^h\left( M \right) + \mathscr {C}_{\pi ^\gamma } \frac{\mathrm{rank}(M) (m_1+m_2)[\log n+\ell ^2 (M) ]}{\lambda }\right] + \frac{\lambda }{2n} + \frac{\log \frac{1}{\epsilon }}{\lambda } \end{aligned}$$
where the constant \(\mathscr {C}_{\pi ^\gamma }\) is explicitly known, and depends only on the form of prior \(\pi ^\gamma \) (Gamma, or Inverse-Gamma) and of its hyperparameters.

An exact value for \(\mathscr {C}_{\pi ^\gamma }\) can be deduced from the proof. It is thus clear that the algorithm performs a trade-off between the fit to the data, through the term \(r_n^h(M)\), and the rank of M.

In addition to empirical bounds, it is necessary to provide so-called theoretical bounds, that will prove that the risk of \(\widetilde{\rho }_\lambda \) will indeed converge to the Bayes risk when the sample size grows. It is the goal of the next subsection.

3.2 Theoretical bound

For this type of theoretical analysis, it is common in classification to make an additional assumption on \(\mathbf {P}\) which leads to an easier task and therefore to better rates of convergence. We propose a definition adapted from Mammen and Tsybakov (1999).

Definition 2

Mammen and Tsybakov margin assumption is satisfied when there is a constant C such that, for any matrix M:
$$\begin{aligned} \mathbb {E}\left[ \left( \mathbb {1}_{Y M_X\le 0} - \mathbb {1}_{Y M^B_X\le 0} \right) ^2\right] \le C[\mathbf {R}(M)-\overline{\mathbf {R}}]. \end{aligned}$$
It is known that if there is a constant \(t>0\) such that \(\mathbb {P}(0<|\eta (X)|< t) =0 \) then the margin assumption is satisfied with some C that depends on t. For example, in the noiseless case where \(Y=M^B_X\) almost surely, which corresponds to \(t=1\), then
$$\begin{aligned} \mathbb {E}\left[ \left( \mathbb {1}_{Y M_X\le 0} - \mathbb {1}_{Y M^B_X\le 0} \right) ^2\right] = \mathbb {E}\left[ \mathbb {1}_{Y M_X\le 0}^2\right] = \mathbb {E}\left[ \mathbb {1}_{Y M_X\le 0}\right] = \mathbf {R}(M) = \mathbf {R}(M)-\overline{\mathbf {R}}, \end{aligned}$$
so the margin assumption is satisfied with \(C=1\).

We are now ready to state our theoretical bound. It makes a link between the integrated risk of the estimator and the lowest possible risk, which is reached by the Bayes classifier \(M^B\). In opposition to the empirical bound, it involves non-observable quantities, depending on \(M^B\), in the right-hand side.

Theorem 2

Assume that Mammen and Tsybakov assumption is satisfied for a given constant \(C>0\). Then, for any \(\epsilon \in (0,1)\) and for \(\lambda = s n/C\), \(s\in (0,1)\), with probability at least \(1-2\epsilon \),
$$\begin{aligned} \int \mathbf {R} d\widetilde{\rho }_\lambda \le 2(1+3s)\overline{\mathbf {R}} + \mathscr {C}_{C,s,\pi ^\gamma } \left( \frac{\mathrm{rank}(M^B) (m_1+m_2)[\log n+\ell ^2(M^B)]+\log \left( \frac{1}{\epsilon }\right) }{n} \right) \end{aligned}$$
where \(\mathscr {C}_{C,s,\pi ^\gamma }\) is known and depends only on the sC and \(\pi ^\gamma \).

Note the adaptive nature of this result, in the sense that the estimator does not depend on \(\mathrm{rank}(M^B)\). Clearly, when \(\mathrm{rank}(M^B)\) is small, the prediction error will be close to the Bayes error \(\overline{\mathbf {R}}\) even for small sample size. This type of inequalitiy is often referred to as an ‘oracle inequality’ in the sense that our estimator behaves as well as if we knew the rank of \(M^B\) through an oracle.

Corollary 2

In the noiseless case \(Y=\mathrm{sign}(M^B_X)\) a.s., for any \(\epsilon >0\) and for \(\lambda = 2 n\), with probability at least \(1-2\epsilon \),
$$\begin{aligned} \int \mathbf {R} d\widetilde{\rho }_\lambda \le \mathscr {C}'_{\pi ^\gamma }\left[ \frac{\mathrm{rank}(M^B)(m_1+m_2)[\log n+\ell ^2(M^B)] + \log \frac{1}{\epsilon }}{n}\right] \end{aligned}$$
(7)
where \(\mathscr {C}'_{\pi ^\gamma }=\mathscr {C}_{1,\frac{1}{4},1,\pi ^\gamma }\).

Remark 1

Note that an empirical inequality comparable to Corollary 1 appears in Srebro et al. (2004). In both cases, the dependance of the bounds with respect to n is \(1/\sqrt{n}\) (take \(\lambda = \sqrt{n}\) in Corollary 1). One notable difference is that our bound also provides an explicit dependance to the rank, which is not the case in Srebro et al. (2004).

In addition to this, theoretical inequalities like Theorem 2 and Corollary 2 are completely new results. They allow to compare the out-of-sample error of our predictor to the optimal one. They show that the rate is \(\mathrm{rank}(M^B) (m_1+m_2)/n\) up to log terms. This can not be improved as this rate is known to be minimax optimal (Alquier et al. 2017).

Remark 2

Determining the tuning parameter \(\lambda \) is not an easy task in practice: even though there are values that lead to the theoretical bounds, it is more efficient in practice to use cross-validation. We used this technique in the empirical results section.

4 Algorithm

4.1 General algorithm

The minimization problem (6) that defines our VB approximation is not straightforward:
$$\begin{aligned} \min _{L^0,R^0,v^L,v^R,\rho ^{\gamma _1}, \dots ,\rho ^{\gamma _k}} AVB(L^0,R^0,v^L,v^R,\rho ^{\gamma _1}, \dots ,\rho ^{\gamma _k},\lambda ). \end{aligned}$$
When \(v^L\), \(v^R\) and all the \(\rho ^{\gamma _k}\)’s are fixed, this is actually the canonical example of so-called biconvex problems: it is convex with respect to \(L^0\), and with respect to \(R^0\), but not with respect to the pair \((L^0,R^0)\). Such problems are notoriously difficult. In this case, alternating blockwise optimization seems to be an acceptable strategy. While there is no guarantee that the algorithm will not get stuck in a local minimum (or even in a singular point that is actually not a minimum), it seems to give very good results in practice, and no efficient alternative is available. We refer the reader to the discussion in Subsection 9.2 page 76 of Boyd et al. (2011) for more details on this problem.
Our strategy is as follows. We update iteratively \(L^0,R^0,v^L,v^R,\rho ^{\gamma _1}, \dots ,\rho ^{\gamma _k}\): for \(L^0\) and \(R^0\) we use a gradient step, while for \(v^L,v^R,\rho ^{\gamma _1},\dots ,\rho ^{\gamma _k}\) an explicit minimization is available. The details for the mean-field optimization (that is, w.r.t. \(\rho ^{\gamma _1},\dots ,\rho ^{\gamma _k}\)) are given in Sect. 4.2. See Algorithm 1 for the general version of the algorithm.

4.2 Mean field optimization

As the pseudo-likelihood does not involve the parameters \((\gamma _1,\dots ,\gamma _K)\), the variational distribution can be optimized in the same way as in Lim and Teh (2007) where the noise is Gaussian. The general update formula is:
$$\begin{aligned} \rho ^{\gamma _k}(\gamma _k)&\propto \exp \mathbb {E}_{\rho ^{-\gamma _k}} \left( \log \pi (L,R,\gamma ) \right) \\&\propto \exp \mathbb {E}_{\rho ^{-\gamma _k}} \left( \log [\pi (L|\gamma )\pi (R|\gamma )\pi ^\gamma (\gamma )] \right) \\&\propto \exp \left\{ \sum _{i=1}^{m_1} \mathbb {E}_{\rho ^L} \left[ \log \pi (L_{i,k}|\gamma _k) \right] + \sum _{j=1}^{m_2}\mathbb {E}_{\rho ^R} \left[ \log \pi (R_{j,k}|\gamma _k) \right] + \log \pi ^\gamma \left( \gamma _k \right) \right\} \\&\propto \exp \left\{ -\frac{m_1+m_2}{2} \log \gamma _k - \frac{1}{\gamma _k} \mathbb {E}_\rho \left[ \frac{ \sum _{i=1}^{m_1} L_{i,k}^2 + \sum _{j=1}^{m_2} R_{j,k}^2}{2}\right] + \log \pi ^\gamma \left( \gamma _k \right) \right\} \end{aligned}$$
where \(\rho ^{-\gamma _k}\) stands for the marginal distribution of \((\gamma _{k'})_{k'\ne k}\) under \(\rho \). The solution then depends on \(\pi ^\gamma \). In what follows we derive explicit formulas for \(\rho ^{\gamma _k}\) according to the choice of \(\pi ^\gamma \): we remind the reader that \(\pi ^\gamma \) could be either a Gamma distribution, or an Inverse-Gamma distribution.

4.2.1 Inverse-gamma prior

The conjugate prior for this part of the model is the inverse-Gamma distribution. The prior of \(\gamma _k\) is \(\pi ^{\gamma }= \Gamma ^{-1}(\alpha ,\beta )\) and its density is:
$$\begin{aligned} \pi ^\gamma (\gamma _k;\alpha ,\beta )=\frac{\beta ^\alpha }{\Gamma (\alpha )}\gamma _k^{ -\alpha -1}\exp \left( -\frac{\beta }{\gamma _k} \right) \mathbb {1}_{\mathbb {R}^+}(\gamma _k). \end{aligned}$$
The moments we need to develop the algorithm and to compute the empirical bound are:
$$\begin{aligned} \mathbb {E}_{\pi ^\gamma }(\log \gamma _k) = \log \beta - \psi (\alpha )\text {, and } \mathbb {E}_{\pi ^\gamma }(1/\gamma _k) = \frac{\alpha }{\beta }, \end{aligned}$$
where \(\psi \) is the digamma function. Therefore, we get:
$$\begin{aligned} \rho ^{\gamma _k}(\gamma _k)&\propto \exp \left\{ -\left( \frac{m_1+m_2}{2}+\alpha +1\right) \log \gamma _k- \frac{1}{\gamma _k}\left( \mathbb {E}_\rho \left[ \frac{ \sum _{i=1}^{m_1} L_{i,k}^2 + \sum _{j=1}^{m_2} R_{j,k}^2}{2}\right] +\beta \right) \right\} , \\ \end{aligned}$$
so we can conclude that:
$$\begin{aligned} \rho ^{\gamma _k}=\Gamma ^{-1}\left( \frac{m_1+m_2}{2}+ \alpha ,\mathbb {E}_\rho \left[ \frac{ \sum _{i=1}^{m_1} L_{i,k}^2 + \sum _{j=1}^{m_2} R_{j,k}^2}{2}\right] +\beta \right) . \end{aligned}$$
(8)

4.2.2 Gamma prior

Even though it seems that this fact was not used in prior works on Bayesian matrix estimation, it is also possible to derive explicit formulas when the prior \(\pi ^\gamma \) on \(\gamma _k\)’s is a \(\Gamma (\alpha ,\beta )\) distribution. In this case, \(\rho ^{\gamma _k}\) is given by
$$\begin{aligned} \rho ^{\gamma _k}(\gamma _k)&\propto \exp \left\{ \left( \alpha -\frac{m_1+m_2}{2}-1\right) \log \gamma _k - \beta \gamma _k -\frac{1}{\gamma _k}\mathbb {E}_\rho \left[ \frac{ \sum _{i=1}^{m_1} L_{i,k}^2 + \sum _{j=1}^{m_2} R_{j,k}^2}{2}\right] \right\} . \end{aligned}$$
We remind the reader that the Generalized Inverse Gaussian distribution is a three-parameter family of distributions over \(\mathbb {R}^{+*}\), written \(GIG(a, b,\eta )\). Its density is given by:
$$\begin{aligned} f(x;a,b,\eta )&= \frac{(a / b)^{\eta /2}}{2K_\eta (\sqrt{a b})} x^{\eta -1}\exp \left( -\frac{1}{2}(a x + b x^{-1}) \right) , \end{aligned}$$
where \(K_\lambda \) is the modified Bessel function of second kind.
The variational distribution \(\rho ^{\gamma _k}\) is in consequence \(GIG(a_k,b_k,\eta _k)\) with:
$$\begin{aligned} a_k&= 2\beta , \quad b_k = \mathbb {E}_\rho \left[ \frac{ \sum _{i=1}^{m_1} L_{i,k}^2 + \sum _{j=1}^{m_2} R_{j,k}^2}{2}\right] , \quad \eta _k =\alpha -\frac{m_1+m_2}{2}. \end{aligned}$$
The moment we need in order to compute the variational distribution of LR is:
$$\begin{aligned} \mathbb {E}_{\rho ^{\gamma _k}}\left( \frac{1}{\gamma _k} \right) = \frac{K_{\eta _k-1}(\sqrt{a_k b_k})}{K_{\eta _k}(\sqrt{a_k b_k})} \sqrt{\frac{a_k}{b_k}}. \end{aligned}$$

5 Logistic model

As mentioned in Zhang (2004), the hinge loss is not the only possible convex relaxation of the zero-one loss. The logistic loss \(\mathrm{logit}(u)=\log [1+\exp (-u)]\) can also be used [even though it might lead to a loss in the rate of convergence of the risk the Bayes risk (Zhang 2004)]. This leads to the definitions:
$$\begin{aligned} \mathbf {R}^\ell (M)&= \mathbb {E}_\mathbf {P}\left[ \mathrm{logit}(Y M_X) \right] ,\\ r_n^\ell (M)&= \frac{1}{n}\sum _{k=1}^n \mathrm{logit}(Y_k M_{X_k}). \end{aligned}$$
In this case, the pseudo-likelihood \(\exp (-\lambda r_n^\ell (M))\), if \(\lambda =n\), is exactly equal to the likelihood of the logistic model:
$$\begin{aligned} Y|X = {\left\{ \begin{array}{ll} 1 &{} \text { with probability } \sigma (M_X) \\ -1 &{} \text { with probability } 1-\sigma (M_X) \end{array}\right. } \end{aligned}$$
where \(\sigma \) is the link function \(\sigma (x) = \frac{\exp (x)}{1+\exp (x)}\). The likelihood is written \(\Lambda (L,R)=\prod _{l=1}^{n}\sigma (Y_l (LR^\top )_{X_l})\). The prior distribution is exactly the same as in the previous sections and the object of interest is the posterior distribution:
$$\begin{aligned} \widehat{\rho _l}(d\theta )=\frac{\Lambda (L,R)\pi (d\theta )}{\int \Lambda (L,R)\pi (d\theta )}. \end{aligned}$$
In order to deal with large matrices, it remains interesting to develop a variational Bayes algorithm. However it is not as simple as in the quadratic loss model, see Lim and Teh (2007) in which the authors develop a mean field approximation, because the logistic likelihood leads to intractable update formulas. A common way to deal with this model is to maximize another quantity which is very close to the one we are interested in. The principle, coming from Jaakkola and Jordan (2000), is well explained in Bishop (2006) and an extended example can be found in Latouche et al. (2015).
We consider the mean field approximation so the approximation is sought among the distributions \(\rho \) that are factorized \(\rho (d\theta ) = \prod _{i=1}^{m_1}\rho ^{L_i}(dL_{i,\cdot }) \prod _{j=1}^{m_2}\rho ^{R_j}(dR_{j,\cdot }) \prod _{k=1}^{K}\rho ^{\gamma _k}(d\gamma _k)\). We have the following decomposition, for all distribution \(\rho \):
$$\begin{aligned} \log \int \Lambda (L,R)\pi (d\theta )&= \mathscr {L}(\rho ) + \mathscr {K}(\rho ,\widehat{\rho _l}) \\ \text {with } \mathscr {L}(\rho )&= \int \log \left( \frac{\Lambda (L,R)\pi (\theta )}{\rho (\theta )} \right) \rho (d\theta ) . \end{aligned}$$
Since the left-hand side (called log-evidence) is fixed, minimizing the Kullback divergence w.r.t. \(\rho \) is the same as maximizing \(\mathscr {L}(\rho )\). Unfortunately, this quantity is intractable. But a lower bound, which corresponds to a Gaussian approximation, is much more easier to optimize. We introduce the additional parameter \(\xi = (\xi _l)_{l \in [n]}\).

Proposition 2

For all \(\xi \in \mathbb {R}^n\) and for all \(\rho \),
$$\begin{aligned} \mathscr {L}(\rho )&\ge \int \log \frac{H(\theta ,\xi )\pi (\theta )}{\rho (\theta )} \rho (d\theta ) := \mathscr {L}(\rho ,\xi )\\ \text {where } \log H(\theta ,\xi )&=\sum _{l \in [n]} \left\{ \log \sigma (\xi _l) + \frac{Y_l (LR^\top )_{X_l} - \xi _l}{2} - \tau (\xi _l)\left[ (LR^\top )_{X_l}^2-\xi _l^2 \right] \right\} \\ \text {and } \tau (x)&= 1/(2x)(\sigma (x)-1/2). \end{aligned}$$
Hence the estimator is (even though \(\xi \) is not the parameter of interest):
$$\begin{aligned} (\widetilde{\rho }, \widetilde{\xi }) = \arg \min _{\rho \in \mathscr {F}} \mathscr {L}(\rho ,\xi ). \end{aligned}$$
(9)

5.1 Bayes algorithm

The lower bound \(\mathscr {L}(\rho ,\xi )\) is maximized with respect to \(\rho \) by the mean field algorithm. A direct calculation shows that the optimal distribution of each site (written with a star subscript) is given by:
$$\begin{aligned} \forall i \in [m_1], \log \rho _\star ^{L_{i,\cdot }}(L_{i,\cdot })&= \int \log \left[ H(\theta ,\xi )\pi (\theta ) \right] \rho ^R(dR)\rho ^\gamma (d\gamma )\prod _{i'\ne i} \rho (dL_{i',\cdot }) + \text {const} \\ \forall j \in [m_2], \log \rho _\star ^{R_{j,\cdot }}(R_{j,\cdot })&= \int \log \left[ H(\theta ,\xi )\pi (\theta ) \right] \rho ^L(dL)\rho ^\gamma (d\gamma )\prod _{j'\ne j} \rho (dR_{j',\cdot }) + \text {const} \end{aligned}$$
As \(\log H(\theta ,\xi )\) is a quadratic form in \((L_{i,\cdot })_{i\in [m_1]}\) and \((R_{j,\cdot })_{j\in [m_2]}\), the variational distribution of each parameter is Gaussian and a direct calculation gives:
$$\begin{aligned} \rho _\star ^{L_{i,\cdot }}&= \mathscr {N}\left( \mathscr {M}_i^L, \mathscr {V}_i^L\right) , \quad \rho _\star ^{R_{j,\cdot }} = \mathscr {N}\left( \mathscr {M}_j^R, \mathscr {V}_j^R\right) \quad \text {where} \\ \mathscr {M}_i^L&= \left( \frac{1}{2} \sum _{l \in \Omega _{i,\cdot }}Y_l\mathbb {E}_{\rho } \left[ R_{j_l,\cdot }\right] \right) \mathscr {V}_i^L, \quad \mathscr {V}_i^L = \left( 2\sum _{l \in \Omega _{i,\cdot }} \tau (\xi _l) \mathbb {E}_\rho [R_{j_l,\cdot }^\top R_{j_l,\cdot }]+ \mathbb {E}_\rho \left[ \text {diag}\left( \frac{1}{\gamma }\right) \right] \right) ^{-1} \\ \mathscr {M}_j^R&= \left( \frac{1}{2}\sum _{l \in \Omega _{\cdot ,j}} Y_l\mathbb {E}_{q} \left[ L_{i_l, \cdot } \right] \right) \mathscr {V}_j^R, \quad \mathscr {V}_j^R = \left( 2\sum _{l \in \Omega _{\cdot , j}} \tau (\xi _l) \mathbb {E}_\rho [L_{i_l, \cdot }^\top L_{i_l, \cdot }]+ \mathbb {E}_\rho \left[ \text {diag}\left( \frac{1}{\gamma } \right) \right] \right) ^{-1} \end{aligned}$$
The variational optimization for \(\gamma \) is exactly the same as in the Hinge Loss setting (with both possible prior distributions \(\Gamma \) and \(\Gamma ^{-1}\)). The optimization of the variational parameters is given by:
$$\begin{aligned} \forall l \in [n], \quad \widehat{\xi _l}&= \sqrt{\mathbb {E}_\rho \left[ (LR^\top )_{X_l}^2 \right] }. \end{aligned}$$

6 Empirical results

In this section we compare our methods to the other 1-bit matrix completion techniques on simulated and real datasets. It is worth noting that the low rank decomposition does not involve the same matrix: in our model, it affects the Bayesian classifier matrix; in logistic model, it concerns the parameter matrix. The estimate from our algorithm is \(\widehat{M}=\mathbb {E}_{\widetilde{\rho }_\lambda }(L) \mathbb {E}_{\widetilde{\rho }_\lambda }(R)^\top \) and we focus on the zero-one loss in prediction. We first test the performances on simulated matrices and then experiment them on a real data set. We compare the four following models: (a) hinge loss with variational approximation (referred as HL), (b) Bayesian logistic model with variational approximation (referred as Logis.), (c) the frequentist logistic model from Davenport et al. (2014) (referred as freq. Logis.) and (d) the frequentist least squares model from Mazumder et al. (2010) (referred as SI for SoftImpute). The former two are tested with both Gamma and Inverse-Gamma prior distributions. The hyperparameters are all tuned by cross validation. The parameter of the frequentist methods is a regularization parameter that is also tuned by cross-validation.

The choice of K in our methods is more difficult. A large K leads to more parameters to be estimated. This considerably slows down our algorithms. In the end, some (very large) values of K are not feasible in practice. Still, what we observe is that the prior leads to an adaptive estimator, in accordance with the theoretical results: when K is taken too large (but still small enough in order to keep the computations feasible), the additional parameters are shrunk to zero. Having observed this fact, we keep \(K=10\) in many simulations. Still, we added simulations with a larger value, \(K=50\), in order to show that this shrinkage effect indeed takes place.

From a theoretical perspective, the complexity of each step of Algorithm 1 is of order \((m_1+m_2)K\). Each step only involves very simple calculations, no matrix operations. On the opposite, the methods that use the nuclear norm are very time-consuming because the complexity of the SVD is of order \(m_1 m_2 \min (m_1,m_2)\). It is possible to use approximate SVD, but the method is more difficult to tune.

6.1 Simulated data: small matrices

The goal is to assess the models with different scenarios of data generation. The general scheme of the simulations is as follows: the observations come from a \(200\times 200\) matrix and we pick randomly \(20\%\) of its entries. We set \(K=10\) in our algorithms. The observations are generated as:
$$\begin{aligned} Y_l = \text {sign}\left( M_{i_l,j_l}+Z_l\right) B_l, \quad \text {where } M\in \mathbb {R}^{m\times m}, \quad (B_l,Z_l)_{l\in [n]} \text { are iid}. \end{aligned}$$
The noise term (BZ) is such that \(\mathbf {R}(M)=\overline{\mathbf {R}}\) and M has low rank noted r in the followings. The predictions are directly compared to M. Two types of matrices M are built: the type A corresponds to the favorable case to the hinge loss; the entries of M lie in \(\{-1,+1\}\).1 The type B corresponds to the a more difficult classification problem because many entries of M are around 0: M is a product of two matrices with r columns where the entries are iid \(\mathscr {N}(0,1)\). The noise term is specified in Table 1. Note that the example A3 may also be seen as a switch noise with probability \(\frac{e}{1+e} \approx 0.73\). The experiments are done one time for each.
Table 1

Type of noise

Type

Name

B

Z

Y

1

No noise

\(B=1\) a.s.

\(Z=0\) a.s.

\(Y_l=\text {sign}(M_{i_l,j_l})\) a.s

2

Switch

\(B\sim 0.9 \delta _1 + 0.1 \delta _{-1}\)

\(Z=0\) a.s.

\(Y_l=\text {sign}(M_{i_l,j_l})\) w.p. 0.9

3

Logistic

\(B=1\) a.s.

\(Z \sim \text {Logistic}\)

\(Y_l =1\) w.p. \(\sigma (M_{i_l,j_l})\)

Table 2

Prediction error on simulated observations—rank 3

Type

Logis.-G (%)

Logis.-IG (%)

HL-G (%)

HL-IG (%)

Freq. logis. (%)

SI (%)

A1

0.0

0.0

0.0

0.0

0.0

0.0

A2

0.5

0.9

0.1

0.0

0.5

0.4

A3

16.0

15.9

8.5

8.5

17.3

17.3

B1

4.1

4.0

5.3

5.8

5.1

5.6

B2

10.1

10.1

10.8

10.6

10.7

10.8

B3

16.0

16.0

22.1

21.3

19.8

19.8

For rank 3 (see Table 2) and rank 5 matrices (see Table 3), the results of the Bayesian algorithms are very similar for both prior distributions and there is no evidence to favor a particular one. The results are better for the hinge loss method on type A observations and the difference of performance between models is very large for A3. On the opposite, the performance of the logistic model is better when the observations are generated from this model and when the parameter matrix is not separable. In comparison with the results from the frequentist approach, the variational approximation seems very good even though we have not at all any theoretical properties. For rank 5 matrices, the performances are worse but the meanings are the same as the rank 3 experiment.
Table 3

Prediction error on simulated matrices—rank 5

Type

Logis.-G (%)

Logis.-IG (%)

HL-G (%)

HL-IG (%)

Freq. logis. (%)

SI (%)

A1

0.01

0.01

0.0

0.0

0.01

0.0

A2

4.4

3.1

0.54

0.55

3.1

2.8

A3

32.5

33.1

27.0

26.7

30.1

30.0

B1

7.8

7.8

9.4

10.4

9.0

9.6

B2

17.3

17.3

17.9

18.1

18.3

18.4

B3

21.5

21.4

24.4

22.9

22.1

22.2

The last simulation is a focus on the influence of the level of switch noise. On A2 type on rank 3 (Table 2), we see that \(10\%\) of corrupted entries is not enough to not almost perfectly recover the Bayesian classifier matrix. We challenge the frequentist program as well. The results are clear, see Fig. 1: the hinge loss method is better almost everywhere. For a noise up to \(25\%\), which means that one fourth of observed entries are corrupted, it is possible to get a very good predictor with less than \(10\%\) of misclassification error. It is getting worse when the level of noise increases and the problem becomes almost impossible for noise greater than \(30\%\).
Fig. 1

Results on simulated A2 matrices with different levels of noise—rank 3

6.2 Simulated data: large matrices

The second experiment involves larger matrices in order to assess the efficiency of the Bayesian methods on large dataset. The observations come now from a \(2000 \times 2000\) matrix, \(10\%\) are observed randomly. The base matrix has now rank 10. It is worth noting that the matrix to be recovered is 100 times larger than in the first example. For the Bayesian methods, we fix \(K=50\). On the other hand, the frequentist methods need a singular value decomposition, which is very time consuming for such a large matrix. Exactly the same observation generation procedure (Table 1) is used and six experiments are done (Table 4).

The results are in the line with the previous section: all the methods give very similar results except for high level of switch noise (matrix A3). The gap between the proposed methods and the logistic model is quite large and the algorithms using hinge loss perform almost perfectly. On the other hand, the hinge loss models perform well on logistic data and it is worth noting that the Bayesian logistic models perform better for logistic model with low level of noise (B1).
Table 4

Prediction error on simulated matrices—rank 10

Type

Logis.-G (%)

Logis.-IG (%)

HL-G (%)

HL-IG (%)

Freq. logis. (%)

SI (%)

A1

0

0

0

0

0

0

A2

0.1

0.3

0

0

0.1

0.1

A3

9 .6

9.9

1.8

1.8

9.9

9.9

B1

3.7

4.1

8.2

6.6

6.2

7.4

B2

11

11

11.3

10.6

11.8

12.0

B3

10.4

10.2

11.6

11.3

11.0

11.2

6.3 Real data set: MovieLens

The last experiment, presented in Table 5, involves the well known MovieLens.2 data set It has already been used by Davenport et al. (2014) and we follow them for the study. The ratings lie between 1 to 5 so we split them into binary data between good ratings (above the mean which is 3.5) and bad ones. The low rank assumption is usual in this case because it is expected that the taste of a particular user is related to only few hidden parameters. The smallest data set contains 100,000 ratings from 943 users and 1682 movies so we use 95,000 of them as a training set and the 5000 remaining as the test set. The performances are very similar between the frequentist logistic model from Davenport et al. (2014) and the hinge loss model. The performances seem slightly worse for the Bayesian logistic model but it is hard to favor a particular model at this stage (note that the difference between 0.28 and 0.27 is not significant on 5000 observations).
Table 5

Misclassification rate on MovieLens 100 k data set

Algorithm

HL-IG

HL-G

Logis.-G

Logis.-IG

Freq. logis.

Misclassif. rate

0.28

0.29

0.32

0.32

0.27

7 Discussion

We undertake the 1-bit matrix completion problem with classification tools and we are able to derive PAC-bounds on the risk and an efficient algorithm to compute the estimator. The previous works only focused on GLM models, which is not the right way to establish distribution free risk bounds. This work relies on PAC-Bayesian framework and the pseudo-posterior distribution is approximated by a variational algorithm. In practice, it is able to deal with large matrices. We also derive a variational approximation of the posterior distribution in the Bayesian logistic model and it works very well in our examples.

The variational approximations look very promising in order to build algorithm which are able to deal with large data and this framework may be extended to more general models and other Machine Learning tools.

Footnotes

  1. 1.

    The matrices are built by drawing r independent columns with only \(\{-1,1\}\) The remaining columns are randomly equal to one of the first r columns multiplied by a factor in \(\{-1,1\}\).

  2. 2.

Notes

Acknowledgements

We would like to thank Vincent Cottet’s Ph.D. supervisor Professor Nicolas Chopin, for his kind support during the project and the three anonymous referees for their helpful and constructive comments.

References

  1. Alquier, P., Cottet, V., & Lecué, G. (2017). Estimation bounds and sharp oracle inequalities of regularized procedures with Lipschitz loss functions. arXiv preprint arXiv:1702.01402.
  2. Alquier, P., Ridgway, J., & Chopin, N. (June 2015). On the properties of variational approximations of Gibbs posteriors. arXiv e-prints.Google Scholar
  3. Bishop, C. M. (2006). Pattern recognition and machine learning (information science and statistics). New York: Springer.zbMATHGoogle Scholar
  4. Boucheron, S., Lugosi, G., & Massart, P. (2013). Concentration inequalities: A nonasymptotic theory of independence. Oxford: OUP.CrossRefzbMATHGoogle Scholar
  5. Boyd, S., Parikh, N., Chu, E., Peleato, B., & Eckstein, J. (2011). Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends in Machine Learning, 3(1), 1–122.CrossRefzbMATHGoogle Scholar
  6. Cai, T., & Zhou, W.-X. (2013). A max-norm constrained minimization approach to 1-bit matrix completion. Journal of Machine Learning Research, 14, 3619–3647.MathSciNetzbMATHGoogle Scholar
  7. Candès, E. J., & Plan, Y. (2010). Matrix completion with noise. Proceedings of the IEEE, 98(6), 925–936.CrossRefGoogle Scholar
  8. Candès, E. J., & Recht, B. (2012). Exact matrix completion via convex optimization. Communications of the ACM, 55(6), 111–119.CrossRefzbMATHGoogle Scholar
  9. Candès, E. J., & Tao, T. (2010). The power of convex relaxation: Near-optimal matrix completion. IEEE Transactions on Information Theory, 56(5), 2053–2080.MathSciNetCrossRefzbMATHGoogle Scholar
  10. Catoni, O. (2004). Statistical learning theory and stochastic optimization. In J. Picard (Ed.), Saint-Flour Summer School on probability theory 2001., Lecture notes in mathematics Berlin: Springer.Google Scholar
  11. Catoni, O. (2007). PAC-Bayesian supervised classification: The thermodynamics of statistical learning (Vol. 56)., Institute of mathematical statistics lecture notes—Monograph series Beachwood, OH: Institute of Mathematical Statistics.zbMATHGoogle Scholar
  12. Chatterjee, S. (2015). Matrix estimation by universal singular value thresholding. Annals of Statistics, 43(1), 177–214.MathSciNetCrossRefzbMATHGoogle Scholar
  13. Dalalyan, A., & Tsybakov, A. B. (2008). Aggregation by exponential weighting, sharp PAC-Bayesian bounds and sparsity. Machine Learning, 72(1), 39–61.CrossRefGoogle Scholar
  14. Davenport, M. A., Plan, Y., van den Berg, E., & Wootters, M. (2014). 1-Bit matrix completion. Information and Inference, 3(3), 189–223.MathSciNetCrossRefzbMATHGoogle Scholar
  15. Herbrich, R., & Graepel, T. (2002). A PAC-Bayesian margin bound for linear classifiers. IEEE Transactions on Information Theory, 48(12), 3140–3150.MathSciNetCrossRefzbMATHGoogle Scholar
  16. Herbster, M., Pasteris, S., & Pontil, M. (2016). Mistake bounds for binary matrix completion. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, R. Garnett, & R. Garnett (Eds.), Proceedings of the 29th conference on neural information processing systems (NIPS 2016). Barcelona, Spain: NIPS Proceedings.Google Scholar
  17. Hsieh, C.-J., Natarajan, N., & Dhillon, I. S. (2015). PU learning for matrix completion. In Proceedings of the 32nd international conference on machine learning, pp. 2445–2453.Google Scholar
  18. Jaakkola, T. S., & Jordan, M. I. (2000). Bayesian parameter estimation via variational methods. Statistics and Computing, 10(1), 25–37.CrossRefGoogle Scholar
  19. Klopp, O., Lafond, J., Moulines, É., & Salmon, J. (2015). Adaptive multinomial matrix completion. Electronic Journal of Statistics, 9(2), 2950–2975.MathSciNetCrossRefzbMATHGoogle Scholar
  20. Kyung, M., Gill, J., Ghosh, M., & Casella, G. (2010). Penalized regression, standard errors, and Bayesian lassos. Bayesian Analysis, 5(2), 369–412.MathSciNetCrossRefzbMATHGoogle Scholar
  21. Latouche, P., Robin, S., & Ouadah, S. (2015). Goodness of fit of logistic models for random graphs. arXiv preprint arXiv:1508.00286.
  22. Lim, Y. J. & Teh, Y. W. (2007). Variational Bayesian approach to movie rating prediction. In Proceedings of KDD cup and workshop.Google Scholar
  23. Mai, T. T., & Alquier, P. (2015). A bayesian approach for noisy matrix completion: Optimal rate under general sampling distribution. Electronic Journal of Statistics, 9(1), 823–841.MathSciNetCrossRefzbMATHGoogle Scholar
  24. Mammen, E., & Tsybakov, A. (1999). Smooth discrimination analysis. The Annals of Statistics, 27(6), 1808–1829.MathSciNetCrossRefzbMATHGoogle Scholar
  25. Mazumder, R., Hastie, T., & Tibshirani, R. (2010). Spectral regularization algorithms for learning large incomplete matrices. Journal of Machine Learning Research, 11(Aug), 2287–2322.MathSciNetzbMATHGoogle Scholar
  26. McAllester, D. A. (1998). Some PAC-Bayesian theorems. In Proceedings of the eleventh annual conference on computational learning theory (pp. 230–234). New York, ACM.Google Scholar
  27. Park, T., & Casella, G. (2008). The Bayesian Lasso. Journal of the American Statistical Association, 103(482), 681–686.MathSciNetCrossRefzbMATHGoogle Scholar
  28. Recht, B., & Ré, C. (2013). Parallel stochastic gradient algorithms for large-scale matrix completion. Mathematical Programming Computation, 5(2), 201–226.MathSciNetCrossRefzbMATHGoogle Scholar
  29. Salakhutdinov, R. & Mnih, A. (2008). Bayesian probabilistic matrix factorization using Markov chain Monte Carlo. In Proceedings of the 25th international conference on machine learning, pp. 880–887.Google Scholar
  30. Seldin, Y., & Tishby, N. (2010). PAC-Bayesian analysis of co-clustering and beyond. Journal of Machine Learning Research, 11(Dec), 3595–3646.MathSciNetzbMATHGoogle Scholar
  31. Seldin, Y., Laviolette, F., Cesa-Bianchi, N., Shawe-Taylor, J., & Auer, P. (2012). PAC-Bayesian inequalities for martingales. IEEE Transactions on Information Theory, 58(12), 7086–7093.MathSciNetCrossRefzbMATHGoogle Scholar
  32. Shawe-Taylor, J., & Langford, J. (2003). PAC-Bayes and margins. Advances in Neural Information Processing Systems, 15, 439.Google Scholar
  33. Srebro, N., Rennie, J., & Jaakkola, T. S. (2004). Maximum-margin matrix factorization. In Advances in neural information processing systems, pp. 1329–1336.Google Scholar
  34. Vapnik, V. (1998). Statistical learning theory. New York: Wiley.zbMATHGoogle Scholar
  35. Zhang, T. (2004). Statistical behavior and consistency of classification methods based on convex risk minimization. Annals of Statistics, 32(1), 56–85.MathSciNetCrossRefzbMATHGoogle Scholar

Copyright information

© The Author(s) 2017

Authors and Affiliations

  1. 1.CREST, ENSAEUniversité Paris SaclayParisFrance

Personalised recommendations