# 1-Bit matrix completion: PAC-Bayesian analysis of a variational approximation

- 700 Downloads
- 2 Citations

## Abstract

We focus on the completion of a (possibly) low-rank matrix with binary entries, the so-called 1-bit matrix completion problem. Our approach relies on tools from machine learning theory: empirical risk minimization and its convex relaxations. We propose an algorithm to compute a variational approximation of the pseudo-posterior. Thanks to the convex relaxation, the corresponding minimization problem is bi-convex, and thus the method works well in practice. We study the performance of this variational approximation through PAC-Bayesian learning bounds. Contrary to previous works that focused on upper bounds on the estimation error of *M* with various matrix norms, we are able to derive from this analysis a PAC bound on the prediction error of our algorithm. We focus essentially on convex relaxation through the hinge loss, for which we present a complete analysis, a complete simulation study and a test on the MovieLens data set. We also discuss a variational approximation to deal with the logistic loss.

## Keywords

Matrix completion PAC-Bayesian bounds Variational Bayes Supervised classification Risk convexification Oracle inequalities## 1 Introduction

Motivated by modern applications like recommendation systems and collaborative filtering, video analysis or quantum statistics, the matrix completion problem has been widely studied over the recent years. Recovering a matrix is, without any additional information, a impossible task. However, under some assumptions on the structure of the matrix to be recovered, it might become feasible, as shown by Candès and Tao (2010) and Candès and Recht (2012) where the assumption is that the matrix has a small rank. This assumption is natural in many applications. For example, in recommendation systems, it is equivalent to the existence of a small number of hidden features that explain the users preferences. While Candès and Tao (2010) and Candès and Recht (2012) focused on matrix completion without noise, many authors extended these techniques to the case of noisy observations, see Candès and Plan (2010) and Chatterjee (2015) among others. The main idea in Candès and Plan (2010) is to minimize the least squares criterion, penalized by the rank. This penalization is then relaxed by the nuclear norm, which is the sum of the singular values of the matrix at hand. An efficient algorithm is described in Recht and Ré (2013).

*i*,

*j*)th entry being 1 means that user

*i*is satisfied by object

*j*while this entry being 0 means that he/she is not satisfied by it. The problem of recovering a binary matrix from partial observations is usually referred as 1-bit matrix completion. To deal with binary observations requires specific estimation methods. Most works on this problem usually assume a generalized linear model (GLM): the observations \(Y_{ij}\) for \(1\le i\le m_1\), \(1\le j\le m_2\), are Bernoulli distributed with parameter \(f(M_{ij})\), where

*f*is a link function which maps from \(\mathbb {R}\) to [0, 1], for example the logistic function \(f(x) = \exp (x)/[1+\exp (x)]\), and

*M*is a \(m_1 \times m_2\) real matrix, see Cai and Zhou (2013), Davenport et al. (2014) and Klopp et al. (2015). In these works, the goal is to recover the matrix

*M*and a convergence rate is then derived. For example, Klopp et al. (2015) provides an estimate \(\widehat{M}\) for which, under suitable assumptions and when the data are generated according to the true model with \(M=M_0\),

*C*that depends on the assumptions and the sampling scheme, and where \(\Vert .\Vert _F\) stands for the Frobenius norm [we refer the reader to Corollary 2 page 2955 in Klopp et al. (2015) for the exact statement]. While this result ensures the consistency of \(\widehat{M}\) when \(M_0\) is low-rank, it does not provide any guarantee on the probability of a prediction error. Moreover, the results rely on the assumption that the model (in particular the function

*f*) is well specified. In practice, this assumption is unrealistic, and it is important to provide generalization error bounds that hold even in case of misspecification.

Here, we adopt a machine learning point of view: in machine learning, dealing with binary output is called a classification problem, for which methods are known that do not assume any model on the observations. That is, instead of focusing on a parametric model for \(Y_{i,j}\), we will only define a set of prediction matrices *M* and seek for the one that leads to the best prediction error. Using the zero-one loss function, we could actually directly use Vapnik (1998) theory to propose a classifier \(\widehat{M}\) risk would be controlled by a PAC inequality. However, it is known that this approach usually is computationally intractable. A popular approach is to replace the zero-one loss by a convex surrogate (Zhang 2004), namely, the hinge loss. Our approach is as follows: we propose a pseudo-Bayesian approach, where we define a pseudo-posterior distribution on a set of matrices *M*. This pseudo-posterior distribution does not have a simple form, however, thanks to a variational approximation, we manage to approximate it by a tractable distribution. Thanks to the PAC-Bayesian theory (McAllester 1998; Herbrich and Graepel 2002; Shawe-Taylor and Langford 2003; Catoni 2004, 2007; Seldin et al. 2012; Dalalyan and Tsybakov 2008), we are able to provide a PAC bound on the prediction risk of this variational approximation. We then show that, due to the convex relaxation of the zero-one loss, the computation of this variational approximation is actually a bi-convex minimization problem. As a consequence, efficient algorithms are available.

Other settings for 1-bit matrix completion have also been studied. For example, in some real-life applications, only positive instances are available. This setting is studied in details in Hsieh et al. (2015). It requires a different approach. Here, we stick to the classification approach where positive and negative instances are observed. We refer the reader to Hsieh et al. (2015) and the references therein for the positive-only case.

The rest of the paper is as follows. In Sect. 2 we provide the notations used in the paper, the definition of the pseudo-posterior and of its variational approximation. In Sect. 3 we give the PAC analysis of the variational approximation. This yields an empirical and a theoretical upper bound on the prediction risk of our method. Sect. 4 provides details on the implementation of our method. Note that in the aforementioned sections, the convex surrogate of the zero-one loss used is the hinge loss. An extension to the logistic loss is briefly discussed in Sect. 5, together with an algorithm to compute the variational approximation. Finally, Sect. 6 is devoted to an empirical study and Sect. 6.3 to an application to the MovieLens data set. The proof of the theorems of Sect. 3 are provided in Sect. 1.

## 2 Estimation procedure

For any integer *m* we define \([m]=\{1,\dots ,m\}\); for two real numbers *a* and *b* we write \(\max (a,b)= a\vee b\) and \(\min (a,b) = a\wedge b\). We define, for any integers \(m_1\) and \(m_2\) and any matrix \(M \in \mathbb {R}^{m_1\times m_2}\), \(\Vert M \Vert _\text {max} = \max _{(i,j)\in [m_1]\times [m_2]} M_{ij}\). Let \(\mathbb {R}^+\) stand for the set of non-negative real numbers, and \(\mathbb {R}^{+*}\) for the positive real numbers. For any real number *a*, \((a)_+\) is the positive part of *a* and is equal to \(\max (0,a)\).

For a pair of matrices (*A*, *B*), we write \(\ell (A,B)= \Vert A \Vert _\text {max} \vee \Vert B \Vert _\text {max}\). Finally, when an \(m_1\times m_2\) matrix *M* has \(\mathrm{rank}(M)=r\) then it can be written as \(M=LR^T\) where *L* is \(m_1\times r\) and *R* is \(m_2 \times r\). This decomposition is obviously not unique; we put \(\ell (M) = \inf _{(L,R)}\ell (L,R) \) where the infimum is taken over all such possible pairs (*L*, *R*) such that \(LR^\top = M\). In frequentist approaches like Klopp et al. (2015), it is common that the upper bound depends on the infinite norm of the entries. This quantity is replaced in our analysis by \(\ell (M)\).

### 2.1 1-Bit matrix completion as a classification problem

*n*i.i.d pairs from a distribution \(\mathbf {P}\). The \(X_k\)’s take values in \(\mathscr {X}=[m_1]\times [m_2]\) and the \(Y_k\)’s take values in \(\mathscr {Y}=\{-1,+1\}\). Hence, the

*k*-th observation of an entry of the matrix is \(Y_k\) and the corresponding position in the matrix is provided by \(X_k=(i_k,j_k)\). In this setting, a predictor is a function \([m_1]\times [m_2]\rightarrow \mathbb {R}\) and thus can be represented by a matrix

*M*and for any

*X*, \(M_X\) is the entry of

*M*at location

*X*. It is natural to use

*M*in the following way: when \((X,Y)\sim \mathbf {P}\),

*M*predicts

*Y*by \(\mathrm{sign}(M_X)\). The ability of this predictor to predict a new entry of the matrix is then assessed by the risk

*i*,

*j*), \(\text {sign}(M^1_{ij})=\text {sign}(M^2_{ij})\) then \(\mathbf {R}(M^1)=\mathbf {R}(M^2)\), and obviously,

Contrary to many recent papers on matrix completion, our approach leads to distribution-free bounds. The marginal distribution of *X* is not an issue and we do not have to assume a uniform sampling scheme. Following standard notations in matrix completion, we define \(\Omega \) as the set of indices of observed entries: \(\Omega =\{X_1,\dots ,X_n\}\). We will use in the following the sub-sample of \(\left\{ 1,\dots ,n\right\} \) for a specified line *i*: \(\Omega _{i,\cdot }=\left\{ l \in [n]:(i,j_l) \in \Omega \right\} \) and the counterpart for a specified column *j*: \(\Omega _{\cdot , j}=\left\{ l \in [n]:(i_l,j) \in \Omega \right\} \).

### 2.2 Pseudo-Bayesian estimation

*r*can be factorized:

*K*. Adaptation with respect to \(r\in [K]\) is obtained by shrinking some columns of

*L*and

*R*to 0. In order to do so, we will scale parameters \(\gamma _k\) for the columns of

*L*and

*R*, and let \(\gamma := (\gamma _1,\dots ,\gamma _K)\). We then define the following hierarchical probability distribution:

*L*and

*R*are normally distributed but the variance depends on the column index: a large \(\gamma _k\) leads to spread values and a small \(\gamma _k\) leads to almost null entries of the column

*k*. In most papers \(\pi ^\gamma \) is chosen as an inverse-Gamma distribution because it is conjugate in this model. This kind of hierarchical prior distribution is also very similar to the Bayesian Lasso developed in Park and Casella (2008) and especially of the form of the Bayesian Group Lasso developed in Kyung et al. (2010) in which the variance term is Gamma distributed. We will show that the Gamma distribution is a possible alternative in matrix completion, both for theoretical results and practical considerations. Thus all the results in this paper are stated under the assumption that \(\pi ^\gamma \) is either the Gamma or the inverse-Gamma distribution: \(\pi ^\gamma =\Gamma (\alpha ,\beta )\), or \(\pi ^\gamma =\Gamma ^{-1}(\alpha ,\beta )\).

### 2.3 Variational Bayes approximations

*parametric*when \(\mathscr {F}\) is finite dimensional and as

*mean-field*otherwise. Here we actually use a mixed approach. Informally, under \(\rho \in \mathscr {F}\), all the coordinates are independent and the variational distribution of the entries of

*L*and

*R*is specified. The free variational parameters to be optimized are the means and the variances. We will show below that the optimization with respect to \(\rho ^{\gamma _k}\) is available in close form. Also, note that any probability distribution \(\rho \in \mathscr {F}\) is uniquely determined by \(L^0\), \(R^0\), \(v^L\), \(v^R\) and \(\rho ^{\gamma _1},\dots ,\rho ^{\gamma _K}\). We could actually use the notation \(\rho = \rho _{L^0,R^0,v^L,v^R,\rho ^{\gamma _1}, \dots ,\rho ^{\gamma _K}}\), but it would be too cumbersome, so we will avoid it as much as possible. Conversely, once \(\rho \) is given in \(\mathscr {F}\), we can define \(L^0 = \mathbb {E}_\rho [L]\), \(R^0 = \mathbb {E}_\rho [R]\) and so on.

### Proposition 1

*M*with \(r_n^h(M)=0\) satisfies \(\mathrm{rank}(M) = r \ll K\). Then, it is possible to decompose

*M*as a product \(M = L^0 (R^0)^\top \) with \(L^0_{i,k}=R^0_{j,k}=0\) when \(r<k\le K\). So, the sum

The quantity \(r_n^h\left( \mathbb {E}_\rho [L] \mathbb {E}_\rho [R]^\top \right) + \mathscr {R}(\rho ,\lambda )\) will be referred as the Approximate Variational Bound (AVB) of \(\rho \) in the following. We are now able to define our estimate.

### Definition 1

In the next section, we study the theoretical properties of our estimate. The main result is that the minimizer \(\widetilde{\rho }_\lambda \) of the \(AVB(\rho ,\lambda )\) has a small prediction risk for a well chosen \(\lambda \). We also provide an algorithm that computes \(\widetilde{\rho }_\lambda \) and show on simulations that it behaves well in practice.

## 3 PAC analysis of the variational approximation

Alquier et al. (2015) propose a general framework for analyzing the prediction properties of VB approximations of pseudo-posteriors based on PAC-Bayesian bounds. In this section, we apply this method to derive a control of the out-of-sample prediction risk \(\mathbf {R}\) for our approximation \(\widetilde{\rho }_\lambda \).

### 3.1 Empirical bound

The first result is a so-called empirical bound: it provides an upper bound on the prediction risk of the pseudo-posterior \(\widetilde{\rho }_\lambda \) that depends only on the data and on quantities defined by the statistician.

### Lemma 1

This shows that our strategy to minimize \(AVB(\rho ,\lambda )\) is indeed the minimization of an empirical upper bound on the prediction risk, a standard approach in PAC-Bayesian theory. An immediate consequence of Lemma 1 and of the definition of \(\widetilde{\rho }_\lambda \) is the following theorem.

### Theorem 1

Even though the bound in the right-hand side may be evaluated in practice, and thus may provide a numerical guarantee on the out-of-sample prediction risk, it is not very clear how it depends on the parameters. The following corollary of Theorem 1 will clarify things. It is obtain by deriving upper bounds of \(AVB(\rho ,\lambda )\) (once again, the proof is provided explicitly in Sect. 1).

### Corollary 1

An exact value for \(\mathscr {C}_{\pi ^\gamma }\) can be deduced from the proof. It is thus clear that the algorithm performs a trade-off between the fit to the data, through the term \(r_n^h(M)\), and the rank of *M*.

In addition to empirical bounds, it is necessary to provide so-called theoretical bounds, that will prove that the risk of \(\widetilde{\rho }_\lambda \) will indeed converge to the Bayes risk when the sample size grows. It is the goal of the next subsection.

### 3.2 Theoretical bound

For this type of theoretical analysis, it is common in classification to make an additional assumption on \(\mathbf {P}\) which leads to an easier task and therefore to better rates of convergence. We propose a definition adapted from Mammen and Tsybakov (1999).

### Definition 2

*C*such that, for any matrix

*M*:

*C*that depends on

*t*. For example, in the noiseless case where \(Y=M^B_X\) almost surely, which corresponds to \(t=1\), then

We are now ready to state our theoretical bound. It makes a link between the integrated risk of the estimator and the lowest possible risk, which is reached by the Bayes classifier \(M^B\). In opposition to the empirical bound, it involves non-observable quantities, depending on \(M^B\), in the right-hand side.

### Theorem 2

*s*,

*C*and \(\pi ^\gamma \).

Note the adaptive nature of this result, in the sense that the estimator does *not* depend on \(\mathrm{rank}(M^B)\). Clearly, when \(\mathrm{rank}(M^B)\) is small, the prediction error will be close to the Bayes error \(\overline{\mathbf {R}}\) even for small sample size. This type of inequalitiy is often referred to as an ‘oracle inequality’ in the sense that our estimator behaves as well as if we knew the rank of \(M^B\) through an oracle.

### Corollary 2

### Remark 1

Note that an empirical inequality comparable to Corollary 1 appears in Srebro et al. (2004). In both cases, the dependance of the bounds with respect to *n* is \(1/\sqrt{n}\) (take \(\lambda = \sqrt{n}\) in Corollary 1). One notable difference is that our bound also provides an explicit dependance to the rank, which is not the case in Srebro et al. (2004).

In addition to this, theoretical inequalities like Theorem 2 and Corollary 2 are completely new results. They allow to compare the out-of-sample error of our predictor to the optimal one. They show that the rate is \(\mathrm{rank}(M^B) (m_1+m_2)/n\) up to log terms. This can not be improved as this rate is known to be minimax optimal (Alquier et al. 2017).

### Remark 2

Determining the tuning parameter \(\lambda \) is not an easy task in practice: even though there are values that lead to the theoretical bounds, it is more efficient in practice to use cross-validation. We used this technique in the empirical results section.

## 4 Algorithm

### 4.1 General algorithm

### 4.2 Mean field optimization

#### 4.2.1 Inverse-gamma prior

#### 4.2.2 Gamma prior

*L*,

*R*is:

## 5 Logistic model

### Proposition 2

### 5.1 Bayes algorithm

## 6 Empirical results

In this section we compare our methods to the other 1-bit matrix completion techniques on simulated and real datasets. It is worth noting that the low rank decomposition does not involve the same matrix: in our model, it affects the Bayesian classifier matrix; in logistic model, it concerns the parameter matrix. The estimate from our algorithm is \(\widehat{M}=\mathbb {E}_{\widetilde{\rho }_\lambda }(L) \mathbb {E}_{\widetilde{\rho }_\lambda }(R)^\top \) and we focus on the zero-one loss in prediction. We first test the performances on simulated matrices and then experiment them on a real data set. We compare the four following models: (a) hinge loss with variational approximation (referred as *HL*), (b) Bayesian logistic model with variational approximation (referred as *Logis.*), (c) the frequentist logistic model from Davenport et al. (2014) (referred as *freq. Logis.*) and (d) the frequentist least squares model from Mazumder et al. (2010) (referred as *SI* for SoftImpute). The former two are tested with both Gamma and Inverse-Gamma prior distributions. The hyperparameters are all tuned by cross validation. The parameter of the frequentist methods is a regularization parameter that is also tuned by cross-validation.

The choice of *K* in our methods is more difficult. A large *K* leads to more parameters to be estimated. This considerably slows down our algorithms. In the end, some (very large) values of *K* are not feasible in practice. Still, what we observe is that the prior leads to an adaptive estimator, in accordance with the theoretical results: when *K* is taken too large (but still small enough in order to keep the computations feasible), the additional parameters are shrunk to zero. Having observed this fact, we keep \(K=10\) in many simulations. Still, we added simulations with a larger value, \(K=50\), in order to show that this shrinkage effect indeed takes place.

From a theoretical perspective, the complexity of each step of Algorithm 1 is of order \((m_1+m_2)K\). Each step only involves very simple calculations, no matrix operations. On the opposite, the methods that use the nuclear norm are very time-consuming because the complexity of the SVD is of order \(m_1 m_2 \min (m_1,m_2)\). It is possible to use approximate SVD, but the method is more difficult to tune.

### 6.1 Simulated data: small matrices

*B*,

*Z*) is such that \(\mathbf {R}(M)=\overline{\mathbf {R}}\) and

*M*has low rank noted

*r*in the followings. The predictions are directly compared to

*M*. Two types of matrices

*M*are built: the

*type A*corresponds to the favorable case to the hinge loss; the entries of

*M*lie in \(\{-1,+1\}\).

^{1}The

*type B*corresponds to the a more difficult classification problem because many entries of

*M*are around 0:

*M*is a product of two matrices with

*r*columns where the entries are iid \(\mathscr {N}(0,1)\). The noise term is specified in Table 1. Note that the example A3 may also be seen as a switch noise with probability \(\frac{e}{1+e} \approx 0.73\). The experiments are done one time for each.

Type of noise

Type | Name | | | |
---|---|---|---|---|

1 | No noise | \(B=1\) a.s. | \(Z=0\) a.s. | \(Y_l=\text {sign}(M_{i_l,j_l})\) a.s |

2 | Switch | \(B\sim 0.9 \delta _1 + 0.1 \delta _{-1}\) | \(Z=0\) a.s. | \(Y_l=\text {sign}(M_{i_l,j_l})\) w.p. 0.9 |

3 | Logistic | \(B=1\) a.s. | \(Z \sim \text {Logistic}\) | \(Y_l =1\) w.p. \(\sigma (M_{i_l,j_l})\) |

Prediction error on simulated observations—rank 3

Type | Logis.-G (%) | Logis.-IG (%) | HL-G (%) | HL-IG (%) | Freq. logis. (%) | SI (%) |
---|---|---|---|---|---|---|

A1 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |

A2 | 0.5 | 0.9 | 0.1 | 0.0 | 0.5 | 0.4 |

A3 | 16.0 | 15.9 | 8.5 | 8.5 | 17.3 | 17.3 |

B1 | 4.1 | 4.0 | 5.3 | 5.8 | 5.1 | 5.6 |

B2 | 10.1 | 10.1 | 10.8 | 10.6 | 10.7 | 10.8 |

B3 | 16.0 | 16.0 | 22.1 | 21.3 | 19.8 | 19.8 |

Prediction error on simulated matrices—rank 5

Type | Logis.-G (%) | Logis.-IG (%) | HL-G (%) | HL-IG (%) | Freq. logis. (%) | SI (%) |
---|---|---|---|---|---|---|

A1 | 0.01 | 0.01 | 0.0 | 0.0 | 0.01 | 0.0 |

A2 | 4.4 | 3.1 | 0.54 | 0.55 | 3.1 | 2.8 |

A3 | 32.5 | 33.1 | 27.0 | 26.7 | 30.1 | 30.0 |

B1 | 7.8 | 7.8 | 9.4 | 10.4 | 9.0 | 9.6 |

B2 | 17.3 | 17.3 | 17.9 | 18.1 | 18.3 | 18.4 |

B3 | 21.5 | 21.4 | 24.4 | 22.9 | 22.1 | 22.2 |

*A*2 type on rank 3 (Table 2), we see that \(10\%\) of corrupted entries is not enough to not almost perfectly recover the Bayesian classifier matrix. We challenge the frequentist program as well. The results are clear, see Fig. 1: the hinge loss method is better almost everywhere. For a noise up to \(25\%\), which means that one fourth of observed entries are corrupted, it is possible to get a very good predictor with less than \(10\%\) of misclassification error. It is getting worse when the level of noise increases and the problem becomes almost impossible for noise greater than \(30\%\).

### 6.2 Simulated data: large matrices

The second experiment involves larger matrices in order to assess the efficiency of the Bayesian methods on large dataset. The observations come now from a \(2000 \times 2000\) matrix, \(10\%\) are observed randomly. The base matrix has now rank 10. It is worth noting that the matrix to be recovered is 100 times larger than in the first example. For the Bayesian methods, we fix \(K=50\). On the other hand, the frequentist methods need a singular value decomposition, which is very time consuming for such a large matrix. Exactly the same observation generation procedure (Table 1) is used and six experiments are done (Table 4).

Prediction error on simulated matrices—rank 10

Type | Logis.-G (%) | Logis.-IG (%) | HL-G (%) | HL-IG (%) | Freq. logis. (%) | SI (%) |
---|---|---|---|---|---|---|

A1 | 0 | 0 | 0 | 0 | 0 | 0 |

A2 | 0.1 | 0.3 | 0 | 0 | 0.1 | 0.1 |

A3 | 9 .6 | 9.9 | 1.8 | 1.8 | 9.9 | 9.9 |

B1 | 3.7 | 4.1 | 8.2 | 6.6 | 6.2 | 7.4 |

B2 | 11 | 11 | 11.3 | 10.6 | 11.8 | 12.0 |

B3 | 10.4 | 10.2 | 11.6 | 11.3 | 11.0 | 11.2 |

### 6.3 Real data set: MovieLens

^{2}data set It has already been used by Davenport et al. (2014) and we follow them for the study. The ratings lie between 1 to 5 so we split them into binary data between good ratings (above the mean which is 3.5) and bad ones. The low rank assumption is usual in this case because it is expected that the taste of a particular user is related to only few hidden parameters. The smallest data set contains 100,000 ratings from 943 users and 1682 movies so we use 95,000 of them as a training set and the 5000 remaining as the test set. The performances are very similar between the frequentist logistic model from Davenport et al. (2014) and the hinge loss model. The performances seem slightly worse for the Bayesian logistic model but it is hard to favor a particular model at this stage (note that the difference between 0.28 and 0.27 is not significant on 5000 observations).

Misclassification rate on MovieLens 100 k data set

Algorithm | HL-IG | HL-G | Logis.-G | Logis.-IG | Freq. logis. |
---|---|---|---|---|---|

Misclassif. rate | 0.28 | 0.29 | 0.32 | 0.32 | 0.27 |

## 7 Discussion

We undertake the 1-bit matrix completion problem with classification tools and we are able to derive PAC-bounds on the risk and an efficient algorithm to compute the estimator. The previous works only focused on GLM models, which is not the right way to establish distribution free risk bounds. This work relies on PAC-Bayesian framework and the pseudo-posterior distribution is approximated by a variational algorithm. In practice, it is able to deal with large matrices. We also derive a variational approximation of the posterior distribution in the Bayesian logistic model and it works very well in our examples.

The variational approximations look very promising in order to build algorithm which are able to deal with large data and this framework may be extended to more general models and other Machine Learning tools.

## Footnotes

- 1.
The matrices are built by drawing

*r*independent columns with only \(\{-1,1\}\) The remaining columns are randomly equal to one of the first*r*columns multiplied by a factor in \(\{-1,1\}\). - 2.
Available at http://grouplens.org/datasets/movielens/100k/.

## Notes

### Acknowledgements

We would like to thank Vincent Cottet’s Ph.D. supervisor Professor Nicolas Chopin, for his kind support during the project and the three anonymous referees for their helpful and constructive comments.

## References

- Alquier, P., Cottet, V., & Lecué, G. (2017). Estimation bounds and sharp oracle inequalities of regularized procedures with Lipschitz loss functions. arXiv preprint arXiv:1702.01402.
- Alquier, P., Ridgway, J., & Chopin, N. (June 2015). On the properties of variational approximations of Gibbs posteriors. arXiv e-prints.Google Scholar
- Bishop, C. M. (2006).
*Pattern recognition and machine learning (information science and statistics)*. New York: Springer.zbMATHGoogle Scholar - Boucheron, S., Lugosi, G., & Massart, P. (2013).
*Concentration inequalities: A nonasymptotic theory of independence*. Oxford: OUP.CrossRefzbMATHGoogle Scholar - Boyd, S., Parikh, N., Chu, E., Peleato, B., & Eckstein, J. (2011). Distributed optimization and statistical learning via the alternating direction method of multipliers.
*Foundations and Trends in Machine Learning*,*3*(1), 1–122.CrossRefzbMATHGoogle Scholar - Cai, T., & Zhou, W.-X. (2013). A max-norm constrained minimization approach to 1-bit matrix completion.
*Journal of Machine Learning Research*,*14*, 3619–3647.MathSciNetzbMATHGoogle Scholar - Candès, E. J., & Plan, Y. (2010). Matrix completion with noise.
*Proceedings of the IEEE*,*98*(6), 925–936.CrossRefGoogle Scholar - Candès, E. J., & Recht, B. (2012). Exact matrix completion via convex optimization.
*Communications of the ACM*,*55*(6), 111–119.CrossRefzbMATHGoogle Scholar - Candès, E. J., & Tao, T. (2010). The power of convex relaxation: Near-optimal matrix completion.
*IEEE Transactions on Information Theory*,*56*(5), 2053–2080.MathSciNetCrossRefzbMATHGoogle Scholar - Catoni, O. (2004). Statistical learning theory and stochastic optimization. In J. Picard (Ed.),
*Saint-Flour Summer School on probability theory 2001*., Lecture notes in mathematics Berlin: Springer.Google Scholar - Catoni, O. (2007).
*PAC-Bayesian supervised classification: The thermodynamics of statistical learning*(Vol. 56)., Institute of mathematical statistics lecture notes—Monograph series Beachwood, OH: Institute of Mathematical Statistics.zbMATHGoogle Scholar - Chatterjee, S. (2015). Matrix estimation by universal singular value thresholding.
*Annals of Statistics*,*43*(1), 177–214.MathSciNetCrossRefzbMATHGoogle Scholar - Dalalyan, A., & Tsybakov, A. B. (2008). Aggregation by exponential weighting, sharp PAC-Bayesian bounds and sparsity.
*Machine Learning*,*72*(1), 39–61.CrossRefGoogle Scholar - Davenport, M. A., Plan, Y., van den Berg, E., & Wootters, M. (2014). 1-Bit matrix completion.
*Information and Inference*,*3*(3), 189–223.MathSciNetCrossRefzbMATHGoogle Scholar - Herbrich, R., & Graepel, T. (2002). A PAC-Bayesian margin bound for linear classifiers.
*IEEE Transactions on Information Theory*,*48*(12), 3140–3150.MathSciNetCrossRefzbMATHGoogle Scholar - Herbster, M., Pasteris, S., & Pontil, M. (2016). Mistake bounds for binary matrix completion. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, R. Garnett, & R. Garnett (Eds.),
*Proceedings of the 29th conference on neural information processing systems (NIPS 2016)*. Barcelona, Spain: NIPS Proceedings.Google Scholar - Hsieh, C.-J., Natarajan, N., & Dhillon, I. S. (2015). PU learning for matrix completion. In
*Proceedings of the 32nd international conference on machine learning*, pp. 2445–2453.Google Scholar - Jaakkola, T. S., & Jordan, M. I. (2000). Bayesian parameter estimation via variational methods.
*Statistics and Computing*,*10*(1), 25–37.CrossRefGoogle Scholar - Klopp, O., Lafond, J., Moulines, É., & Salmon, J. (2015). Adaptive multinomial matrix completion.
*Electronic Journal of Statistics*,*9*(2), 2950–2975.MathSciNetCrossRefzbMATHGoogle Scholar - Kyung, M., Gill, J., Ghosh, M., & Casella, G. (2010). Penalized regression, standard errors, and Bayesian lassos.
*Bayesian Analysis*,*5*(2), 369–412.MathSciNetCrossRefzbMATHGoogle Scholar - Latouche, P., Robin, S., & Ouadah, S. (2015). Goodness of fit of logistic models for random graphs. arXiv preprint arXiv:1508.00286.
- Lim, Y. J. & Teh, Y. W. (2007). Variational Bayesian approach to movie rating prediction. In
*Proceedings of KDD cup and workshop*.Google Scholar - Mai, T. T., & Alquier, P. (2015). A bayesian approach for noisy matrix completion: Optimal rate under general sampling distribution.
*Electronic Journal of Statistics*,*9*(1), 823–841.MathSciNetCrossRefzbMATHGoogle Scholar - Mammen, E., & Tsybakov, A. (1999). Smooth discrimination analysis.
*The Annals of Statistics*,*27*(6), 1808–1829.MathSciNetCrossRefzbMATHGoogle Scholar - Mazumder, R., Hastie, T., & Tibshirani, R. (2010). Spectral regularization algorithms for learning large incomplete matrices.
*Journal of Machine Learning Research*,*11*(Aug), 2287–2322.MathSciNetzbMATHGoogle Scholar - McAllester, D. A. (1998). Some PAC-Bayesian theorems. In
*Proceedings of the eleventh annual conference on computational learning theory*(pp. 230–234). New York, ACM.Google Scholar - Park, T., & Casella, G. (2008). The Bayesian Lasso.
*Journal of the American Statistical Association*,*103*(482), 681–686.MathSciNetCrossRefzbMATHGoogle Scholar - Recht, B., & Ré, C. (2013). Parallel stochastic gradient algorithms for large-scale matrix completion.
*Mathematical Programming Computation*,*5*(2), 201–226.MathSciNetCrossRefzbMATHGoogle Scholar - Salakhutdinov, R. & Mnih, A. (2008). Bayesian probabilistic matrix factorization using Markov chain Monte Carlo. In
*Proceedings of the 25th international conference on machine learning*, pp. 880–887.Google Scholar - Seldin, Y., & Tishby, N. (2010). PAC-Bayesian analysis of co-clustering and beyond.
*Journal of Machine Learning Research*,*11*(Dec), 3595–3646.MathSciNetzbMATHGoogle Scholar - Seldin, Y., Laviolette, F., Cesa-Bianchi, N., Shawe-Taylor, J., & Auer, P. (2012). PAC-Bayesian inequalities for martingales.
*IEEE Transactions on Information Theory*,*58*(12), 7086–7093.MathSciNetCrossRefzbMATHGoogle Scholar - Shawe-Taylor, J., & Langford, J. (2003). PAC-Bayes and margins.
*Advances in Neural Information Processing Systems*,*15*, 439.Google Scholar - Srebro, N., Rennie, J., & Jaakkola, T. S. (2004). Maximum-margin matrix factorization. In
*Advances in neural information processing systems*, pp. 1329–1336.Google Scholar - Vapnik, V. (1998).
*Statistical learning theory*. New York: Wiley.zbMATHGoogle Scholar - Zhang, T. (2004). Statistical behavior and consistency of classification methods based on convex risk minimization.
*Annals of Statistics*,*32*(1), 56–85.MathSciNetCrossRefzbMATHGoogle Scholar