Simulation comparisons between Bayesian and de-biased estimators in low-rank matrix completion

In this paper, we study the low-rank matrix completion problem, a class of machine learning problems, that aims at the prediction of missing entries in a partially observed matrix. Such problems appear in several challenging applications such as collaborative filtering, image processing, and genotype imputation. We compare the Bayesian approaches and a recently introduced de-biased estimator which provides a useful way to build confidence intervals of interest. From a theoretical viewpoint, the de-biased estimator comes with a sharp minimax-optimal rate of estimation error whereas the Bayesian approach reaches this rate with an additional logarithmic factor. Our simulation studies show originally interesting results that the de-biased estimator is just as good as the Bayesian estimators. Moreover, Bayesian approaches are much more stable and can outperform the de-biased estimator in the case of small samples. In addition, we also find that the empirical coverage rate of the confidence intervals obtained by the de-biased estimator for an entry is absolutely lower than of the considered credible interval. These results suggest further theoretical studies on the estimation error and the concentration of Bayesian methods as they are quite limited up to present.

From a frequentist point of view, most of the recent methods are usually based on penalized optimization. A seminal result can be found in Recht, 2009, Candès andTao, 2010] for exact matrix completion (noiseless case) and further developed in the noisy case in [Candès and Plan, 2010, Koltchinskii et al., 2011, Negahban and Wainwright, 2012. Some efficient algorithms had also been studied, for example see [Mazumder et al., 2010, Recht and Ré, 2013, Hastie et al., 2015. More particularly, in the notable work [Koltchinskii et al., 2011], the authors studied nuclearnorm penalized estimators and provided reconstruction error rate for their methods. They also showed that these error rates are minimax-optimal (up to a logarithmic factor). Note that the error rate, i.e. the average quadratic error on the entries, of a rank-r matrix size m × p from n-observations can not be better than: r max(m, p)/n [Koltchinskii et al., 2011].
More recently, in a work by [Chen et al., 2019], de-biased estimators have been proposed for the problem of noisy low-rank matrix completion. The estimation accuracy of this estimator is shown to be sharp in the sense that it reaches the minimax-optimal rate without any additional logarithmic factor. A sharp bound has also been obtained by a different estimator in [Klopp, 2015]. However, uncertainty quantification is not given. More importantly, the confidence intervals on the reconstruction of entries of the underlying matrix are also provided by using the de-biased estimators in the work by [Chen et al., 2019]. It is noted that conducting uncertainty quantification for matrix completion is not straightforward. This is because, in general, the solutions for matrix completion do not admit closed-form and the distributions of the estimates returned by the state-of-the-art algorithms are hard to derive.
On the other hand, uncertainty quantification can be obtained straightforwardly from a Bayesian perspective. More specifically, the unknown matrix is considered as a random variable with a specific prior distribution and statistical inference can be obtained using the posterior distribution, for example considering credible intervals. Bayesian methods have been studied for low-rank matrix completion mainly from a computational viewpoint [Lim and Teh, 2007, Salakhutdinov and Mnih, 2008, Zhou et al., 2010, Alquier et al., 2014, Alquier et al., 2014, Lawrence and Urtasun, 2009, Cottet and Alquier, 2018, Babacan et al., 2012, Yang et al., 2018. Most Bayesian estimators are based on conjugate priors which allow to use Gibbs sampling [Alquier et al., 2014, Salakhutdinov andMnih, 2008] or Variational Bayes methods [Lim and Teh, 2007]. These algorithms are fast enough to deal with and actually tested on large datasets like Netflix [Bennett and Lanning, 2007] or MovieLens [Harper and Konstan, 2015]. However, the theoretical understanding of Bayesian estimators is quite limited, up to our knowledge, [Mai and Alquier, 2015] and [Alquier and Ridgway, 2020] are the only prominent examples. More specifically, they showed that a Bayesian estimator with a low-rank factorization prior reaches the minimax-optimal rate up to a logarithmic factor and the paper [Alquier and Ridgway, 2020] further shows that the same estimation error rate can be obtained by using a Variational Bayesian estimator.
In this paper, to understand the performance of Bayesian approaches when compared to the de-biased estimators, we perform numerical comparisons on the estimation accuracy (the estimation error, the normalized squared error and the prediction error, see Section 3) considering the de-biased estimator in [Chen et al., 2019] and the Bayesian methods [Alquier and Ridgway, 2020] for which the statistical properties have been well studied. Furthermore, we examine in detail the behaviour of the confidence intervals obtained by the de-biased estimator and the Bayesian credible intervals. Interestingly, it is noted that recent works [Rendle et al., 2019, Rendle et al., 2020 show that Bayesian methods are now the most accurate in practical recommender systems. Although Bayesian methods have become popular in the problem of matrix completion, its uncertainty quantification (e.g. credible intervals) has received much more limited attention in the literature.
Results from simulation comparisons release originally interesting messages. More specifically, the de-biased estimator is just as good as the Bayesian estimators when we look at the estimation accuracy, although it is completely successful in improving the estimator being de-biased. On the other hand, the Bayesian approaches are much more stable than the de-biased method and, in addition, they outperform the de-biased estimator especially in the case of small samples. Moreover, we find that the coverage rates of the 95% confidence intervals obtained using the de-biased estimator are lower than the 89% equal-tailed credible intervals. These evidences suggest that the Bayesian estimators may actually reach the minimax-optimal rate sharply and the log-term could be due to the technical proofs (the PAC-Bayesian bounds technique). Furthermore, the concentration rate of the corresponding Bayesian posterior discussed in [Alquier and Ridgway, 2020] with a log-term might not be tight.
The rest of the paper is structured as follows. In Section 2 we present the low-rank matrix completion problem, then introduce the de-biased estimator and the corresponding confidence interval and provide details on the considered Bayesian estimators. In Section 3, simulation studies comparing the different methods are presented. We discuss our results and give some concluding remarks in the final section.
2 Low-rank matrix completion

Model
In this work, we adopt the statistical model commonly studied in the literature for noisy matrix completion [Chen et al., 2019]. Let M * ∈ R m×p be an unknown rank-r matrix of interest. We partially observe some noisy entries of M * as where Ω ⊆ {1, . . . , m} × {1, . . . , p} is a small subset of indexes and E ij ∼ N (0, σ 2 ) are independently generated noise at the location (i, j). The random sampling model is assumed that each index (i, j) ∈ Ω is observed independently with probability κ (i.e., data are missing uniformly at random). Then, the problem of estimating M * with n = |Ω| < mp is called the (noisy) low-rank matrix completion problem. Let P Ω (·) : R m×p → R m×p be the orthogonal projection onto the observed entries in the index set Ω that Notations: For a matrix A ∈ R m×p , A F = trace(A ⊤ A) denotes its Frobenius norm and A * = trace( √ A ⊤ A) denotes its nuclear norm. [a ± b] denotes the interval [a − b, a + b]. We use I q to denote the identity matrix of dimension q × q.

The de-biased estimator
LetM be either the solution of the following nuclear norm regularization [Mazumder et al., 2010] min Z∈R m×p or of the following factorization minimization [Hastie et al., 2015] min U ∈R m×r ,V ∈R p×r where λ > 0 is a tuning parameter. The optimization problem in (2) can be seen as the problem of finding the MAP (maximum a posteriori) in Bayesian modeling where Gaussian priors are used on columns of the factors U and V , detailed discussion can be found in [Alquier et al., 2014, Fithian andMazumder, 2018]. Given an estimatorM as above, the de-biased estimator [Chen et al., 2019] is defined as where Pr rank−r (B) = arg min A,rank(A)≤r A − B F is the projection onto the set of rank-r matrices.
Remark 1. The estimation accuracy of the de-biased estimator, provided in Theorem 3 in [Chen et al., 2019] under some assumptions, is M db − M * 2 F ≤ c max(m, p)rσ 2 /n without any extra log-term and c is universal numerical constant.

The Bayesian estimators
The Bayesian estimator studied in [Alquier and Ridgway, 2020] is given by is the posterior and L(Y |M ) λ is the likelihood raised to the power λ. Here λ ∈ (0, 1) is a tuning parameter and π(M ) is the prior distribution.

Priors
A popular choice for the priors in Bayesian matrix completion is to assign conditional Gaussian priors to U ∈ R m×K and V ∈ R p×K such that for a fixed integer K ≤ min(m, p). More specifically, for k ∈ {1, . . . , K}, independently where I q is the identity matrix of dimension q × q and a, b are some tuning parameters. This type of prior is conjugate so that the conditional posteriors can be derived explicitly in closed form and allows to use the Gibbs sampler, see [Salakhutdinov and Mnih, 2008] for details. Some reviews and discussion on low-rank factorization priors can be found in [Alquier, 2013, Alquier et al., 2014.
Remark 2. In the case that the rank r is not known, it is natural to take K as large as possible, e.g K = min(m, p) but this may be computationally prohibitive if K is large.
Remark 3. The estimation error for this Bayesian estimator, under some assumptions, given in Corollary 4.2 in [Alquier and Ridgway, 2020], is M B − M * 2 F ≤ max(m, p)rσ 2 /n with an additional (multiplicative) log-term by log(n max(m, p)). It is noted that the rate is also reached in [Mai and Alquier, 2015] with an additional (multiplicative) log-term by log(min(m, p)) under general sampling distribution however the authors considered some truncated priors.
For a given rank-r, we propose to consider the following prior, called fixed-rank-prior, for k = 1, . . . , r. This prior is a simplified version of the above prior. We note that for K > r the Gibbs sampler of the fixed-rank-prior will be faster than Gibbs sampler for the above prior. Interestingly, results from simulation for the Bayesian estimator with this prior are slightly better than the one based on the above prior at some point.
Remark 4. We remark that the theoretical estimation error for the Bayesian estimator with the fixed-rank-prior given in (6) remains unchanged following by Corollary 4.2 in [Alquier and Ridgway, 2020].

Credible intervals
Using Bayesian approach, the credibility intervals for the matrix and their functions (e.g. entries) can be easily constructed using the Markov Chain Monte Carlo (MCMC) technique.
Here, we focus on the equal-tailed credible interval for an entry.
More precisely, the credible intervals are reported using the 89% equal-tailed intervals that are recommended by [Kruschke, 2014, McElreath, 2020 for small posterior samples as in our situations with 500 posterior samples. We noted that, according to [Salakhutdinov and Mnih, 2008] as the data are too big to draw a reasonable size sample, the authors state that drawing only 500 observations from the Gibbs Sampler took 90 hours for the Netflix dataset. Thus, we focus on the 89% equal-tailed credible intervals for 500 posterior samples. It is, however, noted that to obtain 95% intervals, an effective posterior sample size of at least 10.000 is recommended [Kruschke, 2014], which is computationally costly to run on all of our simulations. A few examples with 10.000 posterior samples are examined in Figure 2. 3 Simulation studies

Experimental designs
In order to access the performance of different estimators, a series of experiments were conducted with simulated data. We fix m = 100 and alternate the other dimension by taking p = 100 and p = 1000. The rank r is varied between r = 2 and r = 5.
• Setting I: In the first setting, a rank-r matrix M * m×p is generated as the product of two rank-r matrices, . With a missing rate τ = 20%, 50% and 80%, the entries of the observed set are drawn uniformly at random. This sampled set is then corrupted by noise as in (1), where the E i are i.i.d N (0, 1).
• Setting II: The second series of simulations is similar to the first one, except that the matrix M * is no longer rank-r, but it can be well approximated by a rank-r matrix: where the entries of A and B are i.i.d N (0, 1).
• Setting III: This setting is similar to Setting I but here a heavy tail noise is used. More specifically, the noise E i are i.i.d Student distribution t 3 with 3 degrees of freedom.
• Setting IV: The set up of this setting is also similar to Setting I. However, we consider a more extreme case where the entries of U * , V * and the noise E i are all i.i.d drawn from the Student distribution with 3 degrees of freedom.
Remark 5. We note that for the second series of simulations, with approximate low-rank matrices, the theory of the de-biased estimator can not be used whereas theoretical guarantees for Bayesian estimators are still valid, see [Alquier and Ridgway, 2020]. The setting I follows exactly the minimax-optimal regime and thus it will allow to access the accuracy of the considered estimators. The last 2 settings (III and IV) are misspecification models set up where the theoretical guarantee is not available for all considered estimators.
The behavior of an estimator (say M ) is evaluated through the average squared error (ase) per entry ase := 1 mp M − M * 2 F and the relative squared error (rse) We also measure the error in predicting the missing entries by using whereΩ is the set of un-observed entries. For each setup, we generate 100 data sets (simulation replicates) and report the average and the standard deviation for a measure of error of each estimator over the replicates. We compare the de-biased estimator (denoted by 'd.b'), the Bayesian estimator with the fixed-rank-prior (6) (denoted by 'f.Bayes') and the Bayesian estimator with the (flexible rank) prior (5) (denoted by 'Bayes'). As a by-product in calculating the de-biased estimator through the Alternating Least Squares estimator (2), we also report the results for this estimator, denoted it by 'als'.
The 'als' estimator is available from the R package 'softImpute' [Mazumder et al., 2010] and is used with default options. The 'd.b' estimator is run with λ = 2.5σ √ mp as in [Chen et al., 2019]. The 'f.Bayes' and 'Bayes' estimators are used with tuning parameter λ = 1/(4σ 2 ) and parameters for the prior of 'Bayes' estimator are K = 10, a = 1, b = 1/100. The Gibbs samplers for these two Bayesian estimators are run with 500 steps and 100 burn-in steps.

Results on estimation accuracy
From the results in Tables 1 and 2, it is clear that the de-biased estimator significantly outperforms its ancestry estimator being de-biased. Whereas, the de-biased estimator is just as good as the Bayesian methods in some cases. More specifically, in Table 1, the de-biased estimator behaves similar compared to Bayesian estimators in the case with high rates of observation (say τ = 20% or 50%). With the case of highly missing rate τ = 80%, the de-biased estimator returns highly unstable results, this may be because its ancestry estimator (here it is the als estimator) is unstable with few observations. However, when the dimension of the matrix increases, the differences between the de-biased estimator and the Bayesian estimators become smaller. This is also recognized for the setting of approximate low-rank matrices as in Table 2 and in Table 5, 6.
The 'f.Bayes' method yields the best results quite often in terms of all considered errors (ase, Nase and Pred) in setting of exact low-rank matrices. However, it is noted that for the setting with the true underlying matrix being approximately low-rank, in Table 2 and 6, the 'Bayes' approach is slightly better than the 'f.Bayes' approach at some point. This can be explained as the 'Bayes' approach employs a kind of approximate low-rank prior through the Gamma prior on the variance of the factor matrices and thus it is able to adapt to the approximate low-rankness.
Results in the cases of model misspecification with heavy tail noise are given Table 3 and 4. Although Bayesian methods, especially 'f.Bayes' method, yield better results compared with 'als' or 'db', all methods fail in the case of highly missing data, τ = 80%. This could be due to the fact that these considered methods are all designed for the case of Gaussian noise and thus they are not robust to other heavy tail noise, such as Student noise.

Results on uncertainty quantification
To examine the uncertainty quantification across the methods, we simulate a matrix as in Section 3.1 then we repeat the observation process 100 times. More precisely, we obtain 100 data sets by replicating the observation of 20%, 50% and 80% entries of the matrix M * using a uniform sampling and then each sampled set is corrupted by noise as in (1), where the E i are i.i.d N (0, 1). Table 5 and 6 gather the empirical coverage rate of the confidence intervals and of the credible intervals of all methods over 100 independent experiments. More precisely, we report the 95% confidence intervals for the de-biased method. The credible interval is reported using the 89% equal-tailed interval, see [Kruschke, 2014, McElreath, 2020, for small posterior samples as in our situations with 500 posterior samples. We noted that to obtain 95% intervals, an effective posterior sample size of at least 10.000 is recommended [Kruschke, 2014]. A few examples from Setting I with 10000 posterior samples are given in Figure 2.
A noteworthy conclusion from the results in Table 5 and 6 is that the coverage rates of the 89% credible intervals are significantly higher than those of the 95% confidence intervals revealed by the de-biased method. The credible intervals of the 'f.Bayes' approach show a slightly better coverage rate than those based on the 'Bayes' approach. It is also noted that in the setting of approximate low-rankness, Table 6, where we do not have theoretical guarantee for the de-biased estimator, the coverage rate of the confidence intervals is very low while the credible intervals still come with reliable coverage rates. These results further explain why Bayesian methods yield better results in accuracy as in Table 1 and 2.
In Figure 2, we compare the limiting Gaussian distribution of the de-biased estimator and posterior samples for the 'f.Bayes' method against the true entries of interest. It is shown that the limiting Gaussian distribution of the de-biased estimator yields a slightly sharper tail distribution compared to the distribution of the posterior samples. In addition, Figure 1 displays the Q-Q (quantile-quantile) plots of 10000 posterior samples of some entries vs. the standard Gaussian random variables. It shows that the posterior distributions of these entries reasonably well match the standard Gaussian distribution.
Results on empirical coverage rate of the confidence intervals and of the credible intervals for Setting III and IV with heavy tail noises are gathered in Table 7 and 8. We can see that there is a slight reduction in the empirical coverage rate of all methods compared with those in the Gaussian noise setting in Table 5. As in Setting I and II, the empirical coverage rates of confidence intervals decrease quickly as the missing rates τ increase, while the empirical coverage rates of credible intervals remain stable.

Discussion and Conclusion
In this paper, we have provided extensive numerical comparisons between the de-biased estimator and the Bayesian estimators in the problem of low-rank matrix completion. Results from numerical simulations draw a systematic picture of the behaviour of these estimators originally. More specifically, on the estimation accuracy, the de-biased estimator is comparable to the Bayesian estimators whereas the Bayesian estimators are much more stable and in some cases can outperform the de-biased estimator, especially in the small samples regime. Moreover, the credible intervals reasonably cover the underlying entries quite well and slightly better than the confidence intervals in exact low-rank matrix completion. However, in the case of approximate low-rankness, the confidence intervals revealed by the de-biased estimators no longer work well. These results are interested for and can be served as a guideline for researchers as well as practitioners in many areas where one only has access to a few observations.
On the other hand, the results in this work suggest that the considered Bayesian estimators may actually reach the minimax-optimal rate of convergence without additional logarithmic factor. The extra log-terms could be due to the PAC-Bayesian bounds technique that used to prove the theoretical properties of the Bayesian estimator. Moreover, as shown in [Alquier and Ridgway, 2020], the same rate with log-term is proved for the concentration of the corresponding posterior and we conjecture that this rate could also be improved due to the coverage of credible intervals. These are important questions that remain open up to our knowledge.
Last but not least, it is also important to perform the comparisons with the Variational Bayesian (VB) method in [Lim and Teh, 2007] where its theoretical guarantees are given in [Alquier and Ridgway, 2020], because this method is very popular for matrix completion with large datasets. This will be the objective of our future work. However, we would like to note that, in a preprint [Alquier et al., 2014], the authors had performed some comparisons between the Bayesian approach and the VB method. The message from their works is that we can expect that VB should be more or less as accurate as Bayes, maybe slightly less, but that the credibility intervals would be inaccurate (see e.g Figure 3 in [Alquier et al., 2014]).

Availability of data and codes
The R codes and data used in the numerical experiments are available at: https://github.com/tienmt/UQMC .