Abstract
Matrix factorization is a very common machine learning technique in recommender systems. Bayesian Matrix Factorization (BMF) algorithms would be attractive because of their ability to quantify uncertainty in their predictions and avoid overfitting, combined with high prediction accuracy. However, they have not been widely used on largescale data because of their prohibitive computational cost. In recent work, efforts have been made to reduce the cost, both by improving the scalability of the BMF algorithm as well as its implementation, but so far mainly separately. In this paper we show that the stateoftheart of both approaches to scalability can be combined. We combine the recent highlyscalable Posterior Propagation algorithm for BMF, which parallelizes computation of blocks of the matrix, with a distributed BMF implementation that users asynchronous communication within each block. We show that the combination of the two methods gives substantial improvements in the scalability of BMF on webscale datasets, when the goal is to reduce the wallclock time.
Download conference paper PDF
1 Introduction
Matrix Factorization (MF) is a core machine learning technique for applications of collaborative filtering, such as recommender systems or drug discovery, where a data matrix \(\mathbf {R}\) is factorized into a product of two matrices, such that \(\mathbf {R}\approx \mathbf {U}\mathbf {V}^\top \). The main task in such applications is to predict unobserved elements of a partially observed data matrix. In recommender systems, the elements of \(\mathbf {R}\) are often ratings given by users to items, while in drug discovery they typically represent bioactivities between chemical compounds and protein targets or cell lines.
Bayesian Matrix Factorization (BMF) [2, 13], formulates the matrix factorization task as a probabilistic model, with Bayesian inference conducted on the unknown matrices \(\mathbf {U}\) and \(\mathbf {V}\). Advantages often associated with BMF include robustness to overfitting and improved predictive accuracy, as well as flexible utilization of prior knowledge and sidedata. Finally, for application domains such as drug discovery, the ability of the Bayesian approach to quantify uncertainty in predictions is of crucial importance [9].
Despite the appeal and many advantages of BMF, scaling up the posterior inference for industryscale problems has proven difficult. Scaling up to this level requires both data and computations to be distributed over many workers, and so far only very few distributed implementations of BMF have been presented in the literature. In [16], a highperformance computing implementation of BMF using Gibbs sampling for distributed systems was proposed. The authors considered three different distributed programming models: Message Passing Interface (MPI), Global Address Space Programming Interface (GASPI) and ExaSHARK. In a different line of work, [1] proposed to use a distributed version of the minibatchbased Stochastic Gradient Langevin Dynamics algorithm for posterior inference in BMF models. While a key factor in devising efficient distributed solutions is to be able to minimize communication between worker nodes, both of the above solutions require some degree of communication in between iterations.
A recent promising proposal, aiming at minimizing communication and thus reaching a solution in a faster way, is to use a hierarchical embarrassingly parallel MCMC strategy [12]. This technique, called BMF with Posterior Propagation (BMFPP), enhances regular embarrassingly parallel MCMC (e.g. [10, 17]), which does not work well for matrix factorization [12] because of identifiability issues. BMFPP introduces communication at predetermined limited phases in the algorithm to make the problem identifiable, effectively building one model for all the parallelized data subsets, while in previous works multiple independent solutions were found per subset.
The current paper is based on a realization that the approaches of [16] and [12] are compatible, and in fact synergistic. BMFPP will be able to parallelize a massive matrix but will be the more accurate the larger the parallelized blocks are. Now replacing the earlier serial processing of the blocks by the distributed BMF in [16] will allow making the blocks larger up to the scalability limit of the distributed BMF, and hence decrease the wallclock time by engaging more processors.
The main contributions of this work are:

We combine both approaches, allowing for parallelization both at the algorithmic and at the implementation level.

We analyze what is the best way to subdivide the original matrix into subsets, taking into account both compute performance and model quality.

We examine several webscale datasets and show that datasets with different properties (like the number of nonzero items per row) require different parallelization strategies.
The rest of this paper is organized as follows. In Sect. 2 we present the existing Posterior Propagation algorithm and distributed BMF implementation and we explain how to combine both. Section 3 is the main section of this paper, where we document the experimental setup and used dataset, we compare with related MF methods, and present the results both from a machine learning point of view, as from a highperformance compute point of view. In Sect. 4 we draw conclusions and propose future work.
2 Distributed Bayesian Matrix Factorization with Posterior Propagation
In this section, we first briefly review the BMF model and then describe the individual aspects of distributed computation and Posterior Propagation and how to combine them.
2.1 Bayesian Matrix Factorization
In matrix factorization, a (typically very sparsely observed) data matrix \(\mathbf {R}\in \mathbb {R}^{N\times D}\) is factorized into a product of two matrices \(\mathbf {U}\in \mathbb {R}^{N\times K}= (\mathbf {u}_1,\ldots ,\mathbf {u}_N)^\top \) and \(\mathbf {V}= (\mathbf {v}_1,\ldots ,\mathbf {v}_D)^\top \in \mathbb {R}^{D\times K}\). In the context of recommender systems, \(\mathbf {R}\) is a rating matrix, with N the number of users and D the number of rated items.
In Bayesian matrix factorization [2, 13], the data are modelled as
where \(I_{nd}\) denotes an indicator which equals 1 if the element \(r_{nd}\) is observed and 0 otherwise, and \(\tau \) denotes the residual noise precision. The two parameter matrices \(\mathbf {U}\) and \(\mathbf {V}\) are assigned Gaussian priors. Our goal is then to compute the joint posterior density \(p(\mathbf {U},\mathbf {V}\mathbf {R}) \propto p(\mathbf {U})p(\mathbf {V})p(\mathbf {R}\mathbf {U},\mathbf {V})\), conditional on the observed data. Posterior inference is typically done using Gibbs sampling, see [13] for details.
2.2 Bayesian Matrix Factorization with Posterior Propagation
In the Posterior Propagation (PP) framework [12], we start by partitioning \(\mathbf {R}\) with respect to both rows and columns into \(I\times J\) subsets \(\mathbf {R}^{(i,j)}\), \(i=1,\ldots ,I\), \(j = 1,\ldots , J\). The parameter matrices \(\mathbf {U}\) and \(\mathbf {V}\) are correspondingly partitioned into I and J submatrices, respectively. The basic idea of PP is to process each subset using a hierarchical embarrassingly parallel MCMC scheme in three phases, where the posteriors from each phase are propagated forwards and used as priors in the following phase, thus introducing dependencies between the subsets. The approach proceeds as follows (for an illustration, see Fig. 1):
Phase (a): Joint inference for submatrices \((\mathbf {U}^{(1)}, \mathbf {V}^{(1)})\), conditional on data subset \(\mathbf {R}^{(1,1)}\):
Phase (b): Joint inference in parallel for submatrices \((\mathbf {U}^{(i)},\mathbf {V}^{(1)})\), \(i=2,\ldots ,I\), and \((\mathbf {U}^{(1)},\mathbf {V}^{(j)})\), \(j=2,\ldots ,J\), conditional on data subsets which share columns or rows with \(\mathbf {R}^{(1,1)}\), and using posterior marginals from phase (a) as priors:
Phase (c): Joint inference in parallel for submatrices \((\mathbf {U}^{(i)},\mathbf {V}^{(j)})\), \(i=2,\ldots ,I\), \(j=2,\ldots ,J\), conditional on the remaining data subsets, and using posterior marginals propagated from phase (b) as priors:
Finally, the aggregated posterior is obtained by combining the posteriors obtained in phases (a)–(c) and dividing away the multiplycounted propagated posterior marginals; see [12] for details.
2.3 Distributed Bayesian Matrix Factorization
In [16] a distributed parallel implementation of BMF is proposed. In that paper an implementation the BMF algorithm [13] is proposed that distributes the rows and columns of \(\mathbf {R}\) across different nodes of a supercomputing system.
Since Gibbs sampling is used, rows of \(\mathbf {U}\) and rows of \(\mathbf {V}\) are independent and can be sampled in parallel, on different nodes. However, there is a dependency between samples of \(\mathbf {U}\) and \(\mathbf {V}\). The communication pattern between \(\mathbf {U}\) and \(\mathbf {V}\) is shown in Fig. 2.
The main contribution of this implementation is how to distribute \(\mathbf {U}\) and \(\mathbf {V}\) to make sure the computational load is distributed as equally as possible and the amount of data communication is minimized. The authors of [16] optimize the distributions by analysing the sparsity structure of \(\mathbf {R}\).
As presented in the paper, distributed BMF provides a reasonable speedup for compute systems up to 128 nodes. After this, speedup in the strong scaling case is limited by the increase in communication, while at the same computations for each node decrease.
2.4 Distributed Bayesian Matrix Factorization with Posterior Propagation
By combining the distributed BMF implementation with PP, we can exploit the parallelism at both levels. The different subsets of \(\mathbf {R}\) in the parallel phases of PP are independent and can thus be computed on in parallel on different resources. Inside each subset we exploit parallelism by using the distributed BMF implementation with MPI communication.
3 Experiments
In this section we evaluate the distributed BMF implementation with Posterior Propagation (DBMF+PP) and competitors for benchmark datasets. We first have a look at the datasets we will use for the experiments. Next, we compare our proposed DBMF+PP implementation with other related matrix factorization methods in terms of accuracy and runtime, with a special emphasis on the benefits of Bayesian methods. Finally we look at the strong scaling of DBMF+PP and what is a good way to split the matrix in to blocks for Posterior Propagation.
3.1 Datasets
Table 1 shows the different webscale datasets we used for our experiments. The table shows we have a wide diversity of properties: from the relatively small Movielens dataset with 20M ratings, to the 262M ratings Yahoo dataset. Sparsity is expressed as the ratio of the total number of elements in the matrix (#rows \(\times \) #columns) to the number of filledin ratings (#ratings). This metric also varies: especially the Amazon dataset is very sparsely filled.
The scale of the ratings (either 1 to 5 or 0 to 100) is only important when looking at prediction error values, which we will do in the next section.
The final three lines of the table are not data properties, but rather properties of the matrix factorization methods. K is the number of latent dimensions used, which is different per dataset but chosen to be common to all matrix factorization methods. The last two lines are two compute performance metrics, which will be discussed in Sect. 3.4.
3.2 Related Methods
In this section we compare the proposed BMF+PP method to other matrix factorization (MF) methods, in terms of accuracy (RootMeanSquare Error or RMSE on the test set), and in terms of compute performance (wallclock time).
Stochastic gradient decent (SGD [15]), Coordinate Gradient Descent (CGD [18]) and Alternating Least Squares (ALS [14]) are the three mostused algorithms for nonBayesian MF.
CGD and ALSbased algorithms update along one dimension at a time while the other dimension of the matrix remains fixed. Many variants of ALS and CGD exist that improve the convergence [6], or the parallelization degree [11], or are optimized for nonsparse rating matrices [7]. In this paper we limit ourselves to comparing to methods, which divide the rating matrix into blocks, which is necessary to support very large matrices. Of the previous methods, this includes the SGDbased ones.
FPSGD [15] is a very efficient SGDbased library for matrix factorization on multicores. While FPSGD is a singlemachine implementation, which limits its capability to solve largescale problems, it has outperformed all other methods on a machine with up to 16 cores. NOMAD [19] extends the idea of block partitioning, adding the capability to release a portion of a block to another thread before its full completion. It performs similarly to FPSGD on a single machine, and can scale out to a 64node HPC cluster.
Table 2 compares the RMSE values on the test sets of the four selected datasets. For all methods, we used the K provided in Table 1. For competitor methods, we used the default hyperparameters suggested by the authors for the different data sets. As already concluded in [12], BMF with Posterior Propagation results in equally good RMSE compared to the original BMF. We also see from the table that on average the Bayesian method produces only slightly better results, in terms of RMSE. However we do know that Bayesian methods have significant statistical advantages [9] that stimulate their use.
For many applications, this advantage outweighs the much higher computational cost. Indeed, as can be seem from Table 3, BMF is significantly more expensive, than NOMAD and FPSGD, even when taking in to account the speedups thanks to using PP. NOMAD is the fastest method, thanks to the aforementioned improvements compared to FPSGD.
3.3 Block Size
The combined method will achieve some of the parallelization with the PP algorithm, and some with the distributed BMF within each block. We next compare the performance as the share between the two is varied by varying the block size. We find that blocks should be approximately square, meaning the number of rows and columns inside the each block should be more or less equal. This implies the number of blocks across the rows of the \(\mathbf {R}\) matrix will be less if the \(\mathbf {R}\) matrix has fewer rows and vice versa.
Figure 3 shows optimizing the block size is crucial to achieve a good speed up with BMF+PP and avoid compromising the quality of the model (as measured with the RMSE). The block size in the figure, listed as \(I\times J\), means the R matrix is split in I blocks (with equal amount of rows) in the vertical direction and J blocks (with equal amount of columns) in the horizontal direction. The figure explores different block sizes for the Netflix dataset, which has significantly more rows than columns (27\(\times \)) as can be seen from Table 1. This is reflected by the fact that the data point with the smallest bubble area (\(20 \times 3\)) provides the best tradeoff between wallclock time and RMSE.
We stipulate that the reason is the underlying tradeoff between the amount of information in the block and the amount of compute per block. Both the amount of information and the amount of compute can optimized (maximized and minimized respectively) by making the blocks approximately squared, since both are proportionate to the ratio of the area versus the circumference of the block.
We will come back to this tradeoff when we look at performance of BMF+PP when scaling to multiple nodes for the different datasets in Sect. 3.4.
3.4 Scaling
In this section we look at how the added parallelization of Posterior Propagation increases the strong scaling behavior of BMF+PP. Strong scaling means we look at the speedup obtained by increasing the amount of compute nodes, while keeping the dataset constant.
The graphs in this section are displayed with a logarithmic scale on both axes. This has the effect that linear scaling (i.e. doubling of the amount of resources results in a halving of the runtime) is shown as a straight line. Additionally, we indicate Pareto optimal solutions with a blue dot. For these solutions one cannot reduce the execution time without increasing the resources.
We performed experiments with different block sizes, indicated with different colors in the graphs. For example, the yellow line labeled \(16 \times 8\) in Fig. 4 means we performed PP with 16 blocks for the rows of \(\mathbf {R}\) and 8 blocks for the columns of \(\mathbf {R}\).
We have explored scaling on a system with up to 128 K nodes, and use this multinode parallelism in two ways:

1.
Parallelisation inside a block using the distributed version of BMF [16].

2.
Parallelism across blocks. In phase (b) we can exploit parallelism across the blocks in the first row and first column, using up to \(I + J\) nodes where I and J is the number of blocks in the vertical, respectively horizontal direction of the matrix. In phase (c) we can use up to \(I \times J\) nodes.
General Trends. When looking at the four graphs (Fig. 4 and 5), we observe that for the same amount of nodes using more blocks for posterior propagation, increases the wallclock time. The main reason is that we do significantly more compute for this solution, because we take the same amount of samples for each subblock. This means for a partitioning of \(32 \times 32\) we take \(1024 \times \) more samples than \(1 \times 1\).
On the other hand, we do get significant speedup, with up to \(68\times \) improvement for the Netflix dataset. However the amount of resources we need for this is clearly prohibitively large (16 K nodes). To reduce execution time, we plan to investigate reducing the amount of Gibbs samples and the effect this has on RMSE.
Netflix and Yahoo. Runs on the Netflix and Yahoo datasets (Fig. 4) have been performed with 100 latent dimensions (K = 100). Since computational intensity (the amount of compute per row/column of \(\mathbf {R}\)), is \(O(K^3)\), the amount of compute versus compute for these experiments is high, leading to good scalability for a single block (\(1 \times 1\)) for Netflix, and almost linear scalability up to 16 or even 64 nodes. The Yahoo dataset is too large to run with a \(1 \times 1\) block size, but we see a similar trend for \(2 \times 2\).
Movielens and Amazon. For Movielens and Amazon (Fig. 5), we used \(K=10\) for our experiments. This implies a much smaller amount of compute for the same communication cost inside a block. Hence scaling for \(1 \times 1\) is mostly flat. We do get significant performance improvements with more but smaller blocks where the Amazon dataset running on 2048 nodes, with \(32 \times 32\) blocks is \(20 \times \) faster than the best block size (\(1 \times 1\)) on a single node.
When the nodes in the experiment align with the parallelism across of blocks (either \(I + J\) or \(I \times J\) as explained above), we see a significant drop in the run time. For example for the Amazon dataset on \(32 \times 32\) blocks going from 1024 to 2048 nodes.
4 Conclusions and Further Work
In this paper we presented a scalable, distributed implementation of Bayesian Probabilistic Matrix Factorization, using asynchronous communication. We evaluated both the machine learning and the compute performance on several webscale datasets.
While we do get significant speedups, resource requirements for these speedups are extremely large. Hence, in future work, we plan to investigate the effect of the following measures on execution time, resource usage and RMSE:

Reduce the number of samples for subblocks in phase (b) and phase (c).

Use the posterior from phase (b) blocks to make predictions for phase (c) blocks.

Use the GASPI implementation of [16], instead of the MPI implementation.
References
Ahn, S., Korattikara, A., Liu, N., Rajan, S., Welling, M.: Largescale distributed Bayesian matrix factorization using stochastic gradient MCMC. In: Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 9–18 (2015). https://doi.org/10.1145/2783258.2783373
Bhattacharya, A., Dunson, D.B.: Sparse Bayesian infinite factor models. Biometrika 98, 291–306 (2011)
Dror, G., Koenigstein, N., Koren, Y., Weimer, M.: The Yahoo! music dataset and KDDCup’11. In: Dror, G., Koren, Y., Weimer, M. (eds.) Proceedings of KDD Cup 2011. Proceedings of Machine Learning Research, vol. 18, pp. 3–18. PMLR (2012). http://proceedings.mlr.press/v18/dror12a.html
GomezUribe, C.A., Hunt, N.: The netflix recommender system: algorithms, business value, and innovation. ACM Trans. Manag. Inf. Syst. 6(4), 13:1–13:19 (2015). https://doi.org/10.1145/2843948
Harper, F.M., Konstan, J.A.: The movielens datasets: history and context. ACM Trans. Interact. Intell. Syst. 5(4), 19:1–19:19 (2015). https://doi.org/10.1145/2827872
Hsieh, C.J., Dhillon, I.S.: Fast coordinate descent methods with variable selection for nonnegative matrix factorization. In: ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD) (2011)
Koren, Y., Bell, R., Volinsky, C.: Matrix factorization techniques for recommender systems. Computer 42(8), 30–37 (2009). https://doi.org/10.1109/MC.2009.263
Lab, S.S.: Web data: Amazon reviews. https://snap.stanford.edu/data/webAmazon.html. Accessed 18 Aug 2018
Labelle, C., Marinier, A., Lemieux, S.: Enhancing the drug discovery process: Bayesian inference for the analysis and comparison of doseresponse experiments. Bioinformatics 35(14), i464–i473 (2019). https://doi.org/10.1093/bioinformatics/btz335
Neiswanger, W., Wang, C., Xing, E.P.: Asymptotically exact, embarrassingly parallel MCMC. In: Proceedings of the Thirtieth Conference on Uncertainty in Artificial Intelligence, UAI 2014, pp. 623–632. AUAI Press, Arlington (2014). http://dl.acm.org/citation.cfm?id=3020751.3020816
Pilászy, I., Zibriczky, D., Tikk, D.: Fast ALSbased matrix factorization for explicit and implicit feedback datasets. In: Proceedings of the Fourth ACM Conference on Recommender Systems, pp. 71–78. ACM (2010)
Qin, X., Blomstedt, P., Leppäaho, E., Parviainen, P., Kaski, S.: Distributed Bayesian matrix factorization with limited communication. Mach. Learn. 108(10), 1805–1830 (2019). https://doi.org/10.1007/s10994019057782
Salakhutdinov, R., Mnih, A.: Bayesian probabilistic matrix factorization using Markov chain Monte Carlo. In: Proceedings of the 25th International Conference on Machine Learning, pp. 880–887. ACM (2008). https://doi.org/10.1145/1390156.1390267
Tan, W., Cao, L., Fong, L.: Faster and cheaper: parallelizing largescale matrix factorization on GPUs. In: Proceedings of the 25th ACM International Symposium on HighPerformance Parallel and Distributed Computing (HPDC 2016), pp. 219–230. ACM (2016). https://doi.org/10.1145/2907294.2907297
Teflioudi, C., Makari, F., Gemulla, R.: Distributed matrix completion. In: Proceedings of the 2012 IEEE 12th International Conference on Data Mining (ICDM 2012), pp. 655–664 (2012). https://doi.org/10.1109/ICDM.2012.120
Vander Aa, T., Chakroun, I., Haber, T.: Distributed Bayesian probabilistic matrix factorization. Procedia Comput. Sci. 108, 1030–1039 (2017). International Conference on Computational Science, ICCS 2017
Wang, X., Guo, F., Heller, K.A., Dunson, D.B.: Parallelizing MCMC with random partition trees. In: Advances in Neural Information Processing Systems, vol. 28, pp. 451–459. Curran Associates, Inc. (2015)
Yu, H.F., Hsieh, C.J., Si, S., Dhillon, I.: Scalable coordinate descent approaches to parallel matrix factorization for recommender systems. In: Proceedings of the 2012 IEEE 12th International Conference on Data Mining (ICDM 2012), pp. 765–774. IEEE Computer Society (2012). https://doi.org/10.1109/ICDM.2012.168
Yun, H., Yu, H.F., Hsieh, C.J., Vishwanathan, S.V.N., Dhillon, I.: NOMAD: nonlocking, stochastic multimachine algorithm for asynchronous and decentralized matrix completion. Proc. VLDB Endow. 7(11), 975–986 (2014). https://doi.org/10.14778/2732967.2732973
Acknowledgments
The research leading to these results has received funding from the European Union’s Horizon2020 research and innovation programme under the EPEEC project, grant agreement No 801051. The work was also supported by the Academy of Finland (Flagship programme: Finnish Center for Artificial Intelligence, FCAI; grants 319264, 292334). We acknowledge PRACE for awarding us access to Hazel Hen at GCS@HLRS, Germany.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Vander Aa, T., Qin, X., Blomstedt, P., Wuyts, R., Verachtert, W., Kaski, S. (2020). A HighPerformance Implementation of Bayesian Matrix Factorization with Limited Communication. In: Krzhizhanovskaya, V., et al. Computational Science – ICCS 2020. ICCS 2020. Lecture Notes in Computer Science(), vol 12142. Springer, Cham. https://doi.org/10.1007/9783030504335_1
Download citation
DOI: https://doi.org/10.1007/9783030504335_1
Published:
Publisher Name: Springer, Cham
Print ISBN: 9783030504328
Online ISBN: 9783030504335
eBook Packages: Computer ScienceComputer Science (R0)