Skip to main content
Log in

DCF: A Dataflow-Based Collaborative Filtering Training Algorithm

  • Published:
International Journal of Parallel Programming Aims and scope Submit manuscript

Abstract

Emerging recommender systems often adopt collaborative filtering techniques to improve the recommending accuracy. Existing collaborative filtering techniques are implemented with either alternating least square algorithm or gradient descent (GD) algorithm. However, both of the two algorithms are not scalable because ALS suffers from high computation complexity and GD suffers from severe synchronization problem and tremendous data movement. To solve the above problems, we proposed a Dataflow-based Collaborative Filtering (DCF) algorithm. More specifically, DCF exploits fine-grain asynchronous feature of dataflow model to minimize synchronization overhead; leverages mini-batch technique to reduce computation and communication complexities; uses dummy edge and multicasting techniques to avoid fine-grain overhead of dependency checking and reduce data movement. By utilizing all the above techniques, DCF is able to significantly improve the performance of collaborative filtering. Our experiment on a cluster with one master node and ten slave nodes show that DCF achieves 23\(\times \) speedup over ALS on Spark and 18\(\times \) speedup over GD on Graphlab in public datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

References

  1. Apache hadoop project. http://hadoop.apache.org/ (2017)

  2. Abadi, M., Barham, P., et al.: Tensorflow: a system for large-scale machine learning. In: OSDI. Savannah, Georgia, USA (2016)

  3. Adomavicius, G., Tuzhilin, A.: Toward the next generation of recommender systems: a survey of the state-of-the-art and possible extensions. IEEE Trans. Knowl. Data Eng. 17(6), 734–749 (2005)

    Article  Google Scholar 

  4. Armbrust, M., Xin, R.S., et al.: Spark sql: relational data processing in spark. In: SIGMOD, pp. 1383–1394. ACM (2015)

  5. Augonnet, C., Thibault, S., Namyst, R., Wacrenier, P.A.: StarPU: a unified platform for task scheduling on heterogeneous multicore architectures. Euro-Par 23, 187–198 (2011). doi:10.1002/cpe.1631

    Google Scholar 

  6. Bobadilla, J., Ortega, F., Hernando, A., Gutiérrez, A.: Knowledge-based systems. Recomm. Syst. Surv. 46, 109–132 (2013)

    Google Scholar 

  7. Breese, J.S., Heckerman, D., Kadie, C.: Empirical analysis of predictive algorithms for collaborative filtering. In: Proceedings of the Fourteenth Conference on Uncertainty in Artificial Intelligence, pp. 43–52. Morgan Kaufmann Publishers Inc. (1998)

  8. Chin, W.S., Zhuang, Y., Juan, Y.C., Lin, C.J.: A fast parallel stochastic gradient method for matrix factorization in shared memory systems. ACM Trans. Intell. Syst. Technol. (TIST) 6(1), 2 (2015)

    Google Scholar 

  9. Culler, D.E.: Dataflow architectures. Technical report, DTIC Document (1986)

  10. Dean, J., Corrado, G., et al.: Large scale distributed deep networks. In: Advances in Neural Information Processing Systems, pp. 1223–1231 (2012)

  11. Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)

    Article  Google Scholar 

  12. Gemulla, R., Nijkamp, E., Haas, P.J., Sismanis, Y.: Large-scale matrix factorization with distributed stochastic gradient descent. In: SIGKDD, pp. 69–77. ACM (2011)

  13. Hu, Y., Koren, Y., Volinsky, C.: Collaborative filtering for implicit feedback datasets. In: ICDM, pp. 263–272. IEEE (2008)

  14. Kim, J.K., Ho, Q., Lee, S., Zheng, X., Dai, W., Gibson, G.A., Xing, E.P.: Strads: a distributed framework for scheduled model parallel machine learning. In: Eurosys, p. 5 (2016)

  15. Koren, Y., Bell, R., Volinsky, C., et al.: Matrix factorization techniques for recommender systems. Computer 42(8), 30–37 (2009)

    Article  Google Scholar 

  16. Li, M., Andersen, D.G., et al.: Scaling distributed machine learning with the parameter server. In: OSDI, vol. 1, p. 3 (2014)

  17. Low, Y., Gonzalez, J.E., Kyrola, A., Bickson, D., Guestrin, C.E., Hellerstein, J.: Graphlab: a new framework for parallel machine learning. arXiv preprint arXiv:1408.2041 (2014)

  18. Meng, X., Bradley, J., et al.: Mllib: machine learning in apache spark. JMLR 17(34), 1–7 (2016)

    MathSciNet  MATH  Google Scholar 

  19. Oh, J., Han, W.S., Yu, H., Jiang, X.: Fast and robust parallel SGD matrix factorization. In: SIGKDD, pp. 865–874. ACM (2015)

  20. Takane, Y., Young, F.W., De Leeuw, J.: Nonmetric individual differences multidimensional scaling: an alternating least squares method with optimal scaling features. Psychometrika 42(1), 7–67 (1977)

    Article  MATH  Google Scholar 

  21. Zuckerman, S., Suetterlein, J., Knauerhase, R., Gao, G.R.: Using a codelet program execution model for exascale machines: position paper. In: Proceedings of the 1st International Workshop on Adaptive Self-Tuning Computing Systems for the Exaflop Era, pp. 64–69. ACM (2011)

Download references

Acknowledgements

We thank our anonymous reviewers for their feedback and suggestions. This work was partially sponsored by the National Basic Research 973 Program of China under Grant 2015CB352403 and the National Natural Science Foundation of China (NSFC) (61602301 & 61632017).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Xiangyu Ju.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Ju, X., Chen, Q., Wang, Z. et al. DCF: A Dataflow-Based Collaborative Filtering Training Algorithm. Int J Parallel Prog 46, 686–698 (2018). https://doi.org/10.1007/s10766-017-0525-y

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10766-017-0525-y

Keywords

Navigation