Skip to main content
Log in

DC-NMF: nonnegative matrix factorization based on divide-and-conquer for fast clustering and topic modeling

  • Published:
Journal of Global Optimization Aims and scope Submit manuscript

Abstract

The importance of unsupervised clustering and topic modeling is well recognized with ever-increasing volumes of text data available from numerous sources. Nonnegative matrix factorization (NMF) has proven to be a successful method for cluster and topic discovery in unlabeled data sets. In this paper, we propose a fast algorithm for computing NMF using a divide-and-conquer strategy, called DC-NMF. Given an input matrix where the columns represent data items, we build a binary tree structure of the data items using a recently-proposed efficient algorithm for computing rank-2 NMF, and then gather information from the tree to initialize the rank-k NMF, which needs only a few iterations to reach a desired solution. We also investigate various criteria for selecting the node to split when growing the tree. We demonstrate the scalability of our algorithm for computing general rank-k NMF as well as its effectiveness in clustering and topic modeling for large-scale text data sets, by comparing it to other frequently utilized state-of-the-art algorithms. The value of the proposed approach lies in the highly efficient and accurate method for initializing rank-k NMF and the scalability achieved from the divide-and-conquer approach of the algorithm and properties of rank-2 NMF. In summary, we present efficient tools for analyzing large-scale data sets, and techniques that can be generalized to many other data analytics problem domains along with an open-source software library called SmallK.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

Notes

  1. http://www.daviddlewis.com/resources/testcollections/reuters21578/ (retrieved in June 2014).

  2. http://qwone.com/~jason/20Newsgroups/ (retrieved in June 2014).

  3. https://dumps.Wikimedia.org/enWiki/.

  4. https://smallk.github.io/.

  5. Besides the listed algorithms, we also experimented with a recent algorithm based on coordinate descent with a greedy rule to select the variable to improve at each step [24]. However, this algorithm became increasingly slow when we increased k and kept the size of A the same. Therefore, we did not include it in our final comparison.

  6. http://glaros.dtc.umn.edu/gkhome/cluto/cluto/overview.

  7. http://mallet.cs.umass.edu/.

  8. https://github.com/mimno/anchor.

  9. http://math.ucla.edu/~dakuang/software/kmeans3.html.

  10. The run-time for CLUTO on Wiki-4.5M is absent: on our smaller system with 24 GB memory, it ran out of memory; and on our larger server with sufficient memory, the binary could not open a large data file (\(>6\) GB). The CLUTO software is not open-source and thus we only have access to the binary and are not able to build the program on our server.

References

  1. Arora, S., Ge, R., Halpern, Y., Mimno, D.M., Moitra, A., Sontag, D., Wu, Y. Zhu, M.: A practical algorithm for topic modeling with provable guarantees. In: ICML ’13: Proceedings of the 30th International Conference on Machine Learning (2013)

  2. Arora, S., Ge, R., Kannan, R., Moitra, A.: Computing a nonnegative matrix factorization—provably. In: STOC ’12: Proceedings of the 44th Symposium on Theory of Computing, pp. 145–162 (2012)

  3. Bertsekas, D.P.: Nonlinear Programming. Athena Scientific, Belmont (1999)

    MATH  Google Scholar 

  4. Bittorf, V., Recht, B., Re, C., Tropp, J.: Factoring nonnegative matrices with linear programs. In: Advances in Neural Information Processing Systems 25, NIPS ’ 12, pp. 1214–1222 (2012)

  5. Blei, D.M., Griffiths, T.L., Jordan, M.I., Tenenbaum, J.B.: Hierarchical topic models and the nested Chinese restaurant process. In: Advances in Neural Information Processing Systems 16, (2003)

  6. Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)

    MATH  Google Scholar 

  7. Beavers, A., Drake, B., Boyd, R., Park, H.: https://smallk.github.io, June (2016)

  8. Cai, D., He, X., Han, J., Huang, T.S.: Graph regularized nonnegative matrix factorization for data representation. IEEE Trans. Pattern Anal. Mach. Intell. 33, 1548–1560 (2011)

    Article  Google Scholar 

  9. Chu, M.T., Lin, M.M.: Low-dimensional polytope approximation and its applications to nonnegative matrix factorization. SIAM J. Sci. Comput. 30, 1131–1155 (2008)

    Article  MathSciNet  MATH  Google Scholar 

  10. Cichocki, A., Anh Huy, P.H.A.N.: Fast local algorithms for large scale nonnegative matrix and tensor factorizations. IEICE Trans. Fundam. Electron. Commun. Comput. Sci. E92A, 708–721 (2009)

    Article  Google Scholar 

  11. Cohen, J.E., Rothblum, U.G.: Nonnegative ranks, decompositions, and factorizations of nonnegative matrices. Linear Algebra Appl 190, 149–168 (1993)

    Article  MathSciNet  MATH  Google Scholar 

  12. Cover, T.M., Thomas, J.A.: Elements of Information Theory, 2nd edn. Wiley, Hoboken (2006)

    MATH  Google Scholar 

  13. Drake, B., Kim, J., Mallick, M., Park, H.: Supervised Raman spectra estimation based on nonnegative rank deficient least squares. In: Proceedings 13th International Conference on Information Fusion. Edinburgh, UK (2010)

  14. Gillis, N.: The why and how of nonnegative matrix factorization. In: Suykens, J.A.K., Signoretto, M., Argyriou, A. (eds.) Regularization, Optimization, Kernels, and Support Vector Machines, Ch 12, pp. 257–291. Chapman & Hall/CRC, London (2014)

    Google Scholar 

  15. Gillis, N., Glineur, F.: Accelerated multiplicative updates and hierarchical als algorithms for nonnegative matrix factorization. Neural Comput. 24, 1085–1105 (2012)

    Article  MathSciNet  Google Scholar 

  16. Gillis, N., Kuang, D., Park, H.: Hierarchical clustering of hyperspectral images using rank-two nonnegative matrix factorization. IEEE Trans. Geosci. Remote Sens. 53, 2066–2078 (2015)

    Article  Google Scholar 

  17. Globerson, A., Chechik, G., Pereira, F., Tishby, N.: Euclidean embedding of co-occurrence data. J. Mach. Learn. Res. 8, 2265–2295 (2007)

    MathSciNet  MATH  Google Scholar 

  18. Golub, G.H., Van Loan, C.F.: Matrix Computations, 4th edn. The Johns Hopkins University Press, Baltimore (2013)

    MATH  Google Scholar 

  19. Grippo, L., Sciandrone, M.: On the convergence of the block nonlinear Gauss–Seidel method under convex constraints. Oper. Res. Lett. 26, 127–136 (2000)

    Article  MathSciNet  MATH  Google Scholar 

  20. Ho, N.-D.: Non-negative Matrix Factorization. Algorithms and Applications. PhD Thesis, Université catholique de Louvain (2008)

  21. Hofmann, T.: Probabilistic latent semantic indexing. In: SIGIR ’99: Proceedings of the 22th International ACM Conference on Research and Development in Information Retrieval (1999)

  22. Hofree, M., Shen, J.P., Carter, H., Gross, A., Ideker, T.: Network-based stratification of tumor mutations. Nat. Methods 10, 1108–1115 (2013)

    Article  Google Scholar 

  23. Horn, R.A., Johnson, C.R. (eds.): Matrix Analysis. Cambridge University Press, New York (1986)

  24. Hsieh, C.-J., Dhillon, I.S.: Fast coordinate descent methods with variable selection for non-negative matrix factorization. In: 17th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD ’11), pp. 1064–1072 (2011)

  25. Jain, A.K.: Data clustering: 50 years beyond k-means. Pattern Recognit. Lett. 31, 651–666 (2010). (Award winning papers from the 19th International Conference on Pattern Recognition (ICPR))

    Article  Google Scholar 

  26. Kim, H., Park, H.: Sparse non-negative matrix factorizations via alternating non-negativity-constrained least squares for microarray data analysis. Bioinformatics 23, 1495–1502 (2007)

    Article  Google Scholar 

  27. Kim, H., Park, H.: Nonnegative matrix factorization based on alternating non-negativity-constrained least squares and the active set method. SIAM J. Matrix Anal. Appl. 30, 713–730 (2008)

    Article  MathSciNet  MATH  Google Scholar 

  28. Kim, J., He, Y., Park, H.: Algorithms for nonnegative matrix and tensor factorizations: a unified view based on block coordinate descent framework. J. Glob. Optim. 58, 285–319 (2014)

    Article  MathSciNet  MATH  Google Scholar 

  29. Kim, J., Park, H.: Sparse nonnegative matrix factorization for clustering. Technical Report, Georgia Institute of Technology (2008)

  30. Kim, J., Park, H.: Toward faster nonnegative matrix factorization: a new algorithm and comparisons. In: ICDM ’08: Proceedings of the 8th IEEE International Conference on Data Mining, pp. 353–362 (2008)

  31. Kim, J., Park, H.: Fast nonnegative matrix factorization: an active-set-like method and comparisons. SIAM J. Sci. Comput. 33, 3261–3281 (2011)

    Article  MathSciNet  MATH  Google Scholar 

  32. Kuang, D., Park, H.: Fast rank-2 nonnegative matrix factorization for hierarchical document clustering. In 19th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD ’13), pp. 739–747 (2013)

  33. Kuang, D., Yun, S., Park, H.: SymNMF: nonnegative low-rank approximation of a similarity matrix for graph clustering. J. Glob. Optim. 62, 545–574 (2015)

    Article  MathSciNet  MATH  Google Scholar 

  34. Kumar, A., Sindhwani, V., Kambadur, P.: Fast conical hull algorithms for near-separable non-negative matrix factorization. In: ICML ’13: Proceedings of the 30th International Conference on Machine Learning (2013)

  35. Lee, D.D., Seung, H.S.: Learning the parts of objects by non-negative matrix factorization. Nature 401, 788–791 (1999)

    Article  Google Scholar 

  36. Lee, D.D., Seung, H.S.: Algorithms for non-negative matrix factorization. In: Advances in Neural Information Processing Systems 14, NIPS ’01, pp. 556–562 (2001)

  37. Lewis, D.D., Yang, Y., Rose, T.G., Li, F.: Rcv1: A new benchmark collection for text categorization research. J. Mach. Learn. Res. 5, 361–397 (2004)

    Google Scholar 

  38. Li, L., Lebanon, G., Park, H.: Fast Bregman divergence NMF using Taylor expansion and coordinate descent. In: Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’12, pp. 307–315. ACM, New York (2012)

  39. Lin, C.-J.: On the convergence of multiplicative update algorithms for nonnegative matrix factorization. IEEE Trans. Neural Netw. 18, 1589–1596 (2007)

    Article  Google Scholar 

  40. Lin, C.-J.: Projected gradient methods for nonnegative matrix factorization. Neural Comput. 19, 2756–2779 (2007)

    Article  MathSciNet  MATH  Google Scholar 

  41. Manning, C.D., Raghavan, P., Schütze, H.: Introduction to Information Retrieval. Cambridge University Press, New York (2008)

    Book  MATH  Google Scholar 

  42. McCallum, A.K., Nigam, K., Rennie, J., Seymore, K.: Automating the construction of internet portals with machine learning. Inf. Retr. 3, 127–163 (2000)

    Article  Google Scholar 

  43. Ozerov, A., Févotte, C.: Multichannel nonnegative matrix factorization in convolutive mixtures for audio source separation. IEEE Trans. Audio Speech Lang. Process. 18, 550–563 (2010)

    Article  Google Scholar 

  44. Paatero, P., Tapper, U.: Positive matrix factorization: a non-negative factor model with optimal utilization of error estimates of data values. Environmetrics 5, 111–126 (1994)

    Article  Google Scholar 

  45. Van Benthem, M.H., Keenan, M.R.: Fast algorithm for the solution of large-scale non-negativity constrained least squares problems. J. Chemom. 18, 441–450 (2004)

    Article  Google Scholar 

  46. Xu, W., Liu, X., Gong, Y.: Document clustering based on non-negative matrix factorization. In: SIGIR ’03: Proceedings of the 26th International ACM Conference on Research and Development in Information Retrieval, pp. 267–273 (2003)

Download references

Acknowledgements

We would like to thank Dr. Yunlong He for Theorem 1. The work of the authors was supported in part by the National Science Foundation (NSF) grant IIS-1348152 and the Defense Advanced Research Projects Agency (DARPA) XDATA program grant FA8750-12-2-0309. Any opinions, findings and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the NSF or the DARPA.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Haesun Park.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Du, R., Kuang, D., Drake, B. et al. DC-NMF: nonnegative matrix factorization based on divide-and-conquer for fast clustering and topic modeling. J Glob Optim 68, 777–798 (2017). https://doi.org/10.1007/s10898-017-0515-z

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10898-017-0515-z

Keywords

Navigation