The importance of unsupervised clustering and topic modeling is well recognized with ever-increasing volumes of text data available from numerous sources. Nonnegative matrix factorization (NMF) has proven to be a successful method for cluster and topic discovery in unlabeled data sets. In this paper, we propose a fast algorithm for computing NMF using a divide-and-conquer strategy, called DC-NMF. Given an input matrix where the columns represent data items, we build a binary tree structure of the data items using a recently-proposed efficient algorithm for computing rank-2 NMF, and then gather information from the tree to initialize the rank-k NMF, which needs only a few iterations to reach a desired solution. We also investigate various criteria for selecting the node to split when growing the tree. We demonstrate the scalability of our algorithm for computing general rank-k NMF as well as its effectiveness in clustering and topic modeling for large-scale text data sets, by comparing it to other frequently utilized state-of-the-art algorithms. The value of the proposed approach lies in the highly efficient and accurate method for initializing rank-k NMF and the scalability achieved from the divide-and-conquer approach of the algorithm and properties of rank-2 NMF. In summary, we present efficient tools for analyzing large-scale data sets, and techniques that can be generalized to many other data analytics problem domains along with an open-source software library called SmallK.
This is a preview of subscription content, access via your institution.
Buy single article
Instant access to the full article PDF.
Price excludes VAT (USA)
Tax calculation will be finalised during checkout.
http://www.daviddlewis.com/resources/testcollections/reuters21578/ (retrieved in June 2014).
http://qwone.com/~jason/20Newsgroups/ (retrieved in June 2014).
Besides the listed algorithms, we also experimented with a recent algorithm based on coordinate descent with a greedy rule to select the variable to improve at each step . However, this algorithm became increasingly slow when we increased k and kept the size of A the same. Therefore, we did not include it in our final comparison.
The run-time for CLUTO on Wiki-4.5M is absent: on our smaller system with 24 GB memory, it ran out of memory; and on our larger server with sufficient memory, the binary could not open a large data file (\(>6\) GB). The CLUTO software is not open-source and thus we only have access to the binary and are not able to build the program on our server.
Arora, S., Ge, R., Halpern, Y., Mimno, D.M., Moitra, A., Sontag, D., Wu, Y. Zhu, M.: A practical algorithm for topic modeling with provable guarantees. In: ICML ’13: Proceedings of the 30th International Conference on Machine Learning (2013)
Arora, S., Ge, R., Kannan, R., Moitra, A.: Computing a nonnegative matrix factorization—provably. In: STOC ’12: Proceedings of the 44th Symposium on Theory of Computing, pp. 145–162 (2012)
Bertsekas, D.P.: Nonlinear Programming. Athena Scientific, Belmont (1999)
Bittorf, V., Recht, B., Re, C., Tropp, J.: Factoring nonnegative matrices with linear programs. In: Advances in Neural Information Processing Systems 25, NIPS ’ 12, pp. 1214–1222 (2012)
Blei, D.M., Griffiths, T.L., Jordan, M.I., Tenenbaum, J.B.: Hierarchical topic models and the nested Chinese restaurant process. In: Advances in Neural Information Processing Systems 16, (2003)
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)
Beavers, A., Drake, B., Boyd, R., Park, H.: https://smallk.github.io, June (2016)
Cai, D., He, X., Han, J., Huang, T.S.: Graph regularized nonnegative matrix factorization for data representation. IEEE Trans. Pattern Anal. Mach. Intell. 33, 1548–1560 (2011)
Chu, M.T., Lin, M.M.: Low-dimensional polytope approximation and its applications to nonnegative matrix factorization. SIAM J. Sci. Comput. 30, 1131–1155 (2008)
Cichocki, A., Anh Huy, P.H.A.N.: Fast local algorithms for large scale nonnegative matrix and tensor factorizations. IEICE Trans. Fundam. Electron. Commun. Comput. Sci. E92A, 708–721 (2009)
Cohen, J.E., Rothblum, U.G.: Nonnegative ranks, decompositions, and factorizations of nonnegative matrices. Linear Algebra Appl 190, 149–168 (1993)
Cover, T.M., Thomas, J.A.: Elements of Information Theory, 2nd edn. Wiley, Hoboken (2006)
Drake, B., Kim, J., Mallick, M., Park, H.: Supervised Raman spectra estimation based on nonnegative rank deficient least squares. In: Proceedings 13th International Conference on Information Fusion. Edinburgh, UK (2010)
Gillis, N.: The why and how of nonnegative matrix factorization. In: Suykens, J.A.K., Signoretto, M., Argyriou, A. (eds.) Regularization, Optimization, Kernels, and Support Vector Machines, Ch 12, pp. 257–291. Chapman & Hall/CRC, London (2014)
Gillis, N., Glineur, F.: Accelerated multiplicative updates and hierarchical als algorithms for nonnegative matrix factorization. Neural Comput. 24, 1085–1105 (2012)
Gillis, N., Kuang, D., Park, H.: Hierarchical clustering of hyperspectral images using rank-two nonnegative matrix factorization. IEEE Trans. Geosci. Remote Sens. 53, 2066–2078 (2015)
Globerson, A., Chechik, G., Pereira, F., Tishby, N.: Euclidean embedding of co-occurrence data. J. Mach. Learn. Res. 8, 2265–2295 (2007)
Golub, G.H., Van Loan, C.F.: Matrix Computations, 4th edn. The Johns Hopkins University Press, Baltimore (2013)
Grippo, L., Sciandrone, M.: On the convergence of the block nonlinear Gauss–Seidel method under convex constraints. Oper. Res. Lett. 26, 127–136 (2000)
Ho, N.-D.: Non-negative Matrix Factorization. Algorithms and Applications. PhD Thesis, Université catholique de Louvain (2008)
Hofmann, T.: Probabilistic latent semantic indexing. In: SIGIR ’99: Proceedings of the 22th International ACM Conference on Research and Development in Information Retrieval (1999)
Hofree, M., Shen, J.P., Carter, H., Gross, A., Ideker, T.: Network-based stratification of tumor mutations. Nat. Methods 10, 1108–1115 (2013)
Horn, R.A., Johnson, C.R. (eds.): Matrix Analysis. Cambridge University Press, New York (1986)
Hsieh, C.-J., Dhillon, I.S.: Fast coordinate descent methods with variable selection for non-negative matrix factorization. In: 17th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD ’11), pp. 1064–1072 (2011)
Jain, A.K.: Data clustering: 50 years beyond k-means. Pattern Recognit. Lett. 31, 651–666 (2010). (Award winning papers from the 19th International Conference on Pattern Recognition (ICPR))
Kim, H., Park, H.: Sparse non-negative matrix factorizations via alternating non-negativity-constrained least squares for microarray data analysis. Bioinformatics 23, 1495–1502 (2007)
Kim, H., Park, H.: Nonnegative matrix factorization based on alternating non-negativity-constrained least squares and the active set method. SIAM J. Matrix Anal. Appl. 30, 713–730 (2008)
Kim, J., He, Y., Park, H.: Algorithms for nonnegative matrix and tensor factorizations: a unified view based on block coordinate descent framework. J. Glob. Optim. 58, 285–319 (2014)
Kim, J., Park, H.: Sparse nonnegative matrix factorization for clustering. Technical Report, Georgia Institute of Technology (2008)
Kim, J., Park, H.: Toward faster nonnegative matrix factorization: a new algorithm and comparisons. In: ICDM ’08: Proceedings of the 8th IEEE International Conference on Data Mining, pp. 353–362 (2008)
Kim, J., Park, H.: Fast nonnegative matrix factorization: an active-set-like method and comparisons. SIAM J. Sci. Comput. 33, 3261–3281 (2011)
Kuang, D., Park, H.: Fast rank-2 nonnegative matrix factorization for hierarchical document clustering. In 19th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD ’13), pp. 739–747 (2013)
Kuang, D., Yun, S., Park, H.: SymNMF: nonnegative low-rank approximation of a similarity matrix for graph clustering. J. Glob. Optim. 62, 545–574 (2015)
Kumar, A., Sindhwani, V., Kambadur, P.: Fast conical hull algorithms for near-separable non-negative matrix factorization. In: ICML ’13: Proceedings of the 30th International Conference on Machine Learning (2013)
Lee, D.D., Seung, H.S.: Learning the parts of objects by non-negative matrix factorization. Nature 401, 788–791 (1999)
Lee, D.D., Seung, H.S.: Algorithms for non-negative matrix factorization. In: Advances in Neural Information Processing Systems 14, NIPS ’01, pp. 556–562 (2001)
Lewis, D.D., Yang, Y., Rose, T.G., Li, F.: Rcv1: A new benchmark collection for text categorization research. J. Mach. Learn. Res. 5, 361–397 (2004)
Li, L., Lebanon, G., Park, H.: Fast Bregman divergence NMF using Taylor expansion and coordinate descent. In: Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’12, pp. 307–315. ACM, New York (2012)
Lin, C.-J.: On the convergence of multiplicative update algorithms for nonnegative matrix factorization. IEEE Trans. Neural Netw. 18, 1589–1596 (2007)
Lin, C.-J.: Projected gradient methods for nonnegative matrix factorization. Neural Comput. 19, 2756–2779 (2007)
Manning, C.D., Raghavan, P., Schütze, H.: Introduction to Information Retrieval. Cambridge University Press, New York (2008)
McCallum, A.K., Nigam, K., Rennie, J., Seymore, K.: Automating the construction of internet portals with machine learning. Inf. Retr. 3, 127–163 (2000)
Ozerov, A., Févotte, C.: Multichannel nonnegative matrix factorization in convolutive mixtures for audio source separation. IEEE Trans. Audio Speech Lang. Process. 18, 550–563 (2010)
Paatero, P., Tapper, U.: Positive matrix factorization: a non-negative factor model with optimal utilization of error estimates of data values. Environmetrics 5, 111–126 (1994)
Van Benthem, M.H., Keenan, M.R.: Fast algorithm for the solution of large-scale non-negativity constrained least squares problems. J. Chemom. 18, 441–450 (2004)
Xu, W., Liu, X., Gong, Y.: Document clustering based on non-negative matrix factorization. In: SIGIR ’03: Proceedings of the 26th International ACM Conference on Research and Development in Information Retrieval, pp. 267–273 (2003)
We would like to thank Dr. Yunlong He for Theorem 1. The work of the authors was supported in part by the National Science Foundation (NSF) grant IIS-1348152 and the Defense Advanced Research Projects Agency (DARPA) XDATA program grant FA8750-12-2-0309. Any opinions, findings and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the NSF or the DARPA.
About this article
Cite this article
Du, R., Kuang, D., Drake, B. et al. DC-NMF: nonnegative matrix factorization based on divide-and-conquer for fast clustering and topic modeling. J Glob Optim 68, 777–798 (2017). https://doi.org/10.1007/s10898-017-0515-z