Skip to main content
Log in

Convex non-negative matrix factorization for massive datasets

  • Regular paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

Non-negative matrix factorization (NMF) has become a standard tool in data mining, information retrieval, and signal processing. It is used to factorize a non-negative data matrix into two non-negative matrix factors that contain basis elements and linear coefficients, respectively. Often, the columns of the first resulting factor are interpreted as “cluster centroids” of the input data, and the columns of the second factor are understood to contain cluster membership indicators. When analyzing data such as collections of gene expressions, documents, or images, it is often beneficial to ensure that the resulting cluster centroids are meaningful, for instance, by restricting them to be convex combinations of data points. However, known approaches to convex-NMF suffer from high computational costs and therefore hardly apply to large-scale data analysis problems. This paper presents a new framework for convex-NMF that allows for an efficient factorization of data matrices of millions of data points. Triggered by the simple observation that each data point can be expressed as a convex combination of vertices of the data convex hull, we require the basic factors to be vertices of the data convex hull. The benefits of convex-hull NMF are twofold. First, for a growing number of data points the expected size of the convex hull, i.e. the number of its vertices, grows much slower than the dataset. Second, distance preserving low-dimensional embeddings allow us to efficiently sample the convex hull and hence to quickly determine candidate vertices. Our extensive experimental evaluation on large datasets shows that convex-hull NMF compares favorably to convex-NMF in terms of both speed and reconstruction quality. We demonstrate that our method can easily be applied to large-scale, real-world datasets, in our case consisting of 750,000 DBLP entries, 4,000,000 digital images, and 150,000,000 votes on World of Warcraft ®guilds, respectively.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Aggarwal C (2009) On classification and segmentation of massive audio data streams. Knowl Inf Syst 20(2): 137–156

    Article  MathSciNet  Google Scholar 

  2. Aitchison J (1982) The statistical analysis of compositional data. J R Stat Soc B 44(2): 139–177

    MathSciNet  MATH  Google Scholar 

  3. Cai D, He X, Wu X, Han J (2008) Non-negative matrix factorization on manifold. In: Proceedings of IEEE international conference on data mining

  4. Chen Y, Rege M, Dong M, Hua J (2008) Non-negative matrix factorization for semi-supervised data clustering. Knowl Inf Syst 17(3): 355–379

    Article  Google Scholar 

  5. Cutler A, Breiman L (1994) Archetypal analysis. Technometrics 36(4): 338–347

    Article  MathSciNet  MATH  Google Scholar 

  6. de Berg M, van Kreveld M, Overmars M, Schwarzkopf O (2000) Computational geometry. Springer, Heidelberg

    MATH  Google Scholar 

  7. Ding C, Li T, Jordan M (2009) Convex and semi-nonnegative matrix factorizations. IEEE Trans Pattern Anal Mach Intell 32(1): 45–55

    Article  Google Scholar 

  8. Donoho D, Stodden V (2004) When does non-negative matrix factorization give a correct decomposition into parts?. In: Advances in neural information processing systems 16. MIT Press

  9. Drineas P, Kannan R, Mahoney M (2006) , Fast Monte Carlo algorithms III: computing a compressed approixmate matrix decomposition. SIAM J Comput 36(1): 184–206

    Article  MathSciNet  MATH  Google Scholar 

  10. Faloutsos C , Lin K-I (1995) FastMap: a fast algorithm for indexing, data-mining and visualization of traditional and multimedia datasets. In: Proceedings of ACM SIGMOD conference

  11. Golub G, van Loan J (1996) Matrix computations. 3. Johns Hopkins University Press, Baltimore

    Google Scholar 

  12. Halevy A, Norvig P, Pereira F (2009) The unreasonable effectiveness of data. IEEE Intell Syst 24(2): 8–12

    Article  Google Scholar 

  13. Hoyer P (2004) Non-negative matrix factorization with sparseness constraints. J Mach Learn 5(Dec): 1457–1469

    MathSciNet  Google Scholar 

  14. Hueter I (1999) Limit theorems for the convex hull of random points in higher dimensions. Trans Am Math Soc 351(11): 4337–4363

    Article  MathSciNet  MATH  Google Scholar 

  15. Jolliffe I (1986) Principal component analysis. Springer, New York

    Google Scholar 

  16. Kim J, Park H (2008) Toward faster nonnegative matrix factorization: a new algorithm and comparisons. In: Proceedings of IEEE internationl conference on data mining

  17. Klingenberg B, Curry J, Dougherty A (2008) Non-negative matrix factorization: ill-posedness and a geometric algorithm. Pattern Recogn 42(5): 918–928

    Article  Google Scholar 

  18. Langville A, Meyer C, Albright R (2006) Initializations for the nonnegative matrix factorization. In: Proceedings of ACM international conference on knowledge discovery and data mining

  19. Lee DD, Seung HS (1999) Learning the parts of objects by non-negative matrix factorization. Nature 401(6755): 788–799

    Article  Google Scholar 

  20. Li T (2008) Clustering based on matrix approximation: a unifying view. Knowl Inf Syst 17(1): 1–15

    Article  MATH  Google Scholar 

  21. Olivia A, Torralba A (2001) Modeling the shape of the scene: a holistic representation of the spatial envelope. Int J Comput Vis 42(3): 145–175

    Article  Google Scholar 

  22. Ostrouchov G, Samatova N (2005) On FastMap and the convex hull of multivariate data: toward fast and robust dimension reduction. IEEE Trans Pattern Anal Mach Intell 27(8): 1340–1434

    Article  Google Scholar 

  23. Paatero P, Tapper U (1994) Positive matrix factorization: a non-negative factor model with optimal utilization of error estimates of data values. Environmetrics 5(2): 111–126

    Article  Google Scholar 

  24. Rennie J, Srebro N (2005) Fast maximum margin matrix factorization for collaborative prediction. In: Proceedings of international conference on machine learning

  25. Srebro N, Rennie JM, Jaakola T (2005) Maximum-margin matrix factorization. In: Advances in neural information processing systems 17. MIT Press

  26. Sun J, Xie Y, Zhang H, Faloutsos C (2007) Less is more: compact matrix decomposition for large sparse graphs. In: Proceedings of SIAM international conference on data mining

  27. Suvrit S (2008) Block-iterative algorithms for non-negative matrix approximation. In: Proceedings of IEEE international conference on data mining

  28. Thurau C, Kersting K, Bauckhage C (2009) Convex non-negative matrix factorization in the Wild. In: Proceedings of IEEE international conference on data mining

  29. Torralba A, Fergus R, Freeman WT (2008) 80 Million tiny images: a large data set for nonparametric object and scene recognition. IEEE Trans Pattern Anal Mach Intell 30(11): 1958–1970

    Article  Google Scholar 

  30. Vasiloglou N, Gray A, Anderson D (2009) Non-negative matrix factorization, convexity and isometry. In: Proceedings of SIAM international conference on data mining

  31. Ziegler G (1995) Lectures on polytopes. Springer, New York

    MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Christian Thurau.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Thurau, C., Kersting, K., Wahabzada, M. et al. Convex non-negative matrix factorization for massive datasets. Knowl Inf Syst 29, 457–478 (2011). https://doi.org/10.1007/s10115-010-0352-6

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-010-0352-6

Keywords

Navigation