Convex non-negative matrix factorization for massive datasets

Thurau, Christian; Kersting, Kristian; Wahabzada, Mirwaes; Bauckhage, Christian

doi:10.1007/s10115-010-0352-6

Convex non-negative matrix factorization for massive datasets

Regular paper
Published: 26 October 2010

Volume 29, pages 457–478, (2011)
Cite this article

Knowledge and Information Systems Aims and scope Submit manuscript

Christian Thurau¹,
Kristian Kersting¹,
Mirwaes Wahabzada¹ &
…
Christian Bauckhage¹

781 Accesses
34 Citations
Explore all metrics

Abstract

Non-negative matrix factorization (NMF) has become a standard tool in data mining, information retrieval, and signal processing. It is used to factorize a non-negative data matrix into two non-negative matrix factors that contain basis elements and linear coefficients, respectively. Often, the columns of the first resulting factor are interpreted as “cluster centroids” of the input data, and the columns of the second factor are understood to contain cluster membership indicators. When analyzing data such as collections of gene expressions, documents, or images, it is often beneficial to ensure that the resulting cluster centroids are meaningful, for instance, by restricting them to be convex combinations of data points. However, known approaches to convex-NMF suffer from high computational costs and therefore hardly apply to large-scale data analysis problems. This paper presents a new framework for convex-NMF that allows for an efficient factorization of data matrices of millions of data points. Triggered by the simple observation that each data point can be expressed as a convex combination of vertices of the data convex hull, we require the basic factors to be vertices of the data convex hull. The benefits of convex-hull NMF are twofold. First, for a growing number of data points the expected size of the convex hull, i.e. the number of its vertices, grows much slower than the dataset. Second, distance preserving low-dimensional embeddings allow us to efficiently sample the convex hull and hence to quickly determine candidate vertices. Our extensive experimental evaluation on large datasets shows that convex-hull NMF compares favorably to convex-NMF in terms of both speed and reconstruction quality. We demonstrate that our method can easily be applied to large-scale, real-world datasets, in our case consisting of 750,000 DBLP entries, 4,000,000 digital images, and 150,000,000 votes on World of Warcraft ^®guilds, respectively.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Aggarwal C (2009) On classification and segmentation of massive audio data streams. Knowl Inf Syst 20(2): 137–156
Article MathSciNet Google Scholar
Aitchison J (1982) The statistical analysis of compositional data. J R Stat Soc B 44(2): 139–177
MathSciNet MATH Google Scholar
Cai D, He X, Wu X, Han J (2008) Non-negative matrix factorization on manifold. In: Proceedings of IEEE international conference on data mining
Chen Y, Rege M, Dong M, Hua J (2008) Non-negative matrix factorization for semi-supervised data clustering. Knowl Inf Syst 17(3): 355–379
Article Google Scholar
Cutler A, Breiman L (1994) Archetypal analysis. Technometrics 36(4): 338–347
Article MathSciNet MATH Google Scholar
de Berg M, van Kreveld M, Overmars M, Schwarzkopf O (2000) Computational geometry. Springer, Heidelberg
MATH Google Scholar
Ding C, Li T, Jordan M (2009) Convex and semi-nonnegative matrix factorizations. IEEE Trans Pattern Anal Mach Intell 32(1): 45–55
Article Google Scholar
Donoho D, Stodden V (2004) When does non-negative matrix factorization give a correct decomposition into parts?. In: Advances in neural information processing systems 16. MIT Press
Drineas P, Kannan R, Mahoney M (2006) , Fast Monte Carlo algorithms III: computing a compressed approixmate matrix decomposition. SIAM J Comput 36(1): 184–206
Article MathSciNet MATH Google Scholar
Faloutsos C , Lin K-I (1995) FastMap: a fast algorithm for indexing, data-mining and visualization of traditional and multimedia datasets. In: Proceedings of ACM SIGMOD conference
Golub G, van Loan J (1996) Matrix computations. 3. Johns Hopkins University Press, Baltimore
Google Scholar
Halevy A, Norvig P, Pereira F (2009) The unreasonable effectiveness of data. IEEE Intell Syst 24(2): 8–12
Article Google Scholar
Hoyer P (2004) Non-negative matrix factorization with sparseness constraints. J Mach Learn 5(Dec): 1457–1469
MathSciNet Google Scholar
Hueter I (1999) Limit theorems for the convex hull of random points in higher dimensions. Trans Am Math Soc 351(11): 4337–4363
Article MathSciNet MATH Google Scholar
Jolliffe I (1986) Principal component analysis. Springer, New York
Google Scholar
Kim J, Park H (2008) Toward faster nonnegative matrix factorization: a new algorithm and comparisons. In: Proceedings of IEEE internationl conference on data mining
Klingenberg B, Curry J, Dougherty A (2008) Non-negative matrix factorization: ill-posedness and a geometric algorithm. Pattern Recogn 42(5): 918–928
Article Google Scholar
Langville A, Meyer C, Albright R (2006) Initializations for the nonnegative matrix factorization. In: Proceedings of ACM international conference on knowledge discovery and data mining
Lee DD, Seung HS (1999) Learning the parts of objects by non-negative matrix factorization. Nature 401(6755): 788–799
Article Google Scholar
Li T (2008) Clustering based on matrix approximation: a unifying view. Knowl Inf Syst 17(1): 1–15
Article MATH Google Scholar
Olivia A, Torralba A (2001) Modeling the shape of the scene: a holistic representation of the spatial envelope. Int J Comput Vis 42(3): 145–175
Article Google Scholar
Ostrouchov G, Samatova N (2005) On FastMap and the convex hull of multivariate data: toward fast and robust dimension reduction. IEEE Trans Pattern Anal Mach Intell 27(8): 1340–1434
Article Google Scholar
Paatero P, Tapper U (1994) Positive matrix factorization: a non-negative factor model with optimal utilization of error estimates of data values. Environmetrics 5(2): 111–126
Article Google Scholar
Rennie J, Srebro N (2005) Fast maximum margin matrix factorization for collaborative prediction. In: Proceedings of international conference on machine learning
Srebro N, Rennie JM, Jaakola T (2005) Maximum-margin matrix factorization. In: Advances in neural information processing systems 17. MIT Press
Sun J, Xie Y, Zhang H, Faloutsos C (2007) Less is more: compact matrix decomposition for large sparse graphs. In: Proceedings of SIAM international conference on data mining
Suvrit S (2008) Block-iterative algorithms for non-negative matrix approximation. In: Proceedings of IEEE international conference on data mining
Thurau C, Kersting K, Bauckhage C (2009) Convex non-negative matrix factorization in the Wild. In: Proceedings of IEEE international conference on data mining
Torralba A, Fergus R, Freeman WT (2008) 80 Million tiny images: a large data set for nonparametric object and scene recognition. IEEE Trans Pattern Anal Mach Intell 30(11): 1958–1970
Article Google Scholar
Vasiloglou N, Gray A, Anderson D (2009) Non-negative matrix factorization, convexity and isometry. In: Proceedings of SIAM international conference on data mining
Ziegler G (1995) Lectures on polytopes. Springer, New York
MATH Google Scholar

Download references

Author information

Authors and Affiliations

Fraunhofer IAIS, Schloss Birlinghoven, 53754, Sankt Augustin, Germany
Christian Thurau, Kristian Kersting, Mirwaes Wahabzada & Christian Bauckhage

Authors

Christian Thurau
View author publications
You can also search for this author in PubMed Google Scholar
Kristian Kersting
View author publications
You can also search for this author in PubMed Google Scholar
Mirwaes Wahabzada
View author publications
You can also search for this author in PubMed Google Scholar
Christian Bauckhage
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Christian Thurau.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Thurau, C., Kersting, K., Wahabzada, M. et al. Convex non-negative matrix factorization for massive datasets. Knowl Inf Syst 29, 457–478 (2011). https://doi.org/10.1007/s10115-010-0352-6

Download citation

Received: 05 March 2010
Revised: 28 July 2010
Accepted: 09 October 2010
Published: 26 October 2010
Issue Date: November 2011
DOI: https://doi.org/10.1007/s10115-010-0352-6

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Convex non-negative matrix factorization for massive datasets

Abstract

Access this article

Similar content being viewed by others

Off-diagonal symmetric nonnegative matrix factorization

Convex Nonnegative Matrix Factorization with Rank-1 Update for Clustering

A competitive optimization approach for data clustering and orthogonal non-negative matrix factorization

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Convex non-negative matrix factorization for massive datasets

Abstract

Access this article

Similar content being viewed by others

Off-diagonal symmetric nonnegative matrix factorization

Convex Nonnegative Matrix Factorization with Rank-1 Update for Clustering

A competitive optimization approach for data clustering and orthogonal non-negative matrix factorization

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation