Distributed non-negative matrix factorization with determination of the number of latent features

Chennupati, Gopinath; Vangara, Raviteja; Skau, Erik; Djidjev, Hristo; Alexandrov, Boian

doi:10.1007/s11227-020-03181-6

Distributed non-negative matrix factorization with determination of the number of latent features

Published: 08 February 2020

Volume 76, pages 7458–7488, (2020)
Cite this article

The Journal of Supercomputing Aims and scope Submit manuscript

Gopinath Chennupati ORCID: orcid.org/0000-0002-6223-8570¹,
Raviteja Vangara²,
Erik Skau¹,
Hristo Djidjev¹ &
…
Boian Alexandrov²

1248 Accesses
15 Citations
Explore all metrics

Abstract

The holistic analysis and understanding of the latent (that is, not directly observable) variables and patterns buried in large datasets is crucial for data-driven science, decision making and emergency response. Such exploratory analyses require devising unsupervised learning methods for data mining and extraction of the latent features, and non-negative matrix factorization (NMF) is one of the prominent such methods. NMF is based on compute-intense non-convex constrained minimization, which, for large datasets requires fast and distributed algorithms. However, current parallel implementations of NMF fail to estimate the number of latent features. In practice, identifying these features is both difficult and significant for pattern recognition and latent feature analysis, especially for large dense matrices. In this paper, we introduce a distributed NMF algorithm coupled with distributed custom clustering followed by a stability analysis on dense data, which we call DnMFk, to determine the number of latent variables. The results on synthetic data and the classical Swimmer data set demonstrate the accuracy of model determination while scaling nearly linearly across multiple processors for large data. Further, we employ DnMFk to determine the number of hidden features from a terabyte matrix.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Density-Based Clustering Based on Hierarchical Density Estimates

Data clustering: application and trends

Article 27 November 2022

The Hadamard decomposition problem

Article Open access 21 May 2024

References

Alexandrov BS, Alexandrov LB, Iliev FL, Stanev VG, Vesselinov VV (2018) Source identification by non-negative matrix factorization combined with semi-supervised clustering. US Patent App. 15/690,176
Alexandrov BS, Stanev VG, Vesselinov VV, Rasmussen KØ (2019) Nonnegative tensor decomposition with custom clustering for microphase separation of block copolymers. Stat Anal Data Min ASA Data Sci J 12(4):302–310
MathSciNet Google Scholar
Alexandrov BS, Vesselinov VV (2014) Blind source separation for groundwater pressure analysis based on nonnegative matrix factorization. Water Resour Res 50(9):7332–7347
Google Scholar
Alexandrov LB, Nik-Zainal S, Wedge DC, Aparicio SA, Behjati S, Biankin AV, Bignell GR, Bolli N, Borg A, Børresen-Dale AL et al (2013) Signatures of mutational processes in human cancer. Nature 500(7463):415
Google Scholar
Alexandrov LB, Nik-Zainal S, Wedge DC, Campbell PJ, Stratton MR (2013) Deciphering signatures of mutational processes operative in human cancer. Cell Rep 3(1):246–259
Google Scholar
Amari Si, Cichocki A, Yang HH (1996) A new learning algorithm for blind signal separation. In: Advances in neural information processing systems, pp 757–763
Barlow H (1989) Unsupervised learning. Neural Comput 1(3):295–311. https://doi.org/10.1162/neco.1989.1.3.295
Article Google Scholar
Battenberg E, Wessel D (2009) Accelerating non-negative matrix factorization for audio source separation on multi-core and many-core architectures. In: ISMIR, pp 501–506
Baum LE, Petrie T (1966) Statistical inference for probabilistic functions of finite state markov chains. Ann Math Stat 37(6):1554–1563
MathSciNet MATH Google Scholar
Benesty J, Chen J, Huang Y, Cohen I (2009) Pearson correlation coefficient. In: Noise reduction in speech processing, Springer, New York, pp 1–4
Beutel A, Talukdar PP, Kumar A, Faloutsos C, Papalexakis EE, Xing EP (2014) Flexifact: scalable flexible factorization of coupled tensors on hadoop. In: Proceedings of the 2014 SIAM International Conference on Data Mining, SIAM, pp 109–117
Bishop CM (1999) Bayesian PCA. In: Advances in neural information processing systems, pp 382–388
Brunet JP, Tamayo P, Golub TR, Mesirov JP (2004) Metagenes and molecular pattern discovery using matrix factorization. Proc Natl Acad Sci 101(12):4164–4169
Google Scholar
Chan E, Heimlich M, Purkayastha A, Van De Geijn R (2007) Collective communication: theory, practice, and experience. Concurr Comput Pract Exp 19(13):1749–1783
Google Scholar
Chennupati G, Azad RMA, Ryan C (2015) Performance optimization of multi-core grammatical evolution generated parallel recursive programs. In: Proceedings of the 2015 Annual Conference on Genetic and Evolutionary Computation, pp 1007–1014
Cichocki A, Phan AH, Zhao Q, Lee N, Oseledets I, Sugiyama M, Mandic DP et al (2017) Tensor networks for dimensionality reduction and large-scale optimization: part 2 applications and future perspectives. Found Trends® Mach Learn 9(6):431–673
MATH Google Scholar
Cichocki A, Zdunek R, Amari Si (2007) Hierarchical ALS algorithms for nonnegative matrix and 3d tensor factorization. In: International Conference on Independent Component Analysis and Signal Separation, Springer, New York, pp 169–176
Cichocki A, Zdunek R, Phan AH, Si Amari (2009) Nonnegative matrix and tensor factorizations: applications to exploratory multi-way data analysis and blind source separation. Wiley, Hoboken
Google Scholar
Deerwester SC, Dumais ST, Furnas GW, Harshman RA, Landauer TK, Lochbaum KE, Streeter LA (1989) Computer information retrieval using latent semantic structure. US Patent 4,839,853
Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the em algorithm. J R Stat Soc Ser B (Methodol) 39(1):1–22
MathSciNet MATH Google Scholar
Dong C, Zhao H, Wang W (2010) Parallel nonnegative matrix factorization algorithm on the distributed memory platform. Int J Parallel Program 38(2):117–137
MATH Google Scholar
Donoho D, Stodden V (2004) When does non-negative matrix factorization give a correct decomposition into parts? In: Advances in neural information processing systems, pp 1141–1148
Fairbanks JP, Kannan R, Park H, Bader DA (2015) Behavioral clusters in dynamic graphs. Parallel Comput 47:38–50
MathSciNet Google Scholar
Févotte C, Cemgil AT (2009) Nonnegative matrix factorizations as probabilistic inference in composite models. In: 2009 17th European Signal Processing Conference, IEEE, pp 1913–1917
Franke B, Plante JF, Roscher R, Lee EsA, Smyth C, Hatefi A, Chen F, Gil E, Schwing A, Selvitella A et al (2016) Statistical inference, learning and models in big data. Int Stat Rev 84(3):371–389
MathSciNet Google Scholar
Gemulla R, Nijkamp E, Haas PJ, Sismanis Y (2011) Large-scale matrix factorization with distributed stochastic gradient descent. In: Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, pp 69–77
Golub GH, Reinsch C (1971) Singular value decomposition and least squares solutions. In: Linear algebra, Springer, New York, pp 134–151
Hartigan JA, Wong MA (1979) Algorithm as 136: a k-means clustering algorithm. J R Stat Soc Ser C (Appl Stat) 28(1):100–108
MATH Google Scholar
Huang K, Sidiropoulos ND, Liavas AP (2016) A flexible and efficient algorithmic framework for constrained matrix and tensor factorization. IEEE Trans Signal Process 64(19):5052–5065
MathSciNet MATH Google Scholar
Iliev FL, Stanev VG, Vesselinov VV, Alexandrov BS (2018) Nonnegative matrix factorization for identification of unknown number of sources emitting delayed signals. PloS One 13(3):e0193974
Google Scholar
Jolliffe I (2011) Principal component analysis. Springer, New York
MATH Google Scholar
Kanna R (2019) Parallel low-rank approximations with non-negativity constraints (PLANC). https://github.com/ramkikannan/planc. Accessed 03 Sep 2019
Kannan R, Ballard G, Park H (2016) A high-performance parallel algorithm for nonnegative matrix factorization. In: Proceedings of the 21st ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP’16, ACM, pp 9:1–9:11
Kim J, Park H (2008) Toward faster nonnegative matrix factorization: a new algorithm and comparisons. In: 2008 Eighth IEEE International Conference on Data Mining, IEEE, pp 353–362
Kim J, Park H (2011) Fast nonnegative matrix factorization: an active-set-like method and comparisons. SIAM J Sci Comput 33(6):3261–3281
MathSciNet MATH Google Scholar
Koitka S, Friedrich CM (2016) nmfgpu4r: GPU-accelerated computation of the non-negative matrix factorization (NMF) using CUDA capable hardware. R Journal 8(2):382–392
Google Scholar
Korenius T, Laurikkala J, Juhola M (2007) On principal component analysis, cosine and euclidean measures in information retrieval. Inf Sci 177(22):4893–4905
MathSciNet MATH Google Scholar
Kubjas K, Robeva E, Sturmfels B (2015) Fixed points EM algorithm and nonnegative rank boundaries. Ann Stat 43:422–461
MathSciNet MATH Google Scholar
Kysenko V, Rupp K, Marchenko O, Selberherr S, Anisimov A (2012) GPU-accelerated non-negative matrix factorization for text mining. In: International Conference on Application of Natural Language to Information Systems, Springer, New York, pp 158–163
Laurberg H, Christensen MG, Plumbley MD, Hansen LK, Jensen SH (2008) Theorems on positive data: on the uniqueness of NMF. Comput Intell Neurosci 2008:764206
Google Scholar
Lee DD, Seung HS (1999) Learning the parts of objects by non-negative matrix factorization. Nature 401(6755):788–791
MATH Google Scholar
Liao R, Zhang Y, Guan J, Zhou S (2014) Cloudnmf: a mapreduce implementation of nonnegative matrix factorization for large-scale biological datasets. Genom Proteom Bioinform 12(1):48–51
Google Scholar
Liu C, Yang Hc, Fan J, He LW, Wang YM (2010) Distributed nonnegative matrix factorization for web-scale dyadic data analysis on mapreduce. In: Proceedings of the 19th International Conference on World Wide Web, ACM, pp 681–690
Lopes N, Ribeiro B (2010) Non-negative matrix factorization implementation using graphic processing units. In: International Conference on Intelligent Data Engineering and Automated Learning, Springer, New York, pp 275–283
MacKay DJ et al (1994) Bayesian nonlinear modeling for the prediction competition. ASHRAE Trans 100(2):1053–1062
Google Scholar
Mejía-Roa E, García C, Gómez JI, Prieto M, Tirado F, Nogales R, Pascual-Montano A (2011) Biclustering and classification analysis in gene expression using nonnegative matrix factorization on multi-GPU systems. In: 2011 11th International Conference on Intelligent Systems Design and Applications, IEEE, pp 882–887
Mejía-Roa E, Tabas-Madrid D, Setoain J, García C, Tirado F, Pascual-Montano A (2015) NMF-mGPU: non-negative matrix factorization on multi-gpu systems. BMC Bioinform 16(1):43
Google Scholar
Moon GE, Sukumaran-Rajam A, Parthasarathy S, Sadayappan P (2019) PL-NMF: parallel locality-optimized non-negative matrix factorization. CoRR arxiv: abs/1904.07935
Paatero P, Tapper U (1994) Positive matrix factorization: a non-negative factor model with optimal utilization of error estimates of data values. Environmetrics 5(2):111–126
Google Scholar
Rousseeuw PJ (1987) Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J Comput Appl Math 20:53–65
MATH Google Scholar
Sanderson C, Curtin R (2016) Armadillo: a template-based C++ library for linear algebra. J Open Sour Softw 1(2):26
Google Scholar
Spearman C (1904) “General Intelligence” objectively determined and measured. Am J Psychol 15(2):201–292
Google Scholar
Stanev V, Vesselinov VV, Kusne AG, Antoszewski G, Takeuchi I, Alexandrov BS (2018) Unsupervised phase mapping of X-ray diffraction data by nonnegative matrix factorization integrated with custom clustering. NPJ Comput Mater 4(1):43
Google Scholar
Stanev VG, Iliev FL, Hansen S, Vesselinov VV, Alexandrov BS (2018) Identification of release sources in advection-diffusion system by machine learning combined with green’s function inverse method. Appl Math Modell 60:64–76
MathSciNet MATH Google Scholar
Sun DL, Fevotte C (2014) Alternating direction method of multipliers for non-negative matrix factorization with the beta-divergence. In: 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, pp 6201–6205
Sun M, Zhang X et al (2015) A stable approach for model order selection in nonnegative matrix factorization. Pattern Recognit Lett 54:97–102
Google Scholar
Syed AM, Qazi S, Gillis N (2018) Improved SVD-based initialization for nonnegative matrix factorization using low-rank correction. arXiv preprint arXiv:1807.04020
Tan VYF, Févotte C (2009) Automatic relevance determination in nonnegative matrix factorization. In: SPARS’09—signal processing with adaptive sparse structured representations, Inria Rennes—Bretagne Atlantique
Vesselinov VV, Alexandrov BS, O’Malley D (2018) Contaminant source identification using semi-supervised machine learning. J Contam Hydrol 212:134–142
Google Scholar
Wold S, Sjöström M, Eriksson L (2001) Pls-regression: a basic tool of chemometrics. Chemom Intell Lab Syst 58(2):109–130
Google Scholar
Wu S, Joseph A, Hammonds AS, Celniker SE, Yu B, Frise E (2016) Stability-driven nonnegative matrix factorization to interpret spatial gene expression and build local gene networks. Proc Natl Acad Sci 113(16):4290–4295
Google Scholar
Yin J, Gao L, Zhang ZM (2014) Scalable nonnegative matrix factorization with block-wise updates. In: Joint European Conference on Machine Learning and Knowledge Discovery in Databases, Springer, New York, pp 337–352

Download references

Acknowledgements

This research used resources provided by the Los Alamos National Laboratory Institutional Computing Program, which is supported by the U.S. Department of Energy National Nuclear Security Administration under Contract No. 89233218CNA000001.

Funding

This study was funded by U.S. Department of Energy National Nuclear Security Administration under Contract No. DE-AC52-06NA25396 through LANL Laboratory Directed Research and Development (LDRD) Grant 20190020DR.

Author information

Authors and Affiliations

Information Sciences (CCS-3) Group, Los Alamos National Laboratory (LANL), Los Alamos, NM, 87545, USA
Gopinath Chennupati, Erik Skau & Hristo Djidjev
Theoretical Division (T-1) Group, Los Alamos National Laboratory (LANL), Los Alamos, NM, 87545, USA
Raviteja Vangara & Boian Alexandrov

Authors

Gopinath Chennupati
View author publications
You can also search for this author in PubMed Google Scholar
Raviteja Vangara
View author publications
You can also search for this author in PubMed Google Scholar
Erik Skau
View author publications
You can also search for this author in PubMed Google Scholar
Hristo Djidjev
View author publications
You can also search for this author in PubMed Google Scholar
Boian Alexandrov
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Gopinath Chennupati.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

See Tables 3, 4.

Table 3 Execution times for distributed clustering and Silhouette calculation using DnMFk on two data matrices: Data 3 (\(57{,}600 \times 38{,}400\)) and Data 4 (\(129{,}600 \times 51{,}840\))

Full size table

Table 4 Execution times for distributed clustering and Silhouette calculation using DnMFk on two data matrices: Data 1 (\(2^{21} \times 10 \times 10\)) and Data 2 (\(2^{20} \times 50 \times 10\))

Full size table

See Figs. 14, 15 and 16.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Chennupati, G., Vangara, R., Skau, E. et al. Distributed non-negative matrix factorization with determination of the number of latent features. J Supercomput 76, 7458–7488 (2020). https://doi.org/10.1007/s11227-020-03181-6

Download citation

Published: 08 February 2020
Issue Date: September 2020
DOI: https://doi.org/10.1007/s11227-020-03181-6

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Distributed non-negative matrix factorization with determination of the number of latent features

Abstract

Access this article

Similar content being viewed by others