Abstract
The holistic analysis and understanding of the latent (that is, not directly observable) variables and patterns buried in large datasets is crucial for data-driven science, decision making and emergency response. Such exploratory analyses require devising unsupervised learning methods for data mining and extraction of the latent features, and non-negative matrix factorization (NMF) is one of the prominent such methods. NMF is based on compute-intense non-convex constrained minimization, which, for large datasets requires fast and distributed algorithms. However, current parallel implementations of NMF fail to estimate the number of latent features. In practice, identifying these features is both difficult and significant for pattern recognition and latent feature analysis, especially for large dense matrices. In this paper, we introduce a distributed NMF algorithm coupled with distributed custom clustering followed by a stability analysis on dense data, which we call DnMFk, to determine the number of latent variables. The results on synthetic data and the classical Swimmer data set demonstrate the accuracy of model determination while scaling nearly linearly across multiple processors for large data. Further, we employ DnMFk to determine the number of hidden features from a terabyte matrix.
Similar content being viewed by others
References
Alexandrov BS, Alexandrov LB, Iliev FL, Stanev VG, Vesselinov VV (2018) Source identification by non-negative matrix factorization combined with semi-supervised clustering. US Patent App. 15/690,176
Alexandrov BS, Stanev VG, Vesselinov VV, Rasmussen KØ (2019) Nonnegative tensor decomposition with custom clustering for microphase separation of block copolymers. Stat Anal Data Min ASA Data Sci J 12(4):302–310
Alexandrov BS, Vesselinov VV (2014) Blind source separation for groundwater pressure analysis based on nonnegative matrix factorization. Water Resour Res 50(9):7332–7347
Alexandrov LB, Nik-Zainal S, Wedge DC, Aparicio SA, Behjati S, Biankin AV, Bignell GR, Bolli N, Borg A, Børresen-Dale AL et al (2013) Signatures of mutational processes in human cancer. Nature 500(7463):415
Alexandrov LB, Nik-Zainal S, Wedge DC, Campbell PJ, Stratton MR (2013) Deciphering signatures of mutational processes operative in human cancer. Cell Rep 3(1):246–259
Amari Si, Cichocki A, Yang HH (1996) A new learning algorithm for blind signal separation. In: Advances in neural information processing systems, pp 757–763
Barlow H (1989) Unsupervised learning. Neural Comput 1(3):295–311. https://doi.org/10.1162/neco.1989.1.3.295
Battenberg E, Wessel D (2009) Accelerating non-negative matrix factorization for audio source separation on multi-core and many-core architectures. In: ISMIR, pp 501–506
Baum LE, Petrie T (1966) Statistical inference for probabilistic functions of finite state markov chains. Ann Math Stat 37(6):1554–1563
Benesty J, Chen J, Huang Y, Cohen I (2009) Pearson correlation coefficient. In: Noise reduction in speech processing, Springer, New York, pp 1–4
Beutel A, Talukdar PP, Kumar A, Faloutsos C, Papalexakis EE, Xing EP (2014) Flexifact: scalable flexible factorization of coupled tensors on hadoop. In: Proceedings of the 2014 SIAM International Conference on Data Mining, SIAM, pp 109–117
Bishop CM (1999) Bayesian PCA. In: Advances in neural information processing systems, pp 382–388
Brunet JP, Tamayo P, Golub TR, Mesirov JP (2004) Metagenes and molecular pattern discovery using matrix factorization. Proc Natl Acad Sci 101(12):4164–4169
Chan E, Heimlich M, Purkayastha A, Van De Geijn R (2007) Collective communication: theory, practice, and experience. Concurr Comput Pract Exp 19(13):1749–1783
Chennupati G, Azad RMA, Ryan C (2015) Performance optimization of multi-core grammatical evolution generated parallel recursive programs. In: Proceedings of the 2015 Annual Conference on Genetic and Evolutionary Computation, pp 1007–1014
Cichocki A, Phan AH, Zhao Q, Lee N, Oseledets I, Sugiyama M, Mandic DP et al (2017) Tensor networks for dimensionality reduction and large-scale optimization: part 2 applications and future perspectives. Found Trends® Mach Learn 9(6):431–673
Cichocki A, Zdunek R, Amari Si (2007) Hierarchical ALS algorithms for nonnegative matrix and 3d tensor factorization. In: International Conference on Independent Component Analysis and Signal Separation, Springer, New York, pp 169–176
Cichocki A, Zdunek R, Phan AH, Si Amari (2009) Nonnegative matrix and tensor factorizations: applications to exploratory multi-way data analysis and blind source separation. Wiley, Hoboken
Deerwester SC, Dumais ST, Furnas GW, Harshman RA, Landauer TK, Lochbaum KE, Streeter LA (1989) Computer information retrieval using latent semantic structure. US Patent 4,839,853
Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the em algorithm. J R Stat Soc Ser B (Methodol) 39(1):1–22
Dong C, Zhao H, Wang W (2010) Parallel nonnegative matrix factorization algorithm on the distributed memory platform. Int J Parallel Program 38(2):117–137
Donoho D, Stodden V (2004) When does non-negative matrix factorization give a correct decomposition into parts? In: Advances in neural information processing systems, pp 1141–1148
Fairbanks JP, Kannan R, Park H, Bader DA (2015) Behavioral clusters in dynamic graphs. Parallel Comput 47:38–50
Févotte C, Cemgil AT (2009) Nonnegative matrix factorizations as probabilistic inference in composite models. In: 2009 17th European Signal Processing Conference, IEEE, pp 1913–1917
Franke B, Plante JF, Roscher R, Lee EsA, Smyth C, Hatefi A, Chen F, Gil E, Schwing A, Selvitella A et al (2016) Statistical inference, learning and models in big data. Int Stat Rev 84(3):371–389
Gemulla R, Nijkamp E, Haas PJ, Sismanis Y (2011) Large-scale matrix factorization with distributed stochastic gradient descent. In: Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, pp 69–77
Golub GH, Reinsch C (1971) Singular value decomposition and least squares solutions. In: Linear algebra, Springer, New York, pp 134–151
Hartigan JA, Wong MA (1979) Algorithm as 136: a k-means clustering algorithm. J R Stat Soc Ser C (Appl Stat) 28(1):100–108
Huang K, Sidiropoulos ND, Liavas AP (2016) A flexible and efficient algorithmic framework for constrained matrix and tensor factorization. IEEE Trans Signal Process 64(19):5052–5065
Iliev FL, Stanev VG, Vesselinov VV, Alexandrov BS (2018) Nonnegative matrix factorization for identification of unknown number of sources emitting delayed signals. PloS One 13(3):e0193974
Jolliffe I (2011) Principal component analysis. Springer, New York
Kanna R (2019) Parallel low-rank approximations with non-negativity constraints (PLANC). https://github.com/ramkikannan/planc. Accessed 03 Sep 2019
Kannan R, Ballard G, Park H (2016) A high-performance parallel algorithm for nonnegative matrix factorization. In: Proceedings of the 21st ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP’16, ACM, pp 9:1–9:11
Kim J, Park H (2008) Toward faster nonnegative matrix factorization: a new algorithm and comparisons. In: 2008 Eighth IEEE International Conference on Data Mining, IEEE, pp 353–362
Kim J, Park H (2011) Fast nonnegative matrix factorization: an active-set-like method and comparisons. SIAM J Sci Comput 33(6):3261–3281
Koitka S, Friedrich CM (2016) nmfgpu4r: GPU-accelerated computation of the non-negative matrix factorization (NMF) using CUDA capable hardware. R Journal 8(2):382–392
Korenius T, Laurikkala J, Juhola M (2007) On principal component analysis, cosine and euclidean measures in information retrieval. Inf Sci 177(22):4893–4905
Kubjas K, Robeva E, Sturmfels B (2015) Fixed points EM algorithm and nonnegative rank boundaries. Ann Stat 43:422–461
Kysenko V, Rupp K, Marchenko O, Selberherr S, Anisimov A (2012) GPU-accelerated non-negative matrix factorization for text mining. In: International Conference on Application of Natural Language to Information Systems, Springer, New York, pp 158–163
Laurberg H, Christensen MG, Plumbley MD, Hansen LK, Jensen SH (2008) Theorems on positive data: on the uniqueness of NMF. Comput Intell Neurosci 2008:764206
Lee DD, Seung HS (1999) Learning the parts of objects by non-negative matrix factorization. Nature 401(6755):788–791
Liao R, Zhang Y, Guan J, Zhou S (2014) Cloudnmf: a mapreduce implementation of nonnegative matrix factorization for large-scale biological datasets. Genom Proteom Bioinform 12(1):48–51
Liu C, Yang Hc, Fan J, He LW, Wang YM (2010) Distributed nonnegative matrix factorization for web-scale dyadic data analysis on mapreduce. In: Proceedings of the 19th International Conference on World Wide Web, ACM, pp 681–690
Lopes N, Ribeiro B (2010) Non-negative matrix factorization implementation using graphic processing units. In: International Conference on Intelligent Data Engineering and Automated Learning, Springer, New York, pp 275–283
MacKay DJ et al (1994) Bayesian nonlinear modeling for the prediction competition. ASHRAE Trans 100(2):1053–1062
Mejía-Roa E, García C, Gómez JI, Prieto M, Tirado F, Nogales R, Pascual-Montano A (2011) Biclustering and classification analysis in gene expression using nonnegative matrix factorization on multi-GPU systems. In: 2011 11th International Conference on Intelligent Systems Design and Applications, IEEE, pp 882–887
Mejía-Roa E, Tabas-Madrid D, Setoain J, García C, Tirado F, Pascual-Montano A (2015) NMF-mGPU: non-negative matrix factorization on multi-gpu systems. BMC Bioinform 16(1):43
Moon GE, Sukumaran-Rajam A, Parthasarathy S, Sadayappan P (2019) PL-NMF: parallel locality-optimized non-negative matrix factorization. CoRR arxiv: abs/1904.07935
Paatero P, Tapper U (1994) Positive matrix factorization: a non-negative factor model with optimal utilization of error estimates of data values. Environmetrics 5(2):111–126
Rousseeuw PJ (1987) Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J Comput Appl Math 20:53–65
Sanderson C, Curtin R (2016) Armadillo: a template-based C++ library for linear algebra. J Open Sour Softw 1(2):26
Spearman C (1904) “General Intelligence” objectively determined and measured. Am J Psychol 15(2):201–292
Stanev V, Vesselinov VV, Kusne AG, Antoszewski G, Takeuchi I, Alexandrov BS (2018) Unsupervised phase mapping of X-ray diffraction data by nonnegative matrix factorization integrated with custom clustering. NPJ Comput Mater 4(1):43
Stanev VG, Iliev FL, Hansen S, Vesselinov VV, Alexandrov BS (2018) Identification of release sources in advection-diffusion system by machine learning combined with green’s function inverse method. Appl Math Modell 60:64–76
Sun DL, Fevotte C (2014) Alternating direction method of multipliers for non-negative matrix factorization with the beta-divergence. In: 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, pp 6201–6205
Sun M, Zhang X et al (2015) A stable approach for model order selection in nonnegative matrix factorization. Pattern Recognit Lett 54:97–102
Syed AM, Qazi S, Gillis N (2018) Improved SVD-based initialization for nonnegative matrix factorization using low-rank correction. arXiv preprint arXiv:1807.04020
Tan VYF, Févotte C (2009) Automatic relevance determination in nonnegative matrix factorization. In: SPARS’09—signal processing with adaptive sparse structured representations, Inria Rennes—Bretagne Atlantique
Vesselinov VV, Alexandrov BS, O’Malley D (2018) Contaminant source identification using semi-supervised machine learning. J Contam Hydrol 212:134–142
Wold S, Sjöström M, Eriksson L (2001) Pls-regression: a basic tool of chemometrics. Chemom Intell Lab Syst 58(2):109–130
Wu S, Joseph A, Hammonds AS, Celniker SE, Yu B, Frise E (2016) Stability-driven nonnegative matrix factorization to interpret spatial gene expression and build local gene networks. Proc Natl Acad Sci 113(16):4290–4295
Yin J, Gao L, Zhang ZM (2014) Scalable nonnegative matrix factorization with block-wise updates. In: Joint European Conference on Machine Learning and Knowledge Discovery in Databases, Springer, New York, pp 337–352
Acknowledgements
This research used resources provided by the Los Alamos National Laboratory Institutional Computing Program, which is supported by the U.S. Department of Energy National Nuclear Security Administration under Contract No. 89233218CNA000001.
Funding
This study was funded by U.S. Department of Energy National Nuclear Security Administration under Contract No. DE-AC52-06NA25396 through LANL Laboratory Directed Research and Development (LDRD) Grant 20190020DR.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Chennupati, G., Vangara, R., Skau, E. et al. Distributed non-negative matrix factorization with determination of the number of latent features. J Supercomput 76, 7458–7488 (2020). https://doi.org/10.1007/s11227-020-03181-6
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-020-03181-6