Abstract
We propose an efficient distributed outofmemory implementation of the nonnegative matrix factorization (NMF) algorithm for heterogeneous highperformancecomputing systems. The proposed implementation is based on prior work on NMFk, which can perform automatic model selection and extract latent variables and patterns from data. In this work, we extend NMFk by adding support for dense and sparse matrix operation on multinode, multiGPU systems. The resulting algorithm is optimized for outofmemory problems where the memory required to factorize a given matrix is greater than the available GPU memory. Memory complexity is reduced by batching/tiling strategies, and sparse and dense matrix operations are significantly accelerated with GPU cores (or tensor cores when available). Input/output latency associated with batch copies between host and device is hidden using CUDA streams to overlap data transfers and compute asynchronously, and latency associated with collective communications (both intranode and internode) is reduced using optimized NVIDIA Collective Communication Library (NCCL) based communicators. Benchmark results show significant improvement, from 32X to 76x speedup, with the new implementation using GPUs over the CPUbased NMFk. Good weak scaling was demonstrated on up to 4096 multiGPU cluster nodes with approximately 25,000 GPUs when decomposing a dense 340 Terabytesize matrix and an 11 Exabytesize sparse matrix of density \(10^{6}\).
1 Introduction
NMF is a popular unsupervised learning method that extracts sparse and explainable latent features [1], which are often used to reveal explainable lowdimensional hidden structures that represent and classify the elements of the whole dataset [2]. NMF is used in big data analysis, which plays a crucial role in many problems, including human health, cyber security, economic stability, emergency response, and scientific discovery. With the increased accessibility to data and technology, datasets continue to grow in size and complexity. At the same time, the operational value of the information hidden in patterns in such datasets continues to grow in significance. Extracting explainable hidden features from large datasets, collected experimentally or computergenerated, is vital because the data presumably carries essential (but often previously unknown) information about the investigated phenomenon’s causality, relationships, and mechanisms. Discovering meaningful hidden patterns from data is not a trivial task because the datasets are formed only by directly observable quantities while the underlying processes or features, in general, remain unobserved, latent, or hidden [3].
Analysis of vast amounts of (usually sparse) data via NMF requires novel distributed approaches for reducing computational complexity, speeding up the computation, and dealing with data storage and data movement challenges. Most NMF computations are matrixmatrix multiplications, which GPU accelerators can speed up. The primary performance and scaling limiting factors in NMF implementations on modern heterogeneous HPC systems are high communication costs due to data movement across different system parts (internode and intranode communications). In various cases, these communication delays exceed the time the actual computations take, resulting in poor performance and poor scalability on large distributed systems.
The growth in data volumes outpacing the improvement in hardware specifications is causing significant challenges in extracting useful information from largescale datasets using algorithms like NMF. This motivates the need for outofmemory implementations of NMF for distributed HPC systems, which will allow the decomposition of large datasets that does not fit in memory at once. Enabling outofmemory factorization is very important because it removes the matrix size constraint imposed by the GPU memory, thus enabling the analysis of datasets up to the cumulative size of all RAM on the cluster. This is mainly required to address the challenges presented by the need to factorize the evergrowing datasets. We utilize this unique ability of pyDNMFGPU to demonstrate the decomposition of record large dense, and sparse datasets.
To illustrate how pyDNMFGPU can be used as a building block for more comprehensive workflows, we integrate pyDNMFGPU with our existing model selection algorithm pyDNMFk^{Footnote 1} that enables automatic determination of the (usually unknown) number of latent features on a large scale datasets [4,5,6,7,8]. We utilized the integrated model selection algorithm previously to decompose the worlds’ largest collection of human cancer genomes [9], defining cancer mutational signatures [10], as well as successfully applied to solve realworld problems in various fields [8, 11,12,13,14,15,16,17,18,19].
This integration results in our outofmemory scalable tool, pyDNMFkGPU, to be capable of estimating the number of latent features in extralarge sparse (tens of EBs) and dense (hundreds of TBs) datasets while operating across CPUGPU hardware. To the best of our knowledge, our framework is the first to identify hidden features in largescale dense and sparse datasets.
In experiments on large HPC clusters, we show pyDNMFGPU’s potential: we measure up to 76x improvement on a single GPU over running on a single 18core CPU. We also demonstrate weak scaling on up to 4096 multiGPU cluster nodes with approximately 25,000 GPUs when decomposing a dense 340 Terabytesize matrix and an 11 Exabytesize sparse matrix of density \(10^{6}\).
Our main contribution is a novel NMF parallel framework, called pyDNMFGPU, that minimizes the data movement on GPUs, improving overall running times. Our work’s main contribution and novelty is the proposal of a new distributed implementation of NMF with low memory complexity that enables the outofmemory factorization of very large datasets. Our proposed implementation, pyDNMFGPU, takes advantage of the following three modern design choices:

pyDNMFGPU reduces the latency associated with local data transfer between the GPU and host (and viceversa) by using CUDA streams.

Latency associated with collective communications (intranode and internode) is reduced by using NCCL primitives.^{Footnote 2}

We incorporate a batching approach for internode communication, which provides a unique ability to perform outofmemory NMF while using multiple GPUs for the bulk of computations.
The main contributions of the paper include:

Introducing a novel distributed algorithm with outofmemory support for NMF for sparse and dense matrices operating across CPUGPU hardware.

Report, the first NCCL communicator accelerated NMF decomposition tool in distributed GPUs.

Demonstrate the framework’s scalability over a recordbreaking 340 Terabytes (TB) dense and 11 Exabytes (EB) sparse synthetic datasets.
The remainder of the paper is organized as follows: Sect. 2 gives a summary of NMF and the existing parallel NMF implementations. In Sect. 3, we detail the design considerations and choices for a scalable, parallel, and efficient algorithm in different configurations of the data size and available GPU VRAM, as well as the complexity of the new implementation. The efficacy of pyDNMFGPU with different benchmark results and the validation of benchmark results on a synthetic dataset with a predetermined number of latent features is shown in Sect. 4. We finally conclude with summaries and suggestions of possible future work directions in Sect. 5.
2 Background and related work
2.1 Nonnegative matrix factorization algorithms
NMF [1] approximates the nonnegative observational matrix \(\varvec{A}\in \mathbb {R}_{+}^{m \times n}\) with a product of two nonnegative factor matrices \(\varvec{W}\in \mathbb {R}_{+}^{m \times k}\) and \(\varvec{H}\in \mathbb {R}_{+}^{k \times n}\) where the columns of \(\varvec{W}\) represent the latent features, while the columns of \(\varvec{H}\) are the coordinates/weights of the analyzed samples (the columns of \(\varvec{A}\)) in the reduced latent space, and k is the latent dimension of the data. The NMF minimization is based on alternating update of each one of these two factor matrices until convergence indicated by the condition \(\left\ \varvec{A}  \varvec{W} \varvec{H} \right\ _F \le \eta \) is reached. Here \(\left\ . \right\ _F\) is the Frobenius norm, \(\left\ \varvec{A} \right\ _F = \sqrt{\sum _i \sum _j a_{ij}^2}\), where \(a_{ij}\) is the element on row i and column j, and \(\eta \) is the desired tolerance. Each iteration consists of a \(\varvec{W}\)update substep followed by a \(\varvec{H}\)update substep, given by
The Frobenius norm (FRO) based multiplicative update (MU) algorithm is presented in Algorithm 1. In addition to the presented Frobenius normbased MU algorithm (which leads to a Gaussian model of the noise [20]) other similarities (e.g., KLdivergence that corresponds to a Poisson model) can also be used in the NMF minimization. Also, based on the update rules, several variants of NMF algorithms exist such as hierarchical alternating least squares (HALS) [21], alternating nonnegative least squares with block principle pivoting (ANLSBPP) [22], and block coordinate descent algorithm (BCD) [23]. These algorithms have different advantages in the context of convergence rate, computational, and memory requirements. MUbased updates are computationally and memorywise cheap at the cost of slower convergence. Whereas HALS, BCD, and ANLSBPP have faster convergence rates at the cost of higher computational and memory requirements and high communication costs for parallel implementations. In our experiments, we use the FRObased MU algorithm to demonstrate record scalability on large datasets due to its lower computation and communication cost, which can easily be modified with another update algorithm or similarity metric.
2.2 Related work on distributed NMF
Several parallel implementations have been proposed to address the computational need of NMF for large datasets involving multiple and repeated matrixmatrix multiplications of several orders in magnitude. The existing parallel implementations can be grouped under two categories (i) with shared memory and (ii) with distributed memory. Majority of existing parallel works utilize sharedmemory multiprocessor [24,25,26,27] and shared memory GPUs [26,27,28,29] via OpenMP and CUDA libraries respectively. A majority of distributed memory implementations rely on MPI primitives for distributed CPU [12, 30] and CUDAaware MPI primitives for distributed GPU [28, 30] parallelization. Although sharedmemory implementations drastically minimize the communication costs incurred for distributed memory implementation [26], there is a constraint on how much data such frameworks can decompose. Due to this constraint, sharedmemory implementation often cannot provide the computational/memory requirements needed for the current largescale datasets.
Almost all distributed GPU implementations including NMFmGPU [28] and PLANC [33] rely on significant data communication for the update of the factors. This involves using CUDAaware MPI primitives for data communication or MPI distributed memory offload through NVBLAS [33] without multinode GPU communicators. Such implementation leads to high data movement costs due to data onloading/offloading to/from the device, which significantly raises communication costs compared to the computation cost for large data decomposition. This is previously illustrated with distributed BPP in PLANC [30] and distributed MU and BCD [12] where the communication cost is minimized by communicating only with the twofactor matrices and other partitioned matrices among MPI processes. These works attempt to reduce the bandwidth and data latency using MPI collective communication operations. For distributed CPU implementations, this approach works well as the communication cost is significantly lower compared to the computation cost. However, for GPU implementation, communication cost is higher due to device/host data transfer; therefore, communication cost is a limiting factor for parallel performance when using many GPUs.
Table 1 illustrates the comparison against the existing parallel NMF implementations. Further, support for factorization of sparse datasets equally adds value for our new pyDNMFGPU framework. Since many of the extralarge datasets, such as the text corpora, knowledge graph embeddings (and, in general, most of the relational datasets), cyber network activity datasets, and many others, are highly sparse, having sparse decomposition support dramatically reduces the memory and computational requirements which otherwise would be a major bottleneck for the dense implementation. Despite the support for a sparse dataset for sharedmemory in ALONMF and genten [26, 27] and for distributed memory in PLANC [30], there is no specific solution aiming to address the bottlenecks due to extracted dense factors and their communications for large sparse datasets. Even though the largest sparse datasets may be a few MBs in size, due to their extreme sparsity, decomposing such datasets would be challenging for most existing frameworks as the extracted factors are dense and very large. Even for such a small nonzero valued size, the corresponding dense factors could easily explode and require an expensive communication of dense intermediate terms. However, our batching framework provides a solution by accommodating larger intermediatedense factors, which have not been addressed previously.
2.3 Rationale for an algorithm for the outofmemory distributed NMF
In pyDNMFGPU, we use a distributed implementation of NMF that aims at efficiently factorizing matrices of all sizes, even those too big to fit on available memory, in outofmemory scenarios. To this end, pyDNMFGPU accelerates matrix operations using GPUs on modern heterogeneous systems, provides support for sparse matrix operations to deal with practical data sets which are often sparse, and can partition large problems into smaller problems solved in a distributed manner. Above all, and to the best of our knowledge, our proposed implementation is the first to provide a solution for practical outofmemory cases that require the factorization of data too big to be stored on combined available GPU memory.
When performing NMF on GPUs, OOM situations can arise in various scenarios with different degrees of complexity. As discussed in [34], we distinguish three main types of OOM scenarios. Scenarios of type 0 (OOM0) concern practical problems where the input data \(\varvec{A}\) and its cofactors \(\varvec{W}\) and \(\varvec{H}\) can easily be stored on GPU memory. However, an explosion of memory requirement can occur, either due to the unknown rank k becoming significant, causing \(\varvec{W}\) and \(\varvec{H}\) to become prohibitively expensive to store on memory, or when computing intermediate results such as \(\varvec{X}=\varvec{W}\varvec{H}\) (line 8 of Algorithm 1), when \(\varvec{A}\) is a large sparse matrix of very low density, where \(\varvec{X}\) resulting from the operation becomes dense and very likely impossible to store on GPU. For instance, if \(\varvec{A}\in \mathbb {R}^{10^6 \times 10^6}\) is a sparse matrix, with density of \(\delta \approx 10^{3}\), the size of \(\varvec{A}\) in dense format, in single precision, is \(S_{\varvec{A}}\approx 4 TB\), however representing \(\varvec{A}\) in CSR sparse format can lower the size of \(\varvec{A}\) down to \(S_s \sim 3\times S_A \times \delta \approx 12~GB\) (the factor of 3 accounts for storing the data, indices and index pointers for CSR format), consequently \(S_{NMF} \approx 2 \times S_{\varvec{A}}\approx 4TB\). Assuming very small k, \(\varvec{A}\) and all cofactors can be stored on GPU; however, the calculation of the intermediate product \(\varvec{X}\) from \(\varvec{X} = \varvec{W}\varvec{H}\) would still require a whopping \(\sim 8~TB\) of GPU memory (line 8 of Algorithm 1), making this scenario an OOM0 problem.
A more complex OOM scenario, type 1 (OOM1), arises in cases where matrix \(\varvec{A}\) and at most one of its cofactors cannot be cached on GPU memory; this is typically the case when dealing with a large \(\varvec{A}\) that is dense or sparse with high density. Scenarios of type 2 (OOM2) are the most complex and consist of practical cases where neither \(\varvec{A}\), nor its cofactors can be stored on GPU memory. Note that more complexity can arise in cases where data cannot fit on host RAM memory, but that still is of type 2 as the OOM classification here is based on the GPU RAM memory utilization. In other words, in OOM0 scenarios, all the data can be cached on GPU; in OOM1 scenarios, the data can partially be cached on GPU, and in OOM2 scenarios, none of the data can be cached on GPU. The treatment of OOM2 scenarios is out of the scope of this study.
OOM0 cases can easily be handled using tiling techniques, and OOM1 cases can be handled with batching techniques. In extreme OOM1 cases, we will complement batching by tiling to further reduce memory footprint.
Both batching and tiling are blockbased computational techniques designed to simplify larger, memoryintensive computations into smaller, manageable, and partially solvable tasks. Each technique, however, functions in a distinct setting and serves a different purpose. Batching is a process that operates on the host, necessitating consistent data transfer between the host and the device. The efficacy of batching techniques is heavily reliant on the speed of the interconnecting buses between the host and device, such as PCIe or NVLink. Batching techniques become crucial when dealing with OOM1 problems, as they help in transferring partially computed results. Conversely, tiling happens directly within the device memory, resulting in data transfer between global memory and shared or cache memory. The performance of tiling techniques is primarily governed by the GPU architecture, including features like memory speed and available shared memory. Tiling techniques are especially effective for tackling OOM0 problems, as they handle computational tasks directly on the device. Notably, batching is typically irrelevant for OOM0 problems as these computations are already based on the device. Similarly, tiling techniques alone cannot address OOM1 issues due to the preliminary need to transfer operands to the device. However, an optimized solution for extreme OOM1 problems can be achieved by strategically combining both batching and tiling techniques, thus enhancing the overall performance.
In the section below, we discuss our implementation and design choices.
3 pyDNMFGPU for heterogeneous systems
An efficient implementation of NMF for distributed heterogeneous systems should avoid high costs associated with communication (data transfer) resulting from poor consideration for data locality in the distribution of the computational work. Furthermore, cases, where resources such as available combined GPUs memory are limited will require additional considerations and various tradeoffs. For instance, it is sometimes better to replicate data over the distributed compute grid to reduce communication. Other times, it is acceptable to use batching techniques that can increase communication costs to lower the memory footprint. Below we first discuss our distributed data partition strategies that partition large problems into smaller problems solvable on cooperative distributed systems in subsection 3.1, and then in subsection 3.2 we discuss our tiling and batching approaches, respectively used to handle practical scenarios of complexities type0 and type1.
3.1 Distributed implementation
Our implementation considers two onedimensional data partition strategies based on the shape of matrix \(\varvec{A}\) (\(m\times n\)). A column (vertical) partition, CNMF employed when \(n > m\), and a row (horizontal) partition, RNMF, is used otherwise.
Assuming a distributed system with N GPUs where each GPU is indexed by its global rank \(g_{ID}\). In the CNMF approach illustrated in Fig. 1a, the \(j^{th}\) GPU with \(g_{ID}=j\) will work on array partitions \(\varvec{A}[:, j_0:j_1]\), \(\varvec{H}[:, j_0:j_1]\) and \(\varvec{W}\), where \(j_0=j \times J\), \(j_1=(j+1) \times q\), and \(J=n/N\)(partition size). Each GPU gets a full copy of \(\varvec{W}\) (\(\varvec{W}\) is replicated) and a unique partition of \(\varvec{A}\) and \(\varvec{H}\). This translates into a segmentation of arrays \(\varvec{A}\) and \(\varvec{H}\) on global memory illustrated with solid lines in Fig. 1a. These solid lines indicate boundaries in global memory and consequently help conceptualize where communication is required whenever information is exchanged from one bounded region to another. The Hupdate is embarrassingly parallel since \(\varvec{W}^T\varvec{W}\), \((\varvec{W}^T\varvec{W})\varvec{H}\), and \(\varvec{W}^T\varvec{A}\) can all be computed locally on each GPU; the \(\varvec{W}\)update on the other hand, will require two separate allreducesum communications to compute \(\varvec{A}\varvec{H}^T\) and \(\varvec{H}\varvec{H}^T\) as indicated in Algorithm 2 lines 10 and 7.
Following a similar analogy, a RNMF approach results with \(\varvec{H}\) replicated on the different GPUs and \(\varvec{A}\) and \(\varvec{H}\) distributed across the compute grid. This time \(\varvec{W}\)update is embarrassingly parallel since \(\varvec{H}\varvec{H}^T\), \(\varvec{W}(\varvec{H}\varvec{H}^T)\), and \(\varvec{A}\varvec{H}^T\) can all be computed locally on each GPU, but the \(\varvec{H}\)update will require separate allreducesum communication to compute \(\varvec{W}^T\varvec{W}\) and \(\varvec{W}^T\varvec{A}\) as presented in Algorithm 3.
Communication takes place through various channels with different bandwidths and latency. We refer to intranode communications as any communication on the same node, i.e., yellow, pink, and black lines in Fig. 2 and those between different nodes as internode communications. i.e red lines in Fig. 2. The latter often have the lowest bandwidth and highest latency and could easily cause bottlenecks for distributed algorithms such as NMF. For these practical reasons, in our implementation, we avoid allreduce collective calls as much as possible. When \(n>m\), CNMF is more efficient than RNMF because it costs less to communicate \(\varvec{A}\varvec{H}^T\) of shape \(m \times k\), and RNMF is more efficient when \(m>n\) because it cost less to communicate \(\varvec{W}^T\varvec{A}\) of shape \(k \times n\).
The FLOP (floating point operations) count for the given distributed RNMF (rowwise nonnegative matrix factorization) algorithm can be calculated by going through each of the operations performed in the algorithm. Below is a rough estimation of the FLOP count for each line of interest in the algorithm:

Matrix multiplication (Line 3): \(\varvec{WTA} = {\varvec{[}l][]{W}}^T @ \varvec{A} \). Here we have a matrix multiplication of size \((k \times I) * (I \times n)\), which will result in \(2k*I*n  k*n\) FLOPs.

Matrix Multiplication (Line 5): \(\varvec{WTW} = {\varvec{[}l][]{W}}^T @ \varvec{W} \). Here we have a matrix multiplication of size \((k \times I) * (I \times k)\), which will result in \(2k*I*k  k*k\) FLOPs.

Elementwise Multiplication and Division (Line 7): \(\varvec{[}l+1][]{H} = (\varvec{[}l][]{H} * \varvec{WTA}) / (\varvec{WTW} @ \varvec{[}l]{H} + \epsilon ) \).This consists of \(k*n\) FLOPs for elementwise multiplication and \(k*n\) FLOPs for elementwise division, so total \(2*k*n\) FLOPs.

Matrix multiplication (Line 8): \(\varvec{HHT} = \varvec{[}l+1][]{H} @\varvec{H^{T^{(l+1)}}} \).Here we have a matrix multiplication of size \((k \times n) * (n \times k)\), which will result in \(2k*n*k  k*k\) FLOPs.

Matrix multiplication (Line 9): \(\varvec{WHHT} = \varvec{[}l]{W} @ \varvec{HHT} \). Here we have a matrix multiplication of size \((I \times k) * (k \times k)\), which will result in \(2I*k*k  I*k\) FLOPs.

Matrix multiplication (Line 10): \(\varvec{AHT} = \varvec{A} @ \varvec{HT} \). Here we have a matrix multiplication of size \((I \times n) * (n \times k)\), which will result in \(2I*n*k  I*k\) FLOPs.

Elementwise multiplication and division (Line 11): \(\varvec{[}l+1][]{W} = \varvec{[}l][]{W} * \varvec{AHT}/(\varvec{WHHT}+\epsilon )\). This consists of \(I*k\) FLOPs for elementwise multiplication and \(I*k\) FLOPs for elementwise division, so total \(2*I*k\) FLOPs.
Note: The All_Reduce operation (Lines 4 and 6) are communication operations and are not considered in the FLOP count as they do not involve any computation.
So, total FLOPs for each iteration of the loop = \(2k*I*n + 2k*I*k + 2*k*n + 2k*n*k + 2I*k*k + 2I*n*k + 2*I*k  k*n  k*k  k*k  I*k  I*k\).
For \(max_{iter}\) iterations, total FLOPs would be \(max_{iter}\) times the FLOPs per iteration. Now, to compute GFLOPS, we have GFLOPS = total_FLOPs/(total_time\(\times 1e9)\). Morever, given device peak GFLOPS (peakG), we can compute efficiency as GFLOPS/peakG*100%.
The total VRAM required to factorize \(\varvec{A}\) of size \(size(\varvec{A}) = S_A\) (in Bytes) is typically in the order of \(S_{NMF} \sim 4 \times S_A\). One fold of \(S_A\) to store \(\varvec{A}\) in memory, another fold to store perturbed \(\varvec{A}\) [7], an additional fold to compute intermediate product \(\varvec{X}=\varvec{W}@\varvec{H}\) when checking the convergence condition \(\Arrowvert \varvec{A}  \varvec{W} \varvec{H} \Arrowvert _F \le \eta \), and almost one full fold to store the cofactors \(\varvec{W}\), \(\varvec{H}\), and heavy intermediate products such as \(\varvec{W}^{T}\varvec{A}\) or \(\varvec{A}\varvec{H}^{T}\). When the total available combined GPU VRAM, \(S_{GV}\), is lower than \(S_{NMF}\), as in practical big data applications, batching techniques are imperative. Batching, in most cases, increases intranode and internode communication overheads. Although this can significantly affect the algorithm’s performance, proper use of asynchronous data copy and CUDA streams can reduce performance loss by overlapping compute and data transfers, as discussed in our outofmemory implementation below.
3.2 Outofmemory implementation and memory complexity analysis
In pyDNMFGPU, OOM0 problems are handled using a tiling approach where temporary results like \(\varvec{A}\varvec{H}^T\), \(\varvec{W}^T\varvec{A}\) or \(\varvec{W}\varvec{H}\) are evaluated in small chunks, by tiling one of the operands, such that the size of the tile sets the memory required for the calculation. In RNMF for instance, the criterion \(\Arrowvert \varvec{A}  \varvec{W} \varvec{H} \Arrowvert _F \le \eta \), can be evaluated in m/p small chunks obtained by tiling \(\varvec{W}\) into smaller tiles of size \(p \times k\). This results in computing nt chunks of \([\Arrowvert \varvec{A}  \varvec{W} \varvec{H} \Arrowvert _F]_p\) which are accumulated into the total error e such that \(e=\sum {([\Arrowvert \varvec{A}  \varvec{W} \varvec{H} \Arrowvert _F]_t)}_{t=0}^{m/p1}\), which can later be used to check the conversion condition \(e \le \eta \). This allows the reduction of the memory required to check the conversion criterion from \({\mathcal {O}}(m \times n)\) to \({\mathcal {O}}(p \times n)\). Because all matrices involved in the calculations are stored on GPU memory, performance loss due to tiling can be negligible, especially on modern GPU architecture like NVIDIA Ampere A100, which uses low latency and high bandwidth HBM memory. Using the tiling approach, the memory required to perform NMF on GPU can be reduced from \(S_{NMF} \sim 4 S_A\) to approximately \(2 S_A \le S_{NMF} \le 3 S_A\).
When dealing with OOM1 cases, light arrays are cached on GPU memory, and heavier arrays are kept on host memory and batched to respective GPUs as needed. Further, an appropriate batching strategy for the chosen memory partition is required to limit unnecessary D2H and H2D copies. In PyDNMFkGPU, we employ a 1D colinear batching strategy, illustrated in Fig. 1b, where the elements in the batch are arrays of length equal max(m, n). This batching strategy turns out to employ half the D2H and H2D memory copies required by an orthogonal batching strategy, illustrated in Fig. 1a for the column partition, where the elements in the batch are vectors of length equal min(m, n). Let p be a batch size control parameter. In RNMF (CNMF) the number of batches is then given by \(n_B = m/p\) (\(n_B = n/p\)). In the extreme case where both m and n are very large, only the light array, \(\varvec{W}[J,:]\) is cached on GPU memory, and heavier arrays \(\varvec{A}[J, b_0:b_i]\) (\(A[b_0:b_1, J]\)) and \(\varvec{H}[:, b_0:b_1]\) (\(\varvec{W}[b_0:b_1,:]\)) batched to their respective GPUs, such that for the \(b^{th}\) batch, \(b_0=b \times p\) and \(b_1=(b+1) \times p\).
An implementation of the distributed CNMF with orthogonal batching is given in Algorithm 4. The calculation of the different intermediate products is illustrated in Fig. 3, where batch delimitation is represented with dashed lines. The top row shows all intermediate products computed during \(\varvec{H}\)update, and products computed in \(\varvec{W}\)update is shown in the bottom row. Intermediate products \(\varvec{W}^T @ \varvec{A}\) and \(\varvec{W}@\varvec{W}^T\) can be computed with \(n_B\) independent batches each containing \([\varvec{W}^T @ \varvec{A}]_b\) and \([W^T @ W]_b\) subproducts. Each batch is queued to a nondefault CUDA stream \(Stm_b\) along with the transfer of \(\varvec{A}_b[b_0:b_i,J]\) and \(\varvec{W}_b[b_0:b_i,:]\), and when calculated, each subproduct is added to a local accumulator (see lines 10–11 of Algorithm 4). Once all batches have been processed, all accumulators are reduced to obtain the full values of \(W^{\rm TW}\) and \(W^{\rm TA}\), (see lines 15–16 of Algorithm 4). Note that this reduction is local to each GPU and does not involve communication. Special batch enqueuing and dequeuing policies are implemented with CUDA events, so as to limit (control) the number of concurrent batches on GPU to \(q_s\) (see lines 6–7,12 of Algorithm 4). This way, the memory requirement for \(H_{\rm update}\) is bounded by \(q_s \times [p\times J]\), as \(\varvec{W}^T\varvec{W}@\varvec{H}\) and \(\varvec{H}*(\varvec{W}^T\varvec{A})/(\varvec{W}^T\varvec{W}\varvec{H} + \epsilon )\) have a \(k \times J\) memory requirement. This is important, especially when dealing with large sparse arrays, which can be cheap to cache on the device but can also have cofactors becoming prohibitively expensive to cache when k becomes large. For instance, in CNMF, when \(m \sim 10 \ million\), the size of \(\varvec{H}\) will approximate 20GB in single precision when \(k \sim 512\).
Intermediate products \(\varvec{A}@\varvec{H}^T\) and \(\varvec{W}@\varvec{H}\varvec{H}^T\) of the \(\varvec{W}\)update are computed similarly to \(\varvec{W}^T @ \varvec{A}\) and \(\varvec{W}@\varvec{W}^T\), except \(\varvec{A}@\varvec{H}^T\) will require an intermediate \(allreducesum\) of subproducts \([\varvec{A} @ \varvec{H}^T]_b\) of batches of same stream number from the different GPUs (see line 28 of Algorithm 4). The resulting memory complexity of this implementation is found to be of the order of \({\mathcal {O}}(p \times n \times q_s)\) when \(p>> k\) which is the aggregated memory utilization caused by the \(q_s\) concurrent uploads of batches of \(\varvec{A}\) of size \(p \times n\) at line 8 or line 24 of Algorithm 4. This is a significant saving compared to the estimated \(S_{NMF} \sim 3 \times S_A\) when not checking the convergence condition \(\Arrowvert \varvec{A}  \varvec{W} \varvec{H} \Arrowvert _F \le \eta \). When the convergence criterion is checked, the error computation is tiled similarly as it was done for OOM0 scenarios, resulting in a memory utilisation \(S_{NMF} \sim 2 \times p \times n \times q_s\) when \(p>> k\).
Note that the use of batches here will only increase \(intranode\) communication due to memcopies, as it is not possible to cache \(\varvec{A}\) and \(\varvec{W}\) on the device, however major shortcomings of using the orthogonal batching can be pointed out through the example of Algorithm 4 discussed above. First, the need to upload batches two times at lines (8–9 and lines 2425of Algorithm 4) is very inefficient as the second set of H2D will significantly (almost double) data transfer costs. Second, unnecessary additional latency due to load balancing delays when the streams are scheduled in a different order on the different GPUs can occur at line 28 of Algorithm 4. Above all, the worst result here is that both inefficiencies multiply with the number of iterations ( see line 4 of Algorithm 4).
A better implementation uses a colinear batching strategy as it is done in the batched implementation of the distributed RNMF given in Algorithm 5. The calculation of the different intermediate products is illustrated in Fig. 4. The top row shows all intermediate products computed during \(\varvec{W}\)update, and products computed in \(\varvec{H}\)update is shown in the bottom row. The \(\varvec{W}\)update (cartoons 14 of Fig. 4 is embarrassingly parallel and can be done at a batch level. This means that within each batch, we have the updated partition of \(\varvec{W}\) readily available to compute local subproducts \(\varvec{W}^T@\varvec{A}\) and \(\varvec{W}^T@\varvec{A}\) in the \(\varvec{H}\)update. This avoids the need for a second data upload, as was the case with implementation using an orthogonal batching strategy. Further, the aggregation of \(\varvec{W}^T@\varvec{A}\) and \(\varvec{W}^T@\varvec{A}\) first consists of a local accumulation of the subproducts (lines 16–17 of Algorithm 5) followed by a local reduction (lines 21–22 of Algorithm 5), then a global reduction (lines 23–24 of Algorithm 5) illustrated in cartoons 5–6 of Fig. 4. This does not require communication between batches of the same stream number and consequently avoids load balancing issues as discussed above in case using an orthogonal batching strategy.
4 Benchmarks results and discussion
4.1 Hardware infrastructure and software environment
Benchmark tests were performed on three different HPC clusters to illustrate the portability and scalability of pyDNMFGPU. The first cluster, Kodiak, is a LANL internal HPC cluster with 133 compute nodes with dual Xeon E52695 v4 CPUs and four NVIDIA Pascal P100 GPGPUs each. Each NVIDIA Pascal P100 GPGPU has 16GB VRAM and uses PCIE 16X gen 3 Links. The cluster peaks at 1850TF/s and uses an Infiniband interconnect. Each GPU peaks at 9.3 teraflops for single precision. The second cluster, Chicoma, is also a LANL internal HPC cluster, composed of 118 compute nodes where each node has 2 AMD EPYC 7713 Processors and 4 NVIDIA Ampere A100 GPUs. The AMD EPYC 7713 CPUs have 64 cores peaking at 3.67 GHz and 256 GB RAM. Each of the four NVIDIA A100 GPUs in each node provides a theoretical doubleprecision arithmetic capability of approximately 19.5 teraflops with 40GB VRAM memory. The nodes are networked with HPE/Cray slingshot 10 interconnect with 100Gbit/s bandwidth. Chicoma runs Shasta 1.4 OS and SLURM Job manager. The third cluster, Summit, peaks at over 200 petaflops in doubleprecision theoretical performance and comprises 4600 IBM AC922 compute nodes, with two IBM POWER9 CPUs and six NVIDIA Volta V100 GPUs each which peak at 15.7 single precision. The POWER9 CPUs have 22 cores running at 3.07 GHz. The six NVIDIA Tesla V100 GPUs in each node provide a theoretical doubleprecision arithmetic capability of approximately 40 teraflops with VRAM memory of 16GB/GPU. Dual NVLink 2.0 connections between CPUs and GPUs provide a 25GB/s transfer rate in each direction on each NVLink, yielding an aggregate bidirectional bandwidth of 100 GB/s. The nodes are networked in a nonblocking fattree topology by Infiniband. Summit deploys an RHEL 7.4 OS and IBM Job step manager jsrun to run compute jobs. Jsrun provides a fine control of how nodelevel resources are allocated on these systems, including CPU cores, GPUs, and hardware threads.
pyDNMFGPU is written in python and uses other off the shelf python libraries such as CuPy [35], Numpy [36], MPI4PY [37] and Scipy [38]. It supports dense and sparse datasets on various hardware architectures and handles communication using a lowlatency NCCLbased communicator. NCCL is an opensource library providing interGPU communication primitives developed and maintained by NVIDIA. NCCL performs automatic hardware topology detection, which it then uses in graph search algorithms to identify communication paths that offer the highest bandwidth and lowest latencies for communication between GPUs intra and internode (e.g., between GPUs that are on the same compute node, as well as between GPUs that are on separate compute nodes). NCCL is compatible with many multiGPU parallelization models, and provides the ability to perform MPIlike collective and pointtopoint operations such as allgather, reduce, broadcast, allreduce, send, and recv. NCCL was initially proposed to help with the need to transfer large message GPU buffers in deep learning applications efficiently. Many leading deep learning frameworks like Chainer, PyTorch, and TensorFlow have since integrated NCCL to accelerate deep learning training on multiGPU, and multinode systems, which has motivated us to use NCCL to handle communication in our work. All implementations discussed in the section above were found to benefit from a reduction in data transfer latency and communication performance (both intranode and internode communications), using our low latency NCCLbased communicators versus MPI. An example of such benefit in communication performance gain is illustrated in the subsection 4.2 below that compares the new NMF implementation proposed in this work that uses an NCCLbased communicator to the prior pyDNMFk that uses a traditional MPI based communicator. A More comprehensive and detailed comparative study between NCCL and MPI can be found in the analysis by Awan [39].
4.2 Performance benchmark results of pyDNMFGPU vs pyDNMFk
The performance gained using GPU over CPU is assessed with speedup computed as the ratio of time measured on CPU with pyDNMFk [7], to time measured on GPU with pyDNMFGPU. For this study, we used a dense matrix of shape and size \(S_A\) of memory (in bytes) that respectively scale as \([N \times 65536, 32768]\) and \(N \times 8GB\), where N is the number of GPU or CPU units. Speedup measured on the Kodiak cluster are reported in Fig. 5. Figure 5a shows speedup in NMF time as a function of the number of units for various k. First, we note an increasing speedup with the increasing number of units, and second, we note a decreasing performance with increasing k when \(k \ge 32\). The low performance observed at \(k < 32\) is explained by low GPU occupancy. The best performance is obtained when \(k=32\), peaking at 76X. We also report speedup in communication time computed as the ratio of total communication time measured with pyDNMFk to the total communication time measured pyDNMFGPU. The former used MPI based communicator and the latter used an NCCLbased communicator. Speedup in communication time is reported as a function of number of units for various k in Fig. 5b. We note \(\sim 80X100X\) speedup when \(N>2\), the number of units above which internode communications start. This clearly shows a significant performance gain in communication when using NCCL in pyDNMFGPU over MPI in pyDNMFk.
4.3 Strong and Weak scalability of pyDNMFGPU
The scalability of the proposed NMFk algorithm is assessed using both strong and weak scaling analysis. This scaling study measures NMF execution time for a given problem size as a function of the number of compute units. Compute nodes (with 4 GPUs each) are chosen as compute units in strong scaling analysis, while individual GPUs are chosen as compute units in weak scaling analysis. The problem size \(S_A\) is chosen to use most of the available 16GB VRAM per GPU. To this end, \(S_A\) is fixed at \(S_A \approx 4\times 8GB = 32GB\) in strong scaling analysis and chosen to scale as \(S_A \approx 8GB \times N\) in weak scaling analysis. This is accomplished by generating a random synthetic array A of shape \([4 \times 65536, 32768]\) and \([N \times 65536, 32768]\) respectively in both strong and weak scaling. Cases of sparse A with density \(10^{5}\) were also studied, and for those cases, A was generated as a random synthetic array of shape \([4 \times 2097152, 65536]\) in strong scaling analysis, and of shape \([N \times 2097152, 65536]\) was chosen in weak scaling analysis.
4.3.1 Strong scalability
Strong scaling results for cases where \(k=8,16,32,64,128,256\) are shown Fig. 6a. NMF time is found to increase with k and to decrease with the increasing number of compute nodes. Good strong scaling is indicated by a linear decrease of NMF time with increasing compute grid size, and such behavior is only observed in select parts of the obtained results. Strong scaling is maintained up to a count of 8 nodes when \(k=8\), then to 4 nodes when \(k=16\), and lost when \(k>16\). Identical scaling is observed for cases where A is sparse, as shown in Fig. 6b.
The worst case scenarios, when \(k=256\), can be diagnosed from breakdown of \(H_{\rm update}\), \(W_{\rm update}\) and combined allreducesum (AR) execution time, as detailed in Fig. 6c. \(H_{\rm update}\) is shown to maintain good scaling at all compute grid sizes, while \(W_{\rm update}\) had poor scaling at each tested compute grid size. \(W_{\rm update}\)’s poor scaling is strongly influenced by AR communications time, which already makes up more than \(80\%\) of \(W_{\rm update}\) at 2 node count, which increases nonlinearly with node count. At full grid size, AR time makes up more than \(98\%\) of \(W_{\rm update}\), influencing the overall NMF time dominated by \(W_{\rm update}\) time. The same explanation applies to cases where A is sparse, as one can interpret from Fig. 6d.
4.3.2 Weak scalability
Weak scaling results for cases with \(k=8,16,32,64,128,256\) are shown Fig. 7a. Good weak scaling is indicated by constant NMF time with the increasing number of compute units, and this is observed only when \(N>8\). The lack of scaling when \(N <8\) can be explained using the breakdown of \(H_{\rm update}\), \(W_{\rm update}\) and combined AR execution time for the case where \(k=256\), shown in Fig. 7c. While \(W_{\rm update}\) maintains a perfect weak scaling at all N, \(H_{\rm update}\) is influenced by AR communications time, which increases with GPU count. Communication grows with noticeable transitions indicating the use of slower channels. The first transition is from \(N=1\) to \(N=2\), indicating the beginning of \(intranode\) communication between GPUs on the same node. While growing with N, \(intranode\) communication remains a small portion of \(W_{\rm update}\) ( \(\sim 10\%\)). The next major transition occurs between \(N=4\) and \(N=8\), indicating the beginning of \(internode\) communication, which quickly saturates to \(\sim 40\%\) of \(W_{\rm update}\) by \(N=32\). Identical weak scaling is observed for cases where A is sparse, as shown by plots in Fig. 7b, and the explanation for lack of scaling when \(N<8\) is consistent with the explanation given above for the case where A is dense, as one can interpret from Fig. 7d.
In Fig. 8, we display the GFLOPS and Efficiency results generated from our weak scaling experiments conducted on the Kodiak cluster. Notably, GFLOPS shows a linear progression as GPU counts rise in Fig. 8a, indicating an efficient distribution of computational workload across GPUs. Simultaneously, the consistent relationship of Efficiency with increasing GPU counts shown in Fig. 8b underscores the effective GPU utilization, thereby confirming our implementation’s efficacy in maintaining performance at scale, specially for larger ranks(k).
While all scaling results were obtained with RNMF, similar results will be obtained with \(\varvec{A}^T\) using CNMF.
4.4 Scaling benchmark results on Big Data
It’s important to note that as technology continues to evolve, the scale of data storage and processing capabilities will likely increase, leading to even more significant data sets in the future. “The world’s most valuable resource is no longer oil, but data1" [40]. In national security and related research efforts, vast amounts of highdimensional data are continuously being generated by massive computer simulations, largescale experiments, surveillance systems, etc [41, 42]. For example, Stanford Synchrotron Radiation Lightsource experiments at SLAC laboratory for revealing the inner structure of materials at nanometer scales [43, 44] and the Large Hadron Collider [45] produce terabytes of data in minutes. Another example is the petabytes of data generated by missioncritical simulations [46,47,48,49,50]. Exploration and analysis of such extralarge data mandates the development of novel machine learning (ML) approaches that are able to extract meaningful basic processes and fundamental features underlying the data [51].
Given our interest in exascale data, the proposed implementation was tested on a dense matrix of shape [2618523648; 32768] with a size of \(\sim 340 TB \), and a sparse matrix of shape \([2.89 * 10^{12}, 1.05 * 10^6]\) with sparsity \(10^{6}\) and size of \(\sim 11 EB\) (\(\sim 34 TB\) when compressed in a sparse format). Benchmarks were performed on Summit, with an allocation of 4096 nodes with 6 GPUs of 16 GB VRAM each, totaling a combined 394TB VRAM. While that is not enough to efficiently factorize either of the two matrices, we chose to cache A and cofactors and batch the compute of heavy, intermediate products (OOM0). This way, we can reduce performance loss by avoiding unnecessary data transfers from host to device and viceversa.
On the one hand, the weak scaling benchmark results for the dense array are reported in Fig. 9a. The \(H_{\rm update}\) is shown with a perfect weak scaling, while the \(W_{\rm update}\) is shown not to scale appropriately. Loss of scaling in the \(W_{\rm update}\) is a consequence of the high communication cost associated with the Allreduce of \(W^{\rm TA}\) and \(W^{\rm TW}\), which combined, make up a substantial portion of the \(W_{\rm update}\). The total NMF time, in turn, is significantly affected by the \(W_{\rm update}\), which takes about one order of magnitude more time to execute than the \(H_{\rm update}\). On the other, the weak scaling benchmark results for the sparse array, reported in Fig. 9b, indicate both \(W_{\rm update}\) and \(H_{\rm update}\) to have an excellent weak scaling. The \({\rm AR}(W^{\rm TW})\) is similar in both cases, as \(W^{\rm TW}\) is of shape \(k \times k\), but the \({\rm AR}(W^{\rm TA})\) is two orders of magnitude higher in the case of the spare dataset, proportional to n which is also two orders of magnitude higher. Unlike in the case of the dense array, the communication cost associated with the \({\rm AR}(W^{\rm TA})\) and \({\rm AR}(W^{\rm TW})\), although higher, are not significant enough to affect the \(W_{\rm update}\), consequently do not affect the overall scaling of the NMF.
4.5 Benchmark results on outofmemory problems
Next, we assess the effectiveness of the proposed batching technique for OOM scenarios and the use of the CUDA stream queues to reduce communication in Algorithm 5. To this end, the proposed implementation is tested in an OOM1 scenario, where a matrix of shape [524288, 4096] is factorized for \(k=[32,64,128,256,512,1024]\). Smaller array \(\varvec{H}\) is cached on GPU memory, and large arrays \(\varvec{A}\) and \(\varvec{W}\) are stored on the host and batched to GPU as needed. For this experiment, the number of iterations in Algorithm 5(line 4) fixed to \(max\_iters=100\), and the number of batches is fixed to \(n_b=32\). Given the size of A in single precision is \(S_A=8\)GB, the resulting batch size is \(S_B=p\times n \sim 0.25\)GB. The GPU peak memory utilization and NMF execution time for the 100 iteration, vs queue size, are respectively reported in Fig. 10a and Fig. 10b.
In Fig. 10a, the peak memory utilization measured when \(q_s=1\) is \(S_{nmf} \sim 0.267GB\) which is close to the estimated memory complexity of \({\mathcal {O}}(p \times n \times q_s) \approx 0.25\)GB in section 3.2, and which is a very big saving, \(\sim 1/100X\), compared to the estimated \(S_{NMF}~3 \times S_A \approx 24\) GB require by a normal implementation. This memory complexity is maintained for all k values and all queue sizes as indicated by the lines with the same slope \(\sim 0.267\) in Fig. 10a. The increase in peak memory with increasing k for any given queue size is explained by the increase in the size of the arrays cached on GPU (\(\varvec{H}\)), as well as the increase in the size of the computed intermediate products (see Fig. 4. Similarly, for each k value, we note an increase in peak memory utilization with the increasing number of batches which is simply explained by the aggregated memory utilization from the concurrent streams. While from this figure, it seems unproductive to use larger stream queue sizes due to the increase in peak memory utilization, the benefits of such design choice are explained in the execution benchmark results reported in Fig. 10b.
From Fig. 10b, we first see that it is, in all cases, a good idea to choose a queue size \(q_s>1\) if one wants to speed up the NMFk execution time. This is explained by using large stream queue sizes makes more streams available to overlap memory copies, allreduce communications, and compute concurrently. It is, however, not the case that more streams will always make this process better, as we can see it not being the case when \(q_s=16\), where the NMFk execution time is not optimum for any k value. This is explained by the fact that CUDA core counts are limited and that some streams will block and wait when all cores are busy processing other streams, causing loadbalancing delays. Consequently, it is crucial to finetune \(q_s\) for a given batch size and k to obtain optimal performance.
4.6 Validation of the model selection capability
To demonstrate the correctness of the proposed algorithm on big synthetic datasets, we first integrate our pyDNMFGPU with the existing model selection algorithm pyDNMFk [7]. Then, we determine the number of latent features on a synthetic terabyte size matrix (with a predetermined number of features) and show that estimation is performed correctly. We generate a random matrix of dimensions \(8388608\times 32768\) as a product of two random matrices, \(\varvec{W}\) and \(\varvec{H}\), with a latent feature count of \(k=8\). We construct \(\varvec{W}\) with Gaussian features with different statistical means. The pyDNMFkGPU silhouette analysis corresponding to this decomposition is shown in Fig. 11a and the correctness of features is shown with confusion matrix in Fig. 11b. pyDNMFkGPU estimates \(k=8\) as the minimum Silhouette score is high and relative error is low. For \(k>8\), the minimum silhouette score drops suddenly as the solutions begin to fit the noise Fig. 11a. Figure 11b shows a Pearson correlation matrix that illustrates a large correlation between the features of ground truth \(\varvec{W}\) Ground truth and the corresponding pyDNMFkGPU extracted \(\varvec{W} Predicted\) for \(k=8\). The analysis took approximately 1 h to correctly estimate the latent features on Kodiak. The average reconstruction error for the data is \(\sim 4\%\) with the Frobenius norm objective and MU update optimization. Our experiment demonstrates that pyDNMFkGPU correctly estimates the number of latent features in addition to its scalability for large datasets as demonstrated in previous sections.
5 Conclusion
In summary, we demonstrated a novel scalable and portable framework, pyDNMFkGPU, for nonnegative matrix factorization based on custom multiplicative updates, with automatic determination of the number of latent features on Exascale data. Scalability of the framework was demonstrated via strong and weak scaling benchmarks, and speedup gains on GPU over CPU were found to vary with k and to increase with the size of the HPC system. The efficacy of the proposed tiling technique was demonstrated through the OOM0 problem by factorizing a dense dataset of 340TB and a sparse dataset of size 11EB, where the implementation was found to have good week scaling on upto to 25k GPU. We also demonstrated the efficacity of the proposed batching technique along with the importance of using CUDA streams by solving OOM1 problem, where memory complexity was shown to be of the \({\mathcal {O}}(p \times n \times q_s)\), resulting in a significant saving of \(\sim 100X\) smaller peak memory utilization in some cases. The automatic model selection capability was verified by correctly decomposing large synthetic data with a predetermined number of latent features and factors.
Data availability
The code and the benchmark results used in this paper will be available at https://github.com/lanl/pyDNMFk.
Change history
28 September 2023
A Correction to this paper has been published: https://doi.org/10.1007/s11227023056755
Notes
pyDNMFk: https://github.com/lanl/pyDNMFk.
References
Lee DD, Seung HS (1999) Learning the parts of objects by nonnegative matrix factorization. Nature 401(6755):788–791
Cichocki A, Zdunek R, Phan AH, Amari Si (2009) Nonnegative matrix and tensor factorizations: applications to exploratory multiway data analysis and blind source separation
Everett B (2013) An introduction to latent variable models
Alexandrov LB, NikZainal S, Wedge DC, Campbell PJ, Stratton MR (2013) Deciphering signatures of mutational processes operative in human cancer. Cell Rep 3(1):246–259
Alexandrov BS, Alexandrov LB, Iliev F, Stanev VG, Vesselinov V (2020) Source identification by nonnegative matrix factorization combined with semisupervised clustering. Google Patents. US Patent 10,776,718
Chennupati G, Vangara R, Skau E, Djidjev H, Alexandrov B (2020) Distributed nonnegative matrix factorization with determination of the number of latent features. The Journal of Supercomputing, 1–31
Bhattarai M, Nebgen B, Skau E, Eren M, Chennupati G, Vangara R, Djidjev H, Patchett J, Ahrens J, ALexandrov B (2021) pyDNMFk: python distributed non negative matrix factorization. GitHub. https://doi.org/10.5281/zenodo.4722448
Vangara R, Bhattarai M, Skau E, Chennupati G, Djidjev H, Tierney T et al (2021) Finding the number of latent topics with semantic nonnegative matrix factorization. IEEE Access, pp 117217–117231
Alexandrov LB, NikZainal S, Wedge DC, Aparicio SA, Behjati S, Biankin AV, Bignell GR, Bolli N, Borg A, BørresenDale AL et al (2013) Signatures of mutational processes in human cancer. Nature 500(7463):415
Alexandrov LB, Kim J, Haradhvala NJ, Huang MN, Ng AWT, Wu Y, Boot A, Covington KR, Gordenin DA, Bergstrom EN et al (2020) The repertoire of mutational signatures in human cancer. Nature 578(7793):94–101
Vangara R, Skau E, Chennupati G, Djidjev H, Tierney T, Smith JP, Bhattarai M, Stanev VG, Alexandrov BS (2020) Semantic nonnegative matrix factorization with automatic model determination for topic modeling, pp 328–335. IEEE
Bhattarai M, Chennupati G, Skau E, Vangara R, Djidjev H, Alexandrov BS (2020)Distributed nonnegative tensor train decomposition. In: 2020 IEEE High Performance Extreme Computing Conference (HPEC), pp 1–10. IEEE
Alexandrov BS, Stanev VG, Vesselinov VV, Rasmussen KØ (2019) Nonnegative tensor decomposition with custom clustering for microphase separation of block copolymers. Stat Anal Data Min ASA Data Sci J 12(4):302–310
Pulido J, Patchett J, Bhattarai M, Alexandrov B, Ahrens J (2021) Selection of optimal salient time steps by nonnegative tucker tensor decomposition. In: Agus M, Garth C, Kerren A (eds) EuroVis 2021—short papers. The Eurographics Association. https://doi.org/10.2312/evs.20211055
Bhattarai M, Kharat N, Skau E, Nebgen B, Djidjev H, Rajopadhye S, Smith JP, Alexandrov B (2022) Distributed nonnegative rescal with automatic model selection for exascale data. arXiv preprint arXiv:2202.09512
Bhattarai M, Kharat N, Skau E, Truong D, Eren M, Rajopadhye S, Djidjev H, Alexandrov B pyDRESCALk: python distributed non negative RESCAL decomposition with determination of latent features. https://doi.org/10.5281/zenodo.5758446
Eren ME, Moore JS, Skau E, Moore E, Bhattarai M, Chennupati G, Alexandrov BS (2022) Generalpurpose unsupervised cyber anomaly detection via nonnegative tensor factorization. Research and practice, digital threats
Eren ME, Richards LE, Bhattarai M, Yus R, Nicholas C, Alexandrov BS (2022) Fedsplit: Oneshot federated recommendation system based on nonnegative joint matrix factorization and knowledge distillation. arXiv preprint arXiv:2205.02359
Eren ME, Solovyev N, Bhattarai M, Rasmussen K, Nicholas C, Alexandrov BS (2022) Senmfksplit: Large corpora topic modeling by semantic nonnegative matrix factorization with automatic model selection. arXiv preprint arXiv:2208.09942
Févotte C, Cemgil AT (2009) Nonnegative matrix factorizations as probabilistic inference in composite models. In: 2009 17th European Signal Processing Conference, pp 1913–1917. IEEE
Phan AH, Cichocki A (2008) Multiway nonnegative tensor factorization using fast hierarchical alternating least squares algorithm (HALS). In: Proc. of The 2008 international symposium on nonlinear theory and its applications
Kim J, Park H (2012) Fast nonnegative tensor factorization with an activesetlike method, pp 311–326
Kim J, He Y, Park H (2014) Algorithms for nonnegative matrix and tensor factorizations: a unified view based on block coordinate descent framework. J Global Optim 58(2):285–319
Battenberg E, Wessel D (2009) Accelerating nonnegative matrix factorization for audio source separation on multicore and manycore architectures. In: ISMIR, pp 501–506
Fairbanks JP, Kannan R, Park H, Bader DA (2015) Behavioral clusters in dynamic graphs. Parallel Comput 47:38–50
Moon GE, Ellis JA, SukumaranRajam A, Parthasarathy S, Sadayappan P (2020) ALONMF: Accelerated localityoptimized nonnegative matrix factorization. In: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp 1758–1767
Phipps ET, Kolda TG (2019) Software for sparse tensor decomposition on emerging computing architectures. SIAM J Sci Comput 41(3):269–290
MejíaRoa E, TabasMadrid D, Setoain J, García C, Tirado F, PascualMontano A (2015) NMFmGPU: nonnegative matrix factorization on multiGPU systems. BMC Bioinf 16(1):1–12
Lopes N, Ribeiro B (2010) Nonnegative matrix factorization implementation using graphic processing units. In: International Conference on Intelligent Data Engineering and Automated Learning, pp 275–283. Springer
Kannan R, Ballard G, Park H (2016) A highperformance parallel algorithm for nonnegative matrix factorization. ACM SIGPLAN Not 51(8):1–11
Koitka S, Friedrich CM (2016) nmfgpu4R: GPUAccelerated Computation of the NonNegative Matrix Factorization (NMF) Using CUDA Capable Hardware. R J 8(2):382
Tang B, Kang L, Zhang L, Guo F, He H (2021) collaborative filtering recommendation using nonnegative matrix factorization in GPUaccelerated spark platform. Scientific Programming 2021
Eswar S, Hayashi K, Ballard G, Kannan R, Matheson MA, Park H (2021) PLANC: parallel lowrank approximation with nonnegativity constraints. ACM Trans Math Softw 47(3):1–37
Boureima I, Bhattarai M, Eren ME, Solovyev N, Djidjev H, Alexandrov BS (2022) Distributed outofmemory SVD on CPU/GPU architectures. arXiv preprint arXiv:2208.08410
Okuta R, Unno Y, Nishino D, Hido S, Loomis C (2017) Cupy: A NumPycompatible library for NVIDIA GPU calculations. In: Proceedings of Workshop on Machine Learning Systems (LearningSys) in the ThirtyFirst Annual Conference on Neural Information Processing Systems (NIPS). http://learningsys.org/nips17/assets/papers/paper_16.pdf
...Harris CR, Millman KJ, van der Walt SJ, Gommers R, Virtanen P, Cournapeau D, Wieser E, Taylor J, Berg S, Smith NJ, Kern R, Picus M, Hoyer S, van Kerkwijk MH, Brett M, Haldane A, del Río JF, Wiebe M, Peterson P, GérardMarchant P, Sheppard K, Reddy T, Weckesser W, Abbasi H, Gohlke C, Oliphant TE (2020) Array programming with NumPy. Nature 585(7825):357–362. https://doi.org/10.1038/s4158602026492
Dalcin L, Fang YLL (2021) mpi4py: Status update after 12 years of development. Comput Sci Eng 23(4):47–54
...Virtanen P, Gommers R, Oliphant TE, Haberland M, Reddy T, Cournapeau D, Burovski E, Peterson P, Weckesser W, Bright J, van der Walt SJ, Brett M, Wilson J, Millman KJ, Mayorov N, Nelson ARJ, Jones E, Kern R, Larson E, Carey CJ, Polat İ, Feng Y, Moore EW, VanderPlas J, Laxalde D, Perktold J, Cimrman R, Henriksen I, Quintero EA, Harris CR, Archibald AM, Ribeiro AH, Pedregosa F, van Mulbregt P (2020) SciPy 1.0 contributors: SciPy 1.0: fundamental algorithms for scientific computing in python. Nat Methods 17:261–272. https://doi.org/10.1038/s4159201906862
Awan AA, Hamidouche K, Venkatesh A, Panda DK (2016) Efficient large message broadcast using NCCL and CUDAaware MPI for deep learning. In: Proceedings of the 23rd European MPI Users’ Group Meeting, pp 15–22
Quigley E, Holme I, Doyle DM, Ho AK, Ambrose E, Kirkwood K, Doyle G (2021) data is the new oil: citizen science and informed consent in an era of researchers handling of an economically valuable resource. Life Sci Soc Policy 17(1):1–13
Hickey A (2019) Zettabytes of data hog up space and resources
Akhgar B, Saathoff GB, Arabnia HR, Hill R, Staniforth A, Bayerl PS (2015) Application of Big Data for national security: a practitioner’s guide to emerging technologies. ButterworthHeinemann, Oxford
Sierra RG, Laksmono H, Kern J, Tran R, Hattne J, AlonsoMori R, LassalleKaiser B, Glöckner C, Hellmich J, Schafer DW et al (2012) Nanoflow electrospinning serial femtosecond crystallography. Acta Crystallogr D Biol Crystallogr 68(11):1584–1587
Sandberg RL, Huang Z, Xu R, Rodriguez JA, Miao J (2013) Studies of materials at the nanometer scale using coherent xray diffraction imaging. JOM 65:1208–1220
Butter A, Plehn T, Schumann S, Badger S, Caron S, Cranmer K, Di Bello FA, Dreyer E, Forte S, Ganguly S et al (2023) Machine learning and LHC event generation. SciPost Phys 14(4):079
Gubaev K, Podryabinkin EV, Shapeev AV (2018) Machine learning of molecular properties: locality and active learning. J Cheml Phys 148(24):241727
Kruglov I, Sergeev O, Yanilkin A, Oganov AR (2017) Energyfree machine learning force field for aluminum. Sci Rep 7(1):8512
Haghighatlari M, HeidarZadeh F, Hirn M, Hoja J, Isayev O, Kondor R, Li L, Li Y, Martyna G, Meila M et al (2017) IPAM program on machine learning & manyparticle systemsrecent progress and open problems
Messina P, Lee S (2016) The us exascale computing project. In: Proc. ACM/IEEE conf. supercomputing (birds a feather)
Zhang J, Xiao M, Gao L (2019) An active learning reliability method combining kriging constructed with exploration and exploitation of failure region and subset simulation. Reliab Eng Syst Saf 188:90–102
Franke B, Plante JF, Roscher R, EsA Lee, Smyth C, Hatefi A, Chen F, Gil E, Schwing A, Selvitella A et al (2016) Statistical inference, learning and models in big data. Int Stat Rev 84(3):371–389
Acknowledgements
This research used resources of Los Alamos National Laboratory Institutional Computing Program, supported by the U.S. Department of Energy National Nuclear Security Administration under Contract No. 89233218CNA000001 and the Oak Ridge Leadership Computing Facility at the Oak Ridge National Laboratory under Director’s Discretionary allocation #CSC456, which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DEAC0500OR22725.
Funding
This research was funded by DOE National Nuclear Security Administration (NNSA)  Office of Defense Nuclear Nonproliferation R &D and by U.S. Department of Energy National Nuclear Security Administration under Contract No. DEAC5206NA25396 and through LANL laboratory directed research and development (LDRD) grant 20190020DR.
Author information
Authors and Affiliations
Contributions
IB was responsible for the GPU algorithm. The implementation of the code, performance of benchmarks, and complexity analysis was a collaborative effort between IB and MB. ES was consulted for the complexity analysis of the algorithm, along with the verification of the method’s correctness. The algorithms and their implementation, benchmark and verification results were thoroughly reviewed by ME, PR, SE, and BA. All authors significantly contributed to both the work and the composition of the manuscript.
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no competing interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
The original online version of this article was revised: “The Acknowledgements section was missing from this article and should have read ‘This research used resources of Los Alamos National Laboratory Institutional Computing Program, supported by the U.S. Department of Energy National Nuclear Security Administration under Contract No. 89233218CNA000001 and the Oak Ridge Leadership Computing Facility at the Oak Ridge National Laboratory under Director’s Discretionary allocation #CSC456, which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DEAC0500OR22725.’ The original article has been corrected.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Boureima, I., Bhattarai, M., Eren, M. et al. Distributed outofmemory NMF on CPU/GPU architectures. J Supercomput (2023). https://doi.org/10.1007/s11227023055874
Accepted:
Published:
DOI: https://doi.org/10.1007/s11227023055874