Distributed Out-of-Memory NMF on CPU/GPU Architectures

We propose an efficient distributed out-of-memory implementation of the Non-negative Matrix Factorization (NMF) algorithm for heterogeneous high-performance-computing (HPC) systems. The proposed implementation is based on prior work on NMFk, which can perform automatic model selection and extract latent variables and patterns from data. In this work, we extend NMFk by adding support for dense and sparse matrix operation on multi-node, multi-GPU systems. The resulting algorithm is optimized for out-of-memory (OOM) problems where the memory required to factorize a given matrix is greater than the available GPU memory. Memory complexity is reduced by batching/tiling strategies, and sparse and dense matrix operations are significantly accelerated with GPU cores (or tensor cores when available). Input/Output (I/O) latency associated with batch copies between host and device is hidden using CUDA streams to overlap data transfers and compute asynchronously, and latency associated with collective communications (both intra-node and inter-node) is reduced using optimized NVIDIA Collective Communication Library NCCL based communicators. Benchmark results show significant improvement, from 32X to 76x speedup, with the new implementation using GPUs over the CPU-based NMFk. Good weak scaling was demonstrated on up to 4096 multi-GPU cluster nodes with approximately 25,000 GPUs when decomposing a dense 340 Terabyte-size matrix and an 11 Exabyte-size sparse matrix of density 10e-6.


Introduction
NMF is a popular unsupervised learning method that extracts sparse and explainable latent features [1], which are often used to reveal explainable lowdimensional hidden structures that represent and classify the elements of the whole dataset [2].NMF is used in big data analysis, which plays a crucial role in many problems, including human health, cyber security, economic stability, emergency response, and scientific discovery.With the increased accessibility to data and technology, datasets continue to grow in size and complexity.At the same time, the operational value of the information hidden in patterns in such datasets continues to grow in significance.Extracting explainable hidden features from large datasets, collected experimentally or computer-generated, is vital because the data presumably carries essential (but often previously unknown) information about the investigated phenomenon's causality, relationships, and mechanisms.Discovering meaningful hidden patterns from data is not a trivial task because the datasets are formed only by directly observable quantities while the underlying processes or features, in general, remain unobserved, latent, or hidden [3].
Analysis of vast amounts of (usually sparse) data via NMF requires novel distributed approaches for reducing computational complexity, speeding up the computation, and dealing with data storage and data movement challenges.Most NMF computations are matrix-matrix multiplications, which GPU accelerators can speed up.The primary performance and scaling limiting factors in NMF implementations on modern heterogeneous HPC systems are high communication costs due to data movement across different system parts (inter-node and intra-node communications).In various cases, these communication delays exceed the time the actual computations take, resulting in poor performance and poor scalability on large distributed systems.
The growth in data volumes outpacing the improvement in hardware specifications is causing significant challenges in extracting useful information from large-scale datasets using algorithms like NMF.This motivates the need for out-of-memory implementations of NMF for distributed HPC systems, which will allow the decomposition of large datasets that does not fit in memory at once.Enabling out-of-memory factorization is very important because it removes the matrix size constraint imposed by the GPU memory, thus enabling the analysis of datasets up to the cumulative size of all RAM on the cluster.This is mainly required to address the challenges presented by the need to factorize the ever-growing datasets.We utilize this unique ability of pyDNMF-GPU to demonstrate the decomposition of record large dense, and sparse datasets.
To illustrate how pyDNMF-GPU can be used as a building block for more comprehensive workflows, we integrate pyDNMF-GPU with our existing model selection algorithm pyDNMFk1 that enables automatic determination of the (usually unknown) number of latent features on a large scale datasets [4][5][6][7][8].
This integration results in our out-of-memory scalable tool, pyDNMFk-GPU, to be capable of estimating the number of latent features in extra-large sparse (tens of EBs) and dense (hundreds of TBs) datasets while operating across CPU-GPU hardware.To the best of our knowledge, our framework is the first to identify hidden features in large-scale dense and sparse datasets.
In experiments on large HPC clusters, we show pyDNMF-GPU 's potential: we measure up to 76x improvement on a single GPU over running on a single 18-core CPU.We also demonstrate weak scaling on up to 4096 multi-GPU cluster nodes with approximately 25,000 GPUs when decomposing a dense 340 Terabyte-size matrix and an 11 Exabyte-size sparse matrix of density 10 −6 .
Our main contribution is a novel NMF parallel framework, called pyDNMF-GPU, that minimizes the data movement on GPUs, improving overall running times.Our work's main contribution and novelty is the proposal of a new distributed implementation of NMF with low memory complexity that enables the out-of-memory factorization of very large datasets.Our proposed implementation, pyDNMF-GPU, takes advantage of the following three modern design choices: • pyDNMF-GPU reduces the latency associated with local data transfer between the GPU and host (and vice-versa) by using CUDA streams.• Latency associated with collective communications (intra-node and internode) is reduced by using NCCL primitives. 2 .• We incorporate a batching approach for inter-node communication, which provides a unique ability to perform out-of-memory NMF while using multiple GPUs for the bulk of computations.
The main contributions of the paper include: • Introducing a novel distributed algorithm with out-of-memory support for NMF for sparse and dense matrices operating across CPU-GPU hardware.
• Report, the first NCCL communicator accelerated NMF decomposition tool in distributed GPUs.• Demonstrate the framework's scalability over a record-breaking 340 Terabytes (TB) dense and 11 Exabytes (EB) sparse synthetic datasets.
The remainder of the paper is organized as follows: Section 2 gives a summary of NMF and the existing parallel NMF implementations.In Section 3, we detail the design considerations and choices for a scalable, parallel, and efficient algorithm in different configurations of the data size and available GPU VRAM, as well as the complexity of the new implementation.The efficacy of pyDNMF-GPU with different benchmark results and the validation of benchmark results on a synthetic dataset with a predetermined number of latent features is shown in Section 4. We finally conclude with summaries and suggestions of possible future work directions in Section 5.
2 Background and related work 2.1 Non-negative matrix factorization algorithms NMF [1] approximates the non-negative observational matrix A ∈ R m×n + with a product of two non-negative factor matrices W ∈ R m×k + and H ∈ R k×n + where the columns of W represent the latent features, while the columns of H are the coordinates/weights of the analyzed samples (the columns of A) in the reduced latent space, and k is the latent dimension of the data.The NMF minimization is based on alternating update of each one of these two factor matrices until convergence indicated by the condition ∥A − W H∥ F ≤ η is reached.Here ∥.∥ F is the Frobenius norm, ∥A∥ F = i j a 2 ij , where a ij is the element on row i and column j, and η is the desired tolerance.Each iteration consists of a W -update sub-step followed by a H-update sub-step, given by The Frobenius norm (FRO) based multiplicative update (MU) algorithm is presented in Algorithm 1.In addition to the presented Frobenius normbased MU algorithm (which leads to a Gaussian model of the noise [20]) other similarities (e.g., KL-divergence that corresponds to a Poisson model) can also be used in the NMF minimization.Also, based on the update rules, several variants of NMF algorithms exist such as Hierarchical Alternating Least Squares (HALS) [21], Alternating Non-negative Least Squares with Block Principle Pivoting (ANLS-BPP) [22], and Block coordinate descent algorithm (BCD) [23].These algorithms have different advantages in the context of convergence rate, computational, and memory requirements.MU-based updates , k is the rank of approximation and max iters is the number of iterations.
▷ Ensure η i > η to enter loop 4: while (η i ≥ η or i ≤ imax) do @ stands for matrix multiplication operation 5: 11: end while are computationally and memory-wise cheap at the cost of slower convergence.Whereas HALS, BCD, and ANLS-BPP have faster convergence rates at the cost of higher computational and memory requirements and high communication costs for parallel implementations.In our experiments, we use the FRO-based MU algorithm to demonstrate record scalability on large datasets due to its lower computation and communication cost, which can easily be modified with another update algorithm or similarity metric.

Related work on distributed NMF
Several parallel implementations have been proposed to address the computational need of NMF for large datasets involving multiple and repeated matrix-matrix multiplications of several orders in magnitude.The existing parallel implementations can be grouped under two categories (i) with shared memory and (ii) with distributed memory.Majority of existing parallel works utilize shared-memory multiprocessor [24][25][26][27] and shared memory GPUs [26][27][28][29] via OpenMP and CUDA libraries respectively.A majority of distributed memory implementations rely on MPI primitives for distributed CPU [12,30] and CUDA-aware MPI primitives for distributed GPU [28,30] parallelization.Although shared-memory implementations drastically minimize the communication costs incurred for distributed memory implementation [26], there is a constraint on how much data such frameworks can decompose.Due to this constraint, shared-memory implementation often cannot provide the computational/memory requirements needed for the current large-scale datasets.
Almost all distributed GPU implementations including NMF-mGPU [28] and PLANC [33] rely on significant data communication for the update of the factors.This involves using CUDA-aware MPI primitives for data communication or MPI distributed memory offload through NVBLAS [33] without multi-node GPU communicators.Such implementation leads to high data movement costs due to data on-loading/offloading to/from the device, which significantly raises communication costs compared to the computation cost for large data decomposition.This is previously illustrated with distributed BPP in PLANC [30] and distributed MU and BCD [12] where the communication cost is minimized by communicating only with the two-factor matrices and other partitioned matrices among MPI processes.These works attempt to reduce the bandwidth and data latency using MPI collective communication operations.For distributed CPU implementations, this approach works well as the communication cost is significantly lower compared to the computation cost.However, for GPU implementation, communication cost is higher due to device/host data transfer; therefore, communication cost is a limiting factor for parallel performance when using many GPUs.Table 1 illustrates the comparison against the existing parallel NMF implementations.Further, support for factorization of sparse datasets equally adds value for our new pyDNMF-GPU framework.Since many of the extra-large datasets, such as the text corpora, knowledge graph embeddings (and, in general, most of the relational datasets), cyber network activity datasets, and many others, are highly sparse, having sparse decomposition support dramatically reduces the memory and computational requirements which otherwise would be a major bottleneck for the dense implementation.Despite the support for a sparse dataset for shared-memory in ALO-NMF and genten [26,27] and for distributed memory in PLANC [30], there is no specific solution aiming to address the bottlenecks due to extracted dense factors and their communications for large sparse datasets.Even though the largest sparse datasets may be a few MBs in size, due to their extreme sparsity, decomposing such datasets would be challenging for most existing frameworks as the extracted factors are dense and very large.Even for such a small non-zero valued size, the corresponding dense factors could easily explode and require an expensive communication of dense intermediate terms.However, our batching framework provides a solution by accommodating larger intermediate-dense factors, which have not been addressed previously.

Rationale for an algorithm for the out-of-memory distributed NMF
In pyDNMF-GPU, we use a distributed implementation of NMF that aims at efficiently factorizing matrices of all sizes, even those too big to fit on available memory, in out-of-memory scenarios.To this end, pyDNMF-GPU accelerates matrix operations using GPUs on modern heterogeneous systems, provides support for sparse matrix operations to deal with practical data sets which are often sparse, and can partition large problems into smaller problems solved in a distributed manner.Above all, and to the best of our knowledge, our proposed implementation is the first to provide a solution for practical out-of-memory cases that require the factorization of data too big to be stored on combined available GPU memory.
When performing NMF on GPUs, OOM situations can arise in various scenarios with different degrees of complexity.As discussed in [34], we distinguish three main types of OOM scenarios.Scenarios of type 0 (OOM-0) concern practical problems where the input data A and its co-factors W and H can easily be stored on GPU memory.However, an explosion of memory requirement can occur, either due to the unknown rank k becoming significant, causing W and H to become prohibitively expensive to store on memory, or when computing intermediate results such as X = W H (line 8 of Algorithm 1), when A is a large sparse matrix of very low density, where X resulting from the operation becomes dense and very likely impossible to store on GPU.For instance, if A ∈ R 10 6 ×10 6 is a sparse matrix, with density of δ ≈ 10 −3 , the size of A in dense format, in single precision, is S A ≈ 4T B, however representing A in CSR sparse format can lower the size of A down to S s ∼ 3 × S A × δ ≈ 12 GB (the factor of 3 accounts for storing the data, indices and index pointers for CSR format), consequently S N M F ≈ 2 × S A ≈ 4T B. Assuming very small k, A and all co-factors can be stored on GPU; however, the calculation of the intermediate product X from X = W H would still require a whopping ∼ 8 T B of GPU memory (line 8 of Algorithm 1), making this scenario an OOM-0 problem.
A more complex OOM scenario, type 1 (OOM-1), arises in cases where matrix A and at most one of its co-factors cannot be cached on GPU memory; this is typically the case when dealing with a large A that is dense or sparse with high density.Scenarios of type 2 (OOM-2) are the most complex and consist of practical cases where neither A, nor its co-factors can be stored on GPU memory.Note that more complexity can arise in cases where data cannot fit on host RAM memory, but that still is of type 2 as the OOM classification here is based on the GPU RAM memory utilization.In other words, in OOM-0 scenarios, all the data can be cached on GPU; in OOM-1 scenarios, the data can partially be cached on GPU, and in OOM-2 scenarios, none of the data can be cached on GPU.The treatment of OOM-2 scenarios is out of the scope of this study.
OOM-0 cases can easily be handled using tiling techniques, and OOM-1 cases can be handled with batching techniques.In extreme OOM-1 cases, we will complement batching by tiling to further reduce memory footprint.
Both batching and tiling are block-based computational techniques designed to simplify larger, memory-intensive computations into smaller, manageable, and partially solvable tasks.Each technique, however, functions in a distinct setting and serves a different purpose.Batching is a process that operates on the host, necessitating consistent data transfer between the host and the device.The efficacy of batching techniques is heavily reliant on the speed of the interconnecting buses between the host and device, such as PCIe or NV-Link.Batching techniques become crucial when dealing with OOM-1 problems, as they help in transferring partially computed results.Conversely, tiling happens directly within the device memory, resulting in data transfer between global memory and shared or cache memory.The performance of tiling techniques is primarily governed by the GPU architecture, including features like memory speed and available shared memory.Tiling techniques are especially effective for tackling OOM-0 problems, as they handle computational tasks directly on the device.Notably, batching is typically irrelevant for OOM-0 problems as these computations are already based on the device.Similarly, tiling techniques alone cannot address OOM-1 issues due to the preliminary need to transfer operands to the device.However, an optimized solution for extreme OOM-1 problems can be achieved by strategically combining both batching and tiling techniques, thus enhancing the overall performance.
In the section below, we discuss our implementation and design choices.

pyDNMF-GPU for heterogeneous systems
An efficient implementation of NMF for distributed heterogeneous systems should avoid high costs associated with communication (data transfer) resulting from poor consideration for data locality in the distribution of the computational work.Furthermore, cases, where resources such as available combined GPUs memory are limited will require additional considerations and various trade-offs.For instance, it is sometimes better to replicate data over the distributed compute grid to reduce communication.Other times, it is acceptable to use batching techniques that can increase communication costs to lower the memory footprint.Below we first discuss our distributed data partition strategies that partition large problems into smaller problems solvable on cooperative distributed systems in subsection 3.1, and then in subsection 3.2 we discuss our tiling and batching approaches, respectively used to handle practical scenarios of complexities type0 and type1.

Distributed implementation
Our implementation considers two one-dimensional data partition strategies based on the shape of matrix A (m × n).A column (vertical) partition, CNMF employed when n > m, and a row (horizontal) partition, RNMF, is used otherwise.Assuming a distributed system with N GPUs where each GPU is indexed by its global rank g ID .In the CNMF approach illustrated in Figure 1a, the j th GPU with g ID = j will work on array partitions A[:, j 0 : j 1 ], H[:, j 0 : j 1 ] and W , where j 0 = j × J, j 1 = (j + 1) × q, and J = n/N (partition size).Each GPU gets a full copy of W (W is replicated) and a unique partition of A and H.This translates into a segmentation of arrays A and H on global memory illustrated with solid lines in Figure 1a.These solid lines indicate boundaries in global memory and consequently help conceptualize where communication is required whenever information is exchanged from one bounded region to another.The H-update is embarrassingly parallel since W T W , (W T W )H, and W T A can all be computed locally on each GPU; the W -update on the other hand, will require two separate all-reduce-sum communications to compute AH T and HH T as indicated in Algorithm 2 lines 10 and7.
Following a similar analogy, a RNMF approach results with H replicated on the different GPUs and A and H distributed across the compute grid.This time W -update is embarrassingly parallel since HH T , W (HH T ), and , k is the rank of approximation and max iters is the number of iterations.Require: X distributed across N GPUs where X j ∈ R m×n/N where J = n/N if m < n.Similarly the co-factor W local to each GPU given by W ∈ R m×k which is reproduced across different GPUs.H is distributed across N GPUS such that H j ∈ R k×J .1: Initialize W ,H j = rand(m, k),rand(k, J) J = n/N, j 0 = gID * J, j 1 = (gID + 1) * J 2: for l = 0 to l < max iter do 3: @A ▷ @ stands for matrix multiplication operation 4: W T W = W (l) T @W 5: HHT = H (l+1) @H T (l+1) 7: HHT = All Reduce(HHT ) W HHT = W (l) @HHT 9: AHT = A@HT 10: AHT = All Reduce(AHT ) 11: , k is the rank of approximation and max iters is the number of iterations.Require: X distributed across N GPUs where X i ∈ R m/N ×n where I = m/N if n < m.Similarly the co-factor H is reproduced across different GPUs given by H ∈ R k×n .W is distributed across N GPUS such that W i ∈ R I×k .1: Initialize W i ,H = rand(I, k),rand(k, n) I = m/N, i 0 = gID * I, i 1 = (gID + 1) * I 2: for l = 0 to l < max iter do

9:
W HHT = W (l) @HHT 10: AHT = A@HT 11: end for can all be computed locally on each GPU, but the H-update will require separate all-reduce-sum communication to compute W T W and W T A as presented in Algorithm 3 .
Communication takes place through various channels with different bandwidths and latency.We refer to intra-node communications as any communication on the same node, i.e., yellow, pink, and black lines in Figure 2 and those between different nodes as inter-node communications.i.e red lines in Figure 2. The latter often have the lowest bandwidth and highest latency and could easily cause bottlenecks for distributed algorithms such as NMF.For these practical reasons, in our implementation, we avoid all-reduce collective calls as much as possible.When n > m, CNMF is more efficient than RNMF because it costs less to communicate AH T of shape m×k, and RNMF is more efficient when m > n because it cost less to communicate W T A of shape k ×n.
The FLOP (Floating Point Operations) count for the given Distributed RNMF (Row-wise Nonnegative Matrix Factorization) algorithm can be calculated by going through each of the operations performed in the algorithm.Below is a rough estimation of the FLOP count for each line of interest in the algorithm: Note: The All Reduce operation (Lines 4 and 6) are communication operations and are not considered in the FLOP count as they do not involve any computation.
So, total FLOPs for each iteration of the loop = 2k For max iter iterations, total FLOPs would be max iter times the FLOPs per iteration.Now, to compute GFLOPS, we have GFLOPS = total FLOPs/(total time×1e9).Morever, given device peak GFLOPS (peakG), we can compute efficiency as GFLOPS/peakG*100%.

H H * (W.T@A) / (W.T@W@H + 𝜖)
W W * (A@H.T) / (W@H@H.T + ) The total VRAM required to factorize A of size size(A) = S A (in Bytes) is typically in the order of S N M F ∼ 4 × S A .One fold of S A to store A in memory, another fold to store perturbed A [7], an additional fold to compute intermediate product X = W @H when checking the convergence condition ∥A − W H∥ F ≤ η, and almost one full fold to store the co-factors W , H, and heavy intermediate products such as W T A or AH T .When the total available combined GPU VRAM, S GV , is lower than S N M F , as in practical big data applications, batching techniques are imperative.The batching, in most cases, increase intra-node and inter-node communication overheads.Although this can significantly affect the algorithm's performance, proper use of asynchronous data copy and CUDA streams can reduce performance loss by overlapping compute and data transfers, as discussed in our out-of-memory implementation below.

Out-of-memory implementation and memory complexity analysis
In pyDNMF-GPU, OOM-0 problems are handled using a tiling approach where temporary results like AH T , W T A or W H are evaluated in small chunks, by tiling one of the operands, such that the size of the tile sets the memory required for the calculation.In RNMF for instance, the criterion ∥A − W H∥ F ≤ η, can be evaluated in m/p small chunks obtained by tiling W into smaller tiles of size p × k.This results in computing nt chunks of [∥A − W H∥ F ] p which are accumulated into the total error e such that , which can later be used to check the conversion condition e ≤ η.This allows the reduction of the memory required to check the conversion criterion from O(m × n) to O(p × n).Because all matrices involved in the calculations are stored on GPU memory, performance loss due to tiling can be negligible, especially on modern GPU architecture like NVIDIA Ampere A100, which uses low latency and high bandwidth HBM memory.Using the tiling approach, the memory required to perform NMF on GPU can be reduced from When dealing with OOM-1 cases, light arrays are cached on GPU memory, and heavier arrays are kept on host memory and batched to respective GPUs as needed.Further, an appropriate batching strategy for the chosen memory partition is required to limit unnecessary D2H and H2D copies.In PyDNMFk-GPU, we employ a 1D co-linear batching strategy, illustrated in Figure 1b, where the elements in the batch are arrays of length equal max(m, n).This batching strategy turns out to employ half the D2H and H2D memory copies required by an orthogonal batching strategy, illustrated in Figure 1a  , k is the rank of approximation and max iters is the number of iterations.Require: X distributed across N GPUs where X j ∈ R m×n/N where J = n N if m < n.Locally to each GPU, X j can be split into n b batches where n b = m p following a orthogonal batching strategy where the batch is of shape p × J. Similarly the co-factor W local to each GPU given by W ∈ R m×k is locally divided into n b batches where the batch is of shape p × k.H is distributed across N GPUS such that H j ∈ R k×J .1: Initialize W i ,H j = rand(I, k),rand(k, J) 2: Initialize SQ, a queue of CUDA-streams of size qs.while in context st do ▷ Calculations in loop are in non-default stream st i 0 = b * p, i 1 = (b + 1) * p, j 0 = gID * J, j 1 = (gID + 1) * J

13:
W HHT = W b @HHT + ϵ 14: ) stands for async copy of x from GPU to Host using non-default stream st.
▷ H update 28: end for An implementation of the distributed CNMF with orthogonal batching is given in Algorithm 4. The calculation of the different intermediate products is illustrated in Figure 3, where batch delimitation is represented with dashed lines.The top row shows all intermediate products computed during H-update, and products computed in W -update is shown in the bottom row.Intermediate products W T @A and W @W T can be computed with n B independent batches each containing [W T @A] b and [W T @W ] b sub-products.Each batch is queued to a non-default CUDA stream Stm b along with the transfer of A b [b 0 : b i , J] and W b [b 0 : b i , :], and when calculated, each sub-product is added to a local accumulator (see lines 10-11 of Algorithm 4).Once all batches have been processed, all accumulators are reduced to obtain the full values of W T W and W T A, (see lines 15-16 of Algorithm 4).Note that this reduction is local to each GPU and does not involve communication.Special batch enqueuing and de-queuing policies are implemented with CUDA events, so as to limit (control) the number of concurrent batches on GPU to q s (see lines 6-7,12 of Algorithm 4).This way, the memory requirement for H update is bounded by q s × [p × J], as W T W @H and H * (W T A)/(W T W H + ϵ) have a k × J memory requirement.This is important, especially when dealing with large sparse arrays, which can be cheap to cache on the device but can also have co-factors becoming prohibitively expensive to cache when k becomes large.For instance, in CNMF, when m ∼ 10 million, the size of H will approximate 20GB in single precision when k ∼ 512 .
Intermediate products A@H T and W @HH T of the W -update are computed similarly to W T @A and W @W T , except A@H T will require an intermediate all − reduce − sum of sub-products [A@H T ] b of batches of same stream number from the different GPUs (see line 28 of Algorithm 4).The resulting memory complexity of this implementation is found to be of the order of O(p × n × q s ) when p >> k which is the aggregated memory utilization caused by the q s concurrent uploads of batches of A of size p × n at line 8 or line 24 of Algorithm 4. This is a significant saving compared to the estimated S N M F ∼ 3 × S A when not checking the convergence condition ∥A − W H∥ F ≤ η.When the convergence criterion is checked, the error computation is tiled similarly as it was done for OOM-0 scenarios, resulting in a memory utilisation S N M F ∼ 2 × p × n × q s when p >> k.
Note that the use of batches here will only increase intra − node communication due to mem-copies, as it is not possible to cache A and W on the device, however major shortcomings of using the orthogonal batching can be pointed out through the example of Algorithm 4 discussed above.First, the need to upload batches two times at lines (8-9 and lines 24-25 of Algorithm 4) is very inefficient as the second set of H2D will significantly (almost double) data transfer costs.Second, unnecessary additional latency due to load balancing delays when the streams are scheduled in a different order on the different GPUs can occur at line 28 of Algorithm 4. Above all, the worst result here is that both inefficiencies multiply with the number of iterations ( see line 4 of Algorithm 4).
A better implementation uses a co-linear batching strategy as it is done in the batched implementation of the distributed RNMF given in Algorithm 5.The calculation of the different intermediate products is illustrated in Figure 4.The top row shows all intermediate products computed during W -update, and products computed in H-update is shown in the bottom row.The Wupdate (cartoons 1-4 of Figure 4 is embarrassingly parallel and can be done at a batch level.This means that within each batch, we have the updated partition of W readily available to compute local sub-products W T @A and W T @A in the H-update.This avoids the need for a second data upload, as was the case with implementation using an orthogonal batching strategy.Further, the aggregation of W T @A and W T @A first consists of a local accumulation of the sub-products (lines 16-17 of Algorithm 5) followed by a local reduction (lines 21-22 of Algorithm 5), then a global reduction (lines 23-24 of Algorithm 5) illustrated in cartoons 5-6 of Figure 4.This does not require communication between batches of the same stream number and consequently avoids load balancing issues as discussed above in case using an orthogonal batching strategy.

Hardware infrastructure and software environment
Benchmark tests were performed on three different HPC clusters to illustrate the portability and scalability of pyDNMF-GPU.The first cluster, Kodiak, is a LANL internal HPC cluster with 133 compute nodes with dual Xeon E5-2695 v4 CPUs and four NVIDIA Pascal P100 GPGPUs each.Each NVIDIA Pascal P100 GPGPU has 16GB VRAM and uses PCI-E 16X gen 3 Links.The cluster peaks at 1850TF/s and uses an Infiniband interconnect.Each GPU peaks at 9.3 teraflops for single precision.The second cluster, Chicoma, is also a LANL internal HPC cluster, composed of 118 compute nodes where each node has 2 AMD EPYC 7713 Processors and 4 NVIDIA Ampere A100 GPUs.The AMD EPYC 7713 CPUs have 64 cores peaking at 3.67 GHz and 256 GB RAM.Each of the four NVIDIA A100 GPUs in each node provides a theoretical double-precision arithmetic capability of approximately 19.5 teraflops with 40GB VRAM memory.The nodes are networked with HPE/Cray slingshot 10 interconnect with 100Gbit/s bandwidth.Chicoma runs Shasta 1.4 OS and SLURM Job manager.The third cluster, Summit, peaks at over 200 petaflops in double-precision theoretical performance and comprises 4600 IBM AC922 compute nodes, with two IBM POWER9 CPUs and six NVIDIA Volta V100 GPUs each which peak at 15.7 single precision.The POWER9 CPUs have 22 cores running at 3.07 GHz.The six NVIDIA Tesla V100 GPUs in each node provide a theoretical double-precision arithmetic capability of approximately 40 teraflops with VRAM memory of 16GB/GPU.Dual NVLink 2.0 connections between CPUs and GPUs provide a 25-GB/s transfer rate in each direction on each NVLink, yielding an aggregate bidirectional bandwidth of 100 GB/s.The nodes are networked in a non-blocking fat-tree topology by Infiniband.Summit deploys an RHEL 7.4 OS and IBM Job step manager jsrun to run compute jobs.Jsrun provides a fine control of how node-level resources are allocated on these systems, including CPU cores, GPUs, and hardware threads.
pyDNMF-GPU is written in python and uses other off the shelf python libraries such as CuPy [35], Numpy [36], MPI4PY [37] and Scipy [38].It supports dense and sparse datasets on various hardware architectures and handles communication using a low-latency NCCL-based communicator.NCCL is an open-source library providing inter-GPU communication primitives developed and maintained by NVIDIA.NCCL performs automatic hardware topology detection, which it then uses in graph search algorithms to identify communication paths that offer the highest bandwidth and lowest latencies for communication between GPUs intra-and inter-node (e.g., between GPUs that are on the same compute node, as well as between GPUs that are on separate compute nodes).NCCL is compatible with many multi-GPU parallelization models, and provides the ability to perform MPI-like collective and pointto-point operations such as allgather, reduce, broadcast, allreduce, send, and recv.NCCL was initially proposed to help with the need to transfer large message GPU buffers in deep learning applications efficiently.Many leading deep learning frameworks like Chainer, PyTorch, and TensorFlow have since integrated NCCL to accelerate deep learning training on multi-GPU, and multi-node systems, which has motivated us to use NCCL to handle communication in our work.All implementations discussed in the section above were found to benefit from a reduction in data transfer latency and communication performance (both intra-node and inter-node communications), using our low latency NCCL-based communicators versus MPI.An example of such benefit in communication performance gain is illustrated in the subsection 4.2 below that compares the new NMF implementation proposed in this work that uses an NCCL-based communicator to the prior pyDNMFk that uses a traditional MPI based communicator.A More comprehensive and detailed comparative study between NCCL and MPI can be found in the analysis by Awan [39].

Performance benchmark results of pyDNMF-GPU vs pyDNMFk
The performance gained using GPU over CPU is assessed with speedup computed as the ratio of time measured on CPU with pyDNMFk [7], to time measured on GPU with pyDNMF-GPU.For this study, we used a dense matrix of shape and size S A of memory (in bytes) that respectively scale as [N ×65536, 32768] and N ×8GB, where N is the number of GPU or CPU units.Speedup measured on the Kodiak cluster are reported in Figure 5. Figure 5a shows speedup in NMF time as a function of the number of units for various k .First, we note an increasing speedup with the increasing number of units, and second, we note a decreasing performance with increasing k when k ≥ 32.The low performance observed at k < 32 is explained by low GPU occupancy.The best performance is obtained when k = 32, peaking at 76X.We also report speedup in communication time computed as the ratio of total communication time measured with pyDNMFk to the total communication time measured pyDNMF-GPU.The former used MPI based communicator and the latter used an NCCL-based communicator.Speedup in communication time is reported as a function of number of units for various k in Figure 5b.We note ∼ 80X − 100X speedup when N > 2, the number of units above which inter-node communications start.This clearly shows a significant performance   gain in communication when using NCCL in pyDNMF-GPU over MPI in pyDNMFk.

Strong and Weak scalability of pyDNMF-GPU
The scalability of the proposed NMFk algorithm is assessed using both strong and weak scaling analysis.This scaling study measures NMF execution time for a given problem size as a function of the number of compute units.Compute nodes (with 4 GPUs each) are chosen as compute units in strong scaling analysis, while individual GPUs are chosen as compute units in weak scaling analysis.The problem size S A is chosen to use most of the available 16GB VRAM per GPU.To this end, S A is fixed at S A ≈ 4 × 8GB = 32GB in strong scaling analysis and chosen to scale as S A ≈ 8GB × N in weak scaling analysis.This is accomplished by generating a random synthetic array A of shape [4 × 65536, 32768] and [N × 65536, 32768] respectively in both strong and weak scaling.Cases of sparse A with density 10 −5 were also studied, and for those cases, A was generated as a random synthetic array of shape [4 × 2097152, 65536] in strong scaling analysis, and of shape [N × 2097152, 65536] was chosen in weak scaling analysis.

Strong scalability
Strong scaling results for cases where k = 8, 16, 32, 64, 128, 256 are shown Figure 6a.NMF time is found to increase with k and to decrease with the increasing number of compute nodes.Good strong scaling is indicated by a linear decrease of NMF time with increasing compute grid size, and such behavior Number of GPU  [43,44] and the Large Hadron Collider [45] produce terabytes of data in minutes.Another example is the petabytes of data generated by missioncritical simulations [46][47][48][49][50]. Exploration and analysis of such extra-large data mandates the development of novel machine learning (ML) approaches that are able to extract meaningful basic processes and fundamental features underlying the data [51].
Given our interest in exascale data, the proposed implementation was tested on a dense matrix of shape [2618523648; 32768] with a size of ∼ 340T B, and a sparse matrix of shape [2.89 * 10 12 , 1.05 * 10 6 ] with sparsity 10 −6 and size of ∼ 11EB (∼ 34T B when compressed in a sparse format).Benchmarks were performed on Summit, with an allocation of 4096 nodes with 6 GPUs of 16 GB VRAM each, totaling a combined 394T B VRAM.While that is not enough to efficiently factorize either of the two matrices, we chose to cache A and co-factors and batch the compute of heavy, intermediate products (OOM-0).This way, we can reduce performance loss by avoiding unnecessary data transfers from host to device and vice-versa.
On the one hand, the weak scaling benchmark results for the dense array are reported in Figure 9a.The H update is shown with a perfect weak scaling, while the W update is shown not to scale appropriately.Loss of scaling in the W update is a consequence of the high communication cost associated with the All-reduce of W T A and W T W , which combined, make up a substantial portion of the W update .The total NMF time, in turn, is significantly affected by the W update , which takes about one order of magnitude more time to execute than the H update .On the other, the weak scaling benchmark results for the sparse array, reported in Figure 9b, indicate both W update and H update to have an excellent weak scaling.The AR(W T W ) is similar in both cases, as W T W is of shape k × k, but the AR(W T A) is two orders of magnitude higher in the case of the spare dataset, proportional to n which is also two orders of magnitude higher.Unlike in the case of the dense array, the communication cost associated with the AR(W T A) and AR(W T W ), although higher, are not significant enough to affect the W update , consequently do not affect the overall scaling of the NMF.

Benchmark results on out-of-memory problems
Next, we assess the effectiveness of the proposed batching technique for OOM scenarios and the use of the CUDA stream queues to reduce communication in Algorithm 5. To this end, the proposed implementation is tested in an OOM-1 scenario, where a matrix of shape [524288, 4096] is factorized for k = [32,64,128,256,512,1024].Smaller array H is cached on GPU memory, and large arrays A and W are stored on the host and batched to GPU as needed.For this experiment, the number of iterations in Algorithm 5(line 4) fixed to max iters = 100, and the number of batches is fixed to n b = 32.Given the size of A in single precision is S A = 8GB, the resulting batch size is S B = p × n ∼ 0.25GB.The GPU peak memory utilization and NMF execution time for the 100 iteration, vs queue size, are respectively reported in Figure 10a and Figure 10b.
In Figure 10a, the peak memory utilization measured when q s = 1 is S nmf ∼ 0.267GB which is close to the estimated memory complexity of O(p × n × q s ) ≈ 0.25GB in section 3.2, and which is a very big saving, ∼ 1/100X, compared to the estimated S N M F 3 × S A ≈ 24 GB require by a normal implementation.This memory complexity is maintained for all k values and all queue sizes as indicated by the lines with the same slope ∼ 0.267 in Figure 10a.The increase in peak memory with increasing k for any given queue size is explained by the increase in the size of the arrays cached on GPU (H), as well as the increase in the size of the computed intermediate products (see Figure 4. Similarly, for each k value, we note an increase in peak memory utilization with the increasing number of batches which is simply explained by the aggregated memory utilization from the concurrent streams.While from this figure, it seems unproductive to use larger stream queue sizes due to the increase in peak memory utilization, the benefits of such design choice are explained in the execution benchmark results reported in Figure 10b.
From Figure 10b, we first see that it is, in all cases, a good idea to choose a queue size q s > 1 if one wants to speed up the NMFk execution time.This is explained by using large stream queue sizes makes more streams available to overlap memory copies, all-reduce communications, and compute concurrently.It is, however, not the case that more streams will always make this process better, as we can see it not being the case when q s = 16, where the NMFk execution time is not optimum for any k value.This is explained by the fact that CUDA core counts are limited and that some streams will block and wait when all cores are busy processing other streams, causing load-balancing delays.Consequently, it is crucial to fine-tune q s for a given batch size and k to obtain optimal performance.and relative error is low.For k > 8, the minimum silhouette score drops suddenly as the solutions begin to fit the noise Figure 11a.Figure 11b shows a Pearson correlation matrix that illustrates a large correlation between the features of ground truth W Ground truth and the corresponding pyDNMFk-GPU extracted W P redicted for k = 8.The analysis took approximately 1 hour to correctly estimate the latent features on Kodiak.The average reconstruction error for the data is ∼ 4% with the Frobenius norm objective and MU update optimization.Our experiment demonstrates that pyDNMFk-GPU correctly estimates the number of latent features in addition to its scalability for large datasets as demonstrated in previous sections.

Conclusion
In summary, we demonstrated a novel scalable and portable framework, pyDNMFk-GPU, for non-negative matrix factorization based on custom multiplicative updates, with automatic determination of the number of latent features on Exa-scale data.Scalability of the framework was demonstrated via strong and weak scaling benchmarks, and speedup gains on GPU over CPU were found to vary with k and to increase with the size of the HPC system.The efficacy of the proposed tiling technique was demonstrated through the OOM-0 problem by factorizing a dense dataset of 340TB and a sparse dataset of size 11EB, where the implementation was found to have good week scaling on upto to 25k GPU.We also demonstrated the efficacity of the proposed batching technique along with the importance of using CUDA streams by solving OOM-1 problem, where memory complexity was shown to be of the O(p × n × q s ), resulting in a significant saving of ∼ 100X smaller peak memory utilization in some cases.The automatic model selection capability was verified by correctly decomposing large synthetic data with a predetermined number of latent features and factors.

Fig. 1 :
Fig. 1: Illustration of distributed matrix A and co-factors W and H in CNMF and RNMF distributed partitions respectively in (a) and (b).Solid lines show distributed partition boundaries, and dashed lines show local partition segmentation in batch for Out-of-Memory decomposition.

Fig. 2 :
Fig. 2: Illustration of distributed HPC hardware and different communication channels.

Fig. 3 :
Fig.3: Illustration of the batched multiplicative update of Algorithm 4 for the column partition(CNMF).Green array is duplicated across different MPI ranks.Blue and red arrays are distributed, and only red array is cached on device.For CNMF, p is Out-of-Memory batch width and J is distributed partition width.

Fig. 4 :
Fig. 4: Illustration of the batched multiplicative update Algorithm 5 for the row partition(RNMF) and colinear batching.Green array is duplicated across different MPI ranks.Blue and red arrays are distributed, and only red array is cached on device.For RNMF, p is Out-of-Memory batch width and J is distributed partition width.
for the column partition, where the elements in the batch are vectors of length equal min(m, n).Let p be a batch size control parameter.In RNMF (CNMF ) the number of batches is then given by n B = m/p (n B = n/p).In the extreme case where both m and n are very large, only the light array, W [J, :] is cached on GPU memory, and heavier arrays A[J, b 0 : b i ] (A[b 0 : b 1 , J]) and H[:, b 0 : b 1 ] (W [b 0 : b 1 , :]) batched to their respective GPUs, such that for the b th batch, b 0 = b × p and b 1 = (b + 1) × p. Algorithm 4 W , H = CNMF(X, k)-CNMF with orthogonal batching Require: X ∈ R m×n +

3 : 5 :b in n b do 6 :
Initalize zero array accumulators WT A ∈ Z qs×k×n and WT W ∈ Z qs×k×k 4: for l in [1,max iters] do /* Update H given W */ for SQ − > st ▷ De-queue stream st from SQ 7:

19 : 23 :
= n b i=1 WT A i,:,: ▷ Reduce of WT A local to each GPU 16: W T W = n b i=1 WT W i,:,: ▷ Reduce of WT W local to each GPU HHT ← All Reduce sum(HHT ) 20: for b in 0 to n B do 21: SQ − > st ▷ De-queue stream st from SQ 22: while in context st do ▷ Calculations in loop are in non-default stream st Set i 0 = b * p, i 1 = (b + 1) * p 24:

3 : 4 :
end for Algorithm 5 W , H = RNMF(X, k)-RNMF with co-linear batching Require: X ∈ R m×n + , k is the rank of approximation and max iters is the number of iterations.Require: X distributed across N GPUs where X i ∈ R I×n where I = m N if m > n.Locally to each GPU, X i can be split into n b batches following a co-linear batching strategy where the batch size is bs = I n b × n.Similarly the co-factor W local to each GPU given by W i ∈ R I×k is locally divided into n b batches where bs = I n b × k.H ∈ R k×n is cached and replicated across the GPUs.1: Initialize W i ,H = rand(I, k),rand(k, n) 2: Initialize SQ, a queue of CUDA-streams of size qs.Initalize accumulators WT A ∈ R qs×k×n + and WT W ∈ R qs×k×k + for l in [1,max iters] do /* Update W given H */ 5:

8 :
SQ − > st ▷ De-queue stream st from SQ 9:while in context st do ▷ Calculations in loop are in non-default stream st 10:

16 : 17 :
WT A[st] += W T b @A ▷ Accumulate local W T @A WT W[st] += W T b W b ▷ Accumulate local W T @W b 18: st − > SQ ▷ En-queue stream st back into SQ, exit context st H given W */ 21: W T A = n b i=1 WT A i,:,: ▷ Reduce of WT A local to each GPU 22: W T W = n b i=1 WT W i,:,: ▷ Reduce of WT W local to each GPU 23: W T A = All Reduce sum(W T A) ▷ Global Reduce of W T A across all GPUs 24: W T W = All Reduce sum(W T W ) ▷ Global Reduce of W T W across all GPUs 25: Speedups of the compute times for various latent dimensions on different numbers of compute units.
Speedups of the communication times for various latent dimensions on different numbers of compute units.

Fig. 5 :
Fig. 5: Results of benchmarking experiment showing speedup gain using N GPUs vs N CPUs, for various k.Speedup gained on NMF calculation time is shown in Figure 5a and speedup gained on communication time is shown in Figure 5b.

Fig. 9 :
Fig. 9: Results of weak scaling study for dense and sparse A performed on Summit are shown respectively in (a) and (b).

Table 1 :
A comparison chart for different GPU based NMF implementations.
Results of strong scaling study performed on Kodiak.NMF time vs number of node for various k dense and sparse A respectively shown in (6a) and (6b).For the case k = 8, execution time of H update , W update and Allreduce communication are compared in (6c) and (6d), respectively for dense and sparse A.is only observed in select parts of the obtained results.Strong scaling is maintained up to a count of 8 nodes when k = 8, then to 4 nodes when k = 16, and lost when k > 16.Identical scaling is observed for cases where A is sparse, as shown in Figure6b.The worst case scenarios, when k = 256, can be diagnosed from breakdown of H update , W update and combined all-reduce-sum (AR) execution time, as detailed in Figure6c.H update is shown to maintain good scaling at all compute grid sizes, while W update had poor scaling at each tested compute grid size.W update 's poor scaling is strongly influenced by AR communications time, which already makes up more than 80% of W update at 2 node count, which increases non-linearly with node count.At full grid size, AR time makes up more than 98% of W update , influencing the overall NMF time Results of weak scaling study performed on Kodiak.NMF time vs number of GPU for various k are respectively shown in (7a) and (7b), for dense and sparse A. For the case k = 8, execution time of H update , W update and Allreduce communication are compared in (7c) and (7d), respectively for dense and sparse A. dominated by W update time.The same explanation applies to cases where A is sparse, as one can interpret from Figure6d.For example, Stanford Synchrotron Radiation Lightsource experiments at SLAC laboratory for revealing the inner structure of materials at nanometer scales The lack of scaling when N < 8 can be explained using the breakdown of H update , W update and combined AR execution time for the case where k = 256, shown in Figure7c.While W update maintains a perfect weak scaling at all N , H update is influenced by AR communications time, which increases withGPU  42].