1 Introduction

NMF is a popular unsupervised learning method that extracts sparse and explainable latent features [1], which are often used to reveal explainable low-dimensional hidden structures that represent and classify the elements of the whole dataset [2]. NMF is used in big data analysis, which plays a crucial role in many problems, including human health, cyber security, economic stability, emergency response, and scientific discovery. With the increased accessibility to data and technology, datasets continue to grow in size and complexity. At the same time, the operational value of the information hidden in patterns in such datasets continues to grow in significance. Extracting explainable hidden features from large datasets, collected experimentally or computer-generated, is vital because the data presumably carries essential (but often previously unknown) information about the investigated phenomenon’s causality, relationships, and mechanisms. Discovering meaningful hidden patterns from data is not a trivial task because the datasets are formed only by directly observable quantities while the underlying processes or features, in general, remain unobserved, latent, or hidden [3].

Analysis of vast amounts of (usually sparse) data via NMF requires novel distributed approaches for reducing computational complexity, speeding up the computation, and dealing with data storage and data movement challenges. Most NMF computations are matrix-matrix multiplications, which GPU accelerators can speed up. The primary performance and scaling limiting factors in NMF implementations on modern heterogeneous HPC systems are high communication costs due to data movement across different system parts (inter-node and intra-node communications). In various cases, these communication delays exceed the time the actual computations take, resulting in poor performance and poor scalability on large distributed systems.

The growth in data volumes outpacing the improvement in hardware specifications is causing significant challenges in extracting useful information from large-scale datasets using algorithms like NMF. This motivates the need for out-of-memory implementations of NMF for distributed HPC systems, which will allow the decomposition of large datasets that does not fit in memory at once. Enabling out-of-memory factorization is very important because it removes the matrix size constraint imposed by the GPU memory, thus enabling the analysis of datasets up to the cumulative size of all RAM on the cluster. This is mainly required to address the challenges presented by the need to factorize the ever-growing datasets. We utilize this unique ability of pyDNMF-GPU to demonstrate the decomposition of record large dense, and sparse datasets.

To illustrate how pyDNMF-GPU can be used as a building block for more comprehensive workflows, we integrate pyDNMF-GPU with our existing model selection algorithm pyDNMFkFootnote 1 that enables automatic determination of the (usually unknown) number of latent features on a large scale datasets [4,5,6,7,8]. We utilized the integrated model selection algorithm previously to decompose the worlds’ largest collection of human cancer genomes [9], defining cancer mutational signatures [10], as well as successfully applied to solve real-world problems in various fields [8, 11,12,13,14,15,16,17,18,19].

This integration results in our out-of-memory scalable tool, pyDNMFk-GPU, to be capable of estimating the number of latent features in extra-large sparse (tens of EBs) and dense (hundreds of TBs) datasets while operating across CPU-GPU hardware. To the best of our knowledge, our framework is the first to identify hidden features in large-scale dense and sparse datasets.

In experiments on large HPC clusters, we show pyDNMF-GPU’s potential: we measure up to 76x improvement on a single GPU over running on a single 18-core CPU. We also demonstrate weak scaling on up to 4096 multi-GPU cluster nodes with approximately 25,000 GPUs when decomposing a dense 340 Terabyte-size matrix and an 11 Exabyte-size sparse matrix of density \(10^{-6}\).

Our main contribution is a novel NMF parallel framework, called pyDNMF-GPU, that minimizes the data movement on GPUs, improving overall running times. Our work’s main contribution and novelty is the proposal of a new distributed implementation of NMF with low memory complexity that enables the out-of-memory factorization of very large datasets. Our proposed implementation, pyDNMF-GPU, takes advantage of the following three modern design choices:

  • pyDNMF-GPU reduces the latency associated with local data transfer between the GPU and host (and vice-versa) by using CUDA streams.

  • Latency associated with collective communications (intra-node and inter-node) is reduced by using NCCL primitives.Footnote 2

  • We incorporate a batching approach for inter-node communication, which provides a unique ability to perform out-of-memory NMF while using multiple GPUs for the bulk of computations.

The main contributions of the paper include:

  • Introducing a novel distributed algorithm with out-of-memory support for NMF for sparse and dense matrices operating across CPU-GPU hardware.

  • Report, the first NCCL communicator accelerated NMF decomposition tool in distributed GPUs.

  • Demonstrate the framework’s scalability over a record-breaking 340 Terabytes (TB) dense and 11 Exabytes (EB) sparse synthetic datasets.

The remainder of the paper is organized as follows: Sect. 2 gives a summary of NMF and the existing parallel NMF implementations. In Sect. 3, we detail the design considerations and choices for a scalable, parallel, and efficient algorithm in different configurations of the data size and available GPU VRAM, as well as the complexity of the new implementation. The efficacy of pyDNMF-GPU with different benchmark results and the validation of benchmark results on a synthetic dataset with a predetermined number of latent features is shown in Sect. 4. We finally conclude with summaries and suggestions of possible future work directions in Sect. 5.

2 Background and related work

2.1 Non-negative matrix factorization algorithms

NMF [1] approximates the non-negative observational matrix \(\varvec{A}\in \mathbb {R}_{+}^{m \times n}\) with a product of two non-negative factor matrices \(\varvec{W}\in \mathbb {R}_{+}^{m \times k}\) and \(\varvec{H}\in \mathbb {R}_{+}^{k \times n}\) where the columns of \(\varvec{W}\) represent the latent features, while the columns of \(\varvec{H}\) are the coordinates/weights of the analyzed samples (the columns of \(\varvec{A}\)) in the reduced latent space, and k is the latent dimension of the data. The NMF minimization is based on alternating update of each one of these two factor matrices until convergence indicated by the condition \(\left\| \varvec{A} - \varvec{W} \varvec{H} \right\| _F \le \eta \) is reached. Here \(\left\| . \right\| _F\) is the Frobenius norm, \(\left\| \varvec{A} \right\| _F = \sqrt{\sum _i \sum _j a_{ij}^2}\), where \(a_{ij}\) is the element on row i and column j, and \(\eta \) is the desired tolerance. Each iteration consists of a \(\varvec{W}\)-update sub-step followed by a \(\varvec{H}\)-update sub-step, given by

$$\begin{aligned} \begin{aligned} \varvec{W} \mathop {\leftarrow }\limits _{ \varvec{W} \geqslant 0} \left\| \varvec{A} - \varvec{W}\varvec{H} \right\| _{F}^{2} \\ \varvec{H} \mathop {\leftarrow }\limits _{\varvec{H} \geqslant 0} \left\| \varvec{A} - \varvec{W} \varvec{H} \right\| _{F}^{2}, \end{aligned} \end{aligned}$$
(1)
figure a

The Frobenius norm (FRO) based multiplicative update (MU) algorithm is presented in Algorithm 1. In addition to the presented Frobenius norm-based MU algorithm (which leads to a Gaussian model of the noise [20]) other similarities (e.g., KL-divergence that corresponds to a Poisson model) can also be used in the NMF minimization. Also, based on the update rules, several variants of NMF algorithms exist such as hierarchical alternating least squares (HALS) [21], alternating non-negative least squares with block principle pivoting (ANLS-BPP) [22], and block coordinate descent algorithm (BCD) [23]. These algorithms have different advantages in the context of convergence rate, computational, and memory requirements. MU-based updates are computationally and memory-wise cheap at the cost of slower convergence. Whereas HALS, BCD, and ANLS-BPP have faster convergence rates at the cost of higher computational and memory requirements and high communication costs for parallel implementations. In our experiments, we use the FRO-based MU algorithm to demonstrate record scalability on large datasets due to its lower computation and communication cost, which can easily be modified with another update algorithm or similarity metric.

2.2 Related work on distributed NMF

Several parallel implementations have been proposed to address the computational need of NMF for large datasets involving multiple and repeated matrix-matrix multiplications of several orders in magnitude. The existing parallel implementations can be grouped under two categories (i) with shared memory and (ii) with distributed memory. Majority of existing parallel works utilize shared-memory multiprocessor [24,25,26,27] and shared memory GPUs [26,27,28,29] via OpenMP and CUDA libraries respectively. A majority of distributed memory implementations rely on MPI primitives for distributed CPU [12, 30] and CUDA-aware MPI primitives for distributed GPU [28, 30] parallelization. Although shared-memory implementations drastically minimize the communication costs incurred for distributed memory implementation [26], there is a constraint on how much data such frameworks can decompose. Due to this constraint, shared-memory implementation often cannot provide the computational/memory requirements needed for the current large-scale datasets.

Table 1 A comparison chart for different GPU based NMF implementations

Almost all distributed GPU implementations including NMF-mGPU [28] and PLANC [33] rely on significant data communication for the update of the factors. This involves using CUDA-aware MPI primitives for data communication or MPI distributed memory offload through NVBLAS [33] without multi-node GPU communicators. Such implementation leads to high data movement costs due to data on-loading/offloading to/from the device, which significantly raises communication costs compared to the computation cost for large data decomposition. This is previously illustrated with distributed BPP in PLANC [30] and distributed MU and BCD [12] where the communication cost is minimized by communicating only with the two-factor matrices and other partitioned matrices among MPI processes. These works attempt to reduce the bandwidth and data latency using MPI collective communication operations. For distributed CPU implementations, this approach works well as the communication cost is significantly lower compared to the computation cost. However, for GPU implementation, communication cost is higher due to device/host data transfer; therefore, communication cost is a limiting factor for parallel performance when using many GPUs.

Table 1 illustrates the comparison against the existing parallel NMF implementations. Further, support for factorization of sparse datasets equally adds value for our new pyDNMF-GPU framework. Since many of the extra-large datasets, such as the text corpora, knowledge graph embeddings (and, in general, most of the relational datasets), cyber network activity datasets, and many others, are highly sparse, having sparse decomposition support dramatically reduces the memory and computational requirements which otherwise would be a major bottleneck for the dense implementation. Despite the support for a sparse dataset for shared-memory in ALO-NMF and genten [26, 27] and for distributed memory in PLANC [30], there is no specific solution aiming to address the bottlenecks due to extracted dense factors and their communications for large sparse datasets. Even though the largest sparse datasets may be a few MBs in size, due to their extreme sparsity, decomposing such datasets would be challenging for most existing frameworks as the extracted factors are dense and very large. Even for such a small non-zero valued size, the corresponding dense factors could easily explode and require an expensive communication of dense intermediate terms. However, our batching framework provides a solution by accommodating larger intermediate-dense factors, which have not been addressed previously.

Fig. 1
figure 1

Illustration of distributed matrix \(\varvec{A}\) and co-factors \(\varvec{W}\) and \(\varvec{H}\) in CNMF and RNMF distributed partitions respectively in (a) and (b). Solid lines show distributed partition boundaries, and dashed lines show local partition segmentation in batch for Out-of-Memory decomposition

2.3 Rationale for an algorithm for the out-of-memory distributed NMF

In pyDNMF-GPU, we use a distributed implementation of NMF that aims at efficiently factorizing matrices of all sizes, even those too big to fit on available memory, in out-of-memory scenarios. To this end, pyDNMF-GPU accelerates matrix operations using GPUs on modern heterogeneous systems, provides support for sparse matrix operations to deal with practical data sets which are often sparse, and can partition large problems into smaller problems solved in a distributed manner. Above all, and to the best of our knowledge, our proposed implementation is the first to provide a solution for practical out-of-memory cases that require the factorization of data too big to be stored on combined available GPU memory.

When performing NMF on GPUs, OOM situations can arise in various scenarios with different degrees of complexity. As discussed in [34], we distinguish three main types of OOM scenarios. Scenarios of type 0 (OOM-0) concern practical problems where the input data \(\varvec{A}\) and its co-factors \(\varvec{W}\) and \(\varvec{H}\) can easily be stored on GPU memory. However, an explosion of memory requirement can occur, either due to the unknown rank k becoming significant, causing \(\varvec{W}\) and \(\varvec{H}\) to become prohibitively expensive to store on memory, or when computing intermediate results such as \(\varvec{X}=\varvec{W}\varvec{H}\) (line 8 of Algorithm 1), when \(\varvec{A}\) is a large sparse matrix of very low density, where \(\varvec{X}\) resulting from the operation becomes dense and very likely impossible to store on GPU. For instance, if \(\varvec{A}\in \mathbb {R}^{10^6 \times 10^6}\) is a sparse matrix, with density of \(\delta \approx 10^{-3}\), the size of \(\varvec{A}\) in dense format, in single precision, is \(S_{\varvec{A}}\approx 4 TB\), however representing \(\varvec{A}\) in CSR sparse format can lower the size of \(\varvec{A}\) down to \(S_s \sim 3\times S_A \times \delta \approx 12~GB\) (the factor of 3 accounts for storing the data, indices and index pointers for CSR format), consequently \(S_{NMF} \approx 2 \times S_{\varvec{A}}\approx 4TB\). Assuming very small k, \(\varvec{A}\) and all co-factors can be stored on GPU; however, the calculation of the intermediate product \(\varvec{X}\) from \(\varvec{X} = \varvec{W}\varvec{H}\) would still require a whopping \(\sim 8~TB\) of GPU memory (line 8 of Algorithm 1), making this scenario an OOM-0 problem.

A more complex OOM scenario, type 1 (OOM-1), arises in cases where matrix \(\varvec{A}\) and at most one of its co-factors cannot be cached on GPU memory; this is typically the case when dealing with a large \(\varvec{A}\) that is dense or sparse with high density. Scenarios of type 2 (OOM-2) are the most complex and consist of practical cases where neither \(\varvec{A}\), nor its co-factors can be stored on GPU memory. Note that more complexity can arise in cases where data cannot fit on host RAM memory, but that still is of type 2 as the OOM classification here is based on the GPU RAM memory utilization. In other words, in OOM-0 scenarios, all the data can be cached on GPU; in OOM-1 scenarios, the data can partially be cached on GPU, and in OOM-2 scenarios, none of the data can be cached on GPU. The treatment of OOM-2 scenarios is out of the scope of this study.

OOM-0 cases can easily be handled using tiling techniques, and OOM-1 cases can be handled with batching techniques. In extreme OOM-1 cases, we will complement batching by tiling to further reduce memory footprint.

Both batching and tiling are block-based computational techniques designed to simplify larger, memory-intensive computations into smaller, manageable, and partially solvable tasks. Each technique, however, functions in a distinct setting and serves a different purpose. Batching is a process that operates on the host, necessitating consistent data transfer between the host and the device. The efficacy of batching techniques is heavily reliant on the speed of the interconnecting buses between the host and device, such as PCIe or NV-Link. Batching techniques become crucial when dealing with OOM-1 problems, as they help in transferring partially computed results. Conversely, tiling happens directly within the device memory, resulting in data transfer between global memory and shared or cache memory. The performance of tiling techniques is primarily governed by the GPU architecture, including features like memory speed and available shared memory. Tiling techniques are especially effective for tackling OOM-0 problems, as they handle computational tasks directly on the device. Notably, batching is typically irrelevant for OOM-0 problems as these computations are already based on the device. Similarly, tiling techniques alone cannot address OOM-1 issues due to the preliminary need to transfer operands to the device. However, an optimized solution for extreme OOM-1 problems can be achieved by strategically combining both batching and tiling techniques, thus enhancing the overall performance.

In the section below, we discuss our implementation and design choices.

Fig. 2
figure 2

Illustration of distributed HPC hardware and different communication channels

3 pyDNMF-GPU for heterogeneous systems

An efficient implementation of NMF for distributed heterogeneous systems should avoid high costs associated with communication (data transfer) resulting from poor consideration for data locality in the distribution of the computational work. Furthermore, cases, where resources such as available combined GPUs memory are limited will require additional considerations and various trade-offs. For instance, it is sometimes better to replicate data over the distributed compute grid to reduce communication. Other times, it is acceptable to use batching techniques that can increase communication costs to lower the memory footprint. Below we first discuss our distributed data partition strategies that partition large problems into smaller problems solvable on cooperative distributed systems in subsection 3.1, and then in subsection 3.2 we discuss our tiling and batching approaches, respectively used to handle practical scenarios of complexities type0 and type1.

3.1 Distributed implementation

Our implementation considers two one-dimensional data partition strategies based on the shape of matrix \(\varvec{A}\) (\(m\times n\)). A column (vertical) partition, CNMF employed when \(n > m\), and a row (horizontal) partition, RNMF, is used otherwise.

figure b
figure c

Assuming a distributed system with N GPUs where each GPU is indexed by its global rank \(g_{ID}\). In the CNMF approach illustrated in Fig. 1a, the \(j^{th}\) GPU with \(g_{ID}=j\) will work on array partitions \(\varvec{A}[:, j_0:j_1]\), \(\varvec{H}[:, j_0:j_1]\) and \(\varvec{W}\), where \(j_0=j \times J\), \(j_1=(j+1) \times q\), and \(J=n/N\)(partition size). Each GPU gets a full copy of \(\varvec{W}\) (\(\varvec{W}\) is replicated) and a unique partition of \(\varvec{A}\) and \(\varvec{H}\). This translates into a segmentation of arrays \(\varvec{A}\) and \(\varvec{H}\) on global memory illustrated with solid lines in Fig. 1a. These solid lines indicate boundaries in global memory and consequently help conceptualize where communication is required whenever information is exchanged from one bounded region to another. The H-update is embarrassingly parallel since \(\varvec{W}^T\varvec{W}\), \((\varvec{W}^T\varvec{W})\varvec{H}\), and \(\varvec{W}^T\varvec{A}\) can all be computed locally on each GPU; the \(\varvec{W}\)-update on the other hand, will require two separate all-reduce-sum communications to compute \(\varvec{A}\varvec{H}^T\) and \(\varvec{H}\varvec{H}^T\) as indicated in Algorithm 2 lines 10 and 7.

Following a similar analogy, a RNMF approach results with \(\varvec{H}\) replicated on the different GPUs and \(\varvec{A}\) and \(\varvec{H}\) distributed across the compute grid. This time \(\varvec{W}\)-update is embarrassingly parallel since \(\varvec{H}\varvec{H}^T\), \(\varvec{W}(\varvec{H}\varvec{H}^T)\), and \(\varvec{A}\varvec{H}^T\) can all be computed locally on each GPU, but the \(\varvec{H}\)-update will require separate all-reduce-sum communication to compute \(\varvec{W}^T\varvec{W}\) and \(\varvec{W}^T\varvec{A}\) as presented in Algorithm 3.

Communication takes place through various channels with different bandwidths and latency. We refer to intra-node communications as any communication on the same node, i.e., yellow, pink, and black lines in Fig. 2 and those between different nodes as inter-node communications. i.e red lines in Fig. 2. The latter often have the lowest bandwidth and highest latency and could easily cause bottlenecks for distributed algorithms such as NMF. For these practical reasons, in our implementation, we avoid all-reduce collective calls as much as possible. When \(n>m\), CNMF is more efficient than RNMF because it costs less to communicate \(\varvec{A}\varvec{H}^T\) of shape \(m \times k\), and RNMF is more efficient when \(m>n\) because it cost less to communicate \(\varvec{W}^T\varvec{A}\) of shape \(k \times n\).

The FLOP (floating point operations) count for the given distributed RNMF (row-wise nonnegative matrix factorization) algorithm can be calculated by going through each of the operations performed in the algorithm. Below is a rough estimation of the FLOP count for each line of interest in the algorithm:

  • Matrix multiplication (Line 3): \(\varvec{WTA} = {\varvec{[}l][]{W}}^T @ \varvec{A} \). Here we have a matrix multiplication of size \((k \times I) * (I \times n)\), which will result in \(2k*I*n - k*n\) FLOPs.

  • Matrix Multiplication (Line 5): \(\varvec{WTW} = {\varvec{[}l][]{W}}^T @ \varvec{W} \). Here we have a matrix multiplication of size \((k \times I) * (I \times k)\), which will result in \(2k*I*k - k*k\) FLOPs.

  • Elementwise Multiplication and Division (Line 7): \(\varvec{[}l+1][]{H} = (\varvec{[}l][]{H} * \varvec{WTA}) / (\varvec{WTW} @ \varvec{[}l]{H} + \epsilon ) \).This consists of \(k*n\) FLOPs for elementwise multiplication and \(k*n\) FLOPs for elementwise division, so total \(2*k*n\) FLOPs.

  • Matrix multiplication (Line 8): \(\varvec{HHT} = \varvec{[}l+1][]{H} @\varvec{H^{T^{(l+1)}}} \).Here we have a matrix multiplication of size \((k \times n) * (n \times k)\), which will result in \(2k*n*k - k*k\) FLOPs.

  • Matrix multiplication (Line 9): \(\varvec{WHHT} = \varvec{[}l]{W} @ \varvec{HHT} \). Here we have a matrix multiplication of size \((I \times k) * (k \times k)\), which will result in \(2I*k*k - I*k\) FLOPs.

  • Matrix multiplication (Line 10): \(\varvec{AHT} = \varvec{A} @ \varvec{HT} \). Here we have a matrix multiplication of size \((I \times n) * (n \times k)\), which will result in \(2I*n*k - I*k\) FLOPs.

  • Elementwise multiplication and division (Line 11): \(\varvec{[}l+1][]{W} = \varvec{[}l][]{W} * \varvec{AHT}/(\varvec{WHHT}+\epsilon )\). This consists of \(I*k\) FLOPs for elementwise multiplication and \(I*k\) FLOPs for elementwise division, so total \(2*I*k\) FLOPs.

Note: The All_Reduce operation (Lines 4 and 6) are communication operations and are not considered in the FLOP count as they do not involve any computation.

So, total FLOPs for each iteration of the loop = \(2k*I*n + 2k*I*k + 2*k*n + 2k*n*k + 2I*k*k + 2I*n*k + 2*I*k - k*n - k*k - k*k - I*k - I*k\).

For \(max_{iter}\) iterations, total FLOPs would be \(max_{iter}\) times the FLOPs per iteration. Now, to compute GFLOPS, we have GFLOPS = total_FLOPs/(total_time\(\times 1e9)\). Morever, given device peak GFLOPS (peakG), we can compute efficiency as GFLOPS/peakG*100%.

The total VRAM required to factorize \(\varvec{A}\) of size \(size(\varvec{A}) = S_A\) (in Bytes) is typically in the order of \(S_{NMF} \sim 4 \times S_A\). One fold of \(S_A\) to store \(\varvec{A}\) in memory, another fold to store perturbed \(\varvec{A}\) [7], an additional fold to compute intermediate product \(\varvec{X}=\varvec{W}@\varvec{H}\) when checking the convergence condition \(\Arrowvert \varvec{A} - \varvec{W} \varvec{H} \Arrowvert _F \le \eta \), and almost one full fold to store the co-factors \(\varvec{W}\), \(\varvec{H}\), and heavy intermediate products such as \(\varvec{W}^{T}\varvec{A}\) or \(\varvec{A}\varvec{H}^{T}\). When the total available combined GPU VRAM, \(S_{GV}\), is lower than \(S_{NMF}\), as in practical big data applications, batching techniques are imperative. Batching, in most cases, increases intra-node and inter-node communication overheads. Although this can significantly affect the algorithm’s performance, proper use of asynchronous data copy and CUDA streams can reduce performance loss by overlapping compute and data transfers, as discussed in our out-of-memory implementation below.

Fig. 3
figure 3

Illustration of the batched multiplicative update of Algorithm 4 for the column partition(CNMF). Green array is duplicated across different MPI ranks. Blue and red arrays are distributed, and only red array is cached on device. For CNMF, p is out-of-memory batch width and J is distributed partition width

Fig. 4
figure 4

Illustration of the batched multiplicative update Algorithm 5 for the row partition(RNMF) and colinear batching. Green array is duplicated across different MPI ranks. Blue and red arrays are distributed, and only red array is cached on device. For RNMF, p is out-of-memory batch width and J is distributed partition width

figure d

3.2 Out-of-memory implementation and memory complexity analysis

In pyDNMF-GPU, OOM-0 problems are handled using a tiling approach where temporary results like \(\varvec{A}\varvec{H}^T\), \(\varvec{W}^T\varvec{A}\) or \(\varvec{W}\varvec{H}\) are evaluated in small chunks, by tiling one of the operands, such that the size of the tile sets the memory required for the calculation. In RNMF for instance, the criterion \(\Arrowvert \varvec{A} - \varvec{W} \varvec{H} \Arrowvert _F \le \eta \), can be evaluated in m/p small chunks obtained by tiling \(\varvec{W}\) into smaller tiles of size \(p \times k\). This results in computing nt chunks of \([\Arrowvert \varvec{A} - \varvec{W} \varvec{H} \Arrowvert _F]_p\) which are accumulated into the total error e such that \(e=\sum {([\Arrowvert \varvec{A} - \varvec{W} \varvec{H} \Arrowvert _F]_t)}_{t=0}^{m/p-1}\), which can later be used to check the conversion condition \(e \le \eta \). This allows the reduction of the memory required to check the conversion criterion from \({\mathcal {O}}(m \times n)\) to \({\mathcal {O}}(p \times n)\). Because all matrices involved in the calculations are stored on GPU memory, performance loss due to tiling can be negligible, especially on modern GPU architecture like NVIDIA Ampere A100, which uses low latency and high bandwidth HBM memory. Using the tiling approach, the memory required to perform NMF on GPU can be reduced from \(S_{NMF} \sim 4 S_A\) to approximately \(2 S_A \le S_{NMF} \le 3 S_A\).

figure e

When dealing with OOM-1 cases, light arrays are cached on GPU memory, and heavier arrays are kept on host memory and batched to respective GPUs as needed. Further, an appropriate batching strategy for the chosen memory partition is required to limit unnecessary D2H and H2D copies. In PyDNMFk-GPU, we employ a 1D co-linear batching strategy, illustrated in Fig. 1b, where the elements in the batch are arrays of length equal max(mn). This batching strategy turns out to employ half the D2H and H2D memory copies required by an orthogonal batching strategy, illustrated in Fig. 1a for the column partition, where the elements in the batch are vectors of length equal min(mn). Let p be a batch size control parameter. In RNMF (CNMF) the number of batches is then given by \(n_B = m/p\) (\(n_B = n/p\)). In the extreme case where both m and n are very large, only the light array, \(\varvec{W}[J,:]\) is cached on GPU memory, and heavier arrays \(\varvec{A}[J, b_0:b_i]\) (\(A[b_0:b_1, J]\)) and \(\varvec{H}[:, b_0:b_1]\) (\(\varvec{W}[b_0:b_1,:]\)) batched to their respective GPUs, such that for the \(b^{th}\) batch, \(b_0=b \times p\) and \(b_1=(b+1) \times p\).

An implementation of the distributed CNMF with orthogonal batching is given in Algorithm 4. The calculation of the different intermediate products is illustrated in Fig. 3, where batch delimitation is represented with dashed lines. The top row shows all intermediate products computed during \(\varvec{H}\)-update, and products computed in \(\varvec{W}\)-update is shown in the bottom row. Intermediate products \(\varvec{W}^T @ \varvec{A}\) and \(\varvec{W}@\varvec{W}^T\) can be computed with \(n_B\) independent batches each containing \([\varvec{W}^T @ \varvec{A}]_b\) and \([W^T @ W]_b\) sub-products. Each batch is queued to a non-default CUDA stream \(Stm_b\) along with the transfer of \(\varvec{A}_b[b_0:b_i,J]\) and \(\varvec{W}_b[b_0:b_i,:]\), and when calculated, each sub-product is added to a local accumulator (see lines 10–11 of Algorithm 4). Once all batches have been processed, all accumulators are reduced to obtain the full values of \(W^{\rm TW}\) and \(W^{\rm TA}\), (see lines 15–16 of Algorithm 4). Note that this reduction is local to each GPU and does not involve communication. Special batch en-queuing and de-queuing policies are implemented with CUDA events, so as to limit (control) the number of concurrent batches on GPU to \(q_s\) (see lines 6–7,12 of Algorithm 4). This way, the memory requirement for \(H_{\rm update}\) is bounded by \(q_s \times [p\times J]\), as \(\varvec{W}^T\varvec{W}@\varvec{H}\) and \(\varvec{H}*(\varvec{W}^T\varvec{A})/(\varvec{W}^T\varvec{W}\varvec{H} + \epsilon )\) have a \(k \times J\) memory requirement. This is important, especially when dealing with large sparse arrays, which can be cheap to cache on the device but can also have co-factors becoming prohibitively expensive to cache when k becomes large. For instance, in CNMF, when \(m \sim 10 \ million\), the size of \(\varvec{H}\) will approximate 20GB in single precision when \(k \sim 512\).

Intermediate products \(\varvec{A}@\varvec{H}^T\) and \(\varvec{W}@\varvec{H}\varvec{H}^T\) of the \(\varvec{W}\)-update are computed similarly to \(\varvec{W}^T @ \varvec{A}\) and \(\varvec{W}@\varvec{W}^T\), except \(\varvec{A}@\varvec{H}^T\) will require an intermediate \(all-reduce-sum\) of sub-products \([\varvec{A} @ \varvec{H}^T]_b\) of batches of same stream number from the different GPUs (see line 28 of Algorithm 4). The resulting memory complexity of this implementation is found to be of the order of \({\mathcal {O}}(p \times n \times q_s)\) when \(p>> k\) which is the aggregated memory utilization caused by the \(q_s\) concurrent uploads of batches of \(\varvec{A}\) of size \(p \times n\) at line  8 or line  24 of Algorithm 4. This is a significant saving compared to the estimated \(S_{NMF} \sim 3 \times S_A\) when not checking the convergence condition \(\Arrowvert \varvec{A} - \varvec{W} \varvec{H} \Arrowvert _F \le \eta \). When the convergence criterion is checked, the error computation is tiled similarly as it was done for OOM-0 scenarios, resulting in a memory utilisation \(S_{NMF} \sim 2 \times p \times n \times q_s\) when \(p>> k\).

Note that the use of batches here will only increase \(intra-node\) communication due to mem-copies, as it is not possible to cache \(\varvec{A}\) and \(\varvec{W}\) on the device, however major shortcomings of using the orthogonal batching can be pointed out through the example of Algorithm 4 discussed above. First, the need to upload batches two times at lines (8–9 and lines 24-25of Algorithm 4) is very inefficient as the second set of H2D will significantly (almost double) data transfer costs. Second, unnecessary additional latency due to load balancing delays when the streams are scheduled in a different order on the different GPUs can occur at line 28 of Algorithm 4. Above all, the worst result here is that both inefficiencies multiply with the number of iterations ( see line 4 of Algorithm 4).

A better implementation uses a co-linear batching strategy as it is done in the batched implementation of the distributed RNMF given in Algorithm 5. The calculation of the different intermediate products is illustrated in Fig. 4. The top row shows all intermediate products computed during \(\varvec{W}\)-update, and products computed in \(\varvec{H}\)-update is shown in the bottom row. The \(\varvec{W}\)-update (cartoons 1-4 of Fig. 4 is embarrassingly parallel and can be done at a batch level. This means that within each batch, we have the updated partition of \(\varvec{W}\) readily available to compute local sub-products \(\varvec{W}^T@\varvec{A}\) and \(\varvec{W}^T@\varvec{A}\) in the \(\varvec{H}\)-update. This avoids the need for a second data upload, as was the case with implementation using an orthogonal batching strategy. Further, the aggregation of \(\varvec{W}^T@\varvec{A}\) and \(\varvec{W}^T@\varvec{A}\) first consists of a local accumulation of the sub-products (lines 16–17 of Algorithm 5) followed by a local reduction (lines 21–22 of Algorithm 5), then a global reduction (lines 23–24 of Algorithm 5) illustrated in cartoons 5–6 of Fig. 4. This does not require communication between batches of the same stream number and consequently avoids load balancing issues as discussed above in case using an orthogonal batching strategy.

4 Benchmarks results and discussion

4.1 Hardware infrastructure and software environment

Benchmark tests were performed on three different HPC clusters to illustrate the portability and scalability of pyDNMF-GPU. The first cluster, Kodiak, is a LANL internal HPC cluster with 133 compute nodes with dual Xeon E5-2695 v4 CPUs and four NVIDIA Pascal P100 GPGPUs each. Each NVIDIA Pascal P100 GPGPU has 16GB VRAM and uses PCI-E 16X gen 3 Links. The cluster peaks at 1850TF/s and uses an Infiniband interconnect. Each GPU peaks at 9.3 teraflops for single precision. The second cluster, Chicoma, is also a LANL internal HPC cluster, composed of 118 compute nodes where each node has 2 AMD EPYC 7713 Processors and 4 NVIDIA Ampere A100 GPUs. The AMD EPYC 7713 CPUs have 64 cores peaking at 3.67 GHz and 256 GB RAM. Each of the four NVIDIA A100 GPUs in each node provides a theoretical double-precision arithmetic capability of approximately 19.5 teraflops with 40GB VRAM memory. The nodes are networked with HPE/Cray slingshot 10 interconnect with 100Gbit/s bandwidth. Chicoma runs Shasta 1.4 OS and SLURM Job manager. The third cluster, Summit, peaks at over 200 petaflops in double-precision theoretical performance and comprises 4600 IBM AC922 compute nodes, with two IBM POWER9 CPUs and six NVIDIA Volta V100 GPUs each which peak at 15.7 single precision. The POWER9 CPUs have 22 cores running at 3.07 GHz. The six NVIDIA Tesla V100 GPUs in each node provide a theoretical double-precision arithmetic capability of approximately 40 teraflops with VRAM memory of 16GB/GPU. Dual NVLink 2.0 connections between CPUs and GPUs provide a 25-GB/s transfer rate in each direction on each NVLink, yielding an aggregate bidirectional bandwidth of 100 GB/s. The nodes are networked in a non-blocking fat-tree topology by Infiniband. Summit deploys an RHEL 7.4 OS and IBM Job step manager jsrun to run compute jobs. Jsrun provides a fine control of how node-level resources are allocated on these systems, including CPU cores, GPUs, and hardware threads.

pyDNMF-GPU is written in python and uses other off the shelf python libraries such as CuPy [35], Numpy [36], MPI4PY [37] and Scipy [38]. It supports dense and sparse datasets on various hardware architectures and handles communication using a low-latency NCCL-based communicator. NCCL is an open-source library providing inter-GPU communication primitives developed and maintained by NVIDIA. NCCL performs automatic hardware topology detection, which it then uses in graph search algorithms to identify communication paths that offer the highest bandwidth and lowest latencies for communication between GPUs intra- and inter-node (e.g., between GPUs that are on the same compute node, as well as between GPUs that are on separate compute nodes). NCCL is compatible with many multi-GPU parallelization models, and provides the ability to perform MPI-like collective and point-to-point operations such as allgather, reduce, broadcast, allreduce, send, and recv. NCCL was initially proposed to help with the need to transfer large message GPU buffers in deep learning applications efficiently. Many leading deep learning frameworks like Chainer, PyTorch, and TensorFlow have since integrated NCCL to accelerate deep learning training on multi-GPU, and multi-node systems, which has motivated us to use NCCL to handle communication in our work. All implementations discussed in the section above were found to benefit from a reduction in data transfer latency and communication performance (both intra-node and inter-node communications), using our low latency NCCL-based communicators versus MPI. An example of such benefit in communication performance gain is illustrated in the subsection 4.2 below that compares the new NMF implementation proposed in this work that uses an NCCL-based communicator to the prior pyDNMFk that uses a traditional MPI based communicator. A More comprehensive and detailed comparative study between NCCL and MPI can be found in the analysis by Awan [39].

4.2 Performance benchmark results of pyDNMF-GPU vs pyDNMFk

The performance gained using GPU over CPU is assessed with speedup computed as the ratio of time measured on CPU with pyDNMFk [7], to time measured on GPU with pyDNMF-GPU. For this study, we used a dense matrix of shape and size \(S_A\) of memory (in bytes) that respectively scale as \([N \times 65536, 32768]\) and \(N \times 8GB\), where N is the number of GPU or CPU units. Speedup measured on the Kodiak cluster are reported in Fig. 5. Figure 5a shows speedup in NMF time as a function of the number of units for various k. First, we note an increasing speedup with the increasing number of units, and second, we note a decreasing performance with increasing k when \(k \ge 32\). The low performance observed at \(k < 32\) is explained by low GPU occupancy. The best performance is obtained when \(k=32\), peaking at  76X. We also report speedup in communication time computed as the ratio of total communication time measured with pyDNMFk to the total communication time measured pyDNMF-GPU. The former used MPI based communicator and the latter used an NCCL-based communicator. Speedup in communication time is reported as a function of number of units for various k in Fig. 5b. We note \(\sim 80X-100X\) speedup when \(N>2\), the number of units above which inter-node communications start. This clearly shows a significant performance gain in communication when using NCCL in pyDNMF-GPU over MPI in pyDNMFk.

Fig. 5
figure 5

Results of benchmarking experiment showing speedup gain using N GPUs vs N CPUs, for various k. Speedup gained on NMF calculation time is shown in Fig. 5a and speedup gained on communication time is shown in Fig. 5b

4.3 Strong and Weak scalability of pyDNMF-GPU

The scalability of the proposed NMFk algorithm is assessed using both strong and weak scaling analysis. This scaling study measures NMF execution time for a given problem size as a function of the number of compute units. Compute nodes (with 4 GPUs each) are chosen as compute units in strong scaling analysis, while individual GPUs are chosen as compute units in weak scaling analysis. The problem size \(S_A\) is chosen to use most of the available 16GB VRAM per GPU. To this end, \(S_A\) is fixed at \(S_A \approx 4\times 8GB = 32GB\) in strong scaling analysis and chosen to scale as \(S_A \approx 8GB \times N\) in weak scaling analysis. This is accomplished by generating a random synthetic array A of shape \([4 \times 65536, 32768]\) and \([N \times 65536, 32768]\) respectively in both strong and weak scaling. Cases of sparse A with density \(10^{-5}\) were also studied, and for those cases, A was generated as a random synthetic array of shape \([4 \times 2097152, 65536]\) in strong scaling analysis, and of shape \([N \times 2097152, 65536]\) was chosen in weak scaling analysis.

4.3.1 Strong scalability

Strong scaling results for cases where \(k=8,16,32,64,128,256\) are shown Fig. 6a. NMF time is found to increase with k and to decrease with the increasing number of compute nodes. Good strong scaling is indicated by a linear decrease of NMF time with increasing compute grid size, and such behavior is only observed in select parts of the obtained results. Strong scaling is maintained up to a count of 8 nodes when \(k=8\), then to 4 nodes when \(k=16\), and lost when \(k>16\). Identical scaling is observed for cases where A is sparse, as shown in Fig. 6b.

Fig. 6
figure 6

Results of strong scaling study performed on Kodiak. NMF time vs number of node for various k dense and sparse \(\varvec{A}\) respectively shown in (6a) and (6b). For the case \(k=8\), execution time of \(H_{\rm update}\), \(W_{\rm update}\) and All-reduce communication are compared in (6c) and (6d), respectively for dense and sparse \(\varvec{A}\)

The worst case scenarios, when \(k=256\), can be diagnosed from breakdown of \(H_{\rm update}\), \(W_{\rm update}\) and combined all-reduce-sum (AR) execution time, as detailed in Fig. 6c. \(H_{\rm update}\) is shown to maintain good scaling at all compute grid sizes, while \(W_{\rm update}\) had poor scaling at each tested compute grid size. \(W_{\rm update}\)’s poor scaling is strongly influenced by AR communications time, which already makes up more than \(80\%\) of \(W_{\rm update}\) at 2 node count, which increases non-linearly with node count. At full grid size, AR time makes up more than \(98\%\) of \(W_{\rm update}\), influencing the overall NMF time dominated by \(W_{\rm update}\) time. The same explanation applies to cases where A is sparse, as one can interpret from Fig. 6d.

4.3.2 Weak scalability

Fig. 7
figure 7

Results of weak scaling study performed on Kodiak. NMF time vs number of GPU for various k are respectively shown in (7a) and (7b), for dense and sparse \(\varvec{A}\). For the case \(k=8\), execution time of \(H_{\rm update}\), \(W_{\rm update}\) and All-reduce communication are compared in (7c) and (7d), respectively for dense and sparse \(\varvec{A}\)

Weak scaling results for cases with \(k=8,16,32,64,128,256\) are shown Fig. 7a. Good weak scaling is indicated by constant NMF time with the increasing number of compute units, and this is observed only when \(N>8\). The lack of scaling when \(N <8\) can be explained using the breakdown of \(H_{\rm update}\), \(W_{\rm update}\) and combined AR execution time for the case where \(k=256\), shown in Fig. 7c. While \(W_{\rm update}\) maintains a perfect weak scaling at all N, \(H_{\rm update}\) is influenced by AR communications time, which increases with GPU count. Communication grows with noticeable transitions indicating the use of slower channels. The first transition is from \(N=1\) to \(N=2\), indicating the beginning of \(intra-node\) communication between GPUs on the same node. While growing with N, \(intra-node\) communication remains a small portion of \(W_{\rm update}\) ( \(\sim 10\%\)). The next major transition occurs between \(N=4\) and \(N=8\), indicating the beginning of \(inter-node\) communication, which quickly saturates to \(\sim 40\%\) of \(W_{\rm update}\) by \(N=32\). Identical weak scaling is observed for cases where A is sparse, as shown by plots in Fig. 7b, and the explanation for lack of scaling when \(N<8\) is consistent with the explanation given above for the case where A is dense, as one can interpret from Fig. 7d.

In Fig. 8, we display the GFLOPS and Efficiency results generated from our weak scaling experiments conducted on the Kodiak cluster. Notably, GFLOPS shows a linear progression as GPU counts rise in Fig. 8a, indicating an efficient distribution of computational workload across GPUs. Simultaneously, the consistent relationship of Efficiency with increasing GPU counts shown in Fig. 8b underscores the effective GPU utilization, thereby confirming our implementation’s efficacy in maintaining performance at scale, specially for larger ranks(k).

While all scaling results were obtained with RNMF, similar results will be obtained with \(\varvec{A}^T\) using CNMF.

Fig. 8
figure 8

FLOPS and Efficiency graph for weak scaling results for Kodiak Cluster are shown respectively in (a) and (b)

4.4 Scaling benchmark results on Big Data

It’s important to note that as technology continues to evolve, the scale of data storage and processing capabilities will likely increase, leading to even more significant data sets in the future. “The world’s most valuable resource is no longer oil, but data1" [40]. In national security and related research efforts, vast amounts of high-dimensional data are continuously being generated by massive computer simulations, large-scale experiments, surveillance systems, etc [41, 42]. For example, Stanford Synchrotron Radiation Lightsource experiments at SLAC laboratory for revealing the inner structure of materials at nanometer scales [43, 44] and the Large Hadron Collider [45] produce terabytes of data in minutes. Another example is the petabytes of data generated by mission-critical simulations [46,47,48,49,50]. Exploration and analysis of such extra-large data mandates the development of novel machine learning (ML) approaches that are able to extract meaningful basic processes and fundamental features underlying the data [51].

Given our interest in exascale data, the proposed implementation was tested on a dense matrix of shape [2618523648; 32768] with a size of \(\sim 340 TB \), and a sparse matrix of shape \([2.89 * 10^{12}, 1.05 * 10^6]\) with sparsity \(10^{-6}\) and size of \(\sim 11 EB\) (\(\sim 34 TB\) when compressed in a sparse format). Benchmarks were performed on Summit, with an allocation of 4096 nodes with 6 GPUs of 16 GB VRAM each, totaling a combined  394TB VRAM. While that is not enough to efficiently factorize either of the two matrices, we chose to cache A and co-factors and batch the compute of heavy, intermediate products (OOM-0). This way, we can reduce performance loss by avoiding unnecessary data transfers from host to device and vice-versa.

On the one hand, the weak scaling benchmark results for the dense array are reported in Fig. 9a. The \(H_{\rm update}\) is shown with a perfect weak scaling, while the \(W_{\rm update}\) is shown not to scale appropriately. Loss of scaling in the \(W_{\rm update}\) is a consequence of the high communication cost associated with the All-reduce of \(W^{\rm TA}\) and \(W^{\rm TW}\), which combined, make up a substantial portion of the \(W_{\rm update}\). The total NMF time, in turn, is significantly affected by the \(W_{\rm update}\), which takes about one order of magnitude more time to execute than the \(H_{\rm update}\). On the other, the weak scaling benchmark results for the sparse array, reported in Fig. 9b, indicate both \(W_{\rm update}\) and \(H_{\rm update}\) to have an excellent weak scaling. The \({\rm AR}(W^{\rm TW})\) is similar in both cases, as \(W^{\rm TW}\) is of shape \(k \times k\), but the \({\rm AR}(W^{\rm TA})\) is two orders of magnitude higher in the case of the spare dataset, proportional to n which is also two orders of magnitude higher. Unlike in the case of the dense array, the communication cost associated with the \({\rm AR}(W^{\rm TA})\) and \({\rm AR}(W^{\rm TW})\), although higher, are not significant enough to affect the \(W_{\rm update}\), consequently do not affect the overall scaling of the NMF.

Fig. 9
figure 9

Results of weak scaling study for dense and sparse \(\varvec{A}\) performed on Summit are shown respectively in (a) and (b)

4.5 Benchmark results on out-of-memory problems

Next, we assess the effectiveness of the proposed batching technique for OOM scenarios and the use of the CUDA stream queues to reduce communication in Algorithm   5. To this end, the proposed implementation is tested in an OOM-1 scenario, where a matrix of shape [524288, 4096] is factorized for \(k=[32,64,128,256,512,1024]\). Smaller array \(\varvec{H}\) is cached on GPU memory, and large arrays \(\varvec{A}\) and \(\varvec{W}\) are stored on the host and batched to GPU as needed. For this experiment, the number of iterations in Algorithm   5(line 4) fixed to \(max\_iters=100\), and the number of batches is fixed to \(n_b=32\). Given the size of A in single precision is \(S_A=8\)GB, the resulting batch size is \(S_B=p\times n \sim 0.25\)GB. The GPU peak memory utilization and NMF execution time for the 100 iteration, vs queue size, are respectively reported in Fig. 10a and Fig. 10b.

In Fig. 10a, the peak memory utilization measured when \(q_s=1\) is \(S_{nmf} \sim 0.267GB\) which is close to the estimated memory complexity of \({\mathcal {O}}(p \times n \times q_s) \approx 0.25\)GB in section 3.2, and which is a very big saving, \(\sim 1/100X\), compared to the estimated \(S_{NMF}~3 \times S_A \approx 24\) GB require by a normal implementation. This memory complexity is maintained for all k values and all queue sizes as indicated by the lines with the same slope \(\sim 0.267\) in Fig. 10a. The increase in peak memory with increasing k for any given queue size is explained by the increase in the size of the arrays cached on GPU (\(\varvec{H}\)), as well as the increase in the size of the computed intermediate products (see Fig. 4. Similarly, for each k value, we note an increase in peak memory utilization with the increasing number of batches which is simply explained by the aggregated memory utilization from the concurrent streams. While from this figure, it seems unproductive to use larger stream queue sizes due to the increase in peak memory utilization, the benefits of such design choice are explained in the execution benchmark results reported in Fig. 10b.

Fig. 10
figure 10

Results of Out of memory NMF benchmarks on Chicoma showing a NMF peak memory vs queue sizes for different k, and b NMF execution time vs queue sizes for different k

From Fig. 10b, we first see that it is, in all cases, a good idea to choose a queue size \(q_s>1\) if one wants to speed up the NMFk execution time. This is explained by using large stream queue sizes makes more streams available to overlap memory copies, all-reduce communications, and compute concurrently. It is, however, not the case that more streams will always make this process better, as we can see it not being the case when \(q_s=16\), where the NMFk execution time is not optimum for any k value. This is explained by the fact that CUDA core counts are limited and that some streams will block and wait when all cores are busy processing other streams, causing load-balancing delays. Consequently, it is crucial to fine-tune \(q_s\) for a given batch size and k to obtain optimal performance.

4.6 Validation of the model selection capability

Fig. 11
figure 11

a Estimation of number of hidden features (k = 8) through Silhouette analysis [6]. b Pearson correlation between columns of ground truth W and reconstructed W

To demonstrate the correctness of the proposed algorithm on big synthetic datasets, we first integrate our pyDNMF-GPU with the existing model selection algorithm pyDNMFk [7]. Then, we determine the number of latent features on a synthetic terabyte size matrix (with a predetermined number of features) and show that estimation is performed correctly. We generate a random matrix of dimensions \(8388608\times 32768\) as a product of two random matrices, \(\varvec{W}\) and \(\varvec{H}\), with a latent feature count of \(k=8\). We construct \(\varvec{W}\) with Gaussian features with different statistical means. The pyDNMFk-GPU silhouette analysis corresponding to this decomposition is shown in Fig. 11a and the correctness of features is shown with confusion matrix in Fig. 11b. pyDNMFk-GPU estimates \(k=8\) as the minimum Silhouette score is high and relative error is low. For \(k>8\), the minimum silhouette score drops suddenly as the solutions begin to fit the noise Fig. 11a. Figure 11b shows a Pearson correlation matrix that illustrates a large correlation between the features of ground truth \(\varvec{W}\) Ground truth and the corresponding pyDNMFk-GPU extracted \(\varvec{W} Predicted\) for \(k=8\). The analysis took approximately 1 h to correctly estimate the latent features on Kodiak. The average reconstruction error for the data is \(\sim 4\%\) with the Frobenius norm objective and MU update optimization. Our experiment demonstrates that pyDNMFk-GPU correctly estimates the number of latent features in addition to its scalability for large datasets as demonstrated in previous sections.

5 Conclusion

In summary, we demonstrated a novel scalable and portable framework, pyDNMFk-GPU, for non-negative matrix factorization based on custom multiplicative updates, with automatic determination of the number of latent features on Exa-scale data. Scalability of the framework was demonstrated via strong and weak scaling benchmarks, and speedup gains on GPU over CPU were found to vary with k and to increase with the size of the HPC system. The efficacy of the proposed tiling technique was demonstrated through the OOM-0 problem by factorizing a dense dataset of 340TB and a sparse dataset of size 11EB, where the implementation was found to have good week scaling on upto to 25k GPU. We also demonstrated the efficacity of the proposed batching technique along with the importance of using CUDA streams by solving OOM-1 problem, where memory complexity was shown to be of the \({\mathcal {O}}(p \times n \times q_s)\), resulting in a significant saving of \(\sim 100X\) smaller peak memory utilization in some cases. The automatic model selection capability was verified by correctly decomposing large synthetic data with a predetermined number of latent features and factors.