1 Introduction

Nonnegative matrix factorization (NMF) decomposes an input nonnegative data matrix, \(X\in \mathbb {R}_+^{m\times n}\), into the product of two nonnegative matrices \(W\in \mathbb R_{+}^{m\times k}\) and \(H\in \mathbb R_{+}^{k\times n}\) such that \(X\approx WH\), where k is the rank of the factorization. The nonnegativity constraints of NMF allow one to reconstruct data using a purely additive model, with the advantage of a natural interpretability. This makes NMF useful in many applications (Cichocki et al. 2009; Gillis 2020) in various research fields like image analysis (Ensari 2016; Prajapati and Jadhav 2015; Luce et al. 2016; Shiga et al. 2016), text mining (Du et al. 2019), signal processing (Yoshii et al. 2016), and computational biology (Maruyama et al. 2014).

NMFs are in most cases computed by solving the following optimization problem:

$$\begin{aligned} \min _{ W \ge 0, H \ge 0 } \Vert X-WH \Vert _F^2, \end{aligned}$$
(1)

where \(||\cdot ||_F\) denotes the Frobenius norm. Most algorithms for solving (1) use standard non-linear optimization schemes such as block coordinate descent methods, updating the factors in an iterative way; see, e.g., Cichocki et al. (2009), Gillis (2020) and the references therein. Hence, they require initialization mechanisms for the variables, W and H, that can greatly influence the factorization results. They can affect both (i) the number of iterations needed for an algorithm to converge (in fact, if the initial point is closer to a local minimum, it will require a reduced number of iterations to converge to it), and (ii) the final solution the algorithm will converge to. Different strategies have been proposed for NMF initialization (Esposito 2021), such as simple schemes (e.g., random initializations) (Langville et al. 2006), structured schemes (e.g., SVD- or clustering-based (Wild et al. 2004; Rezaei et al. 2011)), evolutionary-based approaches (e.g., genetics algorithms, particle swarm optimization) (Janecek and Tan 2011), and geometric-based methods. The latter class of methods tries to locate the vertices of the convex polytope formed by the columns of X; see, e.g., Zdunek (2012); Araújo et al. (2001); Nascimento and Dias (2005); Liu and Tan (2018) and [ Gillis (2020) Chapter 7]. To the best of our knowledge, some of the most popular initialization methods for NMF are based on deterministic low-rank decompositions such as rank-1 decomposition (Liu and Tan 2018) and the SVD (Esposito 2021), and geometric methods such as vertex component analysis (Nascimento and Dias 2005) and the successive projection algorithm (Araújo et al. 2001; Sauwen et al. 2017). In this paper, we focus on these latter. In this class, two of the most widely used methods, introduced in Boutsidis and Gallopoulos (2008) and Qiao (2015), have the drawback to increasing the approximation error, \(||X-WH||_F^2\), as the rank increases; see Sect. 2 for more details. More recently, a variant of these approaches, referred to as Nonnegative SVD with Low-Rank Correction (NNSVD-LRC), makes use of a low-rank correction to address this issue (Atif et al. 2019), while reducing the computational load by using a truncated SVD of smaller rank (roughly half). In this paper, we improve upon NNSVD-LRC by proposing a new initialization scheme, referred to as accelerated Nonnegative SVD with Progressive Residual Projection (accNNSVD-PRP), that reduces further the computational load of NNSVD-LRC by keeping track of the residual using its low-rank property. The paper is organized as follows. Section 2 briefly reviews the background of SVD-based initialization for NMF, with an emphasis on NNSVD-LRC. Section 3 details our proposed solution, and highlights the differences with existing SVD-based initializations. In Sect. 4, we show that the new proposed initialization, accNNSVD-PRP, compares favorably against standard structured and random initializations on several real dense and sparse data sets. Section 5 concludes the paper and discusses some future directions of research.

2 SVD-based NMF initializations: previous works

As mentioned in the introduction, SVD-based initializations are the most popular and effective for NMF. Recall that the rank-p truncated SVD approximates a data matrix \(X\in \mathbb {R}_+^{m\times n}\) with a lower-rank matrix composed by \(p\) rank-one terms as follows:

$$\begin{aligned} X\approx X_{p} = U_{p}\Sigma _{p} V_{p}^{\top }, \end{aligned}$$
(2)

where \(X_{p}\) is the best rank-p approximation of X in the Frobenius norm, \(U_{p} \in \mathbb {R}^{m\times p}\) and \(V_{p} \in \mathbb {R}^{n\times p}\) contain left and right singular vectors, and \(\Sigma _{p} \in \mathbb {R}^{p\times p}\) contains the singular values on its diagonal in nonincreasing order. Let us write the truncated SVD as the product of only two matrices, as follows

$$\begin{aligned} X\approx X_{p} = \sum _{i=1}^{p} y_{i}z_{i} = Y Z, \end{aligned}$$
(3)

where \(Y = U_p \Sigma _p^{1/2} \in \mathbb {R}^{m \times p}\) and \(Z = \Sigma _p^{1/2} V_p^\top \in \mathbb {R}^{p \times n}\), so that \(y_i = \sqrt{\sigma _i} U_p(:,i)\) and \(z_i = \sqrt{\sigma _i} V_p(:,i)^\top \) for \(1 \le i \le p\). (Note that \(z_i\)’s are row vectors, which will simplify the notation by avoiding using the transpose sign.) Due to their negative entries,Footnote 1Y and Z cannot be used directly for NMF initialization.

Let us define, for a generic vector \(b\in \mathbb {R}^{q}\), its nonnegative part, \(b^{(\ge 0)} = \max (0,b)\), and its nonpositive part, \(b^{(\le 0)} = \max (0,-b)\), so that \(b = b^{(\ge 0)} - b^{(\le 0)}\). In this way, (3) can be rewritten as:

$$\begin{aligned} X_{p} = \sum \limits _{i=1}^{p}\Big ( y_{i}^{(\ge 0)}z_{i}^{(\ge 0)}+y_i^{(\le 0)}z_{i}^{(\le 0)}\Big ) \; - \; \sum \limits _{i=1}^{p}\Big (y_{i}^{(\ge 0)}z_{i}^{(\le 0)} + y_{i}^{(\le 0)}z_{i}^{(\ge 0)}\Big ). \end{aligned}$$
(4)

The negative sign of the second term in (4) poses challenges for NMF initialization, since it can lead to negative initialization values, which are not suitable for NMF purposes. In the literature, there are different methods designed to effectively manage and mitigate the impact of negative values.

The two most popular and widely used approaches are the following:

(1) Nonnegative Double SVD (NNDSVD) (Boutsidis and Gallopoulos 2008) discards the second term (with the minus sign) in (4), and selects \(p\) product terms (among the 2p) from the first term according to the following criterion: for each \(i\), it chooses \(y_{i}^{(\ge 0)}z_{i}^{(\ge 0)}\) if \(||y_{i}^{(\ge 0)}z_{i}^{(\ge 0)}||_F > ||y_i^{(\le 0)}z_{i}^{(\le 0)}||_F\), otherwise it opts for \(y_i^{(\le 0)}z_{i}^{(\le 0)}\). This takes advantage of the sign ambiguity of the SVD (Boutsidis and Gallopoulos 2008; Bro et al. 2008).

(2) SVD-NMF (Qiao 2015) adopts as an initialization the component-wise absolute values of Y and Z, that is, \(W = |Y|\) and \(H = |Z|\).

Unfortunately, these two methods have several drawbacks. First, the reconstruction error of the initialization, \(||X-WH||_F^2\), increases as the rank of the factorization increases. This is because all negative terms are discarded. Second, a lot of information embedded into the first term of (4) is lost. To alleviate these drawbacks, another method, called Nonnegative SVD with Low-Rank Correction (NNSVD-LRC), was proposed more recently in Atif et al. (2019) that splitting (4) in different parts, it considers just the nonnegative side of the rank p-approximation.

2.1 Nonnegative SVD with low-rank correction

NNSVD-LRC is composed by two phases: a nonnegative SVD (NNSVD) initialization, followed a Low-Rank Correction (LRC) that allows NNSVD-LRC to decrease the error as the factorization rank increases. The pseudo-code of NNSVD-LRC is detailed in Algorithm 1.

Algorithm 1
figure a

NNSVD-LRC

The first phase, NNSVD, computes the nonnegative factors W and H from the rank-p truncated SVD of X, using \(p=\lfloor k/2+1 \rfloor \) as in (2), and then construct YZ as in (3). By the Perron-Frobenius theorem, \(|y_1||z_1|\) is an optimal rank-one approximation of X in the Frobenius norm, hence we set \(W(:,1) = |y_1|\) and \(H(1,:) = |z_1|\).

The remaining \(k-1\) columns of W and rows of H are completed with the next \(\lfloor p/2\rfloor \) factors of the truncated SVD as follows:

$$\begin{aligned} W(:,j) = {\left\{ \begin{array}{ll} y_{i}^{(\ge 0)} \quad \text {if { j} is even},\\ y_{i}^{(\le 0)} \quad \text {otherwise}, \end{array}\right. } H(j,:) = {\left\{ \begin{array}{ll} z_{i}^{(\ge 0)} \quad \text {if { j} is even},\\ z_{i}^{(\le 0)} \quad \text {otherwise}, \end{array}\right. } \end{aligned}$$
(5)

where \(j = 2, 3, \dots , k\) and \(i=\lfloor \frac{j}{2} + 1 \rfloor \).

The second phase will fed the preliminary factors W and H into LRC to reduce the error induced during the NNSVD phase due to loss of information in the second term of (4). This is done by using a state-of-the-art NMF algorithm, namely the Accelerated Hierarchical Alternating Least Squares (A-HALS)Footnote 2 algorithm (Gillis and Glineur 2012), to solve

$$\begin{aligned} \min _{ W \ge 0, H \ge 0 } \Vert X_p-WH \Vert _F^2. \end{aligned}$$
(6)

The reason to use \(X_p\) instead of X is that \(X_p=YZ\) has a low rank representation and hence each iteration of the subsequent NMF algorithm can be reduced from O(mnk) operations per iteration to \(O\left( (m+n)k^2 \right) \) operations. Although NNSVD-LRC performs a few relatively cheap A-HALS iterations, it is found empirically that the time taken by these iterations is not negligible compared to the truncated SVD computation.

This motivates us to introduce a new SVD-based initialization for NMF, namely accNNSVD-PRP described in the next section.

3 Accelerated nonnegative SVD with progressive residual projection

In this section, we detail the new proposed SVD-based initialization for NMF, namely Accelerated NNSVD with Progressive Residual Projection (accNNSVD-PRP); see Algorithm 2.

Algorithm 2
figure b

Accelerated Nonnegative SVD with Progressive Residual Projection (accNNSVD-PRP)

It differs from NNSVD-LRC in the following ways:

  1. (1)

    It modifies the first phase, NNSVD, to keep track of the second term of (4) which is discarded by NNSVD-LRC; see Algorithm 3.

  2. (2)

    It replaces the second phase, LRC, by a cheaper one, which we refer to as Progressive Residual Projection (PRP). In contrast to NNSVD-LRC, it only updates the factor H to decrease the error.

Let us describe these two phases in more details.

Phase 1: modified NNSVD (mNNSVD) This phase is very similar to NNSVD, the first phase of NNSVD-LRC, see Algorithm 1. The generated factors (WH) are the same since it splits in different parts (5), but mNNSVD keeps track of the discarded term in (4), denoted \(\bar{H}\), such that

$$\begin{aligned} X_p = WH - W \overline{H}, \quad \text { where } \left( W,H,\overline{H}\right) \ge 0, \end{aligned}$$

that is,

$$\begin{aligned} \overline{H}(1,:) = \varvec{0},\quad \text {and}\quad \overline{H}(j,:) = {\left\{ \begin{array}{ll} z_{i}^{(\le 0)} \quad \text {if { j} is even},\\ z_{i}^{(\ge 0)} \quad \text {otherwise}, \end{array}\right. } \end{aligned}$$

where \(j = 2, 3, \dots , k\), \(i=\lfloor \frac{j}{2} + 1 \rfloor \), and \(\varvec{0}\) denotes the vector of zeros of appropriate dimension. Algorithm 3 described this procedure.

Algorithm 3
figure c

Modified NNSVD (mNNSVD)

Phase 2: Progressive Residual Projection (PRP)

From the output of Algorithm 3, we directly use the basis matrix W as initialization. We will only update H to reduce the approximation error.

To this aim, similarly as in LRC, we want to solve

$$\begin{aligned} \min _{H \ge 0} F(H):= ||X_p - WH||_F^2. \end{aligned}$$
(7)

Let us denote \(H^{(0)}\) and \(\overline{H}^{(0)}\) the output of mNNSVD. We have \(X_p = WH^{(0)} - W\overline{H}^{(0)}\). To solve (7), let us resort to Nesterov accelerated gradient descent (Nesterov 1983, 2018), and let us denote the iterates \(H^{(t)}\) for \(t=0,1,\dots \). At every iteration, we will have

$$\begin{aligned} X_p = WH^{(0)} - W\overline{H}^{(0)} \approx WH^{(t)}. \end{aligned}$$

To progressively keep track of the residual, let us define

$$\begin{aligned} \overline{H}^{(t)} = \overline{H}^{(0)} + H^{(t)} - H^{(0)}. \end{aligned}$$

This implies that, for all t, the following equality holds

$$\begin{aligned} X_p = W \left( H^{(t)}-\overline{H}^{(t)}\right) . \end{aligned}$$
(8)

This can be used to compute the gradient efficiently, as follows:

$$\begin{aligned} \nabla F\left( H^{(t)} \right) = -2W^\top \left( X_p-WH^{(t)}\right) = 2W^\top W\overline{H}^{(t)}. \end{aligned}$$
(9)

It is worthy to note that the formulation of the gradient in (9) implicitly depends on \(H^{(t)}\). Since Nesterov accelerated gradient method does not ensure the objective function to decrease at every iteration, we embed the code with a restarting scheme. If the objective function increases, the algorithm abandons the extrapolated sequence and takes a standard gradient step (O’donoghue B, Candes E, 2015). The cost for each iteration of this phase consists of \(O\left( k^{2} (m+n) \right) \) operations. In contrast, although LRC has the same computational cost per iteration, it alternates between optimizing W and H, necessitating a few iterations each time to update both factors, which results in a comparatively slower process. On the other hand, PRP exclusively optimizes H and typically requires less iterations (as it updates only H), it operates significantly faster than LRC.

Algorithm 4 provides the pseudo-code of PRP, where \(S^{(t)}\) denotes the extrapolated sequence of Nesterov gradient method (Nesterov 1983). Recall that Nesterov accelerated gradient method performs a projected gradient step (step 3 in Algorithm 4) from an extrapolated sequence, denoted \(S^{(t)}\), computed via a so-called extrapolation, see step 5 in Algorithm 4, where the so-called extrapolation parameters (step 4 in Algorithm 4) are chosen to guarantee an optimal convergence rate.

Algorithm 4
figure d

Progressive Residual Projection

Sparsity of factors in accNNSVD-PRP Since W is initialized via the SVD in accNNSVD-PRP, its columns will be roughly 50% sparse (except for the first one, by the Perron-Frobenius theorem). On the other hand, since H is optimized, it is hard to predict the sparsity it will achieve, and it will depend on the data set. On the dense data sets presented in Sect. 4.1, the average sparsity of H was about 50% (between 45% and 56% for all data sets and tested ranks). For the sparse data sets (see Sect. 4.1), the sparsity was between 64% and 74%.

4 Numerical experiments

In this section, we show some numerical results to corroborate the goodness of our proposed method as initialization for NMF problems. The code of accNNSVD-PRP is available on GitHub at https://github.com/5y3datif/accNNSVD-PRP. We compare the proposed method with the scaled random initialization (SRI) from Gillis and Glineur (2008, 2012), and with five structure-based NMF initializations: three are SVD-based, namely NNDSVD, SVD-NMF and NNSVD-LRC (Atif et al. 2019), and two clustering-based, namely a recent hybrid method combining clustering and the computation of rank-one SVDs called CR1-NMF (Liu and Tan 2017), and SPKM based on spherical k-means (Wild et al. 2004). The code for CR1-NMF and SPKM are available from https://github.com/zhaoqiangliu/cr1-nmf. Note that, because of the non-deterministic nature of CR1-NMF, SPMK and SRI, we run the methods 100 times and report mean results along with corresponding standard deviations. All tests are preformed using Matlab R2021b on a laptop Intel(R) Core(TM) i5-9300 H CPU @ 2.40GHz empowered by NVIDIA GEFORCE GTX 1650, CUDA toolkit 10.0 and cuDNN 7.4. We take the advantage of available GPU (whenever possible) during simulation. We will compare the various methods using the following two quantities:

  1. 1.

    the \(\text {relative error}(X,WH) \; = \; \frac{\Vert X - WH\Vert _{F}}{\Vert X\Vert _{F}}\),

  2. 2.

    the computational time which is computed as the CPU time in Matlab.

4.1 Datasets description

We conduct experiments on the following dense and sparse datasets.

4.1.1 Dense datasets

The following three widely used and real dense datasets:

\(\bullet \) AT &T Faces is a dataset of face images taken in 1992–1994 and it is one of the widely used face images datasets by the research community. There are ten different images of 40 distinct subjects. The size of each image is 92\(\times \)112 pixels, with 256 grey levels per pixel. The dataset is freely available from http://cam-orl.co.uk/facedatabase.html.

\(\bullet \) Faces95 is a facial images dataset that consists of 72 subjects such that there is a sequence of 20 images with a resolution of 180\(\times \)200 pixels. The dataset is freely accessible from https://www.essex.ac.uk/mv/allfaces/faces95.zip.

\(\bullet \) PaviaU is a Hyperspectral image dataset acquired by the ROSIS sensor during a flight campaign over Pavia, Italy. Each image has resolution of \(610\times 610\) with a spatial resolution of 1.3m, and 103 spectral bands. The dataset is freely available from http://www.ehu.eus/ccwintco/uploads/e/ee/PaviaU.mat, http://www.ehu.eus/ccwintco/uploads/5/50/PaviaU_gt.mat.

4.1.2 Sparse datasets

We also tested our approach on sparse document datasets. The datasets were derived from the San Jose Mercury newspaper articles that are distributed as part of the TREC collection (TIPSTER Vol. 3) and are accessible from https://catalog.ldc.upenn.edu/LDC93T3D (Zhong and Ghosh 2005):

\(\bullet \) Sports: it consists of documents about seven different sports (such as baseball, basketball, football, hockey, boxing, bicycle, and golf). It contains 8580 documents with 14870 words. Its sparsity is \(99.14\%\).

\(\bullet \) Reviews: it consists documents about five topics (such as food, movies, music, radio, and restaurants). It contains of 4069 documents and 18483 words. Its sparsity is \(98.99\%\).

\(\bullet \) Hitech: it consists documents about six technologies (such as computers, electronics health, medical, research, and technology). It contains 2301 documents with 10080 words. Its sparsity is \(99.14\%\).

4.2 Results

First, as for NNSVD-LRC and as opposed to NNSVD and SVD-NMF, our proposed method, accNNSVD-PRP, decreases the relative error as the rank increase; see Fig. 1 for an example on the AT &T Faces and Reviews data sets. We report representative examples for both classes of data sets studied: dense and sparse; moreover for the Reviews data set it can be noted that the relative error still decreases for all the rank values tested (from \(k=1,\dots ,20\)) that could represent an over-estimation of the \(k=5\) topics in the data sets, as explained in Zhong and Ghosh (2005).

Fig. 1
figure 1

Relative errors of the SVD-based NMF initializations for different values of the rank k, for the data set AT &T Faces and Reviews. Similar results are observed for the other datasets

We also observe on Fig. 1 that accNNSVD-PRP generates solutions with initial error larger than that of NNSVD-LRC, but this is expected since accNNSVD-PRP only optimizes the factor H in the second phase. Table 1 reports the relative errors after a few iterations (namely, 5, 25, 125) of the HALS algorithm for NMF (Cichocki et al. 2007). We observe that acc-NNSVDPRP provides similar relative error values compared to NNSVD-LRC, this is even more evident especially after enough iterations of HALS. Compared to the other initialization strategies, accNNSVD-PRP and NNSVD-LRC perform on average better, as already reported in Atif et al. (2019).

Table 1 Relative error in percent of different initialization methods after few iterations (namely 5, 25, 125) of HALS for different over-estimated rank values i.e., \(k=15,20,25\). Cases when accNNSVD-PRP performs better than NNSVD-LRC are indicated in bold

However, although accNNSVD-PRP does not outperform NNSVD-LRC in terms of relative errors, it runs significantly faster since its second phase is much cheaper. Table 2 reports the computational times of the tested NMF initialization for different over-estimated rank values i.e., \(k=15,20,25, 35, 50\).

Table 2 CPU time (in s.) taken by different NMF initializations for the different data sets according to several over-estimated rank values i.e., k = 15, 20, 25, 35, 50

We observe that, except for random initializations which are very fast to generate, accNNSVD-PRP runs among the fastest in all cases (always at least second best, and very close to the fastest when not the best). In particular, accNNSVD-PRP runs significantly faster than NNSVD-LRC: on average 4.5 times faster on dense data sets, and 1.7 times faster on sparse data sets.

5 Conclusion

In this paper, we have proposed a new SVD-based initialization for NMF, namely accNNSVD-PRP. It is inspired by NNSVD-LRC, a state-of-the-art algorithm, but we adapted the two steps of NNSVD-LRC, by reformulating the NNSVD to introduce a factor \(\overline{H}\) that takes into account some discarded terms that will be updated later with a low-rank correction. This procedure allows the proposed algorithm to run significantly faster, while keeping its nice properties, in particular that the approximation error decreases as the factorization rank increases. As a remark we would like to note that this approach can be adapted to switch between H and W factors and to include W in the LRC correction phase as well, to further improve performance of accNNSVD-PRP and this is the object of our future works. To conclude, we have shown over various sparse and dense datasets that accNNSVD-PRP runs significantly faster than NNSVD-LRC while performing similarly as an NMF initialization scheme in terms of relative errors. Moreover, it outperforms other state-of-the-art algorithms such as NNSVD and SVD-NMF.