Accelerated antilopsided algorithm for nonnegative least squares
 424 Downloads
Abstract
Nonnegative least squares (NNLS) problem has been widely used in scientific computation and data modeling, especially for lowrank representation such as nonnegative matrix and tensor factorization. When applied to largescale datasets, firstorder methods are preferred to provide fast flexible computation for regularized NNLS variants, but they still have the limitations of performance and convergence as key challenges. In this paper, we propose an accelerated antilopsided algorithm for NNLS with linear overbounded convergence rate \(\left[ \left( 1  \frac{\mu }{L}\right) \left( 1\frac{\mu }{nL}\right) ^{2n}\right] ^k\) in the subspace of passive variables where \(\mu \) and L are always bounded as \(\frac{1}{2} \le \mu \le L \le n\), and n is the dimension size of solutions, which is highly competitive with current advanced methods such as accelerated gradient methods having sublinear convergence \(\frac{L}{k^2}\), and greedy coordinate descent methods having convergence \(\left( 1  \frac{\mu }{nL}\right) ^k\), where \(\mu \) and L are unbounded. The proposed algorithm transforms the variable x into the new space satisfying the second derivative equals constant \(\frac{\partial ^2 f}{\partial x_i^2} = 1\) for all variables \(x_i\) to implicitly exploit the secondorder derivative, and to guarantee that \(\mu \) and L are always bounded in order to achieve overbounded convergence of the algorithm, and to enhance the performance of internal processes based on exact line search, greedy coordinate descent methods, and accelerated search. The experiments on large matrices and real applications of nonnegative matrix factorization clearly show the higher performance of the proposed algorithm in comparison with the stateoftheart algorithms.
Keywords
Nonnegative least squares Accelerated antilopsided algorithm Firstorder methods1 Introduction
Minimizing the sum of squares of the errors is one of the most fundamental problems in numeric analysis as known as the nonnegative least squares (NNLS) problem. It has been widely used in scientific computation and data mining to approximate observations [5]. Specially, in many fields such as image processing, computer vision, text mining, environmetrics, chemometrics, or speech recognition, observations \(b \in {\mathbb {R}}^{d}\) are often approximated by a set of measurements or basis factors \(\{A_i\}\) contained in a matrix \(A \in {\mathbb {R}}^{d \times n}\) via minimizing \(\frac{1}{2}\Vert Ax  b\Vert ^2_2\). Moreover, in comparison with least squares (LS), NNLS has more concisely interpretable solutions, of which nonnegative coefficients \(x \in {\mathbb {R}}_+^n\) can be interpreted as contributions of the measurements over the observations. In contrast, mixedsign coefficients of LS solutions are uninterpretable because they lead to overlapping and mutual elimination of the measurements.
Because of no generic formula of solutions unlike least squares (LS) problem, although NNLS is a convex optimization problem, multiple iterative algorithms and gradient methods are widely employed to solve NNLS. The performance of NNLS algorithms mainly depends on selecting appropriate directions to optimize the objective function. To improve the performance, most effective algorithms remove redundant variables based on the concept of active sets [2, 5] in each iteration with different strategies [5]. These algorithms are fundamentally based on the observation that several variables can be ignored if they are negative when the problem is unconstrained [2, 15, 21]. In other words, NNLS can be considered an unconstrained problem in a subspace of several variables [13] that are positive in the optimal solution. In addition, algorithms using the second derivative [2, 15, 21] discover effective directions to more effectively reduce the objective function value. However, these approaches have two main drawbacks: invertibility of \(A^TA\) and its heavy computation, especially for the methods recomputing \((A^TA)^{1}\) several times for different passive sets. Hence, firstorder methods [7, 13, 19] can be more effective for largescale least squares problems.
Since 1990s, the methods of nonnegative matrix or tensor factorizations have widely used NNLS to achieve lowrank representation of nonnegative data [14, 22]. Specially, the lowrank representation transfers data instances into a lowerdimensional space of latent components to obtain increased speed and accuracy, and more concise interpretability of data processing that is essential in applications of signal and image processing, machine learning, and data mining [5]. However, the lowrank representation is usually a nonconvex problem, and it often employs iterative multiplicative update algorithms. In addition, exact algorithms often lack flexibility for lowrank regularized variants and also have high complexity and slow convergence. Hence, fast approximate algorithms based on the firstorder methods are more preferred to naturally provide a flexible framework for lowrank models [4, 9, 10, 11].

Convergence: the accelerated antilopsided algorithm for NNLS attains linear convergence rate of \([(1  \frac{\mu }{L})(1\frac{\mu }{nL})^{2n}]^k\) in the subspace of passive variables where n is the dimension size of solutions, and \(\mu \) and L are always bounded as \(\frac{1}{2} \le \mu \le L \le n\) to guarantee overbounded convergence rate. Meanwhile, current advanced firstorder methods are accelerated gradient methods having sublinear convergence \({\mathcal {O}}(\frac{L}{k^2})\) and greedy coordinate descent algorithm having convergence \((1  \frac{\mu }{nL})^k\), where \(\mu \) and L are unbounded.

Robustness: the algorithm can stably work in illconditioned cases for NNLS regularizations since it is totally based on the first derivative and it does not require computing the inverse of matrices \((A^TA)\) like Newton methods. In addition, it can exploit the second derivative by guaranteeing \(\frac{\partial ^2 f}{\partial x_i^2} = 1, \forall i\) to void the worst cases and discover more effective gradient directions, while keeping the low complexity of each iteration \({\mathcal {O}}(n^2)\). Moreover, \(\mu \) and L are always bounded as \(\frac{1}{2} \le \mu \le L \le n\), which increase the effectiveness of greedy coordinate descent and exact line search algorithms that depend on these parameters \(\mu \) and L.

Effectiveness: the experimental results for NNLS are highly competitive with the stateoftheart methods. These results additionally show that the algorithm is the fastest firstorder method for NNLS in both practice and theory.
2 Background and related works
This section introduces the nonnegative least square (NNLS) problem, its equivalent nonnegative quadratic problem (NQP), and significant milestones in the algorithmic development for NNLS.
2.1 Background
Nonnegative least square (NNLS) can be considered one of the most central problems in data modeling, of which solution can estimate the parameters of models for describing the data [5]. It comes from scientific applications where we need to estimate a large number of vector observations \(b \in {\mathbb {R}}^{d}\) using a set of measures or basis factors \(\{A_i\}\) contained in a matrix \(A \in {\mathbb {R}}^{d \times n}\) via minimizing \(\frac{1}{2}\Vert Ax  b\Vert ^2_2\). Hence, we can define NNLS as follows:
Definition 1
Comparison summary of NNLS solvers
Criteria  ELS  Coord  Accer  Fast  Nm  Frugal  Antilop 

Iteration complexity  \(n^2\)  \(n^2\)  \(n^2\)  \(n^3\)  \(\#(nd)\)  \(\#(nd)\)  \(n^2\) 
Convergence rate  \(\left( 1\frac{\mu }{L}\right) ^k\)  ?  \(\frac{L}{k^2}\)  ?  ?  ?  \(\left[ \left( 1\frac{\mu }{L}\right) \left( 1\frac{\mu }{nL}\right) ^{2n}\right] ^k\) 
Overbounded convergence  ✗  ✗  ✗  ✗  ✗  ✗  ✓ 
Memory size  \(n(n+d)\)  \(n(n+d)\)  \(n(n+d)\)  \(n(n+d)\)  \(\#(nd)\)  \(\#(nd)\)  \(n(n+d)\) 
Not compute \(A^TA\)  ✗  ✗  ✗  ✗  ✓  ✓  ✗ 
Not compute \((A^TA)^{1}\)  ✓  ✓  ✓  ✗  ✓  ✓  ✓ 
2.2 Related works
In the last s everal decades of development, different approaches have been proposed to tackle the NNLS problem, which can be divided into two main groups: active set methods and iterative methods [5].
Active set methods are based on the observation that variables can be divided into subsets of active and passive variables [8]. Particularly, the active set contains variables which are zero or negative when solving the least square problem without concerning nonnegative constraints; otherwise, the remaining variables belong to the passive set. The active set algorithms are based on the fact that if the passive set is identified, the passive variables’ values in NNLS are the unconstrained least squares solution when the active variables are set to zero. However, these sets are unknown in advance. Hence, a number of iterations are employed to find out the passive set, each of which is to solve a unconstrained least squares problem on the passive set to update the passive set and the active set.
Concerning the significant milestones of the active set methods, Lawson and Hanson [15] proposed a standard algorithm for active set methods. Subsequently, Bro and De Jong [2] avoided unnecessary recomputations on multiple righthand sides to speed up the basic algorithm [15]. Finally, Dax [6] proposed selecting a good starting point by Gauss–Seidel iterations and moving away from a “dead point” to reduce the number of iterations. Furthermore, the iterative methods use the firstorder gradient on the active set to handle multiple active constraints in each iteration, while the active set methods only handle one active constraint [5]. Hence, the iterative methods can deal with largerscale problems [12, 13] than the active set methods. However, they do not guarantee the convergence rate.
More recently, Franc et al. [7] proposed a cycle block coordinate descent method having fast convergence in practice with low complexity of each iteration, but it still has been not theoretically guaranteed. Subsequently, Vamsi [19] suggested three modifications of random permutations [17], shrinking, and random projections to speed up NNLS for the case that the matrix A is not thin (\(d \le n\)). Furthermore, accelerated methods [16] and proximal methods [18] having a fast convergence \(O(1/k^2)\) [10] only require the firstorder derivative. However, one major disadvantage of accelerated methods is that they require a large number of iterations to reach high accuracy because the step size is limited by \(\frac{1}{L}\), which is usually small for largescale NNLS problems with big matrices, where L is the Lipschitz constant. The comparison summary of NNLS solvers are presented in Table 1.
In summary, active set methods and iterative methods are two major approaches in solving NNLS. Active set methods accurately solve nonnegative least squares problems, but require huge computation for solving unconstrained least squares problems and are unstable when \(A^TA\) is illconditioned. Iterative methods are more potential for solving largescale NNLS because they can handle multiple active constraints per each iteration. In our view, iterative methods are still ineffective due to the scaling variable problem, which seriously affects the finding of appropriate gradient directions. Therefore, we propose an accelerated antilopsided algorithm combining several algorithms and ideas having different advantages to reduce negative effects of the scaling problem, obtain appropriate gradient directions, and achieve overbounded linear convergence in the subspace of passive variables.
3 Accelerated antilopsided algorithm

Part 1. Antilopsided transformation from Line 3 to Line 5: the variable vector x is transformed into a new space by \(x = \varphi (y)\) as an inverse function. In the new space, the new equivalent objective function \(g(y) = f(\varphi (y))\) has \(\frac{\partial ^2 g}{\partial y_i^2} = 1,\ \forall i\), or the acceleration of each variable equals 1. As a result, the roles of variables become more balanced because the level curve of the function becomes more spherical because \(\frac{\partial ^2 g}{\partial y_i^2} = 1,\ \forall i\), and g(y) is convex. This part aims to make the postprocessing parts more effective because it can implicitly exploit the second derivative information \(\frac{\partial ^2 g}{\partial y_i^2} = 1,\ \forall i\) to guarantee that \(\mu \) and L are always bounded as \(\frac{1}{2} \le \mu \le L \le n\).

Part 2. Exact line search from Line 12 to Line 16: this part optimizes the objective function with a guarantee of overbounded convergence rate \((1  \frac{\mu }{L})^k\) where \(\frac{1}{2} \le \mu \le L \le n\) over the space of passive variables, which has a complexity \({\mathcal {O}}(n^2)\). The part aims to reduce the objective functions exponentially and precisely, although it suffers from variable scaling problems and nonnegative constraints.

Part 3. Greedy coordinate descent algorithm from Line 18 to Line 22 and repeated in Line 29: this part employs greedy coordinate descent using Gauss–Southwell rule with exact optimization to rapidly reduce the objective function with fast convergence \({\mathcal {O}}(1\frac{\mu }{nL})\) for each update [17, 20], which has a complexity of \({\mathcal {O}}(n^2)\). The part aims to reduce negative effects of variable scaling problems and nonnegative constraints, although it has zigzagging problems because of optimizing the objective function over each single variable. Due to having fast convergence in practice and reducing negative effects of variable scaling problems and nonnegative constraints, this part is repeated one more time after Part 4.

Part 4. Accelerated search from Line 24 to Line 28: this step performs a descent momentum search based on previous changes in variables in Part 3 and Part 4, which has a low complexity of \({\mathcal {O}}(n{\cdot }\text {nn}(n))\) where \(\text {nn}(n)\) is the number of negative elements in \((x_{k+1}  \alpha \triangle x)\), see Line 27 in Algorithm 1. This part relies on the global information of two distinct points to escape the local optimal information issues of the first derivative raised by the function complexity. This part originates from the idea that if the function is optimized from \(x_s\) to \(x_k\) by the exact line search and the coordinate descent algorithm, it is highly possible that the function value will be reduced along the vector \((x_k  x_s)\) because the NNLS objective function is convex and has (super) eclipse sharp.
Remark 1

\(\frac{\partial ^2 f}{\partial ^2 y_i}=Q_{ii}=\frac{H_{ii}}{\sqrt{H_{ii}^2}}=1,\ \forall \ i=1,\ldots ,n\)

\(\frac{\partial ^2 f}{\partial y_i \partial y_j}=Q_{ij}=\frac{H_{ij}}{\sqrt{H_{ii}H_{jj}}},\ \forall \ 1 \le i \ne j \le n\) \(\Rightarrow Q_{ij} = \frac{<A_i,A_j>}{A_iA_j+\alpha } \le \cos (A_i, A_j) \le 1\) since \(\alpha > 0\).
The scaling variable problem is significantly reduced because the acceleration of the function over variables equals 1, and the roles of variables in the function become more balanced. For example, Fig. 3 has the change in function value over variables more balanced than in Fig. 2. In addition, the level curve of Function 4 is transformed from longellipse shaped (see Fig. 2) to shortellipse shaped (see Fig. 3) in Function 6. Furthermore, by combining fast convergence algorithms such as exact line search, greedy coordinate descent, and accelerated search, the proposed algorithm can work much more effectively. For example, we need only 21 iterations instead of 271 iterations to reach the optimal solution for the Function 4 with the same initial point \(y_0\), where \({y_0}_i={x_0}_i.H_{ii},\ \forall i\) (see Fig. 3).
4 Theoretical analysis
This section analyzes the convergence and complexity of the proposed algorithm.
4.1 Convergence
Concerning the convergence rate, our method argues Barzilai and Borwein’s note that NNLS is an unconstrained optimization on the passive set of variables [2]. Moreover, the orthogonal projection on the subspace of passive variables \(x = [x_k + \alpha \nabla \bar{f}]_+\) is trivial [13] since NNLS and its equivalent problem (NQP) are strongly convex on a convex set. In addition, the greedy coordinate descent using Gauss–Southwell rule with exact optimization has fast convergence rate \((1  \frac{\mu }{nL})\) for each update [17, 20]. Hence, in this section, we analyze the convergence rate of the exact line search in Algorithm 1 and determine the overbounds of \(\mu \) and L in the subspace of passive variables, which significantly influence the convergence rate of the proposed algorithm. Furthermore, we only consider convergence of NLLS solver without regularizations because it is assumed that \(L_1\) and \(L_2\) coefficients \(\alpha \), \(\beta \) slightly affect the convergence of algorithms for the following reasons: first, the \(L_1\) regularized coefficient \(\beta \) do not change the Hessian matrix; second, the \(L_2\) regularized coefficient \(\alpha \) is often small, and they slightly influence \(\frac{\mu }{L}\) because they change both the convex parameter \(\mu \) and the Lipschitz constant L by adding the same positive value \(\alpha \).

\(\exists \mu , L > 0\) satisfy \( \mu I \preceq \nabla ^2 f \preceq LI\)

\(\forall x, y: f(y) \ge f(x) + \langle \nabla f(x), (y  x)\rangle + \frac{\mu }{2} \Vert y  x\Vert ^2 \)

\(\forall x, y: f(y) \le f(x) + \langle \nabla f(x), (y  x)\rangle + \frac{L}{2} \Vert y  x\Vert ^2 \)
Theorem 1
After \((k+1)\) iterations, the convergence rate of the exact line search is \(f(x^{k+1})  f^* \le (1  \frac{\mu }{L})^k (f(x^0)  f^*)\), where \(f^*\) is the minimum value of f(x).
Proof
Lemma 1
Consider \(\nabla ^2 f\) of \(f(x) = \frac{1}{2}x^TQx + q^Tx, \frac{1}{2}I \preceq \nabla ^2 f \preceq \Vert Q\Vert _2I \le nI\), where \(\Vert Q\Vert _2 = \sqrt{\sum _{i=1}^{n}\sum _{j=1}^{n}Q_{ij}^2}\).
Proof
We have \(\nabla ^2 f = Q\), and \(\mathbf {a_i} = \frac{\mathbf {A_i}}{\Vert \mathbf {A_i}\Vert _2}\) \(\frac{1}{2}x^TIx \le \frac{1}{2}\left( \sum _{i=1}^{n}x^2_i\right) + \frac{1}{2}\left\ \sum _{i=1}^{n}x_i \mathbf {a_i}\right\ _2^2 = \sum _{i=1}^{n}\sum _{j=1}^{n}Q_{ij}x_ix_j =x^TQx\hbox { for }\forall x \Rightarrow \frac{1}{2}I \preceq \nabla ^2 f\) since \(Q_{ij} = \mathbf {a_i}\mathbf {a_j}\hbox { and }Q_{ii} = \mathbf {a_i}\mathbf {a_i} = 1, \forall i, j\).
Therefore: \(\frac{1}{2}I \preceq \nabla ^2 f \preceq \Vert Q\Vert _2I \le nI\). \(\square \)
From Theorem 1 and Lemma 1 and by setting \(\mu = \frac{1}{2}\) and \(L = \Vert Q\Vert _2\), we have:
Lemma 2
After \(k+1\) iterations, \(f(x^{k+1})  f(x^*) \le (1  \frac{\mu }{L})^k (f(x^0)  f(x^*))\), and \(\mu \), L are always bounded as \(\frac{1}{2} \le \mu \le L \le n\), where n is the dimension of x. Hence, the convergence rate of exact line search in Algorithm 1 is overbounded as \((1  \frac{\mu }{L})^k \le (1  \frac{1}{2n})^k\).
Moreover, because the greedy coordinate descent using Gauss–Southwell rule with exact optimization has convergence rate \((1  \frac{\mu }{nL})\) for each update [17, 20] and these updates is conducted 2n times, we have:
Theorem 2
The convergence rate of Algorithm 1 is \([(1  \frac{\mu }{L})(1  \frac{\mu }{nL})^{2n}]^k\) in the subspace of passive variables, where \(\mu \) and L are always bounded as \(\frac{1}{2} \le \mu \le L \le n\). Algorithm 1 converges at overbounded rate \([(1  \frac{\mu }{L})(1  \frac{\mu }{nL})^{2n}]^k \le [(1  \frac{1}{2n})(1  \frac{1}{2n^2})^{2n}]^k\).
Proof
Based on Section 3.2 in [20], the greedy coordinate descent using Gauss–Southwell rule, we have: \(f(x^{k}  \frac{1}{L}\nabla _{i_k} f(x^k))  f(x^*) \le (1  \frac{1}{nL}) (f(x^k)  f(x^*))\).
For using exact optimization, \(f(x^{k+1})  f(x^*) \le f(x^{k}  \frac{1}{L}\nabla _{i_k} f(x^k))  f(x^*)\).
Hence, \(f(x^{k+1})  f(x^*) \le (1  \frac{1}{nL}) (f(x^k)  f(x^*))\). In other words, the convergence rate of each update in the greedy coordinate descent using Gauss–Southwell rule with exact optimization is \((1  \frac{1}{nL})\).
Overall, Algorithm 1 including one exact line search and 2n updates of the greedy coordinate descent has convergence rate of \([(1  \frac{\mu }{L})(1  \frac{\mu }{nL})^{2n}]^k\). \(\square \)
4.2 Complexity

The complexity of the exact line search is \({\mathcal {O}}(n^2+n{\cdot }\text {nn}(n))\),

The complexity of the greedy coordinate descent is \({\mathcal {O}}(n^2)\),

The complexity of the accelerated search is \({\mathcal {O}}(n{\cdot }\text {nn}(n))\).
Theorem 3
The average complexity of Algorithm 1 is \({\mathcal {O}}(dn+dn^2+\bar{k}n^2)\), where \(\bar{k}\) is the number of iterations.
5 Experimental evaluation
Summery of test cases
Dataset  d  n  Type 

Synthetic 1  8000  10,000  \(d < n\) 
Synthetic 2  15,000  10,000  \(d > n\) 
Synthetic 3  15,000  10,000  Sparse \(20\% \ne 0\) 
ILSVRC2013  61,188  10,000  \(d > n\) 
CIFAR  3072  10,000  \(d < n\) 
20NEWS  61,185  10,000  Sparse 
Datasets: To investigate the effectiveness of the compared algorithms, 6 datasets are used and shown in Table 2:

the matrix A is randomly generated by the function \(rand(d, n)\times 100\) for dense matrices, and \(sprand(d, n, 0.1) \times 100\) for space matrices.

10,000 first images of ILSVRC2013^{1} are extracted to form the matrix A, and the images in ILSVRC2013 are resized into the size \([128 \times 128]\) before converted into vectors of 61,188 dimensions.

10,000 first instances of CIFAR^{2} are extracted to establish the matrix A,

10,000 first documents of 20NEWS^{3} are extracted to form the matrix A.
5.1 Comparison with stateoftheart algorithms

Coord: this is a cycle block coordinate descent algorithm [7] with fast convergence in practice [17].

Accer: this is a Nesterov accelerated method with convergence rate \({\mathcal {O}}(\frac{L}{k^2})\) [10]. The source code is extracted from a module in the paper [10]. The source code is downloaded from.^{5}

Fast: this is a modified effective version of active set methods according to Bro R., de Jong S., Journal of Chemometrics, 1997 [2], which is developed by S. Gunn.^{6} This algorithm can be considered as one of the fastest algorithms of active set methods.

Nm: this is a nonmonotonic fast method for largescale nonnegative least squares based on iterative methods [13]. The source code is downloaded from.^{7}

Nm: this is frugal coordinate descent for largescale NNLS. This code is provided by the author [19].

Concerning the convergence of the gradient square \(\Vert \bar{f}\Vert ^2_2\) over the passive variable set versus time in Fig. 4, the proposed algorithm has the fastest convergence over 6 datasets. At the beginning, the frugal block coordinate descent algorithm FCD [19] and the nonmonotonic method Nm [13] have the fast approximation because they do not compute \(A^TA\). However, for a long time, the faster convergence algorithms such as Antilop and Coord [7] will dominate, although they spend a long time on computing Hessian matrix \(A^TA\). In comparison with Coord, the proposed algorithm Antilop converges much faster because Coord has zigzagging problems in optimization of multiple variable functions. For the accelerated algorithm Accer, its gradient square gradually reduces because the step size is limited in \(\frac{1}{L}\). The active set method fast converges slowly because it has a high complexity at each iteration approximated to \({\mathcal {O}}(n^3)\) and handles a single active set simultaneously.

Similarly, regarding the convergence of the objective value \(\Vert Axb\Vert ^2_2/2\) versus, in Fig. 5, the proposed algorithm has the fastest convergence over 6 datasets. At the beginning, the fast approximate algorithms FCD and Nm have faster convergence than the algorithms Antilop and Coord. However, Antilop and Coord more rapidly converge because they can detect the more appropriate direction to optimize the function. In comparison with Coord, the proposed algorithm Antilop converges much faster. For the accelerated algorithm Accer having convergence of \(1/k^2\), the objective value gradually reduces because of its limited step size \(\frac{1}{L}\) and negative effects of nonnegative constraints. In addition, the active set method fast converges slowly due to its high complexity.
5.2 Application for lowrank representation
Summery of test cases for NMF
Dataset  n  m 

CIFAR  3072  60,000 
ILSVRC2013  61,188  60,000 
In the NMF problem, a given nonnegative matrix V is factorized into the product of two matrix \(V \approx WF\). For Frobenius norm, multiple iterative algorithm like EM algorithm is usually employed, which contains two main steps. In each step, one of the two matrices W or F is fixed to find the other optimal matrix. For example, when the matrix W is fixed, the new matrix is determined by \(F \approx \underset{F \succeq 0}{{\text {argmin}}}\; \Vert V  WF\Vert ^2_2 = \underset{F_i \succeq 0}{{\text {argmin}}}\; \sum _{i=1}^{r} \Vert V_i  WF_i\Vert ^2_2\). Hence, a large number of NNLS problems must be approximately solved in NMF, and employing the proposed algorithm in NMF is a reasonable way to test its effectiveness.

Figure 6 shows the convergence of the objective value \(\Vert V  WF\Vert ^2_2/2\) versus running time. For the dataset CIFAR, the algorithm NMF_Antilop always converges faster than the other compared algorithms. For the dataset ILSVRC2013, NMF_Antilop converges slowly at the beginning. However, the algorithm NMF_Antilop has faster convergence rate than the other algorithms to obtain the lower objective values \(\Vert V  WF\Vert ^2_2/2\) at the ending time. In addition, the algorithm NMF_Accer has the slowest convergence and its results has been not reported for the dataset ILSVRC2013 with \(r = \{200, 250\}\) because of its long running time.

Moreover, Table 4 shows the objective values after 300 iterations of the multiple iterative algorithm. Based on the results, the algorithm NMF_Antilop has the highest accuracy in all the test cases. The results indicate that the proposed NNLS algorithm obtains higher accuracy than the other algorithms employed in NMF for the following reasons: first, NMF is a hard optimization problem within a large number of variables. It is difficult to reduce the objective value when the variables converge to the optimal solution, which is represented in Fig. 6. Second, algorithm Antilop has fast convergence with high accuracy to obtain the better objective values.
\(\Vert VWH\Vert ^2_2/2\) of NMF solvers after 300 iterations (unit: \({\times }10^{10}\))
Method  NMF_Antilop  NMF_Coord  NMF_HALS  NMF_Accer 

\(\hbox {CIFAR} + r = 150\)  2.565  2.565  2.581  2.575 
\(\hbox {CIFAR} + r = 200\)  2.014  2.017  2.031  2.016 
\(\hbox {CIFAR} + r = 250\)  1.625  1.630  1.649  1.636 
\(\hbox {ILSVRC2013} + r = 150\)  12.390  12.409  12.400  12.433 
\(\hbox {ILSVRC2013} + r = 200\)  11.070  11.089  11.116  
\(\hbox {ILSVRC2013} + r = 250\)  10.097  10.127  10.141 
6 Conclusion and discussion
In the paper, we proposed an accelerated antilopsided algorithm to solve the nonnegative least squares problem as one of the most fundamental problems for lowrank representation. The proposed algorithm combines several algorithms and ideas, namely antilopsided variable transformation, exact line search, greedy block coordinate descent, and accelerated search to reduce the number of iterations and to increase the speed of the NNLS solver. These techniques aim to deal with variable scaling problems and nonnegative constraints of NNLS, although the combinational algorithm’s iteration complexity increases several times. In addition, the proposed algorithm has overbounded linear convergence rate \([(1\frac{\mu }{L})(1\frac{\mu }{nL})^{2n}]^k\) in the subspace of passive variables, where n is the dimension of solutions, and \(\mu \) and L are always bounded as \(\frac{1}{2} \le \mu \le L \le n\).
In addition, we carefully compare the proposed algorithm with stateoftheart algorithms in different research directions for both synthetic and real datasets. The results clearly shows that the proposed algorithm achieves the fastest convergence of the gradient square over passive variables and the objective value. Moreover, we investigate the effectiveness of the proposed algorithm in a real application of nonnegative matrix factorization, in which numerous NNLS problems must be approximately solved. The results also indicate that the NMF solver employing the proposed algorithm converges fastest and has the best accuracy in almost all the test cases.
Besides these advantages, our proposed algorithm still has several drawbacks such as computing and storing the Hessian matrix \((A^TA)\). Fortunately, in lowrank representation, the Hessian matrix is computed only once time, and parallel threads can use the same shared memory. Hence, the proposed algorithm can potentially be applied for lowrank representation models with Frobenius norm. In the future researches, we will apply the proposed algorithm to lowrank representation problems, especially for nonnegative matrix and tensor factorization.
Footnotes
Notes
Acknowledgements
This work was supported by 322 Scholarship from Vietnam Ministry of Education and Training, and Asian Office of Aerospace R&D under Agreement Number FA23861514006.
Compliance with ethical standards
Conflict of interest
The authors declare that they have no conflict of interest.
References
 1.Boyd, S., Vandenberghe, L.: Convex Optimization. Cambridge University Press, Cambridge (2004)CrossRefzbMATHGoogle Scholar
 2.Bro, R., De Jong, S.: A fast nonnegativityconstrained least squares algorithm. J. Chemom. 11(5), 393–401 (1997)CrossRefGoogle Scholar
 3.Caramanis, L., Jo, S.J.: Ee 381v: large scale optimization fall 2012. http://users.ece.utexas.edu/~cmcaram/EE381V_2012F/Lecture_4_Scribe_Notes.final.pdf (2012)
 4.Cevher, V., Becker, S., Schmidt, M.: Convex optimization for big data: scalable, randomized, and parallel algorithms for big data analytics. Sig. Process. Mag. IEEE 31(5), 32–43 (2014)CrossRefGoogle Scholar
 5.Chen, D., Plemmons, R.J.: Nonnegativity constraints in numerical analysis. In: Symposium on the Birth of Numerical Analysis, pp. 109–140 (2009)Google Scholar
 6.Dax, A.: On computational aspects of bounded linear least squares problems. ACM Trans. Math. Softw. (TOMS) 17(1), 64–73 (1991)MathSciNetCrossRefzbMATHGoogle Scholar
 7.Franc, V., Hlaváč, V., Navara, M.: Sequential coordinatewise algorithm for the nonnegative least squares problem. In: Gagalowicz, A., Philips, W (eds.) Proceedings of the 11th International Conference, CAIP 2005, Versailles, France, September 58, 2005. Computer Analysis of Images and Patterns, pp. 407–414. Springer, Berlin (2005)Google Scholar
 8.Gill, P.E., Murray, W., Wright, M.H.: Practical Optimization. 1981. Academic, London (1987)Google Scholar
 9.Gillis, N., Glineur, F.: Accelerated multiplicative updates and hierarchical ALS algorithms for nonnegative matrix factorization. Neural Comput. 24(4), 1085–1105 (2012)MathSciNetCrossRefGoogle Scholar
 10.Guan, N., Tao, D., Luo, Z., Yuan, B.: NeNMF: an optimal gradient method for nonnegative matrix factorization. IEEE Trans. Sig. Process. 60(6), 2882–2898 (2012)MathSciNetCrossRefGoogle Scholar
 11.Hsieh, C.J., Dhillon, I.S.: Fast coordinate descent methods with variable selection for nonnegative matrix factorization. In: Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1064–1072. ACM (2011)Google Scholar
 12.Kim, D., Sra, S., Dhillon, I.S.: A new projected quasiNewton approach for the nonnegative least squares problem. Computer Science Department, University of Texas at Austin (2006)Google Scholar
 13.Kim, D., Sra, S., Dhillon, I.S.: A nonmonotonic method for largescale nonnegative least squares. Optim. Methods Softw. 28(5), 1012–1039 (2013)MathSciNetCrossRefzbMATHGoogle Scholar
 14.Kolda, T.G., Bader, B.W.: Tensor decompositions and applications. SIAM Rev. 51(3), 455–500 (2009)MathSciNetCrossRefzbMATHGoogle Scholar
 15.Lawson, C.L., Hanson, R.J.: Solving Least Squares Problems, vol. 161. SIAM, Philadelphia (1974)zbMATHGoogle Scholar
 16.Nesterov, Y.: A method of solving a convex programming problem with convergence rate \(o (1/k^2)\). Sov. Math. Dokl. 27, 372–376 (1983)zbMATHGoogle Scholar
 17.Nesterov, Y.: Efficiency of coordinate descent methods on hugescale optimization problems. Core Discussion Papers 2010002, Université Catholique de Louvain. Center for Operations Research and Econometrics (CORE) (2010)Google Scholar
 18.Parikh, N., Boyd, S.: Proximal algorithms. Found. Trends Optim. 1(3), 123–231 (2013)Google Scholar
 19.Potluru, V.K.: Frugal coordinate descent for largescale NNLS. In: AAAI (2012)Google Scholar
 20.Schmidt, M., Friedlander, M.: Coordinate descent converges faster with the Gauss–Southwell rule than random selection. In: NIPS OPTML Workshop (2014)Google Scholar
 21.Van Benthem, M.H., Keenan, M.R.: Fast algorithm for the solution of largescale nonnegativityconstrained least squares problems. J. Chemom. 18(10), 441–450 (2004)CrossRefGoogle Scholar
 22.Zhang, Z.Y.: Nonnegative matrix factorization: models, algorithms and applications. In: Holmes, D.E., Jain, L.C. (eds.) Data Mining: Foundations and Intelligent Paradigms, vol. 2. Springer, Berlin (2012)Google Scholar
 23.Zhou, G., Cichocki, A., Zhao, Q., Xie, S.: Nonnegative matrix and tensor factorizations: an algorithmic perspective. Sig. Process. Mag. IEEE 31(3), 54–65 (2014)CrossRefGoogle Scholar