A novel update rule of HALS algorithm for nonnegative matrix factorization and Zangwill’s global convergence

Nonnegative Matrix Factorization (NMF) has attracted a great deal of attention as an effective technique for dimensionality reduction of large-scale nonnegative data. Given a nonnegative matrix, NMF aims to obtain two low-rank nonnegative factor matrices by solving a constrained optimization problem. The Hierarchical Alternating Least Squares (HALS) algorithm is a well-known and widely-used iterative method for solving such optimization problems. However, the original update rule used in the HALS algorithm is not well defined. In this paper, we propose a novel well-defined update rule of the HALS algorithm, and prove its global convergence in the sense of Zangwill. Unlike conventional globally-convergent update rules, the proposed one allows variables to take the value of zero and hence can obtain sparse factor matrices. We also present two stopping conditions that guarantee the finite termination of the HALS algorithm. The practical usefulness of the proposed update rule is shown through experiments using real-world datasets.


Introduction
Dimensionality reduction methods for large-scale and high-dimensional data have been actively studied in the fields of machine learning and signal processing because of their diverse applications such as feature extraction and visualization (see [8] and references therein). In recent years, Nonnegative Matrix Factorization (NMF) [2,32] has attracted a great deal of attention as an effective dimensionality reduction method for large-scale nonnegative data, and has been successfully applied to various tasks such as image processing [4,34], acoustic signal processing [14,31], network analysis [18,22,50], mobile sensor calibration [12] and so on. A key difference between NMF and other dimensionality reduction methods such  as the principal component analysis [51] is that the factor matrices obtained by NMF are nonnegative and tend to be sparse [32]. Thus NMF can learn a parts-based representation of the data [32].
Given an M × N nonnegative matrix X, NMF aims to decompose it into two nonnegative factor matrices W and H of sizes M × K and N × K , respectively, so that W H T is approximately equal to X, where K is much less than min{M, N } (see Fig. 1). The problem of finding such factor matrices is often formulated as the constrained optimization problem: where · F denotes the Frobenius norm of matrices, and 0 I ×J is the I × J matrix of all zeros. For matrices P and Q of the same size, P ≥ Q means element-wise inequality. The Frobenius norm can be replaced with one of several alternatives such as the I-divergence [33], the Itakura-Saito divergence [14] and others [52]. Also, one or more regularization terms can be added to the objective function in order to enforce desirable properties on the factor matrices [24,25,40,41]. As with many machine learning methods, the 1 -regularization term is often used in NMF.
Various methods for finding a local optimal solution of the optimization problem (1) have been developed so far. Note that finding a global optimal solution is difficult in general because it is known that (1) is NP-hard [49]. Most of the conventional methods update some or all of the elements of one factor matrix at a time because the objective function f (W , H) is not jointly convex but convex in W or H. For example, the multiplicative update rule (MUR) [33], which is widely known as a simple and easy-to-implement method, alternately updates W and H according to the rule derived from strictly convex functions called the auxiliary functions [33]. An important advantage of the MUR is that the value of the objective function decreases monotonically as long as division by zero does not occur. However, division by zero is certainly possible in the MUR because elements of the factor matrices can become zero. For this reason, convergence of the factor matrices is not guaranteed. In fact, it was shown experimentally that the MUR sometimes fails to converge to a stationary point [19]. To solve this problem, some authors proposed modified MURs [16,35]. For example, Gillis and Glineur [16] proposed to replace all values less than a positive constant with after updating W and H using the original MUR. Their modified MUR was later proved by Takahashi and Hibi [46] to be globally convergent in the sense of Zangwill [53] (see Definition 1 of the present paper) to a stationary point of the corresponding optimization problem: where 1 I ×J denotes the I × J matrix of all ones. Lin [35] proposed a different kind of modified MUR and proved its global convergence to a stationary point of (1). However, this modified MUR is much more complicated than the one mentioned above, and requires a higher computational cost. real numbers are denoted by R, R + and R ++ , respectively. The I × J matrix of all zeros and that of all ones are denoted by 0 I ×J and 1 I ×J , respectively. For any vector v = (v 1 , v 2 , . . . , v I ) T ∈ R I , 1 -and 2 -norms of v are denoted by v 1 and v 2 , respectively. The notation [v] + represents the vector of which the i-th element is given by max{0, v i } for all i. Similarly, for any vector v ∈ R I and any constant ∈ R ++ , the notation [v] + represents the vector of which the i-th element is given by max{ , v i } for all i.
The feasible region of the constrained optimization problem (1) is denoted by F . That is, We call (W , H) ∈ R M×K × R N ×K a stationary point of (1) if it satisfies the Karush-Kuhn-Tucker (KKT) conditions: where and represents the element-wise product. The set of stationary points of (1) is denoted by

S.
Similarly, the feasible region of the constrained optimization problem (2) is denoted by We call (W , H) ∈ R M×K ×R N ×K a stationary point of (2) if it satisfies the KKT conditions: The set of stationary points of (2) is denoted by S . Many iterative algorithms for solving (1) have been proposed so far. Such an algorithm starts with an initial point (W (0) , H (0) ) ∈ F and generates a sequence of points {(W (t) , H (t) )} ∞ t=0 ⊂ F that is expected to converge to a stationary point of (1). Following to Zangwill [53], we define the global convergence of an iterative algorithm for solving (1) as follows.
Definition 1 (Global Convergence) An iterative algorithm for solving (1) is said to be globally convergent to S if any sequence {(W (t) , H (t) )} ∞ t=0 ⊂ F generated by the algorithm has at least one convergent subsequence and the limit of any convergent subsequence belongs to S.
Note that Definition 1 does not mean the convergence of the whole sequence {(W (t) , H (t) )} ∞ t=0 to a stationary point. Nevertheless, the notion of global convergence as defined above is of great practical importance because the finite termination of the algorithm is guaranteed if we relax the KKT conditions in a proper way and use them as the stopping condition [29,46,47].
Using Zangwill's global convergence theorem [53], we can obtain a theorem that gives a sufficient condition for an iterative algorithm for solving (1) to be globally convergent to S. Before presenting the theorem, we introduce two important notions: point-to-set mappings and their closedness. We consider every iterative algorithm for solving (1) as an iterative process of defining a set of candidate points in the next iteration from the point in the current iteration, and selecting one from the candidate points in some way. Each algorithm is thus characterized by how to define the set of candidate points, which is represented by a pointto-set mapping from F to its subsets. For point-to-set mappings from F to its subsets, their closedness is defined as follows.
Definition 2 (Closed Mapping) A point-to-set mapping A from F to its subsets is said to be closed on D ⊆ F if, for any sequence {( P (t) , It is often the case that the set A(W , H) consists of only one point in F for any (W , H) ∈ F . In this case, A can be considered as a point-to-point mapping from F to itself, and the closedness defined above can be considered as the continuity of A.
Now we are ready to present a theorem that can be obtained as a direct consequence of Zangwill's global convergence theorem [53].
Theorem 1 Let A be the point-to-set mapping from F to its subsets that represents an iterative algorithm for solving (1). If A satisfies the following conditions then the algorithm is globally convergent to S.

Any sequence {(W (t) , H (t) )} ∞
t=0 generated by the mapping A in such a way that (W (0) , H (0) ) ∈ F and (W (t+1) , H (t+1) ) ∈ A(W (t) , H (t) ) for all t ∈ Z + is contained in a compact subset of F . 2. The mapping A does not increase the value of f . To be more specific, for any point (W , H) ∈ F , the following statements hold true.

The mapping A is closed on
The global convergence of iterative algorithms for solving (2) and the closedness of pointto-set mappings from F to its subsets can be defined in the same way as above. Also, if we replace F and S in Theorem 1 with F and S , respectively, we obtain a theorem that gives a sufficient condition for algorithms for solving (2) to be globally convergent to S .
Zangwill's global convergence theorem is well known as a powerful framework for proving the global convergence of iterative algorithms. For example, it was used in proving the global convergence of the concave-convex procedure [45], the decomposition method for support vector machines [48], and the modified MUR for NMF [47].

HALS algorithm
In this section, we review the HALS algorithm [6] for solving the optimization problem (1) and some of its variants. We also review their convergence property.
Let the k-th columns of W and H be denoted by w k and h k , respectively. Then the problem (1) is rewritten as follows: The HALS algorithm, which can be viewed as a special case of the block coordinate descent (BCD) method [27], updates 2K column vectors w 1 , w 2 , . . . , w K and h 1 , h 2 , . . . , h K one by one in a fixed order so that the value of the objective function of (5) decreases monotonically. When updating w k , the HALS algorithm considers all other variables as constants and solves the following subproblem: where If h k =0 N ×1 , the objective function p k (w k ) is strictly convex and minimized at w k =R k h k / h k 2 2 . Hence the subproblem (6) has the unique optimal solution w k = R k h k / h k 2 2 + [27, Theorem 2]. Similarly, when updating h k , the HALS algorithm considers all other variables as constants and solves the following subproblem: Taking into account the correspondence between variables and constants in (6) and those in (7), we can say that the subproblem (7) has the unique optimal solution h k = Based on these analyses, the update rule described by is obtained [7,23,27]. In this paper, we call the algorithm based on this update rule the HALS algorithm [7] though it is also called the rank-one residue iteration algorithm [23]. For the HALS algorithm, the following result is known. Note that the global convergence of the HALS algorithm is not guaranteed by this theorem. There are two issues to consider. First, the assumption that the columns of W and H remain nonzero throughout the iterations may not always be valid. Once w k becomes zero for example, h k cannot be updated because the right-hand side of (9) becomes an indeterminate form. Second, even though the assumption is valid, it may occur that the sequence generated by the HALS algorithm has no limit point.
A simple way to avoid indeterminate forms is to use instead of (8) and (9), where is a small positive constant. This update rule was introduced by Cichocki et al. [6] to avoid the numerical instability, but later proved to be globally convergent as shown in the following theorem. [29]) The HALS algorithm using the update rule described by (10) and (11) is globally convergent to S .

Theorem 3 (Kimura and Takahashi
Note that the update rule described by (10) and (11) does not perform NMF but positive matrix factorization [39]. In addition, the limit of any convergent subsequence is not a stationary point of (1) but one of (2) as shown in Theorem 3. Hence this update rule produces only dense factor matrices. One may claim that sparse factor matrices will be obtained if we replace all in the factor matrices with zeros and that the pair of the resulting sparse factor matrices will be close to S. However, it is not clear whether this claim always holds true or not.
Another simple way to avoid indeterminate forms is to use instead of (8) and (9), where δ is a positive constant. This update rule is derived from auxiliary functions of p k (w k ) and q k (h k ) [15]. Details will be shown in the proof of Lemma 2. For this update rule, the following result is known.

belongs to S.
Just like Theorem 2 for the original HALS algorithm, Theorem 4 says nothing about the global convergence of the update rule described by (12) and (13) to S. The existence of a limit point is not guaranteed even though the objective function value decreases monotonically along the sequence (W (t) , H (t) ) ∞ t=0 generated by the update rule, because the level set of the objective function f (W , H) is unbounded.

New update rule and its global convergence
In this section, we propose a new update rule of the HALS algorithm and prove that it is globally convergent to S.

Proposed update rule
The update rule we propose in this paper is described by where δ is a positive constant and u k is an arbitrary nonnegative unit vector. It is clear that division by zero never occurs in the proposed update rule. The first formula (14) is the same as (12). The second formula (15) is the normalization procedure for w k . The third formula (16) is used instead of (9) because w k 2 2 = 1 always holds when h k is updated. The normalization procedure plays an important role when we prove that any sequence generated by the proposed update rule is contained in a compact subset of F .
In this paper, we focus our attention on the case where the columns of W are normalized, but the alternative case where the columns of H are normalized can be dealt with in the same way. Here we should note that the modified MUR [35] also uses a normalization procedure, but this is slightly different from ours. It uses 0 M×1 instead of u k in (15).
A formal statement of the proposed update rule is presented in Algorithm 1. Note that Step 4 is added to facilitate the global convergence analysis, though it is not necessary for practical purpose. Note also that Steps 2 and 3 can be replaced with Step 6 can be replaced with for an efficient implementation (see Cichocki and Fan [5] for more details) . It is easy to see that the proposed update rule has the same computational complexity per iteration as the original update rule. The following theorem establishes the global convergence of the proposed update rule.

Theorem 5 The HALS algorithm using the update rule shown in Algorithm 1 is globally convergent to S.
This theorem can be proved by using Theorem 1. Details are shown in the next subsection.

Proof of Theorem 5
We prove Theorem 5 by using Theorem 1. Let the point-to-set mapping representing Algorithm 1 be denoted by A. Also, let the point-to-set mappings corresponding to Steps 3, 4, 5 H) and stop. Otherwise set k ← k + 1 and go to Step 2. and 6 of Algorithm 1 be denoted by D W k , S H k , S W k and D H k , respectively. Then A is expressed as

Algorithm 1 Proposed Update Rule for NMF
and the mapping S W k (W , H) is given by otherwise. Note that the set D W k (W , H) consists of only one point in F , which is represented as a continuous function of (W , H). The same can be said for S H k (W , H) and D H k (W , H). We now prove that the proposed update rule satisfies the second condition in Theorem 1. Let us begin with the definition and an important property of the auxiliary function [33] because it plays an important role in our proof. [33]) For a function g : R + → R, a two-variable function g : R + × R + → R is called an auxiliary function of g if the following conditions hold:

Definition 3 (Auxiliary Function
for all x, y ∈ R + . Lemma 1 Letḡ : R + × R + → R be an auxiliary function of g : R + → R. If the inequalitȳ g(a, b) ≤ḡ(b, b) holds for nonnegative numbers a and b then g(a) ≤ g(b). In particular, if The first inequality follows from the second condition in Definition 3 and the equality follows from the first condition in Definition 3. Ifḡ(a, b) is strictly less thanḡ Using Lemma 1, we obtain the following three lemmas.

Lemma 2
The objective function f (W , H) is nonincreasing under the proposed update rule shown in Algorithm 1.
Proof The objective function is nonincreasing under the composite mapping S W k • S H k for all k because the value of w k h T k does not change before and after the composite mapping is performed. Also, the objective function is nonincreasing under D H k for all k because h k = [R T k w k ] + is the unique optimal solution of (7) when w k 2 = 1. So it suffices for us to show that the objective function is nonincreasing under D W k for all k.
We thus consider all variables other than w k as constants, and show that the value of p k (w k ), the objective function of (6), does not increase. Note that p k (w k ) is rewritten as and r r m is the m-th row of R k . For the function p mk (x), we define a two-variable function p mk (x, y) as follows:p where δ is a positive constant used in Algorithm 1. It is clear thatp mk (x, y) is an auxiliary function of p mk (x) and strongly convex in both x and y (but not jointly) [15,42]. For each value of y, the minimum point x * ofp mk (x, y) in R + is uniquely determined as Therefore, by Lemma 1, we have Substituting y = w mk into this inequality, we have from which we have This means that f (W , H) is nonincreasing under D W k . (1) if and only if w k is a stationary point of (6) with h k = h * k for k = 1, 2, . . . , K and h k is a stationary point of (7) with w k = w * k for k = 1, 2, . . . , K .
Proof We omit the proof because it is similar to [29,Lemma 3].

Lemma 4
For any (W , H) ∈ F , the following statements hold true.
Proof It is clear from Lemma 2 that the second statement holds true. Thus we only have to consider the first statement. Let (W , H) be any point in F \ S. It follows from Lemma 3 that there exists at least one k such that i) w k is not a stationary point of (6) or ii) h k is not a stationary point of (7).
In the first case, there exists at least one m such that is given by (18). For such an m, the auxiliary functionp mk (x, y) of p mk (x), which is given by (19), satisfies which is negative if w mk = 0 and nonzero if w mk > 0. This means that x = w mk is not the unique minimum point ofp mk (x, w mk ). Hencep mk (x * , w mk ) <p mk (w mk , w mk ) where x * is the unique minimum point given by (20). From this inequality and Lemma 1, we have Therefore, f (W , H) strictly decreases under the mapping A.
In the second case, we can show in the same way as above that f (W , H) strictly decreases under the mapping A.
We next prove that the proposed update rule satisfies the first condition in Theorem 1. To do so, for any point (W (0) , H (0) ) in F , we define the set L (W (0) ,H (0) ) as follows: Note that this is not a level set of f because of the conditions that w k 2 = 1 for all k. The next lemma shows the boundedness of this set.  H (0) ) . It suffices for us to show that h k 2 is bounded for k = 1, 2, . . . , K . Because q k (h k ) is convex, the inequality holds for any v ∈ R N [3].
This completes the proof.
Using Lemma 5, we obtain the following lemma.

Lemma 6 Any sequence (W (t) , H (t) )
∞ t=0 generated by Algorithm 1 is contained in a compact subset of F .

Proof
We easily see from Step 5 of Algorithm 1 that w (t) H (0) ) for all t ∈ Z + . Therefore (W (t) , H (t) ) ∈ L (W (0) ,H (0) ) for all t ∈ Z ++ . Because L (W (0) ,H (0) ) is bounded as shown in Lemma 5, the sequence {(W (t) , H (t) )} ∞ t=0 is contained in a compact subset of F . We finally prove that the proposed update rule satisfies the third condition in Theorem 1. The next lemma shows the closedness of the point-to-set mappings S W 1 , S W 2 , . . . , S W K . Lemma 7 The point-to-set mappings S W 1 , S W 2 , . . . , S W K are closed on F .
) be the limits of these two sequences. It is clear from the definition of S W k that u (t) for allk = k and t ∈ Z + , and V (t) = H (t) for all t ∈ Z + . We first consider the case where w , H (∞) and {(U (t) , V (t) )} ∞ t=0 converges to it. We next consider the case where w as follows: for k = 1, 2, . . . , K and σ max (X) is the largest singular value of X. It is clear that all of the three sets defined above are compact subsets of F . It is also clear that ( . Furthermore, the following lemma holds.

Lemma 8
The following statements are true for k = 1, 2, . . . , K . . Then w k 2 ≤ μ k and h k 2 ≤ ν k hold. Using these inequalities, we have

If
. We next prove the second statement. Suppose that (W , H) ∈ L 2

(W (0) ,H (0) )
. Then w k 2 ≤ (σ max (X)ν k /δ + μ k ) and h k 2 ≤ ν k hold. Using these inequalities, we have . The third statement is clear from the definition of the point-to-set mapping S W k , and the fourth statement is clear from the proof of Lemma 5. .

Proof It is clear that the composite mapping
to the subsets of to the subsets of is closed on its domain for all k. Furthermore, since L 1 is a compact subset of F , by [53, Corollary 4.2.1], we can conclude that A, which is a composition of the is closed on its domain.
We should note that even if u k in Step 5 of Algorithm 1 is replaced with a constant nonnegative unit vector such as (1/ √ M)1 M×1 and (1, 0, 0, . . . , 0) T we can prove Theorem 5 without changing the definition of the mapping S W k .

Stopping conditions
We have proved that the HALS algorithm using the proposed update rule shown in Algorithm 1 is globally convergent to S in the sense of Definition 1. Therefore, combining this update rule with an appropriate stopping condition, we can design an algorithm that always stops in a finite number of iterations. In this section, we consider two approaches for deriving stopping conditions.
The HALS algorithm for NMF using Algorithm 1 and the stopping condition described by (21) and (22) is shown in Algorithm 2. For this algorithm, the following theorem holds. The proof is omitted because it is similar to that of Theorem 2 in [46].

Projected gradient norm
The second approach is to make use of the projected gradient [36]. To be more specific, the inequality is used as the stopping condition, where τ 1 and τ 2 are positive constants, and ψ τ 2 (W , H) is defined as The notations G W τ 2 (W , H) and G H τ 2 (W , H) denote a modified projected gradients with respect to W and H, respectively, which are defined by Note that our definition of the projected gradient is slightly different from the one used in the literature [20,26,27,36], which corresponds to the case where τ 2 = 0. It is clear that if (W , H) is a stationary point of (1) then (23) is satisfied because ψ τ 2 (W , H) = 0 holds. Therefore, (23) is considered as relaxed KKT conditions. The proposed HALS algorithm for NMF using Algorithm 1 and the stopping condition (23) is shown in Algorithm 3. For this algorithm, the following theorem holds.
Let us define a positive constant μ as H (0) ).
Because ∇ W f (W , H) and ∇ H f (W , H) are continuous on F , the following statements hold true. (∞) , H (∞) )) nk = 0, there exists a positive integer I H nk such that
holds for all i ≥ I . This means that the stopping condition (23) holds in a finite number of iterations. However, this contradicts the assumption that Algorithm 3 does not stop.

Numerical experiments
In order to examine the practical performance of the proposed update rule, the authors conducted numerical experiments using the real-world datasets: Olivetti 1 and CLUTO 2 (tr41). The former is a dataset of face images, and the latter is that of documents. The statistics of these two datasets is shown in Table 1. In the experiments, two global-convergenceguaranteed update rules were applied to the nonnegative matrices obtained from the datasets. One is Algorithm 1 (denoted as 'proposed') and the other is the update rule described by (10) and (11) (denoted as 'positive'). These two update rules are compared in terms of the evolution of the objective function value and the number of unsatisfied inequalities in the relaxed KKT conditions, and the characteristics of the obtained factor matrices. Experimental setup is shown in Table 2. The value of δ in the proposed update rule is set to 10 −8 in all experiments, while the value of in the positive one is set to 10 −4 or 10 −8 depending on the experiment. The iteration is terminated when the stopping condition described by (21) and (22) is satisfied or the number of iterations reaches 500. The values of κ 1 and κ 2 in the stopping condition are set to 1.0 and 2 , respectively, in all experiments. Note that the finite termination of the positive update rule is guaranteed if κ 2 is greater than . This can be proved in the same way as Theorem 7 (see [29] for details). Three different initial solutions are generated for each dataset in such a way that each element is drawn from independent uniform distributions on the intervals [0, 1], [0, 0.5] and [0, 0.25] which are called the 'large', 'medium' and 'small' initial solutions, respectively.
Results of Experiment 1 are summarized in Fig. 2 and Table 3. Figure 2 shows the evolution of the objective function value and the number of unsatisfied inequalities in (21) and (22).   We easily see from the figure that the two update rules decrease the objective function value in a similar way until one of them satisfies the stopping condition. In contrast, the behavior of these update rules with respect to the number of unsatisfied inequalities is quite different. The proposed update rule decreases the number at a similar rate for all the initial solutions, and satisfies the stopping condition between 200 and 300 iterations. This is because the normalization process is included in the proposed update rule. The positive update rule decreases the number very slowly, and cannot satisfy the stopping condition in 500 iterations for the large and medium initial solutions, while for the small initial solution it decreases the number very fast and satisfies the stopping condition in less than 100 iterations. Table 3 shows the characteristics of the solutions obtained by the proposed and positive update rules. Some important facts are observed in this table. The first one is that a small objective function value does not necessarily mean that the number of unsatisfied inequalities is small. In fact, the solution obtained by the positive update rule for the large initial solution gives the smallest objective function value and the largest number of unsatisfied inequalities. Also, the solution obtained by the positive update rule for the small initial solution gives the largest objective function value but satisfies all the inequalities. The second fact is that about a quarter of the variables are at the lower bound in all cases. Hence the solutions obtained by the proposed update rule are sparse because the lower bound is zero. In contrast, the solutions obtained by the positive update rule are dense because the lower bound is a positive constant . The third fact is that the replacement of all with zero in each solution obtained by the positive update rule increases the number of unsatisfied inequalities. In particular, the replacement changes a solution that satisfies the stopping condition to another one that does not. It is thus not always possible to find a sparse solution that satisfies the relaxed KKT conditions using the positive update rule, while we can always do it using the proposed update rule. This is an advantage of the proposed update rule against the positive one.
Results of Experiment 2 are summarized in Fig. 3 and Table 4 just like Experiment 1. The evolution of the objective function value and the number of unsatisfied inequalities in (21) and (22) shown in Fig. 3 are similar to those in Experiment 1 (see Fig. 2), though the values of and κ 2 are quite different. The characteristics shown in Table 4 are similar to those in Table 3 but there is one important difference. The number of unsatisfied inequalities is zero before and after the replacement of all with zero in the solution obtained by the positive update rule for the small initial solution. This indicates that we can find a sparse solution that  Fig. 4 and Table 5. The evolution of the objective function value and the number of unsatisfied inequalities in (21) and (22) shown in Fig. 4 are similar to those in Experiment 1 (see Fig. 2), though the dataset is different. The characteristics shown in Table 5 are also similar to those in Table 3 but there are two main differences. One is that a solution with a smaller objective function value satisfies more inequalities in (21) and (22). The other is that the number of variables at the lower bound in    Fig. 5 and Table 6. As for the proposed update rule, the evolution of the objective function value and the number of unsatisfied inequalities in (21) and (22) are similar to those in Experiment 3 (see Fig. 4). In contrast,  Table 6. In addition, the number of unsatisfied inequalities is not affected by the replacement of all with zero for all the solutions obtained by the positive update rule. This indicates that we can find a sparse solution that

Applicability of proposed update rule to variants of HALS algorithm
In this section, we introduce some variants of the HALS algorithm to which our update rule can be applied in order to guarantee the well-definedness and/or the global convergence. First one is the accelerated HALS algorithm [17]. The idea behind this algorithm is very simple.
In each iteration, W is updated several times while H is fixed, and then H is updated several times while W is fixed. It was shown through experiments using image and text datasets that this algorithm significantly outperforms the original HALS algorithm [17]. Now we claim that the global convergence of this algorithm is guaranteed if Algorithm 1 is incorporated into it. In each iteration, the algorithm updates all columns of W several times using the update rule in Step 3, next updates all columns of W and H using Steps 4 and 5 once, and then updates all columns of H several times using the update rule in Step 6. The second one is the fast coordinate descent algorithm with variable selection [26]. In each iteration, M rows of W are updated one by one and then N rows of H are updated one by one. Each row of W or H is updated by repeating the following two steps until some condition is satisfied: i) selection of one element based on the potential decrease in the objective function value, and ii) update of the selected element. It was shown through experiments using synthetic and real-world datasets that this algorithm is considerably faster than conventional algorithms [26]. Again, we claim that the global convergence of this algorithm is guaranteed if Algorithm 1 is incorporated into it. To be more specific, when each row of W or H is updated, the update rule in Step 3 or Step 6 of Algorithm 1 can be used for both the computation of the potential decrease in the objective function value and the update of the selected element. One important point is that the normalization procedure in Steps 4 and 5 should be done between the update of M rows of W and the update of N rows of H.
The third one is the randomized HALS algorithm [13] which is based on the probabilistic framework for low-rank approximations [21]. In the first step, this algorithm constructs a surrogate matrix B ∈ R L×N with K < L M as follows. First, X is multiplied by a random matrix ∈ R N ×L to get Y = X . Next, a matrix Q ∈ R M×L with orthogonal columns is obtained by performing the QR-decomposition of Y . Finally, the surrogate matrix is obtained by B = Q T X. The surrogate matrix B obtained like this is expected to capture the essential information of X. In the next step, this algorithm solves the optimization problem: by an iterative algorithm very similar to the HALS algorithm. It was shown through experiments using hand-written digits and face image datasets that the randomized HALS algorithm has a substantially lower computational cost than the deterministic one, and attains almost the same reconstruction error as the deterministic one [13]. The technique used in our update rule can be easily applied to this algorithm in order to ensure that it is well-defined.
In addition to these three, there are many other algorithms to which our update rule can be applied. One example is the distributed HALS algorithm for multiagent networks [10]. This algorithm is based on the update rule given by (10) and (11) to guarantee the global convergence. By using one of our update rules, this algorithm can find a stationary point of the original optimization problem (1).

Conclusions
In this paper, we have proposed a novel update rule of the HALS algorithm for NMF, and proved its global convergence using Zangwill's global convergence theorem. The proposed update rule has the same computational complexity per iteration as the update rule in the original HALS algorithm. In addition, unlike the global-convergence-guaranteed update rules in the literature [29,30], the proposed update rule does not restrict the range of each variable to a subset of R ++ . This allows us to obtain sparse factor matrices. We have also given two types of stopping conditions and proved the finite termination of the proposed update rule combined with these stopping conditions.
One future direction of this work is to extend our results to Nonnegative Tensor Factorization (NTF) [7,55] which is expected to be used in various applications such as recommender systems [56], while the global convergence property has not yet been analyzed in depth.