Abstract
Nonnegative Matrix Factorization (NMF) has attracted a great deal of attention as an effective technique for dimensionality reduction of large-scale nonnegative data. Given a nonnegative matrix, NMF aims to obtain two low-rank nonnegative factor matrices by solving a constrained optimization problem. The Hierarchical Alternating Least Squares (HALS) algorithm is a well-known and widely-used iterative method for solving such optimization problems. However, the original update rule used in the HALS algorithm is not well defined. In this paper, we propose a novel well-defined update rule of the HALS algorithm, and prove its global convergence in the sense of Zangwill. Unlike conventional globally-convergent update rules, the proposed one allows variables to take the value of zero and hence can obtain sparse factor matrices. We also present two stopping conditions that guarantee the finite termination of the HALS algorithm. The practical usefulness of the proposed update rule is shown through experiments using real-world datasets.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
1 Introduction
Dimensionality reduction methods for large-scale and high-dimensional data have been actively studied in the fields of machine learning and signal processing because of their diverse applications such as feature extraction and visualization (see [8] and references therein). In recent years, Nonnegative Matrix Factorization (NMF) [2, 32] has attracted a great deal of attention as an effective dimensionality reduction method for large-scale nonnegative data, and has been successfully applied to various tasks such as image processing [4, 34], acoustic signal processing [14, 31], network analysis [18, 22, 50], mobile sensor calibration [12] and so on. A key difference between NMF and other dimensionality reduction methods such as the principal component analysis [51] is that the factor matrices obtained by NMF are nonnegative and tend to be sparse [32]. Thus NMF can learn a parts-based representation of the data [32].
Given an \(M \times N\) nonnegative matrix \(\varvec{X}\), NMF aims to decompose it into two nonnegative factor matrices \(\varvec{W}\) and \(\varvec{H}\) of sizes \(M \times K\) and \(N \times K\), respectively, so that \(\varvec{W}\varvec{H}^\mathrm {T}\) is approximately equal to \(\varvec{X}\), where K is much less than \(\min \{M, N\}\) (see Fig. 1). The problem of finding such factor matrices is often formulated as the constrained optimization problem:
where \(\Vert \cdot \Vert _\mathrm {F}\) denotes the Frobenius norm of matrices, and \(\varvec{0}_{I \times J}\) is the \(I \times J\) matrix of all zeros. For matrices \(\varvec{P}\) and \(\varvec{Q}\) of the same size, \(\varvec{P} \ge \varvec{Q}\) means element-wise inequality. The Frobenius norm can be replaced with one of several alternatives such as the I-divergence [33], the Itakura-Saito divergence [14] and others [52]. Also, one or more regularization terms can be added to the objective function in order to enforce desirable properties on the factor matrices [24, 25, 40, 41]. As with many machine learning methods, the \(\ell _1\)-regularization term is often used in NMF.
Various methods for finding a local optimal solution of the optimization problem (1) have been developed so far. Note that finding a global optimal solution is difficult in general because it is known that (1) is NP-hard [49]. Most of the conventional methods update some or all of the elements of one factor matrix at a time because the objective function \(f(\varvec{W},\varvec{H})\) is not jointly convex but convex in \(\varvec{W}\) or \(\varvec{H}\). For example, the multiplicative update rule (MUR) [33], which is widely known as a simple and easy-to-implement method, alternately updates \(\varvec{W}\) and \(\varvec{H}\) according to the rule derived from strictly convex functions called the auxiliary functions [33]. An important advantage of the MUR is that the value of the objective function decreases monotonically as long as division by zero does not occur. However, division by zero is certainly possible in the MUR because elements of the factor matrices can become zero. For this reason, convergence of the factor matrices is not guaranteed. In fact, it was shown experimentally that the MUR sometimes fails to converge to a stationary point [19]. To solve this problem, some authors proposed modified MURs [16, 35]. For example, Gillis and Glineur [16] proposed to replace all values less than a positive constant \(\epsilon \) with \(\epsilon \) after updating \(\varvec{W}\) and \(\varvec{H}\) using the original MUR. Their modified MUR was later proved by Takahashi and Hibi [46] to be globally convergent in the sense of Zangwill [53] (see Definition 1 of the present paper) to a stationary point of the corresponding optimization problem:
where \(\varvec{1}_{I \times J}\) denotes the \(I \times J\) matrix of all ones. Lin [35] proposed a different kind of modified MUR and proved its global convergence to a stationary point of (1). However, this modified MUR is much more complicated than the one mentioned above, and requires a higher computational cost.
Another well-known method for solving (1) is the Hierarchical Alternating Least Squares (HALS) algorithm [6, 7], which is much faster than the MUR in many cases, and much simpler than other fast algorithms [17, 20, 26, 28, 36, 54]. The HALS algorithm updates one column of the factor matrices at a time according to the rule derived from the partial derivative of the objective function with respect to the column. The value of the objective function decreases monotonically if the columns of the factor matrices remain nonzero throughout the iterations [27]. However, as with the MUR, elements of the factor matrices can become zero and this may cause division by zero. To solve this problem, some authors proposed modified update rules for the HALS algorithm [6, 15]. The one proposed by Cichocki et al. [6] takes the same approach as the modified MURs [16]. It replaces all values less than a positive constant \(\epsilon \) with \(\epsilon \) after updating each column of the factor matrices using the original update rule. Although the global convergence to a stationary point of (2) has been proved [29], this update rule cannot obtain sparse factor matrices for the same reason as stated above. In contrast, the update rule given by Gillis [15] not only allows variables to be zero but also avoids division by zero. Furthermore, the value of the objective function decreases monotonically under this update rule. However, the global convergence to a stationary point of (1) is not guaranteed because the level set of the objective function is unbounded.
In this paper, we propose a novel update rule for the HALS algorithm, and prove its global convergence to a stationary point of (1) using Zangwill’s global convergence theorem [53]. The proposed update rule is a combination of the original update rule, the update rule of Gillis [15] and a normalization step. The normalization step is elaborately designed to guarantee not only the boundedness of variables but also the closedness of the point-to-set mapping representing the proposed update rule. We also present two stopping conditions that guarantee the finite termination of the HALS algorithm using the proposed update rule. In addition, the practical usefulness of the proposed update rule is shown through experiments using real-world datasets.
There are many variants of NMF. For example, NMF with additional constraints such as orthogonality [9], symmetry [37] and separability [1, 11, 43] have been extensively studied. These variants are important not only from a theoretical viewpoint but also in practice. In fact, they have many applications in document clustering, community detection, dictionary learning and so on. However, we do not consider these variants in this paper because they need their own specialized algorithms.
The remainder of this paper is organized as follows. In Sect. 2, notations and definitions used in later sections are presented. In Sect. 3, the conventional update rules of the HALS algorithm and their convergence property are reviewed. In Sect. 4, a novel update rule of the HALS algorithm is proposed and its global convergence is proved. In Sect. 5, two stopping conditions are presented and the finite termination of the HALS algorithm using these stopping conditions is proved. In Sect. 6, some experimental results are presented to show the practical usefulness of the proposed update rule. Section 7 introduces some variants of the HALS algorithm to which the proposed update rule can be applied. Section 8 concludes this work and discusses a possible future direction.
2 Notations and definitions
The sets of integers, nonnegative integers, and positive integers are denoted by \({\mathbb {Z}}\), \({\mathbb {Z}}_{+}\) and \({\mathbb {Z}}_{++}\), respectively. Similarly, the sets of real numbers, nonnegative real numbers, and positive real numbers are denoted by \({\mathbb {R}}\), \({\mathbb {R}}_{+}\) and \({\mathbb {R}}_{++}\), respectively. The \(I \times J\) matrix of all zeros and that of all ones are denoted by \(\varvec{0}_{I \times J}\) and \(\varvec{1}_{I \times J}\), respectively.
For any vector \(\varvec{v}=(v_1,v_2,\ldots ,v_I)^\mathrm {T} \in {\mathbb {R}}^I\), \(\ell _1\)- and \(\ell _2\)-norms of \(\varvec{v}\) are denoted by \(\Vert \varvec{v}\Vert _1\) and \(\Vert \varvec{v}\Vert _2\), respectively. The notation \([\varvec{v}]_{+}\) represents the vector of which the i-th element is given by \(\max \{0,v_i\}\) for all i. Similarly, for any vector \(\varvec{v} \in {\mathbb {R}}^I\) and any constant \(\epsilon \in {\mathbb {R}}_{++}\), the notation \([\varvec{v}]_{\epsilon +}\) represents the vector of which the i-th element is given by \(\max \{\epsilon ,v_i\}\) for all i.
The feasible region of the constrained optimization problem (1) is denoted by \({\mathcal {F}}\). That is, \({\mathcal {F}}={\mathbb {R}}_{+}^{M \times K} \times {\mathbb {R}}_{+}^{N \times K}\). We call \((\varvec{W},\varvec{H}) \in {\mathbb {R}}^{M \times K} \times {\mathbb {R}}^{N \times K}\) a stationary point of (1) if it satisfies the Karush-Kuhn-Tucker (KKT) conditions:
where
and \(\odot \) represents the element-wise product. The set of stationary points of (1) is denoted by \({\mathcal {S}}\).
Similarly, the feasible region of the constrained optimization problem (2) is denoted by \({\mathcal {F}}_{\epsilon }\). That is, \({\mathcal {F}}_{\epsilon }=[\epsilon , \infty )^{M \times K} \times [\epsilon ,\infty )^{N \times K}\). We call \((\varvec{W},\varvec{H}) \in {\mathbb {R}}^{M \times K} \times {\mathbb {R}}^{N \times K}\) a stationary point of (2) if it satisfies the KKT conditions:
The set of stationary points of (2) is denoted by \({\mathcal {S}}_{\epsilon }\).
Many iterative algorithms for solving (1) have been proposed so far. Such an algorithm starts with an initial point \((\varvec{W}^{(0)},\varvec{H}^{(0)}) \in {\mathcal {F}}\) and generates a sequence of points \(\{(\varvec{W}^{(t)},\varvec{H}^{(t)})\}_{t=0}^{\infty } \subset {\mathcal {F}}\) that is expected to converge to a stationary point of (1). Following to Zangwill [53], we define the global convergence of an iterative algorithm for solving (1) as follows.
Definition 1
(Global Convergence) An iterative algorithm for solving (1) is said to be globally convergent to \({\mathcal {S}}\) if any sequence \(\{(\varvec{W}^{(t)},\varvec{H}^{(t)})\}_{t=0}^\infty \subset {\mathcal {F}}\) generated by the algorithm has at least one convergent subsequence and the limit of any convergent subsequence belongs to \({\mathcal {S}}\).
Note that Definition 1 does not mean the convergence of the whole sequence \(\{(\varvec{W}^{(t)},\varvec{H}^{(t)})\}_{t=0}^\infty \) to a stationary point. Nevertheless, the notion of global convergence as defined above is of great practical importance because the finite termination of the algorithm is guaranteed if we relax the KKT conditions in a proper way and use them as the stopping condition [29, 46, 47].
Using Zangwill’s global convergence theorem [53], we can obtain a theorem that gives a sufficient condition for an iterative algorithm for solving (1) to be globally convergent to \({\mathcal {S}}\). Before presenting the theorem, we introduce two important notions: point-to-set mappings and their closedness. We consider every iterative algorithm for solving (1) as an iterative process of defining a set of candidate points in the next iteration from the point in the current iteration, and selecting one from the candidate points in some way. Each algorithm is thus characterized by how to define the set of candidate points, which is represented by a point-to-set mapping from \({\mathcal {F}}\) to its subsets. For point-to-set mappings from \({\mathcal {F}}\) to its subsets, their closedness is defined as follows.
Definition 2
(Closed Mapping) A point-to-set mapping A from \({\mathcal {F}}\) to its subsets is said to be closed on \({\mathcal {D}} \subseteq {\mathcal {F}}\) if, for any sequence \(\{(\varvec{P}^{(t)},\varvec{Q}^{(t)})\}_{t=0}^{\infty } \subset {\mathcal {F}}\) that converges to \((\varvec{P}^{(\infty )},\varvec{Q}^{(\infty )}) \in {\mathcal {D}}\) and any sequence \(\{(\varvec{U}^{(t)},\varvec{V}^{(t)})\}_{t=0}^{\infty } \subset {\mathcal {F}}\) such that \((\varvec{U}^{(t)},\varvec{V}^{(t)}) \in A(\varvec{P}^{(t)},\varvec{Q}^{(t)})\) for all \(t \in {\mathbb {Z}}_{+}\) and it converges to \((\varvec{U}^{(\infty )},\varvec{V}^{(\infty )}) \in {\mathcal {F}}\), their limits satisfy \((\varvec{U}^{(\infty )},\varvec{V}^{(\infty )}) \in A(\varvec{P}^{(\infty )},\varvec{Q}^{(\infty )})\).
It is often the case that the set \(A(\varvec{W},\varvec{H})\) consists of only one point in \({\mathcal {F}}\) for any \((\varvec{W},\varvec{H}) \in {\mathcal {F}}\). In this case, A can be considered as a point-to-point mapping from \({\mathcal {F}}\) to itself, and the closedness defined above can be considered as the continuity of A.
Now we are ready to present a theorem that can be obtained as a direct consequence of Zangwill’s global convergence theorem [53].
Theorem 1
Let A be the point-to-set mapping from \({\mathcal {F}}\) to its subsets that represents an iterative algorithm for solving (1). If A satisfies the following conditions then the algorithm is globally convergent to \({\mathcal {S}}\).
-
1.
Any sequence \(\{(\varvec{W}^{(t)},\varvec{H}^{(t)})\}_{t=0}^\infty \) generated by the mapping A in such a way that \((\varvec{W}^{(0)},\varvec{H}^{(0)}) \in {\mathcal {F}}\) and \((\varvec{W}^{(t+1)},\varvec{H}^{(t+1)}) \in A(\varvec{W}^{(t)},\varvec{H}^{(t)})\) for all \(t \in {\mathbb {Z}}_{+}\) is contained in a compact subset of \({\mathcal {F}}\).
-
2.
The mapping A does not increase the value of f. To be more specific, for any point \((\varvec{W},\varvec{H}) \in {\mathcal {F}}\), the following statements hold true.
-
(a)
If \((\varvec{W},\varvec{H}) \not \in {\mathcal {S}}\) then \(f(\varvec{U},\varvec{V}) < f(\varvec{W},\varvec{H})\) for all \((\varvec{U},\varvec{V}) \in A(\varvec{W},\varvec{H})\).
-
(b)
If \((\varvec{W},\varvec{H}) \in {\mathcal {S}}\) then \(f(\varvec{U},\varvec{V}) \le f(\varvec{W},\varvec{H})\) for all \((\varvec{U},\varvec{V}) \in A(\varvec{W},\varvec{H})\).
-
(a)
-
3.
The mapping A is closed on \({\mathcal {F}}\setminus {\mathcal {S}}\).
The global convergence of iterative algorithms for solving (2) and the closedness of point-to-set mappings from \({\mathcal {F}}_{\epsilon }\) to its subsets can be defined in the same way as above. Also, if we replace \({\mathcal {F}}\) and \({\mathcal {S}}\) in Theorem 1 with \({\mathcal {F}}_{\epsilon }\) and \({\mathcal {S}}_{\epsilon }\), respectively, we obtain a theorem that gives a sufficient condition for algorithms for solving (2) to be globally convergent to \({\mathcal {S}}_{\epsilon }\).
Zangwill’s global convergence theorem is well known as a powerful framework for proving the global convergence of iterative algorithms. For example, it was used in proving the global convergence of the concave-convex procedure [45], the decomposition method for support vector machines [48], and the modified MUR for NMF [47].
3 HALS algorithm
In this section, we review the HALS algorithm [6] for solving the optimization problem (1) and some of its variants. We also review their convergence property.
Let the k-th columns of \(\varvec{W}\) and \(\varvec{H}\) be denoted by \(\varvec{w}_k\) and \(\varvec{h}_k\), respectively. Then the problem (1) is rewritten as follows:
The HALS algorithm, which can be viewed as a special case of the block coordinate descent (BCD) method [27], updates 2K column vectors \(\varvec{w}_1,\varvec{w}_2,\ldots ,\varvec{w}_K\) and \(\varvec{h}_1,\varvec{h}_2,\ldots ,\varvec{h}_K\) one by one in a fixed order so that the value of the objective function of (5) decreases monotonically. When updating \(\varvec{w}_k\), the HALS algorithm considers all other variables as constants and solves the following subproblem:
where
If \(\varvec{h}_k {\ne } \varvec{0}_{N \times 1}\), the objective function \(p_k(\varvec{w}_k)\) is strictly convex and minimized at \(\varvec{w}_k{=}\varvec{R}_k \varvec{h}_k/\Vert \varvec{h}_k\Vert _2^2\). Hence the subproblem (6) has the unique optimal solution \(\varvec{w}_k=\left[ \varvec{R}_k \varvec{h}_k/\Vert \varvec{h}_k\Vert _2^2\right] _{+}\) [27, Theorem 2]. Similarly, when updating \(\varvec{h}_k\), the HALS algorithm considers all other variables as constants and solves the following subproblem:
Taking into account the correspondence between variables and constants in (6) and those in (7), we can say that the subproblem (7) has the unique optimal solution \(\varvec{h}_k=\left[ \varvec{R}_k^\mathrm {T} \varvec{w}_k/\Vert \varvec{w}_k\Vert _2^2\right] _{+}\) if \(\varvec{w}_k \ne \varvec{0}_{M \times 1}\). Based on these analyses, the update rule described by
is obtained [7, 23, 27]. In this paper, we call the algorithm based on this update rule the HALS algorithm [7] though it is also called the rank-one residue iteration algorithm [23].
For the HALS algorithm, the following result is known.
Theorem 2
(Kim et al. [27]) If the columns of \(\varvec{W}\) and \(\varvec{H}\) remain nonzero throughout the iterations, every limit point of the sequence \(\left\{ (\varvec{W}^{(t)},\varvec{H}^{(t)}) \right\} _{t=0}^{\infty }\) generated by the HALS algorithm belongs to \({\mathcal {S}}\).
Note that the global convergence of the HALS algorithm is not guaranteed by this theorem. There are two issues to consider. First, the assumption that the columns of \(\varvec{W}\) and \(\varvec{H}\) remain nonzero throughout the iterations may not always be valid. Once \(\varvec{w}_k\) becomes zero for example, \(\varvec{h}_k\) cannot be updated because the right-hand side of (9) becomes an indeterminate form. Second, even though the assumption is valid, it may occur that the sequence generated by the HALS algorithm has no limit point.
A simple way to avoid indeterminate forms is to use
instead of (8) and (9), where \(\epsilon \) is a small positive constant. This update rule was introduced by Cichocki et al. [6] to avoid the numerical instability, but later proved to be globally convergent as shown in the following theorem.
Theorem 3
(Kimura and Takahashi [29]) The HALS algorithm using the update rule described by (10) and (11) is globally convergent to \({\mathcal {S}}_{\epsilon }\).
Note that the update rule described by (10) and (11) does not perform NMF but positive matrix factorization [39]. In addition, the limit of any convergent subsequence is not a stationary point of (1) but one of (2) as shown in Theorem 3. Hence this update rule produces only dense factor matrices. One may claim that sparse factor matrices will be obtained if we replace all \(\epsilon \) in the factor matrices with zeros and that the pair of the resulting sparse factor matrices will be close to \({\mathcal {S}}\). However, it is not clear whether this claim always holds true or not.
Another simple way to avoid indeterminate forms is to use
instead of (8) and (9), where \(\delta \) is a positive constant. This update rule is derived from auxiliary functions of \(p_k(\varvec{w}_k)\) and \(q_k(\varvec{h}_k)\) [15]. Details will be shown in the proof of Lemma 2. For this update rule, the following result is known.
Theorem 4
(Gillis [15]) Every limit point of the sequence \(\left\{ (\varvec{W}^{(t)},\varvec{H}^{(t)}) \right\} _{t=0}^{\infty }\) generated by the HALS algorithm using the update rule described by (12) and (13) belongs to \({\mathcal {S}}\).
Just like Theorem 2 for the original HALS algorithm, Theorem 4 says nothing about the global convergence of the update rule described by (12) and (13) to \({\mathcal {S}}\). The existence of a limit point is not guaranteed even though the objective function value decreases monotonically along the sequence \(\left\{ (\varvec{W}^{(t)},\varvec{H}^{(t)}) \right\} _{t=0}^{\infty }\) generated by the update rule, because the level set of the objective function \(f(\varvec{W},\varvec{H})\) is unbounded.
4 New update rule and its global convergence
In this section, we propose a new update rule of the HALS algorithm and prove that it is globally convergent to \({\mathcal {S}}\).
4.1 Proposed update rule
The update rule we propose in this paper is described by
where \(\delta \) is a positive constant and \(\varvec{u}_k\) is an arbitrary nonnegative unit vector. It is clear that division by zero never occurs in the proposed update rule. The first formula (14) is the same as (12). The second formula (15) is the normalization procedure for \(\varvec{w}_k\). The third formula (16) is used instead of (9) because \(\Vert \varvec{w}_k\Vert _2^2=1\) always holds when \(\varvec{h}_k\) is updated. The normalization procedure plays an important role when we prove that any sequence generated by the proposed update rule is contained in a compact subset of \({\mathcal {F}}\).
In this paper, we focus our attention on the case where the columns of \(\varvec{W}\) are normalized, but the alternative case where the columns of \(\varvec{H}\) are normalized can be dealt with in the same way. Here we should note that the modified MUR [35] also uses a normalization procedure, but this is slightly different from ours. It uses \(\varvec{0}_{M \times 1}\) instead of \(\varvec{u}_k\) in (15).
A formal statement of the proposed update rule is presented in Algorithm 1. Note that Step 4 is added to facilitate the global convergence analysis, though it is not necessary for practical purpose. Note also that Steps 2 and 3 can be replaced with
and Step 6 can be replaced with
for an efficient implementation (see Cichocki and Fan [5] for more details) . It is easy to see that the proposed update rule has the same computational complexity per iteration as the original update rule. The following theorem establishes the global convergence of the proposed update rule.
![figure a](http://media.springernature.com/lw685/springer-static/image/art%3A10.1007%2Fs10898-022-01167-7/MediaObjects/10898_2022_1167_Figa_HTML.png)
Theorem 5
The HALS algorithm using the update rule shown in Algorithm 1 is globally convergent to \({\mathcal {S}}\).
This theorem can be proved by using Theorem 1. Details are shown in the next subsection.
4.2 Proof of Theorem 5
We prove Theorem 5 by using Theorem 1. Let the point-to-set mapping representing Algorithm 1 be denoted by A. Also, let the point-to-set mappings corresponding to Steps 3, 4, 5 and 6 of Algorithm 1 be denoted by \(D_k^\mathrm {W}\), \(S_k^\mathrm {H}\), \(S_k^\mathrm {W}\) and \(D_k^\mathrm {H}\), respectively. Then A is expressed as
where \(\circ \) denotes the composition of mappings. The mappings \(D_k^\mathrm {W}\), \(S_k^\mathrm {H}\) and \(D_k^\mathrm {H}\) are given by
and the mapping \(S_k^\mathrm {W}(\varvec{W},\varvec{H})\) is given by
if \(\varvec{w}_k \ne \varvec{0}_{M \times 1}\), and
otherwise. Note that the set \(D_k^\mathrm {W}(\varvec{W},\varvec{H})\) consists of only one point in \({\mathcal {F}}\), which is represented as a continuous function of \((\varvec{W},\varvec{H})\). The same can be said for \(S_k^\mathrm {H}(\varvec{W},\varvec{H})\) and \(D_k^\mathrm {H}(\varvec{W},\varvec{H})\).
We now prove that the proposed update rule satisfies the second condition in Theorem 1. Let us begin with the definition and an important property of the auxiliary function [33] because it plays an important role in our proof.
Definition 3
(Auxiliary Function [33]) For a function \(g: {\mathbb {R}}_{+} \rightarrow {\mathbb {R}}\), a two-variable function \({\bar{g}}: {\mathbb {R}}_{+} \times {\mathbb {R}}_{+} \rightarrow {\mathbb {R}}\) is called an auxiliary function of g if the following conditions hold:
-
1.
\({\bar{g}}(x,x)=g(x)\) for all \(x \in {\mathbb {R}}_{+}\),
-
2.
\({\bar{g}}(x,y) \ge g(x)\) for all \(x, y \in {\mathbb {R}}_{+}\).
Lemma 1
Let \({\bar{g}}: {\mathbb {R}}_{+} \times {\mathbb {R}}_{+} \rightarrow {\mathbb {R}}\) be an auxiliary function of \(g: {\mathbb {R}}_{+} \rightarrow {\mathbb {R}}\). If the inequality \({\bar{g}}(a,b) \le {\bar{g}}(b,b)\) holds for nonnegative numbers a and b then \(g(a) \le g(b)\). In particular, if \({\bar{g}}(a,b) < {\bar{g}}(b,b)\) then \(g(a) < g(b)\).
Proof
If \({\bar{g}}(a,b) \le {\bar{g}}(b,b)\), we have
The first inequality follows from the second condition in Definition 3 and the equality follows from the first condition in Definition 3. If \({\bar{g}}(a,b)\) is strictly less than \({\bar{g}}(b,b)\), it is clear from (17) that \(g(a) < g(b)\). \(\square \)
Using Lemma 1, we obtain the following three lemmas.
Lemma 2
The objective function \(f(\varvec{W}, \varvec{H})\) is nonincreasing under the proposed update rule shown in Algorithm 1.
Proof
The objective function is nonincreasing under the composite mapping \(S_k^\mathrm {W} \circ S_k^\mathrm {H}\) for all k because the value of \(\varvec{w}_k\varvec{h}_k^\mathrm {T}\) does not change before and after the composite mapping is performed. Also, the objective function is nonincreasing under \(D_k^\mathrm {H}\) for all k because \(\varvec{h}_k=[\varvec{R}_k^\mathrm {T}\varvec{w}_k]_{+}\) is the unique optimal solution of (7) when \(\Vert \varvec{w}_k\Vert _2=1\). So it suffices for us to show that the objective function is nonincreasing under \(D_k^\mathrm {W}\) for all k.
When the mapping \(D_k^\mathrm {W}\) is performed, only \(\varvec{w}_k=[w_{1k},w_{2k},\ldots ,w_{Mk}]^\mathrm {T}\) is updated. We thus consider all variables other than \(\varvec{w}_k\) as constants, and show that the value of \(p_k(\varvec{w}_k)\), the objective function of (6), does not increase. Note that \(p_k(\varvec{w}_k)\) is rewritten as
where
and \(\varvec{r}_m^\mathrm {r}\) is the m-th row of \(\varvec{R}_k\). For the function \(p_{mk}(x)\), we define a two-variable function \({\bar{p}}_{mk}(x,y)\) as follows:
where \(\delta \) is a positive constant used in Algorithm 1. It is clear that \({\bar{p}}_{mk}(x,y)\) is an auxiliary function of \(p_{mk}(x)\) and strongly convex in both x and y (but not jointly) [15, 42]. For each value of y, the minimum point \(x^{*}\) of \({\bar{p}}_{mk}(x,y)\) in \({\mathbb {R}}_{+}\) is uniquely determined as
Therefore, by Lemma 1, we have
Substituting \(y=w_{mk}\) into this inequality, we have
from which we have
This means that \(f(\varvec{W},\varvec{H})\) is nonincreasing under \(D_k^\mathrm {W}\). \(\square \)
Lemma 3
A point \((\varvec{W}^{*},\varvec{H}^{*})\) is a stationary point of (1) if and only if \(\varvec{w}_k\) is a stationary point of (6) with \(\varvec{h}_k=\varvec{h}_k^{*}\) for \(k = 1,2,\ldots ,K\) and \(\varvec{h}_k\) is a stationary point of (7) with \(\varvec{w}_k=\varvec{w}_k^{*}\) for \(k = 1,2,\ldots ,K\).
Proof
We omit the proof because it is similar to [29, Lemma 3]. \(\square \)
Lemma 4
For any \((\varvec{W},\varvec{H}) \in {\mathcal {F}}\), the following statements hold true.
-
1.
If \((\varvec{W},\varvec{H}) \not \in {\mathcal {S}}\) then \(f(\varvec{U},\varvec{V}) < f(\varvec{W},\varvec{H})\) for all \((\varvec{U},\varvec{V}) \in A(\varvec{W},\varvec{H})\).
-
2.
If \((\varvec{W},\varvec{H}) \in {\mathcal {S}}\) then \(f(\varvec{U},\varvec{V}) \le f(\varvec{W},\varvec{H})\) for all \((\varvec{U},\varvec{V}) \in A(\varvec{W},\varvec{H})\).
Proof
It is clear from Lemma 2 that the second statement holds true. Thus we only have to consider the first statement. Let \((\varvec{W},\varvec{H})\) be any point in \({\mathcal {F}} \setminus {\mathcal {S}}\). It follows from Lemma 3 that there exists at least one k such that i) \(\varvec{w}_k\) is not a stationary point of (6) or ii) \(\varvec{h}_k\) is not a stationary point of (7).
In the first case, there exists at least one m such that \(p_{mk}'(w_{mk}) < 0\) if \(w_{mk}=0\) and \(p_{mk}'(w_{mk}) \ne 0\) if \(w_{mk}>0\), where \(p_{mk}(x)\) is given by (18). For such an m, the auxiliary function \({\bar{p}}_{mk}(x,y)\) of \(p_{mk}(x)\), which is given by (19), satisfies
which is negative if \(w_{mk}=0\) and nonzero if \(w_{mk}>0\). This means that \(x=w_{mk}\) is not the unique minimum point of \({\bar{p}}_{mk}(x,w_{mk})\). Hence \({\bar{p}}_{mk}(x^{*},w_{mk})<{\bar{p}}_{mk}(w_{mk},w_{mk})\) where \(x^{*}\) is the unique minimum point given by (20). From this inequality and Lemma 1, we have \(p_{mk}(x^{*})<p_{mk}(w_{mk})\) which implies that
Therefore, \(f(\varvec{W},\varvec{H})\) strictly decreases under the mapping A.
In the second case, we can show in the same way as above that \(f(\varvec{W},\varvec{H})\) strictly decreases under the mapping A. \(\square \)
We next prove that the proposed update rule satisfies the first condition in Theorem 1. To do so, for any point \((\varvec{W}^{(0)},\varvec{H}^{(0)})\) in \({\mathcal {F}}\), we define the set \({\mathcal {L}}_{(\varvec{W}^{(0)},\varvec{H}^{(0)})}\) as follows:
Note that this is not a level set of f because of the conditions that \(\Vert \varvec{w}_k\Vert _2=1\) for all k. The next lemma shows the boundedness of this set.
Lemma 5
The set \({\mathcal {L}}_{(\varvec{W}^{(0)},\varvec{H}^{(0)})}\) is bounded for any \((\varvec{W}^{(0)},\varvec{H}^{(0)}) \in {\mathcal {F}}\).
Proof
Let \((\varvec{W},\varvec{H})\) be any point in \({\mathcal {L}}_{(\varvec{W}^{(0)},\varvec{H}^{(0)})}\). It suffices for us to show that \(\Vert \varvec{h}_k\Vert _2\) is bounded for \(k=1,2,\ldots ,K\). Because \(q_k(\varvec{h}_k)\) is convex, the inequality
holds for any \(\varvec{v} \in {\mathbb {R}}^N\) [3]. Substituting \(\varvec{v}=\varvec{R}_k^\mathrm {T}\varvec{w}_k+\varvec{1}_{N \times 1}\), we have
Hence, the inequality \(q_k(\varvec{h}_k) \le f(\varvec{W}^{(0)},\varvec{H}^{(0)})\) implies that
from which we have
This completes the proof. \(\square \)
Using Lemma 5, we obtain the following lemma.
Lemma 6
Any sequence \(\left\{ (\varvec{W}^{(t)},\varvec{H}^{(t)}) \right\} _{t=0}^{\infty }\) generated by Algorithm 1 is contained in a compact subset of \({\mathcal {F}}\).
Proof
We easily see from Step 5 of Algorithm 1 that \(\Vert \varvec{w}_k^{(t)}\Vert _2=1\) for all k and \(t \in {\mathbb {Z}}_{++}\), where \(\varvec{w}_k^{(t)}\) is the k-th column of \(\varvec{W}^{(t)}\). Also, it follows from Lemma 2 that \(f(\varvec{W}^{(t)},\varvec{H}^{(t)}) \le f(\varvec{W}^{(0)},\varvec{H}^{(0)})\) for all \(t \in {\mathbb {Z}}_{+}\). Therefore \((\varvec{W}^{(t)},\varvec{H}^{(t)}) \in {\mathcal {L}}_{(\varvec{W}^{(0)},\varvec{H}^{(0)})}\) for all \(t \in {\mathbb {Z}}_{++}\). Because \({\mathcal {L}}_{(\varvec{W}^{(0)},\varvec{H}^{(0)})}\) is bounded as shown in Lemma 5, the sequence \(\{(\varvec{W}^{(t)},\varvec{H}^{(t)})\}_{t=0}^{\infty }\) is contained in a compact subset of \({\mathcal {F}}\). \(\square \)
We finally prove that the proposed update rule satisfies the third condition in Theorem 1. The next lemma shows the closedness of the point-to-set mappings \(S_1^\mathrm {W}, S_2^\mathrm {W}, \ldots , S_K^\mathrm {W}\).
Lemma 7
The point-to-set mappings \(S_1^\mathrm {W}, S_2^\mathrm {W}, \ldots , S_K^\mathrm {W}\) are closed on \({\mathcal {F}}\).
Proof
Let \(\{(\varvec{W}^{(t)},\varvec{H}^{(t)})\}_{t=0}^{\infty }\) and \(\{(\varvec{U}^{(t)},\varvec{V}^{(t)})\}_{t=0}^{\infty }\) be any two convergent sequences in \({\mathcal {F}}\) that satisfy \((\varvec{U}^{(t)},\varvec{V}^{(t)}) \in S_k^\mathrm {W}(\varvec{W}^{(t)},\varvec{H}^{(t)})\) for all \(t \in {\mathbb {Z}}_{+}\). Let \((\varvec{W}^{(\infty )},\varvec{H}^{(\infty )})\) and \((\varvec{U}^{(\infty )},\varvec{V}^{(\infty )})\) be the limits of these two sequences. It is clear from the definition of \(S_k^\mathrm {W}\) that \(\Vert \varvec{u}_k^{(t)}\Vert _2=1\) for all \(t \in {\mathbb {Z}}_{+}\), \(\varvec{u}_{{\tilde{k}}}^{(t)}=\varvec{w}_{{\tilde{k}}}^{(t)}\) for all \({\tilde{k}} \ne k\) and \(t \in {\mathbb {Z}}_{+}\), and \(\varvec{V}^{(t)}=\varvec{H}^{(t)}\) for all \(t \in {\mathbb {Z}}_{+}\). We first consider the case where \(\varvec{w}_k^{(\infty )} \ne \varvec{0}_{M \times 1}\). In this case, \(S_k^\mathrm {W}(\varvec{W}^{(\infty )},\varvec{H}^{(\infty )})\) consists only of the point
and \(\{(\varvec{U}^{(t)},\varvec{V}^{(t)})\}_{t=0}^\infty \) converges to it. We next consider the case where \(\varvec{w}_k^{(\infty )}=\varvec{0}_{M \times 1}\). In this case, \(S_k^\mathrm {W}(\varvec{W}^{(\infty )},\varvec{H}^{(\infty )})\) is the set of all \((\varvec{W},\varvec{H}) \in {\mathcal {F}}\) such that \(\Vert \varvec{w}_k\Vert _2=1\), \(\varvec{w}_{{\tilde{k}}}=\varvec{w}_{{\tilde{k}}}^{(\infty )}\) for all \({\tilde{k}} \ne k\) and \(\varvec{H}=\varvec{H}^{(\infty )}\). Also, \((\varvec{U}^{(\infty )},\varvec{V}^{(\infty )})\) satisfies \(\Vert \varvec{u}_k^{(\infty )}\Vert _2=1\), \(\varvec{u}_{{\tilde{k}}}^{(\infty )}=\varvec{w}_{{\tilde{k}}}^{(\infty )}\) for all \({\tilde{k}} \ne k\) and \(\varvec{V}^{(\infty )}=\varvec{H}^{(\infty )}\). Therefore, we have \((\varvec{U}^{(\infty )},\varvec{V}^{(\infty )}) \in S_{k}^\mathrm {W}(\varvec{W}^{(\infty )},\varvec{H}^{(\infty )})\). \(\square \)
Given a point \((\varvec{W}^{(0)},\varvec{H}^{(0)}) \in {\mathcal {F}}\), we define \({\mathcal {L}}^1_{(\varvec{W}^{(0)},\varvec{H}^{(0)})}\), \({\mathcal {L}}^2_{(\varvec{W}^{(0)},\varvec{H}^{(0)})}\) and \({\mathcal {L}}^3_{(\varvec{W}^{(0)},\varvec{H}^{(0)})}\) as follows:
where
for \(k=1,2,\ldots ,K\) and \(\sigma _{\max }(\varvec{X})\) is the largest singular value of \(\varvec{X}\). It is clear that all of the three sets defined above are compact subsets of \({\mathcal {F}}\). It is also clear that \((\varvec{W}^{(0)},\varvec{H}^{(0)}) \in {\mathcal {L}}^1_{(\varvec{W}^{(0)},\varvec{H}^{(0)})}\). Furthermore, the following lemma holds.
Lemma 8
The following statements are true for \(k=1,2,\ldots ,K\).
-
1.
If \((\varvec{W},\varvec{H}) \in {\mathcal {L}}^1_{(\varvec{W}^{(0)},\varvec{H}^{(0)})}\) then \(D_k^\mathrm {W}(\varvec{W},\varvec{H}) \subseteq {\mathcal {L}}^2_{(\varvec{W}^{(0)},\varvec{H}^{(0)})}\).
-
2.
If \((\varvec{W},\varvec{H}) \in {\mathcal {L}}^2_{(\varvec{W}^{(0)},\varvec{H}^{(0)})}\) then \(S_k^\mathrm {H}(\varvec{W},\varvec{H}) \subseteq {\mathcal {L}}^3_{(\varvec{W}^{(0)},\varvec{H}^{(0)})}\).
-
3.
If \((\varvec{W},\varvec{H}) \in {\mathcal {L}}^3_{(\varvec{W}^{(0)},\varvec{H}^{(0)})}\) then \(S_k^\mathrm {W}(\varvec{W},\varvec{H}) \subseteq {\mathcal {L}}_{(\varvec{W}^{(0)},\varvec{H}^{(0)})}\).
-
4.
If \((\varvec{W},\varvec{H}) \in {\mathcal {L}}_{(\varvec{W}^{(0)},\varvec{H}^{(0)})}\) then \(D_k^\mathrm {H}(\varvec{W},\varvec{H}) \subseteq {\mathcal {L}}^1_{(\varvec{W}^{(0)},\varvec{H}^{(0)})}\).
Proof
We first prove the first statement. Suppose that \((\varvec{W},\varvec{H}) \in {\mathcal {L}}^1_{(\varvec{W}^{(0)},\varvec{H}^{(0)})}\). Then \(\Vert \varvec{w}_k\Vert _2 \le \mu _k\) and \(\Vert \varvec{h}_k\Vert _2 \le \nu _k\) hold. Using these inequalities, we have
which means that \(D_k^\mathrm {W}(\varvec{W},\varvec{H}) \subseteq {\mathcal {L}}^2_{(\varvec{W}^{(0)},\varvec{H}^{(0)})}\).
We next prove the second statement. Suppose that \((\varvec{W},\varvec{H}) \in {\mathcal {L}}^2_{(\varvec{W}^{(0)},\varvec{H}^{(0)})}\). Then \(\Vert \varvec{w}_k\Vert _2 \le (\sigma _{\max }(\varvec{X}) \nu _k/\delta +\mu _k)\) and \(\Vert \varvec{h}_k\Vert _2 \le \nu _k\) hold. Using these inequalities, we have
which means that \(S_k^\mathrm {H}(\varvec{W},\varvec{H}) \subseteq {\mathcal {L}}^3_{(\varvec{W}^{(0)},\varvec{H}^{(0)})}\).
The third statement is clear from the definition of the point-to-set mapping \(S_k^\mathrm {W}\), and the fourth statement is clear from the proof of Lemma 5. \(\square \)
From Lemma 8, we can restrict the domains of the point-to-set mappings \(D_k^\mathrm {W}\), \(S_k^\mathrm {H}\), \(S_k^\mathrm {W}\), \(D_k^\mathrm {H}\) to \({\mathcal {L}}^1_{(\varvec{W}^{(0)},\varvec{H}^{(0)})}\), \({\mathcal {L}}^2_{(\varvec{W}^{(0)},\varvec{H}^{(0)})}\), \({\mathcal {L}}^3_{(\varvec{W}^{(0)},\varvec{H}^{(0)})}\) and \({\mathcal {L}}_{(\varvec{W}^{(0)},\varvec{H}^{(0)})}\), respectively. This means that we can restrict the domain of the point-to-set mapping A to \({\mathcal {L}}^1_{(\varvec{W}^{(0)},\varvec{H}^{(0)})}\). The next lemma shows the closedness of A restricted to \({\mathcal {L}}^1_{(\varvec{W}^{(0)},\varvec{H}^{(0)})}\).
Lemma 9
For any \((\varvec{W}^{(0)},\varvec{H}^{(0)}) \in {\mathcal {F}}\), the point-to-set mapping A restricted to \({\mathcal {L}}^1_{(\varvec{W}^{(0)},\varvec{H}^{(0)})}\) is closed on \({\mathcal {L}}^1_{(\varvec{W}^{(0)},\varvec{H}^{(0)})}\).
Proof
It is clear that the composite mapping \(S_k^\mathrm {H} \circ D_k^\mathrm {W}\) from \({\mathcal {L}}^1_{(\varvec{W}^{(0)},\varvec{H}^{(0)})}\) to the subsets of \({\mathcal {L}}^3_{(\varvec{W}^{(0)},\varvec{H}^{(0)})}\) is closed on its domain for all k. Also, it follows from Lemma 7 and the continuity of \(D_k^\mathrm {H}\) that the composite mapping \(D_k^\mathrm {H} \circ S_k^\mathrm {W}\) from \({\mathcal {L}}^3_{(\varvec{W}^{(0)},\varvec{H}^{(0)})}\) to the subsets of \({\mathcal {L}}^1_{(\varvec{W}^{(0)},\varvec{H}^{(0)})}\) is closed on its domain for all k. Because \({\mathcal {L}}^3_{(\varvec{W}^{(0)},\varvec{H}^{(0)})}\) is a compact subset of \({\mathcal {F}}\), by [53, Corollary 4.2.1], the composite mapping \((D_k^\mathrm {H} \circ S_k^\mathrm {W}) \circ (S_k^\mathrm {H} \circ D_k^\mathrm {W})\) from \({\mathcal {L}}^1_{(\varvec{W}^{(0)},\varvec{H}^{(0)})}\) to the subsets of \({\mathcal {L}}^1_{(\varvec{W}^{(0)},\varvec{H}^{(0)})}\) is closed on its domain for all k. Furthermore, since \({\mathcal {L}}^1_{(\varvec{W}^{(0)},\varvec{H}^{(0)})}\) is a compact subset of \({\mathcal {F}}\), by [53, Corollary 4.2.1], we can conclude that A, which is a composition of the mappings \((D_k^\mathrm {H} \circ S_k^\mathrm {W}) \circ (S_k^\mathrm {H} \circ D_k^\mathrm {W})\), restricted to \({\mathcal {L}}^1_{(\varvec{W}^{(0)},\varvec{H}^{(0)})}\) is closed on its domain. \(\square \)
We should note that even if \(\varvec{u}_k\) in Step 5 of Algorithm 1 is replaced with a constant nonnegative unit vector such as \((1/\sqrt{M})\varvec{1}_{M \times 1}\) and \((1,0,0,\ldots ,0)^\mathrm {T}\) we can prove Theorem 5 without changing the definition of the mapping \(S_k^\mathrm {W}\).
5 Stopping conditions
We have proved that the HALS algorithm using the proposed update rule shown in Algorithm 1 is globally convergent to \({\mathcal {S}}\) in the sense of Definition 1. Therefore, combining this update rule with an appropriate stopping condition, we can design an algorithm that always stops in a finite number of iterations. In this section, we consider two approaches for deriving stopping conditions.
5.1 Relaxed KKT conditions
The first approach, which has already been used in the literature [29, 30, 38, 44, 46, 47], is to relax the KKT conditions (3) as follows:
where \(\kappa _1\) and \(\kappa _2\) are positive constants.
The HALS algorithm for NMF using Algorithm 1 and the stopping condition described by (21) and (22) is shown in Algorithm 2. For this algorithm, the following theorem holds. The proof is omitted because it is similar to that of Theorem 2 in [46].
Theorem 6
Algorithm 2 stops in a finite number of iterations.
![figure b](http://media.springernature.com/lw685/springer-static/image/art%3A10.1007%2Fs10898-022-01167-7/MediaObjects/10898_2022_1167_Figb_HTML.png)
5.2 Projected gradient norm
The second approach is to make use of the projected gradient [36]. To be more specific, the inequality
is used as the stopping condition, where \(\tau _1\) and \(\tau _2\) are positive constants, and \(\psi _{\tau _2}(\varvec{W},\varvec{H})\) is defined as
The notations \(\varvec{G}_{\tau _2}^{\mathrm {W}}(\varvec{W},\varvec{H})\) and \(\varvec{G}_{\tau _2}^{\mathrm {H}}(\varvec{W},\varvec{H})\) denote a modified projected gradients with respect to \(\varvec{W}\) and \(\varvec{H}\), respectively, which are defined by
and
Note that our definition of the projected gradient is slightly different from the one used in the literature [20, 26, 27, 36], which corresponds to the case where \(\tau _2=0\). It is clear that if \((\varvec{W},\varvec{H})\) is a stationary point of (1) then (23) is satisfied because \(\psi _{\tau _2}(\varvec{W},\varvec{H})=0\) holds. Therefore, (23) is considered as relaxed KKT conditions.
The proposed HALS algorithm for NMF using Algorithm 1 and the stopping condition (23) is shown in Algorithm 3. For this algorithm, the following theorem holds.
Theorem 7
Algorithm 3 stops in a finite number of iterations.
Proof
The proof is done by contradiction. We first assume that Algorithm 3 does not stop for some \((\varvec{W}^{(0)},\varvec{H}^{(0)}) \in {\mathcal {F}}\). Let \(\{(\varvec{W}^{(t)}, \varvec{H}^{(t)}) \}_{t=0}^\infty \) be an infinite sequence generated by Algorithm 3. Then, we see from Step 1 that \(\psi _{\tau _2}(\varvec{W}^{(0)},\varvec{H}^{(0)})\) must be positive. Also, by Theorem 5, this sequence has at least one subsequence that converges to a stationary point of (1). Let \(\{(\varvec{W}^{(t_i)}, \varvec{H}^{(t_i)}) \}_{i=0}^\infty \) be one of such subsequences and \((\varvec{W}^{(\infty )},\varvec{H}^{(\infty )}) \in {\mathcal {F}}\) be its limit. Because the limit is a stationary point of (1), it satisfies
Let us define a positive constant \(\mu \) as
Because \(\nabla _{\varvec{W}} f(\varvec{W},\varvec{H})\) and \(\nabla _{\varvec{H}} f(\varvec{W},\varvec{H})\) are continuous on \({\mathcal {F}}\), the following statements hold true.
-
1.
For any (m, k) such that \((\nabla _{\varvec{W}} f(\varvec{W}^{(\infty )},\varvec{H}^{(\infty )}))_{mk} = 0\), there exists a positive integer \(I_{mk}^\mathrm {W}\) such that
$$\begin{aligned} \left| (\nabla _{\varvec{W}} f(\varvec{W}^{(t_i)},\varvec{H}^{(t_i)}))_{mk} \right| \le \mu \end{aligned}$$for all \(i \ge I_{mk}^\mathrm {W}\).
-
2.
For any (m, k) such that \((\nabla _{\varvec{W}} f(\varvec{W}^{(\infty )},\varvec{H}^{(\infty )}))_{mk} > 0\), there exists a positive integer \(I_{mk}^\mathrm {W}\) such that
$$\begin{aligned} (\nabla _{\varvec{W}} f(\varvec{W}^{(t_i)},\varvec{H}^{(t_i)}))_{mk} > 0, \quad w_{mk}^{(t_i)} \le \tau _2 \end{aligned}$$for all \(i \ge I_{mk}^\mathrm {W}\).
-
3.
For any (n, k) such that \((\nabla _{\varvec{H}} f(\varvec{W}^{(\infty )},\varvec{H}^{(\infty )}))_{nk} = 0\), there exists a positive integer \(I_{nk}^\mathrm {H}\) such that
$$\begin{aligned} \left| (\nabla _{\varvec{H}} f(\varvec{W}^{(t_i)},\varvec{H}^{(t_i)}))_{nk} \right| \le \mu \end{aligned}$$for all \(i \ge I_{nk}^\mathrm {H}\).
-
4.
For any (n, k) such that \((\nabla _{\varvec{H}} f(\varvec{W}^{(\infty )},\varvec{H}^{(\infty )}))_{nk} > 0\), there exists a positive integer \(I_{nk}^\mathrm {H}\) such that
$$\begin{aligned} (\nabla _{\varvec{H}} f(\varvec{W}^{(t_i)},\varvec{H}^{(t_i)}))_{nk} > 0, \quad h_{nk}^{(t_i)} \le \tau _2 \end{aligned}$$for all \(i \ge I_{nk}^\mathrm {H}\).
From these statements, we see that
for all \(i \ge I=\max \{I_{11}^\mathrm {W},\ldots ,I_{MK}^\mathrm {W},I_{11}^\mathrm {H},\ldots ,I_{NK}^\mathrm {H}\}\). Therefore, the inequality
holds for all \(i \ge I\). This means that the stopping condition (23) holds in a finite number of iterations. However, this contradicts the assumption that Algorithm 3 does not stop. \(\square \)
![figure c](http://media.springernature.com/lw685/springer-static/image/art%3A10.1007%2Fs10898-022-01167-7/MediaObjects/10898_2022_1167_Figc_HTML.png)
6 Numerical experiments
In order to examine the practical performance of the proposed update rule, the authors conducted numerical experiments using the real-world datasets: OlivettiFootnote 1 and CLUTOFootnote 2 (tr41). The former is a dataset of face images, and the latter is that of documents. The statistics of these two datasets is shown in Table 1. In the experiments, two global-convergence-guaranteed update rules were applied to the nonnegative matrices obtained from the datasets. One is Algorithm 1 (denoted as ‘proposed’) and the other is the update rule described by (10) and (11) (denoted as ‘positive’). These two update rules are compared in terms of the evolution of the objective function value and the number of unsatisfied inequalities in the relaxed KKT conditions, and the characteristics of the obtained factor matrices.
Experimental setup is shown in Table 2. The value of \(\delta \) in the proposed update rule is set to \(10^{-8}\) in all experiments, while the value of \(\epsilon \) in the positive one is set to \(10^{-4}\) or \(10^{-8}\) depending on the experiment. The iteration is terminated when the stopping condition described by (21) and (22) is satisfied or the number of iterations reaches 500. The values of \(\kappa _1\) and \(\kappa _2\) in the stopping condition are set to 1.0 and \(2\epsilon \), respectively, in all experiments. Note that the finite termination of the positive update rule is guaranteed if \(\kappa _2\) is greater than \(\epsilon \). This can be proved in the same way as Theorem 7 (see [29] for details). Three different initial solutions are generated for each dataset in such a way that each element is drawn from independent uniform distributions on the intervals [0, 1], [0, 0.5] and [0, 0.25] which are called the ‘large’, ‘medium’ and ‘small’ initial solutions, respectively.
Results of Experiment 1 are summarized in Fig. 2 and Table 3. Figure 2 shows the evolution of the objective function value and the number of unsatisfied inequalities in (21) and (22). We easily see from the figure that the two update rules decrease the objective function value in a similar way until one of them satisfies the stopping condition. In contrast, the behavior of these update rules with respect to the number of unsatisfied inequalities is quite different. The proposed update rule decreases the number at a similar rate for all the initial solutions, and satisfies the stopping condition between 200 and 300 iterations. This is because the normalization process is included in the proposed update rule. The positive update rule decreases the number very slowly, and cannot satisfy the stopping condition in 500 iterations for the large and medium initial solutions, while for the small initial solution it decreases the number very fast and satisfies the stopping condition in less than 100 iterations.
Table 3 shows the characteristics of the solutions obtained by the proposed and positive update rules. Some important facts are observed in this table. The first one is that a small objective function value does not necessarily mean that the number of unsatisfied inequalities is small. In fact, the solution obtained by the positive update rule for the large initial solution gives the smallest objective function value and the largest number of unsatisfied inequalities. Also, the solution obtained by the positive update rule for the small initial solution gives the largest objective function value but satisfies all the inequalities. The second fact is that about a quarter of the variables are at the lower bound in all cases. Hence the solutions obtained by the proposed update rule are sparse because the lower bound is zero. In contrast, the solutions obtained by the positive update rule are dense because the lower bound is a positive constant \(\epsilon \). The third fact is that the replacement of all \(\epsilon \) with zero in each solution obtained by the positive update rule increases the number of unsatisfied inequalities. In particular, the replacement changes a solution that satisfies the stopping condition to another one that does not. It is thus not always possible to find a sparse solution that satisfies the relaxed KKT conditions using the positive update rule, while we can always do it using the proposed update rule. This is an advantage of the proposed update rule against the positive one.
Results of Experiment 2 are summarized in Fig. 3 and Table 4 just like Experiment 1. The evolution of the objective function value and the number of unsatisfied inequalities in (21) and (22) shown in Fig. 3 are similar to those in Experiment 1 (see Fig. 2), though the values of \(\epsilon \) and \(\kappa _2\) are quite different. The characteristics shown in Table 4 are similar to those in Table 3 but there is one important difference. The number of unsatisfied inequalities is zero before and after the replacement of all \(\epsilon \) with zero in the solution obtained by the positive update rule for the small initial solution. This indicates that we can find a sparse solution that satisfies the relaxed KKT conditions using the positive update rule if the magnitude of the initial solution and the value of \(\epsilon \) are sufficiently small. However, it is difficult in general to know in advance how small these values should be.
Results of Experiment 3 are summarized in Fig. 4 and Table 5. The evolution of the objective function value and the number of unsatisfied inequalities in (21) and (22) shown in Fig. 4 are similar to those in Experiment 1 (see Fig. 2), though the dataset is different. The characteristics shown in Table 5 are also similar to those in Table 3 but there are two main differences. One is that a solution with a smaller objective function value satisfies more inequalities in (21) and (22). The other is that the number of variables at the lower bound in each solution obtained by the positive update rule is higher than that in the corresponding solution obtained by the proposed update rule.
Results of Experiment 4 are summarized in Fig. 5 and Table 6. As for the proposed update rule, the evolution of the objective function value and the number of unsatisfied inequalities in (21) and (22) are similar to those in Experiment 3 (see Fig. 4). In contrast, the behavior of the positive update rule is quite different from that in Experiment 3. The number of unsatisfied inequalities decreases faster than the proposed update rule for all the initial solutions, and reaches zero in less than 200 iterations. All the solutions obtained by the proposed and positive update rules have almost the same objective function value and very similar numbers of unsatisfied inequalities, as shown in Table 6. In addition, the number of unsatisfied inequalities is not affected by the replacement of all \(\epsilon \) with zero for all the solutions obtained by the positive update rule. This indicates that we can find a sparse solution that satisfies the relaxed KKT conditions using the positive update rule if the magnitude of the initial solution and the value of \(\epsilon \) are properly selected.
7 Applicability of proposed update rule to variants of HALS algorithm
In this section, we introduce some variants of the HALS algorithm to which our update rule can be applied in order to guarantee the well-definedness and/or the global convergence. First one is the accelerated HALS algorithm [17]. The idea behind this algorithm is very simple. In each iteration, \(\varvec{W}\) is updated several times while \(\varvec{H}\) is fixed, and then \(\varvec{H}\) is updated several times while \(\varvec{W}\) is fixed. It was shown through experiments using image and text datasets that this algorithm significantly outperforms the original HALS algorithm [17]. Now we claim that the global convergence of this algorithm is guaranteed if Algorithm 1 is incorporated into it. In each iteration, the algorithm updates all columns of \(\varvec{W}\) several times using the update rule in Step 3, next updates all columns of \(\varvec{W}\) and \(\varvec{H}\) using Steps 4 and 5 once, and then updates all columns of \(\varvec{H}\) several times using the update rule in Step 6.
The second one is the fast coordinate descent algorithm with variable selection [26]. In each iteration, M rows of \(\varvec{W}\) are updated one by one and then N rows of \(\varvec{H}\) are updated one by one. Each row of \(\varvec{W}\) or \(\varvec{H}\) is updated by repeating the following two steps until some condition is satisfied: i) selection of one element based on the potential decrease in the objective function value, and ii) update of the selected element. It was shown through experiments using synthetic and real-world datasets that this algorithm is considerably faster than conventional algorithms [26]. Again, we claim that the global convergence of this algorithm is guaranteed if Algorithm 1 is incorporated into it. To be more specific, when each row of \(\varvec{W}\) or \(\varvec{H}\) is updated, the update rule in Step 3 or Step 6 of Algorithm 1 can be used for both the computation of the potential decrease in the objective function value and the update of the selected element. One important point is that the normalization procedure in Steps 4 and 5 should be done between the update of M rows of \(\varvec{W}\) and the update of N rows of \(\varvec{H}\).
The third one is the randomized HALS algorithm [13] which is based on the probabilistic framework for low-rank approximations [21]. In the first step, this algorithm constructs a surrogate matrix \(\varvec{B} \in {\mathbb {R}}^{L \times N}\) with \(K < L \ll M\) as follows. First, \(\varvec{X}\) is multiplied by a random matrix \(\varvec{\Omega } \in {\mathbb {R}}^{N \times L}\) to get \(\varvec{Y}=\varvec{X}\varvec{\Omega }\). Next, a matrix \(\varvec{Q} \in {\mathbb {R}}^{M \times L}\) with orthogonal columns is obtained by performing the QR-decomposition of \(\varvec{Y}\). Finally, the surrogate matrix is obtained by \(\varvec{B}=\varvec{Q}^\mathrm {T}\varvec{X}\). The surrogate matrix \(\varvec{B}\) obtained like this is expected to capture the essential information of \(\varvec{X}\). In the next step, this algorithm solves the optimization problem:
by an iterative algorithm very similar to the HALS algorithm. It was shown through experiments using hand-written digits and face image datasets that the randomized HALS algorithm has a substantially lower computational cost than the deterministic one, and attains almost the same reconstruction error as the deterministic one [13]. The technique used in our update rule can be easily applied to this algorithm in order to ensure that it is well-defined.
In addition to these three, there are many other algorithms to which our update rule can be applied. One example is the distributed HALS algorithm for multiagent networks [10]. This algorithm is based on the update rule given by (10) and (11) to guarantee the global convergence. By using one of our update rules, this algorithm can find a stationary point of the original optimization problem (1).
8 Conclusions
In this paper, we have proposed a novel update rule of the HALS algorithm for NMF, and proved its global convergence using Zangwill’s global convergence theorem. The proposed update rule has the same computational complexity per iteration as the update rule in the original HALS algorithm. In addition, unlike the global-convergence-guaranteed update rules in the literature [29, 30], the proposed update rule does not restrict the range of each variable to a subset of \({\mathbb {R}}_{++}\). This allows us to obtain sparse factor matrices. We have also given two types of stopping conditions and proved the finite termination of the proposed update rule combined with these stopping conditions.
One future direction of this work is to extend our results to Nonnegative Tensor Factorization (NTF) [7, 55] which is expected to be used in various applications such as recommender systems [56], while the global convergence property has not yet been analyzed in depth.
References
Arora, S., Ge, R., Kannan, R., Moitra, A.: Computing a nonnegative matrix factorization–provably. SIAM J. Comput. 45(4), 1582–1611 (2016)
Berry, M.W., Browne, M., Langville, A.N., Pauca, V.P., Plemmons, R.J.: Algorithms and applications for approximate nonnegative matrix factorization. Comput. Stat. Data Anal. 52(1), 155–173 (2007)
Boyd, S., Boyd, S.P., Vandenberghe, L.: Convex Optimization. Cambridge University Press, Cambridge (2004)
Cai, D., He, X., Han, J., Huang, T.S.: Graph regularized nonnegative matrix factorization for data representation. IEEE Trans. Pattern Anal. Mach. Intell. 33(8), 1548–1560 (2010)
Cichocki, A., Phan, A.H.: Fast local algorithms for large scale nonnegative matrix and tensor factorizations. IEICE Trans. Fundam. Electron. Commun. Comput. Sci. 92(3), 708–721 (2009)
Cichocki, A., Zdunek, R., Amari, S.I.: Hierarchical ALS algorithms for nonnegative matrix and 3D tensor factorization. In: proceedings of the 2017 International conference on independent component analysis and signal separation, pp. 169–176 (2007)
Cichocki, A., Zdunek, R., Phan, A.H., Amari, S.I.: Nonnegative Matrix and Tensor Factorizations: Applications to Exploratory Multi-Way Data Analysis and Blind Source Separation. John Wiley & Sons, Hoboken (2009)
Cunningham, J.P., Ghahramani, Z.: Linear dimensionality reduction: survey, insights, and generalizations. J. Mach. Learn. Res. 16(1), 2859–2900 (2015)
Ding, C., Li, T., Peng, W., Park, H.: Orthogonal nonnegative matrix tri-factorizations for clustering. In: proceedings of the 12th ACM SIGKDD International conference on knowledge discovery and data mining, pp. 126–135 (2006)
Domen, Y., Migita, T., Takahashi, N.: A distributed HALS algorithm for Euclidean distance-based nonnegative matrix factorization. In: proceedings of the 2019 IEEE symposium series on computational intelligence, pp. 1332–1337 (2019)
Donoho, D., Stodden, V.: When does non-negative matrix factorization give a correct decomposition into parts? Adv. Neural Inf. Process. Syst. 16, 1141–1148 (2003)
Dorffer, C., Puigt, M., Delmaire, G., Roussel, G.: Informed nonnegative matrix factorization methods for mobile sensor network calibration. IEEE Trans. Signal Inf. Process. Netw. 4(4), 667–682 (2018)
Erichson, N.B., Mendible, A., Wihlborn, S., Kutz, J.N.: Randomized nonnegative matrix factorization. Pattern Recognit. Lett. 104, 1–7 (2018)
Févotte, C., Bertin, N., Durrieu, J.L.: Nonnegative matrix factorization with the Itakura–Saito divergence: with application to music analysis. Neural Comput. 21(3), 793–830 (2009)
Gillis, N.: Nonnegative Matrix Factorization. SIAM (2020)
Gillis, N., Glineur, F.: Nonnegative factorization and the maximum edge biclique problem. arXiv e-prints (2008)
Gillis, N., Glineur, F.: Accelerated multiplicative updates and hierarchical ALS algorithms for nonnegative matrix factorization. Neural Comput. 24(4), 1085–1105 (2012)
Gligorijević, V., Panagakis, Y., Zafeiriou, S.: Non-negative matrix factorizations for multiplex network analysis. IEEE Trans. Pattern Anal. Mach. Intell. 41(4), 928–940 (2018)
Gonzalez, E.F., Zhang, Y.: Accelerating the Lee-Seung algorithm for nonnegative matrix factorization. Tech. rep. (2005)
Guan, N., Tao, D., Luo, Z., Yuan, B.: NeNMF: an optimal gradient method for nonnegative matrix factorization. IEEE Trans. Signal Process. 60(6), 2882–2898 (2012)
Halko, N., Martinsson, P.G., Tropp, J.A.: Finding structure with randomness: probabilistic algorithms for constructing approximate matrix decompositions. SIAM Rev. 53(2), 217–288 (2011)
Hamon, R., Borgnat, P., Flandrin, P., Robardet, C.: Extraction of temporal network structures from graph-based signals. IEEE Trans. Signal Inf. Process. Netw. 2(2), 215–226 (2016)
Ho, N.D.: Nonnegative matrix factorization algorithms and applications. Ph.D. thesis, Université catholique de Louvain (2008)
Hoyer, P.O.: Non-negative sparse coding. In: proceedings of the 12th IEEE Workshop on neural networks for signal processing, pp. 557–565 (2002)
Hoyer, P.O.: Non-negative matrix factorization with sparseness constraints. J. Mach. Learn. Res. 5, 1457–1469 (2004)
Hsieh, C.J., Dhillon, I.S.: Fast coordinate descent methods with variable selection for non-negative matrix factorization. In: proceedings of the 17th ACM SIGKDD International conference on knowledge discovery and data mining, pp. 1064–1072 (2011)
Kim, J., He, Y., Park, H.: Algorithms for nonnegative matrix and tensor factorizations: a unified view based on block coordinate descent framework. J. Glob. Optim. 58(2), 285–319 (2014)
Kim, J., Park, H.: Fast nonnegative matrix factorization: an active-set-like method and comparisons. SIAM J. Sci. Comput. 33(6), 3261–3281 (2011)
Kimura, T., Takahashi, N.: Global convergence of a modified HALS algorithm for nonnegative matrix factorization. In: proceedings of the 2015 IEEE 6th International Workshop on computational advances in multi-sensor adaptive processing, pp. 21–24 (2015)
Kimura, T., Takahashi, N.: Gauss-Seidel HALS algorithm for nonnegative matrix factorization with sparseness and smoothness constraints. IEICE Trans. Fundam. Electron. Commun. Comput. Sci. 100(12), 2925–2935 (2017)
Kitamura, D., Ono, N., Sawada, H., Kameoka, H., Saruwatari, H.: Determined blind source separation unifying independent vector analysis and nonnegative matrix factorization. IEEE/ACM Trans. Audio Speech Lang. Process. 24(9), 1626–1641 (2016)
Lee, D.D., Seung, H.S.: Learning the parts of objects by non-negative matrix factorization. Nature 401(6755), 788–791 (1999)
Lee, D.D., Seung, H.S.: Algorithms for non-negative matrix factorization. In: advances in Neural Information Processing Systems, pp. 556–562 (2001)
Li, Z., Tang, J., He, X.: Robust structured nonnegative matrix factorization for image representation. IEEE Trans. Neural Netw. Learn. Syst. 29(5), 1947–1960 (2017)
Lin, C.J.: On the convergence of multiplicative update algorithms for nonnegative matrix factorization. IEEE Trans. Neural Netw. 18(6), 1589–1596 (2007)
Lin, C.J.: Projected gradient methods for nonnegative matrix factorization. Neural Comput. 19(10), 2756–2779 (2007)
Lu, S., Hong, M., Wang, Z.: A nonconvex splitting method for symmetric nonnegative matrix factorization: convergence analysis and optimality. IEEE Trans. Signal Process. 65(12), 3120–3135 (2017)
Nakatsu, S., Takahashi, N.: A novel Newton-type algorithm for nonnegative matrix factorization with alpha-divergence. In: proceedings of the 2017 International conference on neural information processing, pp. 335–344. Springer (2017)
Paatero, P., Tapper, U.: Positive matrix factorization: a non-negative factor model with optimal utilization of error estimates of data values. Environmetrics 5(2), 111–126 (1994)
Pauca, V.P., Piper, J., Plemmons, R.J.: Nonnegative matrix factorization for spectral data analysis. Linear Algebra Appl. 416(1), 29–47 (2006)
Pauca, V.P., Shahnaz, F., Berry, M.W., Plemmons, R.J.: Text mining using non-negative matrix factorizations. In: proceedings of the 2004 SIAM International conference on data mining, pp. 452–456 (2004)
Razaviyayn, M., Hong, M., Luo, Z.Q.: A unified convergence analysis of block successive minimization methods for nonsmooth optimization. SIAM J. Optim. 23(2), 1126–1153 (2013)
Recht, B., Re, C., Tropp, J., Bittorf, V.: Factoring nonnegative matrices with linear programs. Adv. Neural Inf. Process. Syst. 25, 1214–1222 (2012)
Sano, T., Migita, T., Takahashi, N.: A damped Newton algorithm for nonnegative matrix factorization based on alpha-divergence. In: proceedings of the 2019 6th International conference on systems and informatics, pp. 463–468. IEEE (2019)
Sriperumbudur, B.K., Lanckriet, G.R.: On the convergence of the concave-convex procedure. In: proceedings of the 22nd International conference on neural information processing systems, pp. 1759–1767 (2009)
Takahashi, N., Hibi, R.: Global convergence of modified multiplicative updates for nonnegative matrix factorization. Comput. Optim. Appl. 57(2), 417–440 (2014)
Takahashi, N., Katayama, J., Seki, M., Takeuchi, J.: A unified global convergence analysis of multiplicative update rules for nonnegative matrix factorization. Comput. Optim. Appl. 71(1), 221–250 (2018)
Takahashi, N., Nishi, T.: Global convergence of decomposition learning methods for support vector machines. IEEE Trans. Neural Netw. 17(6), 1362–1369 (2006)
Vavasis, S.A.: On the complexity of nonnegative matrix factorization. SIAM J. Optim. 20(3), 1364–1377 (2010)
Wang, F., Li, T., Wang, X., Zhu, S., Ding, C.: Community discovery using nonnegative matrix factorization. Data Min. Knowl. Discov. 22(3), 493–521 (2011)
Wold, S., Esbensen, K., Geladi, P.: Principal component analysis. Chemom. Intell. Lab. Syst. 2(1–3), 37–52 (1987)
Yang, Z., Oja, E.: Unified development of multiplicative algorithms for linear and quadratic nonnegative matrix factorization. IEEE Trans. Neural Netw. 22(12), 1878–1891 (2011)
Zangwill, W.I.: Nonlinear Programming: A Unified Approach. Prentice-Hall, Englewood Cliffs, New Jersey (1969)
Zdunek, R., Cichocki, A.: Non-negative matrix factorization with quasi-Newton optimization. In: International conference on artificial intelligence and soft computing, pp. 870–879 (2006)
Zdunek, R., Fonal, K.: Randomized nonnegative tensor factorization for feature extraction from high-dimensional signals. In: 2018 25th International conference on systems, signals and image processing, pp. 1–5 (2018)
Zhang, W., Sun, H., Liu, X., Guo, X.: Temporal QoS-aware web service recommendation via non-negative tensor factorization. In: proceedings of the 23rd International conference on World Wide Web, pp. 585–596 (2014)
Acknowledgements
The authors would like to thank anonymous reviewers for their valuable comments and suggestions to improve the quality of the paper.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
This work was supported by JSPS KAKENHI Grant Number JP21H03510.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Sano, T., Migita, T. & Takahashi, N. A novel update rule of HALS algorithm for nonnegative matrix factorization and Zangwill’s global convergence. J Glob Optim 84, 755–781 (2022). https://doi.org/10.1007/s10898-022-01167-7
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10898-022-01167-7