1 Introduction

Dimensionality reduction methods for large-scale and high-dimensional data have been actively studied in the fields of machine learning and signal processing because of their diverse applications such as feature extraction and visualization (see [8] and references therein). In recent years, Nonnegative Matrix Factorization (NMF) [2, 32] has attracted a great deal of attention as an effective dimensionality reduction method for large-scale nonnegative data, and has been successfully applied to various tasks such as image processing [4, 34], acoustic signal processing [14, 31], network analysis [18, 22, 50], mobile sensor calibration [12] and so on. A key difference between NMF and other dimensionality reduction methods such as the principal component analysis [51] is that the factor matrices obtained by NMF are nonnegative and tend to be sparse [32]. Thus NMF can learn a parts-based representation of the data [32].

Given an \(M \times N\) nonnegative matrix \(\varvec{X}\), NMF aims to decompose it into two nonnegative factor matrices \(\varvec{W}\) and \(\varvec{H}\) of sizes \(M \times K\) and \(N \times K\), respectively, so that \(\varvec{W}\varvec{H}^\mathrm {T}\) is approximately equal to \(\varvec{X}\), where K is much less than \(\min \{M, N\}\) (see Fig. 1). The problem of finding such factor matrices is often formulated as the constrained optimization problem:

$$\begin{aligned} \begin{array}{ll} \text{ minimize } &{} \displaystyle f(\varvec{W}, \varvec{H}) = \frac{1}{2} \left\| \varvec{X} - \varvec{W}\varvec{H}^\mathrm {T} \right\| _\mathrm {F}^2 \\ \text{ subject } \text{ to } &{} \varvec{W} \ge \varvec{0}_{M \times K},\ \varvec{H} \ge \varvec{0}_{N \times K}, \end{array} \end{aligned}$$
(1)

where \(\Vert \cdot \Vert _\mathrm {F}\) denotes the Frobenius norm of matrices, and \(\varvec{0}_{I \times J}\) is the \(I \times J\) matrix of all zeros. For matrices \(\varvec{P}\) and \(\varvec{Q}\) of the same size, \(\varvec{P} \ge \varvec{Q}\) means element-wise inequality. The Frobenius norm can be replaced with one of several alternatives such as the I-divergence [33], the Itakura-Saito divergence [14] and others [52]. Also, one or more regularization terms can be added to the objective function in order to enforce desirable properties on the factor matrices [24, 25, 40, 41]. As with many machine learning methods, the \(\ell _1\)-regularization term is often used in NMF.

Fig. 1
figure 1

Nonnegative matrix factorization

Various methods for finding a local optimal solution of the optimization problem (1) have been developed so far. Note that finding a global optimal solution is difficult in general because it is known that (1) is NP-hard [49]. Most of the conventional methods update some or all of the elements of one factor matrix at a time because the objective function \(f(\varvec{W},\varvec{H})\) is not jointly convex but convex in \(\varvec{W}\) or \(\varvec{H}\). For example, the multiplicative update rule (MUR) [33], which is widely known as a simple and easy-to-implement method, alternately updates \(\varvec{W}\) and \(\varvec{H}\) according to the rule derived from strictly convex functions called the auxiliary functions [33]. An important advantage of the MUR is that the value of the objective function decreases monotonically as long as division by zero does not occur. However, division by zero is certainly possible in the MUR because elements of the factor matrices can become zero. For this reason, convergence of the factor matrices is not guaranteed. In fact, it was shown experimentally that the MUR sometimes fails to converge to a stationary point [19]. To solve this problem, some authors proposed modified MURs [16, 35]. For example, Gillis and Glineur [16] proposed to replace all values less than a positive constant \(\epsilon \) with \(\epsilon \) after updating \(\varvec{W}\) and \(\varvec{H}\) using the original MUR. Their modified MUR was later proved by Takahashi and Hibi [46] to be globally convergent in the sense of Zangwill [53] (see Definition 1 of the present paper) to a stationary point of the corresponding optimization problem:

$$\begin{aligned} \begin{array}{ll} \text{ minimize } &{} \displaystyle f(\varvec{W}, \varvec{H}) = \frac{1}{2} \left\| \varvec{X} - \varvec{W}\varvec{H}^\mathrm {T} \right\| _\mathrm {F}^2 \\ \text{ subject } \text{ to } &{} \varvec{W} \ge \epsilon \varvec{1}_{M \times K},\ \varvec{H} \ge \epsilon \varvec{1}_{N \times K}, \end{array} \end{aligned}$$
(2)

where \(\varvec{1}_{I \times J}\) denotes the \(I \times J\) matrix of all ones. Lin [35] proposed a different kind of modified MUR and proved its global convergence to a stationary point of (1). However, this modified MUR is much more complicated than the one mentioned above, and requires a higher computational cost.

Another well-known method for solving (1) is the Hierarchical Alternating Least Squares (HALS) algorithm [6, 7], which is much faster than the MUR in many cases, and much simpler than other fast algorithms [17, 20, 26, 28, 36, 54]. The HALS algorithm updates one column of the factor matrices at a time according to the rule derived from the partial derivative of the objective function with respect to the column. The value of the objective function decreases monotonically if the columns of the factor matrices remain nonzero throughout the iterations [27]. However, as with the MUR, elements of the factor matrices can become zero and this may cause division by zero. To solve this problem, some authors proposed modified update rules for the HALS algorithm [6, 15]. The one proposed by Cichocki et al. [6] takes the same approach as the modified MURs [16]. It replaces all values less than a positive constant \(\epsilon \) with \(\epsilon \) after updating each column of the factor matrices using the original update rule. Although the global convergence to a stationary point of (2) has been proved [29], this update rule cannot obtain sparse factor matrices for the same reason as stated above. In contrast, the update rule given by Gillis [15] not only allows variables to be zero but also avoids division by zero. Furthermore, the value of the objective function decreases monotonically under this update rule. However, the global convergence to a stationary point of (1) is not guaranteed because the level set of the objective function is unbounded.

In this paper, we propose a novel update rule for the HALS algorithm, and prove its global convergence to a stationary point of (1) using Zangwill’s global convergence theorem [53]. The proposed update rule is a combination of the original update rule, the update rule of Gillis [15] and a normalization step. The normalization step is elaborately designed to guarantee not only the boundedness of variables but also the closedness of the point-to-set mapping representing the proposed update rule. We also present two stopping conditions that guarantee the finite termination of the HALS algorithm using the proposed update rule. In addition, the practical usefulness of the proposed update rule is shown through experiments using real-world datasets.

There are many variants of NMF. For example, NMF with additional constraints such as orthogonality [9], symmetry [37] and separability [1, 11, 43] have been extensively studied. These variants are important not only from a theoretical viewpoint but also in practice. In fact, they have many applications in document clustering, community detection, dictionary learning and so on. However, we do not consider these variants in this paper because they need their own specialized algorithms.

The remainder of this paper is organized as follows. In Sect. 2, notations and definitions used in later sections are presented. In Sect. 3, the conventional update rules of the HALS algorithm and their convergence property are reviewed. In Sect. 4, a novel update rule of the HALS algorithm is proposed and its global convergence is proved. In Sect. 5, two stopping conditions are presented and the finite termination of the HALS algorithm using these stopping conditions is proved. In Sect. 6, some experimental results are presented to show the practical usefulness of the proposed update rule. Section 7 introduces some variants of the HALS algorithm to which the proposed update rule can be applied. Section 8 concludes this work and discusses a possible future direction.

2 Notations and definitions

The sets of integers, nonnegative integers, and positive integers are denoted by \({\mathbb {Z}}\), \({\mathbb {Z}}_{+}\) and \({\mathbb {Z}}_{++}\), respectively. Similarly, the sets of real numbers, nonnegative real numbers, and positive real numbers are denoted by \({\mathbb {R}}\), \({\mathbb {R}}_{+}\) and \({\mathbb {R}}_{++}\), respectively. The \(I \times J\) matrix of all zeros and that of all ones are denoted by \(\varvec{0}_{I \times J}\) and \(\varvec{1}_{I \times J}\), respectively.

For any vector \(\varvec{v}=(v_1,v_2,\ldots ,v_I)^\mathrm {T} \in {\mathbb {R}}^I\), \(\ell _1\)- and \(\ell _2\)-norms of \(\varvec{v}\) are denoted by \(\Vert \varvec{v}\Vert _1\) and \(\Vert \varvec{v}\Vert _2\), respectively. The notation \([\varvec{v}]_{+}\) represents the vector of which the i-th element is given by \(\max \{0,v_i\}\) for all i. Similarly, for any vector \(\varvec{v} \in {\mathbb {R}}^I\) and any constant \(\epsilon \in {\mathbb {R}}_{++}\), the notation \([\varvec{v}]_{\epsilon +}\) represents the vector of which the i-th element is given by \(\max \{\epsilon ,v_i\}\) for all i.

The feasible region of the constrained optimization problem (1) is denoted by \({\mathcal {F}}\). That is, \({\mathcal {F}}={\mathbb {R}}_{+}^{M \times K} \times {\mathbb {R}}_{+}^{N \times K}\). We call \((\varvec{W},\varvec{H}) \in {\mathbb {R}}^{M \times K} \times {\mathbb {R}}^{N \times K}\) a stationary point of (1) if it satisfies the Karush-Kuhn-Tucker (KKT) conditions:

$$\begin{aligned} \varvec{W}&\ge \varvec{0}_{M \times K}, \end{aligned}$$
(3a)
$$\begin{aligned} \varvec{H}&\ge \varvec{0}_{N \times K}, \end{aligned}$$
(3b)
$$\begin{aligned} \nabla _{\varvec{W}} f(\varvec{W},\varvec{H})&\ge \varvec{0}_{M \times K}, \end{aligned}$$
(3c)
$$\begin{aligned} \nabla _{\varvec{H}} f(\varvec{W},\varvec{H})&\ge \varvec{0}_{N \times K}, \end{aligned}$$
(3d)
$$\begin{aligned} \nabla _{\varvec{W}} f(\varvec{W},\varvec{H}) \odot \varvec{W}&= \varvec{0}_{M \times K}, \end{aligned}$$
(3e)
$$\begin{aligned} \nabla _{\varvec{H}} f(\varvec{W},\varvec{H}) \odot \varvec{H}&= \varvec{0}_{N \times K}, \end{aligned}$$
(3f)

where

$$\begin{aligned} \begin{aligned} \nabla _{\varvec{W}} f(\varvec{W},\varvec{H})&= (\varvec{W}\varvec{H}^\mathrm {T} - \varvec{X})\varvec{H}, \\ \nabla _{\varvec{H}} f(\varvec{W},\varvec{H})&= (\varvec{H}\varvec{W}^\mathrm {T} - \varvec{X}^\mathrm {T})\varvec{W}, \end{aligned} \end{aligned}$$

and \(\odot \) represents the element-wise product. The set of stationary points of (1) is denoted by \({\mathcal {S}}\).

Similarly, the feasible region of the constrained optimization problem (2) is denoted by \({\mathcal {F}}_{\epsilon }\). That is, \({\mathcal {F}}_{\epsilon }=[\epsilon , \infty )^{M \times K} \times [\epsilon ,\infty )^{N \times K}\). We call \((\varvec{W},\varvec{H}) \in {\mathbb {R}}^{M \times K} \times {\mathbb {R}}^{N \times K}\) a stationary point of (2) if it satisfies the KKT conditions:

$$\begin{aligned} \varvec{W}&\ge \epsilon \varvec{1}_{M \times K}, \end{aligned}$$
(4a)
$$\begin{aligned} \varvec{H}&\ge \epsilon \varvec{1}_{N \times K}, \end{aligned}$$
(4b)
$$\begin{aligned} \nabla _{\varvec{W}} f(\varvec{W},\varvec{H})&\ge \varvec{0}_{M \times K}, \end{aligned}$$
(4c)
$$\begin{aligned} \nabla _{\varvec{H}} f(\varvec{W},\varvec{H})&\ge \varvec{0}_{N \times K}, \end{aligned}$$
(4d)
$$\begin{aligned} \nabla _{\varvec{W}} f(\varvec{W},\varvec{H}) \odot (\varvec{W}-\epsilon \varvec{1}_{M \times K})&= \varvec{0}_{M \times K}, \end{aligned}$$
(4e)
$$\begin{aligned} \nabla _{\varvec{H}} f(\varvec{W},\varvec{H}) \odot (\varvec{H}-\epsilon \varvec{1}_{N \times K})&= \varvec{0}_{N \times K}, \end{aligned}$$
(4f)

The set of stationary points of (2) is denoted by \({\mathcal {S}}_{\epsilon }\).

Many iterative algorithms for solving (1) have been proposed so far. Such an algorithm starts with an initial point \((\varvec{W}^{(0)},\varvec{H}^{(0)}) \in {\mathcal {F}}\) and generates a sequence of points \(\{(\varvec{W}^{(t)},\varvec{H}^{(t)})\}_{t=0}^{\infty } \subset {\mathcal {F}}\) that is expected to converge to a stationary point of (1). Following to Zangwill [53], we define the global convergence of an iterative algorithm for solving (1) as follows.

Definition 1

(Global Convergence) An iterative algorithm for solving (1) is said to be globally convergent to \({\mathcal {S}}\) if any sequence \(\{(\varvec{W}^{(t)},\varvec{H}^{(t)})\}_{t=0}^\infty \subset {\mathcal {F}}\) generated by the algorithm has at least one convergent subsequence and the limit of any convergent subsequence belongs to \({\mathcal {S}}\).

Note that Definition 1 does not mean the convergence of the whole sequence \(\{(\varvec{W}^{(t)},\varvec{H}^{(t)})\}_{t=0}^\infty \) to a stationary point. Nevertheless, the notion of global convergence as defined above is of great practical importance because the finite termination of the algorithm is guaranteed if we relax the KKT conditions in a proper way and use them as the stopping condition [29, 46, 47].

Using Zangwill’s global convergence theorem [53], we can obtain a theorem that gives a sufficient condition for an iterative algorithm for solving (1) to be globally convergent to \({\mathcal {S}}\). Before presenting the theorem, we introduce two important notions: point-to-set mappings and their closedness. We consider every iterative algorithm for solving (1) as an iterative process of defining a set of candidate points in the next iteration from the point in the current iteration, and selecting one from the candidate points in some way. Each algorithm is thus characterized by how to define the set of candidate points, which is represented by a point-to-set mapping from \({\mathcal {F}}\) to its subsets. For point-to-set mappings from \({\mathcal {F}}\) to its subsets, their closedness is defined as follows.

Definition 2

(Closed Mapping) A point-to-set mapping A from \({\mathcal {F}}\) to its subsets is said to be closed on \({\mathcal {D}} \subseteq {\mathcal {F}}\) if, for any sequence \(\{(\varvec{P}^{(t)},\varvec{Q}^{(t)})\}_{t=0}^{\infty } \subset {\mathcal {F}}\) that converges to \((\varvec{P}^{(\infty )},\varvec{Q}^{(\infty )}) \in {\mathcal {D}}\) and any sequence \(\{(\varvec{U}^{(t)},\varvec{V}^{(t)})\}_{t=0}^{\infty } \subset {\mathcal {F}}\) such that \((\varvec{U}^{(t)},\varvec{V}^{(t)}) \in A(\varvec{P}^{(t)},\varvec{Q}^{(t)})\) for all \(t \in {\mathbb {Z}}_{+}\) and it converges to \((\varvec{U}^{(\infty )},\varvec{V}^{(\infty )}) \in {\mathcal {F}}\), their limits satisfy \((\varvec{U}^{(\infty )},\varvec{V}^{(\infty )}) \in A(\varvec{P}^{(\infty )},\varvec{Q}^{(\infty )})\).

It is often the case that the set \(A(\varvec{W},\varvec{H})\) consists of only one point in \({\mathcal {F}}\) for any \((\varvec{W},\varvec{H}) \in {\mathcal {F}}\). In this case, A can be considered as a point-to-point mapping from \({\mathcal {F}}\) to itself, and the closedness defined above can be considered as the continuity of A.

Now we are ready to present a theorem that can be obtained as a direct consequence of Zangwill’s global convergence theorem [53].

Theorem 1

Let A be the point-to-set mapping from \({\mathcal {F}}\) to its subsets that represents an iterative algorithm for solving (1). If A satisfies the following conditions then the algorithm is globally convergent to \({\mathcal {S}}\).

  1. 1.

    Any sequence \(\{(\varvec{W}^{(t)},\varvec{H}^{(t)})\}_{t=0}^\infty \) generated by the mapping A in such a way that \((\varvec{W}^{(0)},\varvec{H}^{(0)}) \in {\mathcal {F}}\) and \((\varvec{W}^{(t+1)},\varvec{H}^{(t+1)}) \in A(\varvec{W}^{(t)},\varvec{H}^{(t)})\) for all \(t \in {\mathbb {Z}}_{+}\) is contained in a compact subset of \({\mathcal {F}}\).

  2. 2.

    The mapping A does not increase the value of f. To be more specific, for any point \((\varvec{W},\varvec{H}) \in {\mathcal {F}}\), the following statements hold true.

    1. (a)

      If \((\varvec{W},\varvec{H}) \not \in {\mathcal {S}}\) then \(f(\varvec{U},\varvec{V}) < f(\varvec{W},\varvec{H})\) for all \((\varvec{U},\varvec{V}) \in A(\varvec{W},\varvec{H})\).

    2. (b)

      If \((\varvec{W},\varvec{H}) \in {\mathcal {S}}\) then \(f(\varvec{U},\varvec{V}) \le f(\varvec{W},\varvec{H})\) for all \((\varvec{U},\varvec{V}) \in A(\varvec{W},\varvec{H})\).

  3. 3.

    The mapping A is closed on \({\mathcal {F}}\setminus {\mathcal {S}}\).

The global convergence of iterative algorithms for solving (2) and the closedness of point-to-set mappings from \({\mathcal {F}}_{\epsilon }\) to its subsets can be defined in the same way as above. Also, if we replace \({\mathcal {F}}\) and \({\mathcal {S}}\) in Theorem 1 with \({\mathcal {F}}_{\epsilon }\) and \({\mathcal {S}}_{\epsilon }\), respectively, we obtain a theorem that gives a sufficient condition for algorithms for solving (2) to be globally convergent to \({\mathcal {S}}_{\epsilon }\).

Zangwill’s global convergence theorem is well known as a powerful framework for proving the global convergence of iterative algorithms. For example, it was used in proving the global convergence of the concave-convex procedure [45], the decomposition method for support vector machines [48], and the modified MUR for NMF [47].

3 HALS algorithm

In this section, we review the HALS algorithm [6] for solving the optimization problem (1) and some of its variants. We also review their convergence property.

Let the k-th columns of \(\varvec{W}\) and \(\varvec{H}\) be denoted by \(\varvec{w}_k\) and \(\varvec{h}_k\), respectively. Then the problem (1) is rewritten as follows:

$$\begin{aligned} \begin{array}{ll} \text{ minimize } &{} \displaystyle \frac{1}{2} \left\| \varvec{X} - \sum _{k=1}^K \varvec{w}_k\varvec{h}_k^\mathrm {T} \right\| _\mathrm {F}^2 \\ \text{ subject } \text{ to } &{} \varvec{w}_k \ge \varvec{0}_{M \times 1},\;\; \varvec{h}_k \ge \varvec{0}_{N \times 1},\;\; k=1,2,\ldots ,K. \end{array} \end{aligned}$$
(5)

The HALS algorithm, which can be viewed as a special case of the block coordinate descent (BCD) method [27], updates 2K column vectors \(\varvec{w}_1,\varvec{w}_2,\ldots ,\varvec{w}_K\) and \(\varvec{h}_1,\varvec{h}_2,\ldots ,\varvec{h}_K\) one by one in a fixed order so that the value of the objective function of (5) decreases monotonically. When updating \(\varvec{w}_k\), the HALS algorithm considers all other variables as constants and solves the following subproblem:

$$\begin{aligned} \begin{array}{ll} \text{ minimize } &{} \displaystyle p_k(\varvec{w}_k) = \frac{1}{2} \left\| \varvec{R}_k^\mathrm {T} - \varvec{h}_k\varvec{w}_k^\mathrm {T} \right\| _\mathrm {F}^2 \\ \text{ subject } \text{ to } &{} \varvec{w}_k \ge \varvec{0}_{M \times 1} \end{array} \end{aligned}$$
(6)

where

$$\begin{aligned} \varvec{R}_k = \varvec{X} - \sum _{{\tilde{k}}=1, {\tilde{k}} \ne k}^K \varvec{w}_{{\tilde{k}}}\varvec{h}_{{\tilde{k}}}^\mathrm {T}. \end{aligned}$$

If \(\varvec{h}_k {\ne } \varvec{0}_{N \times 1}\), the objective function \(p_k(\varvec{w}_k)\) is strictly convex and minimized at \(\varvec{w}_k{=}\varvec{R}_k \varvec{h}_k/\Vert \varvec{h}_k\Vert _2^2\). Hence the subproblem (6) has the unique optimal solution \(\varvec{w}_k=\left[ \varvec{R}_k \varvec{h}_k/\Vert \varvec{h}_k\Vert _2^2\right] _{+}\) [27, Theorem 2]. Similarly, when updating \(\varvec{h}_k\), the HALS algorithm considers all other variables as constants and solves the following subproblem:

$$\begin{aligned} \begin{array}{ll} \text{ minimize } &{} \displaystyle q_k(\varvec{h}_k) = \frac{1}{2} \left\| \varvec{R}_k - \varvec{w}_k\varvec{h}_k^\mathrm {T} \right\| _\mathrm {F}^2 \\ \text{ subject } \text{ to } &{} \varvec{h}_k \ge \varvec{0}_{N \times 1} . \end{array} \end{aligned}$$
(7)

Taking into account the correspondence between variables and constants in (6) and those in (7), we can say that the subproblem (7) has the unique optimal solution \(\varvec{h}_k=\left[ \varvec{R}_k^\mathrm {T} \varvec{w}_k/\Vert \varvec{w}_k\Vert _2^2\right] _{+}\) if \(\varvec{w}_k \ne \varvec{0}_{M \times 1}\). Based on these analyses, the update rule described by

$$\begin{aligned} \varvec{w}_k&\leftarrow \left[ \frac{\varvec{R}_k \varvec{h}_k}{\Vert \varvec{h}_k\Vert _2^2} \right] _{+}, \end{aligned}$$
(8)
$$\begin{aligned} \varvec{h}_k&\leftarrow \left[ \frac{\varvec{R}_k^\mathrm {T} \varvec{w}_k}{\Vert \varvec{w}_k\Vert _2^2}\right] _{+} \end{aligned}$$
(9)

is obtained [7, 23, 27]. In this paper, we call the algorithm based on this update rule the HALS algorithm [7] though it is also called the rank-one residue iteration algorithm [23].

For the HALS algorithm, the following result is known.

Theorem 2

(Kim et al. [27]) If the columns of \(\varvec{W}\) and \(\varvec{H}\) remain nonzero throughout the iterations, every limit point of the sequence \(\left\{ (\varvec{W}^{(t)},\varvec{H}^{(t)}) \right\} _{t=0}^{\infty }\) generated by the HALS algorithm belongs to \({\mathcal {S}}\).

Note that the global convergence of the HALS algorithm is not guaranteed by this theorem. There are two issues to consider. First, the assumption that the columns of \(\varvec{W}\) and \(\varvec{H}\) remain nonzero throughout the iterations may not always be valid. Once \(\varvec{w}_k\) becomes zero for example, \(\varvec{h}_k\) cannot be updated because the right-hand side of (9) becomes an indeterminate form. Second, even though the assumption is valid, it may occur that the sequence generated by the HALS algorithm has no limit point.

A simple way to avoid indeterminate forms is to use

$$\begin{aligned} \varvec{w}_k&\leftarrow \left[ \frac{\varvec{R}_k \varvec{h}_k}{\Vert \varvec{h}_k\Vert _2^2} \right] _{\epsilon +}, \end{aligned}$$
(10)
$$\begin{aligned} \varvec{h}_k&\leftarrow \left[ \frac{\varvec{R}_k^\mathrm {T} \varvec{w}_k}{\Vert \varvec{w}_k\Vert _2^2} \right] _{\epsilon +} \end{aligned}$$
(11)

instead of (8) and (9), where \(\epsilon \) is a small positive constant. This update rule was introduced by Cichocki et al. [6] to avoid the numerical instability, but later proved to be globally convergent as shown in the following theorem.

Theorem 3

(Kimura and Takahashi [29]) The HALS algorithm using the update rule described by (10) and (11) is globally convergent to \({\mathcal {S}}_{\epsilon }\).

Note that the update rule described by (10) and (11) does not perform NMF but positive matrix factorization [39]. In addition, the limit of any convergent subsequence is not a stationary point of (1) but one of (2) as shown in Theorem 3. Hence this update rule produces only dense factor matrices. One may claim that sparse factor matrices will be obtained if we replace all \(\epsilon \) in the factor matrices with zeros and that the pair of the resulting sparse factor matrices will be close to \({\mathcal {S}}\). However, it is not clear whether this claim always holds true or not.

Another simple way to avoid indeterminate forms is to use

$$\begin{aligned} \varvec{w}_k&\leftarrow \frac{\left[ \varvec{R}_k \varvec{h}_k + \delta \varvec{w}_k \right] _{+}}{\Vert \varvec{h}_k\Vert _2^2 + \delta }, \end{aligned}$$
(12)
$$\begin{aligned} \varvec{h}_k&\leftarrow \frac{\left[ \varvec{R}_k^\mathrm {T} \varvec{w}_k + \delta \varvec{h}_k \right] _{+}}{\Vert \varvec{w}_k\Vert _2^2 + \delta } \end{aligned}$$
(13)

instead of (8) and (9), where \(\delta \) is a positive constant. This update rule is derived from auxiliary functions of \(p_k(\varvec{w}_k)\) and \(q_k(\varvec{h}_k)\) [15]. Details will be shown in the proof of Lemma 2. For this update rule, the following result is known.

Theorem 4

(Gillis [15]) Every limit point of the sequence \(\left\{ (\varvec{W}^{(t)},\varvec{H}^{(t)}) \right\} _{t=0}^{\infty }\) generated by the HALS algorithm using the update rule described by (12) and (13) belongs to \({\mathcal {S}}\).

Just like Theorem 2 for the original HALS algorithm, Theorem 4 says nothing about the global convergence of the update rule described by (12) and (13) to \({\mathcal {S}}\). The existence of a limit point is not guaranteed even though the objective function value decreases monotonically along the sequence \(\left\{ (\varvec{W}^{(t)},\varvec{H}^{(t)}) \right\} _{t=0}^{\infty }\) generated by the update rule, because the level set of the objective function \(f(\varvec{W},\varvec{H})\) is unbounded.

4 New update rule and its global convergence

In this section, we propose a new update rule of the HALS algorithm and prove that it is globally convergent to \({\mathcal {S}}\).

4.1 Proposed update rule

The update rule we propose in this paper is described by

$$\begin{aligned} \varvec{w}_k&\leftarrow \frac{\left[ \varvec{R}_k \varvec{h}_k + \delta \varvec{w}_k \right] _{+}}{\Vert \varvec{h}_k\Vert _2^2 + \delta }, \end{aligned}$$
(14)
$$\begin{aligned} \varvec{w}_k&\leftarrow {\left\{ \begin{array}{ll} \varvec{w}_k/\Vert \varvec{w}_k\Vert _2, &{} \text{ if } \varvec{w}_k \ne \varvec{0}_{M \times 1}, \\ \varvec{u}_k, &{} \text{ otherwise }, \end{array}\right. } \end{aligned}$$
(15)
$$\begin{aligned} \varvec{h}_k&\leftarrow \left[ \varvec{R}_k^\mathrm {T} \varvec{w}_k\right] _{+} \end{aligned}$$
(16)

where \(\delta \) is a positive constant and \(\varvec{u}_k\) is an arbitrary nonnegative unit vector. It is clear that division by zero never occurs in the proposed update rule. The first formula (14) is the same as (12). The second formula (15) is the normalization procedure for \(\varvec{w}_k\). The third formula (16) is used instead of (9) because \(\Vert \varvec{w}_k\Vert _2^2=1\) always holds when \(\varvec{h}_k\) is updated. The normalization procedure plays an important role when we prove that any sequence generated by the proposed update rule is contained in a compact subset of \({\mathcal {F}}\).

In this paper, we focus our attention on the case where the columns of \(\varvec{W}\) are normalized, but the alternative case where the columns of \(\varvec{H}\) are normalized can be dealt with in the same way. Here we should note that the modified MUR [35] also uses a normalization procedure, but this is slightly different from ours. It uses \(\varvec{0}_{M \times 1}\) instead of \(\varvec{u}_k\) in (15).

A formal statement of the proposed update rule is presented in Algorithm 1. Note that Step 4 is added to facilitate the global convergence analysis, though it is not necessary for practical purpose. Note also that Steps 2 and 3 can be replaced with

$$\begin{aligned} \varvec{w}_k \leftarrow \left[ \varvec{w}_k+\frac{\left( \varvec{X}-\varvec{W}\varvec{H}^\mathrm {T}\right) \varvec{h}_k}{\Vert \varvec{h}_k\Vert _2^2+\delta }\right] _{+} \end{aligned}$$

and Step 6 can be replaced with

$$\begin{aligned} \varvec{h}_k \leftarrow \left[ \varvec{h}_k+\left( \varvec{X}-\varvec{W}\varvec{H}^\mathrm {T}\right) ^\mathrm {T}\varvec{w}_k\right] _{+} \end{aligned}$$

for an efficient implementation (see Cichocki and Fan [5] for more details) . It is easy to see that the proposed update rule has the same computational complexity per iteration as the original update rule. The following theorem establishes the global convergence of the proposed update rule.

figure a

Theorem 5

The HALS algorithm using the update rule shown in Algorithm 1 is globally convergent to \({\mathcal {S}}\).

This theorem can be proved by using Theorem 1. Details are shown in the next subsection.

4.2 Proof of Theorem 5

We prove Theorem 5 by using Theorem 1. Let the point-to-set mapping representing Algorithm 1 be denoted by A. Also, let the point-to-set mappings corresponding to Steps 3, 4, 5 and 6 of Algorithm 1 be denoted by \(D_k^\mathrm {W}\), \(S_k^\mathrm {H}\), \(S_k^\mathrm {W}\) and \(D_k^\mathrm {H}\), respectively. Then A is expressed as

$$\begin{aligned} A = D_K^\mathrm {H} \circ S_K^\mathrm {W} \circ S_K^\mathrm {H} \circ D_K^\mathrm {W} \circ \cdots \circ D_1^\mathrm {H} \circ S_1^\mathrm {W} \circ S_1^\mathrm {H} \circ D_1^\mathrm {W} \end{aligned}$$

where \(\circ \) denotes the composition of mappings. The mappings \(D_k^\mathrm {W}\), \(S_k^\mathrm {H}\) and \(D_k^\mathrm {H}\) are given by

$$\begin{aligned} D_k^\mathrm {W}(\varvec{W},\varvec{H})&= \{(\varvec{U},\varvec{V}) \in {\mathcal {F}}\,|\, \varvec{u}_k=[\varvec{R}_k\varvec{h}_k+\delta \varvec{w}_k]_{+}/(\Vert \varvec{h}_k\Vert _2^2+\delta ), \\&\qquad \varvec{u}_{{\tilde{k}}}=\varvec{w}_{{\tilde{k}}} \text{ for } \text{ all } {\tilde{k}} \ne k, \varvec{V}=\varvec{H}\}, \\ S_k^\mathrm {H}(\varvec{W},\varvec{H})&= \{(\varvec{U},\varvec{V}) \in {\mathcal {F}}\,|\,\varvec{U}=\varvec{W}, \varvec{v}_k=\varvec{h}_k\Vert \varvec{w}_k\Vert _2, \varvec{v}_{{\tilde{k}}}=\varvec{h}_{{\tilde{k}}} \text{ for } \text{ all } {\tilde{k}} \ne k\}, \\ D_k^\mathrm {H}(\varvec{W},\varvec{H})&= \{(\varvec{U},\varvec{V}) \in {\mathcal {F}}\,|\,\varvec{U}=\varvec{W}, \varvec{v}_k=[\varvec{R}_k\varvec{w}_k]_{+},\varvec{v}_{{\tilde{k}}}=\varvec{h}_{{\tilde{k}}} \text{ for } \text{ all } {\tilde{k}} \ne k\}, \end{aligned}$$

and the mapping \(S_k^\mathrm {W}(\varvec{W},\varvec{H})\) is given by

$$\begin{aligned} S_k^\mathrm {W}(\varvec{W},\varvec{H}) = \{(\varvec{U},\varvec{V}) \in {\mathcal {F}}\,|\, \varvec{u}_k=\varvec{w}_k/\Vert \varvec{w}_k\Vert _2,\varvec{u}_{{\tilde{k}}}=\varvec{w}_{{\tilde{k}}} \text{ for } \text{ all } {\tilde{k}} \ne k, \varvec{V}=\varvec{H}\} \end{aligned}$$

if \(\varvec{w}_k \ne \varvec{0}_{M \times 1}\), and

$$\begin{aligned} S_k^\mathrm {W}(\varvec{W},\varvec{H}) = \{(\varvec{U},\varvec{V}) \in {\mathcal {F}}\,|\, \Vert \varvec{u}_k\Vert _2=1,\varvec{u}_{{\tilde{k}}}=\varvec{w}_{{\tilde{k}}} \text{ for } \text{ all } {\tilde{k}} \ne k, \varvec{V}=\varvec{H}\}, \end{aligned}$$

otherwise. Note that the set \(D_k^\mathrm {W}(\varvec{W},\varvec{H})\) consists of only one point in \({\mathcal {F}}\), which is represented as a continuous function of \((\varvec{W},\varvec{H})\). The same can be said for \(S_k^\mathrm {H}(\varvec{W},\varvec{H})\) and \(D_k^\mathrm {H}(\varvec{W},\varvec{H})\).

We now prove that the proposed update rule satisfies the second condition in Theorem 1. Let us begin with the definition and an important property of the auxiliary function [33] because it plays an important role in our proof.

Definition 3

(Auxiliary Function [33]) For a function \(g: {\mathbb {R}}_{+} \rightarrow {\mathbb {R}}\), a two-variable function \({\bar{g}}: {\mathbb {R}}_{+} \times {\mathbb {R}}_{+} \rightarrow {\mathbb {R}}\) is called an auxiliary function of g if the following conditions hold:

  1. 1.

    \({\bar{g}}(x,x)=g(x)\) for all \(x \in {\mathbb {R}}_{+}\),

  2. 2.

    \({\bar{g}}(x,y) \ge g(x)\) for all \(x, y \in {\mathbb {R}}_{+}\).

Lemma 1

Let \({\bar{g}}: {\mathbb {R}}_{+} \times {\mathbb {R}}_{+} \rightarrow {\mathbb {R}}\) be an auxiliary function of \(g: {\mathbb {R}}_{+} \rightarrow {\mathbb {R}}\). If the inequality \({\bar{g}}(a,b) \le {\bar{g}}(b,b)\) holds for nonnegative numbers a and b then \(g(a) \le g(b)\). In particular, if \({\bar{g}}(a,b) < {\bar{g}}(b,b)\) then \(g(a) < g(b)\).

Proof

If \({\bar{g}}(a,b) \le {\bar{g}}(b,b)\), we have

$$\begin{aligned} g(a) \le {\bar{g}}(a,b) \le {\bar{g}}(b,b)=g(b). \end{aligned}$$
(17)

The first inequality follows from the second condition in Definition 3 and the equality follows from the first condition in Definition 3. If \({\bar{g}}(a,b)\) is strictly less than \({\bar{g}}(b,b)\), it is clear from (17) that \(g(a) < g(b)\). \(\square \)

Using Lemma 1, we obtain the following three lemmas.

Lemma 2

The objective function \(f(\varvec{W}, \varvec{H})\) is nonincreasing under the proposed update rule shown in Algorithm 1.

Proof

The objective function is nonincreasing under the composite mapping \(S_k^\mathrm {W} \circ S_k^\mathrm {H}\) for all k because the value of \(\varvec{w}_k\varvec{h}_k^\mathrm {T}\) does not change before and after the composite mapping is performed. Also, the objective function is nonincreasing under \(D_k^\mathrm {H}\) for all k because \(\varvec{h}_k=[\varvec{R}_k^\mathrm {T}\varvec{w}_k]_{+}\) is the unique optimal solution of (7) when \(\Vert \varvec{w}_k\Vert _2=1\). So it suffices for us to show that the objective function is nonincreasing under \(D_k^\mathrm {W}\) for all k.

When the mapping \(D_k^\mathrm {W}\) is performed, only \(\varvec{w}_k=[w_{1k},w_{2k},\ldots ,w_{Mk}]^\mathrm {T}\) is updated. We thus consider all variables other than \(\varvec{w}_k\) as constants, and show that the value of \(p_k(\varvec{w}_k)\), the objective function of (6), does not increase. Note that \(p_k(\varvec{w}_k)\) is rewritten as

$$\begin{aligned} p_k(\varvec{w}_k)=\sum _{m=1}^M p_{mk}(w_{mk}) \end{aligned}$$

where

$$\begin{aligned} p_{mk}(x)&=\frac{1}{2} \left\| (\varvec{r}_{m}^\mathrm {r})^\mathrm {T}-\varvec{h}_k x\right\| _2^2 \nonumber \\&= \frac{1}{2} \Vert \varvec{h}_k\Vert _2^2 x^2 -\varvec{r}_m^\mathrm {r} \varvec{h}_k x+\frac{1}{2}\Vert \varvec{r}_m^\mathrm {r}\Vert _2^2 \end{aligned}$$
(18)

and \(\varvec{r}_m^\mathrm {r}\) is the m-th row of \(\varvec{R}_k\). For the function \(p_{mk}(x)\), we define a two-variable function \({\bar{p}}_{mk}(x,y)\) as follows:

$$\begin{aligned} {\bar{p}}_{mk}(x,y) = p_{mk}(x)+\frac{\delta }{2}(x-y)^2 \end{aligned}$$
(19)

where \(\delta \) is a positive constant used in Algorithm 1. It is clear that \({\bar{p}}_{mk}(x,y)\) is an auxiliary function of \(p_{mk}(x)\) and strongly convex in both x and y (but not jointly) [15, 42]. For each value of y, the minimum point \(x^{*}\) of \({\bar{p}}_{mk}(x,y)\) in \({\mathbb {R}}_{+}\) is uniquely determined as

$$\begin{aligned} x^{*} = \frac{\left[ \varvec{r}_m^\mathrm {r} \varvec{h}_k+\delta y \right] _{+}}{\Vert \varvec{h}_k\Vert _2^2+\delta } . \end{aligned}$$
(20)

Therefore, by Lemma 1, we have

$$\begin{aligned} p_{mk}(x^{*}) \le p_{mk}(y) . \end{aligned}$$

Substituting \(y=w_{mk}\) into this inequality, we have

$$\begin{aligned} p_{mk}\left( \frac{\left[ \varvec{r}_m^\mathrm {r} \varvec{h}_k+\delta w_{mk} \right] _{+}}{\Vert \varvec{h}_k\Vert _2^2+\delta }\right) \le p_{mk}(w_{mk}) \end{aligned}$$

from which we have

$$\begin{aligned} p_k\left( \frac{\left[ \varvec{R}_k \varvec{h}_k+\delta \varvec{w}_k\right] _{+}}{\Vert \varvec{h}_k\Vert _2^2+\delta }\right)&=\sum _{m=1}^M p_{mk}\left( \frac{\left[ \varvec{r}_m^\mathrm {r} \varvec{h}_k+\delta w_{mk} \right] _{+}}{\Vert \varvec{h}_k\Vert _2^2+\delta }\right) \\&\le \sum _{m=1}^M p_{mk}(w_{mk}) \\&= p_k(\varvec{w}_k) . \end{aligned}$$

This means that \(f(\varvec{W},\varvec{H})\) is nonincreasing under \(D_k^\mathrm {W}\). \(\square \)

Lemma 3

A point \((\varvec{W}^{*},\varvec{H}^{*})\) is a stationary point of (1) if and only if \(\varvec{w}_k\) is a stationary point of (6) with \(\varvec{h}_k=\varvec{h}_k^{*}\) for \(k = 1,2,\ldots ,K\) and \(\varvec{h}_k\) is a stationary point of (7) with \(\varvec{w}_k=\varvec{w}_k^{*}\) for \(k = 1,2,\ldots ,K\).

Proof

We omit the proof because it is similar to [29, Lemma 3]. \(\square \)

Lemma 4

For any \((\varvec{W},\varvec{H}) \in {\mathcal {F}}\), the following statements hold true.

  1. 1.

    If \((\varvec{W},\varvec{H}) \not \in {\mathcal {S}}\) then \(f(\varvec{U},\varvec{V}) < f(\varvec{W},\varvec{H})\) for all \((\varvec{U},\varvec{V}) \in A(\varvec{W},\varvec{H})\).

  2. 2.

    If \((\varvec{W},\varvec{H}) \in {\mathcal {S}}\) then \(f(\varvec{U},\varvec{V}) \le f(\varvec{W},\varvec{H})\) for all \((\varvec{U},\varvec{V}) \in A(\varvec{W},\varvec{H})\).

Proof

It is clear from Lemma 2 that the second statement holds true. Thus we only have to consider the first statement. Let \((\varvec{W},\varvec{H})\) be any point in \({\mathcal {F}} \setminus {\mathcal {S}}\). It follows from Lemma 3 that there exists at least one k such that i) \(\varvec{w}_k\) is not a stationary point of (6) or ii) \(\varvec{h}_k\) is not a stationary point of (7).

In the first case, there exists at least one m such that \(p_{mk}'(w_{mk}) < 0\) if \(w_{mk}=0\) and \(p_{mk}'(w_{mk}) \ne 0\) if \(w_{mk}>0\), where \(p_{mk}(x)\) is given by (18). For such an m, the auxiliary function \({\bar{p}}_{mk}(x,y)\) of \(p_{mk}(x)\), which is given by (19), satisfies

$$\begin{aligned} \frac{\partial {\bar{p}}_{mk}}{\partial x}(w_{mk},w_{mk}) = p'_{mk}(w_{mk}) + \delta (w_{mk}-w_{mk}) = p'_{mk}(w_{mk}) \end{aligned}$$

which is negative if \(w_{mk}=0\) and nonzero if \(w_{mk}>0\). This means that \(x=w_{mk}\) is not the unique minimum point of \({\bar{p}}_{mk}(x,w_{mk})\). Hence \({\bar{p}}_{mk}(x^{*},w_{mk})<{\bar{p}}_{mk}(w_{mk},w_{mk})\) where \(x^{*}\) is the unique minimum point given by (20). From this inequality and Lemma 1, we have \(p_{mk}(x^{*})<p_{mk}(w_{mk})\) which implies that

$$\begin{aligned} p_k\left( \frac{\left[ \varvec{R}_k \varvec{h}_k+\delta \varvec{w}_k\right] _{+}}{\Vert \varvec{h}_k\Vert _2^2+\delta }\right) < p_k(\varvec{w}_k). \end{aligned}$$

Therefore, \(f(\varvec{W},\varvec{H})\) strictly decreases under the mapping A.

In the second case, we can show in the same way as above that \(f(\varvec{W},\varvec{H})\) strictly decreases under the mapping A. \(\square \)

We next prove that the proposed update rule satisfies the first condition in Theorem 1. To do so, for any point \((\varvec{W}^{(0)},\varvec{H}^{(0)})\) in \({\mathcal {F}}\), we define the set \({\mathcal {L}}_{(\varvec{W}^{(0)},\varvec{H}^{(0)})}\) as follows:

$$\begin{aligned} {\mathcal {L}}_{(\varvec{W}^{(0)},\varvec{H}^{(0)})} = \{(\varvec{W},\varvec{H}) \in {\mathcal {F}}\,|\,f(\varvec{W},\varvec{H}) \le f(\varvec{W}^{(0)},\varvec{H}^{(0)}), \Vert \varvec{w}_k\Vert _2=1 \text{ for } \text{ all } k\}. \end{aligned}$$

Note that this is not a level set of f because of the conditions that \(\Vert \varvec{w}_k\Vert _2=1\) for all k. The next lemma shows the boundedness of this set.

Lemma 5

The set \({\mathcal {L}}_{(\varvec{W}^{(0)},\varvec{H}^{(0)})}\) is bounded for any \((\varvec{W}^{(0)},\varvec{H}^{(0)}) \in {\mathcal {F}}\).

Proof

Let \((\varvec{W},\varvec{H})\) be any point in \({\mathcal {L}}_{(\varvec{W}^{(0)},\varvec{H}^{(0)})}\). It suffices for us to show that \(\Vert \varvec{h}_k\Vert _2\) is bounded for \(k=1,2,\ldots ,K\). Because \(q_k(\varvec{h}_k)\) is convex, the inequality

$$\begin{aligned} q_k(\varvec{h}_k)&\ge q_k(\varvec{v})+\nabla q_k(\varvec{v}_k)^\mathrm {T}(\varvec{h}_k-\varvec{v}) \\&= q_k(\varvec{v})-\left( \varvec{R}_k^\mathrm {T}\varvec{w}_k - \Vert \varvec{w}_k\Vert _2^2 \varvec{v}\right) ^\mathrm {T} (\varvec{h}_k-\varvec{v}) \\&= q_k(\varvec{v})-\left( \varvec{R}_k^\mathrm {T}\varvec{w}_k - \varvec{v}\right) ^\mathrm {T} (\varvec{h}_k-\varvec{v}) \end{aligned}$$

holds for any \(\varvec{v} \in {\mathbb {R}}^N\) [3]. Substituting \(\varvec{v}=\varvec{R}_k^\mathrm {T}\varvec{w}_k+\varvec{1}_{N \times 1}\), we have

$$\begin{aligned} q_k(\varvec{h}_k) \ge \frac{1}{2}\Vert \varvec{R}_k-\varvec{w}_k(\varvec{R}_k^\mathrm {T}\varvec{w}_k+\varvec{1}_{N \times 1})^\mathrm {T}\Vert _\mathrm {F}^2 +\varvec{1}_{N \times 1}^\mathrm {T} (\varvec{h}_k-\varvec{R}_k^\mathrm {T}\varvec{w}_k-\varvec{1}_{N \times 1}). \end{aligned}$$

Hence, the inequality \(q_k(\varvec{h}_k) \le f(\varvec{W}^{(0)},\varvec{H}^{(0)})\) implies that

$$\begin{aligned} \varvec{1}_{N \times 1}^\mathrm {T} (\varvec{h}_k-\varvec{R}_k^\mathrm {T}\varvec{w}_k-\varvec{1}_{N \times 1}) \le f(\varvec{W}^{(0)},\varvec{H}^{(0)}) \end{aligned}$$

from which we have

$$\begin{aligned} \Vert \varvec{h}_k\Vert _2&\le \Vert \varvec{h}_k\Vert _1 \\&\le f(\varvec{W}^{(0)},\varvec{H}^{(0)})+\varvec{1}_{N \times 1}^\mathrm {T}\varvec{R}_k^\mathrm {T}\varvec{w}_k+N \\&\le f(\varvec{W}^{(0)},\varvec{H}^{(0)})+\varvec{1}_{N \times 1}^\mathrm {T}\varvec{X}^\mathrm {T}\varvec{1}_{M \times 1}+N . \end{aligned}$$

This completes the proof. \(\square \)

Using Lemma 5, we obtain the following lemma.

Lemma 6

Any sequence \(\left\{ (\varvec{W}^{(t)},\varvec{H}^{(t)}) \right\} _{t=0}^{\infty }\) generated by Algorithm 1 is contained in a compact subset of \({\mathcal {F}}\).

Proof

We easily see from Step 5 of Algorithm 1 that \(\Vert \varvec{w}_k^{(t)}\Vert _2=1\) for all k and \(t \in {\mathbb {Z}}_{++}\), where \(\varvec{w}_k^{(t)}\) is the k-th column of \(\varvec{W}^{(t)}\). Also, it follows from Lemma 2 that \(f(\varvec{W}^{(t)},\varvec{H}^{(t)}) \le f(\varvec{W}^{(0)},\varvec{H}^{(0)})\) for all \(t \in {\mathbb {Z}}_{+}\). Therefore \((\varvec{W}^{(t)},\varvec{H}^{(t)}) \in {\mathcal {L}}_{(\varvec{W}^{(0)},\varvec{H}^{(0)})}\) for all \(t \in {\mathbb {Z}}_{++}\). Because \({\mathcal {L}}_{(\varvec{W}^{(0)},\varvec{H}^{(0)})}\) is bounded as shown in Lemma 5, the sequence \(\{(\varvec{W}^{(t)},\varvec{H}^{(t)})\}_{t=0}^{\infty }\) is contained in a compact subset of \({\mathcal {F}}\). \(\square \)

We finally prove that the proposed update rule satisfies the third condition in Theorem 1. The next lemma shows the closedness of the point-to-set mappings \(S_1^\mathrm {W}, S_2^\mathrm {W}, \ldots , S_K^\mathrm {W}\).

Lemma 7

The point-to-set mappings \(S_1^\mathrm {W}, S_2^\mathrm {W}, \ldots , S_K^\mathrm {W}\) are closed on \({\mathcal {F}}\).

Proof

Let \(\{(\varvec{W}^{(t)},\varvec{H}^{(t)})\}_{t=0}^{\infty }\) and \(\{(\varvec{U}^{(t)},\varvec{V}^{(t)})\}_{t=0}^{\infty }\) be any two convergent sequences in \({\mathcal {F}}\) that satisfy \((\varvec{U}^{(t)},\varvec{V}^{(t)}) \in S_k^\mathrm {W}(\varvec{W}^{(t)},\varvec{H}^{(t)})\) for all \(t \in {\mathbb {Z}}_{+}\). Let \((\varvec{W}^{(\infty )},\varvec{H}^{(\infty )})\) and \((\varvec{U}^{(\infty )},\varvec{V}^{(\infty )})\) be the limits of these two sequences. It is clear from the definition of \(S_k^\mathrm {W}\) that \(\Vert \varvec{u}_k^{(t)}\Vert _2=1\) for all \(t \in {\mathbb {Z}}_{+}\), \(\varvec{u}_{{\tilde{k}}}^{(t)}=\varvec{w}_{{\tilde{k}}}^{(t)}\) for all \({\tilde{k}} \ne k\) and \(t \in {\mathbb {Z}}_{+}\), and \(\varvec{V}^{(t)}=\varvec{H}^{(t)}\) for all \(t \in {\mathbb {Z}}_{+}\). We first consider the case where \(\varvec{w}_k^{(\infty )} \ne \varvec{0}_{M \times 1}\). In this case, \(S_k^\mathrm {W}(\varvec{W}^{(\infty )},\varvec{H}^{(\infty )})\) consists only of the point

$$\begin{aligned} \left( \left( \varvec{w}_1^{(\infty )},\ldots ,\varvec{w}_{k-1}^{(\infty )},\frac{\varvec{w}_k^{(\infty )}}{\Vert \varvec{w}_k^{(\infty )}\Vert _2},\varvec{w}_{k+1}^{(\infty )},\ldots ,\varvec{w}_K^{(\infty )}\right) ,\varvec{H}^{(\infty )}\right) \end{aligned}$$

and \(\{(\varvec{U}^{(t)},\varvec{V}^{(t)})\}_{t=0}^\infty \) converges to it. We next consider the case where \(\varvec{w}_k^{(\infty )}=\varvec{0}_{M \times 1}\). In this case, \(S_k^\mathrm {W}(\varvec{W}^{(\infty )},\varvec{H}^{(\infty )})\) is the set of all \((\varvec{W},\varvec{H}) \in {\mathcal {F}}\) such that \(\Vert \varvec{w}_k\Vert _2=1\), \(\varvec{w}_{{\tilde{k}}}=\varvec{w}_{{\tilde{k}}}^{(\infty )}\) for all \({\tilde{k}} \ne k\) and \(\varvec{H}=\varvec{H}^{(\infty )}\). Also, \((\varvec{U}^{(\infty )},\varvec{V}^{(\infty )})\) satisfies \(\Vert \varvec{u}_k^{(\infty )}\Vert _2=1\), \(\varvec{u}_{{\tilde{k}}}^{(\infty )}=\varvec{w}_{{\tilde{k}}}^{(\infty )}\) for all \({\tilde{k}} \ne k\) and \(\varvec{V}^{(\infty )}=\varvec{H}^{(\infty )}\). Therefore, we have \((\varvec{U}^{(\infty )},\varvec{V}^{(\infty )}) \in S_{k}^\mathrm {W}(\varvec{W}^{(\infty )},\varvec{H}^{(\infty )})\). \(\square \)

Given a point \((\varvec{W}^{(0)},\varvec{H}^{(0)}) \in {\mathcal {F}}\), we define \({\mathcal {L}}^1_{(\varvec{W}^{(0)},\varvec{H}^{(0)})}\), \({\mathcal {L}}^2_{(\varvec{W}^{(0)},\varvec{H}^{(0)})}\) and \({\mathcal {L}}^3_{(\varvec{W}^{(0)},\varvec{H}^{(0)})}\) as follows:

$$\begin{aligned} {\mathcal {L}}^1_{(\varvec{W}^{(0)},\varvec{H}^{(0)})}&= \{(\varvec{W},\varvec{H}) \in {\mathcal {F}}\,|\, f(\varvec{W},\varvec{H}) \le f(\varvec{W}^{(0)},\varvec{H}^{(0)}), \\&\qquad \Vert \varvec{w}_k\Vert _2 \le \mu _k \text{ and } \Vert \varvec{h}_k\Vert _2 \le \nu _k \text{ for } \text{ all } k\} \\ {\mathcal {L}}^2_{(\varvec{W}^{(0)},\varvec{H}^{(0)})}&= \{(\varvec{W},\varvec{H}) \in {\mathcal {F}}\,|\, f(\varvec{W},\varvec{H}) \le f(\varvec{W}^{(0)},\varvec{H}^{(0)}), \\&\qquad \Vert \varvec{w}_k\Vert _2 \le \mu _k+\sigma _{\max }(\varvec{X})\nu _k/\delta \text{ and } \Vert \varvec{h}_k\Vert _2 \le \nu _k \text{ for } \text{ all } k\}, \\ {\mathcal {L}}^3_{(\varvec{W}^{(0)},\varvec{H}^{(0)})}&= \{(\varvec{W},\varvec{H}) \in {\mathcal {F}}\,|\, f(\varvec{W},\varvec{H}) \le f(\varvec{W}^{(0)},\varvec{H}^{(0)}), \\&\qquad \Vert \varvec{w}_k\Vert _2 \le \mu _k+\sigma _{\max }(\varvec{X})\nu _k/\delta \text{ and } \\&\qquad \Vert \varvec{h}_k\Vert _2 \le \nu _k (\mu _k+\sigma _{\max }(\varvec{X})\nu _k/\delta ) \text{ for } \text{ all } k\} \end{aligned}$$

where

$$\begin{aligned} \mu _k&= \max \{1,\Vert \varvec{w}_k^{(0)}\Vert _2\}, \\ \nu _k&= \max \{f(\varvec{W}^{(0)},\varvec{H}^{(0)})+\varvec{1}_{N \times 1}^\mathrm {T}\varvec{X}^\mathrm {T}\varvec{1}_{M \times 1}+N, \Vert \varvec{h}_k^{(0)}\Vert _2\} \end{aligned}$$

for \(k=1,2,\ldots ,K\) and \(\sigma _{\max }(\varvec{X})\) is the largest singular value of \(\varvec{X}\). It is clear that all of the three sets defined above are compact subsets of \({\mathcal {F}}\). It is also clear that \((\varvec{W}^{(0)},\varvec{H}^{(0)}) \in {\mathcal {L}}^1_{(\varvec{W}^{(0)},\varvec{H}^{(0)})}\). Furthermore, the following lemma holds.

Lemma 8

The following statements are true for \(k=1,2,\ldots ,K\).

  1. 1.

    If \((\varvec{W},\varvec{H}) \in {\mathcal {L}}^1_{(\varvec{W}^{(0)},\varvec{H}^{(0)})}\) then \(D_k^\mathrm {W}(\varvec{W},\varvec{H}) \subseteq {\mathcal {L}}^2_{(\varvec{W}^{(0)},\varvec{H}^{(0)})}\).

  2. 2.

    If \((\varvec{W},\varvec{H}) \in {\mathcal {L}}^2_{(\varvec{W}^{(0)},\varvec{H}^{(0)})}\) then \(S_k^\mathrm {H}(\varvec{W},\varvec{H}) \subseteq {\mathcal {L}}^3_{(\varvec{W}^{(0)},\varvec{H}^{(0)})}\).

  3. 3.

    If \((\varvec{W},\varvec{H}) \in {\mathcal {L}}^3_{(\varvec{W}^{(0)},\varvec{H}^{(0)})}\) then \(S_k^\mathrm {W}(\varvec{W},\varvec{H}) \subseteq {\mathcal {L}}_{(\varvec{W}^{(0)},\varvec{H}^{(0)})}\).

  4. 4.

    If \((\varvec{W},\varvec{H}) \in {\mathcal {L}}_{(\varvec{W}^{(0)},\varvec{H}^{(0)})}\) then \(D_k^\mathrm {H}(\varvec{W},\varvec{H}) \subseteq {\mathcal {L}}^1_{(\varvec{W}^{(0)},\varvec{H}^{(0)})}\).

Proof

We first prove the first statement. Suppose that \((\varvec{W},\varvec{H}) \in {\mathcal {L}}^1_{(\varvec{W}^{(0)},\varvec{H}^{(0)})}\). Then \(\Vert \varvec{w}_k\Vert _2 \le \mu _k\) and \(\Vert \varvec{h}_k\Vert _2 \le \nu _k\) hold. Using these inequalities, we have

$$\begin{aligned} \left\| \frac{[\varvec{R}_k \varvec{h}_k+\delta \varvec{w}_k]_{+}}{\Vert \varvec{h}_k\Vert _2^2+\delta }\right\| _2&\le \left\| \frac{\varvec{X} \varvec{h}_k+\delta \varvec{w}_{k}}{\delta }\right\| _2 \\&\le \frac{1}{\delta } \Vert \varvec{X}\varvec{h}_k\Vert _2+\Vert \varvec{w}_k\Vert _2 \\&\le \frac{\sigma _{\max }(\varvec{X})}{\delta } \Vert \varvec{h}_k\Vert _2+\Vert \varvec{w}_k\Vert _2 \\&\le \frac{\sigma _{\max }(\varvec{X})}{\delta } \nu _k+\mu _k \end{aligned}$$

which means that \(D_k^\mathrm {W}(\varvec{W},\varvec{H}) \subseteq {\mathcal {L}}^2_{(\varvec{W}^{(0)},\varvec{H}^{(0)})}\).

We next prove the second statement. Suppose that \((\varvec{W},\varvec{H}) \in {\mathcal {L}}^2_{(\varvec{W}^{(0)},\varvec{H}^{(0)})}\). Then \(\Vert \varvec{w}_k\Vert _2 \le (\sigma _{\max }(\varvec{X}) \nu _k/\delta +\mu _k)\) and \(\Vert \varvec{h}_k\Vert _2 \le \nu _k\) hold. Using these inequalities, we have

$$\begin{aligned} \Vert \varvec{h}_k\Vert _2 \Vert \varvec{w}_k\Vert _2 \le \nu _k \left( \frac{\sigma _{\max }(\varvec{X})}{\delta } \nu _k+\mu _k \right) \end{aligned}$$

which means that \(S_k^\mathrm {H}(\varvec{W},\varvec{H}) \subseteq {\mathcal {L}}^3_{(\varvec{W}^{(0)},\varvec{H}^{(0)})}\).

The third statement is clear from the definition of the point-to-set mapping \(S_k^\mathrm {W}\), and the fourth statement is clear from the proof of Lemma 5. \(\square \)

From Lemma 8, we can restrict the domains of the point-to-set mappings \(D_k^\mathrm {W}\), \(S_k^\mathrm {H}\), \(S_k^\mathrm {W}\), \(D_k^\mathrm {H}\) to \({\mathcal {L}}^1_{(\varvec{W}^{(0)},\varvec{H}^{(0)})}\), \({\mathcal {L}}^2_{(\varvec{W}^{(0)},\varvec{H}^{(0)})}\), \({\mathcal {L}}^3_{(\varvec{W}^{(0)},\varvec{H}^{(0)})}\) and \({\mathcal {L}}_{(\varvec{W}^{(0)},\varvec{H}^{(0)})}\), respectively. This means that we can restrict the domain of the point-to-set mapping A to \({\mathcal {L}}^1_{(\varvec{W}^{(0)},\varvec{H}^{(0)})}\). The next lemma shows the closedness of A restricted to \({\mathcal {L}}^1_{(\varvec{W}^{(0)},\varvec{H}^{(0)})}\).

Lemma 9

For any \((\varvec{W}^{(0)},\varvec{H}^{(0)}) \in {\mathcal {F}}\), the point-to-set mapping A restricted to \({\mathcal {L}}^1_{(\varvec{W}^{(0)},\varvec{H}^{(0)})}\) is closed on \({\mathcal {L}}^1_{(\varvec{W}^{(0)},\varvec{H}^{(0)})}\).

Proof

It is clear that the composite mapping \(S_k^\mathrm {H} \circ D_k^\mathrm {W}\) from \({\mathcal {L}}^1_{(\varvec{W}^{(0)},\varvec{H}^{(0)})}\) to the subsets of \({\mathcal {L}}^3_{(\varvec{W}^{(0)},\varvec{H}^{(0)})}\) is closed on its domain for all k. Also, it follows from Lemma 7 and the continuity of \(D_k^\mathrm {H}\) that the composite mapping \(D_k^\mathrm {H} \circ S_k^\mathrm {W}\) from \({\mathcal {L}}^3_{(\varvec{W}^{(0)},\varvec{H}^{(0)})}\) to the subsets of \({\mathcal {L}}^1_{(\varvec{W}^{(0)},\varvec{H}^{(0)})}\) is closed on its domain for all k. Because \({\mathcal {L}}^3_{(\varvec{W}^{(0)},\varvec{H}^{(0)})}\) is a compact subset of \({\mathcal {F}}\), by [53, Corollary 4.2.1], the composite mapping \((D_k^\mathrm {H} \circ S_k^\mathrm {W}) \circ (S_k^\mathrm {H} \circ D_k^\mathrm {W})\) from \({\mathcal {L}}^1_{(\varvec{W}^{(0)},\varvec{H}^{(0)})}\) to the subsets of \({\mathcal {L}}^1_{(\varvec{W}^{(0)},\varvec{H}^{(0)})}\) is closed on its domain for all k. Furthermore, since \({\mathcal {L}}^1_{(\varvec{W}^{(0)},\varvec{H}^{(0)})}\) is a compact subset of \({\mathcal {F}}\), by [53, Corollary 4.2.1], we can conclude that A, which is a composition of the mappings \((D_k^\mathrm {H} \circ S_k^\mathrm {W}) \circ (S_k^\mathrm {H} \circ D_k^\mathrm {W})\), restricted to \({\mathcal {L}}^1_{(\varvec{W}^{(0)},\varvec{H}^{(0)})}\) is closed on its domain. \(\square \)

We should note that even if \(\varvec{u}_k\) in Step 5 of Algorithm 1 is replaced with a constant nonnegative unit vector such as \((1/\sqrt{M})\varvec{1}_{M \times 1}\) and \((1,0,0,\ldots ,0)^\mathrm {T}\) we can prove Theorem 5 without changing the definition of the mapping \(S_k^\mathrm {W}\).

5 Stopping conditions

We have proved that the HALS algorithm using the proposed update rule shown in Algorithm 1 is globally convergent to \({\mathcal {S}}\) in the sense of Definition 1. Therefore, combining this update rule with an appropriate stopping condition, we can design an algorithm that always stops in a finite number of iterations. In this section, we consider two approaches for deriving stopping conditions.

5.1 Relaxed KKT conditions

The first approach, which has already been used in the literature [29, 30, 38, 44, 46, 47], is to relax the KKT conditions (3) as follows:

$$\begin{aligned}&{\left\{ \begin{array}{ll} \left( \nabla _{\varvec{W}}f(\varvec{W},\varvec{H})\right) _{mk} \ge -\kappa _1, &{} \text{ if } w_{mk} \le \kappa _2, \\ \left| \left( \nabla _{\varvec{W}}f(\varvec{W},\varvec{H})\right) _{mk}\right| \le \kappa _1, &{} \text{ if } w_{mk} > \kappa _2, \end{array}\right. } \nonumber \\&\qquad \qquad m=1,2,\ldots ,M, \; k=1,2,\ldots ,K, \end{aligned}$$
(21)
$$\begin{aligned}&{\left\{ \begin{array}{ll} \left( \nabla _{\varvec{H}}f(\varvec{W},\varvec{H})\right) _{nk} \ge -\kappa _1, &{} \text{ if } h_{nk} \le \kappa _2, \\ \left| \left( \nabla _{\varvec{H}}f(\varvec{W},\varvec{H})\right) _{nk}\right| \le \kappa _1, &{} \text{ if } h_{nk} > \kappa _2, \end{array}\right. } \nonumber \\&\qquad \qquad n=1,2,\ldots ,N, \; k=1,2,\ldots ,K \end{aligned}$$
(22)

where \(\kappa _1\) and \(\kappa _2\) are positive constants.

The HALS algorithm for NMF using Algorithm 1 and the stopping condition described by (21) and (22) is shown in Algorithm 2. For this algorithm, the following theorem holds. The proof is omitted because it is similar to that of Theorem 2 in [46].

Theorem 6

Algorithm 2 stops in a finite number of iterations.

figure b

5.2 Projected gradient norm

The second approach is to make use of the projected gradient [36]. To be more specific, the inequality

$$\begin{aligned} \psi _{\tau _2}(\varvec{W}, \varvec{H}) \le \tau _1 \psi _{\tau _2}(\varvec{W}^{(0)},\varvec{H}^{(0)}) \end{aligned}$$
(23)

is used as the stopping condition, where \(\tau _1\) and \(\tau _2\) are positive constants, and \(\psi _{\tau _2}(\varvec{W},\varvec{H})\) is defined as

$$\begin{aligned} \psi _{\tau _2}(\varvec{W},\varvec{H}) = \sqrt{\left\| \varvec{G}_{\tau _2}^{\mathrm {W}}(\varvec{W},\varvec{H})\right\| _\mathrm {F}^2+\left\| \varvec{G}_{\tau _2}^{\mathrm {H}}(\varvec{W},\varvec{H})\right\| _\mathrm {F}^2}. \end{aligned}$$

The notations \(\varvec{G}_{\tau _2}^{\mathrm {W}}(\varvec{W},\varvec{H})\) and \(\varvec{G}_{\tau _2}^{\mathrm {H}}(\varvec{W},\varvec{H})\) denote a modified projected gradients with respect to \(\varvec{W}\) and \(\varvec{H}\), respectively, which are defined by

$$\begin{aligned} (\varvec{G}_{\tau _2}^{\mathrm {W}}(\varvec{W},\varvec{H}))_{mk} = {\left\{ \begin{array}{ll} \min \{0,(\nabla _{\varvec{W}}f(\varvec{W},\varvec{H}))_{mk}\}, &{} \text{ if } w_{mk} \le \tau _2, \\ (\nabla _{\varvec{W}} f(\varvec{W},\varvec{H}))_{mk}, &{} \text{ if } w_{mk} > \tau _2 \end{array}\right. } \end{aligned}$$

and

$$\begin{aligned} (\varvec{G}_{\tau _2}^{\mathrm {H}}(\varvec{W},\varvec{H}))_{nk} = {\left\{ \begin{array}{ll} \text{ min }\{0,(\nabla _{\varvec{H}} f(\varvec{W},\varvec{H}))_{nk} \}, &{} \text{ if } h_{nk} \le \tau _2, \\ (\nabla _{\varvec{H}}f(\varvec{W},\varvec{H}))_{nk}, &{} \text{ if } h_{nk} > \tau _2 . \end{array}\right. } \end{aligned}$$

Note that our definition of the projected gradient is slightly different from the one used in the literature [20, 26, 27, 36], which corresponds to the case where \(\tau _2=0\). It is clear that if \((\varvec{W},\varvec{H})\) is a stationary point of (1) then (23) is satisfied because \(\psi _{\tau _2}(\varvec{W},\varvec{H})=0\) holds. Therefore, (23) is considered as relaxed KKT conditions.

The proposed HALS algorithm for NMF using Algorithm 1 and the stopping condition (23) is shown in Algorithm 3. For this algorithm, the following theorem holds.

Theorem 7

Algorithm 3 stops in a finite number of iterations.

Proof

The proof is done by contradiction. We first assume that Algorithm 3 does not stop for some \((\varvec{W}^{(0)},\varvec{H}^{(0)}) \in {\mathcal {F}}\). Let \(\{(\varvec{W}^{(t)}, \varvec{H}^{(t)}) \}_{t=0}^\infty \) be an infinite sequence generated by Algorithm 3. Then, we see from Step 1 that \(\psi _{\tau _2}(\varvec{W}^{(0)},\varvec{H}^{(0)})\) must be positive. Also, by Theorem 5, this sequence has at least one subsequence that converges to a stationary point of (1). Let \(\{(\varvec{W}^{(t_i)}, \varvec{H}^{(t_i)}) \}_{i=0}^\infty \) be one of such subsequences and \((\varvec{W}^{(\infty )},\varvec{H}^{(\infty )}) \in {\mathcal {F}}\) be its limit. Because the limit is a stationary point of (1), it satisfies

$$\begin{aligned}&\left( \nabla _{\varvec{W}}f(\varvec{W}^{(\infty )},\varvec{H}^{(\infty )})\right) _{mk} {\left\{ \begin{array}{ll} \ge 0, &{} \text{ if } w_{mk}^{(\infty )}=0, \\ = 0, &{} \text{ if } w_{mk}^{(\infty )}> 0, \end{array}\right. } \nonumber \\&\qquad m=1,2,\ldots ,M, \; k=1,2,\ldots ,K, \\&\left( \nabla _{\varvec{H}}f(\varvec{W}^{(\infty )},\varvec{H}^{(\infty )})\right) _{nk} {\left\{ \begin{array}{ll} \ge 0, &{} \text{ if } h_{nk}^{(\infty )} = 0, \\ = 0, &{} \text{ if } h_{nk}^{(\infty )} > 0, \end{array}\right. } \nonumber \\&\qquad n=1,2,\ldots ,N, \; k=1,2,\ldots ,K. \end{aligned}$$

Let us define a positive constant \(\mu \) as

$$\begin{aligned} \mu = \frac{1}{\sqrt{MK+NK}} \tau _1 \psi _{\tau _2}(\varvec{W}^{(0)},\varvec{H}^{(0)}). \end{aligned}$$

Because \(\nabla _{\varvec{W}} f(\varvec{W},\varvec{H})\) and \(\nabla _{\varvec{H}} f(\varvec{W},\varvec{H})\) are continuous on \({\mathcal {F}}\), the following statements hold true.

  1. 1.

    For any (mk) such that \((\nabla _{\varvec{W}} f(\varvec{W}^{(\infty )},\varvec{H}^{(\infty )}))_{mk} = 0\), there exists a positive integer \(I_{mk}^\mathrm {W}\) such that

    $$\begin{aligned} \left| (\nabla _{\varvec{W}} f(\varvec{W}^{(t_i)},\varvec{H}^{(t_i)}))_{mk} \right| \le \mu \end{aligned}$$

    for all \(i \ge I_{mk}^\mathrm {W}\).

  2. 2.

    For any (mk) such that \((\nabla _{\varvec{W}} f(\varvec{W}^{(\infty )},\varvec{H}^{(\infty )}))_{mk} > 0\), there exists a positive integer \(I_{mk}^\mathrm {W}\) such that

    $$\begin{aligned} (\nabla _{\varvec{W}} f(\varvec{W}^{(t_i)},\varvec{H}^{(t_i)}))_{mk} > 0, \quad w_{mk}^{(t_i)} \le \tau _2 \end{aligned}$$

    for all \(i \ge I_{mk}^\mathrm {W}\).

  3. 3.

    For any (nk) such that \((\nabla _{\varvec{H}} f(\varvec{W}^{(\infty )},\varvec{H}^{(\infty )}))_{nk} = 0\), there exists a positive integer \(I_{nk}^\mathrm {H}\) such that

    $$\begin{aligned} \left| (\nabla _{\varvec{H}} f(\varvec{W}^{(t_i)},\varvec{H}^{(t_i)}))_{nk} \right| \le \mu \end{aligned}$$

    for all \(i \ge I_{nk}^\mathrm {H}\).

  4. 4.

    For any (nk) such that \((\nabla _{\varvec{H}} f(\varvec{W}^{(\infty )},\varvec{H}^{(\infty )}))_{nk} > 0\), there exists a positive integer \(I_{nk}^\mathrm {H}\) such that

    $$\begin{aligned} (\nabla _{\varvec{H}} f(\varvec{W}^{(t_i)},\varvec{H}^{(t_i)}))_{nk} > 0, \quad h_{nk}^{(t_i)} \le \tau _2 \end{aligned}$$

    for all \(i \ge I_{nk}^\mathrm {H}\).

From these statements, we see that

$$\begin{aligned}&\left| (\varvec{G}_{\tau _2}^{\mathrm {W}}(\varvec{W}^{(t_i)},\varvec{H}^{(t_i)}))_{mk} \right| \le \mu , \quad m=1,2,\ldots ,M, \quad k=1,2,\ldots ,K, \\&\left| (\varvec{G}_{\tau _2}^{\mathrm {H}}(\varvec{W}^{(t_i)},\varvec{H}^{(t_i)}))_{nk} \right| \le \mu , \quad n=1,2,\ldots ,N, \quad k=1,2,\ldots ,K \end{aligned}$$

for all \(i \ge I=\max \{I_{11}^\mathrm {W},\ldots ,I_{MK}^\mathrm {W},I_{11}^\mathrm {H},\ldots ,I_{NK}^\mathrm {H}\}\). Therefore, the inequality

$$\begin{aligned} \psi _{\tau _2}(\varvec{W}^{(t_i)},\varvec{H}^{(t_i)})&= \sqrt{\Vert \varvec{G}_{\tau _2}^\mathrm {W}(\varvec{W}^{(t_i)},\varvec{H}^{(t_i)})\Vert _{\mathrm {F}}^2+\Vert \varvec{G}_{\tau _2}^\mathrm {H}(\varvec{W}^{(t_i)},\varvec{H}^{(t_i)})\Vert _{\mathrm {F}}^2} \\&\le \mu \sqrt{MK+NK} \\&= \tau _1 \psi _{\tau _2}(\varvec{W}^{(0)}, \varvec{H}^{(0)}) \end{aligned}$$

holds for all \(i \ge I\). This means that the stopping condition (23) holds in a finite number of iterations. However, this contradicts the assumption that Algorithm 3 does not stop. \(\square \)

figure c

6 Numerical experiments

In order to examine the practical performance of the proposed update rule, the authors conducted numerical experiments using the real-world datasets: OlivettiFootnote 1 and CLUTOFootnote 2 (tr41). The former is a dataset of face images, and the latter is that of documents. The statistics of these two datasets is shown in Table 1. In the experiments, two global-convergence-guaranteed update rules were applied to the nonnegative matrices obtained from the datasets. One is Algorithm 1 (denoted as ‘proposed’) and the other is the update rule described by (10) and (11) (denoted as ‘positive’). These two update rules are compared in terms of the evolution of the objective function value and the number of unsatisfied inequalities in the relaxed KKT conditions, and the characteristics of the obtained factor matrices.

Experimental setup is shown in Table 2. The value of \(\delta \) in the proposed update rule is set to \(10^{-8}\) in all experiments, while the value of \(\epsilon \) in the positive one is set to \(10^{-4}\) or \(10^{-8}\) depending on the experiment. The iteration is terminated when the stopping condition described by (21) and (22) is satisfied or the number of iterations reaches 500. The values of \(\kappa _1\) and \(\kappa _2\) in the stopping condition are set to 1.0 and \(2\epsilon \), respectively, in all experiments. Note that the finite termination of the positive update rule is guaranteed if \(\kappa _2\) is greater than \(\epsilon \). This can be proved in the same way as Theorem 7 (see [29] for details). Three different initial solutions are generated for each dataset in such a way that each element is drawn from independent uniform distributions on the intervals [0, 1], [0, 0.5] and [0, 0.25] which are called the ‘large’, ‘medium’ and ‘small’ initial solutions, respectively.

Table 1 Statistics of the datasets used in the experiments
Table 2 Experimental setup
Fig. 2
figure 2

The evolution of the objective function value (left column) and the number of unsatisfied inequalities in (21) and (22) (right column) in Experiment 1. The first, second and third rows show the results for the large, medium and small initial solutions, respectively

Table 3 Characteristics of the solutions obtained by the proposed and positive update rules in Experiment 1. The notation ‘positive\(+\)replacement’ means that all \(\epsilon \) in the factor matrices obtained by the positive update rule are replaced with zero

Results of Experiment 1 are summarized in Fig. 2 and Table 3. Figure 2 shows the evolution of the objective function value and the number of unsatisfied inequalities in (21) and (22). We easily see from the figure that the two update rules decrease the objective function value in a similar way until one of them satisfies the stopping condition. In contrast, the behavior of these update rules with respect to the number of unsatisfied inequalities is quite different. The proposed update rule decreases the number at a similar rate for all the initial solutions, and satisfies the stopping condition between 200 and 300 iterations. This is because the normalization process is included in the proposed update rule. The positive update rule decreases the number very slowly, and cannot satisfy the stopping condition in 500 iterations for the large and medium initial solutions, while for the small initial solution it decreases the number very fast and satisfies the stopping condition in less than 100 iterations.

Table 3 shows the characteristics of the solutions obtained by the proposed and positive update rules. Some important facts are observed in this table. The first one is that a small objective function value does not necessarily mean that the number of unsatisfied inequalities is small. In fact, the solution obtained by the positive update rule for the large initial solution gives the smallest objective function value and the largest number of unsatisfied inequalities. Also, the solution obtained by the positive update rule for the small initial solution gives the largest objective function value but satisfies all the inequalities. The second fact is that about a quarter of the variables are at the lower bound in all cases. Hence the solutions obtained by the proposed update rule are sparse because the lower bound is zero. In contrast, the solutions obtained by the positive update rule are dense because the lower bound is a positive constant \(\epsilon \). The third fact is that the replacement of all \(\epsilon \) with zero in each solution obtained by the positive update rule increases the number of unsatisfied inequalities. In particular, the replacement changes a solution that satisfies the stopping condition to another one that does not. It is thus not always possible to find a sparse solution that satisfies the relaxed KKT conditions using the positive update rule, while we can always do it using the proposed update rule. This is an advantage of the proposed update rule against the positive one.

Fig. 3
figure 3

The evolution of the objective function value (left column) and the number of unsatisfied inequalities in (21) and (22) (right column) in Experiment 2. The first, second and third rows show the results for the large, medium and small initial solutions, respectively

Table 4 Characteristics of the solutions obtained by the proposed and positive update rules in Experiment 2
Fig. 4
figure 4

The evolution of the objective function value (left column) and the number of unsatisfied inequalities in (21) and (22) (right column) in Experiment 3. The first, second and third rows show the results for the large, medium and small initial solutions, respectively

Table 5 Characteristics of the solutions obtained by the proposed and positive update rules in Experiment 3
Fig. 5
figure 5

The evolution of the objective function value (left column) and the number of unsatisfied inequalities in (21) and (22) (right column) in Experiment 4. The first, second and third rows show the results for the large, medium and small initial solutions, respectively

Table 6 Characteristics of the solutions obtained by the proposed and positive update rules in Experiment 4

Results of Experiment 2 are summarized in Fig. 3 and Table 4 just like Experiment 1. The evolution of the objective function value and the number of unsatisfied inequalities in (21) and (22) shown in Fig. 3 are similar to those in Experiment 1 (see Fig. 2), though the values of \(\epsilon \) and \(\kappa _2\) are quite different. The characteristics shown in Table 4 are similar to those in Table 3 but there is one important difference. The number of unsatisfied inequalities is zero before and after the replacement of all \(\epsilon \) with zero in the solution obtained by the positive update rule for the small initial solution. This indicates that we can find a sparse solution that satisfies the relaxed KKT conditions using the positive update rule if the magnitude of the initial solution and the value of \(\epsilon \) are sufficiently small. However, it is difficult in general to know in advance how small these values should be.

Results of Experiment 3 are summarized in Fig. 4 and Table 5. The evolution of the objective function value and the number of unsatisfied inequalities in (21) and (22) shown in Fig. 4 are similar to those in Experiment 1 (see Fig. 2), though the dataset is different. The characteristics shown in Table 5 are also similar to those in Table 3 but there are two main differences. One is that a solution with a smaller objective function value satisfies more inequalities in (21) and (22). The other is that the number of variables at the lower bound in each solution obtained by the positive update rule is higher than that in the corresponding solution obtained by the proposed update rule.

Results of Experiment 4 are summarized in Fig. 5 and Table 6. As for the proposed update rule, the evolution of the objective function value and the number of unsatisfied inequalities in (21) and (22) are similar to those in Experiment 3 (see Fig. 4). In contrast, the behavior of the positive update rule is quite different from that in Experiment 3. The number of unsatisfied inequalities decreases faster than the proposed update rule for all the initial solutions, and reaches zero in less than 200 iterations. All the solutions obtained by the proposed and positive update rules have almost the same objective function value and very similar numbers of unsatisfied inequalities, as shown in Table 6. In addition, the number of unsatisfied inequalities is not affected by the replacement of all \(\epsilon \) with zero for all the solutions obtained by the positive update rule. This indicates that we can find a sparse solution that satisfies the relaxed KKT conditions using the positive update rule if the magnitude of the initial solution and the value of \(\epsilon \) are properly selected.

7 Applicability of proposed update rule to variants of HALS algorithm

In this section, we introduce some variants of the HALS algorithm to which our update rule can be applied in order to guarantee the well-definedness and/or the global convergence. First one is the accelerated HALS algorithm [17]. The idea behind this algorithm is very simple. In each iteration, \(\varvec{W}\) is updated several times while \(\varvec{H}\) is fixed, and then \(\varvec{H}\) is updated several times while \(\varvec{W}\) is fixed. It was shown through experiments using image and text datasets that this algorithm significantly outperforms the original HALS algorithm [17]. Now we claim that the global convergence of this algorithm is guaranteed if Algorithm 1 is incorporated into it. In each iteration, the algorithm updates all columns of \(\varvec{W}\) several times using the update rule in Step 3, next updates all columns of \(\varvec{W}\) and \(\varvec{H}\) using Steps 4 and 5 once, and then updates all columns of \(\varvec{H}\) several times using the update rule in Step 6.

The second one is the fast coordinate descent algorithm with variable selection [26]. In each iteration, M rows of \(\varvec{W}\) are updated one by one and then N rows of \(\varvec{H}\) are updated one by one. Each row of \(\varvec{W}\) or \(\varvec{H}\) is updated by repeating the following two steps until some condition is satisfied: i) selection of one element based on the potential decrease in the objective function value, and ii) update of the selected element. It was shown through experiments using synthetic and real-world datasets that this algorithm is considerably faster than conventional algorithms [26]. Again, we claim that the global convergence of this algorithm is guaranteed if Algorithm 1 is incorporated into it. To be more specific, when each row of \(\varvec{W}\) or \(\varvec{H}\) is updated, the update rule in Step 3 or Step 6 of Algorithm 1 can be used for both the computation of the potential decrease in the objective function value and the update of the selected element. One important point is that the normalization procedure in Steps 4 and 5 should be done between the update of M rows of \(\varvec{W}\) and the update of N rows of \(\varvec{H}\).

The third one is the randomized HALS algorithm [13] which is based on the probabilistic framework for low-rank approximations [21]. In the first step, this algorithm constructs a surrogate matrix \(\varvec{B} \in {\mathbb {R}}^{L \times N}\) with \(K < L \ll M\) as follows. First, \(\varvec{X}\) is multiplied by a random matrix \(\varvec{\Omega } \in {\mathbb {R}}^{N \times L}\) to get \(\varvec{Y}=\varvec{X}\varvec{\Omega }\). Next, a matrix \(\varvec{Q} \in {\mathbb {R}}^{M \times L}\) with orthogonal columns is obtained by performing the QR-decomposition of \(\varvec{Y}\). Finally, the surrogate matrix is obtained by \(\varvec{B}=\varvec{Q}^\mathrm {T}\varvec{X}\). The surrogate matrix \(\varvec{B}\) obtained like this is expected to capture the essential information of \(\varvec{X}\). In the next step, this algorithm solves the optimization problem:

$$\begin{aligned} \begin{array}{ll} \text{ minimize } &{} \displaystyle \frac{1}{2} \left\| \varvec{B} - \tilde{\varvec{W}}\varvec{H}^\mathrm {T} \right\| _\mathrm {F}^2 \\ \text{ subject } \text{ to } &{} \varvec{Q}\tilde{\varvec{W}} \ge \varvec{0}_{M \times K},\ \varvec{H} \ge \varvec{0}_{N \times K} \end{array} \end{aligned}$$

by an iterative algorithm very similar to the HALS algorithm. It was shown through experiments using hand-written digits and face image datasets that the randomized HALS algorithm has a substantially lower computational cost than the deterministic one, and attains almost the same reconstruction error as the deterministic one [13]. The technique used in our update rule can be easily applied to this algorithm in order to ensure that it is well-defined.

In addition to these three, there are many other algorithms to which our update rule can be applied. One example is the distributed HALS algorithm for multiagent networks [10]. This algorithm is based on the update rule given by (10) and (11) to guarantee the global convergence. By using one of our update rules, this algorithm can find a stationary point of the original optimization problem (1).

8 Conclusions

In this paper, we have proposed a novel update rule of the HALS algorithm for NMF, and proved its global convergence using Zangwill’s global convergence theorem. The proposed update rule has the same computational complexity per iteration as the update rule in the original HALS algorithm. In addition, unlike the global-convergence-guaranteed update rules in the literature [29, 30], the proposed update rule does not restrict the range of each variable to a subset of \({\mathbb {R}}_{++}\). This allows us to obtain sparse factor matrices. We have also given two types of stopping conditions and proved the finite termination of the proposed update rule combined with these stopping conditions.

One future direction of this work is to extend our results to Nonnegative Tensor Factorization (NTF) [7, 55] which is expected to be used in various applications such as recommender systems [56], while the global convergence property has not yet been analyzed in depth.