In this section, we will introduce our method—MNMF in detail. We use the basic NMF objective function as the measurement of the clustering quality and design a regularization term to quantify the redundancy between different clusterings. Then, we propose an efficient algorithm to solve the optimization problem. We start with the basic NMF.
NMF
Given a nonnegative matrix, NMF factorizes it into the product of two nonnegative matrices (Lee and Seung 1999). Let \(X=[\mathbf {x}_1,\mathbf {x}_2,\ldots ,\mathbf {x}_N]\in \mathbb {R}^{M\times N}\) be a data matrix, where each column is an instance. Denote the two new nonnegative matrices by \(U=[u_{ik}]\in \mathbb {R}^{M\times K}\) and \(V=[v_{jk}]\in \mathbb {R}^{N\times K}\), respectively. Then, we have
$$\begin{aligned} X\approx UV^\top \end{aligned}$$
Generally speaking, the rank of the two matrices U and V is much smaller than the rank of the matrix X, i.e., \(K\ll \min (M,N)\).
To measure the quality of the approximation, we need a cost function that quantifies the difference between X and \(U V^\top \). The most popular function is the sum of squared errors, or the Frobenius norm of \(X - UV^\top \), and the associated optimization problem is given by
$$\begin{aligned} \min \limits _{U,V \ge 0} \ {J_{sse}} = \left\| X - UV^\top \right\| _F^2 = \sum \limits _{i,j} \left( x_{ij} - \sum \limits _{k = 1}^K u_{ik}v_{jk} \right) ^2 \end{aligned}$$
(1)
Lee and Seung (2001) present an iterative algorithm which optimizes the above problem in the following way
$$\begin{aligned} {u_{ik}} \leftarrow {u_{ik}}\frac{{{{(XV)}_{ik}}}}{{{{(U{V^\top }V)}_{ik}}}}, \quad {v_{jk}} \leftarrow {v_{jk}}\frac{{{{({X^\top }U)}_{jk}}}}{{{{(V{U^\top }U)}_{jk}}}} . \end{aligned}$$
It has been proved that the objective function value is nonincreasing under the multiplicative updating rules (Lee and Seung 2001).
For the fact that \(K\ll \min (M,N)\), NMF can be treated as a technique of dimension reduction. And we can view the approximation column by column as follows
$$\begin{aligned} {\mathbf {x}_j} \approx \sum \limits _{k = 1}^K {{\mathbf {u}_k}{v_{jk}}} \end{aligned}$$
where \(\mathbf {u}_k\) is the k-th column vector of U. Thus, each data point \(\mathbf {x}_j\) is approximated by a linear combination of the columns of U, with the coefficient given in the k-th row of V. Therefore, U can be regarded as a basis consisting of nonnegative vectors and each row in V is a new representation of an instance with respect to U. For the purpose of clustering, we can set K to be the number of clusters and assign \(\mathbf {x}_i\) to cluster \(c_i=\mathop {{{\mathrm{argmax}}}}\limits _{k} v_{ik}\).
The most important difference between NMF and the other matrix factorization methods, like SVD, is the nonnegative constraints on U and V which only allow additive combinations among different basis vectors. For this reason, it is believed that NMF can learn a part-based representation which reveals the inherent structure of the original data.
Multiple NMF
Suppose, there exists a clustering \(C_1\) which partition the original data into different groups. How can we make use of it to generate a new clustering, which is on one hand different from \(C_1\) and on the other hand has high quality?
First, from the reference clustering \(C_1\), we can extract a similarity matrix \(S \in \mathbb {R}^{N \times N}\) between N data points. Specifically, we have
$$\begin{aligned} S_{ij} = \left\{ \begin{array}{ll} 1, &{}\quad \text {if } \mathbf {x}_i \text { and } \mathbf {x}_j \text { are in the same cluster} \\ 0, &{}\quad \text {otherwise} \end{array} \right. \end{aligned}$$
(2)
Then, the similarity matrix S can be used to guide the generation of the new clustering. Next, we discuss how to modify the standard NMF to exploit this additional information. Our goal is to generate a new clustering \(C_2\) by NMF. As discussed in Sect. 3.1, column vectors of U consist of a set of basis vectors of the new subspace, and rows of V provide new presentations of data points.
Define
$$\begin{aligned} W=VV^\top \in \mathbb {R}^{N \times N} \end{aligned}$$
where \(W_{ij}\) is the inner product of the i-th row and the j-th row of V. Since V is nonnegative, \(W_{ij}\ge 0\) represents the similarity between new representations of \(\mathbf {x}_i\) and \(\mathbf {x}_j\). Because the new clustering \(C_2\) is also derived from V, we can use W to approximate the similarity matrix of \(C_2\). Note that in the ideal case that V is an indicator matrix, W is equal to the similarity matrix of \(C_2\).
Given two similarity matrices S and W, we measure the redundancy between \(C_1\) and \(C_2\) as the inner produce of S and W, i.e.,
$$\begin{aligned} \langle S, W \rangle =\sum \limits _{i,j=1}^N W_{ij} S_{ij} \end{aligned}$$
In order to minimize the redundancy, we want the value of \(\sum _{ij} W_{ij}S_{ij}\) to be as small as possible. From the property of trace operation, we can formulate the above quantity as a simple quadratic term
$$\begin{aligned} {R} = \sum \limits _{i,j}^N {{W_{ij}}{S_{ij}} ={{\mathrm{tr}}}({W^\top }S)} = {{\mathrm{tr}}}(V{V^\top }S) = {{\mathrm{tr}}}({V^\top }SV) \end{aligned}$$
(3)
By minimizing R, we expect that if two data points \(\mathbf {x}_i\) and \(\mathbf {x}_j\) are in the same cluster in the reference clustering \(C_1\) (i.e., \(S_{ij}=1\)), they would be in different clusters in the new clustering \(C_2\). Then, we incorporate R as a regularization term into Eq. (1), and obtain the objective function of Multiple NMF as follows:
$$\begin{aligned} \min \limits _{U,V \ge 0} \ \phi =\left\| {X - U{V^\top }} \right\| _F^2 + \lambda {{\mathrm{tr}}}(V^\top SV). \end{aligned}$$
(4)
In the above equation, the regularization parameter \(\lambda >0\) controls the trade off between the clustering quality and the dissimilarity between different clusterings. By minimizing \(\phi \), we can get an alternative clustering \(C_2\) with respect to the reference clustering \(C_1\). Although the objective function \(\phi \) is not convex in U and V jointly, it is convex in them separately. Thus, a local minima can be found by optimizing U and V alternatively, similar to the optimization of the basic NMF. In the mathematical form, our optimization problem is similar to that of graph regularized NMF (GNMF) (Cai et al. 2011), and thus we can borrow techniques of GNMF to optimize it.
The objective function \(\phi \) in Eq. (4) can be rewritten as follows:
$$\begin{aligned} \begin{aligned} \phi&= {{\mathrm{tr}}}((X - U{V^\top }){(X - U{V^\top })^\top }) + \lambda {{\mathrm{tr}}}({V^\top }SV)\\&= {{\mathrm{tr}}}(X{X^\top }) - 2{{\mathrm{tr}}}(XV{U^\top }) + {{\mathrm{tr}}}(U{V^\top }V{U^\top }) + \lambda {{\mathrm{tr}}}({V^\top }SV) \end{aligned} \end{aligned}$$
There are two nonnegative constraints that \(U\ge 0\) and \(V\ge 0\). In order to eliminate the constraints, we derive the Lagrange function of \(\phi \). Let \(A=[a_{ik}]\in \mathbb {R}^{M\times K}\) and \(B=[b_{jk}]\in \mathbb {R}^{N\times K}\) be the matrices of dual variables. The Lagrange function L is
$$\begin{aligned} \begin{aligned} L =&{{\mathrm{tr}}}(X{X^\top }) - 2{{\mathrm{tr}}}(XV{U^\top }) + {{\mathrm{tr}}}(U{V^\top }V{U^\top }) \\&+\, \lambda {{\mathrm{tr}}}({V^\top }SV) + {{\mathrm{tr}}}(A{U^\top }) + {{\mathrm{tr}}}(B{V^\top }) \end{aligned} \end{aligned}$$
The partial derivatives of L with respect to U and V are:
$$\begin{aligned} \frac{{\partial L}}{{\partial U}}&= - 2XV + 2U{V^\top }V + A\\ \frac{{\partial L}}{{\partial V}}&= - 2{X^\top }U + 2V{U^\top }U +2\lambda SV + B \end{aligned}$$
From the KKT conditions that \(A_{ik}u_{ik}=0\) and \(B_{jk}v_{jk}=0\), we obtain the following equations for \(u_{ik}\) and \(v_{jk}\):
$$\begin{aligned} - {(XV)_{ik}}{u_{ik}} + {(U{V^\top }V)_{ik}}{u_{ik}}&= 0 \\ - {({X^\top }U)_{jk}}{v_{jk}} + {(V{U^\top }U)_{jk}}{v_{jk}} + \lambda {(SV)_{jk}}{v_{jk}}&= 0 \end{aligned}$$
leading to the following multiplicative updating rules:
$$\begin{aligned} {u_{ik}}&\leftarrow {u_{ik}}\frac{{{{(XV)}_{ik}}}}{{{{(U{V^\top }V)}_{ik}}}} \end{aligned}$$
(5)
$$\begin{aligned} {v_{jk}}&\leftarrow {v_{jk}}\frac{{{{({X^\top }U)}_{jk}}}}{{{{(V{U^\top }U + \lambda SV)}_{jk}}}} \end{aligned}$$
(6)
For these two updating rules, we have the following theorem.
Theorem 1
The objective function \(\phi \) in Eq. (4) is nonincreasing under the updating rules in Eqs. (5) and (6).
The proof is given in the next section. The analysis essentially follows that of NMF and GNMF (Lee and Seung 2001; Cai et al. 2011). We note that the above theorem cannot guarantee the final solution is a stationary point. To obtain a stronger theoretical guarantee, one can adopt the technique of Lin (2007) to modify the updating rule.
In practice, to prevent elements of V being unbounded, we will normalize the columns of U to make them of unit length (Xu et al. 2003). The matrix V also needs to be adjusted accordingly. The normalization steps are as follows
$$\begin{aligned} {u_{ik}} \leftarrow \frac{{{u_{ik}}}}{{\sqrt{\sum \nolimits _i {u_{ik}^2} } }},\quad {v_{jk}} \leftarrow {v_{jk}}\sqrt{\sum \limits _i {v_{ik}^2} } \end{aligned}$$
(7)
After obtaining the new representation V of the data, we get an alternative clustering by either assigning the instance \(\mathbf {x}_i\) to the cluster \(\mathop {\max }\limits _k {v_{ik}}\) or applying any clustering method like k-means to V. Given a reference clustering \(C_1\), the whole process of generating an alternative clustering \(C_2\) by MNMF is summarized in Algorithm 1.
When \(\lambda =0\), the two updating rules are the same as NMF, and the algorithm reduces to the traditional clustering by NMF. In addition, our multiple clustering method can also be extended to deal with the case that we need to generate more than two clusterings. Each time we obtain a clustering \(C_i\), we get the corresponding similarity matrix \(S_{i}\). Then, we can simply calculate the accumulated similarity matrix by \(S=\sum _i S_i\), and use S to generate another clustering.
Proof of Theorem 1
The objective function \(\phi \) of MNMF in Eq. (4) is bounded from below by zero. In order to prove the algorithm converges to a stable state, we need to show that \(\phi \) is non-increasing under the updating rules in Eqs. (5) and (6). Since the second term of \(\phi \) is only related to V, we have exactly the same updating rule for U in MNMF as in the original NMF. Thus, we can use the convergence proof of NMF to show that \(\phi \) is non-increasing under the update rule in Eq. (5). Please see Lee and Seung (2001) for details.
Then, we need to prove that \(\phi \) is non-increasing under the updating step in Eq. (6). We follow the similar procedure described in Cai et al. (2011). Define
$$\begin{aligned} F(V) = \phi (U,V) = \left\| {X - U{V^\top }} \right\| _F^2 + \lambda {{\mathrm{tr}}}({V^\top }SV) \end{aligned}$$
We will construct an auxiliary function which satisfies the following conditions:
$$\begin{aligned} G(v,v^{t})\ge F(v), \quad G(v,v)=F(v). \end{aligned}$$
Lemma 1
If G satisfies the conditions above, then F is non-increasing under the updating rule:
$$\begin{aligned} {v^{t + 1}} = \mathop {\arg \min }\limits _v G(v,{v^t}) \end{aligned}$$
(8)
Proof
$$\begin{aligned} F({v^{t + 1}}) \le G({v^{t + 1}},{v^t}) \le G({v^t},{v^t}) = F({v^t}) \end{aligned}$$
Considering any element \(v_{ab}\) in V, we use \(F_{ab}\) to denote the part of \(\phi \) which is only relevant to \(v_{ab}\). It is easy to check that
$$\begin{aligned} F_{ab}^{'}&= {\left( \frac{{\partial \phi }}{{\partial V}}\right) _{ab}} = {( - 2{X^\top }U + 2V{U^\top }U + 2\lambda SV)_{ab}} \end{aligned}$$
(9)
$$\begin{aligned} F_{ab}^{''}&= 2{({U^\top }U)_{ab}} + 2\lambda {S_{aa}} \end{aligned}$$
(10)
Since our updating rule is essentially element-wise, it is sufficient to show that each \(F_{ab}\) is non-increasing under the updating step in Eq. (8).
Lemma 2
$$\begin{aligned} \begin{aligned} G(v,v_{ab}^t) =\,&F_{ab}(v_{ab}^t) + F_{ab}^{'}(v_{ab}^t)(v - v_{ab}^t) \\&+\, \frac{{{{(V{U^\top }U)}_{ab}} + \lambda {{(SV)}_{ab}}}}{{v_{ab}^t}}{(v - v_{ab}^t)^2} \end{aligned} \end{aligned}$$
(11)
is an auxiliary function for \(F_{ab}\) which satisfies the conditions in Lemma 1.
Proof
Obviously, \(G(v,v)=F_{ab}(v)\). So we only need to prove that \(G(v,v_{ab}^t) \ge F_{ab}(v)\). To do this, we use the Taylor series expansion of \(F_{ab}(v)\):
$$\begin{aligned} {F_{ab}}(v) = {F_{ab}}(v_{ab}^t) + F_{ab}^{'}(v_{ab}^t)(v - v_{ab}^t) + [{({U^\top }U)_{bb}} + \lambda {S_{aa}}]{(v - v_{ab}^t)^2} \end{aligned}$$
Compared with Eq. (11), we observe that \(G(v,v_{ab}^t)\ge F_{ab}(v)\) is equivalent to
$$\begin{aligned} \frac{{{{(V{U^\top }U)}_{ab}} + \lambda {{(SV)}_{ab}}}}{{v_{ab}^t}} \ge {({U^\top }U)_{bb}} + \lambda {S_{aa}} \end{aligned}$$
(12)
We have
$$\begin{aligned} {(V{U^\top }U)_{ab}} = \sum \limits _{l = 1}^k {v_{al}^t{{({U^\top }U)}_{lb}} \ge v_{ab}^t{{({U^\top }U)}_{bb}}} \end{aligned}$$
and
$$\begin{aligned} \lambda {(SV)_{ab}} = \lambda \sum \limits _{j = 1}^M {{S_{aj}}v_{jb}^t} \ge \lambda {S_{aa}}v_{ab}^t . \end{aligned}$$
As a result, Eq. (12) holds and we have \(G(v,v_{ab}^t)\ge F_{ab}(v)\)
\(\square \)
We can now demonstrate the convergence of Theorem 1:
Proof
Replacing \(G(v,v_{ab}^t)\) in Eq. (8) with Eq. (11), then we get the updating rule for \(v_{ab}\):
$$\begin{aligned} v_{ab}^{t + 1} = v_{ab}^t - v_{ab}^t\frac{{F_{ab}^{'}(v_{ab}^t)}}{{2{{(V{U^\top }U)}_{ab}} + 2\lambda {{(SV)}_{ab}}}} = v_{ab}^t\frac{{{{({X^\top }U)}_{ab}}}}{{{{(V{U^\top }U + \lambda SV)}_{ab}}}}. \end{aligned}$$
Since Eq. (11) is an auxiliary function, \(F_{ab}\) is nonincreasing under this updating rule. \(\square \)
In summary, we conclude that the objective function \(\phi \) is non-increasing under the updating rules in Eqs. (5) and (6). Furthermore, the convergence analysis of GNMF (Yang et al. 2014) implies our algorithm still converges under the additional normalization steps.