# SAGA: sparse and geometry-aware non-negative matrix factorization through non-linear local embedding

- 649 Downloads
- 1 Citations

## Abstract

This paper presents a new non-negative matrix factorization technique which (1) allows the decomposition of the original data on multiple latent factors accounting for the geometrical structure of the manifold embedding the data; (2) provides an optimal representation with a controllable level of sparsity; (3) has an overall linear complexity allowing handling in tractable time large and high dimensional datasets. It operates by coding the data with respect to local neighbors with non-linear weights. This locality is obtained as a consequence of the simultaneous sparsity and convexity constraints. Our method is demonstrated over several experiments, including a feature extraction and classification task, where it achieves better performances than the state-of-the-art factorization methods, with a shorter computational time.

## Keywords

Non-negative matrix factorization Manifold sampling Kernel methods Sparse projections Simplex methods Convexity constraints## 1 Introduction

### 1.1 Context

Non-negative matrix factorization (or NMF for short) has long been studied and used as a powerful data analysis tool providing a basis for numerous processing, such as dimensionality reduction, clustering, denoising, unmixing, etc. This article is concerned with a variant of NMF, where one tries to decompose a given matrix \(\mathbf{X}=[\mathbf{x}_1, \ldots , \mathbf{x}_n]_{(\mathbf{x}_i\in \mathbb {R}^p)}\) formed by a set of \(n\) samples described with \(p\) variables (\(\in \mathbb {R}^p\)) as a product of two matrices \(\mathbf{F}\in \mathbb {R}^{p \times \ell }\) and \(\mathbf{G}\in \mathbb {R}^{n \times \ell }\) such that \(||\mathbf{X}- \mathbf{F}\mathbf{G}^\top ||^2\) is minimized, under the constraints that the elements of \(\mathbf{G}\) are positive or nil, and that its rows sum to one. In other words, \(\mathbf{G}\) serves as a convex surrogate description of \(\mathbf{X}\) in the reduced embedding formed by \(\ell \) prototypes (the columns of \(\mathbf{F}\)).

This convexity property on \(\mathbf{G}\) is desirable in a lot of applications for its relations with the physics of the underlying observations: in this acceptation, the datapoints are usually observations of a mixing process where the components of the mixture are not known. Hyperspectral images are a good example, as each pixel describes a spectrum that can be defined as a combination of pure materials spectra (trees, concrete, water, etc.); the combination occurring because of the captor spatial resolution and different scattering effects (Esser et al. 2012; Bioucas-Dias et al. 2012). Other examples of applications are found in archetypal analysis (Mørup and Hansen 2012), or biology (Kersting et al. 2012).

### 1.2 Notations

Bold capital letters, such as \(\mathbf{X}\) or \(\mathbf{M}\) refer to matrices. The transposition operator is denoted \({\cdot }^\top \) (so that \(\mathbf{M}^\top \) refers to the transpose of \(\mathbf{M}\)). The set of \(m\times n\) real matrices (respectively real matrices with nonnegative entries) is denoted \(\mathbb {R}^{m \times n}\) (respectively \(\mathbb {R}^{m \times n}_+\)). \(\mathbf{I}_n\) is the \(n\times n\) identity matrix, and \(||\cdot ||\) refers to the Frobenius norm. For any matrix \(\mathbf{M}\), \(\mathbf{M}_{\bullet i}\) (respectively \(\mathbf{M}_{i \bullet }\)) corresponds to the \(i\)th column (respectively \(i\)th row) of \(\mathbf{M}\). However, in the particular case of \(\mathbf{X}\), each column \(\mathbf{X}_{\bullet i}\) corresponds to the datum \(\mathbf{x}_i\), so that this latter more intuitive notation is preferred. \(\Delta ^{n}(\mathbf{M})\) is a \((n-1)\)-simplicial polytope formed by \(n\) columns of \(\mathbf{M}\). The unit \((n-1)\)-simplex \(\Delta ^{n}(\mathbf{I}_n)\) is simply referred to as \(\Delta ^{n}\).

### 1.3 Related work

Despite appealing properties, NMF presents a number of difficulties: First, while real data are often embedded on complex manifolds, the seminal convex NMF formulation prohibits a large number of latent factors that would explain the data in a geometry preserving way; Second, the level of sparsity in the final embedding (the matrix \(\mathbf{G}\)) is rarely controllable, and far from data analyst expectations; finally, the computational time is prohibitive when it comes to handling very large size matrices. Even if to date, various state-of-the-art methods already tackle one of or two of these issues, no method addresses these three issues all together.

The convexity constraint we consider on \(\mathbf{G}\) differs from that of what is classically referred to as Convex NMF (Ding et al. 2010), i.e. NMF where the columns of \(\mathbf{F}\) are convex combinations of columns of \(\mathbf{X}\). However, our problem is also not unheard of in the literature, as, beyond its easiness of interpretation, it provides a very appealing computational framework: if \(\mathbf{G}\) encodes convex combinations, the \(\ell \) columns of \(\mathbf{F}\) are expected to characterize a hull of \(\mathbf{X}\) in \(\mathbb {R}^p\), as data points lying within are perfectly reconstructed. Based on this idea, it is possible to separate the NMF problem into two steps, namely the computation of \(\mathbf{F}\) (in order to determine, or to best approximate the hull of the dataset) and of \(\mathbf{G}\) (which corresponds to the projection of the dataset onto the region bound by the hull). From a computational point of view, this is really interesting, as it is now possible to decompose the NMF (which amounts to finding two matrices with close relationship) into two simple problems, each focused on a particular matrix: first one computes \(\mathbf{F}\) regardless \(\mathbf{G}\), and second, one computes \(\mathbf{G}\) by projection of the columns of \(\mathbf{X}\) onto \(\mathbf{F}\). With such an approach, it is possible to expect a lower complexity than that of elder methods based on singular value decomposition (with a \(o(n^3)\) complexity), or than that of demanding procedures that iteratively alternates between the minimization of the two matrices, such as in Ding et al. (2010).

Initially, \(\mathbf{F}\) was first related to the *convex hull* of \(\mathbf{X}\) (Cutler and Breiman 1994): Notably, numerous works in the remote sensing and hyperspectral imaging community (see Bioucas-Dias et al. 2012 as well as its references) have pointed out that \(\mathbf{F}\) should be chosen so that \(\Delta ^{\ell }(\mathbf{F})\) best encompasses \(\mathbf{X}\), while other investigations focused on how to find this simplicial convex hull (Wang et al. 2010; Thurau et al. 2010; Çivril and Magdon-Ismail 2009). More recently, a series of works (Arora et al. 2012; Kumar et al. 2013; Gillis and Vavasis 2013; Recht et al. 2012) focused on the *conical hull* of \(\mathbf{X}\): Here, the idea is to find \(\mathbf{F}\) so that it spans a cone encompassing the dataset. The reason of the recent focus on the conical hull is that it is the geometric translation of an important assumption in NMF, the *separability assumptions* proposed by Donoho and Stodden (2003). This separability assumption reads that (1) the residue \(\mathbf{X}- \mathbf{F}\mathbf{G}^\top \) is nil, (2) columns of \(\mathbf{F}\) are collinear to columns of \(\mathbf{X}\). This second condition regarding the collinearity of the columns of \(\mathbf{F}\) and \(\mathbf{X}\) is interesting beyond the separability assumption: whatever the type of hull defined by \(\mathbf{F}\), it is possible to assume that the components of the mixture belong to the data, and that any datum is a convex combination of a restricted number of particular selected datapoints (those forming \(\mathbf{F}\)). This approach drastically reduces the complexity of finding \(\mathbf{F}\) as this latter is simply defined by a *column subset selection* (CSS) of \(\mathbf{X}\).

Figure 1 illustrates a dataset lying on a non-linear manifold resulting from some hidden factors that could be of interest, as well as its conical and convex hulls. It clearly appears that the geometries of these hulls are not adapted to that of the dataset. So far, there has been little interest in trying to characterize the boundary of the manifold dataset in spite of its non-convexity; in the sequel, we shall address this boundary with the shorter and imaged name *manifold hull*, that is illustrated on Fig. 1d. Naturally, precisely characterizing such a manifold hull would require to increase \(\ell \), the number of datapoints involved in the CSS; which stresses to a larger extend the need for an adapted control of the sparsity level that is already sought for in numerous applications: Each point should be described as a convex combination of a restricted number \(\lambda \) of prototype points among the \(\ell \) of the CSS.

Even if the characterization of the manifold hull has never been addressed so far, it is well-known that the kernel trick is an efficient way to provide a manifold preserving description of a dataset. In fact, several pre-existing works applied the kernel trick to NMF (Zhang et al. 2006; Buciu et al. 2008; Cai et al. 2011). However, the convexity constraint on \(\mathbf{G}\) (and consequently the notion of manifold hull) does not appear. Moreover, neither sparsity nor any linear complexity is achieved.

### 1.4 Proposal

In this article, we present SAGA (*Sparse And Geometry-Aware*), a new NMF method avoiding the aforementioned limits. It operates in a Reproducing Kernel Hilbert Space (RKHS) (Schölkopf and Smola 2002), where \(\mathbf{F}\) is defined by a CSS. Then, according to an expected sparsity level, \(\mathbf{G}\), the best dataset projection onto the simplex formed by \(\mathbf{F}\) is computed. The advantages of our method are:

*First*, kernelization is interesting to several extents: (i) it makes the algorithm compliant with dataset where only relationships among objects are available; (ii) its regularization property improves robustness to noise; (iii) it allows using more latent factors than the dimensions of the input data, providing an insightful tool to consider the geometric structure of the data manifold. As both the number of latent factors and the sparsity level are increasing, it appears that the locality of the support of \(\mathbf{G}\) is increasing, turning the factorization problem in a *non-linear local embedding* of the original data. As shown in some recent works (Yu et al. 2009; Guillemot and Turkan 2012), this kind of embedding is powerful to describe the non-linear structure of the data manifold, and as such serve as a very good feature extraction framework.

*Second*, the CSS is defined thanks to a manifold subsampling method (Shroff et al. 2011), which reaches a linear complexity with respect to \(n\), the size of the dataset.

*Third*, the computation of \(\mathbf{G}\) corresponds to a sparse RKHS simplex projection, which is a non-linear optimization problem. Based on recent advances on sparse projected gradient methods, the projection is solved with an algorithm of linear complexity (with respect to \(n\)), while naturally embedding sparsity mechanism control.

*Forth* and finally, since both computations of \(\mathbf{F}\) and \(\mathbf{G}\) are linear, the overall complexity of SAGA is linear. This makes this algorithm perfectly suitable for very big data.

To the best of our knowledge, no state-of-the-art method simultaneously compiles all these advantages.

### 1.5 Contributions

- 1.
It is slightly generalized in order to operate in the Hilbert space reproducing the Gaussian kernel, rather than in \(\mathbb {R}^{p}\).

- 2.
Most importantly, while SiVM proposes an approximate optimization (for several simplifications are made to reach a linear complexity), we propose another solution to the simplex volume maximization which performs an exact optimization.

- 3.
This latter is adapted to the projection over the CSS in the RKHS, and its linear complexity is kept.

- 4.
We provide with theoretical bounds on the convergence of the projector (linked to the kernel bandwidth, the minimum pairwise distance between the CSS and the sparsity level).

### 1.6 Outline

We solve our NMF problem by separating the computations of the CSS matrix and of the projection matrix. This is why, Sects. 2 and 3 focus on the computations of \(\mathbf{F}\) and \(\mathbf{G}\) respectively. More specifically, the structure of Sect. 2 is the following: First a small introductive paragraph recalls the basics of the kernel trick, as well as why it is interesting to operate in a RKHS to fulfill our objectives. Then, Sect. 2.1 investigates the consequences of working in such RKHS; They lead us to a particular strategy, which is to characterize the manifold hull of the dataset in \(\mathbb {R}^p\) via a simplicial convex hull in the RKHS; and to define this latter with a manifold sampling strategy. At this point, computational constraints direct us toward SiVM-like procedures rather than toward more resource-demanding ones. Section 2.2 jointly presents SiVM, such as defined in the literature (Thurau et al. 2012), as well as the kernel generalization we propose, while Sect. 2.3 describes our modifications to reach exact maximal simplex volume. In a similar way, Sect. 3 is divided into two parts: the first one (Sect. 3.1) explains how projecting the image in the RKHS of any datum \(\mathbf{x}_i\) onto the image of the CSS in the RKHS, along with sparsity constraints; Sect. 3.2 provides with a proof that the projector defined in Sect. 3.1 converges. As it is established in the literature (Garg and Khandekar 2009) that the Restricted Isometry Property (RIP, Candes 2008) implies the convergence, we prove that our projector respect the RIP for some particular tuning of its parameters. At this stage, most of the required mathematics is established. Then, in Sect. 4, one summarizes the entire procedure made of the concatenation of the two matrices computation through an easy to implement algorithm, completed by some theoretical assessments of the linear complexity of the algorithm. Finally, Sect. 5 is devoted to experimental validations. In the first series of experiments, one focuses on toy examples, in order to illustrate the behavior of the manifold hull. Then, follow several experiments on simulated datasets, in order to compare the computational efficiency and the computational precision of SAGA with respect to the state-of-the-art. Finally, we consider real datasets through challenging image classification tasks.

## 2 Geometry aware CSS procedure

Despite living in \(\mathbb {R}^p\), \(\mathbf{X}\) may span a nonlinear manifold of intrinsic dimensionality lower than \(p\). A major problem is thus to extend to this nonlinear manifold the classical statistics that are used to work in a vector space. To do so, an essential element is to replace the Euclidean distances by geodesic distances. Depending on the manifold, the associated metric may be difficult to formally define. However, it is possible to characterize it through the time-scale of the well-studied heat diffusion process, the kernel formulation of which is well approximated by the Gaussian kernel (Lafferty and Lebanon 2005; Lafon and Lee 2006): The geometry of the manifold in which \(\mathbf{X}\) lies is captured by \(\mathbf{K}_{ij} = k(\mathbf{x}_i,\mathbf{x}_j) = \exp {\left( {-{||\mathbf{x}_i-\mathbf{x}_j||^2}/{2\sigma ^2}}\right) }\) with variance \(\sigma ^2\). Let us call \(\phi (\cdot )\) the implicit feature map from \(\mathbb {R}^p\) onto \(\mathcal {H}\), the RKHS associated to \(k(\cdot ,\cdot )\). We use the shorthand notation \(\mathbf{\Phi }= \phi (\mathbf{X}) = [\phi (\mathbf{x}_1), \ldots ,\phi (\mathbf{x}_n)]_{(\phi (\mathbf{x}_i) \in \mathcal {H})}\). Then, following the notations of Ding et al. (2010), the SAGA solution amounts to finding the indicator matrix \(\mathbf{W}^{(\ell )}\in \{ 0,1 \}^{n \times \ell }\) defining the CSS^{1} and the projection matrix \(\mathbf{G}\) where \(\phi (\mathbf{F})= \phi (\mathbf{X}\mathbf{W}^{(\ell )}) = \mathbf{\Phi }\mathbf{W}^{(\ell )}\), such that \(||\mathbf{\Phi }- \mathbf{\Phi }\mathbf{W}^{(\ell )}\mathbf{G}^\top ||^2\) is minimized under convexity constraints.

### 2.1 CSS as a manifold sampling procedure

Let us start by a basic remark,

### *Remark 1*

In the input space, we must have \(\ell \le n\) and \(\ell \le p\). In the Gaussian RKHS, one still has \(\ell \le n\), however, \(\ell > p\) becomes possible, for each sample spans its own dimension.

which leads to the following property:

### *Property 1*

(Non-Separability) Separability assumption does not hold for NMF in the Gaussian RKHS.

### *Proof*

According to Remark 1, any datum not included in \(\mathbf{X}\mathbf{W}\) cannot be expressed as a linear combination of elements of \(\phi (\mathbf{X}\mathbf{W})\). \(\square \)

Thus, one should not consider conical hull in the RKHS. However, Remark 1 leads to:

### **Corollary 1**

In the Gaussian RKHS, it is possible to use more than \(p\) points in the CSS. The latter forms a non-simplicial polytope in \(\mathbb {R}^p\) while their image in the Gaussian RKHS is a simplex.

In other words, the manifold hull of \(\mathbf{X}\) can be characterized through the simplicial convex hull \(\Delta ^{\ell }(\phi (\mathbf{F}))\). Then, it follows that:

### **Corollary 2**

No sample lies in \(\Delta ^{\ell }(\phi (\mathbf{F}))\) and all the samples will be projected on hyperfaces of \(\Delta ^{\ell }(\phi (\mathbf{F}))\), leading to approximate reconstructions. Yet, such an approximation comes with the appealing property of sparsity, discussed later in the article.

At this point, finding \(\Delta ^{\ell }(\phi (\mathbf{F}))\) amounts to finding \(\mathbf{W}\), which turns out to subsample the boundary of the data manifold. Most of the methods from the literature address it with the objective of maximizing the representation of the dataset, while here, we are interested in its boundary, which makes the sampling completely different: For instance, a kernel \(k\)-means sampling, although very efficient to subsample a given manifold (Lafon and Lee 2006), leads to Convex NMF of Ding et al. (2010), the aim of which is completely different of ours.

However, our problem is not completely unheard of: In Shroff et al. (2011), the authors consider maximizing the diversity of the selected samples using a Karcher variance^{2} maximization criterion. Alternative formulation exists, where one seeks for the maximum volume parallelepiped in the data matrix (Çivril and Magdon-Ismail 2009). Interestingly enough, whatever the interpretation (diversity or volume criterion), the corresponding computation can reduce to recursive QR decompositions of the data matrix (Gillis and Vavasis 2013; Shroff et al. 2011). However, in the RKHS, since no explicit coordinates are available, those methods cannot be transposed. The recent proposal of Courty and Burger (2013) regarding a kernel rank revealing Cholesky technique is also of interest, unfortunately, it fails to scale up to big data, because it implies a computationally demanding decomposition at each selection step.

Finally, among all the methods available in the literature, if one discards those (1) which do not sample the boundary, (2) which cannot be conducted in a RKHS, (3) which do not have a linear complexity, we are aware of a single remaining method: the Simplex Volume Maximization (SiVM) (Thurau et al. 2012).

### 2.2 Original simplex volume maximization

### *Remark 2*

Naturally, the magnitude to which the constant distance assumption is violated strongly depends on the dataset. As a consequence, the instantiation of Eq. 3 into 5 may lead to some approximations as already noted in Gillis and Vavasis (2013), where Gillis and Vavasis showed that SiVM can underperform on ill-conditionned datasets.

Despite this remark, the general principle advocated in Eq. 3 remains unalterably valid. This is why, we rely on it to propose an alternative to SiVM, which provides exact volume computation, with a similar linear complexity.

### 2.3 Exact simplex volume maximization

### *Remark 3*

This procedure tends to add datum that most changes the geometry of the manifold spanned by the CSS. As such, it acts as the spectral sampling procedure proposed in Öztireli et al. (2010) for sampling a 3D mesh in a computer graphics context.

## 3 Sparse projections onto the RKHS simplex

In this section, we focus on the computation of \(\mathbf{G}\). We give the formulation of our sparse RKHS simplex projector, and then we discuss its convergence. We notably show some analytical bounds required for the convergence of the method.

### 3.1 Projection on the RKHS simplex

### *Remark 4*

Since the GSSP procedure projects \(\mathbf{G}_{i \bullet }\) on the subspace spanned by the \(\ell \) columns of \(\mathbf{F}\), the coordinates of \(\mathbf{G}_{i \bullet }\) embeds the projection over each of the \(\ell \) selected elements in the feature space, so that the sparsity support is chosen in the closest elements in the feature space.

The Gaussian kernel is monotonically decreasing according to the neighboring distance; This implies that for each datum, the sparsity support is made of the closest CSS elements in the input space. It follows that:

### *Property 2*

(Non-linear local coding) The sparse RKHS simplex projector describes any element \(\mathbf{x}_i\) of \(\mathbf{X}\) with \(\mathbf{G}_{i \bullet }\), which interprets as its non-linear (because of the kernel non-linearity) barycentric coordinates according to \(\lambda \) prototype points. These prototype points are found in the closest elements of the CSS, thus providing a non-linear local coding for \(\mathbf{x}_i\).

### 3.2 Convergence of the projector

Finally, we establish the convergence of that projector, which ensures (Kyrillidis et al. 2013) that the final vector \(\mathbf{G}_{i \bullet }\) is the best \(\lambda \)-sparse solution. To do so, we rely on Garg and Khandekar (2009), which states that for Eq. 11 to be minimized via the projected gradient approach, \({\mathbf{\Phi }\mathbf{W}}\) has to satisfy \(\lambda \)-restricted isometry property (or \(\lambda \)-RIP for short), with \(\delta _\lambda \in [0,1[\). This latter reads:

### **Definition 1**

*restricted isometry property*(Candes 2008) with constant \(\delta _\lambda \) if

As a matter of fact, such convergence holds for particular values of \(\lambda \) and \(\sigma \), such as stated by the following proposition:

### **Proposition 1**

If \(\lambda > 2\), and if \(\sigma < \frac{d_{\mathrm{{min}}}}{\sqrt{2\ln (\lambda -1)}},\) then, \({\mathbf{\Phi }\mathbf{W}}\) satisfies the \(\lambda \)-RIP with constant \(\delta _\lambda \in [0,1[\). If \(\lambda \le 2\), there is no particular bound to \(\sigma \).

### *Proof*

Let us first note that since we are working in the Gaussian RKHS, the columns of \({\mathbf{\Phi }\mathbf{W}}\) have unit norms. The diagonal entries of \(\mathbf{K}_\lambda \) are 1, and \(\forall i,j \in \mathcal {L}\) the off-diagonal element \((i,j)\) is \(k(\mathbf{x}_i,\mathbf{x}_j)\).

This allows deriving conditions on \(\sigma \) for different convergence or approximation guarantees, such as \(\delta _{2\lambda }<1/3\) for convergence (Garg and Khandekar 2009) (Theorem 2.1). This point is discussed in Kyrillidis et al. (2013) (Sect. 2). We note that similar developments using the Gershgorin circle theorem have been used for the deterministic creation of projection matrices in the domain of compressed sensing (Bandeira et al. 2012) (Sect. 2.1).

## 4 Complete SAGA procedure

We give in Alg. 1 the complete SAGA matrix factorization procedure.^{3} We then discuss its computational complexity.

### 4.1 Algorithm

### 4.2 Computational complexity

### *Property 3*

(Computational complexity for \(\mathbf{W}\)–noted as **First step** in Alg. 1) The selection of the CSS defining \(\mathbf{W}\) based on the exact and incremental simplex volume maximization has complexity of \(o(n \ell ^3)\), i.e. it has a linear complexity with respect to \(n\).

### *Proof*

The entire CSS construction is based on Eqs. 8 and 9. At each step of the selection, the procedure amounts to a linear scanning procedure where the volume increment is computed for each element in the dataset (size \(n\)). For one element \(\mathbf{x}_i\), with \(i \in [1, n]\), this requires to compute the associated bilinear form \(\mathbf{k}_{\mathbf{x}_q}^\top \cdot \mathbf{K}^{-1}_{i-1} \cdot \mathbf{k}_{\mathbf{x}_q}\) \(\mathbf{k}_{\mathbf{x}_q}\) vector. with an assymptotical computational complexity of \(o(p^2)\). As \(p\) varies 1 to \(\ell \), one ends up with a cubical complexity term regarding \(\ell \). \(\square \)

We note that thanks to Eq. 9, almost all the values (except the one corresponding to the kernel evaluation with the last chosen element) have already been computed in the previous iteration and do not need to be computed again. This makes the overall CSS computation that efficient. However, for exceptionally large data matrices, it is possible to improve it with the randomized approximate search of Thurau et al. (2010), that can be directly adapted. Finally, let us remark that, in spite of being linear in terms of \(n\), the number of elements in the dataset, the procedure is not linear with respect to \(\ell \). However, as \(\ell \) is classically several order of magnitude smaller than \(n\), so that it is seldom important. This is why, in a similar way, SiVM is also not linear with respect to \(\ell \).

The complexity of the computation of the projection matrix \(\mathbf{G}\) is by construction linear with respect to \(n\) as in the algorithm, it is decomposed into a succession of \(n\) independent projections. However, the complexity of the projection with respect to \(\ell \) is linear:

### *Property 4*

(Computational complexity for \(\mathbf{G}\)—noted as **Second step** in Alg. 1) Each projection has a linear complexity regarding \(\lambda \).

### *Proof*

Each projection is computed through a gradient descent, each iteration of which has a complexity dominated by that of the GSSP (Eq. 12). The latter requires getting the \(\lambda \) greatest values of \(\mathbf{G}^{t}_{i \bullet }\). To do so, the GSSP classically relies on sorting the elements of \(\mathbf{G}^{t}_{i \bullet }\) (a vector of size \(\ell \)), with a \(o(\ell \log (\ell ))\) complexity. However, it is possible to be more efficient by achieving this task in \(o(\lambda )\) thanks to the median-finding algorithm (Maculan and Paula 1989). Finally, if \(q\) iterations are needed in the descent, then the total complexity of one projection is \(o(q \lambda )\). \(\square \)

Let us note here that \(q\) typically depends on the choice of the magnitude of the gradient step \(\varepsilon _t\), which can be efficiently set following the results of Garg and Khandekar (2009). In practice, only a few tens of iterations are necessary and if one has \(n > > q \lambda \), the influence of the number of iterations is immaterial.

Finally, as both the definition of the CSS and the projection have a linear complexity with respect to the dataset size, the overall SAGA procedure also has. This allows factorizing very large matrices in tractable time.

## 5 Experiments and results

In this section, we first observe the behavior of SAGA on some toy datasets. Our goal is to verify the behavior of our algorithm with respect to the theoretical properties given in the previous section, with a special focus on the nature of the subsampling occurring the RKHS. Then, the volume maximization strategy is discussed, as well as its impact on the reconstruction errors over a toy and a real dataset. We finally discuss the potential use of SAGA in a feature extraction context for classification purpose. The performances of our method are then compared to a selection of state-of-the-art methods performing NMF with characteristics shared by our method (sparsity, kernels, convexity, etc.).

### 5.1 Experiments on toy datasets

### 5.2 Comparison on subsampling strategies

The quality of the overall matrix factorization procedure relies on two elements: the ability of the CSS to correctly define the manifold hull, and the precision of the projection. A particular focus is given here on the first one, as various sampling strategies are compared and discussed.

The goal of our first comparison is to confirm that, among the methods based on volume maximization, ours is the most accurate. To this end, we compare three volume maximization methods (operating in the RKHS, to fit the objectives of this work): The first one is very method from Thurau et al. (2012), reported through Eq. 5, including the constant distance assumption as well as the replacement of squared distances by simple distances. As this point is only supported by computational considerations, we also consider the method summarized by Eq. 4, where the constant distance assumption still holds, but where the squared distances are kept. These two methods are referred to as SiVM-approx-(4) and SiVM-approx-(5). Naturally, the third one is that of SAGA, based on Eqs. 8 and 9. As a reference, we consider the exact volume computation based on the Cayley–Menger determinant (CMD), and we compare the ratios of the volumes derived by the three methods over the reference one. Due to the computation cost of CMD, we have restrict the size of the CSS to \(\ell =8\).

Uniform | Ill-conditioned | COIL-20 | |
---|---|---|---|

SiVM-approx-(4) | \(98.93\ (0.07)\) | \(59.18\ (1.6)\) | \(97.37\ (0.12)\) |

SiVM-approx-(5) | \(96.03\ (0.21)\) | \(55.85\ (1.5)\) | \(96.28\ (0.13)\) |

SAGA | \(\mathbf 100.13 \; (\mathbf 2 \times \mathbf 10 ^{-\mathbf 3 })\) | \(\mathbf 100.04 \; (\mathbf 6 \times \mathbf 10 ^{-\mathbf 3 })\) | \(\mathbf 100.00 \; (\mathbf 0.00 )\) |

Some conclusions can be drawn from Table 1. First, the lower ratios of SiVM-approx-(4) and -(5) for ill-conditioned datasets confirm the conclusion of Gillis and Vavasis (2013) regarding the constant distance approximation, while SAGA is not bothered. Second, it is possible to notice the slight decrement of the performances due to the supplemental approximation in SiVM-approx-(5) where the squared distances are not considered anymore for computational reasons. Also, for all datasets, SAGA finds the most similar simplex volumes to the reference ones, as the ratios are the closest to 1. Oddly enough, the ratios are sometimes even slightly greater than 1, due to the numerical imprecisions of the determinant computations which may induce different choices for the simplex vertices. In the meantime, while volume differences between SAGA and CMD are low, their respective computational performances are quite different. As an example, with COIL dataset, each CMD computation requires \(\approx 10\) s. whereas SAGA takes \(\approx 0.01\) s in an unoptimized implementation. As a conclusion, SAGA provides a volume as large as what of CMD approach, yet in much less time. However, the computational accuracy of SAGA is fully investigated in the next subsection.

Now that it is established that SAGA provides a better hull for the manifold than classical methods based on a kernelization of the simplex volume maximization, let us compare it to other manifold sampling strategies. As it is not meaningful to use the simplex volume as a criterion, we consider the normalized reconstruction errors, such as with toy datasets. We can also consider larger CSS values, ranging from 10 to more than 100, as the computation of the CMD is not an issue anymore. In the comparison, we keep SiVM-approx-(5), as it corresponds to a kernelized version of the original paper (Thurau et al. 2012). We also consider random sampling (the NMF reducing to a random projection algorithm), kernel \(k\)-means and kernel rank-revealing Cholesky of Courty and Burger (2013), as a surrogate for the rank-revealing QR method of Shroff et al. (2011) which cannot be kernelized.

### 5.3 Computational complexity

If we now turn to the complexity of the projection, it is useless to consider it with respect to \(n\) as the projection is dealt datum by datum. However, it is interesting to consider the complexity regarding \(\ell \). To do so, we use the same experimental setting, yet, \(n\) is fixed and \(\ell \) varies from 1 to 120. Fig. 6b clearly highlights the outperformance of our projector compared to the quadratic programming approach used in SiVM.

### 5.4 Application to classification on real datasets

In order to propose a more application-oriented test of SAGA, we propose to use the result of various NMF methods as feature extraction processes in a classification problem. Experiments are conducted on four publicly available datasets. The previously described COIL-20 dataset, the CMU PIE dataset, the MNIST dataset and the CIFAR-10 dataset: The CMU PIE dataset contains the \(32\) gray scale facial images of 68 people, any of each having 42 images at different illumination and facial expression conditions (Sim et al. 2002). The MNIST dataset (LeCun et al. 1998) contains the \(28\times 28\) gray scale images of handwritten numbers (from 0 to 9). Its total sample number is 70,000. Finally, CIFAR-10 is made of 60,000 \(32\times 32\) color images in 10 classes, with 6,000 images per class.

Mean accuracy and variance NMF-based classification

COIL-20 | PIE | MNIST | CIFAR-10 | |
---|---|---|---|---|

Sparse NMF | 88.04 (0.66) | 80.24 (1.25) | 86.08 (1.99) | 38.34 (0.56) |

Kernel NMF | 94.38 (1.75) | 79.89 (7.25) | 83.61 (2.86) | |

Convex NMF | 83.46 (2.84) | 54.95 (9.07) | 80.66 (2.80) | 31.31 (0.45) |

KC NMF | 85.18 (0.39) | 59.80 (1.75) | 81.94 (0.35) | 25.76 (0.16) |

SK SiVM | 96.79 (1.34) | 83.73 (9.32) | 89.71 (3.78) | 35.50 (2.20) |

SAGA | | | | 35.59 (2.44) |

\(t\)-statistics | 3.2327 | 1.8872 | 2.3257 | X |

Confidence Level | \({\ge }99\,\%\) | \({\ge }95\,\%\) | \({\ge }97.5\,\%\) | X |

## 6 Conclusion

SAGA (Sparse and Geometry Aware) is a new matrix factorization algorithm which has the following properties: (1) it operates in the Gaussian RKHS, which accounts for potential nonlinearity in the dataset geometry; (2) it provides sparse and convex projections onto a reduced embedding spanned by selected typical samples, which facilitates the human interpretation, and leads to a non-linear local representation of the data; (3) it has a complexity linear with the number of data entries, which allows dealing with big data. SAGA relies on both a manifold sampling strategy and a data projection. This latter has been proved to converge under some conditions regarding the Gaussian kernel variance. Finally, SAGA has been tested on toy, simulated and real datasets. The following conclusions can be drawn from the experiments: we observed in accordance with the theory that SAGA encodes the data as convex combinations of neighbor samples; the proposed volume maximization heuristic leads to better subsampling of the original data with respect to the volume of the simplex formed by the CSS; and the performances of its feature extraction have proved to outperform the selected state-of-the-art other NMF methods on classification tasks.

## Footnotes

- 1.
We write \(\mathbf{W}\) instead of \(\mathbf{W}^{(\ell )}\) when \(\ell \) does not matter, or is implicit regarding the context.

- 2.
The Karcher variance is a variance accounting for Riemannian distances over the manifold, rather than Euclidean ones.

- 3.
The MATLAB source code is available on http://people.irisa.fr/Nicolas.Courty/SAGA.

## Notes

### Acknowledgments

This work was partially supported by the ANR fundings ANR-10-INBS-08 (ProFI project, “Infrastructures Nationales en Biologie et Santé”, “Investissements d’Avenir”), ANR-13-JS02-0005-01 (Asterix project). and the Prospectom project (Mastodons 2012 CNRS challenge).

## References

- Arora, S., Ge, R., Kannan, R., & Moitra, A. (2012) Computing a nonnegative matrix factorization-provably. In:
*Proceedings of the 44th Symposium on Theory of Computing*(pp. 145–162). ACM.Google Scholar - Bache, K., & Lichman, M. (2013). UCI machine learning repository. http://archive.ics.uci.edu/ml. Accessed 27 Dec 2013.
- Bandeira, A., Fickus, M., Mixon, D., & Wong, P. (2012). The road to deterministic matrices with the restricted isometry property. arXiv preprint arXiv:1202.1234.
- Bioucas-Dias, J. M., Plaza, A., Dobigeon, N., Parente, M., Du, Q., Gader, P., et al. (2012). Hyperspectral unmixing overview: Geometrical, statistical, and sparse regression-based approaches.
*IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing*,*5*(2), 354–379.CrossRefGoogle Scholar - Boyd, S. P., & Vandenberghe, L. (2004).
*Convex optimization*. New York: Cambridge University Press.CrossRefzbMATHGoogle Scholar - Buciu, I., Nikolaidis, N., & Pitas, I. (2008). Nonnegative matrix factorization in polynomial feature space.
*IEEE Transactions on Neural Networks*,*19*(6), 1090–1100.CrossRefGoogle Scholar - Cai, D., He, X., Han, J., & Huang, T. (2011). Graph regularized nonnegative matrix factorization for data representation.
*IEEE Transactions on Pattern Analysis and Machine Intelligence*,*33*(8), 1548–1560.CrossRefGoogle Scholar - Candes, E. (2008). The restricted isometry property and its implications for compressed sensing.
*Comptes Rendus Mathematique*,*346*(9–10), 589–592.MathSciNetCrossRefzbMATHGoogle Scholar - Çivril, A., & Magdon-Ismail, M. (2009). On selecting a maximum volume sub-matrix of a matrix and related problems.
*Theoretical Computer Science*,*410*(47–49), 4801–4811.MathSciNetCrossRefzbMATHGoogle Scholar - Courty, N., & Burger, T. (2013). A kernel view on manifold sub-sampling based on karcher variance optimization. In:
*Geometric Science of Information*(pp. 751–758). Berlin: Springer.Google Scholar - Courty, N., Burger, T., & Johann, L. (2011). PerTurbo: A new classification algorithm based on the spectrum perturbations of the laplace-beltrami operator. In:
*Proceedings of ECML/PKDD*(vol. 1, pp. 359–374). Berlin: SpringerGoogle Scholar - Cutler, A., & Breiman, L. (1994). Archetypal analysis.
*Technometrics*,*36*(4), 338–347.MathSciNetCrossRefzbMATHGoogle Scholar - Ding, C., Li, T., & Jordan, M. (2010). Convex and semi-nonnegative matrix factorizations.
*IEEE Transactions on Pattern Analysis and Machine Intelligence*,*32*(1), 45–55.CrossRefGoogle Scholar - Donoho, D. L., & Stodden, V. C. (2003). When does non-negative matrix factorization give a correct decomposition into parts? In:
*NIPS*.Google Scholar - Esser, E., Moller, M., Osher, S., Sapiro, G., & Xin, J. (2012). A convex model for nonnegative matrix factorization and dimensionality reduction on physical space.
*IEEE Transactions on Image Processing*,*21*(7), 3239–3252.MathSciNetCrossRefGoogle Scholar - Garg, R., & Khandekar, R. (2009). Gradient descent with sparsification: An iterative algorithm for sparse recovery with restricted isometry property. In:
*ICML*(pp. 337–344).Google Scholar - Gärtner, B. (1999). Fast and robust smallest enclosing balls. In:
*Proceedings of the 7th Annual European Symposium on Algorithms, ESA ’99*(pp. 325–338).Google Scholar - Gillis, N. (2012). Sparse and unique nonnegative matrix factorization through data preprocessing.
*JMLR*,*13*, 3349–3386.MathSciNetzbMATHGoogle Scholar - Gillis, N., & Vavasis, S. (2013). Fast and robust recursive algorithms for separable nonnegative matrix factorization. IEEE Transactions on Pattern Analysis and Machine Intelligence. doi: 10.1109/TPAMI.2013.226.
- Guillemot, C., & Turkan, M. (2012). Neighbor embedding with non-negative matrix factorization for image prediction. In:
*IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2012*(pp. 785–788).Google Scholar - Hoyer, P. O. (2004). Non-negative matrix factorization with sparseness constraints.
*JMLR*,*5*, 1457–1469.Google Scholar - Kersting, K., Xu, Z., Wahabzada, M., Bauckhage, C., Thurau, C., Roemer, C., Ballvora, A., Rascher, U., Leon, J., & Pluemer, L. (2012). Pre-symptomatic prediction of plant drought stress using dirichlet-aggregation regression on hyperspectral images.
*AAAI Computational Sustainability and AI Track*.Google Scholar - Kim, H., & Park, H. (2007). Sparse non-negative matrix factorizations via alternating non-negativity-constrained least squares for microarray data analysis.
*Bioinformatics*,*23*(12), 1495–1502.CrossRefGoogle Scholar - Kumar, A., Sindhwani, V., & Kambadur, P. (2013). Fast conical hull algorithms for near-separable non-negative matrix factorization. In:
*ICML*(pp. 231–239).Google Scholar - Kyrillidis, A., Becker, S., & Cevher, V. (2013). Sparse projections onto the simplex.
*JMLR W&CP: Proceedings of The 30th International Conference on Machine Learning (ICML 2013)*(vol.28, 235–243).Google Scholar - Lafferty, J. D., & Lebanon, G. (2005). Diffusion kernels on statistical manifolds.
*Journal of Machine Learning Research*,*6*, 129–163.MathSciNetzbMATHGoogle Scholar - Lafon, S., & Lee, A. (2006). Diffusion maps and coarse-graining: A unified framework for dimensionality reduction, graph partitioning, and data set parameterization.
*PAMI*,*28*(9), 1393–1403.CrossRefGoogle Scholar - LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). Gradient-based learning applied to document recognition.
*Proceedings of the IEEE*,*86*(11), 2278–2324.CrossRefGoogle Scholar - Li, Y., & Ngom, A. (2012). A new kernel non-negative matrix factorization and its application in microarray data analysis. In:
*IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology 2012 (CIBCB)*(pp. 371–378). IEEE.Google Scholar - Maculan, N., & de Paula, G. (1989). A linear-time median-finding algorithm for projecting a vector on the simplex of rn.
*Operations Research Letters*,*8*(4), 219–222.MathSciNetCrossRefzbMATHGoogle Scholar - Mørup, M., & Hansen, L. (2012). Archetypal analysis for machine learning and data mining.
*Neurocomputing*,*80*, 54–63.CrossRefGoogle Scholar - Nene, S., Nayar, S., & Murase, H. (1996).
*Columbia object image library (coil-20)*. New York: Department of Computer Science, Columbia University.Google Scholar - Öztireli, C., Alexa, M., & Gross, M. (2010). Spectral sampling of manifolds. In:
*SIGGRAPH ASIA*.Google Scholar - Recht, B., Re, C., Tropp, J., & Bittorf, V. (2012). Factoring nonnegative matrices with linear programs. In:
*NIPS*(pp. 1223–1231).Google Scholar - Schölkopf, B., & Smola, A. (2002).
*Learning with kernels: Support vector machines, regularization, optimization, and beyond*. Cambridge: The MIT Press.Google Scholar - Shroff, N., Turaga, P., & Chellappa, R. (2011). Manifold precis: An annealing technique for diverse sampling of manifolds. In:
*NIPS*(pp. 154–162).Google Scholar - Sim, T., Baker, S., & Bsat, M. (2002). The cmu pose, illumination, and expression (pie) database. In:
*Proceedings of Fifth IEEE International Conference on Automatic Face and Gesture Recognition, 2002*(pp. 46–51). IEEE.Google Scholar - Thurau, C., Kersting, K., & Bauckhage, C. (2010). Yes we can: Simplex volume maximization for descriptive web-scale matrix factorization. In:
*CIKM*(pp. 1785–1788).Google Scholar - Thurau, C., Kersting, K., Wahabzada, M., & Bauckhage, C. (2012). Descriptive matrix factorization for sustainability adopting the principle of opposites.
*Data Mining and Knowledge Discovery*,*24*(2), 325–354.MathSciNetCrossRefzbMATHGoogle Scholar - Von Luxburg, U. (2007). A tutorial on spectral clustering.
*Statistics and Computing*,*17*(4), 395–416.MathSciNetCrossRefGoogle Scholar - Wang, F. Y., Chi, C. Y., Chan, T. H., & Wang, Y. (2010). Nonnegative least-correlated component analysis for separation of dependent sources by volume maximization.
*IEEE Transactions on Pattern Analysis and Machine Intelligence*,*32*(5), 875–888.CrossRefGoogle Scholar - Yu, K., Zhang, T., & Gong, Y. (2009). Nonlinear learning using local coordinate coding. In:
*NIPS*(pp. 2223–2231).Google Scholar - Zhang, D., Zhou, Z., & Chen, S. (2006). Non-negative matrix factorization on kernels. In:
*PRICAI 2006: Trends in Artificial Intelligence*(pp. 404–412). New York: Springer.Google Scholar