# Learning from patches by efficient spectral decomposition of a structured kernel

## Abstract

We present a kernel based method that learns from a small neighborhoods (patches) of multidimensional data points. This method is based on spectral decomposition of a large structured kernel accompanied by an out-of-sample extension method. In many cases, the performance of a spectral based learning mechanism is limited due to the use of a distance metric among the multidimensional data points in the kernel construction. Recently, different distance metrics have been proposed that are based on a spectral decomposition of an appropriate kernel prior to the application of learning mechanisms. The diffusion distance metric is a typical example where a distance metric is computed by incorporating the relation of a single measurement to the entire input dataset. A method, which is called patch-to-tensor embedding (PTE), generalizes the diffusion distance metric that incorporates matrix similarity relations into the kernel construction that replaces its scalar entries with matrices. The use of multidimensional similarities in PTE based spectral decomposition results in a bigger kernel that significantly increases its computational complexity. In this paper, we propose an efficient dictionary construction that approximates the oversized PTE kernel and its associated spectral decomposition. It is supplemented by providing an out-of-sample extension for vector fields. Furthermore, the approximation error is analyzed and the advantages of the proposed dictionary construction are demonstrated on several image processing tasks.

### Keywords

Kernel method Out of sample extension Non-scalar similarity Diffusion maps Vector field interpolation Manifold learning High-dimensional data analysis## 1 Introduction

Recent machine learning methods for massive high dimensional data analysis model observable parameters in such datasets by the application of nonlinear mappings of a small number of underlying factors (Belkin and Niyogi 2003; Singer and Coifman 2008). Mathematically, these mappings are characterized by the geometric structure of a low dimensional manifold that is immersed in a high dimensional ambient space. Under this model, the dataset is assumed to be sampled from an unknown manifold in the ambient observable space. Then, manifold learning methods are applied to reveal the underlying geometry. One approach for revealing elements of this geometry, such as dimensionally estimated and tangent space inference, is by encoding tensor similarities directly in the the ambient space, as proposed in Mordohai and Medioni (2010). Another approach, which also involves dimensionality reduction, is the to use kernel methods such as k-PCA, Laplacian Eigenmaps (Belkin and Niyogi 2003) and Diffusion Maps (DM) (Coifman and Lafon 2006), which are designed for this task and utilize the intrinsic manifold geometry. These methods have provided good results in representing and analyzing such data. Kernel methods are based on a scalar similarity measure between individual multidimensional data points.^{1}

In this paper, we focus on patch-to-tensor embedding (PTE) kernel construction from Salhov et al. (2012), which considers patches that are taken from the underlying manifold. The data analysis considers these patches instead of analyzing single data points from the manifold. Each patch is defined as a local neighborhood of a data point in a dataset sampled from an underlying manifold. The relation between any two patches in a PTE based kernel construction is described by a matrix that represents both the affinities between data points at the centers of these patches and the similarities between their local coordinate systems. Then, the constructed matrices between all patches are combined into a block matrix that is called super-kernel. This super-kernel is a non-scalar affinity kernel between patches (local neighborhood) of the underlaying manifold (Salhov et al. 2012). The super-kernel is positive definite. It captures the intrinsic geometry of the underlying manifold/s. However, the proposed enhanced kernel increases its size. Hence, it increases the computational complexity of a direct spectral decomposition of this kernel.

- 1.
We utilize the structure of the super-kernel from a PTE construction to find a necessary condition for updating a non-scalar based dictionary. This dictionary is used to approximate the spectral decomposition of the super-kernel and to derive the required conditions for achieving a bound on the super-kernel approximation error. In addition, we analyze its computational complexity and estimate its efficiency in comparison to the computation of a spectral decomposition of a full super-kernel.

- 2.
We show how to efficiently extend the PTE embedding via an out-of-sample extension to a newly arrived data point that did not participate in the dictionary construction. This includes the case when the analyzed dataset consists of tangential vector fields and provides an out-of-sample extension method to process vector fields.

*M*by exploring the geometry of the manifold \(\mathcal {M}\) from which it is sampled. This method is based on defining an isotropic kernel \(K \in \mathbbm {R}^{n \times n}\) whose elements are defined by

*M*. Data points in

*M*are the vertices and the weights (for example, we can use the weight in Eq. 1) of the edges are defined by the kernel

*K*. The degree of each data point (i.e., vertex) \(x \in M\) in this graph is \(q(x) \triangleq \sum \limits _{y \in M} k(x,y).\) Normalization of the kernel by this degree produces an \(n \times n\) row stochastic transition matrix

*P*whose elements are \(p(x,y) = k(x,y)/q(x)\), \(x,y \in M\), which defines a Markov process (i.e., a diffusion process) over the data points in

*M*. A symmetric conjugate \(\bar{P}\) of the transition operator

*P*defines the diffusion affinities between data points by

DM extends the classical Multi-Dimensional Scaling (MDS) core method (Cox and Cox 1994; Kruskal 1964) by considering nonlinear relations between data points instead of considering the original linear Gram matrix relations. Furthermore, DM has been utilized for a wide variety data types and for pattern analysis such as improving audio quality by suppressing transient interference (Talmon et al. 2013), detecting moving vehicles (Schclar et al. 2010), scene classification (Jingen et al. 2009), gene expression analysis (Rui et al. 2007), source localization (Talmon et al. 2011) and fusion of different data sources (Lafon et al. 2006; Keller et al. 2010).

DM methodology is also used in finding both a parametrization and an explicit metric, which reflects the intrinsic geometry of a given dataset. This is done by considering the entire pairwise connectivity between data points in a diffusion process (Lafon et al. 2006).

Recently, DM has been extended in several directions to handle orientation in local tangent spaces (Singer and Wu 2011, 2012; Salhov et al. 2012; Wolf and Averbuch 2013). Under the manifold assumption, the relation between two local patches of neighboring data points is described by a matrix instead of a scalar value. The resulting kernel captures enriched similarities between local structures in the underlying manifold. These enriched similarities are used to analyze local areas around data points instead of analyzing their specific individual locations. For example, this approach is beneficial in image processing for analyzing regions instead of individual pixels or when data points are perturbed, thus, their surrounding area is more important than their specific individual positions. Since the constructions of these similarities are based on local tangent spaces, they provide methods to manipulate tangential vector fields. For example, they allow to perform an out-of-sample extension for vector fields. These manipulations are beneficial when the analyzed data consists of directional information in addition to positional information on the manifold. For example, the goal in Ballester et al. (2001) is to recover missing data in images while interpolating the appropriate vector field. Another example is in the utilization of tangential vector fields interpolation on \(\mathcal {S}^2\) for modeling atmospheric air flow and oceanic water velocity (Fuselier and Wright 2009).

The main motivation of this paper is to show the beneficial aspects in using PTE based construction while reducing its computational complexity when spectral decomposition of a PTE based kernel is used.

The spectral decomposition computational complexity can be reduced in several ways. Kernel matrix sparsification is a widely used approach that utilizes a sparse eigensolver such as Lanczos to compute the relevant eigenvectors (Cullum and Willoughby 2002). Another sparsification approach is to translate the dense similarity matrix into a sparse matrix by selectively truncating elements outside a given neighborhood radius of each data point. More sparsification approaches are given in von Luxburg (2007). An alternative approach to reduce the computational complexity is based on Nyström extension for estimating the required eigenvectors (Fowlkes et al. 2004). The Nyström based method has three steps: (1) The given dataset is subsampled uniformly over its indices set. (2) The subsamples define a kernel that is smaller than the dataset size. Then, the corresponding spectral decomposition is achieved by the application of a SVD solver to this kernel. (3) The solution of the small spectral problem is extended to the entire dataset via the application of Nyström extension method. The result is an approximated spectral method. Although this solution is much more computationally efficient than the application of the spectral decomposition to the original kernel matrix, the resulting approximated spectral method suffers from several major problems originated from the subsampling effect on the spectral approximation and from the ill-conditioned effects in the Nyström extension method.

Recently, a novel multiscale scheme is suggested in Bermanis et al. (2013) for sampling scattered data and for extending functions defined on sampled data points. The suggested approach, also called Multiscale Extension (MSE), overcomes some limitations of the Nyström method. For example, it overcomes the ill-conditioned effect in the Nyström extension method and accelerates the computation. The MSE method is based on mutual distances between data points. It uses a coarse-to-fine hierarchy of a multiscale decomposition of a Gaussian kernel. Another interesting approach is suggested in Engel et al. (2004), Ouimet and Bengio (2005) to identify a subset of data points (called dictionary) and a corresponding extension coefficients that enable us to approximate the full scalar kernel. The number of dictionary atoms depends on the given data, kernel configuration and on the designed parameter that controls the quality of the full kernel approximation. These are efficient methods designed to process general kernels. However, they are not designed to utilize and preserve the inherent structure that resides in kernels such as a super-kernel (Salhov et al. 2012).

In summary, methods such as Nyström and kernel recursive least squares, to name some, have been suggested to approximate large kernels decomposition. However, neither the suggested methods utilize the inherent structure of the PTE kernel nor suggest an out-of-sample extension method for vector fields. In this paper, we propose a dictionary construction that approximates the oversized PTE kernel and its associated spectral decomposition. Furthermore, the approximation error is analyzed and the advantages of the proposed dictionary construction are demonstrated on a several image processing tasks.

The paper has the following structure: Sect. 2 formulates the problem. The patch-to-tensor embedding is described in Sect. 3. The patch-based dictionary construction and its properties are described in Sect. 4. Finally, Sect. 5 describes the experimental results for image segmentation and for handwritten digit classification that are based on the utilization of a dictionary-based analysis.

## 2 Problem formulation

In this paper, we approximate the spectral decomposition of a large and structured kernel. We assume that we have a dataset of *n* multidimensional data points that are sampled from a manifold \(\mathcal {M}\) that lies in the ambient space \(\mathbbm {R}^m\). We also assume that the intrinsic dimension of \(\mathcal {M}\) is \(d \ll m\). We consider two related tasks: (1) How to approximate the spectral decomposition of a super-kernel? This is explained in Sect. 2.2. (2) How to perform an out-of-sample extension for vector fields? This is explained in Sect. 2.3.

### 2.1 Manifold setup

*d*-dimensional tangent space \(T_x(\mathcal {M})\), which is a subspace of \(\mathbbm {R}^m\). We assume that the manifold is densely sampled thus, the tangential space \(T_x(\mathcal {M})\) can be approximated by a small enough patch (i.e., neighborhood) \(N(x) \subseteq M\) around \(x \in M\). Let \(o_x^1, \ldots , o_x^d \in \mathbbm {R}^m\), where \(o_{x}^{i}=(o_{x}^{i,1}, \ldots , o_{x}^{i,m})^T\), \(i=1, \ldots , d\), form an orthonormal basis of \(T_x(\mathcal {M})\) and let \(O_x \in \mathbbm {R}^{m \times d}\) be a matrix whose columns are these vectors

*d*coordinates according to the basis \(o_x^1, \ldots , o_x^d\). For each vector \(u \in T_x(\mathcal {M})\), \(\tilde{u} = O_x u \in \mathbbm {R}^m\) is the same vector as

*u*represented by

*m*coordinates, according to the basis of the ambient space. For each vector \(v \in \mathbbm {R}^m\) in the ambient space, the vector \(v' = O_x^T v \in T_x(\mathcal {M})\) is a linear projection of

*v*on the tangent space \(T_x(\mathcal {M})\). This setting was used in Salhov et al. (2012). In Sect. 2.2, we formulate the spectral decomposition of a super kernel.

### 2.2 Approximation of the super-kernel spectral decomposition

The super-kernel *G*, which is constructed in Salhov et al. (2012), on a dataset *M*, is based on *d* orthonormal basis vectors in each \(O_x\), \(x \in M\). *G* is a \(nd \times nd\) structured matrix. Each \(d \times d\) block in *G* is a similarity-index matrix between a pair of data points and their corresponding neighborhoods. More details of the structure of *G* are given in Sect. 3 (specifically, in Definition 1). Given a dataset *M*, the PTE method (Salhov et al. 2012) uses the spectral decomposition of *G* to embed each data point (or neighborhood around it) to a tensor. This embedding is computed by constructing the coordinate matrices of the embedded tensors using the eigenvalues and the eigenvectors of *G*.

*G*. Under the manifold assumption, the super-kernel

*G*has a small number of significant singular values in comparison with the dimensionality

*m*of the given data. Hence, a low rank approximation of the super-kernel

*G*exists. We aim to find a dictionary \(D_{\eta _{n}}\) of \(\eta _{n}\) representative data points that will be used to construct a small super-kernel \(\hat{G}_{n}\) and a corresponding extension matrix \(E_n \in \mathbbm {R}^{\eta _n d \times nd}\) that approximates the entire super-kernel

*G*by

*G*is computed by the application of QR decomposition (see Golub and Loan 2012, p. 246) to \(E_{n}\). In order to construct a dictionary of representatives, we design a criterion that will identify data points and their corresponding embedded tensors that are approximately linearly independent. This criterion is utilized to control the approximation error bound.

### 2.3 Out-of-sample extension method for vector fields

Consider a tangential vector field \(\mathbf{v}: M\rightarrow \mathbbm {R}^d\) such that for all \(x\in M\), \(\mathbf{v}(x)\in T_x(\mathcal {M})\). The goal is to find an out-of-sample extension method for such tangential vector fields. More specifically, given a new data point on the manifold, we aim to find the vector in its local tangential space that corresponds smoothly to the known values of the vector field. This out-of-sample extension method is used to extend the coordinate matrices of the embedded tensors of PTE and to approximate the embedding of a new data point \(x \notin M\).

*nd*. Figure 1 illustrates these two forms.

Let \(x^\prime \in \mathcal {M}\backslash M\) be a new data point associated with the matrix \(O_{x^\prime }\) whose columns \(o_{x^\prime }^1,\ldots ,o_{x^\prime }^d\) form an orthonormal basis for the tangential space \(T_{x^\prime }(\mathcal {M})\). The out-of-sample extension problem is formulated as finding an extension vector field \(\mathbf{u}\) that enables us to compute a vector \(\mathbf{v}(x^\prime )\) for any new data point \(x^\prime \).

*G*defines an integral operator on tangential vector fields of the manifold \(\mathcal {M}\) that are restricted to the dataset

*M*(see Wolf and Averbuch 2013, Definition 4.1). This integral operator can be formally described by the application of a super-kernel matrix to a long-vector (second representation form) that represents vector fields. By abuse of notation, if we treat \(\mathbf{u}\) as its vector representation in \(\mathbbm {R}^{nd}\), we define the relation between \(\mathbf{u}\) and \(\mathbf{v}(x^\prime )\) using the operator

*G*to be

*G*, which are vectors in \(\mathbbm {R}^{nd}\) that are regarded as tangential vector fields.

## 3 Patch-to-tensor embedding

The main goal of this paper is to enable to have a practical application of the PTE method from Salhov et al. (2012), Wolf and Averbuch (2013). In these works, the PTE methodology was introduced either in continuous settings or with a dense sampling of the manifold, which approximates the continuous settings. While these settings may not be suitable for efficient application to process modern datasets (indeed, this is the main focus of this paper), they enabled a thorough analysis regarding the geometrical interpretation of the obtained embedding and its metric structure, in relation to the manifold geometry. In order to make the presentation of this paper self-contained, we review in this section the main results from these previous works before introducing the proposed dictionary construction and out-of-sample extension method.

For \(x,y \in M\), define \(O_{xy} \mathop {=}\limits ^{{\varDelta }} O_x^TO_y \in \mathbbm {R}^{d \times d}\), where \(O_x\) and \(O_y\) were defined in Eq. 3. The matrices \(O_x\) and \(O_y\) represent the bases for the tangential spaces \(T_x(\mathcal {M})\) and \(T_y(\mathcal {M})\), respectively. Thus, the matrix \(O_{xy}\), which will be referred to as the tangential similarity matrix, represents a linear-projection between these tangential spaces. According to Salhov et al. (2012), Wolf and Averbuch (2013), this linear-projection quantifies the similarity between these tangential spaces via the angle between their relative orientations.

Let \({\varOmega }\) be an \(n \times n\) symmetric and positive semi-definite affinity kernel defined on \(M \subseteq \mathbbm {R}^m\). Each row or each column in \({\varOmega }\) corresponds to a data point in *M*, and each element \( [{\varOmega }]_{xy} = \omega (x,y)\), \(x,y \in M\), represents the affinity between *x* and *y* such that \(\omega (x,y) \ge 0\) for every \(x,y \in M\). The diffusion affinity kernel is an example for such an affinity kernel. Definition 1 uses the tangent similarity matrices and the affinity kernel \({\varOmega }\) to define the Linear-Projection *super-kernel*. When the diffusion affinities in \(\bar{P}\) (Eq. 2) are used instead of using the general affinities in \({\varOmega }\), then this structured kernel is called a Linear-Projection Diffusion (LPD) super-kernel.

**Definition 1**

(Salhov et al. 2012) [Linear-Projection Diffusion Super-Kernel] A Linear-Projection Diffusion (LPD) super-kernel is a block matrix \(G \in \mathbbm {R}^{nd \times nd}\) of size \(n \times n\) where each block in it is a \(d \times d\) matrix. Each row and each column of blocks in *G* correspond to a data point in *M*. A single block \(G_{(x,y)}\), \(x,y \in M\), represents an affinity (similarity) between the patches *N*(*x*) and *N*(*y*) around *x* and *y*, respectively. Each block \(G_{(x,y)} \in \mathbbm {R}^{d \times d}\) in *G* is defined by \(G_{(x,y)} \triangleq \bar{p}(x,y) O_{xy} = a(x,y) O_x^TO_y, \; x,y \in M\).

The super-kernel in Definition 1 encompasses both the diffusion affinities between data points from the manifold \(\mathcal {M}\) and the similarities between their tangential spaces. The latter are expressed by the linear-projection operators between tangential spaces. Specifically, for two tangential spaces \(T_x(\mathcal {M})\) and \(T_y(\mathcal {M})\) at \(x,y \in M\) of the manifold, the operator \(O_x^T O_y\) (i.e., their tangential similarity matrix) expresses a linear projection from \(T_y(\mathcal {M})\) to \(T_x(\mathcal {M})\) via the ambient space \(\mathbbm {R}^m\). The obvious extreme cases are the identity matrix, which indicates the existence of a complete similarity, and a zero matrix, which indicates the existence of orthogonality (i.e., a complete dissimilarity). These linear projection operators express some important properties of the manifold structure, e.g., curvatures between patches and differences in orientation. More details on the properties of this super-kernel are given in Salhov et al. (2012), Wolf and Averbuch (2013).

It is convenient to use the vectors \(o_x^i\) and \(o_y^j\) to apply a double-indexing scheme by using the notation \(g(o_x^i,o_y^j) \triangleq [G_{(x,y)}]_{ij}\) that considers each single cell in *G* as an element \([G_{(x,y)}]_{ij}\), \(1 \le i,j \le d\), in the block \(G_{(x,y)}\), \(x,y \in M\). It is important to note that \(g(o_x^i,o_y^j)\) is *only a convenient notation* and a single element of a block in *G* does not necessarily have any special meaning. The block itself, as a whole, holds meaningful similarity information.

Spectral decomposition is used to analyze the super-kernel *G*. It is utilized to embed the patches *N*(*x*), \(x\in M\), into a tensor space. Let \( \vert \lambda _1 \vert \ge \vert \lambda _2 \vert \ge \cdots \ge \vert \lambda _{\ell } \vert \) be the \(\ell \) most significant eigenvalues of *G* and let \(\phi _1, \phi _2, \ldots , \phi _{\ell }\) be their corresponding eigenvectors. Each eigenvector \(\phi _i\), \(i =1, \ldots ,\ell \), is a vector of length *nd*. We denote each of its elements by \(\phi _i(o_x^j)\), \(x \in M\), \(j=1,\ldots ,d\). An eigenvector \(\phi _i\) is a vector of *n* blocks, each of which is a vector of length *d* that corresponds to a data point \(x \in M\) on the manifold. To express this notion, we use the notation \(\varphi _i^j(x) = \phi _i(o_x^j)\). Thus, the block, which corresponds to \(x \in M\) in \(\phi _i\), is the vector \((\varphi _i^{1}(x), \ldots ,\varphi _i^{d}(x))^T\).

*G*are used to construct a spectral map

*t*is the diffusion transition time. This spectral map is then used to construct the embedded tensor \(\mathcal {T}_x \in \mathbbm {R}^{\ell } \otimes \mathbbm {R}^{d}\) for each \(x \in M\). These tensors are represented by the \(\ell \times d\) matrices

*N*(

*x*), \(x \in M\), into the tensor space.

The computational complexity of the direct spectral decomposition of a super-kernel is in the order of \(O\left( n^3 d^3 \right) \). However, the structure of the super-kernel can be utilized for simplifying the decomposition. The super kernel can be viewed as the Khatri–Rao product (Rao 1968) of the diffusion affinity matrix \({\varOmega }\) with the \(nd \times nd\) matrix \(B = O O^T \), where the matrix \(O \in \mathbbm {R}^{nd \times d}\) is a block matrix with \(O_i^T\) as its \(d \times d\) *i*th sub-block Hence, let \(G = {\varOmega } \odot B \) be the corresponding Khatri–Rao product where the blocks of \({\varOmega }\) are the affinity scalars and the blocks of *B* are the \(d \times d\) respected matrices \(O_i^T O_j \). It can be shown that the symmetric partition of *B* can be used to reformulate the Khatri–Rao product in terms of a Kronecker product as \(G = Z \left( {\varOmega } *B \right) Z^T\), where *Z* is a proper \(n d \times n^2 d^2\) selection matrix (Liu and Trenkler 2008), \(Z^T Z = I\) and \( {\varOmega } *B\) is the Kronecker product of \({\varOmega }\) and *B*. Let \( {\varOmega } =U_{{\varOmega }} {\varSigma }{_{\varOmega }} V_{{\varOmega }}^T \) be the SVD of \({\varOmega }\) and \(B=U_{B} {\varSigma }_B V_B^T\) be the SVD of *B*, then the spectral decomposition of *G* can be reformulated using the spectral decomposition characterization of a Kronecker products as \(G =U {\varSigma } V = Z \left( U_{\varOmega } *U_B \right) \left( {\varSigma }_{\varOmega } *{\varSigma }_B \right) \left( V_{\varOmega } *V_B \right) Z^T\). Hence, the computation of the left eigenvectors \(U = Z \left( U_{\varOmega } *U_B \right) \) requires \(O\left( n^3\right) \) operations to decompose \({\varOmega }\) and additional operations that are needed for the decomposition of *O*. The total computational complexity is more efficient than what is needed to decompose \(nd \times nd\) super-kernel. However, computational cost of \(O\left( n^3\right) \) to process large datasets is still impractical. Hence, we are looking for a reduction of at least one order of magnitude in the spectral decomposition of large matrices that represent the super kernel.

The previous works, which is described in this section, mainly focused on geometric interpretations of the embedded space obtained by PTE and its metric structure. In this paper, on the other hand, we focus on the practical application of this analysis, which requires a dictionary construction and an out-of-sample extension method, as described in Sect. 2. To this end, we consider the numerical and algebraic sides of the PTE super-kernel and the resulting embedded space to augment its extensively studied geometric side. In particular, the dictionary construction introduced in Sect. 4 aims to reduce redundancies by considering (and eliminating) approximate linear dependencies between dictionary members, while ensuring that the span of the their embedded tensors approximates the embedded space of PTE. By ensuring that the metric structure of the embedded space is approximated by the presented construction, the geometrically-oriented results from Salhov et al. (2012), Wolf and Averbuch (2013) remain valid for this analysis as well. They provide meaningful relations between the approximate embedding and the manifold geometry.

## 4 Patch-based dictionary construction

### 4.1 Dictionary construction

We use an iterative approach to construct the described dictionary by one sequential scan of the entire data points in *M*. In the first iteration, we define the scanned set to be \(X_1=\{x_1\}\) and the dictionary \(D_1=\{x_1\}\). At each iteration \(s=2,\ldots ,n\), we have a new data point \(x_s\), the scanned set \(X_{s-1}=\{x_1,\ldots ,x_{s-1}\}\) from the previous iteration and the dictionary \(D_{s-1}\) that represents \(X_{s-1}\). The dictionary \(D_{s-1}\) is a subset of \(\eta _{s-1}\) data points from \(X_{s-1}\) that are sufficient to represent its embedded tensors. We define the scanned set to be \(X_s=X_{s-1}\cup \{x_{s}\}\). Our goal is to define the dictionary \(D_s\) of \(X_s\) based on the dictionary \(D_{s-1}\) with the new data point \(x_s\). To achieve it, a dependency criterion has to be established. If this criterion is satisfied, then the the dictionary remains the same such that \(D_s=D_{s-1}\). Otherwise, it is updated to include the new data point such that \(D_s=D_{s-1}\cup \{x_{s}\}\).

**Lemma 1**

Let \(\hat{G}_{s-1} \in \mathbbm {R}^{d\eta _{s-1} \times d\eta _{s-1}}\) be a super-kernel of the data points in the dictionary \(D_{s-1}\) and let \(H_s \in \mathbbm {R}^{d \eta _{s-1} \times d}\) be a \(\eta _{s-1} \times 1\) block matrix whose *j*th \(d\times d\) block is \(G_{(y_j,x_s)}\), \(j=1,\ldots ,\eta _{s-1}\). Then, the optimal matrix coefficients (from Eq. 11) are \(C^s_j, ~j=1,\ldots ,\eta _{s-1}\), where \(C^s_j\) is the *j*th \(d\times d\) block of the \(\eta _{s-1}\times 1\) of \(d\times d\) blocks matrix \(\hat{G}_{s-1}^{-1}H_s\). The error \(\delta _s\) in Eq. 11 satisfies \(\delta _s = {{\mathrm{tr}}}[G_{(x_s,x_s)} - H_s^T \hat{G}_{s-1}^{-1} H_s]\).

*Proof*

*j*th \(d\times d\) block is \(C^s_j\), \(j=1,\ldots ,\eta _{s-1}\). Solving the optimization (i.e., minimization) in Eq. 12 yields the solution

Lemma 1 provides an expression for a dictionary-based approximation in super-kernel terms. Essentially, this eliminates the need for having a prior knowledge of the embedded tensors during the dictionary construction. At each iteration *s*, we consider the criterion \(\delta _s < \mu \). Based on this condition, we decide whether to add \(x_s\) to the dictionary or to approximate its tensor. The threshold \(\mu \) is given anyway in advance as a meta-parameter and \(\delta _s\) is computed by using the expression in Lemma 1, which does not depend on the embedded tensors. Therefore, the dictionary construction processes only the required knowledge of the super-kernel blocks that are needed to compute this expression in every iteration. In fact, the number of required blocks is relatively limited since it is determined by the size of the dictionary and not by the size of the dataset.

### 4.2 The dictionary construction algorithm

*s*in the dictionary construction algorithm. Let \(E_s\) be a \((\eta _s\times d ) \times (s\times d)\) block matrix at iteration

*s*whose (

*i*,

*j*) entry (block) is the optimal coefficient matrix \(C_i^{j}\), \(1\le i \le \eta _s,\, 1\le j\le s\) computed in Lemma 1. The coefficient matrix \(C_i^{j}\) has two solutions that depend on the ATD criterion calculated in iteration

*s*. First, Lemma 1 is applied to compute \(\delta _s\). Then, two possible scenarios are considered:

- 1.
\(\delta _s \le \mu \), hence, \(\mathcal {T}_{x_s}\) (Eq. 7) satisfies the ATD on \(D_{s-1}\). The dictionary remains the same, i.e., \(D_s = D_{s-1}\) and \(\hat{G}_s = \hat{G}_{s-1}\). The extension matrix \(E_s = [E_{s-1} \; A_s]\) is computed by concatenating the matrix \(A_s = \hat{G}_{s-1}^{-1} H_s\) (from the proof of Lemma 1) with the coefficient matrix from the previous iteration \(s-1\).

- 2.\(\delta _s > \mu \), hence, \(\mathcal {T}_{x_s}\) does not satisfy the ATD on \(D_{s-1}\). The vector \(x_s\) is added to the dictionary, i.e., \(D_s = D_{s-1} \cup \{x_s\}\). The computation of \(E_s\) is done by adding the identity matrix and the zero matrix in the appropriate places. The dictionary related super-kernel \(\hat{G}_s\) becomes$$\begin{aligned} \hat{G}_s = \left[ \begin{array}{c@{\quad }c} \hat{G}_{s-1} &{} H_s \\ H_s^T &{} G_{(x_s,x_s)} \end{array} \right] . \end{aligned}$$(14)

At each iteration *s*, the PTE of the data points \(X_s\) (or, more accurately, their patches) is based on the spectral decomposition of the super-kernel \(G_s\) of \(X_s\). This spectral decomposition is approximated by using the extension matrix \(E_s\) and the dictionary related super-kernel \(\hat{G}_s\) (Eq. 14).

*d*, however, in general the intrinsic dimension

*d*is assumed to be small. Finally, embedding via Tensor Voting (Mordohai and Medioni 2010) is more cost effective.

PTEA computational complexity

Operation | Complexity |
---|---|

Computation of \(\delta _s\) | \(O(\eta _s^2 d^3)\) |

Update of \(G_s\),\(E_s\) and \(G_s^{-1}\) | \(O(\eta _s^2 d^3)\) |

QR of \(E_s\) | \(O(s \eta _s^2 d^3)\) |

SVD of \(R_s\hat{G}_sR_s^T\) | \(O( \eta _s^3 d^3)\) |

\(Q_s \tilde{U}_s\) that approximates \(\mathcal {T}_{x_i}\) | \(O(s \eta _s^2 d^3)\) |

Computational complexity comparison

Method | Complexity |
---|---|

PTEA | \(O(n \eta _n^2 d^3)\) |

PTE | \(O( \eta ^3_n d^3 )\) |

DM | \(O(n^3+n^2m)\) |

KRLS | \(O( n \eta ^2_s +n^2m )\) |

Tensor voting | |

### 4.3 The super-kernel approximation error bound

For the quantification of Eq. 17, let \({\varPsi }_s \mathop {=}\limits ^{{\varDelta }} [\mathcal {T}_{x_1},\ldots , \mathcal {T}_{x_s}]\) be the \(\ell \times sd\) matrix that aggregates all the embedded tensors up to a step *s*, where \(\mathcal {T}_{x_{i}}\in \mathbbm {R}^{l\times d}\), \(i=1, \ldots ,s\). Let \(\hat{{\varPsi }}_s \mathop {=}\limits ^{{\varDelta }} [ \mathcal {T}_{y_1},\ldots , \mathcal {T}_{y_{\eta _s}} ]\) be the \(\ell \times \eta _sd\) matrix that aggregates all the embedded tensors of the dictionary members \(D_s\) and let \({\varPsi }_s^\text {res}\triangleq {\varPsi }_s-\hat{{\varPsi }}_s E_s = [\psi _1^\text {res},\ldots ,\psi _s^\text {res}]\), where \(\psi _s^\text {res}\triangleq \mathcal {T}_{x_s} - \sum ^{\eta _{s-1}}_{j=1} \mathcal {T}_{y_j} C_j^{s}\). Then, due to Eq. 11 and the ATD condition, \(\left\| {\varPsi }_s^\text {res} \right\| _F^2\le s\mu \). From Definition 1 of the super-kernel, we get \(G_s={\varPsi }_s^T{\varPsi }_s\) and \(\hat{G}_s=\hat{{\varPsi }}_s^T\hat{{\varPsi }}_s\). Therefore, by substituting \({\varPsi }_s\) into Eq. 17 we get \(G_s = E_s \hat{G}_s E_s^T + ({\varPsi }_s^\text {res})^T {\varPsi }_s^\text {res}\) where all the cross terms vanish by the optimality of \(E_s\). As a consequence, the approximation error \(R_s = G_s - E_s \hat{G}_s E_s^T = ({\varPsi }_s^\text {res})^T {\varPsi }_s^\text {res}\) in step *s* satisfies \(\left\| R_s \right\| _F^2\le s\mu \). In particular, \(\left\| R_n \right\| _F^2\le n\mu \) for \(s=n\).

### 4.4 Out-of-sample extension for vector fields

*s*and the actual given vector field (as part of the training set) by

^{2}The Recursive Least Squares solution, which minimizes \(J(\mathbf{w})\), is given by \(\mathbf{w}_o = G^{-1}{\varvec{\nu }}\). In the case when the number of vector examples is large, the complexity of inverting the super-kernel tends to be expensive in terms of computational complexity and memory requirements. Furthermore, the length of the predictor weight vector \(\mathbf{w}_o\) depends on the number of training samples. Hence, redundant samples, which generate linearly-dependent rows in the super-kernel, will generate over fitting by using the predictor. One possible remedy for these problems is to utilize the sparsification procedure from Sect. 4.2 in order to design the predictor. The optimizer \(\mathbf{w}_o\) is formulated by introducing the dictionary estimated super-kernel as \(J(\mathbf{w}) = \Vert G \mathbf{w} - \varvec{\nu } \Vert ^2_2 \approx \Vert E_s^T \hat{G}_s E_s w - \varvec{\nu } \Vert ^2_2\) using Eq. 17. Let \(\alpha = E_s w \) and let \(\alpha _{j}\) be the

*j*th element of vector \(\alpha \), then the predictor is reformulated as \(\hat{\varvec{\nu }}(x_i) = \sum _{j=1}^{s} C_j^{s} G_{(x_j,x_i)} {\alpha }_j\), and the corresponding predictor’s \(l_2\) error is given by \(J(\alpha ) = \Vert E_s^T \hat{G}_s E_s w - f \Vert = \Vert E_s^T \hat{G}_s \alpha - \varvec{\nu } \Vert \), which can be minimized by

## 5 Data analysis illustration via dictionary based PTE

The PTE methodology provides a general framework that can be applied to a wide varieties of data analysis tasks such as clustering, classification, anomaly detection and manifold learning. In this section, we demonstrate how the proposed dictionary construction based approximated PTE is utilized to solve several relevant data analysis applications such as: I. MNIST Handwritten digit classification. II. Image segmentation. III. Vector field extension over a sphere. The experiments were done on an off-the-shelf PC with a I7-2600 quad core CPU and a 32 GB of DDR3 memory.

### 5.1 Example I: MNIST handwritten digit classification

The computerized handwritten character recognition has been extensively studied. Relevant papers such as Dimond (1958), Bomba (1959) can be found from the late 50s of the last century. Since then, this challenge has remained relevant and many more studies have been published (Suen et al. 1980). Handwritten characters can be recognized in many different ways, such as K-Nearest Neighbors (Keysers et al. 2007), linear and nonlinear classifiers, neural nets and SVMs (Lecun et al. 1998), to name some.

The MNIST database of handwritten digits (Lecun et al. 1998) (available from http://yann.lecun.com/exdb/mnist/) consists of a training set of 60,000 examples and a test set of 10,000 examples. Each digit is given as a gray level image of size \(28 \times 28\) pixels. The digit images were centered by computing the center of mass of the pixels. Then, the image was translated to the position of this data point at the center of the \(28 \times 28\) field. MNIST is a subset of a larger available set from NIST.

Currently, convolutional networks achievements are considered as the state-of-the-art in recognition accuracy with an error of 0.23 % (Ciresan et al. 2012). For our purpose, the MNIST dataset provides 70,000 data points of very high dimensional measurements having 728 pixels per a measured digit. In our experiments, the images were used as is.

The PTEA Algorithm 4.1 was applied to embed the MNIST dataset of 70,000 examples (data points) using the following steps: 1. For each data point, we identified the 300 nearest neighbors and computed the corresponding local PCA og this neighborhood. For each local tangential space, we kept the 2 significant eigenvectors. 2. The diffusion affinity (Eq. 1) was computed with \(\varepsilon =105\) which is the Euclidean distance mean for all the pairwise data points. The proposed dictionary construction with \(\mu =0.0002/3\) identified 381 important data points and their corresponding local tangential spaces. 3. The approximated tensors were constructed utilizing a tensor length \(l=14\). The labeling of each test data point was determined by using the label of the nearest training data point where the pairwise distance was computed by the Frobenius norm of the difference between the corresponding embedded tensors. The software-based implementation of the PTEA algorithm is described in Salhov et al. (2015).

Performance comparison between the application of the approximated SVD step in the PTEA Algorithm 4.1 and the application of the full SVD to the NIST dataset

Dataset | Size | | SVD complexity—full | SVD complexity—approx. | Dict. size |
---|---|---|---|---|---|

MNIST | 70,000 | 2 | \(O \left( 70,000^3 \times 2^3 \right) \) | \(O \left( 70,000 \times 145,161 \times 2^3 \right) \) | 381 |

Comparisons between the computational complexities of different methods

Operation | Measured time (min) |
---|---|

300NN | 126 |

Local PCA | 50 |

PTEA | 267 |

Total | 443 |

Although we are not far away from the state-of-the-art result in digit recognition, the proposed method has the following advantages: 1. It shows that patch processing can be utilized for recognition and data analysis tasks. 2. High dimensional big datasets can be processed on a “cheap” hardware such as in our case where the algorithm was executed on less than 1000$ worth of hardware.

### 5.2 Example II: Image segmentation

Image segmentation aims to cluster pixels into image regions that correspond to individual surfaces, objects or natural parts of objects. Image segmentation plays a key rule in many computer vision tasks.

Under the PTEA framework, the image is viewed as a super-kernel construction that reflects the affinities between the pixels and the projection of the related tangential spaces. The PTE construction translates a given pixel related features into tensors in the embedded space. The image segmentation is achieved through tensorial clustering in the embedded space.

For the image segmentation examples, we utilized the pixel color information and its spatial (*x*,*y*) location multiplied by a scaling factor \(w=0.1\). Hence, given a RGB image with \(Ix \times Iy\) pixels, we generated the dataset *X* of size \(5 \times (Ix \cdot Iy)\).

Algorithm 4.1 embeds *X* into a tensorial space. The first step in Algorithm 4.1 constructs local patches. Each generated patch captures the relevant neighborhood and considers both color and spatial similarities. Hence, a patch is more likely to include attributes related to spatially close pixels. It is important to note that the affinity kernel is computed according to Eq. 1 where \(\varepsilon \) equals to the mean Euclidean distance between all pairs in *X*. The PTE parameters *l* and \(\rho \) were chosen to generate homogeneous segments. The dictionary’s approximation tolerance \(\mu \) was chosen arbitrarily to be \(\mu =0.001\). The k-means algorithm with sum of square differences clustered the tensors into similar clusters. The final clustering results from the embedded tensors as a function of the diffusion transition time *t* (Eq. 6) are presented in Figs. 2 and 3.

*t*. The effect of the diffusion transition time on the segmentation of the ‘Hand’ image is significant. For example, the first three images in Fig. 2, which correspond to \(t=1\), \(t=2\) and \(t=3\), respectively, show a poor segmentation results. As

*t*increases, the segmentation becomes more homogeneous, thus the main structures in the original image can be separated such as \(t=4\) in (e). Another consequence from the increase in the diffusion transition time parameter

*t*is the smoothing effect on the pairwise distances between data points in the embedded space. By increasing

*t*, the pairwise distances between similar tensors are reduced while he distances between dissimilar tensors increase.

Performance comparison between the application of the approximated SVD step in the PTEA Algorithm 4.1 and the application of the full SVD to imaging datasets

Image | Size | d | SVD complexity—full | SVD complexity—approx. | Dict. size |
---|---|---|---|---|---|

Hand | \(104 \times 128\) | 2 | \(O \left( 26624^3 \right) \) | \(O \left( 26624 \times 16 \right) \) | 2 |

Sport | \(40 \times 77\) | 2 | \(O \left( 6160^3 \right) \) | \(O \left( 6160 \times 16 \right) \) | 2 |

The constructed dictionary enables us to practically utilize the PTE to segment medium sized images. The computational complexities were significantly lower in comparison to the application of the SVD decomposition that utilizes the full super-kernel. The reduction in the computational complexity of image segmentation was achieved by the application of the dictionary-based SVD/QR step of the extension coefficient matrix *E* in the PTEA algorithm.

### 5.3 Example III: Vector field extension over a sphere

*F*. The synthetic vector field consists of \(\left| M \right| =N=8050\) data points sampled from a two dimensional half sphere \(\mathcal {M}\) that is immersed in \(\mathbbm {R}^3\) by \(F:M\rightarrow \mathbb {R}^3\) such that for any \(x \in M\), we have \(F(x)\in T_x(\mathcal {M})\). For each two-dimensional data point, we generate a vector that lies on the tangent plane of the sphere at a corresponding data point. Figure 4 illustrates the described vector field from two views.

*F*(

*x*) in the local basis of \(T_x(\mathcal {M})\). Our goal is to extend

*F*to any data point \(y \in \mathcal {M} \backslash M\). In our example, for each data point \(x \in M\), we have a closed form for its local patch. Each patch at the data point \(x = (x_1,x_2)\) is spanned by two vectors \(S_1 = (1,0,-x_1 \sqrt{1-x_1^2-x_2^2} )\) and \(S_2 = (0,1,-x_2 \sqrt{1-x_1^2-x_2^2} )\). Hence, the corresponding tangential space \(O_x\) is given by the application of SVD to \([S_1,S_2]^T\). Furthermore, we choose the vectors from the vector field to be the corresponding vectors \(F(x)= (1,0,-x_1 \sqrt{1-x_1^2-x_2^2})\). Equation 20 provides the solution to the out-of-sample extension problem given the vector field members

*f*that corresponds to the dictionary members and the corresponding super-kernel \(G_s\). For each data point

*y*, which is not in the dictionary, we apply Eq. 5 to find the extended vector field. The resulting vector field is compared with the original vector field in Fig. 6.

## 6 Conclusions

The proposed construction in this paper extends the dictionary construction in Engel et al. (2004) by using the LPD super-kernel from Salhov et al. (2012), Wolf and Averbuch (2013). This is achieved by an efficient dictionary-based construction assuming the data is sampled from an underlying manifold while utilizing the non-scalar relations and the similarities among manifold patches instead of representing them by individual data points. The constructed dictionary contains these patches from the underlying manifold, which are represented by the embedded tensors from Salhov et al. (2012), instead of representing them by individual data points. Therefore, it encompasses multidimensional similarities between local areas in the data. Furthermore, the dictionary insertion criterion is based on tensorial linear dependencies that incorporate both distances on the manifold as well as relations between its tangential spaces. Therefore, dictionary representatives are expected to either be sufficiently far from each other on the manifold or have sufficiently different tangent spaces. As a result, this criterion is able to adapt the dictionary selection to the curvatures of the manifold, since it allows a denser selection in high-curvature areas and a sparser one in low-curvature areas. The PTEA Algorithm reduces the computational complexities of the spectral analysis in comparison to the regular use of PTE based algorithm.

## Footnotes

## Notes

### Acknowledgments

This research was supported by the the Israel Ministry of Science & Technology (Grants No. 3-9096, 3-10898), by US–Israel Binational Science Foundation (BSF 2012282), by Blavatnik Funds and by a Fellowship from University of Jyväskylä. The third author was supported by the Eshkol Fellowship from the Israel Ministry of Science & Technology.

### References

- Ballester, C., Bertalmio, M., Sapiro, G., & Verdera, J. (2001). Filling-in by joint interpolation of vector fields and gray levels.
*IEEE Transactions on Image Processing*,*10*, 1200–1211.MathSciNetCrossRefMATHGoogle Scholar - Belkin, M., & Niyogi, P. (2003). Laplacian eigenmaps for dimensionality reduction and data representation.
*Neural Computation*,*15*(6), 1373–1396.CrossRefMATHGoogle Scholar - Bermanis, A., Averbuch, A., & Coifman, R. (2013). Multiscale data sampling and function extension.
*Applied and Computational Harmonic Analysis*,*34*, 182–203.MathSciNetCrossRefMATHGoogle Scholar - Bomba, J. S. (1959).
*Alpha-numeric character recognition using local operations*. In Papers presented at the December 1–3, 1959, eastern joint IRE-AIEE-ACM computer conference, ACM, New York, NY, USA, IRE-AIEE-ACM ’59 (Eastern) (pp. 218–224).Google Scholar - Bronstein, M., & Bronstein, A. (2011). Shape recognition with spectral distances.
*IEEE Transactions on Pattern Analysis and Machine Intelligence*,*33*(5), 1065–1071.CrossRefGoogle Scholar - Ciresan, D., Meier, U., & Schmidhuber, J. (2012). Multi-column deep neural networks for image classification. In
*2012 IEEE conference on computer vision and pattern recognition (CVPR)*(pp. 3642–3649).Google Scholar - Coifman, R., & Lafon, S. (2006). Diffusion maps.
*Applied and Computational Harmonic Analysis*,*21*(1), 5–30.MathSciNetCrossRefMATHGoogle Scholar - Cox, T., & Cox, M. (1994).
*Multidimensional scaling*. London: Chapman and Hall.MATHGoogle Scholar - Cullum, J. K., & Willoughby, R. A. (2002).
*Lanczos algorithms for large symmetric eigenvalue computations*, (Vol. 1). Society for Industrial and Applied Mathematics.Google Scholar - David, G. (2009). Anomaly detection and classification via diffusion processes in hyper-networks. PhD thesis, School of Computer Science, Tel Aviv University.Google Scholar
- David, G., & Averbuch, A. (2012). Hierarchical data organization, clustering and denoising via localized diffusion folders.
*Applied and Computational Harmonic Analysis*,*33*(1), 1–23.MathSciNetCrossRefMATHGoogle Scholar - Dimond, T. L. (1958).
*Devices for reading handwritten characters*. In Papers and discussions presented at the December 9–13, 1957, eastern joint computer conference: Computers with deadlines to meet, ACM, New York, NY, USA, IRE-ACM-AIEE ’57 (Eastern) (pp. 232–237).Google Scholar - Engel, Y., Mannor, S., & Meir, R. (2004). The kernel recursive least-squares algorithm.
*IEEE Transactions on Signal Processing*,*52*(8), 2275–2285.MathSciNetCrossRefGoogle Scholar - Fowlkes, C., Belongie, S., Chung, F., & Malik, J. (2004). Spectral grouping using the Nyström method.
*IEEE Transactions on Pattern Analysis and Machine Intelligence*,*26*, 214–225.CrossRefGoogle Scholar - Fuselier, E. J., & Wright, G. (2009). Stability and error estimates for vector field interpolation and decomposition on the sphere with RFBs.
*SIAM Journal on Numerical Analysis*,*47*(5), 3213–3239.MathSciNetCrossRefMATHGoogle Scholar - Golub, G., & Van Loan, C. (2012).
*Matrix computations*(4th ed.). Baltimore: John Hopkins University Press.MATHGoogle Scholar - Jingen, L., Yang, Y., & Shah, M. (2009). Learning semantic visual vocabularies using diffusion distance. In
*IEEE conference on computer vision and pattern recognition, 2009. CVPR 2009*(pp. 461–468).Google Scholar - Keller, Y., Coifman, R., Lafon, S., & Zucker, S. (2010). Audio-visual group recognition using diffusion maps.
*IEEE Transactions on Signal Processing*,*58*(1), 403–413.MathSciNetCrossRefGoogle Scholar - Keysers, D., Deselaers, T., Gollan, C., & Ney, H. (2007). Deformation models for image recognition.
*IEEE Transactions on Pattern Analysis and Machine Intelligence*,*29*, 1422–1435.CrossRefGoogle Scholar - Kruskal, J. (1964). Multidimensional scaling by optimizing goodness of fit to a nonmetric hypothesis.
*Psychometrika*,*29*, 1–27.MathSciNetCrossRefMATHGoogle Scholar - Lafon, S. (2004). Diffusion maps and geometric harmonics. PhD thesis, Yale University.Google Scholar
- Lafon, S., Keller, Y., & Coifman, R. (2006). Data fusion and multicue data matching by diffusion maps.
*IEEE Transactions on Pattern Analysis and Machine Intelligence*,*28*(11), 1784–1797.CrossRefGoogle Scholar - Lecun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). Gradient-based learning applied to document recognition.
*Proceedings of the IEEE*,*86*, 2278–2324.CrossRefGoogle Scholar - Liu, S., & Trenkler, G. (2008). Hadamard, Khatri-Rao, Kronecker and other matrix products.
*International Journal of Information & Systems Sciences*,*4*(1), 160–177.MathSciNetMATHGoogle Scholar - Mordohai, P., & Medioni, G. G. (2010). Dimensionality estimation, manifold learning and function approximation using tensor voting.
*Journal of Machine Learning Research*,*11*, 411–450.MathSciNetMATHGoogle Scholar - Ouimet, M., & Bengio, Y. (2005). Greedy spectral embedding. In
*Proceedings of the tenth international workshop on artificial intelligence and statistics*(pp. 253–260).Google Scholar - Rao, C. K. C. (1968). Solutions to some functional equations and their applications to characterization of probability distributions.
*Sankhya: The Indian Journal of Statistics, Series A (1961–2002)*,*30*(2), 167–180.MathSciNetMATHGoogle Scholar - Rui, X., Damelin, S., & Wunsch, D. (2007). Applications of diffusion maps in gene expression data-based cancer diagnosis analysis. In:
*Engineering in medicine and biology society, 2007. EMBS 2007. 29th annual international conference of the IEEE*(pp. 4613–4616).Google Scholar - Salhov, M., Wolf, G., & Averbuch, A. (2012). Patch-to-tensor embedding.
*Applied and Computational Harmonic Analysis*,*33*(2), 182–203.MathSciNetCrossRefMATHGoogle Scholar - Salhov, M., Bermanis, A., Wolf, G., & Averbuch, A. (2015). PTEA matlab implementation. http://www.cs.tau.ac.il/~moshesa/HA/PAtchDiffusion.rar.
- Schclar, A., Averbuch, A., Rabin, N., Zheludev, V., & Hochman, K. (2010). A diffusion framework for detection of moving vehicles.
*Digital Signal Processing*,*20*(1), 111–122.CrossRefGoogle Scholar - Singer, A., & Coifman, R. (2008). Non-linear independent component analysis with diffusion maps.
*Applied and Computational Harmonic Analysis*,*25*(2), 226–239.MathSciNetCrossRefMATHGoogle Scholar - Singer, A., & Wu, H. (2011). Orientability and diffusion maps.
*Applied and Computational Harmonic Analysis*,*31*(1), 44–58.MathSciNetCrossRefMATHGoogle Scholar - Singer, A., & Wu, H. (2012). Vector diffusion maps and the connection laplacian.
*Communications on Pure and Applied Mathematics*,*65*(8), 1067–1144.MathSciNetCrossRefMATHGoogle Scholar - Suen, C., Berthod, M., & Mori, S. (1980). Automatic recognition of handprinted characters; the state of the art.
*Proceedings of the IEEE*,*68*(4), 469–487.CrossRefGoogle Scholar - Talmon, R., Cohen, I., & Gannot, S. (2011). Supervised source localization using diffusion kernels. In
*2011 IEEE workshop on applications of signal processing to audio and acoustics (WASPAA)*(pp. 245–248).Google Scholar - Talmon, R., Kushnir, D., Coifman, R., Cohen, I., & Gannot, S. (2012). Parametrization of linear systems using diffusion kernels.
*IEEE Transactions on Signal Processing*,*60*(3), 1159–1173.MathSciNetCrossRefGoogle Scholar - Talmon, R., Cohen, I., & Gannot, S. (2013). Single-channel transient interference suppression with diffusion maps.
*IEEE Transactions on Audio, Speech, and Language Processing*,*21*(1–2), 132–144.CrossRefGoogle Scholar - von Luxburg, U. (2007). A tutorial on spectral clustering.
*Statistics and Computing*,*17*, 395–416.MathSciNetCrossRefGoogle Scholar - Wolf, G., & Averbuch, A. (2013). Linear-projection diffusion on smooth Euclidean submanifolds.
*Applied and Computational Harmonic Analysis*,*34*, 1–14.MathSciNetCrossRefMATHGoogle Scholar