Abstract
Nowadays many real-world datasets can be considered as functional, in the sense that the processes which generate them are continuous. A fundamental property of this type of data is that in theory they belong to an infinite-dimensional space. Although in practice we usually receive finite observations, they are still high-dimensional and hence dimensionality reduction methods are crucial. In this vein, the main state-of-the-art method for functional data analysis is Functional PCA. Nevertheless, this classic technique assumes that the data lie in a linear manifold, and hence it could have problems when this hypothesis is not fulfilled. In this research, attention has been placed on a non-linear manifold learning method: Diffusion Maps. The article explains how to extend this multivariate method to functional data and compares its behavior against Functional PCA over different simulated and real examples.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
1 Introduction
In many real-world applications, data must be considered as discretized functions rather than standard vectors, since the process that generates them is assumed to be continuous. Natural examples are found in medicine, economics, speech recognition or meteorological problems. This type of data, commonly called functional data, is usually represented by high-dimensional vectors whose coordinates are highly correlated, and thus multivariate methods lead to ill-conditioned problems. As a consequence, most traditional data analysis tools, such as regression, classification or clustering, are being adapted to functional inputs, as well as some dimensionality reduction techniques.
Functional data analysis (FDA; Vieu 2018; Cuevas 2014) is an active field of statistics in which the data of interest are often considered as realizations of stochastic processes over a compact domain. Traditionally, FDA focused on linear dimensionality reduction methods, with special attention given to Functional Principal Component Analysis (FPCA; Ramsay and Silverman 2005) as an unsupervised technique that captures most of the variance of the original data in a lower-dimensional space. In particular, in this method the infinite-dimensional random functions are projected onto the lower dimensional subspace generated by the eigenfunctions of the covariance operator. Moreover, when functional data are expressed as a linear combination of known basis functions, the eigenequation problem can be reduced to a discrete form expressing the eigenfunctions in the same known basis (Riesz and Nagy 2012).
Other examples of popular linear algorithms that have been transferred to FDA include Independent Component Analysis (ICA; Virta et al. 2020), which attempts to project data onto independent components, and Partial Least Squares (FPLS; Delaigle and Hall 2012), which aims to identify the underlying latent variables in both the response and predictor variables.
An extreme case of linear projections are variable selection methods that reduce trajectories to their values at a few impact points. In recent years, a multitude of proposals in functional variable selection have emerged, highlighting methods based on the regularization of linear regression models such as Lasso or Dantzig (Aneiros et al. 2022; Matsui and Konishi 2011). Other proposals are based on logistic regression (McKeague and Sen 2010), wrapper methods (Delaigle et al. 2012), or the selection of local maxima of dependence measures (Berrendero et al. 2016). While all of these proposals have notable strengths, they are inadequate for addressing non-linear problems.
A first step to alleviate this deficiency is the extension of linear dimensionality reduction techniques in FDA to a non-linear context. This is the case of NAFPCA (Song and Li 2021), a method proposed as a generalization of FPCA via two additively nested Hilbert spaces of functions. The first space characterizes the functional nature of the data, and the second space captures the nonlinear dependence. This formulation is based on the estimator proposed in Li and Song (2017) in a context of sufficient dimension reduction. This method has the advantage of being suitable for vector-valued functional data.
From a different perspective, multivariate statistics offers several non-linear techniques for dimensionality reduction based on the assumption that high dimensional data observed in a D-dimensional space \({\mathbb {R}}^D\) actually lie on (or close to) a d-dimensional manifold \({\mathcal {M}}\subset {\mathbb {R}}^D\), with \(d < D\) (Cayton 2005). These techniques can be used for discovering high-density low-dimensional surfaces in unsupervised problems, or as a preprocessing step when applying supervised models, being very useful, for example, for data analysis and visualization (Bengio et al. 2006). Their goal is to achieve a new representation that captures the structure of the data in a few dimensions while preserving the original local information. For doing this, most of these methods rely on the spectral analysis of a similarity matrix of a graph previously constructed over the original data. Another important characteristic of these methods is that they may arrive at a new space where the Euclidean distance between embedded points somehow corresponds to the original information preserved.
Isometric feature mapping (Isomap; Tenenbaum et al. 2000), t-distributed Stochastic Neighborhood Embedding (t-SNE; Cox and Cox 2008), Locally Linear Embedding (LLE; Roweis and Saul 2000) or Diffusion Maps (DM; Coifman and Lafon 2006; Coifman et al. 2005; De la Porte et al. 2008) are some of the main manifold methods used for dimensionality reduction over multivariate data. They all take advantage of the local linearity of manifolds and create a mapping that preserves local neighborhoods at each point of the underlying manifold.
Out of all of these techniques, only Isomap (Chen and Müller 2012) has been extended to the functional setting. This method is an extension of the classical multidimensional scaling metric and it is characterized by the use of geodesic distances induced by a neighborhood graph for estimating the intrinsic geometry of a data manifold. Despite the fact that Isomap is often easy to visualize and interpret, making it useful for exploratory data analysis and data visualization, the computation of geodesic distances can be computationally expensive. Therefore, there is a need to extend other non-linear dimension reduction methods to the functional context that deal with these difficulties. A promising approach is the utilization of Diffusion Maps, which has the ability to handle more complex data with multiple scales and is much more robust against noise and outliers.
The current paper specifically focuses on extending this manifold learning technique to the FDA context by proposing a new method: functional diffusion maps (FDM). As DM, which has been successfully applied to a variety of different contexts (including multivariate datasets Fernández et al. (2012), Dov et al. (2015), time-series problems Marshall and Hirn (2018), Lian et al. (2015), complex dynamical systems analyses Nadler et al. (2006), or dimensional stochastic systems representation Coifman et al. (2008)), FDM has the potential to deliver competitive results in functional data applications, particularly in cases where the data is distributed on specific manifolds. The main contribution of this research can thus be summarized in extending DM to the functional domain, offering a formalization of DM for functions and comparing FDM with DM, FPCA and Isomap in simulated and real examples.
The rest of the paper is organized as follows: Sect. 2 presents some ideas and definitions from FDA that will be useful for the subsequent discussion in this article, paying special attention to functional manifolds, which will be the main assumption in DM. Section 3 formalizes the FDM approach and Sect. 4 illustrates and compares the performance of DM, FPCA, Isomap and FDM on synthetic and real datasets. Finally, Sect. 5 presents some conclusions of this work, as well as possible research lines to further improve it.
2 Functional data analysis
In this work, we consider the typical case in Functional Data Analysis (Ramsay and Silverman 2005; Ferraty and Vieu 2006) where we have sample of functions \(x_1(t), \dotsc , x_N(t)\), where \(t\in {\mathcal {J}}\), \({\mathcal {J}}\) is a compact interval, and each \(x_i(t)\) is the observation of an independent functional variable \(X_i\) identically distributed as X. It is usual to assume that the functional variable X is a second order stochastic process, \(\textrm{E}[X^2]<\infty \), and takes values in the Hilbert space of square integrable functions \(L^2([a,b])\) defined on the closed interval \([a,b] \subset {\mathbb {R}}\). Square integrable functions form a vector space and we can define the inner product of two functions by \(\langle x,y \rangle = \int _{a}^b x(t) y(t) dt\). The inner product allows to introduce the notion of distance between functions by the \(L^2\)-norm \(\Vert x\Vert ^2 = \langle x,x \rangle = \int _{a}^b x^2(t)dt.\) Therefore, in \(L^2\) space, the distance between two functions can be calculated as the norm of their difference, which is expressed as \(\Vert x-y\Vert \).
2.1 Functional data representation
Although each function \(x_i\) consists of infinitely many values \(x_i(t)\) with \(t \in [a,b]\), in practice we only observe a finite set of function values in a set of arguments that are not necessarily equidistant or equal for all functions (Wang et al. 2016). A main problem representing functional data is that we can not directly retrieve the underlying function and we need to work with the functions using specific approaches: working directly with discrete data or transforming it into a function using a basis function expansion. We will briefly sketch below both approaches (for a thorough explanation the reader can refer to Ramsay and Silverman (2005)).
The first approach consists in using directly the observed values of the functions and it does not require any additional adjustment or assumption. Let \(\{x_i(t)\}_{i=1}^N\) be N underlying functions from which we have obtained our data \(\textrm{x}_i = ( x_i(t_1), \dotsc ,x_i(t_M))^{\top }\); we denote our set of data as the \(N \times M\) matrix defined as
The other possible representation is based on a basis function expansion and consist in representing each function \(x_i(t)\) using a basis in \(L^2\), such that
where \(\phi _k(t)\) are the basis functions and \(c_{ik}\) are the coefficients that determine the weight of each basis function in the representation of \(x_i(t)\).
One way to approximate the function is by truncating the basis to the first K elements. This results in smoothing and dimensionality reduction, such that,
where \(\upphi (t) = (\phi _1(t), \dotsc , \phi _K(t))^{\top }\) is the vector containing the functions of the truncated basis and \(\textrm{c}_i = (c_{i1}, \dotsc , c_{iK})^{\top }\) are the coefficients in the new basis. In this way, the set of N functional data evaluated at time t can be expressed as follows:
where \(\textrm{C}\) is an \(N\times K\) matrix.
There are different possible basis, such as Fourier basis, wavelet, B-Spline, polynomial basis, etc. Depending on the data, the choice of basis can be key to capture as much information as possible about the underlying function. For example, if the data are periodic, a Fourier basis will be of interest, while if they are not, the B-Spline basis is a common choice.
2.2 Functional manifolds
A differential manifold \({\mathcal {M}}\) is a topological space which is locally diffeomorphic to \({\mathbb {R}}^d\) (Lee 2012). These manifolds are expressed in terms of an atlas consisting of a set of charts \(\{(U_\alpha , \varphi _\alpha )\}\), where \(U_\alpha \) are open sets covering \({\mathcal {M}}\), and the coordinate maps \(\varphi _\alpha : U_\alpha \rightarrow {\mathbb {R}}^d\), are diffeomorphisms from \(U_\alpha \) to open sets of \({\mathbb {R}}^d\). These coordinate maps are commonly used to describe the geometry of the manifold. In the context of non-linear dimensionality reduction, it is usual to consider “simple” differential manifolds which are locally isometric to \({\mathbb {R}}^d\). These manifolds are characterized by a single chart \((U, \varphi )\), where U is an open set covering \({\mathcal {M}}\), and the coordinate map \(\varphi : U\rightarrow {\mathbb {R}}^d\) is an isometry from U to an open set of \({\mathbb {R}}^d\). A typical example of a “simple” manifold is the well-known Swiss roll manifold depicted in Fig. 4.
In line with Chen and Müller (2012) and in accordance with standard practices in FDA, our research will focus on “simple” manifolds \({\mathcal {M}}\subset L^2\). We will refer to them as functional manifolds. These manifolds are naturally equipped with the \(L^2\) distance, although as in the case of multivariate data, this metric may not be the most appropriate one. Hence, we will use manifold learning methods to approximate \(\varphi \) and understand the inherent structure of the functional data.
3 Functional diffusion maps
In this work, we propose a new Functional Manifold Learning technique, which we call functional diffusion maps, that aims to find low-dimensional representations of \(L^2\) functions on non-linear functional manifolds. It first builds a weighted graph based on pairwise similarities between functional data and then, just like Diffusion Maps does, uses a diffusion process on the normalized graph to reveal the overall geometric structure of the data at different scales.
In the following, the mathematical framework for DM is detailed, and then a generalization for functional data is proposed.
3.1 Mathematical framework for diffusion maps
Diffusion Maps is a non-linear dimensionality reduction algorithm introduced by Coifman and Lafon (2006) which focus on discovering the underlying manifold from which multivariate data have been sampled. Since it was first proposed, the theoretical framework behind DM has settled the basis for many other studies and proposals (Singer and Coifman 2008; Berry and Harlim 2016; Lederman and Talmon 2018; Maggioni and Murphy 2019).
In particular, DM computes a family of embeddings of a dataset into a low-dimensional Euclidean subspace whose coordinates are computed from the eigenvectors and eigenvalues of a diffusion operator on the data. In more detail, let \({\mathcal {X}}=\{\textrm{x}_{1}, \dotsc ,\textrm{x}_{\textrm{N}}\}\) be our multivariate dataset, such that \({\mathcal {X}}\) is lying along a manifold \({\mathcal {M}} \subset {\mathbb {R}}^D\). A random walker will be defined on the data, so the connectivity or similitude between two data points \(\textrm{x}\) and \(\textrm{y}\) can be used as the probability of walking from \(\textrm{x}\) to \(\textrm{y}\) in one step of the random walk. Usually, this probability is specified in terms of a kernel function of the two points, \(k:{\mathbb {R}}^D\times {\mathbb {R}}^D\rightarrow {\mathbb {R}}\), which has the following properties:
-
k is symmetric: \(k(\textrm{x},\textrm{y}) = k (\textrm{y},\textrm{x})\);
-
k is positivity preserving: \(k(\textrm{x}, \textrm{y})\ge 0\) \(\forall \textrm{x},\textrm{y}\in {\mathcal {X}}\).
Let \(\textrm{G} = ({\mathcal {X}}, \textrm{K})\) be a weighted graph where \(\textrm{K}\) is an \(N\times N\) kernel matrix whose components are \(k_{ij} = k(\textrm{x}_i,\textrm{x}_j)\). This graph is connected and captures some geometric features of interest in the dataset. Now, to build a random walk, \(\textrm{G}\) has to be normalized.
Since the graph depends on both the geometry of the underlying manifold and the data distribution on the manifold, in order to control the influence of the latter, DM uses a density parameter \(\alpha \in [0,1]\) in the normalization step. The normalized graph \(\textrm{G}'=({\mathcal {X}}, \textrm{K}^{(\alpha )})\) uses the weight matrix \(\textrm{K}^{(\alpha )}\) obtained by normalizing \(\textrm{K}\):
where \(d_i = \sum _{j=1}^N k_{ij}\) is the degree of the graph and the power \(d_i^\alpha \) approximates the density of each vertex.
Now, we can create a Markov chain on the normalized graph whose transition matrix \(\textrm{P}\) is defined by
where \(d_i^{(\alpha )}\) is the degree of the normalized graph,
Each component \(p_{ij}\) of the transition matrix \(\textrm{P}\) provides the probabilities of arriving from node i to node j in a single step. By taking powers of the \(\textrm{P}\) matrix we can increase the number of steps taken in the random walk. Thus, the components of \(P^T\), namely \(p_{ij}^T\), represent the sum of the probabilities of all paths of length T from i to j. This defines a diffusion process that reveals the global geometric structure of \({\mathcal {X}}\) at different scales.
The transition matrix of a Markov chain has a stationary distribution, given by
that satisfies the stationary property as follows:
Since the graph is connected, the associated stationary distribution is unique. Moreover, since \({\mathcal {X}}\) is finite, the chain is ergodic. Another property of the Markov chain is that it is reversible, satisfying
Taking into account all these properties, we can now define a diffusion distance \(\textrm{D}_T\) based on the geometrical structure of the obtained diffusion process. The metric measures the similarities between data as the connectivity or probability of transition between them, such that
where \(\Vert \frac{1}{\pi }\Vert _{L^2}\) represents the euclidean norm weighted by the stationary distribution \(\pi \), which takes into account the local data point density. This metric is also robust to noise perturbation, since it is defined as an average of all the paths of length T. Therefore, T plays the role of a scale parameter and \(\textrm{D}_T^2 (\textrm{x}_i,\textrm{x}_j)\) will be small if there exist a lot of paths of length T that connect \(\textrm{x}_i\) and \(\textrm{x}_j\).
Figure 1 shows a visual example of the diffusion distances and the \(L^2\) distances over the classic Moons dataset, which consists of two half circles interspersed. A spectral color bar has been used, with warm colors indicating nearness between the data and cold colors representing remoteness.
Clearly, Moons data live in a non-linear manifold. In the picture, the distances between the point marked with a star and the rest are displayed. It can be seen that the diffusion distance approximates the distance in the manifold, so that points on the same moon are nearer among them than to points on the other moon. In contrast, \(L^2\) distances are independent of the moons and are not convenient when data live in a non-linear manifold such as this one.
Spectral analysis of the Markov chain allows us to consider an alternative formulation of the diffusion distance that uses eigenvalues and eigenvectors of \(\textrm{P}\).
As detailed in Coifman and Lafon (2006), even though \(\textrm{P}\) is not symmetric, it makes sense to perform its spectral decomposition using its left and right eigenvectors, such that
The eigenvalue \(\lambda _0=1\) is discarded for P since \(\psi _0\) is a vector with all its values equal to one. The other eigenvalues \(\lambda _1, \lambda _2, \dotsc ,\lambda _{N-1}\) tend to 0 and satisfy \(\lambda _l<1\) for all \(l \ge 1\). Now, the diffusion distance can be rewritten using the new representation of \(\textrm{P}\):
Since \(\lambda _l\rightarrow 0\) when l grows, the diffusion distance can be approximated using the first L eigenvalues and eigenvectors. One possibility is to select L such that
where \(\delta \) is a parameter that controls the precision in the approximation. Thus, the diffusion distance approximated by L is expressed as
Finally, the diffusion map \(\Psi _T: {\mathcal {X}}\rightarrow {\mathbb {R}}^L\) is defined as
With this definition, diffusion maps family \(\{\Psi _T\}_{T\in {\mathbb {N}}}\) projects data points into a Euclidean space \({\mathbb {R}}^{L}\) so that diffusion distance on the original space can be approximated by the Euclidean distance of \(\Psi _T\) projections in \({\mathbb {R}}^{L}\),
preserving the local geometry of the original space.
3.2 Extending diffusion maps to functional data
Let X be a centered square integrable functional variable of the Hilbert space \( L^2([a,b])\), where [a, b] is a compact interval. Let \({\mathcal {X}} = \{x_1(t), \dotsc , x_N(t)\}\) be the observations of N independent functional variables \(X_1, \dotsc , X_N\) identically distributed as X. We assume that \({\mathcal {X}}\) lies on a functional manifold \({\mathcal {M}}\subset L^2([a,b])\).
Before defining a diffusion process over \({\mathcal {X}}\), a random walk over a weighted graph has to be built. Graph vertices are functions \(x_i(t) \in {\mathcal {X}}\) and weights \(k_{ij}\) are given by a symmetric positive-definite Kernel operator \({\mathcal {K}}:L^2([a,b])\times L^2([a,b])\rightarrow {\mathbb {R}}\), so that the weight between two vertices \(x_i(t)\) and \(x_j(t)\) is \(k_{ij} = {\mathcal {K}}(x_i(t), x_j(t))\). Thus, the weighted graph is \(\textrm{G}=({\mathcal {X}}, \textrm{K})\), where \(\textrm{K}\) is an \(N\times N\) kernel matrix resulted from evaluating \({\mathcal {K}}\) on \({\mathcal {X}}\). The kernel operator defines a local measure of similarity within a certain neighborhood, so that outside the neighborhood the function quickly goes to zero. The standard kernel used to compute the similarity between functional data is the Radial Basis Function (RBF) kernel, which is defined as:
where the size of the local neighborhood is determined by the hyperparameter \(\sigma \). Another classical option is the Laplacian kernel, which is defined as:
These kernels satisfy the property \({\mathcal {K}}(x(t),y(t))=\hat{{\mathcal {K}}}(x(t)-y(t))\), i.e. the kernel only depends on the difference between both elements.
Once the functional data graph \(\textrm{G}\) has been constructed, the algorithm performs the same operations as multivariate DM. The full methodology of the method is presented in Algorithm 1. Furthermore, the Python implementation of the FDM method is available at https://github.com/mariabarrosoh/functional-diffusion-maps/.
To conclude, note that, since the algorithm of FDM only differs from the one of DM in the definition of the graph via a functional metric, all the properties derived from the probability matrix and from the diffusion distances are preserved. In particular, any theoretical result regarding the standard multivariate DM method where data are involved only through a kernel function can be immediately transferred to its functional version. This means that classical results, such as the ability to learn intrinsic coordinates or to modulate the influence of the data density through the \(\alpha \) parameter, are inherited by FDM.
4 Examples and simulation study
In this section we apply the FDM technique described above to synthetic and classic functional datasets, using the FDA utilities provided by the scikit-fda package (Ramos-Carreño et al. 2022), as well as our FDM implementation described above. In particular, we compare the performance of FDM against its multivariate version and we also evaluate the performance of FDM alongside other FDA techniques, including FPCA and Isomap, to determine its efficacy.
Both DM and FDM require an initial analysis to identify the suitable parameters to cluster the data or reveal the structure of the manifold where the data is supposed to be embedded. To achieve this, a grid search is performed in each experiment using the hyperparameters specified in Table 1.
The functional version of the Isomap method has been used in practical applications by directly discretizing the values (Herrmann and Scheipl 2020). This technique only requires to set the number of neighbors needed to create the graph. To determine the optimal number of neighbors, we perform a grid search over the set \(\{5k: 1 \le k \le 5, k\in {\mathbb {N}}\}\).
4.1 Cauchy densities data
The purpose of this first experiment is to demonstrate that the straightforward implementation of Isomap and Diffusion Maps on functional data may not be sufficient if the functions are not equally spaced and the metric employed is unsuitable.
To illustrate this assertion, we apply Isomap, DM and FDM in a synthetic example consisting of 50 not equally spaced Cauchy probability density functions with \(\gamma =1.0\). In particular, we generate 25 Cauchy densities with amplitude 1.0 regularly centered in the interval \([-5,5]\) and 25 Cauchy densities with amplitude 1.5 regularly centered in the same interval. The functions are partitioned into equally spaced sections with 100 data points in each of the following intervals: \([-10,-5]\), \((-5,5)\), and [5, 10], so that the point density in the middle interval is half that of the other intervals. The functions generated are displayed in Fig. 2.
After evaluating all possible parameter settings for Isomap, DM and FDM on the Cauchy densities, we found that 15 neighbors are optimal for the Isomap method. An RBF kernel with a scale parameter of \(\sigma = 0.6\) and a density parameter of \(\alpha = 1.0\) works better for the DM embedding, while an RBF kernel with \(\sigma = 0.1\) and \(\alpha = 0.0\) is suitable for the FDM technique.
Figure 3 shows the scatterplots of the multivariate and functional scores in the first two components for the Cauchy densities dataset.
Both Isomap and DM scores exhibit a semi-circular shape, with the two function classes located in close proximity. In contrast, the FDM scores exhibit a complete separation of function classes. Consequently, we can deduce that the Isomap and multivariate DM may not be an appropriate method for analyzing non-equispaced functions, for which the functional version of DM provides an essential alternative. The main reason behind this discrepancy lies in the computation of distances.
4.2 Moons and Swiss roll data
After establishing the benefits of using FDM instead of DM for functional data, we aim to contrast FDM with FPCA, wich is the most popular dimensionality reduction method in FDA by far, by evaluating their performance on the functional versions of the Moons and Swiss Roll datasets.
The Moons dataset shown in Sect. 3.1 is typically used to visualize clustering algorithms. Instead, the Swiss Roll dataset is typically used to test dimensionality reduction techniques. Both of them are common examples where non-linearity in the data makes multivariate PCA perform poorly, and therefore, manifold learning techniques are preferable.
Figure 4 shows, at the left, multivariate Moons and Swiss Roll datasets generated without noise. To get the functional version of these datasets, which are shown in the right panel of Fig. 4, the features of multivariate data are used as the coefficient of a chosen functional basis. Specifically, we represent functional Moons data using the following non-orthogonal basis function:
for the functional Swiss Roll data we use instead the non-orthogonal basis function:
Since FPCA and FDM act on the basis function coordinates, we expect to obtain the same results as the multivariate ones, except the effect for the inner product between the chosen basis functions.
After evaluating all possible parameter configurations (see Table 1), we found that an RBF kernel with a scale parameter of \(\sigma = 0.2\) and a density parameter of \(\alpha = 0.5\) works better for the Moons dataset, while an RBF kernel with \(\sigma =0.6\) and \(\alpha =1.0\) is more suitable for the Swiss Roll dataset.
At the top panel of Figs. 5 and 6, the Moons and Swiss Roll scatterplots for the FPCA and FDM scores in the first two components are respectively depicted. Furthermore, below the same figures, the projection of the scores in the first component is also presented. These visualizations enable us to discern whether clusters have been recognized or if the underlying manifold has been “unrolled".
The FPCA embeddings exhibit similarity to the original Moons and Swiss Roll data, with the exception of rotation, sign, and scale adjustments. Instead, the FDM embeddings reveal that Moons data is entirely separated by the first component. Regarding the Swiss Roll, data, the underlying two-dimensional manifold is exposed, and even in one dimension it can be seen that color order is preserved by FDM.
4.3 Real datasets
In this last experiment we have applied the main functional dimensionality reduction techniques—FPCA, Isomap and FDM—to two real problems: the Phoneme and the Symbols datasets.
The Phoneme dataset, which consists of log-periodograms computed from continuous speech of male speakers for five distinct phonemes, was extracted from the TIMIT database (Garofolo et al. 1993), and it is widely used in speech recognition research. In this example, the phonemes are transcribed as follows: /sh/ (as in ‘she’), /dcl/ (as in ‘dark’), /iy/ (as the vowel in ‘she’), /aa/ (as the vowel in ‘dark’) and /ao/ (as the first vowel in ‘water’). For our experiment, we take a random sample of 1500 log-periodograms curves of length 256 as a representative sample of the population due to computational cost. Table 2 shows the phoneme frequencies by class membership. The log-periodogram curves of the Phoneme dataset are characterized by high irregularity and noise at their endpoints. Hence, in order to prepare them for a dimensionality reduction analysis, a preprocessing step of trimming and smoothing is typically performed. To do this, we truncate the log-periodogram curves to a maximum length of 50. Figure 7 displays the resulting functional representation of the phoneme dataset. Looking at the curves, we see higher similarities between the phonemes /aa/ and /ao/, as their sound is very similar; and also between the last part of the curves of phonemes /iy/ and /sh/, since the phoneme /iy/ is contained at the end of the phoneme /sh/. On the other hand, the curves of the phoneme /dcl/ seem to be far away from the rest of the curves.
For the comparison, FPCA, Isomap and FDM were applied to the preprocessed dataset. In this experiment, both Isomap and FDM techniques require an initial analysis to determine the most suitable parameters. After evaluating all possible parameter configurations, we found that an RBF kernel with \(\sigma = 1.0\) and \(\alpha = 1.0\) yields the best results for the FDM method, while for the Isomap method, we found that using 15 neighbors provides suitable performance. In the left panel of Fig. 8, we display the FPCA, Isomap and FDM scores in the first two components for the Phoneme dataset. In the right panel of the same figure, we also present the histograms of the scores associated with the first component. The outcomes yielded by each technique offer different perspectives on phoneme similarities.
FPCA groups each phoneme maintaining the similarities that we had observed from the trajectories. We can also observe in FPCA histogram the large overlapping between the first component of the phonemes, especially for /aa/ with /ao/ and /iy/ with /sh/ phonemes. Isomap keep similar clusters than FPCA, but without overlapping /iy/ and /sh/ phonemes. Although a more effective separation of /iy/ and /sh/ is observed in the 2D embedding, there is a significant overlap between /dcl/ and /sh/ in the 1D embedding, despite the fact that they should be completely distinct. FDM embeds the different phoneme groups ordering them from the vowel phonemes (/aa/, /ao/, /iy/) to the word phonemes (/sh/, /dcl/), in the same order as described. In addition, the similarity of the phonemes of the vowel of ‘dark’ (/aa/) and the first vowel of ‘water’ (/ao/) is maintained, appearing close in the embedding, as well as the similarity between the phoneme (/iy/) of the vowel of ‘she’ and the phoneme (/sh/) corresponding to that word. We can also observe that the phoneme of the vowel of ‘she’ (/iy/) is more similar to the phoneme of the first vowel of ‘water’ (/ao/) than to the phoneme of the vowel of ‘dark’, information that we can observe in the curvature of both trajectories. This similarities may be because the phoneme /iy/ is more open than phonemes /ao/ and /iy/ and neither FPCA nor Isomap were able to capture this information.
Regarding the Symbols problem, this dataset comes from the UCR database (Dau et al. 2018). It is composed of the pictograms of three different symbols: s1, s2, and s3. Each of these symbols is registered separately for their x and y components, giving rise to the six different classes that can be appreciated in Fig. 9 (where the suffix indicates the component). On the other side, Fig. 10 displays the resulting functional representation.
Following the same methodology as in the previous experiment, FPCA, Isomap, and FDM have been applied to this dataset. The results, both the 2D embedding and the scores histograms, are depicted in Fig. 11.
As observed, FPCA separates perfectly the y component of the third symbol, which appears to be the most distinctive. However, in this projection, there is an overlapping between s1 \(-\) x and s2 \(-\) x, and also between s1 \(-\) y and s2 \(-\) y. This is in accordance with what can be seen in Figs. 9 and 10, which show these two pairs to be very similar. The same effect can be appreciated in the second-row plots, which represent the Isomap embeddings. In this case, s3 \(-\) x is more separated from the other x components, but s1 \(-\) x and s2 \(-\) x appear again overlapped, as well as s1 \(-\) y and s2 \(-\) y. Finally, the FDM method is able to univocally separate the x components of the three symbols. Moreover, they are depicted in the same line representing their similarity. On the other side, s3 \(-\) y appears once again perfectly separated from the other classes, due to its clearly different shape. Unfortunately, FDM neither is able to separate the s1- \(-\) y and s2 \(-\) y, which are probably the most similar ones. Still, we can conclude overall that, for this last example, FDM seems to provide the clearest embedding with better separation among classes.
Finally, note that for both previous examples we have shown the best results for each method, but for the sake of completeness we include the FDM embeddings for different hyperparameter configurations in Fig. 12. The dependence shown resembles the one present in the multivariate DM method.
5 Conclusions
Functional dimensionality reduction methods are statistical techniques that represent infinite-dimensional data in lower dimensional spaces, for example, by capturing most of the variance of the original data in the new space or by reflecting lower dimensional underlying manifolds where the original data lay. Following this second line of research, Diffusion Maps has been extended to the infinite-dimensional case.
As a result, functional diffusion maps emerges as a functional manifold learning technique that finds low-dimensional representations of \(L^2\) functions on non-linear functional manifolds. FDM performs the same operations as DM algorithm once a functional similarity graph is created. Even though giving a geometric interpretation of similarity between functional data is sometimes not possible, FDM retains the advantages of multivariate DM as it is robust to noise perturbation and it is also a very flexible algorithm that allows fine-tuning of parameters that influence the resulting embedding.
The performance of this method has been tested for simulated and real functional examples and the results have been compared with those obtained from the multivariate DM, the linear FPCA method, and the non-linear Isomap technique. It should be noted that the Isomap and multivariate DM cannot be applied directly to non-equispaced functions, as it may not effectively differentiate the particularities of the functions in the study. FDM outperforms FPCA for functions laying in non-linear manifolds such as the Moons and the Swiss Roll examples, where FDM obtains a representation that maintains the original structure of the manifold. Besides being an advantage for these simulated examples, it also provides good results and allows new similarity interpretations for real examples such as the Phoneme dataset.
Overall, we find that the proposed manifold FDM method is an interesting technique that can provide useful representations which are competitive with, and often superior to, some classical linear and non-linear representations for functional data.
Nevertheless, some work remains to be done. In particular, the distance between functions can be interpreted as an earth mover distance between trajectories using a Besov space distance, which can be computed by expanding each function in a Haar basis in time, and minimizing its dual to Holder Ankenman and Leeb (2018). It would also be interesting to extend the proposed method to vector-valued functions and to perform a thorough comparison study against other nonlinear methods, including the one presented in Song and Li (2021).
References
Aneiros, G., Novo, S., Vieu, P.: Variable selection in functional regression models: a review. J. Multivar. Anal. 188, 104871 (2022). https://doi.org/10.1016/j.jmva.2021.104871. (50th Anniversary Jubilee Edition)
Ankenman, J., Leeb, W.: Mixed hölder matrix discovery via wavelet shrinkage and calderón-zygmund decompositions. Appl. Comput. Harmon. Anal. 45(3), 551–596 (2018). https://doi.org/10.1016/j.acha.2017.01.003
Bengio, Y., Delalleau, O., Le Roux, N., Paiement, J.-F., Vincent, P., Ouimet, M.: Spectral Dimensionality Reduction. Springer, Canada (2006)
Berrendero, J.R., Cuevas, A., Torrecilla, J.L.: Variable selection in functional data classification: a maxima-hunting proposal. Stat. Sin. 619–638 (2016)
Berry, T., Harlim, J.: Variable bandwidth diffusion kernels. Appl. Comput. Harmon. Anal. 40(1), 68–96 (2016). https://doi.org/10.1016/j.acha.2015.01.001
Cayton, L.: Algorithms for manifold learning. Technical Report, University of California (2005)
Chen, D., Müller, H.-G.: Nonlinear manifold representations for functional data. Ann. Stat. (2012). https://doi.org/10.1214/11-aos936
Coifman, R.R., Lafon, S.: Diffusion maps. Appl. Comput. Harmon. Anal. 21(1), 5–30 (2006). https://doi.org/10.1016/j.acha.2006.04.006
Coifman, R.R., Lafon, S., Lee, A.B., Maggioni, M., Nadler, B., Warner, F., Zucker, S.W.: Geometric diffusions as a tool for harmonic analysis and structure definition of data: diffusion maps. Proc. Natl. Acad. Sci. 102(21), 7426–7431 (2005). https://doi.org/10.1073/pnas.0500334102
Coifman, R.R., Kevrekidis, I.G., Lafon, S., Maggioni, M., Nadler, B.: Diffusion maps, reduction coordinates, and low dimensional representation of stochastic systems. Multiscale Model. Simul. 7(2), 842–864 (2008). https://doi.org/10.1137/070696325
Cox, M.A.A., Cox, T.F.: Multidimensional Scaling, pp. 315–347. Springer, Berlin, Heidelberg (2008). https://doi.org/10.1007/978-3-540-33037-0_14
Cuevas, A.: A partial overview of the theory of statistics with functional data. J. Stat. Plan. Inference 147, 1–23 (2014). https://doi.org/10.1016/j.jspi.2013.04.002
Dau, H.A., Keogh, E., Kamgar, K., Yeh, C.-C.M., Zhu, Y., Gharghabi, S., Ratanamahatana, C.A., Yanping, Hu, B., Begum, N., Bagnall, A., Mueen, A., Batista, G., Hexagon-ML: The UCR Time Series Classification Archive. https://www.cs.ucr.edu/~eamonn/time_series_data_2018/ (2018)
De la Porte, J., Herbst, B., Hereman, W., Van Der Walt, S.: An introduction to diffusion maps. In: Proceedings of the 19th Symposium of the Pattern Recognition Association of South Africa (PRASA 2008), Cape Town, South Africa, pp. 15–25 (2008)
Delaigle, A., Hall, P.: Methodology and theory for partial least squares applied to functional data. Preprint at Statistics Theory (2012)
Delaigle, A., Hall, P., Bathia, N.: Componentwise classification and clustering of functional data. Biometrika (2012). https://doi.org/10.2307/41720693
Dov, D., Talmon, R., Cohen, I.: Audio-visual voice activity detection using diffusion maps. IEEE/ACM Trans. Audio Speech Lang. Process. 23(4), 732–745 (2015). https://doi.org/10.1109/TASLP.2015.2405481
Fernández, Á., González, A.M., Díaz, J., Dorronsoro, J.R.: Diffusion maps for the description of meteorological data. In: Corchado, E., Snášel, V., Abraham, A., Woźniak, M., Graña, M., Cho, S.-B. (eds.) Hybrid Artificial Intelligent Systems, pp. 276–287. Springer, Berlin, Heidelberg (2012)
Ferraty, F., Vieu, P.: Nonparametric Functional Data Analysis: Theory and Practice. Springer, New York (2006)
Garofolo, J.S., Lamel, L.F., Fisher, W.M., Fiscus, J.G., Pallett, D.S., Dahlgren, N.L.: DARPA TIMIT Acoustic Phonetic Continuous Speech Corpus CDROM. NIST (1993)
Herrmann, M., Scheipl, F.: Unsupervised Functional Data Analysis via Nonlinear Dimension Reduction. arXiv (2020). https://doi.org/10.48550/ARXIV.2012.11987
Lederman, R.R., Talmon, R.: Learning the geometry of common latent variables using alternating-diffusion. Appl. Comput. Harmon. Anal. 44(3), 509–536 (2018). https://doi.org/10.1016/j.acha.2015.09.002
Lee, J.M.: Smooth Manifolds, pp. 1–31. Springer, New York (2012). https://doi.org/10.1007/978-1-4419-9982-5_1
Li, B., Song, J.: Nonlinear sufficient dimension reduction for functional data. Ann. Stat. 45(3), 1059–1095 (2017). https://doi.org/10.1214/16-AOS1475
Lian, W., Talmon, R., Zaveri, H., Carin, L., Coifman, R.: Multivariate time-series analysis and diffusion maps. Signal Process. 116, 13–28 (2015). https://doi.org/10.1016/j.sigpro.2015.04.003
Maggioni, M., Murphy, J.M.: Learning by unsupervised nonlinear diffusion. J. Mach. Learn. Res. 20, 1–56 (2019)
Marshall, N.F., Hirn, M.J.: Time coupled diffusion maps. Appl. Comput. Harmon. Anal. 45(3), 709–728 (2018). https://doi.org/10.1016/j.acha.2017.11.003
Matsui, H., Konishi, S.: Variable selection for functional regression models via the l 1 regularization. Comput. Stat. Data Anal. 55, 3304–3310 (2011). https://doi.org/10.1016/j.csda.2011.06.016
McKeague, I.W., Sen, B.: Fractals with point impact in functional linear regression. Ann. Stat. 38(4), 2559–2586 (2010). https://doi.org/10.1214/10-AOS791
Nadler, B., Lafon, S., Coifman, R.R., Kevrekidis, I.G.: Diffusion maps, spectral clustering and reaction coordinates of dynamical systems. Applied and Computational Harmonic Analysis 21(1), 113–127 (2006). https://doi.org/10.1016/j.acha.2005.07.004. (Special Issue: Diffusion Maps and Wavelets)
Ramos-Carreño, C., Torrecilla, J.L., Carbajo-Berrocal, M., Marcos, P., Suárez, A.: scikit-fda: a Python package for functional data analysis. To appear in Journal of Statistical Software. Preprint at arXiv:2211.02566 (2022)
Ramsay, J.O., Silverman, B.W.: Functional Data Analysis. Springer, New York (2005)
Riesz, F., Nagy, B.S.: Functional Analysis. Courier Corporation, Chelmsford (2012)
Roweis, S.T., Saul, L.K.: Nonlinear dimensionality reduction by locally linear embedding. Science 290(5500), 2323–2326 (2000)
Singer, A., Coifman, R.R.: Non-linear independent component analysis with diffusion maps. Appl. Comput. Harmon. Anal. 25(2), 226–239 (2008). https://doi.org/10.1016/j.acha.2007.11.001
Song, J., Li, B.: Nonlinear and additive principal component analysis for functional data. J. Multivar. Anal. 181, 104675 (2021). https://doi.org/10.1016/j.jmva.2020.104675
Tenenbaum, J.B., de Silva, V., Langford, J.C.: A global geometric framework for nonlinear dimensionality reduction. Science 290(5500), 2319–2323 (2000). https://doi.org/10.1126/science.290.5500.2319
Vieu, P.: On dimension reduction models for functional data. Stat. Probab. Lett. 136, 134–138 (2018). https://doi.org/10.1016/j.spl.2018.02.032. (The role of Statistics in the era of big data)
Virta, J., Li, B., Nordhausen, K., Oja, H.: Independent component analysis for multivariate functional data. J. Multivar. Anal. 176, 104568 (2020). https://doi.org/10.1016/j.jmva.2019.104568
Wang, J.-L., Chiou, J.-M., Müller, H.-G.: Functional data analysis. Annu. Rev. Stat. Its Appl. 3, 257–295 (2016)
Funding
Open Access funding provided thanks to the CRUE-CSIC agreement with Springer Nature.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Barroso, M., Alaíz, C.M., Torrecilla, J.L. et al. Functional diffusion maps. Stat Comput 34, 22 (2024). https://doi.org/10.1007/s11222-023-10332-1
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s11222-023-10332-1