1 Introduction

In many real-world applications, data must be considered as discretized functions rather than standard vectors, since the process that generates them is assumed to be continuous. Natural examples are found in medicine, economics, speech recognition or meteorological problems. This type of data, commonly called functional data, is usually represented by high-dimensional vectors whose coordinates are highly correlated, and thus multivariate methods lead to ill-conditioned problems. As a consequence, most traditional data analysis tools, such as regression, classification or clustering, are being adapted to functional inputs, as well as some dimensionality reduction techniques.

Functional data analysis (FDA; Vieu 2018; Cuevas 2014) is an active field of statistics in which the data of interest are often considered as realizations of stochastic processes over a compact domain. Traditionally, FDA focused on linear dimensionality reduction methods, with special attention given to Functional Principal Component Analysis (FPCA; Ramsay and Silverman 2005) as an unsupervised technique that captures most of the variance of the original data in a lower-dimensional space. In particular, in this method the infinite-dimensional random functions are projected onto the lower dimensional subspace generated by the eigenfunctions of the covariance operator. Moreover, when functional data are expressed as a linear combination of known basis functions, the eigenequation problem can be reduced to a discrete form expressing the eigenfunctions in the same known basis (Riesz and Nagy 2012).

Other examples of popular linear algorithms that have been transferred to FDA include Independent Component Analysis (ICA; Virta et al. 2020), which attempts to project data onto independent components, and Partial Least Squares (FPLS; Delaigle and Hall 2012), which aims to identify the underlying latent variables in both the response and predictor variables.

An extreme case of linear projections are variable selection methods that reduce trajectories to their values at a few impact points. In recent years, a multitude of proposals in functional variable selection have emerged, highlighting methods based on the regularization of linear regression models such as Lasso or Dantzig (Aneiros et al. 2022; Matsui and Konishi 2011). Other proposals are based on logistic regression (McKeague and Sen 2010), wrapper methods (Delaigle et al. 2012), or the selection of local maxima of dependence measures (Berrendero et al. 2016). While all of these proposals have notable strengths, they are inadequate for addressing non-linear problems.

A first step to alleviate this deficiency is the extension of linear dimensionality reduction techniques in FDA to a non-linear context. This is the case of NAFPCA (Song and Li 2021), a method proposed as a generalization of FPCA via two additively nested Hilbert spaces of functions. The first space characterizes the functional nature of the data, and the second space captures the nonlinear dependence. This formulation is based on the estimator proposed in Li and Song (2017) in a context of sufficient dimension reduction. This method has the advantage of being suitable for vector-valued functional data.

From a different perspective, multivariate statistics offers several non-linear techniques for dimensionality reduction based on the assumption that high dimensional data observed in a D-dimensional space \({\mathbb {R}}^D\) actually lie on (or close to) a d-dimensional manifold \({\mathcal {M}}\subset {\mathbb {R}}^D\), with \(d < D\) (Cayton 2005). These techniques can be used for discovering high-density low-dimensional surfaces in unsupervised problems, or as a preprocessing step when applying supervised models, being very useful, for example, for data analysis and visualization (Bengio et al. 2006). Their goal is to achieve a new representation that captures the structure of the data in a few dimensions while preserving the original local information. For doing this, most of these methods rely on the spectral analysis of a similarity matrix of a graph previously constructed over the original data. Another important characteristic of these methods is that they may arrive at a new space where the Euclidean distance between embedded points somehow corresponds to the original information preserved.

Isometric feature mapping (Isomap; Tenenbaum et al. 2000), t-distributed Stochastic Neighborhood Embedding (t-SNE; Cox and Cox 2008), Locally Linear Embedding (LLE; Roweis and Saul 2000) or Diffusion Maps (DM; Coifman and Lafon 2006; Coifman et al. 2005; De la Porte et al. 2008) are some of the main manifold methods used for dimensionality reduction over multivariate data. They all take advantage of the local linearity of manifolds and create a mapping that preserves local neighborhoods at each point of the underlying manifold.

Out of all of these techniques, only Isomap (Chen and Müller 2012) has been extended to the functional setting. This method is an extension of the classical multidimensional scaling metric and it is characterized by the use of geodesic distances induced by a neighborhood graph for estimating the intrinsic geometry of a data manifold. Despite the fact that Isomap is often easy to visualize and interpret, making it useful for exploratory data analysis and data visualization, the computation of geodesic distances can be computationally expensive. Therefore, there is a need to extend other non-linear dimension reduction methods to the functional context that deal with these difficulties. A promising approach is the utilization of Diffusion Maps, which has the ability to handle more complex data with multiple scales and is much more robust against noise and outliers.

The current paper specifically focuses on extending this manifold learning technique to the FDA context by proposing a new method: functional diffusion maps (FDM). As DM, which has been successfully applied to a variety of different contexts (including multivariate datasets Fernández et al. (2012), Dov et al. (2015), time-series problems Marshall and Hirn (2018), Lian et al. (2015), complex dynamical systems analyses Nadler et al. (2006), or dimensional stochastic systems representation Coifman et al. (2008)), FDM has the potential to deliver competitive results in functional data applications, particularly in cases where the data is distributed on specific manifolds. The main contribution of this research can thus be summarized in extending DM to the functional domain, offering a formalization of DM for functions and comparing FDM with DM, FPCA and Isomap in simulated and real examples.

The rest of the paper is organized as follows: Sect. 2 presents some ideas and definitions from FDA that will be useful for the subsequent discussion in this article, paying special attention to functional manifolds, which will be the main assumption in DM. Section 3 formalizes the FDM approach and Sect. 4 illustrates and compares the performance of DM, FPCA, Isomap and FDM on synthetic and real datasets. Finally, Sect. 5 presents some conclusions of this work, as well as possible research lines to further improve it.

2 Functional data analysis

In this work, we consider the typical case in Functional Data Analysis (Ramsay and Silverman 2005; Ferraty and Vieu 2006) where we have sample of functions \(x_1(t), \dotsc , x_N(t)\), where \(t\in {\mathcal {J}}\), \({\mathcal {J}}\) is a compact interval, and each \(x_i(t)\) is the observation of an independent functional variable \(X_i\) identically distributed as X. It is usual to assume that the functional variable X is a second order stochastic process, \(\textrm{E}[X^2]<\infty \), and takes values in the Hilbert space of square integrable functions \(L^2([a,b])\) defined on the closed interval \([a,b] \subset {\mathbb {R}}\). Square integrable functions form a vector space and we can define the inner product of two functions by \(\langle x,y \rangle = \int _{a}^b x(t) y(t) dt\). The inner product allows to introduce the notion of distance between functions by the \(L^2\)-norm \(\Vert x\Vert ^2 = \langle x,x \rangle = \int _{a}^b x^2(t)dt.\) Therefore, in \(L^2\) space, the distance between two functions can be calculated as the norm of their difference, which is expressed as \(\Vert x-y\Vert \).

2.1 Functional data representation

Although each function \(x_i\) consists of infinitely many values \(x_i(t)\) with \(t \in [a,b]\), in practice we only observe a finite set of function values in a set of arguments that are not necessarily equidistant or equal for all functions (Wang et al. 2016). A main problem representing functional data is that we can not directly retrieve the underlying function and we need to work with the functions using specific approaches: working directly with discrete data or transforming it into a function using a basis function expansion. We will briefly sketch below both approaches (for a thorough explanation the reader can refer to Ramsay and Silverman (2005)).

The first approach consists in using directly the observed values of the functions and it does not require any additional adjustment or assumption. Let \(\{x_i(t)\}_{i=1}^N\) be N underlying functions from which we have obtained our data \(\textrm{x}_i = ( x_i(t_1), \dotsc ,x_i(t_M))^{\top }\); we denote our set of data as the \(N \times M\) matrix defined as

$$\begin{aligned} \textrm{X} = \begin{pmatrix} \textrm{x}_1^{\top }\\ \textrm{x}_2^{\top }\\ \vdots \\ \textrm{x}_N^{\top } \end{pmatrix} = \begin{pmatrix} x_1(t_1) &{} x_1(t_2) &{} \cdots &{} x_1(t_M) \\ x_2(t_1) &{} x_2(t_2) &{} \cdots &{} x_2(t_M) \\ \vdots &{} \vdots &{} \ddots &{} \vdots \\ x_N(t_1) &{} x_N(t_2) &{} \cdots &{} x_N(t_M) \\ \end{pmatrix}. \end{aligned}$$
(1)

The other possible representation is based on a basis function expansion and consist in representing each function \(x_i(t)\) using a basis in \(L^2\), such that

$$\begin{aligned} x_i(t) = \sum _{k=1}^\infty c_{ik} \phi _k(t), \end{aligned}$$

where \(\phi _k(t)\) are the basis functions and \(c_{ik}\) are the coefficients that determine the weight of each basis function in the representation of \(x_i(t)\).

One way to approximate the function is by truncating the basis to the first K elements. This results in smoothing and dimensionality reduction, such that,

$$\begin{aligned} x_i(t) \approx \sum _{k=1}^K c_{ik}\phi _k(t) = \textrm{c}_i^{\top } \upphi (t), \end{aligned}$$

where \(\upphi (t) = (\phi _1(t), \dotsc , \phi _K(t))^{\top }\) is the vector containing the functions of the truncated basis and \(\textrm{c}_i = (c_{i1}, \dotsc , c_{iK})^{\top }\) are the coefficients in the new basis. In this way, the set of N functional data evaluated at time t can be expressed as follows:

$$\begin{aligned} \textrm{X}(t) = \textrm{C} \upphi (t)&= \begin{pmatrix} \textrm{c}_1^{\top } \\ \textrm{c}_2^{\top } \\ \vdots \\ \textrm{c}_N^{\top } \end{pmatrix} \begin{pmatrix} \phi _1(t) \\ \phi _2(t) \\ \vdots \\ \phi _K(t) \end{pmatrix}\nonumber \\&= \begin{pmatrix} c_{11} &{} c_{12} &{} \cdots &{} c_{1K} \\ c_{21} &{} c_{22} &{} \cdots &{} c_{2K} \\ \vdots &{} \vdots &{} \ddots &{} \vdots \\ c_{N1} &{} c_{N2} &{} \cdots &{} c_{NK} \end{pmatrix} \begin{pmatrix} \phi _1(t) \\ \phi _2(t) \\ \vdots \\ \phi _K(t) \end{pmatrix}, \end{aligned}$$
(2)

where \(\textrm{C}\) is an \(N\times K\) matrix.

There are different possible basis, such as Fourier basis, wavelet, B-Spline, polynomial basis, etc. Depending on the data, the choice of basis can be key to capture as much information as possible about the underlying function. For example, if the data are periodic, a Fourier basis will be of interest, while if they are not, the B-Spline basis is a common choice.

2.2 Functional manifolds

A differential manifold \({\mathcal {M}}\) is a topological space which is locally diffeomorphic to \({\mathbb {R}}^d\) (Lee 2012). These manifolds are expressed in terms of an atlas consisting of a set of charts \(\{(U_\alpha , \varphi _\alpha )\}\), where \(U_\alpha \) are open sets covering \({\mathcal {M}}\), and the coordinate maps \(\varphi _\alpha : U_\alpha \rightarrow {\mathbb {R}}^d\), are diffeomorphisms from \(U_\alpha \) to open sets of \({\mathbb {R}}^d\). These coordinate maps are commonly used to describe the geometry of the manifold. In the context of non-linear dimensionality reduction, it is usual to consider “simple” differential manifolds which are locally isometric to \({\mathbb {R}}^d\). These manifolds are characterized by a single chart \((U, \varphi )\), where U is an open set covering \({\mathcal {M}}\), and the coordinate map \(\varphi : U\rightarrow {\mathbb {R}}^d\) is an isometry from U to an open set of \({\mathbb {R}}^d\). A typical example of a “simple” manifold is the well-known Swiss roll manifold depicted in Fig. 4.

In line with Chen and Müller (2012) and in accordance with standard practices in FDA, our research will focus on “simple” manifolds \({\mathcal {M}}\subset L^2\). We will refer to them as functional manifolds. These manifolds are naturally equipped with the \(L^2\) distance, although as in the case of multivariate data, this metric may not be the most appropriate one. Hence, we will use manifold learning methods to approximate \(\varphi \) and understand the inherent structure of the functional data.

3 Functional diffusion maps

In this work, we propose a new Functional Manifold Learning technique, which we call functional diffusion maps, that aims to find low-dimensional representations of \(L^2\) functions on non-linear functional manifolds. It first builds a weighted graph based on pairwise similarities between functional data and then, just like Diffusion Maps does, uses a diffusion process on the normalized graph to reveal the overall geometric structure of the data at different scales.

In the following, the mathematical framework for DM is detailed, and then a generalization for functional data is proposed.

Fig. 1
figure 1

A comparison between the \(L^2\) distance and the diffusion distance over the Moons dataset

3.1 Mathematical framework for diffusion maps

Diffusion Maps is a non-linear dimensionality reduction algorithm introduced by Coifman and Lafon (2006) which focus on discovering the underlying manifold from which multivariate data have been sampled. Since it was first proposed, the theoretical framework behind DM has settled the basis for many other studies and proposals (Singer and Coifman 2008; Berry and Harlim 2016; Lederman and Talmon 2018; Maggioni and Murphy 2019).

In particular, DM computes a family of embeddings of a dataset into a low-dimensional Euclidean subspace whose coordinates are computed from the eigenvectors and eigenvalues of a diffusion operator on the data. In more detail, let \({\mathcal {X}}=\{\textrm{x}_{1}, \dotsc ,\textrm{x}_{\textrm{N}}\}\) be our multivariate dataset, such that \({\mathcal {X}}\) is lying along a manifold \({\mathcal {M}} \subset {\mathbb {R}}^D\). A random walker will be defined on the data, so the connectivity or similitude between two data points \(\textrm{x}\) and \(\textrm{y}\) can be used as the probability of walking from \(\textrm{x}\) to \(\textrm{y}\) in one step of the random walk. Usually, this probability is specified in terms of a kernel function of the two points, \(k:{\mathbb {R}}^D\times {\mathbb {R}}^D\rightarrow {\mathbb {R}}\), which has the following properties:

  • k is symmetric: \(k(\textrm{x},\textrm{y}) = k (\textrm{y},\textrm{x})\);

  • k is positivity preserving: \(k(\textrm{x}, \textrm{y})\ge 0\) \(\forall \textrm{x},\textrm{y}\in {\mathcal {X}}\).

Let \(\textrm{G} = ({\mathcal {X}}, \textrm{K})\) be a weighted graph where \(\textrm{K}\) is an \(N\times N\) kernel matrix whose components are \(k_{ij} = k(\textrm{x}_i,\textrm{x}_j)\). This graph is connected and captures some geometric features of interest in the dataset. Now, to build a random walk, \(\textrm{G}\) has to be normalized.

Since the graph depends on both the geometry of the underlying manifold and the data distribution on the manifold, in order to control the influence of the latter, DM uses a density parameter \(\alpha \in [0,1]\) in the normalization step. The normalized graph \(\textrm{G}'=({\mathcal {X}}, \textrm{K}^{(\alpha )})\) uses the weight matrix \(\textrm{K}^{(\alpha )}\) obtained by normalizing \(\textrm{K}\):

$$\begin{aligned} k_{ij}^{(\alpha )} = \frac{k_{ij}}{d_i^\alpha d_j^\alpha }, \end{aligned}$$

where \(d_i = \sum _{j=1}^N k_{ij}\) is the degree of the graph and the power \(d_i^\alpha \) approximates the density of each vertex.

Now, we can create a Markov chain on the normalized graph whose transition matrix \(\textrm{P}\) is defined by

$$\begin{aligned} p_{ij} = \frac{k_{ij}^{(\alpha )}}{d_i^{(\alpha )}}, \end{aligned}$$

where \(d_i^{(\alpha )}\) is the degree of the normalized graph,

$$\begin{aligned} d_i^{(\alpha )} = \sum _j k_{ij}^{(\alpha )} = \sum _j \frac{k_{ij}}{d_i^\alpha d_j^\alpha }. \end{aligned}$$

Each component \(p_{ij}\) of the transition matrix \(\textrm{P}\) provides the probabilities of arriving from node i to node j in a single step. By taking powers of the \(\textrm{P}\) matrix we can increase the number of steps taken in the random walk. Thus, the components of \(P^T\), namely \(p_{ij}^T\), represent the sum of the probabilities of all paths of length T from i to j. This defines a diffusion process that reveals the global geometric structure of \({\mathcal {X}}\) at different scales.

The transition matrix of a Markov chain has a stationary distribution, given by

$$\begin{aligned} \pi _i = \frac{d_i^{(\alpha )}}{\sum _k d_k^{(\alpha )}}, \end{aligned}$$

that satisfies the stationary property as follows:

$$\begin{aligned} \sum _{i=1}^N \pi _i p_{ij}{} & {} = \sum _{i=1}^N \frac{d_i^{(\alpha )}}{\sum _k d_k^{(\alpha )}} \frac{k_{ij}^{(\alpha )}}{d_i^{(\alpha )}} = \sum _{i=1}^N \frac{k_{ij}^{(\alpha )}}{\sum _k d_k^{(\alpha )}}\\{} & {} =\frac{d_j^{(\alpha )}}{\sum _k d_k^{(\alpha )}} = \pi _j. \end{aligned}$$

Since the graph is connected, the associated stationary distribution is unique. Moreover, since \({\mathcal {X}}\) is finite, the chain is ergodic. Another property of the Markov chain is that it is reversible, satisfying

$$\begin{aligned} \pi _i p_{ij}= & {} \frac{d_i^{(\alpha )} }{\sum _k d_k^{(\alpha )}} \frac{k_{ij}^{(\alpha )}}{d_i^{(\alpha )}} =\frac{k_{ij}^{(\alpha )}}{\sum _k d_k^{(\alpha )}} = \frac{k_{ji}^{(\alpha )}}{\sum _k d_k^{(\alpha )}} \\= & {} \frac{d_j^{(\alpha )} }{\sum _k d_k^{(\alpha )}} \frac{k_{ji}^{(\alpha )}}{d_i^{(\alpha )}} = \pi _j p_{ji}. \end{aligned}$$

Taking into account all these properties, we can now define a diffusion distance \(\textrm{D}_T\) based on the geometrical structure of the obtained diffusion process. The metric measures the similarities between data as the connectivity or probability of transition between them, such that

$$\begin{aligned} \textrm{D}_T^2 (\textrm{x}_i,\textrm{x}_j) = \Vert p_{i\cdot }^T - p_{j\cdot }^T\Vert ^2_{\textrm{L}^2(\frac{1}{\pi })} =\sum _k \frac{\left( p_{ik}^T - p_{jk}^T\right) ^2}{\pi _k}, \end{aligned}$$
(3)

where \(\Vert \frac{1}{\pi }\Vert _{L^2}\) represents the euclidean norm weighted by the stationary distribution \(\pi \), which takes into account the local data point density. This metric is also robust to noise perturbation, since it is defined as an average of all the paths of length T. Therefore, T plays the role of a scale parameter and \(\textrm{D}_T^2 (\textrm{x}_i,\textrm{x}_j)\) will be small if there exist a lot of paths of length T that connect \(\textrm{x}_i\) and \(\textrm{x}_j\).

Figure 1 shows a visual example of the diffusion distances and the \(L^2\) distances over the classic Moons dataset, which consists of two half circles interspersed. A spectral color bar has been used, with warm colors indicating nearness between the data and cold colors representing remoteness.

Clearly, Moons data live in a non-linear manifold. In the picture, the distances between the point marked with a star and the rest are displayed. It can be seen that the diffusion distance approximates the distance in the manifold, so that points on the same moon are nearer among them than to points on the other moon. In contrast, \(L^2\) distances are independent of the moons and are not convenient when data live in a non-linear manifold such as this one.

Spectral analysis of the Markov chain allows us to consider an alternative formulation of the diffusion distance that uses eigenvalues and eigenvectors of \(\textrm{P}\).

As detailed in Coifman and Lafon (2006), even though \(\textrm{P}\) is not symmetric, it makes sense to perform its spectral decomposition using its left and right eigenvectors, such that

$$\begin{aligned} p_{ij}=\sum _{l\ge 0} \lambda _l (\psi _l)_i (\varphi _l)_j. \end{aligned}$$

The eigenvalue \(\lambda _0=1\) is discarded for P since \(\psi _0\) is a vector with all its values equal to one. The other eigenvalues \(\lambda _1, \lambda _2, \dotsc ,\lambda _{N-1}\) tend to 0 and satisfy \(\lambda _l<1\) for all \(l \ge 1\). Now, the diffusion distance can be rewritten using the new representation of \(\textrm{P}\):

$$\begin{aligned} \textrm{D}_T^2 (\textrm{x}_i,\textrm{x}_j)= & {} \sum _k \frac{1}{\pi _k} \left( \sum _{l\ge 1}\lambda _l^T (\psi _l)_i (\varphi _l)_k - \sum _{l\ge 1}\lambda _l^T (\psi _l)_j (\varphi _l)_k\right) ^2 \\= & {} \sum _{l\ge 1}\lambda _l^{2T}\left( (\psi _l)_i - (\psi _l)_j\right) ^2 \sum _k \frac{(\varphi _l)_k^2}{\pi _k} \\= & {} \sum _{l\ge 1}\lambda _l^{2T} \left( (\psi _l)_i - (\psi _l)_j\right) ^2 \sum _k \frac{((\phi _l)_k \sqrt{\pi _k})^2}{\pi _k} \\= & {} \sum _{l\ge 1}\lambda _l^{2T} \left( (\psi _l)_i - (\psi _l)_j\right) ^2 . \end{aligned}$$

Since \(\lambda _l\rightarrow 0\) when l grows, the diffusion distance can be approximated using the first L eigenvalues and eigenvectors. One possibility is to select L such that

$$\begin{aligned} L = \max \{l:\lambda _l^T>\delta \lambda _1^T\}, \end{aligned}$$

where \(\delta \) is a parameter that controls the precision in the approximation. Thus, the diffusion distance approximated by L is expressed as

$$\begin{aligned} \textrm{D}_T^2(\textrm{x}_i, \textrm{x}_j) \approx \sum _{l=1}^{L}\lambda _l^{2T} \left( (\psi _l)_i - (\psi _l)_j\right) ^2. \end{aligned}$$

Finally, the diffusion map \(\Psi _T: {\mathcal {X}}\rightarrow {\mathbb {R}}^L\) is defined as

$$\begin{aligned} \Psi _T(\textrm{x}_i) = \begin{pmatrix} \lambda _1^T \psi _1(\textrm{x}_i) \\ \lambda _2^T \psi _2(\textrm{x}_i) \\ \vdots \\ \lambda _{L}^T \psi _{L}(\textrm{x}_i) \end{pmatrix}. \end{aligned}$$
(4)

With this definition, diffusion maps family \(\{\Psi _T\}_{T\in {\mathbb {N}}}\) projects data points into a Euclidean space \({\mathbb {R}}^{L}\) so that diffusion distance on the original space can be approximated by the Euclidean distance of \(\Psi _T\) projections in \({\mathbb {R}}^{L}\),

$$\begin{aligned} \textrm{D}_T^2(\textrm{x}_i(t), \textrm{x}_j(t)) \approx \sum _{l=1}^{L}\lambda _l^{2T} \left( (\psi _l)_i - (\psi _l)_j\right) ^2\\ = \Vert \Psi (\textrm{x}_i) - \Psi (\textrm{x}_j)\Vert ^2, \end{aligned}$$

preserving the local geometry of the original space.

3.2 Extending diffusion maps to functional data

Let X be a centered square integrable functional variable of the Hilbert space \( L^2([a,b])\), where [ab] is a compact interval. Let \({\mathcal {X}} = \{x_1(t), \dotsc , x_N(t)\}\) be the observations of N independent functional variables \(X_1, \dotsc , X_N\) identically distributed as X. We assume that \({\mathcal {X}}\) lies on a functional manifold \({\mathcal {M}}\subset L^2([a,b])\).

Before defining a diffusion process over \({\mathcal {X}}\), a random walk over a weighted graph has to be built. Graph vertices are functions \(x_i(t) \in {\mathcal {X}}\) and weights \(k_{ij}\) are given by a symmetric positive-definite Kernel operator \({\mathcal {K}}:L^2([a,b])\times L^2([a,b])\rightarrow {\mathbb {R}}\), so that the weight between two vertices \(x_i(t)\) and \(x_j(t)\) is \(k_{ij} = {\mathcal {K}}(x_i(t), x_j(t))\). Thus, the weighted graph is \(\textrm{G}=({\mathcal {X}}, \textrm{K})\), where \(\textrm{K}\) is an \(N\times N\) kernel matrix resulted from evaluating \({\mathcal {K}}\) on \({\mathcal {X}}\). The kernel operator defines a local measure of similarity within a certain neighborhood, so that outside the neighborhood the function quickly goes to zero. The standard kernel used to compute the similarity between functional data is the Radial Basis Function (RBF) kernel, which is defined as:

$$\begin{aligned} {\mathcal {K}}(x_i(t), x_j(t)) = \exp {\frac{-\Vert x_i(t)-x_j(t) \Vert ^2_{L^2}}{2\sigma ^2}}, \end{aligned}$$

where the size of the local neighborhood is determined by the hyperparameter \(\sigma \). Another classical option is the Laplacian kernel, which is defined as:

$$\begin{aligned} {\mathcal {K}}(x_i(t), x_j(t)) = \exp {\frac{-\Vert x_i(t)-x_j(t)\Vert _{L^1}}{\sigma ^2}}. \end{aligned}$$

These kernels satisfy the property \({\mathcal {K}}(x(t),y(t))=\hat{{\mathcal {K}}}(x(t)-y(t))\), i.e. the kernel only depends on the difference between both elements.

Once the functional data graph \(\textrm{G}\) has been constructed, the algorithm performs the same operations as multivariate DM. The full methodology of the method is presented in Algorithm 1. Furthermore, the Python implementation of the FDM method is available at https://github.com/mariabarrosoh/functional-diffusion-maps/.

Algorithm 1
figure a

FDM

Table 1 DM and FDM hyperparameter grid used for finding the best values for the different dataset
Fig. 2
figure 2

Cauchy densities data

Fig. 3
figure 3

Isomap, DM and FDM scores in the first two components for the Cauchy densities dataset

To conclude, note that, since the algorithm of FDM only differs from the one of DM in the definition of the graph via a functional metric, all the properties derived from the probability matrix and from the diffusion distances are preserved. In particular, any theoretical result regarding the standard multivariate DM method where data are involved only through a kernel function can be immediately transferred to its functional version. This means that classical results, such as the ability to learn intrinsic coordinates or to modulate the influence of the data density through the \(\alpha \) parameter, are inherited by FDM.

4 Examples and simulation study

In this section we apply the FDM technique described above to synthetic and classic functional datasets, using the FDA utilities provided by the scikit-fda package (Ramos-Carreño et al. 2022), as well as our FDM implementation described above. In particular, we compare the performance of FDM against its multivariate version and we also evaluate the performance of FDM alongside other FDA techniques, including FPCA and Isomap, to determine its efficacy.

Both DM and FDM require an initial analysis to identify the suitable parameters to cluster the data or reveal the structure of the manifold where the data is supposed to be embedded. To achieve this, a grid search is performed in each experiment using the hyperparameters specified in Table 1.

Fig. 4
figure 4

Left: Multivariate Moons and Swiss Roll data. Right: Functional version of Moons and Swiss Roll data

The functional version of the Isomap method has been used in practical applications by directly discretizing the values (Herrmann and Scheipl 2020). This technique only requires to set the number of neighbors needed to create the graph. To determine the optimal number of neighbors, we perform a grid search over the set \(\{5k: 1 \le k \le 5, k\in {\mathbb {N}}\}\).

4.1 Cauchy densities data

The purpose of this first experiment is to demonstrate that the straightforward implementation of Isomap and Diffusion Maps on functional data may not be sufficient if the functions are not equally spaced and the metric employed is unsuitable.

To illustrate this assertion, we apply Isomap, DM and FDM in a synthetic example consisting of 50 not equally spaced Cauchy probability density functions with \(\gamma =1.0\). In particular, we generate 25 Cauchy densities with amplitude 1.0 regularly centered in the interval \([-5,5]\) and 25 Cauchy densities with amplitude 1.5 regularly centered in the same interval. The functions are partitioned into equally spaced sections with 100 data points in each of the following intervals: \([-10,-5]\), \((-5,5)\), and [5, 10], so that the point density in the middle interval is half that of the other intervals. The functions generated are displayed in Fig. 2.

After evaluating all possible parameter settings for Isomap, DM and FDM on the Cauchy densities, we found that 15 neighbors are optimal for the Isomap method. An RBF kernel with a scale parameter of \(\sigma = 0.6\) and a density parameter of \(\alpha = 1.0\) works better for the DM embedding, while an RBF kernel with \(\sigma = 0.1\) and \(\alpha = 0.0\) is suitable for the FDM technique.

Figure 3 shows the scatterplots of the multivariate and functional scores in the first two components for the Cauchy densities dataset.

Fig. 5
figure 5

Top: FPCA and FDM scores in the first two components for the Moons dataset. Down: FPCA and FDM scores histogram in the first component for Moons dataset

Both Isomap and DM scores exhibit a semi-circular shape, with the two function classes located in close proximity. In contrast, the FDM scores exhibit a complete separation of function classes. Consequently, we can deduce that the Isomap and multivariate DM may not be an appropriate method for analyzing non-equispaced functions, for which the functional version of DM provides an essential alternative. The main reason behind this discrepancy lies in the computation of distances.

4.2 Moons and Swiss roll data

After establishing the benefits of using FDM instead of DM for functional data, we aim to contrast FDM with FPCA, wich is the most popular dimensionality reduction method in FDA by far, by evaluating their performance on the functional versions of the Moons and Swiss Roll datasets.

The Moons dataset shown in Sect. 3.1 is typically used to visualize clustering algorithms. Instead, the Swiss Roll dataset is typically used to test dimensionality reduction techniques. Both of them are common examples where non-linearity in the data makes multivariate PCA perform poorly, and therefore, manifold learning techniques are preferable.

Figure 4 shows, at the left, multivariate Moons and Swiss Roll datasets generated without noise. To get the functional version of these datasets, which are shown in the right panel of Fig. 4, the features of multivariate data are used as the coefficient of a chosen functional basis. Specifically, we represent functional Moons data using the following non-orthogonal basis function:

$$\begin{aligned} \upphi (x) = \{\phi _1(x) = \sin (4x), \phi _2 = x^2 + 2x - 2\}; \end{aligned}$$

for the functional Swiss Roll data we use instead the non-orthogonal basis function:

$$\begin{aligned} \upphi (x) = \{\phi _1(x) = \sin (4x), \phi _2 = \cos (8x), \phi _3 = \sin (12x)\}. \end{aligned}$$
Fig. 6
figure 6

FPCA and FDM scores for the Swiss Roll dataset

Since FPCA and FDM act on the basis function coordinates, we expect to obtain the same results as the multivariate ones, except the effect for the inner product between the chosen basis functions.

After evaluating all possible parameter configurations (see Table 1), we found that an RBF kernel with a scale parameter of \(\sigma = 0.2\) and a density parameter of \(\alpha = 0.5\) works better for the Moons dataset, while an RBF kernel with \(\sigma =0.6\) and \(\alpha =1.0\) is more suitable for the Swiss Roll dataset.

At the top panel of Figs. 5 and 6, the Moons and Swiss Roll scatterplots for the FPCA and FDM scores in the first two components are respectively depicted. Furthermore, below the same figures, the projection of the scores in the first component is also presented. These visualizations enable us to discern whether clusters have been recognized or if the underlying manifold has been “unrolled".

Table 2 Phoneme frequencies
Fig. 7
figure 7

Phoneme dataset

The FPCA embeddings exhibit similarity to the original Moons and Swiss Roll data, with the exception of rotation, sign, and scale adjustments. Instead, the FDM embeddings reveal that Moons data is entirely separated by the first component. Regarding the Swiss Roll, data, the underlying two-dimensional manifold is exposed, and even in one dimension it can be seen that color order is preserved by FDM.

Fig. 8
figure 8

Left: FPCA, Isomap and FDM scores in the first two components for the Phoneme dataset. Right: FPCA, Isomap and FDM scores histogram in the first component for Phoneme dataset

Fig. 9
figure 9

A sample of each one of the three pictograms in the Symbols dataset. The left column shows the complete pictograms, whereas the middle and right columns show only the x and y components, respectively

4.3 Real datasets

In this last experiment we have applied the main functional dimensionality reduction techniques—FPCA, Isomap and FDM—to two real problems: the Phoneme and the Symbols datasets.

The Phoneme dataset, which consists of log-periodograms computed from continuous speech of male speakers for five distinct phonemes, was extracted from the TIMIT database (Garofolo et al. 1993), and it is widely used in speech recognition research. In this example, the phonemes are transcribed as follows: /sh/ (as in ‘she’), /dcl/ (as in ‘dark’), /iy/ (as the vowel in ‘she’), /aa/ (as the vowel in ‘dark’) and /ao/ (as the first vowel in ‘water’). For our experiment, we take a random sample of 1500 log-periodograms curves of length 256 as a representative sample of the population due to computational cost. Table 2 shows the phoneme frequencies by class membership. The log-periodogram curves of the Phoneme dataset are characterized by high irregularity and noise at their endpoints. Hence, in order to prepare them for a dimensionality reduction analysis, a preprocessing step of trimming and smoothing is typically performed. To do this, we truncate the log-periodogram curves to a maximum length of 50. Figure 7 displays the resulting functional representation of the phoneme dataset. Looking at the curves, we see higher similarities between the phonemes /aa/ and /ao/, as their sound is very similar; and also between the last part of the curves of phonemes /iy/ and /sh/, since the phoneme /iy/ is contained at the end of the phoneme /sh/. On the other hand, the curves of the phoneme /dcl/ seem to be far away from the rest of the curves.

Fig. 10
figure 10

Symbols dataset

Fig. 11
figure 11

Left: FPCA, Isomap, and FDM scores in the first two components for the Symbols dataset. Right: FPCA, Isomap, and FDM scores histogram in the first component for Symbols dataset

Fig. 12
figure 12

FDM scores for different values of the hyperparameters \(\sigma \) and \(\alpha \) for both real datasets

For the comparison, FPCA, Isomap and FDM were applied to the preprocessed dataset. In this experiment, both Isomap and FDM techniques require an initial analysis to determine the most suitable parameters. After evaluating all possible parameter configurations, we found that an RBF kernel with \(\sigma = 1.0\) and \(\alpha = 1.0\) yields the best results for the FDM method, while for the Isomap method, we found that using 15 neighbors provides suitable performance. In the left panel of Fig. 8, we display the FPCA, Isomap and FDM scores in the first two components for the Phoneme dataset. In the right panel of the same figure, we also present the histograms of the scores associated with the first component. The outcomes yielded by each technique offer different perspectives on phoneme similarities.

FPCA groups each phoneme maintaining the similarities that we had observed from the trajectories. We can also observe in FPCA histogram the large overlapping between the first component of the phonemes, especially for /aa/ with /ao/ and /iy/ with /sh/ phonemes. Isomap keep similar clusters than FPCA, but without overlapping /iy/ and /sh/ phonemes. Although a more effective separation of /iy/ and /sh/ is observed in the 2D embedding, there is a significant overlap between /dcl/ and /sh/ in the 1D embedding, despite the fact that they should be completely distinct. FDM embeds the different phoneme groups ordering them from the vowel phonemes (/aa/, /ao/, /iy/) to the word phonemes (/sh/, /dcl/), in the same order as described. In addition, the similarity of the phonemes of the vowel of ‘dark’ (/aa/) and the first vowel of ‘water’ (/ao/) is maintained, appearing close in the embedding, as well as the similarity between the phoneme (/iy/) of the vowel of ‘she’ and the phoneme (/sh/) corresponding to that word. We can also observe that the phoneme of the vowel of ‘she’ (/iy/) is more similar to the phoneme of the first vowel of ‘water’ (/ao/) than to the phoneme of the vowel of ‘dark’, information that we can observe in the curvature of both trajectories. This similarities may be because the phoneme /iy/ is more open than phonemes /ao/ and /iy/ and neither FPCA nor Isomap were able to capture this information.

Regarding the Symbols problem, this dataset comes from the UCR database (Dau et al. 2018). It is composed of the pictograms of three different symbols: s1, s2, and s3. Each of these symbols is registered separately for their x and y components, giving rise to the six different classes that can be appreciated in Fig. 9 (where the suffix indicates the component). On the other side, Fig. 10 displays the resulting functional representation.

Following the same methodology as in the previous experiment, FPCA, Isomap, and FDM have been applied to this dataset. The results, both the 2D embedding and the scores histograms, are depicted in Fig. 11.

As observed, FPCA separates perfectly the y component of the third symbol, which appears to be the most distinctive. However, in this projection, there is an overlapping between s1 \(-\) x and s2 \(-\) x, and also between s1 \(-\) y and s2 \(-\) y. This is in accordance with what can be seen in Figs. 9 and 10, which show these two pairs to be very similar. The same effect can be appreciated in the second-row plots, which represent the Isomap embeddings. In this case, s3 \(-\) x is more separated from the other x components, but s1 \(-\) x and s2 \(-\) x appear again overlapped, as well as s1 \(-\) y and s2 \(-\) y. Finally, the FDM method is able to univocally separate the x components of the three symbols. Moreover, they are depicted in the same line representing their similarity. On the other side, s3 \(-\) y appears once again perfectly separated from the other classes, due to its clearly different shape. Unfortunately, FDM neither is able to separate the s1- \(-\) y and s2 \(-\) y, which are probably the most similar ones. Still, we can conclude overall that, for this last example, FDM seems to provide the clearest embedding with better separation among classes.

Finally, note that for both previous examples we have shown the best results for each method, but for the sake of completeness we include the FDM embeddings for different hyperparameter configurations in Fig. 12. The dependence shown resembles the one present in the multivariate DM method.

5 Conclusions

Functional dimensionality reduction methods are statistical techniques that represent infinite-dimensional data in lower dimensional spaces, for example, by capturing most of the variance of the original data in the new space or by reflecting lower dimensional underlying manifolds where the original data lay. Following this second line of research, Diffusion Maps has been extended to the infinite-dimensional case.

As a result, functional diffusion maps emerges as a functional manifold learning technique that finds low-dimensional representations of \(L^2\) functions on non-linear functional manifolds. FDM performs the same operations as DM algorithm once a functional similarity graph is created. Even though giving a geometric interpretation of similarity between functional data is sometimes not possible, FDM retains the advantages of multivariate DM as it is robust to noise perturbation and it is also a very flexible algorithm that allows fine-tuning of parameters that influence the resulting embedding.

The performance of this method has been tested for simulated and real functional examples and the results have been compared with those obtained from the multivariate DM, the linear FPCA method, and the non-linear Isomap technique. It should be noted that the Isomap and multivariate DM cannot be applied directly to non-equispaced functions, as it may not effectively differentiate the particularities of the functions in the study. FDM outperforms FPCA for functions laying in non-linear manifolds such as the Moons and the Swiss Roll examples, where FDM obtains a representation that maintains the original structure of the manifold. Besides being an advantage for these simulated examples, it also provides good results and allows new similarity interpretations for real examples such as the Phoneme dataset.

Overall, we find that the proposed manifold FDM method is an interesting technique that can provide useful representations which are competitive with, and often superior to, some classical linear and non-linear representations for functional data.

Nevertheless, some work remains to be done. In particular, the distance between functions can be interpreted as an earth mover distance between trajectories using a Besov space distance, which can be computed by expanding each function in a Haar basis in time, and minimizing its dual to Holder Ankenman and Leeb (2018). It would also be interesting to extend the proposed method to vector-valued functions and to perform a thorough comparison study against other nonlinear methods, including the one presented in Song and Li (2021).