1 Introduction

Toroidal data is formed by tuples of observations that lie on the p-dimensional torus \({\mathbb {T}}^p =[-\pi , \pi )^p\), \(p\ge 1\), where \(-\pi \) and \(\pi \) are identified. This kind of data is present in a variety of applied fields ranging from bioinformatics, when modeling angles in atom bonds (e.g., Boomsma et al. 2008), to environmental sciences, when modeling sea conditions (e.g., Jona-Lasinio et al. 2012), among many others (see, e.g., Ley and Verdebout 2018). The fact that circular variables present different properties than linear variables renders spurious the application of standard statistical tools, with their adaptations comprising the so-called directional statistics (Mardia and Jupp 1999); see, e.g., Pewsey and García-Portugués (2021) for a recent review. In particular, Principal Component Analysis (PCA) is a widely used (linear) statistical tool whose extension to circular variables has been challenging due to the ill-posedness of linear relations on \({\mathbb {T}}^p\). On \({\mathbb {T}}^2\), PCA produces linear subspaces that are either non-periodic or, if extended periodically by “wrapping” at the \(-\pi \equiv \pi \) boundaries, would almost surely result in useless perfect fits to a sample. Indeed, infinitely-dense line wrappings are produced by irrational slopes for the first sample principal component, which arise with probability one in a sample generated by a continuous random vector.

There have been several attempts to extend PCA to toroidal data, with all of them facing some relevant limitations or foundational issues; see Section 8.1 in Pewsey and García-Portugués (2021) for an overview. These approaches can be divided into two general branches. The first one consists of “geodesic-based” methods, like Fletcher et al. (2004)’s Principal Geodesic Analysis (PGA). PGA defines the first principal geodesic as the one passing through the intrinsic mean that minimizes the sum of squared intrinsic residuals. Nodehi et al. (2015) applied PGA to the torus. Both of these approaches inevitably may lead to the aforementioned infinite wrapping. The other branch of techniques is based on embedding toroidal data in Euclidean space to use classical PCA afterward. For example, Mu et al. (2005) proposed dihedral PCA (dPCA), which maps the angles onto \({\mathbb {S}}^1 \times {\mathbb {S}}^1\) via \((\phi , \psi ) \mapsto (\cos \phi ,\sin \phi ,\cos \psi ,\sin \psi )\) and then applies regular PCA. Although this method transforms the data in a unique way, the geometry of \({\mathbb {S}}^1 \times {\mathbb {S}}^1\) is ignored in the subsequent PCA application, leading to substantial transformation-induced artifacts in the resulting PCA scores. Riccardi et al. (2009) introduced angular PCA (aPCA): a centering of the data with the circular mean followed by an application of PCA that disregards periodicity. Another idea was that of characterizing the covariance matrix, which Kent and Mardia (2009) did on a wrapped normal model using trigonometric moments to facilitate PCA on it. To reduce the distortion generated by transformation-based approaches, Sittel et al. (2017) proposed a modification on aPCA consisting of a shift of the data so that the “periodic region” is located at the lowest-density area (determined by histogram approximation) and most of the data is located in the more “non-periodic region”. Like aPCA, the periodicity of the resulting principal components and scores is not enforced. Eltzner et al. (2018) introduced Torus PCA (T-PCA) by mapping \({\mathbb {T}}^d\) onto \({\mathbb {S}}^d\), a geometrically-benign space that, unlike geodesic PCA, produces non-dense linear subspaces. Principal Nested Spheres (PNS) (Jung et al. 2012) is then carried out on \({\mathbb {S}}^d\). Although a principled approach, the transformation in T-PCA is not invariant under permutations of variables, which may yield significantly different outcomes, and is prone to create artifacts in, e.g., datasets with marginal uniform-like distributions. More recently, Zoubouloglou et al. (2022) introduced Scaled Torus PCA (ST-PCA), which uses multidimensional scaling maps between \({\mathbb {T}}^d\) and \({\mathbb {S}}^d\) designed to alleviate the aforementioned drawbacks in T-PCA, and then applies PNS. However, these maps are computationally demanding and difficult to invert. Despite many partial advances, the development of a flexible and successful PCA on the torus that is free of significant drawbacks remains remarkably open.

Differently from the previous PCA approaches, the purpose of this article is to advance a flexible density-driven PCA on \({\mathbb {T}}^2\). A landmark in flexible dimension reduction on \({\mathbb {R}}^p\) was the principal curves of Hastie and Stuetzle (1989). The linear subspaces spanned by PCA’s principal directions are self-consistent when the data is normally distributed, i.e., the expectation of all the points projecting onto a point \({\textbf{x}} \in {\mathbb {R}}^p\) of the subspace is also \({\textbf{x}}\). This fact motivates the definition of principal curves as those that are self-consistent, although their existence is not guaranteed. Another alternative definition of principal curves was introduced by Delicado (2001, 2003) in terms of the total variance, which has the advantage of a guaranteed existence. The total variance of a random variable is minimal when its hyperplane is orthogonal to the first principal component. A different approach to flexible PCA was given by Ozertem and Erdogmus (2011), who defined the concept of density ridges, sets of points that characterize the main features of the density, for locally defined principal curves and surfaces. Notably, Ozertem and Erdogmus (2011) introduced an estimation approach for density ridges based on the mean shift algorithm. Although the first notions of density ridges were already introduced by Hall et al. (1992), it has been in the last decade that the topic has attracted more attention. To review just a few contributions, Genovese et al. (2014) showed Hausdorff-consistency of the mean-shift estimated ridge sets, Chen et al. (2015) gave a Gaussian-process representation of the asymptotic distribution of the Hausdorff distances between estimated and theoretical ridges, and Qiao and Polonik (2016) established the pointwise asymptotic normality of the estimated ridges and the asymptotic distribution of its uniform deviation from theoretical ridges.

This article aims to generalize PCA to bivariate toroidal data by making use of density ridges. The contributions towards this goal are manifold. First, we provide efficient algorithms for computing periodic density ridges for two standard toroidal distributions, the Bivariate Sine von Mises (BSvM) (Singh et al. 2002) and the Bivariate Wrapped Cauchy (BWC) (Kato and Pewsey 2015). These two algorithms substantially reduce the computational burden of iterative methods, based on integral curves, by (1) targeting the implicit equations that define the density ridge, (2) leveraging specific solutions of these implicit equations, and (3) exploiting the ridge symmetries induced by the BSvM and BWC densities. Second, we present Toroidal Ridge PCA (TR-PCA), which uses the connected component of the density ridge that passes through the mode(s) of the density as a flexible first principal component in the parametric setting of BSvM and BWC distributions. TR-PCA provides PCA-like scores through the obtention of signed distances along the ridge, signed projections onto the ridge, and Fréchet means within the ridge. These operations are facilitated by Fourier approximations of the ridge curve, which provide a fast analytical handle. Using likelihood theory, TR-PCA automatically handles edge cases arising in practice, such as diagonal ridges, and decides the best-fitting underlying distribution. Third, we empirically evidence the advantages of TR-PCA over aPCA, arguably the most readily implementable alternative, in a series of illustrative numerical examples. In particular, unlike transformation-based approaches, TR-PCA does not introduce distortions and, unlike other approaches like aPCA, yields scores that inherit the periodicity of the data. Fourth, the usefulness of TR-PCA is demonstrated with a novel application on the study of ocean currents at the coast of Santa Barbara, where the obtained principal ridge explains the currents’ behavior successfully. A final contribution is the companion R package ridgetorus, which provides an implementation of TR-PCA and allows the end-to-end replicability of the data application.

The organization of the rest of this article is as follows: Sects. 24 study the population case of density ridges, and their sample version appears in Sect. 5, with the definition of TR-PCA, and in Sect. 6. More precisely, Sect. 2 provides the definition and some useful properties of density ridges on \({\mathbb {R}}^p\), as well as the computation methods thereof. Section 3 is devoted to the computation and properties of the density ridge for the BSvM distribution on \({\mathbb {T}}^2\). Section 4 gives an analogous construction for the BWC distribution. Then, Sect. 5 introduces an effective parametrization of the connected component of the BSvM/BWC ridge and explains how to obtain scores from it in the presence of a sample, resulting in the definition of TR-PCA in Algorithm 1. Section 6 shows an application of TR-PCA on real bivariate data. Finally, a discussion of the main conclusions, alternatives, and limitations of the present work is given in Sect. 7.

2 Density ridges

2.1 Definition

Density ridges are higher-dimensional extensions of the concept of mode that are informative of the main features of a density. A mode is a local maximum of the density function f that, if \(f \in {\mathcal {C}}^2({\mathbb {R}}^p)\), \(p\ge 2\), can be characterized by a null gradient and negative Hessian eigenvalues, if the degenerate cases are excluded. The eigendecomposition of the Hessian of a density function \(f \in {\mathcal {C}}^2({\mathbb {R}}^p)\) evaluated at a point \({\textbf{x}} \in {\mathbb {R}}^p\) is given by \(\textrm{H} f({\textbf{x}})={\textbf{U}}({\textbf{x}}) \varvec{\Lambda }({\textbf{x}}) ({\textbf{U}}({\textbf{x}}))^\prime \), where \({\textbf{U}}({\textbf{x}})=({\textbf{u}}_{1}({\textbf{x}}), \ldots , {\textbf{u}}_{p}({\textbf{x}}))\) is a matrix whose columns are the eigenvectors of \(\textrm{H} f({\textbf{x}})\) and \(\varvec{\Lambda }({\textbf{x}})=\textrm{diag}(\lambda _{1}({\textbf{x}}), \ldots ,\lambda _{p}({\textbf{x}}))\), \(\lambda _{1}({\textbf{x}}) \ge \ldots \ge \lambda _{p}({\textbf{x}})\), contains their eigenvalues. Denoting \({\textbf{U}}_{(p-1)}({\textbf{x}}):=({\textbf{u}}_{2}({\textbf{x}}), \ldots , {\textbf{u}}_{p}({\textbf{x}}))\) and defining the projected gradient on \(\{{\textbf{u}}_2({\textbf{x}}),\ldots ,{\textbf{u}}_p({\textbf{x}}) \}\) as \(\textrm{D}_{(p-1)} f({\textbf{x}}):={\textbf{U}}_{(p-1)}({\textbf{x}}) ({\textbf{U}}_{(p-1)}({\textbf{x}}))^{\prime } \textrm{D} f({\textbf{x}})\), where \(\textrm{D}f({\textbf{x}})\) stands for the (column vector) gradient, the density ridge is defined by Genovese et al. (2014) as follows.

Definition 1

(Density ridge; Genovese et al. 2014). Let \(f\in {\mathcal {C}}^2({\mathbb {R}}^p)\), \(p\ge 2\). The density ridge of f is the set

$$\begin{aligned} {\mathcal {R}}(f):=\big \{&{\textbf{x}} \in {\mathbb {R}}^{p}:\big \Vert \textrm{D}_{(p-1)} f({\textbf{x}})\big \Vert =0,\\&\lambda _{2}({\textbf{x}}),\ldots ,\lambda _{p}({\textbf{x}})<0\big \}. \end{aligned}$$
Fig. 1
figure 1

From left to right, representation of the eigenvector fields \({\textbf{u}}_1,{\textbf{u}}_2:{\mathbb {R}}^2\rightarrow {\mathbb {R}}^2\), and the gradient and projected gradient vector fields \(\textrm{D}f,\textrm{D}_{1}f:{\mathbb {R}}^2\rightarrow {\mathbb {R}}^2\) for a Gaussian density f on \({\mathbb {R}}^2\). Red plus (respectively, blue minus) indicate a positive (negative) eigenvalue \(\lambda _j({\textbf{x}})\) associated with \({\textbf{u}}_j({\textbf{x}})\), for \(j=1,2\), with \(\lambda _1({\textbf{x}})\ge \lambda _2({\textbf{x}})\) and \({\textbf{x}}\in {\mathbb {R}}^2\). For visualization purposes, vector fields are unit-norm standardized and low-density regions are shown in white

Clearly, two cases imply that \({\textbf{x}}\in {\mathbb {R}}^p\) satisfies \(\textrm{D}_{(p-1)} f({\textbf{x}})={\textbf{0}}\). The first is \(\textrm{D}f({\textbf{x}})={\textbf{0}}\). In that case, if \(\lambda _{2}({\textbf{x}}),\ldots ,\lambda _{p}({\textbf{x}})<0\), \({\textbf{x}}\) is a maximum or a saddle point. The second is that \(\textrm{D}f({\textbf{x}})\) is perpendicular to \({\textbf{U}}_{(p-1)}({\textbf{x}})({\textbf{U}}_{(p-1)}({\textbf{x}}))^\prime \), that is, the gradient is parallel to \({\textbf{u}}_1({\textbf{x}})\). In this case, the directions of maximum ascent (gradient) and maximum signed curvature (\({\textbf{u}}_1({\textbf{x}})\)) coincide (see Fig. 1), with “signed” emphasizing that the maximum is not in absolute value terms since \(\lambda _1({\textbf{x}})\in {\mathbb {R}}\).

2.2 Some useful properties

Two useful properties of density ridges are given below. Their proofs are given in Appendix A.

Proposition 1

(Ridge invariance to translations and rotations). Let \({\textbf{X}}\) be a p-random vector with density \(f\in {\mathcal {C}}^2({\mathbb {R}}^p)\), \(p\ge 2\). Let \({\textbf{Y}}=\varvec{\mu }+{\textbf{R}}{\textbf{X}}\) represent a shifting and rotation family of p-random vectors spanned by \({\textbf{X}}\), where \(\varvec{\mu }\in {\mathbb {R}}^p\) and \({\textbf{R}}\) is a \(p\times p\) orthogonal matrix. If \(f(\cdot ;\varvec{\mu },{\textbf{R}})\) stands for the density function of \({\textbf{Y}}\) and \({\textbf{x}}\in {\mathbb {R}}^p\), then

$$\begin{aligned} {\textbf{x}}\in {\mathcal {R}}(f(\cdot ; \varvec{\mu },{\textbf{R}}))\iff {\textbf{R}}^\prime ({\textbf{x}}-\varvec{\mu }) \in {\mathcal {R}}(f). \end{aligned}$$

This result yields two convenient handles to manipulate density ridges. First, it reduces the problem of computing density ridges to those that are centered at a certain origin. Second, it shows how to exploit the symmetries of f to reduce the computational costs of evaluating \({\mathcal {R}}(f)\) (e.g., halve them if f is reflective symmetric in \({\mathbb {R}}^2\)).

Proposition 2

(Ridges for elliptically-symmetric densities). Let \(f\in {\mathcal {C}}^2({\mathbb {R}}^p)\), \(p\ge 2\), be an elliptically-symmetric density of the form \(f({\textbf{x}})=g(({\textbf{x}}-\varvec{\mu })^{\prime }\varvec{\Sigma }^{-1}({\textbf{x}}-\varvec{\mu }))\), with \({\textbf{x}},\varvec{\mu }\in {\mathbb {R}}^p\), \(\varvec{\Sigma }\) a \(p\times p\) positive definite matrix, and \(g\in {\mathcal {C}}^2({\mathbb {R}}^+_0)\) a (strictly) decreasing function. Then, the subspace spanned by the first eigenvector \({\textbf{v}}_1\) of \(\varvec{\Sigma }\) belongs to the density ridge of f:

$$\begin{aligned} \{\varvec{\mu }+c {\textbf{v}}_1: c\in {\mathbb {R}}\} \subset {\mathcal {R}}(f). \end{aligned}$$
(1)

This proposition has several important consequences. First, it proves that \({\mathcal {R}}(f)\) is intimately related to the first principal component of PCA for elliptically-symmetric densities, since the subspace generated by the latter direction is included in it. A decreasing g implies that f has \(\varvec{\mu }\) as a unique mode. This is satisfied by many elliptically-symmetric distributions, with its prime representative being the normal distribution, a class of distributions in which PCA is particularly meaningful. Second, like the first principal component, the density ridge contains the center (\(\varvec{\mu }\)) in elliptically-symmetric distributions. Third, the proposition shows that the density ridge is more than just the first principal direction subspace. This observation is key for defining the \(\varvec{\mu }\)-connected density ridge for a mode or mean \(\varvec{\mu }\) of f with a view to constructing a density-driven first principal component analog.

Definition 2

(\(\varvec{\mu }\)-connected density ridge). Let \(f\in {\mathcal {C}}^2({\mathbb {R}}^p)\). The \(\varvec{\mu }\)-connected density ridge of f, \({\mathcal {R}}_{\varvec{\mu }}(f)\), is defined as the connected component of \({\mathcal {R}}(f)\) that contains \(\varvec{\mu }\in {\mathbb {R}}^p\).

We return to the \(\varvec{\mu }\)-connected density ridges in Sects. 3 and 4. Before, we describe in the following subsections two approaches to determining in practice the set of points that conform \({\mathcal {R}}(f)\) for \(f \in {\mathcal {C}}^2({\mathbb {R}}^p)\). Both approaches are exploited in Sects. 3 and 4.

2.3 Integral curve approach

Genovese et al. (2014) showed that the vector field of the projected gradient defines a global flow. Hence, the trajectory defined by the projected gradient converges to the points where it is null. This fact allows characterizing the density ridge as an integral curve:

$$\begin{aligned} {\mathcal {R}}(f)=\big \{&{\textbf{x}} \in {\mathbb {R}}^{p}: \lim _{t \rightarrow \infty } \phi _{{\textbf{x}}_{0}}(t)={\textbf{x}},\ {\textbf{x}}_{0} \in {\mathbb {R}}^{p},\ \\&\lambda _{2}({\textbf{x}}),\ldots ,\lambda _{p}({\textbf{x}})<0\big \}, \end{aligned}$$

where \(\phi _{{\textbf{x}}_{0}}: {\mathbb {R}} \rightarrow {\mathbb {R}}^{p}\) is a flow curve in \({\mathbb {R}}^{p}\) that satisfies \((\textrm{d}/\textrm{d} t) \phi _{{{\textbf{x}}}_{0}}(t)=\textrm{D}_{(p-1)} f(\phi _{{{\textbf{x}}}_{0}}(t))\), \(t>0\), and \( \phi _{{{\textbf{x}}}_{0}}(0)={\textbf{x}}_{0}\). This differential equation can be solved numerically using the Euler method until attaining a point where \(\textrm{D}_{(p-1)} f({\textbf{x}})\approx {\textbf{0}}\). To move faster in low-density regions and slower in high-density regions, it is useful to consider the normalized projected gradient \(\varvec{\eta }_{(p-1)}({\textbf{x}}):=\textrm{D}_{(p-1)} f({\textbf{x}})/f({\textbf{x}})\):

$$\begin{aligned} {\textbf{x}}_{t+1}={\textbf{x}}_{t}+h \varvec{\eta }_{(p-1)}({\textbf{x}}_{t}), \text { for}\quad t=0, 1, \ldots . \end{aligned}$$
(2)

The above scheme is started from an initial point \({\textbf{x}}_0\) and is iterated using a step \(h>0\) until a criterion for convergence is met after N iterations. The resulting \({\textbf{x}}_N\) (approximately) belongs to \({\mathcal {R}}(f)\). Then, it is possible to approximately determine \({\mathcal {R}}(f)\) by running (2) from a sufficiently dense grid of initial values \({\textbf{x}}_0\).

2.4 Implicit equation approach

When \(p=2\), from Definition 1 it can be seen that \({\mathcal {R}}(f):=\{{\textbf{x}} \in {\mathbb {R}}^{2}:\textrm{D}_1f({\textbf{x}}) u_{2,1}({\textbf{x}})+\textrm{D}_2f({\textbf{x}})u_{2,2}({\textbf{x}})=0,\ \lambda _{2}({\textbf{x}})<0\}\), with the subscript indicating the first/second component of the gradient and the first/second component of the second eigenvector \({\textbf{u}}_2({\textbf{x}})\) of the Hessian at \({\textbf{x}}\). Furthermore, still when \(p=2\), the eigenvectors and eigenvalues of the Hessian admit a closed form. For example, Qiao and Polonik (2016) give:

$$\begin{aligned} {\textbf{u}}_{2}(u, v, w)&=\!\left( \!\begin{array}{c} 2 u-2 w+2 v-2 \sqrt{(w-u)^{2}+4 v^{2}} \\ w-u+4 v-\sqrt{(w-u)^{2}+4 v^{2}} \end{array}\right) \!,\\ \lambda _{2}(u, v, w)&=\frac{u+w-\sqrt{(u-w)^{2}+4 v^{2}}}{2}, \end{aligned}$$

with \(u=(\partial ^2 /\partial _1 ^2)f\), \(v=(\partial ^2 /\partial _1 \partial _2)f\), and \(w=(\partial ^2 /\partial _2 ^2)f\). This means that the implicit equation in terms of the derivatives of f is given by

$$\begin{aligned}&\textrm{D}_1\left( 2 u-2 w+2 v-2 \sqrt{(w-u)^{2}+4 v^{2}}\right) \nonumber \\&\quad +\textrm{D}_2\left( w-u+4 v-\sqrt{(w-u)^{2}+4 v^{2}}\right) =0 \end{aligned}$$
(3)

and that the eigenvalue condition reads as

$$\begin{aligned} \lambda _{2}(u, v, w)=\frac{u+w-\sqrt{(u-w)^{2}+4 v^{2}}}{2}<0. \end{aligned}$$
(4)

Therefore, it is possible to obtain \({\mathcal {R}}(f)\) for a bivariate density f by solving Equation (3), restricted to Equation (4), along a grid on one of its variables. This approach is much faster than that in Sect. 2.3, yet it is restricted to \(p=2\).

Fig. 2
figure 2

Catalog of the Euler-approximated ridges of a zero-centered BSvM with parameters \((\kappa _1, \kappa _2, \lambda ) \in \{(0.3, 0.15, 0.25), (0.3, 0.6, 0.5), (0.3, 0.3, 1.0), (1.0, 0.5, 1.5)\}\). The initial set of points, together with their trajectories, are shown in black. The red points represent the final points of the algorithm, which describe \({\mathcal {R}}(f_\textrm{BSvM})\). The background shows the density contour of the BSvM. All the contourplots share the same color scale

3 Density ridges for bivariate sine von Mises

3.1 Bivariate sine von Mises

Let \(\Theta _{1}\) and \(\Theta _{2}\) be two circular random variables. Singh et al. (2002)’s Bivariate Sine von Mises (BSvM) distribution has density given by

$$\begin{aligned} f_\textrm{BSvM}&(\theta _{1}, \theta _{2};\mu _1,\mu _2,\kappa _1,\kappa _2,\lambda ):=T(\kappa _{1}, \kappa _{2}, \lambda )\nonumber \\&\times \exp \big \{\kappa _{1} \cos (\theta _{1}-\mu _{1})+\kappa _{2} \cos (\theta _{2}-\mu _{2})\nonumber \\&\quad \qquad +\lambda \sin (\theta _{1}-\mu _{1}) \sin (\theta _{2}-\mu _{2})\big \} \end{aligned}$$
(5)

for \(\theta _{1}, \theta _{2}, \mu _{1}, \mu _{2}\in {\mathbb {T}}\), \(\kappa _{1}, \kappa _{2} \ge 0\), \(\lambda \in {\mathbb {R}}\), and a normalizing constant expressible as \(T(\kappa _{1}, \kappa _{2}, \lambda )=4 \pi ^{2} \sum _{m=0}^{\infty }\left( {\begin{array}{c}2\,m\\ m\end{array}}\right) \)\( \big (\lambda /2\big )^{2\,m} \kappa _{1}^{-m} {\mathcal {I}}_{m}(\kappa _{1}) \kappa _{2}^{-m} {\mathcal {I}}_{m}(\kappa _{2})\), where \({\mathcal {I}}_{m}\) is the modified Bessel function of order m. The distribution is pointwise symmetric about \((\mu _1, \mu _2)\), with these two parameters representing the marginal circular means. The parameters \(\kappa _1\) and \(\kappa _2\) measure the marginal concentrations of the distribution about \(\mu _1\) and \(\mu _2\), respectively. The parameter \(\lambda \) measures dependence: positive/negative values of \(\lambda \) correspond to positive/negative correlation between \(\Theta _{1}\) and \(\Theta _{2}\). If \(\lambda =0\), then \(\Theta _{1}\) and \(\Theta _{2}\) are independent, with each variable having a (univariate) von Mises distribution. Density (5) can be bimodal. A sufficient unimodality condition is given by \(\kappa _1\kappa _2>\lambda ^2\) (Mardia et al. 2007). In the bimodal case where \(\varvec{\mu }_1,\varvec{\mu }_2\in {\mathbb {T}}^2\) are the two modes of (5), due to the symmetry of the BSvM, it is easy to see that \({\mathcal {R}}_{\varvec{\mu }_1}(f_\textrm{BSvM}) = {\mathcal {R}}_{\varvec{\mu }_2}(f_\textrm{BSvM})\), thus ensuring that a connected component is well defined.

The BSvM has some properties that make it a candidate toroidal analog to the bivariate normal. Among these are the facts that it is also part of the exponential family and that it has asymptotic bivariate normal distributions under high concentrations (Singh et al. 2002). Furthermore, the marginals of the BSvM are not von Mises, but the conditionals are. Figure 2 shows different forms of the BSvM density. For simplicity, and without loss of generality due to Proposition 1, we will work with \((\mu _1,\mu _2)=(0,0)\).

No closed expressions for the maximum likelihood estimators of the BSvM parameters exist (Mardia et al. 2008), but these can be computed numerically using moment-based estimators as starting values in the optimization of the log-likelihood. In particular, this enables testing the homogeneity hypothesis \({\mathcal {H}}_0:\kappa _1 = \kappa _2\) vs. \({\mathcal {H}}_1:\kappa _1 \ne \kappa _2\) using the Likelihood Ratio Test (LRT) that asymptotically rejects \({\mathcal {H}}_0\) at the significance level \(\alpha \) whenever \(-2({\hat{\ell }}-{\hat{\ell }}_0)>\chi ^2_{\alpha ;1}\), where \({\hat{\ell }}\) and \({\hat{\ell }}_0\) are the maximum likelihoods of (5), the latter under \({\mathcal {H}}_0\), and \(\chi ^2_{\alpha ;1}\) is the \(\alpha \)-upper quantile of the chi-square distribution with one degree of freedom. This homogeneity LRT is convenient for distinguishing in practice the edge case \(\kappa _1=\kappa _2\), which has a simple diagonal ridge associated. An analogous LRT can be constructed for testing \({\mathcal {H}}_0:\lambda = 0\) vs. \({\mathcal {H}}_1:\lambda \ne 0\), a null hypothesis of independence associated with horizontal/vertical ridges if \(\kappa _1\) is smaller/larger than \(\kappa _2\).

3.2 Integral curve approach

Fig. 3
figure 3

Catalog of the implicit-equation-approximated ridges \({\mathcal {R}}(f_\textrm{BSvM})\) (black) and \({\mathcal {R}}_{{\textbf{0}}}(f_\textrm{BSvM})\) (red) of a zero-centered BSvM for the same parameter values as in Fig. 2. The background shows the density contour of the BSvM. All the contourplots share the same color scale

The first alternative to compute the \({\mathcal {R}}(f_\textrm{BSvM})\) of a \(\textrm{BSvM}(0,0,\kappa _1, \kappa _2, \lambda )\) uses the integral curve approach. With the Euler algorithm, an initial grid of points converges to the density ridge by following the trajectory defined by the projected gradient, as Fig. 2 shows. It can be seen that the main part of the density ridge captures the main features of the distribution, shape, and correlation, while also being periodic. The figure also shows that there is a significant number of “secondary” ridges.

3.3 Implicit ridge equation approach and connected ridges

The implicit equation offers a faster alternative to the computationally-demanding Euler algorithm. In the following, the normalizing constant of the BSvM and a common positive factor will be ignored, as these do not affect the direction of \(\varvec{\eta }_{(p-1)}\). It is easy to check that \( \textrm{D}_1(f_\textrm{BSvM})\propto -\kappa _1 \sin \theta _1 + \lambda \sin \theta _2 \cos \theta _1\), \( \textrm{D}_2(f_\textrm{BSvM})\propto -\kappa _2 \sin \theta _2 + \lambda \sin \theta _1 \cos \theta _2\), \(u\propto \textrm{D}_1^2-\kappa _1 \cos \theta _1 - \lambda \sin \theta _1 \sin \theta _2\), \(v\propto \lambda \cos \theta _1 \cos \theta _2+\textrm{D}_1\textrm{D}_2\), and \(w\propto \textrm{D}_2^2-\kappa _2 \cos \theta _2 - \lambda \sin \theta _1 \sin \theta _2\). These expressions are readily usable in (34). Figure 3 shows the density ridges obtained with this method, which allow a clear identification of \({\mathcal {R}}_{{\textbf{0}}}(f_\textrm{BSvM})\).

However, in general, there is no explicit parametrization of \({\mathcal {R}}_{\varvec{\mu }}(f_\textrm{BSvM})\) with \(\varvec{\mu } \in {\mathbb {T}}^2\), so a method is needed to filter it from the full \({\mathcal {R}}(f_\textrm{BSvM})\) obtained from the implicit equation. The symmetry of \(f_{\textrm{BSvM}}\) about \(\varvec{\mu }\) and Proposition 1 imply (see Fig. 3 for graphical insight) that: (1) \({\mathcal {R}}_{\varvec{\mu }}(f_\textrm{BSvM}) \subset \{(\theta _1, \theta _2)\in {\mathbb {T}}^2: \textrm{sign}(\theta _1/\theta _2) = \textrm{sign}(\lambda )\}\); and (2) the computation of \({\mathcal {R}}_{\varvec{\mu }}(f_\textrm{BSvM})\) reduces to the first or second quadrant, depending on \(\textrm{sign}(\lambda )\). Finally, since \({\mathcal {R}}_{\varvec{\mu }}(f_\textrm{BSvM})\) passes through \(\varvec{\mu }\) by definition, a connected component can be obtained by iteratively adding the sufficiently close ridge points to the last point added to the set, starting from \(\varvec{\mu }\).

The following result presents limit cases for which the density ridge \(\mathbf {{\mathcal {R}}_0}(f_{\textrm{BSvM}})\) is explicit.

Proposition 3

The \({\textbf{0}}\)-connected density ridge \(\mathbf {{\mathcal {R}}_0}(f_{\textrm{BSvM}})\) of a \(\textrm{BSvM}(0,0,\kappa _1, \kappa _2, \lambda )\) admits the following representations:

  1. (i)

    When \(\kappa _2>\kappa _1=0\), and \(\lambda =0\), \({\mathcal {R}}_{\textbf{0}}(f_\textrm{BSvM})=\{(\theta _1,\theta _2)\in {\mathbb {T}}^2: \theta _2=0\}\).

  2. (ii)

    When \(\kappa _1=\kappa _2\ge 0\) and \(\lambda \in {\mathbb {R}}\), \({\mathcal {R}}_{\textbf{0}}(f_\textrm{BSvM})\supset \{(\theta _1,\theta _2)\in {\mathbb {T}}^2: \cos \theta _1< \vert \lambda \vert /\kappa _1,\ \theta _2=\textrm{sign}(\lambda )\theta _1 \}\).

Remark 1

Note that (ii) does not characterize \({\mathcal {R}}_{{\textbf{0}}}\), unlike (i). Although in view of the third leftmost panel in Fig. 3 it seems possible to achieve this characterization with an additional restriction in Definition 2, we do not pursue this further due to its limited practical interest: given that \({\mathcal {R}}_{\textbf{0}}(f_\textrm{BSvM})\not \supset \{(\theta _1,\theta _2)\in {\mathbb {T}}^2: \theta _2=\textrm{sign}(\lambda )\theta _1\}\), \({\mathcal {R}}_{\textbf{0}}(f_\textrm{BSvM})\) will not always wrap periodically at \(-\pi \equiv \pi \). To solve this edge-case issue in practice, and in coherence with the other cases, we simply extend \({\mathcal {R}}_{\textbf{0}}(f_\textrm{BSvM})\) according to the diagonal.

4 Density ridges for bivariate wrapped Cauchy

4.1 Bivariate wrapped Cauchy

The density of a random vector \((\Theta _{1}, \Theta _{2})\) following the Bivariate Wrapped Cauchy (BWC) distribution proposed by Kato and Pewsey (2015) is

$$\begin{aligned}&f_\textrm{BWC}(\theta _{1}, \theta _{2};\mu _1,\mu _2,\xi _1,\xi _2,\rho )\\&\quad :=\;c\{c_{0}-c_{1} \cos (\theta _{1}-\mu _{1})-c_{2} \cos (\theta _{2}-\mu _{2})\\&\qquad -c_{3} \cos (\theta _{1}-\mu _{1}) \cos (\theta _{2}-\mu _{2})\\&\qquad -c_{4} \sin (\theta _{1}-\mu _{1}) \sin (\theta _{2}-\mu _{2})\}^{-1}, \end{aligned}$$

for \(\theta _{1}, \theta _{2}, \mu _1, \mu _2\in {\mathbb {T}}\), \(0 \le \xi _{1}, \xi _{2}<1\), and \(-1<\rho <1\). The several constants are given as follows: \(c=(1-\rho ^{2})(1-\xi _{1}^{2})(1-\xi _{2}^{2}) /(4 \pi ^{2})\), \(c_{0}=(1+\rho ^{2})(1+\xi _{1}^{2})(1+\xi _{2}^{2})-8\vert \rho \vert \xi _{1} \xi _{2}\), \(c_{1}= 2(1+\rho ^{2}) \xi _{1}(1+\xi _{2}^{2})-4\vert \rho \vert (1+\xi _{1}^{2}) \xi _{2}\), \(c_{2}=2(1+\rho ^{2})(1+\xi _{1}^{2}) \xi _{2}-4\vert \rho \vert \xi _{1}(1+\xi _{2}^{2})\), \(c_{3}=-4(1+\rho ^{2}) \xi _{1} \xi _ 2+2\vert \rho \vert (1+\xi _{1}^{2})(1+\xi _{2}^{2})\), and \(c_{4}=2 \rho (1-\xi _{1}^{2})(1-\xi _{2}^{2})\). Analogously to the BSvM, \(\mu _1\) and \(\mu _2\) represent the marginal circular means of the density. The parameters \(\xi _{1}\) and \(\xi _{2}\) regulate the concentrations of the marginal distributions, that of \(\Theta _j\) being circular uniform when \(\xi _j=0\) (\(j=1,2\)). When \(\xi _{1}, \xi _{2}>0,\) the density is unimodal and pointwise symmetric about \((\mu _{1}, \mu _{2})\). As \(\xi _j \rightarrow 1\), the marginal distribution of \(\Theta _j\) tends to a point mass at \(\mu _j\). The association between \(\Theta _{1}\) and \(\Theta _{2}\) is controlled by the parameter \(\rho \), \(\rho =0\) corresponding to independence. Positive/negative values of \(\rho \) correspond to positive/negative correlation between \(\Theta _{1}\) and \(\Theta _{2}\). Figure 4 shows different forms of the BWC density. This distribution is always unimodal (Kato and Pewsey 2015), thus ensuring that \({\mathcal {R}}_{\varvec{\mu }}\) is well-defined.

The BWC is closed under marginalization and conditioning, meaning that the resulting marginals and conditional densities belong to the wrapped Cauchy family. This distinctive closedness property is shared with the bivariate normal distribution, whose marginals and conditionals are also normal. The functional form of marginals is immediate from the BWC’s construction as a particular Wehrly and Johnson (1980) model, but the conditionals are not (despite being wrapped Cauchys).

As in the BSvM case, there are no closed expressions for the maximum likelihood estimators for the BWC parameters (Kato and Jones 2015), so numerical routines are needed. Analogously to the BSvM case, the LRTs for \({\mathcal {H}}_0:\xi _1=\xi _2\) vs. \({\mathcal {H}}_1:\xi _1\ne \xi _2\) (homogeneity) and \({\mathcal {H}}_0:\rho =0\) vs. \({\mathcal {H}}_1:\rho \ne 0\) (independence) are also relevant to distinguish diagonal and horizontal/vertical ridges in a practical and principled manner.

The integral curve approach for the BWC is completely analogous to the BSvM case.

Fig. 4
figure 4

Catalog of the implicit-equation-approximated ridges \({\mathcal {R}}(f_\textrm{BWC})\) (black) and \({\mathcal {R}}_{{\textbf{0}}}(f_\textrm{BWC})\) (red) with parameter values \((\xi _1, \xi _2, \rho ) \in \{(0.15, 0.075, 0.25),(0.2, 0.7, 0.2),(0.3, 0.3, 0.6),(0.025, 0.6, 0.7) \}\). The background shows the density contour of the BWC. All the contourplots share the same common color scale

4.2 Implicit ridge equation approach and connected ridges

It is simple to check that the derivatives of the BWC density, excluding a common positive factor, are \( \textrm{D}_1(f_\textrm{BWC})\propto -c_1 \sin \theta _1 - c_3 \sin \theta _1 \cos \theta _2 + c_4 \sin \theta _2 \cos \theta _1\), \( \textrm{D}_2(f_\textrm{BWC})\propto -c_2 \sin \theta _2 - c_3 \sin \theta _2 \cos \theta _1 + c_4 \sin \theta _1 \cos \theta _2\), \(u\propto 2 \textrm{D}_1^2f^* -c_1 \cos \theta _1 - c_3 \cos \theta _1 \cos \theta _2 -c_4 \sin \theta _1\sin \theta _2\), \(v\propto 2\textrm{D}_1\textrm{D}_2f^*+c_3\sin \theta _2\sin \theta _1+c_4\cos \theta _1\cos \theta _2\), and \(w\propto 2 \textrm{D}_2^2f^* - c_2 \cos \theta _2 - c_3 \cos \theta _2 \cos \theta _1 -c_4 \sin \theta _2\sin \theta _1\), where \(f^*=f/c\). Figure 4 shows the density ridges obtained with the implicit equation approach. This figure illustrates that, differently to the sinusoidal shapes of \({\mathcal {R}}_{{\varvec{0}}}(f_\textrm{BSvM})\), the shapes of \({\mathcal {R}}_{{\varvec{0}}}(f_\textrm{BWC})\) involve ridges that connect (0, 0) with \((\pm \pi ,\pm \pi )\) for unequal marginal concentrations.

The following proposition shows limit cases for which \({\mathcal {R}}_{{\textbf{0}}}(f_\textrm{BWC})\) is explicit.

Proposition 4

The \({\textbf{0}}\)-connected density ridge \({\mathcal {R}}_{{\textbf{0}}}(f_\textrm{BWC})\) of a \(\textrm{BWC}(0,0,\xi _1,\xi _2,\rho )\) admits the following explicit representations:

  1. (i)

    When \(\xi _1=\rho =0\) and \(\xi _2\in (0,1)\), \({\mathcal {R}}_{{\textbf{0}}}(f_\textrm{BWC})=\{(\theta _1,\theta _2)\in {\mathbb {T}}^2: \theta _2=0\}\).

  2. (ii)

    When \(0\le \xi _1=\xi _2<1\) and \(\rho \in (-1,1)\), \({\mathcal {R}}_{{\textbf{0}}}(f_\textrm{BWC})\supset \{(\theta _1,\theta _2)\in {\mathbb {T}}^2: \theta _2=\textrm{sign}(\rho )\theta _1\}\).

Remark 2

Analogously to Remark 1, in practical applications we consider \({\mathcal {R}}_{\textbf{0}}(f_\textrm{BWC}):=\{(\theta _1,\theta _2)\in {\mathbb {T}}^2: \theta _2=\textrm{sign}(\rho )\theta _1\}\) for case (ii).

5 Toroidal ridge PCA

5.1 Ridge parametrization

Due to the difficulty in explicitly solving the implicit equation (3), we have not found any simple parametric form for the curve defined by \({\mathcal {R}}_{\varvec{\mu }}(f)\) beyond limit cases. This poses an important hindrance to the tractability of the distance operation along \({\mathcal {R}}_{\varvec{\mu }}(f)\) and the projection operation that maps a point on \({\mathbb {T}}^2\) to its closest point on \({\mathcal {R}}_{\varvec{\mu }}(f)\), both being core mechanisms for the definition of the forthcoming scores. To prevent this issue from draining the performance of TR-PCA, we consider a Fourier-type parametrization of \({\mathcal {R}}_{\varvec{\mu }}(f)\) given by

$$\begin{aligned} r_{f,\varvec{\mu },j}(\phi )&:=\textrm{cmod}\big (\mu _l+ \rho _m(\phi -\mu _j)-\rho _m(0)\big ),\\ \rho _m(\theta )&:=\textrm{atan2}({\mathcal {S}}_{m} (\theta ), {\mathcal {C}}_{m}(\theta )),\nonumber \end{aligned}$$
(6)

where \(\phi ,\theta \in {\mathbb {T}}\), \(\textrm{cmod}(\cdot ):=(\cdot +\pi )\mod 2\pi -\pi \), \(j,l\in \{1,2\}\), \(l\ne j\), and \({\mathcal {C}}_{m}(\theta ):= a_{0}/2 + \sum _{k=1}^m a_k \cos (k\theta )\) and \({\mathcal {S}}_{m}(\theta ):= \sum _{k=1}^m b_k \sin (k\theta )\) are truncated cosine/sine Fourier series with \(m\ge 1\). The coefficients \(\{(a_k,b_k)\}_{k=0}^m\) (with \(b_0=0\)) are \(a_k:=\frac{1}{\pi }\int _{{\mathbb {T}}} \cos (R_{f,{\textbf{0}},j}(\theta )) \cos (k\theta )\,\textrm{d}\theta \) and \(b_k:=\frac{1}{\pi }\int _{{\mathbb {T}}} \sin (R_{f,{\textbf{0}},j}(\theta )) \sin (k\theta )\,\textrm{d}\theta \), where \(\{(\phi ,R_{f,\varvec{\mu },1}(\phi )):\phi \in {\mathbb {T}}\}={\mathcal {R}}_{\varvec{\mu }}(f)\) or \(\{(R_{f,\varvec{\mu },2}(\phi ),\phi ):\phi \in {\mathbb {T}}\}={\mathcal {R}}_{\varvec{\mu }}(f)\), depending on which is the index coordinate j (see the next paragraph). The consideration of only the cosine/sine parts in (6) is justified by the pointwise symmetry of \({\mathcal {R}}_{{\textbf{0}}}(f_{\textrm{BSvM}})\) and \({\mathcal {R}}_{{\textbf{0}}}(f_{\textrm{BWC}})\), which renders null Fourier cosine/sine coefficients for sine/cosine components. In practice, \(\{(a_k,b_k)\}_{k=0}^m\) is approximated by Gaussian quadrature using a grid of points in \({\mathcal {R}}_{\varvec{\mu }}(f)\) determined using the implicit equation method. In numerical experiments, the truncation of the Fourier series (6) to \(m=15\) was found to be a sensible choice for a large number of parameter specifications of the BSvM and BWC densities. With \(m=15\), the maximum distance between the implicit-equation computation of \({\mathcal {R}}_{\varvec{\mu }}(f)\) and the Fourier-parametrized approximation was found to be smaller than \(10^{-2}\). Therefore, we consider \(m=15\) to Fourier-parametrize \({\mathcal {R}}_{\varvec{\mu }}(f_\textrm{BSvM})\) and \({\mathcal {R}}_{\varvec{\mu }}(f_\textrm{BWC})\).

To obtain a one-to-one parametrization for the BSvM and BWC densities, in practice \(r_{f,\varvec{\mu },j}\) is indexed along the variable with the smallest concentration (e.g., \(\theta _2\) and \(\theta _1\) for the first and second leftmost panels in Figure 2, respectively), which is straightforward to identify. For the sake of notational simplicity, henceforth we assume without loss of generality that \(j=1\).

To define the first scores, it is fundamental to parametrize the curve \(r_{f,\varvec{\mu },1}\) through its arc length from \(\phi =\mu _1\) to \(\phi =\mu _1+t\), \(t\in [0,2\pi )\): \(L(t):= \int _{\mu _1}^{\mu _1+t} \sqrt{1 + (r_{\varvec{\mu },f,1}'(\phi ))^2}\,\textrm{d}\phi \). The length of the curve of \(r_{f,\varvec{\mu },1}\) is \(R:=\lim _{\theta \rightarrow (2\pi )^{-}}L(\theta )\). The arc-length parametrized curve in \({\mathbb {T}}^2\) is thus

$$\begin{aligned} s\in [0, R)\mapsto {\textbf{r}}_{f,\varvec{\mu }}(s):=\big ((\textrm{id},r_{f,\varvec{\mu },1})'\circ L^{-1}\big )(s),\!\! \end{aligned}$$
(7)

with \(\textrm{id}\) denoting the identity function. Two further tweaks are required on \({\textbf{r}}_{f,\varvec{\mu }}\) to: (1) scale the parametrization to \(s\in [0,2\pi )\); (2) center the parametrization to \(s\in [-\pi ,\pi )\). The first step sets the range of the scores to \(2\pi \), just as the original data in \({\mathbb {T}}\), while the second step is crucial for assigning signs. The definition below summarizes the construction.

Definition 3

(Scaled-centered arc-length ridge parametrization). The scaled-centered arc-length parametrization of \({\mathcal {R}}_{\varvec{\mu }}(f)\) based on (6)–(7) is \(\alpha \in {\mathbb {T}}\mapsto \tilde{{\textbf{r}}}_{f,\varvec{\mu }}(\alpha )\) with

$$\begin{aligned} \tilde{{\textbf{r}}}_{f,\varvec{\mu }}(\alpha ):={\textbf{r}}_{f,\varvec{\mu }}\left( [(R/(2\pi ))\alpha ] \mod R\right) . \end{aligned}$$
(8)

The proposition below collects two immediate properties of \(\tilde{{\textbf{r}}}_{f,\varvec{\mu }}\) for the BSvM and BWC densities.

Proposition 5

For \(f_\textrm{BSvM}\) and \(f_\textrm{BWC}\), with \(\varvec{\mu }\) being its location parameter, the curve in (8) is such that:

  1. (i)

    \(\tilde{{\textbf{r}}}_{f,\varvec{\mu }}(0)=\varvec{\mu }\) and \(\tilde{{\textbf{r}}}_{f,\varvec{\mu }}(\pm \pi )\in \{\textrm{cmod}(\mu _1\pm \pi ,\mu _2)', \)\( \textrm{cmod}(\mu _1\pm \pi ,\mu _2\pm \pi )'\}\);

  2. (ii)

    the signed distance between two points \(\varvec{\phi }_i:=\tilde{{\textbf{r}}}_{f,\varvec{\mu }}(\alpha _i)\), \(\alpha _i\in {\mathbb {T}}\), \(i=1,2\), along the curve \({\textbf{r}}_{f,\varvec{\mu }}\) is \((R/(2\pi ))\,\textrm{cmod}\)\( (\alpha _1-\alpha _2)\).

5.2 Scores

Projections on \({\mathcal {R}}_{\varvec{\mu }}(f)\) are defined via the fast handle \(\tilde{{\textbf{r}}}_{f,\varvec{\mu }}\). They involve the toroidal distance \(d_{{\mathbb {T}}^p}(\varvec{\theta },\varvec{\phi }):=\sqrt{\sum _{j=1}^p d_{{\mathbb {T}}}(\theta _j,\phi _j)^2}\), \(p\ge 1\), where \(d_{{\mathbb {T}}}(\theta _j,\phi _j):=\min \{\vert \theta _j-\phi _j\vert , 2\pi -\vert \theta _j-\phi _j\vert \}\), for \(\theta _j,\phi _j\in {\mathbb {T}}\).

Definition 4

(Ridge projections). For \(\varvec{\theta } \in {\mathbb {T}}^2\), its projection to the Fourier-parametrized \({\mathcal {R}}_{\varvec{\mu }}(f)\), \(\varvec{\mu }\in {\mathbb {T}}^2\), is

$$\begin{aligned} \textrm{proj}_{f,\varvec{\mu }}(\varvec{\theta })&:=\tilde{{\textbf{r}}}_{f,\varvec{\mu }}\left( \alpha _{f,\varvec{\mu }}(\varvec{\theta })\right) , \end{aligned}$$

where the projection argument is

$$\begin{aligned} \alpha _{f,\varvec{\mu }}(\varvec{\theta })&:= \arg \min _{\alpha \in {\mathbb {T}}} d_{{\mathbb {T}}^2}\left( \varvec{\theta },\tilde{{\textbf{r}}}_{f,\varvec{\mu }}(\alpha )\right) . \end{aligned}$$

The TR-PCA scores for an arbitrary point \(\varvec{\theta }\in {\mathbb {T}}^2\) are defined as an analogy to ordinary PCA. The first score is the signed distance along \(\tilde{{\textbf{r}}}_{f,\varvec{\mu }}\) and between \(\textrm{proj}_{f,\varvec{\mu }}(\varvec{\theta })\) and \(\varvec{\mu }\). The second score is the signed distance between \(\varvec{\theta }\) and \(\textrm{proj}_{f,\varvec{\mu }}(\varvec{\theta })\). The sign is set according to the relative position of the tangent and normal vectors, both at the projection point.

Definition 5

(Scores in TR-PCA). For \(\varvec{\theta } \in {\mathbb {T}}^2\), its first TR-PCA score, \(s_1(\varvec{\theta })\), is

$$\begin{aligned} s_{1}(\varvec{\theta }):=\alpha _{f,\varvec{\mu }}(\varvec{\theta }). \end{aligned}$$

The second TR-PCA score, \(s_2(\varvec{\theta })\), is

$$\begin{aligned} \vert s_{2}(\varvec{\theta })\vert&:= (\pi / m_2)\,d_{{\mathbb {T}}^2}(\varvec{\theta },\textrm{proj}_{f,\varvec{\mu }}(\varvec{\theta })),\\ \textrm{sign}(s_2(\varvec{\theta }))&:=\textrm{sign}\big (\angle ({\varvec{t}}(\varvec{\theta }))-\angle ({\varvec{n}}(\varvec{\theta }))\big ),\nonumber \end{aligned}$$
(9)

where \({\varvec{t}}(\varvec{\theta }):= \tilde{{{\textbf {r}}}}'_{f,\varvec{\mu }}(\alpha _{f,\varvec{\mu }}(\varvec{\theta }))\), \({\varvec{n}}(\varvec{\theta }):=\textrm{cmod}(\textrm{proj}_{f,\varvec{\mu }}(\varvec{\theta })-\varvec{\theta })\), and \(\angle ({\textbf{v}}):=\textrm{atan2}(v_2,v_1)\).

The first factor in (9), where \(m_2:=\max _{\varvec{\theta }\in {\mathbb {T}}^2} \vert s_2(\varvec{\theta })\vert \), is included to homogenize the scales of both scores, so that \((s_{1}(\varvec{\theta }),s_{2}(\varvec{\theta }))'\in {\mathbb {T}}^2\).

5.3 Proportion of variance explained

Given a sample \(\varvec{\Theta }_1,\ldots ,\varvec{\Theta }_n\) in \({\mathbb {T}}^p\), \(n,p\ge 1\), its Fréchet mean (or intrinsic mean) is defined as

$$\begin{aligned} \hat{\varvec{\mu }}_{\textrm{F}}&:= \arg \min _{\varvec{\phi }\in {\mathbb {T}}^p} \sum _{i=1}^n d_{{\mathbb {T}}^p}(\varvec{\phi },\varvec{\Theta }_i)^2 \in {\mathbb {T}}^p. \end{aligned}$$

The Fréchet variance of the sample is the minimum of the previous objective function:

$$\begin{aligned} \widehat{\textrm{var}}_{\textrm{F}}&:= \sum _{i=1}^n d_{{\mathbb {T}}^p}(\hat{\varvec{\mu }}_\textrm{F},\varvec{\Theta }_i)^2. \end{aligned}$$

Due to the product structure of \({\mathbb {T}}^p\), \(\hat{\varvec{\mu }}_{\textrm{F}}=\big ({\hat{\mu }}_{\textrm{F}}^{(1)},\ldots ,{\hat{\mu }}_{\textrm{F}}^{(p)}\big )'\) and \(\widehat{\textrm{var}}_{\textrm{F}}=\sum _{j=1}^p \widehat{\textrm{var}}_{\textrm{F}}^{(j)}\), where the superscript denotes a marginal Fréchet mean/ variance. For \(p=2\), this variance decomposition facilitates the definition of the Proportion of Variance Explained (PVE) in TR-PCA as

$$\begin{aligned} \text {PVE} := \frac{\widehat{\textrm{svar}}_{\textrm{F}}^{(1)}}{\widehat{\textrm{svar}}_{\textrm{F}}^{(1)}+\widehat{\textrm{svar}}_{\textrm{F}}^{(2)}}, \end{aligned}$$
(10)

where \(\widehat{\textrm{svar}}_{\textrm{F}}^{(j)}\) stands for the Fréchet variance of the jth scores \(\{s_j(\varvec{\Theta }_i)\}_{i=1}^n\), \(j=1,2\).

5.4 Complete TR-PCA procedure

The complete TR-PCA procedure for a given toroidal sample involves all the concepts introduced so far. It is divided into three main stages.

Algorithm 1

(TR-PCA). Given a sample \(\{\varvec{\Theta }_i\}_{i=1}^n\) in \({\mathbb {T}}^2\), TR-PCA proceeds as follows:

  1. (i)

    Modeling.

    1. (a)

      Fit the BSvM and/or BWC models (Sects. 3.1 and 4.1 ) with maximum likelihood estimation. If both models are fit, select the one with the smallest Bayesian Information Criterion (BIC).

    2. (b)

      Inspect edge cases using LRTs (Sects. 3.1 and 4.1) at \(5\%\) significance level.

      1. (b.i)

        Test diagonal vs. non-diagonal ridges with the independence LRT.

      2. (b.ii)

        Test straight vs. non-straight ridges with the homogeneity LRT.

    3. (c)

      If any of the LRTs does not reject, refit the model with maximum likelihood estimation restricted to the decisions of (b.i)–(b.ii).

    4. (d)

      Retrieve \({\hat{f}}\), \(\hat{\varvec{\mu }}\) (location parameter), and \({\hat{j}}\) (index of the lowest concentration).

  2. (ii)

    Ridge computation.

    1. (a)

      Determine a grid of \({\mathcal {R}}_{{\textbf{0}}}({\hat{f}})\) with the implicit equation approach (Sects. 3.3 and 4.2 ).

    2. (b)

      Construct \(r_{{\hat{f}},\hat{\varvec{\theta }},{\hat{j}}}\) in (6) with \(\{({\hat{a}}_k,{\hat{b}}_k)\}_{k=0}^m\) computed with Gauss–Legendre quadrature on the previous grid.

    3. (c)

      Obtain the arc-length parametrized ridge curve \(\tilde{{\textbf{r}}}_{{\hat{f}},\hat{\varvec{\theta }}}\) from Definition 3.

  3. (iii)

    Scores and PVE computation.

    1. (a)

      Compute \(\{(s_1(\varvec{\Theta }_i), s_2(\varvec{\Theta }_i))'\}_{i=1}^n\) using Definition 5.

    2. (b)

      Obtain the PVE using (10).

5.5 Illustrative examples

We now compare the performance of TR-PCA versus an adaptation of PCA to the torus, angular PCA (aPCA) (Riccardi et al. 2009), which is arguably the most readily implementable alternative to PCA in the torus. aPCA centers the data using the circular mean prior to the application of standard PCA. Hence, periodicity is not preserved, which introduces artifacts on the scores when dealing with non-concentrated data. TR-PCA follows the steps defined in Algorithm 1. Figure 5 shows this comparison for four samples simulated from Bivariate Wrapped Normal (BWN) and BWC distributions. The BWN distribution is that of a bivariate random vector distributed as \({\mathcal {N}}_2(\varvec{\mu },\varvec{\Sigma })\), for a mean \(\varvec{\mu }\in {\mathbb {T}}^2\) and a covariance matrix \(\varvec{\Sigma }\), after each vector component is transformed by applying \(\textrm{cmod}(\cdot )\).

Fig. 5
figure 5

Performance of aPCA (red line) versus TR-PCA (black curve) on four different samples (top row), together with TR-PCA (middle) and aPCA scores (bottom). From left to right: a concentrated BWN with \(\varvec{\mu } = (-\pi , 0)^{\prime }\) and \(\varvec{\Sigma } = (\sigma _1^2, \rho \sigma _1\sigma _2; \rho \sigma _1\sigma _2, \sigma _2^2)\) with \((\sigma _1^2,\sigma _2^2,\rho )=(0.2,0.8,0.35)\); a more spread BWN with \(\varvec{\mu } = (\pi /2, 0)^{\prime }\) and \((\sigma _1^2,\sigma _2^2,\rho )=(3,1.5,0.85)\); a BWC with \(\varvec{\mu } = (1,2)^{\prime }\) and \((\xi _1, \xi _2, \rho ) = (0.5, 0.1, -0.75)\); and an equal mixture of two BWNs with \(\varvec{\mu }_1 =(\pi /2, -\pi /2)^{\prime }\) and \((\sigma _{11}^2,\sigma _{12}^2,\rho _1)=(0.4,0.16,0.35)\), and \(\varvec{\mu }_2 =(-\pi /2, \pi /2)^{\prime }\) and \((\sigma _{21}^2,\sigma _{22}^2,\rho _2)=(0.16,0.4,0.35)\). To allow a principled comparison between aPCA and TR-PCA, in the first row the sample has been centered by its circular mean. TR-PCA is invariant to this centering, while unperiodic projection in aPCA depends on it. The rainbow color palette indicates the main mode of variation. The two clusters in the rightmost column are colored separately

The first scenario (leftmost column) shows the almost equivalence between both methods when dealing with highly-concentrated samples, an expected consequence of the torus being locally Euclidean. Here, both the first scores of TR-PCA and aPCA explain \(81\%\) of the variance. When the distribution is more dispersed, as in the other three samples, periodicity becomes relevant and aPCA begins introducing artifacts. In the second scenario, TR-PCA (\(83\%\)) does not introduce any distortion on the scores thanks to honoring the toroidal geometry, while aPCA (\(75\%\)) creates some spurious clusters with scores that exit \({\mathbb {T}}^2\) (observe the vertical scale) and induces a slight rotation on the scores. These artifacts are magnified in the third scenario (TR-PCA: \(88\%\); aPCA: \(74\%\)). Finally, the fourth scenario represents a Simpson’s paradox case in which TR-PCA (\(78\%\)) is able to successfully separate the two clusters along the first scores, while aPCA (\(38\%\)) fails to do so. In all scenarios, TR-PCA yields periodic scores, unlike aPCA. The PVEs of both methods were computed according to (10).

6 Data application

This section illustrates the application of TR-PCA to the study of currents in the Santa Barbara Channel (California, USA). Studying the direction of currents is of crucial importance to understand the supply of nutrients in marine habitats (Allen et al. 2012) and the genetic propagation of marine fauna and flora (White et al. 2010), as well as is important for environmental purposes and prevention of contamination (DiGiacomo et al. 2004). In the present context of increased contamination and climate change, the Santa Barbara Channel is an area well known for the confluence of several important ocean currents and vast marine biodiversity. The complexity of local currents (Winant et al. 2003) makes the use of standard statistical tools not directly applicable and motivates the search for new methodologies able to explain their variability. In their Figure 10, Winant et al. (2003) explain that there exists a counterclockwise vortex in the Santa Barbara Channel, which is represented by a general westward flux on its northern coast. In addition, studies such as Auad et al. (1998) show that there exists a net influx of water in the Santa Barbara Channel that exits heading south through the Santa Cruz Channel. These two facts are taken as a guideline to validate the ability of TR-PCA on indexing the main variability of the data in a fully data-driven way.

The data was obtained from the High-Frequency Radar Database (https://hfradar.ndbc.noaa.gov/), which maps surface currents and wave fields over wide areas. In particular, the currents’ direction, \(d = \textrm{atan2}(u,v)\), is hourly available through the measured eastward and northward surface velocities (u and v, respectively) of the water body. We smoothed this direction by taking the daily speed-weighted circular mean in a given zone Z, obtaining the circular variable of the study, \(\theta _\textrm{Z}\). Since large-scale currents involve timespans longer than hours, daily averages allow smoothing of the data while still keeping the variability associated with these ocean currents. We used the data from the three-year period 2019–2021, using complete years so that seasonality events were neither over- nor underrepresented. Four locations were selected: zones A and B, located along the northern coast of the Santa Barbara Channel; and zones C and D, corresponding to the north and south of the Santa Cruz Channel, respectively. These areas are shown in Fig. 6. The study focuses on the dependency between \(\theta _\textrm{A}\) and \(\theta _\textrm{B}\) to further analyze the top part of the vortex shown in (Winant et al. 2003, Figure 10) and the water flux parallel to the coast, as well as on the dependency between \(\theta _\textrm{C}\) and \(\theta _\textrm{D}\) for the water output through the inter-island strait (Auad et al. 1998). The third analysis on \((\theta _\textrm{A}, \theta _\textrm{C})\) is performed to investigate the dependence between both regions.

Fig. 6
figure 6

Satellite map of Santa Barbara coast with the main zones of the analysis depicted in different colors. The coordinates (in degrees) delimiting the different areas are \(\textrm{A} \equiv (-120.24, -120.12)\times (34.37, 34.44)\), \(\textrm{B} \equiv (-119.86, -119.74) \times (34.31, 34.39)\), \(\textrm{C} \equiv (-120.05, -119.93) \times (34.04, 34.12) \), and \(\textrm{D} \equiv (-119.95, -119.82) \times (33.86, 33.94)\). The connection A–B represents the coast of the Santa Barbara Channel, while C–D is the Santa Cruz Channel

A first analysis of the data’s density in the three zones (A–B, A–C, C–D) was performed to check the adequacy of the BSvM and BWC distributions to model the current directions. Toroidal-adapted kernel density estimators for estimating the unknown bivariate densities were compared with the estimated parametric densities of the BSvM and BWC. These estimates are shown in Fig. 7, where it can be seen that the BWC and BSvM give similar shapes, although simplifying some of the asymmetries of the kernel density estimators. To formally assess the existence of dependence in the four scenarios, we performed the \(\phi ^{(n)}_{\textrm{dc}}\) test described in García-Portugués et al. (2023, Section 3), emphatically rejecting the null hypothesis of independence in all cases.

The BSvM and BWC models were fitted to the samples of \((\theta _\textrm{A}, \theta _\textrm{B})\), \((\theta _\textrm{A}, \theta _\textrm{C})\), and \((\theta _\textrm{C}, \theta _\textrm{D})\) using maximum likelihood. The estimated parameters and log-likelihoods are shown in Table 1. In terms of log-likelihoods, it can be seen that the BSvM estimate has better performance in two of the analyses. As expected, the BWC distribution is more concentrated, leading to higher values of the density around the modes.

Table 1 Estimated parameters, log-likelihood (\({\hat{\ell }}\)), and Proportion of Variance Explained (PVE) by TR-PCA for the different analyses. Left and right blocks give the BSvM and BWC fits, respectively. The bold font indicates the best-performing model in terms of the log-likelihood

The estimated ridges \({\mathcal {R}}_{\hat{\varvec{\mu }}}({\hat{f}})\) for the three bivariate analyses are shown in Fig. 7. It can be seen how \({\mathcal {R}}_{\hat{\varvec{\mu }}}({\hat{f}})\) indexes the main variability of the sample and informs on the positioning of high-density regions. On the one hand, the best-fitting models capture the local correlations of the currents’ directions about their modes, which are seen to be positive in the A–B case, and negative in the A–C and C–D cases. On the other hand, the curve \({\mathcal {R}}_{\hat{\varvec{\mu }}}({\hat{f}})\) allows synthesizing the behavior of the currents in a sensible way. For example, in the A–B case, the BSvM model not only reproduces the W–W mode (western direction both at zone A and B) found in previous studies (Winant et al. 2003), but also provides additional information in terms of the sinusoidal-like dependence between A–B. A final insight revealed by the estimated ridges is that there is a high negative correlation in C–D. Due to the geographical shape of the strait, water can flow through it in two main directions: SE–SE and NW–NW, as previous studies show (Auad et al. 1998). As seen in Fig. 7, both main modes are captured fairly well: there is a high concentration toward SE–S and a secondary concentration at NW–W. Furthermore, it can also be seen that the higher-density transition region between the two modes has a negative correlation, a result that cannot be easily obtained by only analyzing the modes.

Fig. 7
figure 7

From left to right, the columns represent the density contours of kernel density estimates, fitted BSvM densities, fitted BWC densities, and kernel density estimates of the TR-PCA scores of the best-performing model. The second and third columns show the estimated ridge curves \({\mathcal {R}}_{\hat{\varvec{\mu }}}({\hat{f}})\) (red curves) from the sample (black points in the first column) for the four bivariate analyses and two models. From top to bottom, rows represent zones A–B, A–C, and C–D. The contourplots within the same row share a common color scale in the first three columns. The parameters of the fits are shown in Table 1

The summarizing capability of \({\mathcal {R}}_{\hat{\varvec{\mu }}}({\hat{f}})\) can be further exploited by visualizing a march along it. Figure 8 shows this march in the case C–D using the BSvM fit shown previously in Fig. 7. Unlike previous studies that were limited to finding the main modes of direction (Auad et al. 1998; Winant et al. 2003), this parametrization also allows an estimation of the variability when the flow has other directions. For example, the net influx of water outside the strait is reproduced in the second leftmost plot in Fig. 8.

Fig. 8
figure 8

Four snapshots of the march along the ridge curve \({\mathcal {R}}_{\hat{\varvec{\mu }}}({\hat{f}})\) for C–D. The arrows at C (top) and D (bottom) are colored according to their position on the ridge curve. Past directions are also shown with transparencies in order to visualize the movement’s direction. The main variability of the currents on C–D is manifested on a pendular-like variation of the current direction at D that is aligned with a counterclockwise variation of the current direction at C. Video available at https://github.com/egarpor/ridgetorus/tree/main/application

TR-PCA scores were computed in the rightmost column of Fig. 7, resulting in narrower, more concentrated distributions that capture a large part of the total variance, with the PVEs collected in Table 1 being greater than \(74\%\). These scores could be used for clustering purposes to find particular meteorological events or outliers.

7 Discussion and conclusions

In this work we have advocated the use of density ridges for constructing a well-grounded bivariate toroidal PCA. The construction is based on using the implicit equation approach to determine ridges, which has proven to be more efficient and robust than the Euler algorithm. By tailoring this procedure to bivariate data, we have corroborated empirically that two reference toroidal distributions, the BWC and the BSvM, present stable connected components that go through the distributions’ modes. We have proposed a practical way to parametrize \({\mathcal {R}}_{\varvec{\mu }}(f)\) to yield a tractable computation of PCA-like scores that allows for a full dimension-reduction analysis. The real data application has shown that TR-PCA explains \(75\%-80\%\) of the variability of the ocean currents and gives interpretations that are consistent with previous explanations of large-scale water movements in the area.

An important takeaway of our investigations is that the BWC seems to be, in general, a more robust parametric distribution than the BSvM for TR-PCA. Although the BSvM is recognized as a somewhat canonical choice for “the normal density on \({\mathbb {T}}^2\)”, for this application, the squarish form of its density contours may introduce “elbows” on its associated density ridges, these in turn introducing artifacts on the resulting scores. In comparison, the BWC does not present this problem and tends to yield more flexible and descriptive principal ridges. Nevertheless, the choice of the reference parametric distribution is subject to improvement, as any sufficiently well-behaved density in \({\mathcal {C}}^2({\mathbb {T}}^2)\) could be considered within our methodology. In this regard, we highlight that the ridge parametrization introduced in Sect. 5.1 could be used in other densities apart from BSvM/BWC. Deriving theoretical results giving the conditions under which the existence and well-definedness of a connected ridge for parametric families of densities would be a useful endeavor to be addressed in the future. In this respect, we experienced with another normal analog on the torus, the wrapped normal, for the analyses in Sections 3.1, 3.2, and 5.5. We found numerically that some ridges could be multivalued functions in both variables \(\theta _1\) and \(\theta _2\), giving convoluted ridge curves. When this does not happen, somehow unpleasant Z-shaped ridges with rough corners arise, thus making this distribution less appropriate for TR-PCA than BSvM and BWC. Although not pursued in this paper, we note that asymptotic inference for \({\mathcal {R}}_{\varvec{\mu }}(f)\) is readily addressable when using a density model f with tractable maximum likelihood.

TR-PCA has been constructed as a toroidal analog of PCA on \({\mathbb {R}}^2\) under the following optic. PCA can be seen as a parametric dimension-reduction method driven by Gaussianity: in an arbitrary sample, it fits a normal distribution \({\mathcal {N}}(\varvec{\mu },\varvec{\Sigma })\) by maximum likelihood and extracts the eigendecomposition of \(\hat{\varvec{\Sigma }}\); the subspace spanned by the first eigenvector coincides with the \(\hat{\varvec{\mu }}\)-connected ridge. TR-PCA follows this view of PCA, replacing the normal distribution with the BSvM/BWC model. It might be argued that PCA is a model-free technique since \((\hat{\varvec{\mu }},\hat{\varvec{\Sigma }})\) estimate the mean and covariance matrix of a general population. In the torus, these two descriptive summaries are less canonical; e.g., there exist two definitions of circular means (extrinsic or intrinsic). The maximum likelihood estimators of the BSvM and BWC distributions do not necessarily coincide with extrinsic/intrinsic means, and so TR-PCA does not mimic this Gaussian-specific aspect of PCA. However, the maximum likelihood estimators of the BSvM/BWC model \(f_{\varvec{\xi }}\) estimate \(\varvec{\xi }_0 =\arg \min _{\varvec{\xi }}\textrm{D}_{\textrm{KL}}(f_0\Vert f_{\varvec{\xi }})\), the parameter that minimizes the Kullback–Leibler divergence of a general population \(f_0\) from \(f_{\varvec{\xi }}\). In the Gaussian model \(f_{\varvec{\mu },\varvec{\Sigma }}\), \((\varvec{\mu }_0,\varvec{\Sigma }_0)=\arg \min _{\varvec{\mu },\varvec{\Sigma }}\textrm{D}_{\textrm{KL}}(f_0\Vert f_{\varvec{\mu },\varvec{\Sigma }})\) are the mean and covariance matrix of the population \(f_0\), highlighting that both TR-PCA and PCA estimate a general population descriptor under a parametric model.

A clear limitation of the present work is its restriction to \(p=2\). The concept of ridges can be extended to higher dimensions by redefining the eigenvector matrix, the projected gradient, and the density ridge of a multivariate toroidal density \(f\in {\mathcal {C}}^2({\mathbb {R}}^p)\) as \({\textbf{U}}_{(p-q)}({\textbf{x}}):=({\textbf{u}}_{q+1}({\textbf{x}}), \ldots , {\textbf{u}}_{p}({\textbf{x}}))\), \(\textrm{D}_{(p-q)} f({\textbf{x}}):={\textbf{U}}_{(p-q)}({\textbf{x}}) \)\( ({\textbf{U}}_{(p-q)}({\textbf{x}}))^{\prime } \textrm{D} f({\textbf{x}})\), and \({\mathcal {R}}_{q}(f):=\{{\textbf{x}}\in {\mathbb {R}}^{p}:\Vert \textrm{D}_{(p-q)} f({\textbf{x}})\Vert =0,\ \lambda _{q+1}({\textbf{x}}),\ldots ,\lambda _{p}({\textbf{x}})<0\}\), with \(1\le q\le p-1\). However, in this multivariate case, obtaining an equivalent of the implicit equation is more challenging, so one may need to rely on the (computationally expensive) Euler algorithm. Still, important concepts such as iterative projections on nested spaces and scores computation would need to be carefully defined for the multivariate case. The multivariate extension of TR-PCA is also currently limited by the scarcity of multivariate toroidal models beyond the multivariate von Mises distribution (Mardia et al. 2008) (which extends (5)). For example, currently there does not exist a multivariate extension of the BWC. Therefore, the larger endeavor of extending TR-PCA to higher dimensions is open for future research.

SUPPLEMENTARY MATERIAL

R-package ridgetorus: It implements TR-PCA on the two-dimensional torus (ridge_pca()). It provides functions to estimate the parameters of BSvM and BWC models and compute their density ridges. It also contains all the data used for the application in Section 6 (santabarbara), as well as the code for its end-to-end replicability. The package is available at https://CRAN.R-project.org/package=ridgetorus.