1 Introduction

Set estimation is focused on the reconstruction of a set (or the approximation of any of its characteristic features such as its boundary or its volume) from a random sample of points. One of the specific topics in this area is concerned with the estimation of sets directly related to density functions such as level sets. Mathematically, for a given level \(t>0\), the goal is to reconstruct the unknown set

$$\begin{aligned} G_g(t)=\{x\in \mathbb {R}^d:g(x)\ge t\} \end{aligned}$$
(1)

from a random sample of points of a density function g on \(\mathbb {R}^d\). This topic has received considerable attention in the statistical literature, specially since the notion of population clusters was established in Hartigan (1975) as the connected components of the set in (1). This cluster definition relies clearly on the user-specified level t, so for addressing this problem, an algorithm for estimating the smallest level with more than a single connected component was proposed in Steinwart (2015). For a general review on clustering, see (Anderberg 1973; Everitt 1993; Cuevas and Fraiman 1997) and (Rinaldo and Wasserman 2001).

The number of clusters is a basic feature for a statistical population. However, the problem of its estimation is not always taken into account in cluster analysis where it is usually chosen by the practitioner as a first step. Since the number of clusters is equal to the number of connected components of a level set, a very natural estimator for this populational parameter is the number of the connected components of the level set reconstruction. This perspective that solves the problem of selecting this unknown population parameter is considered, for instance, in Cuevas et al. (2000), Cuevas et al. (2001) and Biau et al. (2007).

Level set estimation theory has been mainly established for a density supported on an Euclidean space such as in Eq. (1) with very few contributions in other domains. Cuevas et al. (2006) consider the estimation of level sets for general functions (not necessarily a density) providing some consistency theoretical results and showing a level set on the sphere for illustration. More recently, the reconstruction of density level sets on manifolds is studied in Cholaquidis et al. (2020). Through some simulations, the behavior of the proposed method is analyzed on the torus and on the sphere.

Unfortunately, for most applications, the specific value of the level t in (1) is fully unknown by the practitioner. In addition, areas of the distribution support where g is close to zero (non-effective support) are usually of limited interest for applications. If the practitioner establishes the probability content instead of the level t, a new kind of density level sets emerges known as highest density regions (HDRs) (see Box and Tiao 1973 and Hyndman 1996). The estimation of HDRs involves further complexities given that the threshold of this particular type of level sets must be determined from the established probability content. Perhaps due to its practical importance, HDRs plug-in reconstruction from the linear kernel density estimator has been widely studied considering also the problem of selecting an appropriate bandwidth specifically devised for the HDR reconstruction (see, for instance, Baíllo and Cuevas 2006 or Samworth and Wand 2010). However, as far as we know, the notion of HDR has not been introduced for directional data yet. Therefore, the main goals of this work are to (1) generalize HDRs definition to the directional setting, (2) establish a plug-in procedure for HDRs reconstruction from the proposal of a new bootstrap bandwidth for a well-known directional kernel density estimator that can be seen as the first specific selector for directional HDRs, (3) check its performance through an extensive simulation study analysing the effect of considering a smoothing parameter not specifically designed for HDR estimation and (4) apply this methodology to analyze data on animal orientation and on seismology.

One may argue that such an absence of a general and effective proposal for directional HDRs estimation may be due to a lack of practical interest, but this is far from the truth, so let us present two application examples that motivate the developments in this work. The first one concerns a problem from animal orientation studies and the second one is related to earthquakes occurrences. Both datasets are available in the R package HDiR.Footnote 1

1.1 Some motivating examples

Animal orientation example. Behavioral plasticity is considered by biologists as a feature of adaptation to changing beach environments. In particular, orientation is an adaptation characteristic that can not be modified by a single factor. Nonetheless, experts found some regularities in the orientation of sandhoppers and other animals from beach environments by changing one factor at a time under other controlled conditions.

Fig. 1
figure 1

Geographical location of Zouara beach (right). Talorchestia brito (center) and Talitrus saltator (left)

For instance, the orientation of two sandhoppers species (Talitrus saltator and Talorchestia brito) is analyzed in Scapini et al. (2002). Both species are shown in Fig. 1. Bottom pictures can be found in Dekker (1978). Comparing the two species through regresion procedures, Scapini et al. (2002) conclude that Talitrus saltator showed more differentiated orientations, depending on the time of day, period of the year and sex, with respect to Talorchestia brito. Moreover, it seems that Talitrus saltator shows a higher flexibility (variation) of orientation than Talorchestia brito under the same environmental conditions, supporting the hypothesis that the former has a higher level of terrestrialization. As an illustration, Fig. 2 (left panel) shows the 36 orientation points (slightly jittered) corresponding to males of the specie Talitrus saltator measurements during the noon in April. It also contains the 77 angles (slightly jittered) when the measures are taken in October (Fig. 2, right panel). Differences in the distribution on the circle of these two samples can be easily observed. Therefore, the month of the year seems to play a significant role in sandhoppers behavior. In particular, two clusters for October measurements can be detected around the angle \(\pi \) but they are not present for the April sample. Similar comments could be done for the situation registered around the angles \(\pi /2\). HDRs reconstruction (with low probability content) would allow to determine the biggest modes of the distribution, and then, its clusters. Therefore, HDRs can be seen as a useful alternative to analyze sandhoppers orientation.

Fig. 2
figure 2

Orientation data (slightly jittered) corresponding to males of the specie Talitrus saltator registered in the noon in April (left) and October (right)

Earthquakes occurrences. The European-Mediterranean Seismological Centre (EMSC)Footnote 2 is a non-governmental and non-profit organisation that has been established in 1975 at the request of the European Seismological Commission. Since the European-Mediterranean region has suffered several destructive earthquakes, there was a need for a scientific organisation to be in charge of the determination, as quickly as possible (within one hour of the earthquake occurrence), of the characteristics of such earthquakes. These predictions are based on the seismological data received from more than 65 national seismological agencies, mostly in the Euro-Med region. Figure 3 (left) shows the geographical coordinates (red points), downloaded from EMSC website, of a total of 272 medium and strong world earthquakes registered between 1\(^{th}\) October 2004 and 9\(^{th}\) April 2020. The magnitude of all these events is at least 2.5 degrees on the Richter scale. Of course, these planar points correspond to spherical coordinates on Earth. Due to the important damages that earthquakes cause, cluster detection of HDRs could be also useful to identify, from a real dataset, where earthquakes are specially likely. This information is key for decision-making, for example, to update construction codes guaranteeing a better building seismic-resistance. An interactive representation of the sphere can be seen in Appendix D.

Fig. 3
figure 3

Distribution of earthquakes around the world between October 2004 and April 2020 (left). HDR contour obtained from the sample of world earthquakes registered between October 2004 and April 2020 (right)

1.2 Paper organization

This paper is organized as follows. Section 2 contains some background ideas on directional level set estimation including some discussion on error measurements and some existing consistency results in the directional setting that will be really useful to extent the definition of HDRs for directional data in Sect. 3. There, plug-in estimators and the corresponding confidence regions are also established. Concretely, we consider the plug-in methods based on a well-known directional kernel density estimator, which requires a smoothing parameter (bandwidth) for its practical implementation. An appropriate bootstrap bandwidth selector, the first one specifically designed for directional HDRs estimation, is also introduced in this section. Section 4 presents an extensive simulation study illustrating the performance of the resulting plug-in reconstruction for the HDRs (for circular and spherical domains) considering the new bandwidth selector. These results are compared with those obtained with directional smoothing parameters not specifically designed for HDR estimation. In Sect. 5, the proposed methodology is applied to analyze the two real data examples presented in the Introduction. Finally, some conclusions and ideas for further research are presented in Sect. 6. Appendix A and the supplementary material that completes this work include further information on the datasets. Appendix B specifies the parameters taken for the construction of the spherical densities in the simulation study. Appendix C contains some additional results of simulations. Appendix D collects the description of the bandwidth selectors considered in the simulation study. All the methods presented in this paper, along with the real data examples, are accessible in HDiR package.

2 Some background on directional level sets

The specific problem of reconstructing density level sets in the directional setting is reviewed in this section: the definition of directional level set is introduced jointly with a plug-in estimator. Based on the real data and simulated examples, some discussion about how to measure the estimation error and some asymptotic results are also included.

2.1 On directional level sets

Consider a random vector X taking values on a \(d-\)dimensional unit sphere \(S^{d-1}\) with density f. Given a level \(t>0\), the directional level set is defined as:

$$\begin{aligned} G_f(t)=\{x\in S^{d-1}:f(x)\ge t\}. \end{aligned}$$
(2)

The nature of different level sets is shown in Fig. 4, which represents \(G_f(t)\) in grey color for three different circular densities and three different values of the level t. The threshold t is represented through a dotted grey line. Note that, if large values of t are considered (bottom row in Fig. 4), \(G_f(t)\) coincides with the greatest modes. However, for small values of t, the level set \(G_f(t)\) is virtually equal to the support of the distribution.

Fig. 4
figure 4

For thee different circular densities, \(G_f(t)\) for \(t=t_1\) (first column), \(t=t_2\) (second column) and \(t=t_3\) (third column) verifying \(0<t_1<t_2<t_3\). Equivalently, \(L(f_\tau )\) for \(\tau =0.2\) (first column), \(\tau =0.5\) (second column) and \(\tau =0.8\) (third column)

It is important to noticed that, following (Hartigan 1975), we may also establish the concept of cluster in directional setting as the connected components of the level set \(G_f(t)\). With this view in mind, note that the density represented in the second row of Fig. 4 presents four connected components for all of the considered values for t, determining four population clusters.

Plug-in estimation is the most natural and common choice for reconstructing density level sets in the Euclidean space. A review of other existing estimation alternatives can be seen in Rodríguez-Casal and Saavedra-Nieves (2019). Plug-in methods are devised to reconstruct the level set in (1) as

$$\begin{aligned} \hat{G}_g(t)=\{x\in \mathbb {R}^d:g_n(x)\ge t\} \end{aligned}$$

where \(g_n\) usually denotes the classical kernel estimator for Euclidean data (see Parzen 1962 and Rosenblatt 1956). This methodology, which has received considerable attention (see, for instance, Tsybakov 1997; Baíllo 2003; Mason and Polonik 2009; Rigollet and Vert 2009; Mammen and Polonik 2013; Polonik 2013 or Chen et al. 2017) can be easily generalized to the directional setting. Given a random sample \(\mathcal {X}_n=\{X_1,\ldots ,X_n\}\in S^{d-1}\) of the unknown directional density f, the corresponding level set \(G_f(t)\) in (2) can be reconstructed as

$$\begin{aligned} \hat{G}_f(t)=\{x\in S^{d-1}:f_n(x)\ge t\} \end{aligned}$$
(3)

where \(f_n\) denotes a nonparametric directional density estimator. Following the classical ideas for real-valued random variables, a kernel estimator on \(S^{d-1}\) is provided in Bai et al. (1989) (\(d>2\)) who also proved strong pointwise consistency, uniform consistency, and \(L_1-\)norm consistency of the estimator (see also Hall et al. 1987 and Klemelä 2000 for further results). Following Bai et al. (1989), from a random sample \(\mathcal {X}_n\) on a \(d-\)dimensional sphere the directional kernel density estimator at a point \(x\in S^{d-1}\) is defined as

$$\begin{aligned} f_n(x)= \frac{1}{n} \sum _{i=1}^n K_{vM}(x;X_i;1/h^2), \end{aligned}$$
(4)

where \(1/h^2 > 0\) is concentration parameter and \(K_{vM}\) denotes the von Mises-Fisher kernel density (see Appendix B for explicit formulae). The consideration of a von Mises kernel in Eq. (4) is not the only option and it is particularly interesting to point out the use of a wrapped-normal kernel in the circular setting. In this case, Huckemann et al. (2016) proved that this kernel guarantees the monotonicity on the number of modes with respect to the smoothing parameter, something that also happens for the gaussian kernel in the linear case. It may be argued that such a kernel shoud be used in our problem. Nevertheless, it is computationally more expensive and our practical experience shows that results in practice are quite similar.

Note that the kernel estimator in (4) can be viewed as a mixture of von Mises-Fisher. Furthermore, the concentration parameter \(1/h^2\) plays an analogous role to the bandwidth in the Euclidean case. For small values of \(1/h^2\), the density estimator is oversmoothed. The opposite effect is obtained as \(1/h^2\) increases: with a large value of \(1/h^2\), the estimator is clearly undersmoothing the underlying target density. Hence, the choice of h is a crucial issue. For simplicity, in what follows, we refer to h as bandwidth parameter. Several approaches for selecting h in practice, in circular and even directional settings, have been proposed in the literature (see Appendix D). All the existing proposals aim to minimize some error criterion on the target density, but none of them is specifically designed focusing on the reconstruction of a directional level set.

Fig. 5
figure 5

Plug-in density level sets \(\hat{G}_f(t)\) from \(\mathcal {X}_{250}\) for three different circular densities with \(t=t_1\) (first column), \(t=t_2\) (second column) and \(t=t_3\) (third column) verifying \(0<t_1<t_2<t_3\)

Figure 5 shows three plug-in estimators \(\hat{G}_f(t)\) for models (black colour) and levels \(t_1\), \(t_2\) and \(t_3\) (dotted grey line) considered in Fig. 4. Kernel density estimators (grey color) in (4) have been determined from samples of size 250 considering the proposal in Oliveira et al. (2012) as bandwidth parameter.

Fig. 6
figure 6

Plug-in level set estimators obtained from the orientation samples corresponding males of the specie Talitrus saltator registered in the noon in April (left) and October (right)

For instance, for the sandhoppers example, Fig. 6 shows the plug-in estimators obtained for the two samples of sandhoppers represented in Fig. 2. It is possible to detect the largest modes of the two sample distributions corresponding to April and October samples. These results allow us to confirm the differences between the two populations. The largest cluster of April orientations is located around the angle \(7\pi /4\). However, the pattern observed for October registries is completely different. Although an only cluster is identified around the angle \(3\pi /2\), if the level t decreases slightly two additional groups can be detected around the angles \(3\pi /4\) and \(5\pi /4\), respectively.

Regarding the earthquakes illustration, Fig. 3 (right) shows the plug-in contour in blue obtained from the selected sample of world earthquakes considered. Chosing a convenient value of the level t, the greatest mode of sample distribution is identified in the Southeast of Europe. Countries such as Italy, Greece or Turkey (located withint this cluster) are the most affected areas in the recent past.

2.2 Error measures and consistency results on directional level sets

The level set \(\hat{G}_f(t_3)\) represented in Fig. 5 (third column) presents two connected components. However, Fig. 4 (right plot in the bottom row) shows that the theoretical level set \(G_f(t_3)\) has exactly three components. Therefore, the estimation error is considerably large. Distances between sets are the common criteria considered in set estimation to measure the discrepancies between the theoretical region to be estimated and the corresponding reconstruction. Of course, this is also applicable when the goal is to estimate level sets or HDRs.

The distance in measure \(d_{\mu }\) between two Borel sets A and B in \(\mathbb {R}^d\) is defined as

$$\begin{aligned} d_\mu (A,B)=\mu (A\triangle B) \end{aligned}$$
(5)

where \(\mu \) denotes the Lebesgue measure and \(A\triangle B\), the symmetric difference of A and B calculated as \((A\cap B^c)\cup (A^c\cap B)\) with \(A^c\) representing the complementary of A. Consistency results for directional plug-in estimators have been already obtained in the literature for this distance. For the estimator established in (3) defined on \(S^2\), Cuevas et al. (2006) and Cholaquidis et al. (2020) check that \(\lim _{n\rightarrow \infty }d_\mu ( G_f(t), \hat{G}_f(t))=0, \text{ a.s. },\) and \(\lim _{n\rightarrow \infty }d_\mu (G_f(t),\hat{G}_f^{'}(t))=0, \text{ a.s. }\), where \(\hat{G}_f^{'}(t)=\{x\in S^{d-1}:f_n^{'}(x)\ge t\} \) and \(f_n^{'}\) denotes the kernel estimator (for manifolds with boundary) proposed in Berry and Sauer (2017). From the definition in (5), it easy to check that the distance in measure \(d_\mu \) does not penalize those level set estimators that have an isolated point as a connected component or any other set with null Lebesgue measure. Additionally, the undersmoothing caused by the choice of a small bandwidth value may provoque that the estimator \(\hat{G}_f(t)\) presents non-significant connected components with small Lebesgue measure. In this case, \(d_{\mu }\) would not be as effective as, for instance, the Hausdorff distance in detecting this situation.

Let us recall that, if A and B are now non-empty compact sets in \(\mathbb {R}^d\), the Hausdorff distance between A and B is established as follows

$$\begin{aligned} d_H(A,B)=\max \left\{ \sup _{x\in A}\rho \left( \{x\},B\right) ,\sup _{y\in B}\rho \left( \{y\},A\right) \right\} \end{aligned}$$
(6)

where \(\rho (\{x\},B)=\inf _{y\in B}\{\rho (x,y)\}\) being \(\rho (x,y)\) the distance between two points. Note that the definition of the Hausdorff distance is very general and depending on the selection of the distance \(\rho \), different error criteria emerge. Usually, \(\rho \) corresponds to the chordal distance (Euclidean distance in \(\mathbb {R}^d\), \(\rho _1\)).

Remark 1

Other natural choices such as the geodesic distance (great circle, \(\rho _2\)) could be considered in Eq. (6). Hopf-Rinow Theorem states that \(\rho _1\) and \(\rho _2\) induce the same topology on \(S^{d-1}\). Figure 1 in Jeong et al. (2017) illustrates that \(\rho _1(x,y)\le \rho _2(x,y)\) for any pair of points xy in the unit circle. Following Lemma 3 in Boissonnat et al. (2019), a general upper bound for the \(\rho _2(x,y)\) for all \(x,y\in S^{d-1}\) depending on \(\rho _1(x,y)\). Specifically, it is possible to prove that \(\rho _2(x,y)\le \arcsin (\rho _1(x,y))\) for all \(x,y\in S^{d-1}\) when the constant r is equal to 1/2.

The metric \(d_H\) is not completely successful in detecting differences in shape properties. In other words, two sets can be very close in Hausdorff distance and still show quite different shapes. This typically happens where the boundaries \(\partial A\) and \(\partial B\) are far apart, no matter the proximity of A and B. So a natural way to reinforce the notion of visual proximity between two sets provided by Hausdorff distance is to account also for the proximity of the respective boundaries. This error criterion has been also considered for establishing consistency results of several directional plug-in reconstructions. Cuevas et al. (2006) prove that \(\lim _{n\rightarrow \infty }d_H(\partial G_f(t),\partial \hat{G}_f(t))=0, \text{ a.s. },\) when the Hausdorff distance is defined from \(\rho _1\). If the Hausdorff distance involves \(\rho _2\), (Cholaquidis et al. 2020) prove that \(\lim _{n\rightarrow \infty }d_H(\partial G_f(t),\partial \hat{G}_f^{'}(t))=0\) and \(\lim _{n\rightarrow \infty }d_H(G_f(t),\hat{G}_f^{'}(t))=0, \text{ a.s }\). The existing monotone relationship between chordal and geodesic distances guarantees the consistency of the plug-in estimator in Cholaquidis et al. (2020) also when the Hausdorff distance depends on \(\rho _1\) instead of \(\rho _2\). Therefore, if the target is the reconstruction of a set, the Hausdorff metric (defined from the chordal distance) can be seen as a suitable error criteria in the directional setting.

3 HDRs in the directional setting

As noted in the Introduction, the level t is usually unknown in (1) and, for practical purposes, the practitioner chooses the probability content of the set instead of the level t. These particular class of level sets widely considered for Euclidean data are the so-called HDRs (see Box and Tiao 1973; Hyndman 1996 or Samworth and Wand 2010). However, as far as we know, HDRs were not defined in the directional context yet. Motivating the need for a proper extension of this notion and the proposal of adequate estimation tools can be easily justified. Figure 7 (top) shows four different 50% circular regions (regions containing 50% of the probability, empirically approximated) for the kernel density estimator \(f_n\) represented in grey. Although all of them have probability content equal to 50%, they exhibit completely different shapes. Therefore, it is obvious that there exists an infinite number of ways to choose a region with given coverage probability and in a general scenario, it may not be clear which region must be chosen. The same happens for real-valued random variables, and (Hyndman 1996) suggests that HDRs are the best subset to summarize a probability distribution.

The usual purpose in summarizing a probability distribution by a region of the sample space is to delineate a comparatively small set which contains most of the probability, although the density may be nonzero over infinite regions of the sample space. Therefore, as in the Euclidean case, it is necessary to decide what properties the region has to verify. The following conditions are natural:

  1. (C1)

    The region should occupy the smallest possible volume in the sample space.

  2. (C2)

    Every point inside the region should have probability density at least as large as every point outside the region.

Following (Box and Tiao 1973), conditions (C1) and (C2) are equivalent and lead to regions called HDRs. Definition 1 formalizes this concept in the directional context taking into account the second criterion.

Definition 1

Let f be a directional density function on \(S^{d-1}\) of a random vector X. Given \(\tau \in (0,1)\), the \(100(1 - \tau )\)% HDR is the subset

$$\begin{aligned} L(f_\tau )=\{x\in S^{d-1}:f(x)\ge f_\tau \} \end{aligned}$$
(7)

where \(f_\tau \) can be seen as the largest constant such that

$$\begin{aligned} \mathbb {P}(X\in L(f_\tau ))\ge 1-\tau \end{aligned}$$
(8)

with respect to the distribution induced by f.

According to Polonik (1997) and García et al. (2003) in the Euclidean context, \(L(f_\tau )\) is the minimum volume level set with probability content at least \((1-\tau )\). Figure 4 shows the HDR \(L(f_\tau )\) in grey for three different circular densities and three different values of \(\tau \). The threshold \(f_\tau \) is represented through a dotted grey line. Note that, if large values of \(\tau \) are considered, \(L(f_\tau )\) is equal to the greatest modes and, therefore, the most differentiated clusters can be easily identified. However, for small values of \(\tau \), \(L(f_\tau )\) is almost equal to the support of the distribution.

3.1 Plug-in estimation of directional HDRs

The first step to reconstruct the HDR in Definition 1 for a given \(\tau \in (0,1)\) is to estimate the threshold \(f_\tau \). As in the Euclidean case, numerical integration methods could be also used in the directional setting in order to approximate its value. However, when the dimension increases, the computational cost becomes a major issue due to the complexity of the numerical integration algorithms considered on high dimensional spaces. An alternative approach for estimating \(f_{\tau }\) with a feasible computational cost is described next.

As before, let X be a random vector with directional density f and let \(Y = f(X)\) be the random vector obtained by transforming X by its own density function. Since \(\mathbb {P}(f(X)\ge f_\tau )=1-\tau \), \(f_{\tau }\) is exactly the \(\tau -\) quantile of Y, following (Hyndman 1996), \(f_{\tau }\) can be estimated as a sample quantile from a set of independent and identically distributed random vectors with the same distribution as Y.

In particular, if \(\mathcal X_n=\{X_1,\ldots ,X_{n}\}\) denotes a set of independent observations in \(S^{d-1}\) from a density f, \(\{f(X_1), \ldots , f(X_n)\}\) is a set of independent observations from the distribution of Y. Let \(f_{(j)}\) be the \(j-\)th largest value of \(\{f(X_i)\}_{i=1}^n\) so that \(f_{(j)}\) is the (j/n) sample quantile of Y. We shall use \(f_{(j)}\) as an estimate of \(f_\tau \). Specifically, we choose \(\hat{f}_\tau = f_{(j)}\) where \(j = \lfloor \tau n\rfloor \). Cadre et al. (2009) study the convergence of \(\hat{f}_\tau \) to \(f_\tau \) in the linear setting.

Fig. 7
figure 7

50% circular regions obtained from the circular kernel estimator \( f_n\) (Grey color) obtained from a sample \(\mathcal {X}_{250}\). Boxplots of \(\{f_n(X_1),\ldots ,f_n(X_{250})\}\) and quantiles (dotted lines) that determine the 50% regions (bottom)

Obviously, if f is a known function, the observations can be pseudorandomly generated and the estimation of \(f_\tau \) could be made arbitrarily accurate by increasing n. In practice, f is often unknown and we have as only information a random sample of points \(\mathcal {X}_n\) from an unknown density f. From this sample, we propose first to determine the kernel estimator \(f_n\) in (4). If n is large enough, then calculate the set \(\{f_n(X_1),\ldots ,f_n(X_n)\}\) in order to estimate f empirically. If n is moderate, it may be preferable to generate observations \(\mathcal X_n=\{X_l,\ldots ,X_N\}\) of large size N from \(f_n\). For small values of n it may not be possible to get a reasonable density estimate. Besides, with few observations and no prior knowledge of the underlying density, there seems little point in attempting to summarize the sample space (see Wand and Jones 1995 for some discussion on the number of observations needed for a reasonable linear density estimate). Note that the problem here is not with the density quantile algorithm (that give results to an arbitrary degree of accuracy given a density), but with estimating the density from insufficient data.

Once the threshold \(f_\tau \) is estimated, plug-in methods reconstruct the \(100(1 - \tau )\)% HDR namely \(L(f_\tau )\) in (7) as

$$\begin{aligned} \hat{L}(\hat{f}_\tau )=\{x\in S^{d-1}:f_n(x)\ge \hat{f}_\tau \}. \end{aligned}$$
(9)

Figure 7 shows the circular kernel estimator \(f_n\) (grey color) calculated from a sample \(\mathcal {X}_{250}\) generated from the second model (black color) in Fig. 4 and different empirically approximated 50% circular regions (grey color, top). The boxplot of the transformed values denoted by \(\{f_n(X_1),\ldots ,f_n(X_{250})\}\) is also shown (bottom). The dotted lines represent the quantiles that determine the corresponding 50% (probability coverage) circular region. Note that only the estimated HDR (left), \(\hat{L}(\hat{f}_\tau )\), is able to show the existence of the five existing modes.

Apart from the consistency of \(\hat{f}_\tau \), Cadre et al. (2009) establish the exact convergerce rate (considering the distance in measure \(d_\mu \) as error criteria) for Euclidean HDRs. The extension of these results to the directional setting does not seem straightforward. However, if \(\hat{f}_{\tau }-\)consistency remains true, we could prove that \(\hat{L}(\hat{f}_\tau )\) is also a \(d_H-\)consistent estimator of \(L(f_\tau )\) in \(S^2\) under the assumptions of Corollary 1 and condition (T) in Cuevas et al. (2006). To complete the proof is only necessary to apply a triangle inequality on \(d_H(L(f_\tau ),\hat{L}(\hat{f}_\tau ))\).

3.1.1 Confidence regions for estimated HDRs

The density quantile algorithm detailed above for approximating the threshold \(f_{\tau }\) involves an empirical approximation. Then, it is convenient to compute some uncertainty limits on the estimated regions.

For the simplest case of X being a circular random variable (following Hyndman 1996), standard asymptotic results for a sample in Cox and Hinkley (1979) allow to prove that \(\hat{f}_{\tau }\) is asymptotically normally distributed with mean \(f_{\tau }\) and variance \(\tau (1 - \tau )/(n[F(f_\tau )]^2\)) where

$$\begin{aligned} F(y)=y\sum _{i=1}^{n(y)}|f^{'}(z_i)|^{-1} \end{aligned}$$

and \(\{z_i\}\) denote those points in the sample space of X such that \(f(z_i) = y\), \(i = 1, 2,\ldots ,n(y)\).

Alternatively, a bootstrap algorithm can be easily designed to compute confidence regions for estimated HDRs. The procedure is detailed in Algorithm 1.

figure a
Fig. 8
figure 8

95 % Confidence regions considering the asymptotic approach (first row, in dark red color) and the bootstrap procedure with \(B=250\) (second row, in purple color) from \(\mathcal {X}_{500}\) of a circular density with \(\tau _1\) (first column), \(\tau _2\) (second column) and \(\tau _3\) (third column) verifying \(0<\tau _1<\tau _2<\tau _3<1\)

As an illustration, Fig. 8 shows the estimated confidence regions using the asymptotic approach (first row, in dark red color) and the bootstrap procedure (second row, in purple color) for three different values of \(\tau \) when \(\alpha =0.05\) and \(B=250\). Cross validation bandwidths introduced in Hall et al. (1987) were used as smoothing parameters for circular density estimation in both approaches.

3.2 A suitable bootstrap bandwidth selector

The plug-in reconstruction of the directional HDRs in (9) involves the calculation of the kernel density estimator in (4) that is known to be heavily dependent on the selection of h. The existing methods for selecting an optimal value for h aim for minimizing some error criterion on the target density f, but they are not specifically designed for the estimation of HDRs. The goal of this section is to propose the first selector of h specifically designed for HDRs reconstruction.

A bootstrap bandwidth selector focused on the problem of reconstructing HDRs is introduced in what follows. The idea is to use an error criterion that quantifies the differences between the theoretical region and its plug-in reconstruction. In the real-valued setting, Samworth and Wand (2010) propose one of the first bandwidth selectors for HDRs estimation studying an relatively uncommon distance (depending on both \(\mu \) and g) between these sets. In this work, we consider the classical Hausdorff distance (introduced in Sect. 2.2) between the boundaries of the HDR and the corresponding estimator.

In the directional case, the closed expression of \(d_H(\partial L(f_\tau ),\partial \hat{L}(\hat{f}_\tau ))\) is not known. However, it could be estimated through a bootstrap procedure. Therefore, a new bandwidth selector can be established as

$$\begin{aligned} h_{1}=\arg \min _{h>0}\mathbb {E}_B\left[ d_H(\partial L^{*}(\hat{f}_\tau ^*),\partial \hat{L}(\hat{f}_\tau ))\right] \end{aligned}$$
(10)

where \(\mathbb {E}_B\) denotes the bootstrap expectation with respect to random samples \(\mathcal X_n^*=\{X_1^*,\ldots ,X_n^*\}\) generated from the directional kernel \(f_n\) that, of course, is dependent on a pilot bandwidth and also on the choice of the distance \(\rho \) in Eq. (6).

Figure 12 shows the theoretical HDR for model S3 (see Sect. 4.2) when \(\tau =0.5\) (first and second columns). Moreover, the plug-in estimator \(\hat{L}(\hat{f}_\tau )\) obtained from a sample of size \(n=1000\) and considering the bandwidth proposed in García-Portugués (2013) when \(\tau =0.5\) is also represented (third column). Note that, for this sample size, only the largest mode is detected. In this particular case, the Hausdorff error is smaller if the HDR is reconstructed from a cross-validation bandwidth designed for density estimation (fourth and fifth columns). A relevant issue appears when \(h_1\) is estimated from imprecise HDR estimators. Remember that the minimization procedure considered for determining \(h_1\) involves the boundary of the set \(\hat{L}(\hat{f}_\tau )\). If this set is poorly approximated the resulting bandwidth surely will not provide competitive results. Therefore, largest sample sizes will be considered in this section for avoiding this problem.

Another point that is worth to mention is that diverse bandwidths selectors emerge from the consideration of the different choices of \(\rho \) in the definition of the Hausdorff distance. In fact, other bandwidths could be defined if, for example, \(d_H\) in (10) is replaced by a completely different error criteria such as the distance in measure \(d_\mu \) that, unlike Hausdorff distance, does not take in account the connected components of a set only composed by a isolated point. Therefore, we could propose as many bandwidths as existing distances between sets attending to the specific properties and characteristics of each distance.

Fig. 9
figure 9

Circular density models for simulations. Dotted circles represent the threshold \(f_\tau \) when \(\tau =0.2\), \(\tau =0.5\) and \(\tau =0.8\), respectively

4 Simulation study

The performance of the proposed bandwidth selector is explored in this section. As it has been mentioned, there exist other bandwidth selectors for directional kernel density estimation (see Appendix D), although not specifically designed for HDR reconstruction. We will also check the impact of considering some of these selectors in the HDR plug-in estimation. Specifically, the selector \(h_1\) established in (10) was implemented considering the chordal distance \(\rho _1\) that, as we show in Sect. 2.2, guarantees good asymptotic properties of directional level sets. The code for computing it is available in the R library HDiR. All the other bandwidths are implemented in the R packages NPCircFootnote 3 and DirectionalFootnote 4. Sects. 4.1 and 4.2 contain the results obtained in circular and spherical settings, respectively. Some additional results of simulations are also contained in Appendix C.

4.1 Estimation of circular HDRs

A collection of 9 circular densities (models C1 to C9) have been considered in this simulation study. These models are mixture of different circular distributions and they correspond to densities 5, 6, 7, 8, 10, 11, 16, 19 and 20 fully described in Oliveira et al. (2014). Figure 9 shows these densities and the thresholds \(f_\tau \) for \(\tau =0.2\), \(\tau =0.5\) and \(\tau =0.8\) through dotted circles.

A total of 250 random samples of sizes \(n=500\) and \(n=1000\) were generated for each of these models. From each sample, circular HDRs are reconstructed for \(\tau =0.2\), \(\tau =0.5\) and \(\tau =0.8\). Results for \(\tau =0.2\) and \(\tau =0.8\) are exposed in Appendix C.1. The behavior of plug-in methods that emerge from the consideration of different bandwidth parameters will be checked. Apart from \(h_1\), we will consider the circular rule-of-thumb by Taylor (2008) (\(h_2\)); its improved version by Oliveira et al. (2012) (namely \(h_3\)); cross-validation methods (likelihood \(h_4\) and least squares \(h_5\)) introduced by Hall et al. (1987) and the bootstrap bandwidth (\(h_6\)) presented by Di Marzio et al. (2011). Note that for computing \(h_1\), a pilot bandwidth is required. In this study, \(h_3\) has been taken as a pilot, and \(B=200\) resamples are considered for obtaining \(h_1\).

Table 1 Means (M) and standard deviations (SD) of 250 errors in Hausdorff distance for \(\tau =0.5\), \(n=500\) and \(B=200\)
Table 2 Means (M) and standard deviations (SD) of 250 errors in Hausdorff distance for \(\tau =0.5\), \(n=1000\) and \(B=200\)

For each method and each sample, the estimation error is measured by computing the Hausdorff distance (\(d_H\)) between the boundaries of estimated HDR and the frontier of theoretical set. As a reference, note that the maximum value of this criteria in \(S^1\) is 2 (the length of the diameter of the circle).

Tables 1 and 2 show the means and the standard deviations of the 250 estimation errors obtained when \(\tau =0.5\) from samples of sizes \(n=500\) and \(n=1000\), respectively. Bold numbers correspond to the lowest mean errors obtained for each density. Taking into account the variety of models considered, exhibiting different features, it is not surprising that all of the bandwidth selectors are the best ones for some model, showing \(h_1\) a competitive behavior in all cases. In fact, it is the best one for models C3, C5, C6 and C8 (with \(n=1000\)).

Figure 10 shows the violin plots of Hausdorff errors obtained for some of the simulation models when \(\tau =0.5\) (\(n=1000\)). It shows that \(h_2\) is the selector that presents a worst behavior for models C3 and C6. Furthermore, its variance is again specially large for model C3.

Finally, it is worth to mention that results in Appendix C.1 shows that the competitive behavior of \(h_1\) improves considerably when high values of \(\tau \) are selected. This is not a minor question when the goal is to estimate the biggest modes of a distribution.

4.2 Estimation of spherical HDRs

For the spherical scenario, 9 density models have been considered. These models, namely S1 to S9, are mixtures of von Mises-Fisher densities on the sphere, allowing to represent complex structures showing multimodality and/or asymetry. Parameters of mixtures are fully established in Table 8 in Appendix B for reproducibility. Moreover, these densities are also implemented in the R package HDiR. Figure 11 shows them and the corresponding HDRs for \(\tau =0.2\), \(\tau =0.5\) and \(\tau =0.8\).

Fig. 10
figure 10

Violin plots of Hausdorff errors for models C3, C6 and C8 when \(\tau =0.5\) and \(n=1000\). Note that due to the behaviour of \(h_2\), the scales of these figures are different

Fig. 11
figure 11

Finite mixtures of von Mises-Fisher spherical models for simulations. HDRs are represented for \(\tau =0.2\), \(\tau =0.5\) and \(\tau =0.8\)

For sample sizes \(n=500\), \(n=1500\) and \(n=2500\), 200 random samples were generated from models S1 to S9. From each sample, HDRs are reconstructed for \(\tau =0.2\), \(\tau =0.5\) and \(\tau =0.8\). As before, results for \(\tau =0.2\) and \(\tau =0.8\) are contained in Appendix C.2. The performance of different plug-in methods that emerge from the consideration of different bandwidth parameters discussed in this work is checked. Apart from \(h_1\), cross-validation bandwidth selectors for data on a sphere \(S^{d-1}\) (\(h_5\)) and the plug-in bandwidth selector introduced by García-Portugués (2013) (\(h_7\)) are taken into account. In this case, a total of \(B=50\) resamples are established for estimating the proposed bootstrap bandwidth \(h_1\), taking \(h_5\) as a pilot bandwidth.

For each method and each sample, the estimation error is again measured calculating the Hausdorff distance between the boundaries of estimated HDR and the frontier of theoretical set. As reference, note that the maximum value of both criteria \(S^2\) is also 2. In this case, this upper bound coincides exactly with the length of the diameter of the sphere.

Table 3 Means (M) and standard deviations (SD) of 200 errors in Hausdorff distance for \(\tau =0.5\), \(n=500\) and \(B=50\)
Table 4 Means (M) and standard deviations (SD) of 200 errors in Hausdorff distance for \(\tau =0.5\), \(n=1500\) and \(B=50\)
Table 5 Means (M) and standard deviations (SD) of 200 errors in Hausdorff distance for \(\tau =0.5\), \(n=2500\) and \(B=50\)

Tables 3, 4 and 5 contains the results for \(\tau =0.5\) when \(n=500\), \(n=1500\) and \(n=2500\), respectively. Bold numbers correspond to the lowest mean errors obtained for each density. The proposed selector \(h_1\) is the best or second best in all cases. In fact, \(h_1\) and \(h_5\) usually behave similarly and \(h_7\) is the worst selector for S3.

Figure 13 contains the violin plots of Hausdorff errors for some of the considered models when \(\tau =0.5\) and \(n=1500\). Remark that \(h_1\) and \(h_5\) usually present similar results, see densities S5 and S9. However, \(h_1\) is clearly more competitive for models S1 and S8.

To conclude, simulations in Appendix C.2 allows to confirm the good performance of the selector \(h_1\) when small or big values of \(\tau \) are considered for spherical data.

Fig. 12
figure 12

Theoretical HDR for model S3 when \(\tau =0.5\) (first and second columns). Sample of size \(n=1000\) of model 3 (blue color) and corresponding plug-in estimators (black color) when \(\tau =0.5\) considering \(h_7\) (third column) and \(h_5\) (fourth and fifth columns) as smoothing parameters. Note that the last two columns show two views of the sphere

Fig. 13
figure 13

Violin plots of Hausdorff errors for models S1, S5, S8 and S9 when \(\tau =0.5\) and \(n=1500\). Note that the scales of these figures are different

5 Real data analysis

The proposed methodology is now applied to the two real datasets presented in the Introduction, exemplifying the aplicability of the method for circular and spherical data.

5.1 Behavioral plasticity of sandhoppers

Adaptation to changing beach environments for the real example on sandhoppers introduced in Sect. 1.1 is analyzed from HDRs estimation perspective. HDRs are estimated for \(\tau =0.8\) disaggregating the sandhoppers data taking into account the categories of variables specie, sex, time of day and month of year. As consequence, a total of 24 set estimators are determined, numbered E1 to E24. Variables combinations yielding this group classification are presented in Table 7 in Appendix A.

Table 6 Upper triangular matrix contains the Hausdorff distances between boundaries of sets from 1 to 24. HDRs representations for \(\tau =0.8\) for some sets between 1 and 24 (top)

Note that the estimated HDRs correspond to the largest modes of the orientation distributions. Hausdorff distances between the boundaries of these 24 sets are able to establish the degree of dissimilarity of HDRs. In general, large distances between the boundaries of two sets indicate the existence of modes in different directions. If the categories of all variables with the exception of one are fixed, it is possible to check if the different values of the non-fixing variable has some influence in sandhoppers orientation through the comparison of the estimated HDRs. The upper triangular matrix in Table 6 contains the Hausdorff distances (defined from \(\rho _1\)) between boundaries. The largest distances are represented in blue color. Grey color is used in order to depict the next largest values. Furthermore, Table 6 (top) contains some of the estimated HDRs that present the largest distances.

In particular, Hausdorff distance between regions 5 and 11 is equal to 1.91 (close to 2, the maximum value of Hausdorff distance). According to Table 7, the variable configuration 5 corresponds to the largest orientation modes for females of the specie Talitrus saltator when the orientation is measure in noon during October. Region 11 refers to same measurements taken in April. Therefore, the month can be seen as variable that has influence on the orientation for sandhoppers.

Hausdorff distance between regions 5 and 6 is equal to 1.93. According to Table 7, set 6 also corresponds to the HDR for females of the specie Talitrus saltator but, in this case, when the orientation is registered in morning during October. Then, the moment of the day also seems a factor with influence on the sandhoppers behavior.

Several cells in Table 6 are represented in pink color. All of them corresponds to considerable large values of distances (larger than 1.00) and they are used to analyze briefly the influence of each of the variables in the dataset. Under the same values of the rest of variables Talitrus saltator and Talorchestia brito present different behaviors. For instance, distances between sets 5 and 17 or 3 and 15 correspond to this situation. Sets 5 and 17 can be compared using their representations in Table 6 (top).

The importance of the sex variable for the specie Talitrus saltator can be also seen considering the Hausdorff distances of the sets 2 and 5, 3 and 6 or 18 and 15. According to images in Table 6, these sets present their largest modes in completely different directions. Note that the role of the variable month is clearly remarkable. The relatively high values of the distances between sets 1 and 7 and 6 and 12 or 14 and 20 for the species Talitrus saltator and Talorchestia brito also corresponds to the existence of modes in different directions. Finally, the importance of the moment of the day for the Talitrus saltator can be studied through the distances between sets 4 and 5 or 4 and 6. Remark that set 4 has two connected components while set 5 only presents one.

Finally, the analysis of Hausdorff distances for the two species of sandhoppers shows that the median of the Talitrus saltator in Hausdorff distance is 0.76, clearly bigger than the median of Talorchestia brito that is equal to 0.52. Therefore, Talitrus saltator presents more differentiated orientations, depending on the time of day, period of year and sex, with respect to Talorchestia brito. Therefore, conclusions in Scapini et al. (2002) are corroborated from this perspective.

5.2 Earthquakes distribution on Earth

According to the theory of plate tectonics, Earth is an active planet. Its surface is composed of about 15 individual plates that move and interact, constantly changing and reshaping Earth’s outer layer. These movements are usually the main cause of volcanoes and earthquakes. Seismologists have related these natural phenomena to the boundaries of tectonic plates because they tend to occur there, see Selley et al. (2004). In fact, the concentration of earthquake epicenters traces the filamentary network of fault lines and, consequently, they could be analyzed alternatively from the perspective of nonparametric filamentary structure estimation (see, for instance, Genovese et al. 2012). Moreover, tectonic hazards can provoque important damages (destroy buildings, infrastructures or even cause deaths). Therefore, it is important to detect which areas are specially risky. As an illustration, the recent world earthquakes distribution is analyzed next through HDRs estimation.

Fig. 14
figure 14

Contours of HDRs for \(\tau _1=0.1\), \(\tau _2=0.3\), \(\tau _3=0.5\), \(\tau _4=0.7\) and \(\tau _5=0.9\) obtained from the sample of world earthquakes registered between October 2004 and April 2020

Figure 14 shows the margins of the tectonic plates (grey color) and the geographical coordinates (red points) of a total of 272 medium and strong earthquakes registered between 1th October 2004 and 9th April 2020 already introduced in Sect. 1.1. Note that most of events are exactly located on the plates boundaries.

Our main goal is to detect which areas are really problematic nowadays. In Sect. 1.1, we show that the largest mode is located on the Southeast Europe considering a value of \(\tau =0.8\). However, a more general view on earthquakes distribution could be obtained if more HDRs are reconstructed for a range of values of \(\tau \). Specifically, they were estimated choosing \(\tau _1=0.1\), \(\tau _2=0.3\), \(\tau _3=0.5\), \(\tau _4=0.7\) and \(\tau _5=0.9\). The bandwidth parameter used is the proposed in García-Portugués (2013). The corresponding contours are also represented in Fig. 14 using blue colors. An interactive representation of these HDRs can be seen in Appendix D.

The two smallest contours (dark blue colors) corresponds to density regions with probability at least \(1-\tau _5=0.1\) and \(1-\tau _4=0.3\), respectively. Therefore, they match with the greatest modes of earthquakes world distribution and they identify the more risky parts of the world. They are located on Europe. Concretely, on the boundaries intersection for the Eurasian and African Plates. Note that the second of these regions even includes the frontier of the Arabian Plate. Contours for \(\tau _2=0.3\) and \(\tau _3=0.5\) are related to Indo-Australian Plate and margins of Philippine Sea and Pacific Plates appears when \(\tau _1=0.1\).

As for America, the most problematic area is detected in Central America. Concretely, it is mainly located on the frontiers of Cocos, Nazca and Caribbean Plates. According to the contours shown, this region belongs to the zone of the world where the 70% (\(1-\tau _2\)%) of earthquakes are registered. If \(\tau _1=0.1\) is considered then Pacific, North and South American plates appears as risky areas.

6 Conclusions and discussion

The main goals of this work are to extend the definition of HDRs for directional data and propose a plug-in estimator based on a new bootstrap bandwidth selector that is focused on HDRs reconstruction. The route designed to reach this goal can be summarized as follows: (1) Extending the definition of HDRs for directional data, (2) proposing general HDRs plug-in estimators and two different procedures for estimating confidence regions, (3) introducing the first specific selector of the bandwidth parameter for directional HDRs reconstruction, (4) studying the practical behavior of the plug-in estimators (using the new selector and other classical directional bandwidths not specifically designed for HDR reconstruction) and (5) applying the plug-in reconstruction of HDRs to the real data on sandhoppers orientation and earthquakes.

Some further research on the proposed estimator and some natural extensions of this work are discussed. The performance of the procedures for estimating the HDRs confidence regions should be compared, for instance, through simulations. Additionally, consistency results on the proposed HDR estimator and the bootstrap bandwidth selector could be explored following the scheme in Cadre et al. (2009). Regarding the procedure for bandwidth selection, there are two natural extensions. Firstly, as it has been mentioned along the paper, other distances may be used. Secondly, the consideration of the kernel density estimates proposed in Di Marzio et al. (2011) (torus) and García-Portugués et al. (2013) (cylinder) enables the adaptation of our proposal to these settings.

Note also that in the Introduction, we refer to the notion of cluster as the number of connected components of the probability density. With this view in mind, an estimator of the number of directional clusters can be given by the number of connected components of the HDRs plug-in estimator. In addition, (two or more) directional densities could be also compared using the ideas explored in this work: we may compare the discrepancy between directional HDRs estimations, for instance, measuring distances between boundaries. The simple geometric structure of estimators could be used to compute the procedure and calibrate the test using re-sampling schemes.

Finally, earthquakes on Earth could be analyzed following alternative approaches. Note that contour lines in Fig. 14 do not clearly follow the geometry of tectonic plates. A possible cause of this behavior is that earthquakes occur very close to the boundary of the density support (that is, the frontiers of the tectonic plates) and this issue may produce a bias in the estimator. For manifolds with known boundaries, Theorem 3.1 in Berry and Sauer (2017) provides a consistent estimate of the density both in the interior and the frontier, reducing the bias for density evaluations closed to the boundaries. Since the concentration of earthquakes epicenters traces the filamentary network of fault lines, following (Genovese et al. 2012), the performance of nonparametric filament estimators should be also checked for further insight in this problem.