Nonparametric estimation of directional highest density regions

Saavedra-Nieves, Paula; Crujeiras, Rosa M.

doi:10.1007/s11634-021-00457-4

Nonparametric estimation of directional highest density regions

Regular Article
Open access
Published: 10 October 2021

Volume 16, pages 761–796, (2022)
Cite this article

Download PDF

You have full access to this open access article

Advances in Data Analysis and Classification Aims and scope Submit manuscript

Nonparametric estimation of directional highest density regions

Download PDF

Paula Saavedra-Nieves¹ &
Rosa M. Crujeiras¹

2192 Accesses
5 Citations
Explore all metrics

This article has been updated

Abstract

Highest density regions (HDRs) are defined as level sets containing sample points of relatively high density. Although Euclidean HDR estimation from a random sample, generated from the underlying density, has been widely considered in the statistical literature, this problem has not been contemplated for directional data yet. In this work, directional HDRs are formally defined and plug-in estimators based on kernel smoothing and associated confidence regions are proposed. We also provide a new suitable bootstrap bandwidth selector for plug-in HDRs estimation based on the minimization of an error criteria that involves the Hausdorff distance between the boundaries of the theoretical and estimated HDRs. An extensive simulation study shows the performance of the resulting estimator for the circle and for the sphere. The methodology is applied to analyze two real data sets in animal orientation and seismology.

Spatial machine learning: new opportunities for regional science

Article Open access 24 December 2021

Taxonomy and Nomenclature for the Stone Domain in New England

Article 21 September 2023

An integrated approach for understanding global earthquake patterns and enhancing seismic risk assessment

Article Open access 13 March 2024

1 Introduction

Set estimation is focused on the reconstruction of a set (or the approximation of any of its characteristic features such as its boundary or its volume) from a random sample of points. One of the specific topics in this area is concerned with the estimation of sets directly related to density functions such as level sets. Mathematically, for a given level $t>0$, the goal is to reconstruct the unknown set

$$\begin{aligned} G_g(t)=\{x\in \mathbb {R}^d:g(x)\ge t\} \end{aligned}$$

(1)

from a random sample of points of a density function g on $\mathbb {R}^d$. This topic has received considerable attention in the statistical literature, specially since the notion of population clusters was established in Hartigan (1975) as the connected components of the set in (1). This cluster definition relies clearly on the user-specified level t, so for addressing this problem, an algorithm for estimating the smallest level with more than a single connected component was proposed in Steinwart (2015). For a general review on clustering, see (Anderberg 1973; Everitt 1993; Cuevas and Fraiman 1997) and (Rinaldo and Wasserman 2001).

The number of clusters is a basic feature for a statistical population. However, the problem of its estimation is not always taken into account in cluster analysis where it is usually chosen by the practitioner as a first step. Since the number of clusters is equal to the number of connected components of a level set, a very natural estimator for this populational parameter is the number of the connected components of the level set reconstruction. This perspective that solves the problem of selecting this unknown population parameter is considered, for instance, in Cuevas et al. (2000), Cuevas et al. (2001) and Biau et al. (2007).

Level set estimation theory has been mainly established for a density supported on an Euclidean space such as in Eq. (1) with very few contributions in other domains. Cuevas et al. (2006) consider the estimation of level sets for general functions (not necessarily a density) providing some consistency theoretical results and showing a level set on the sphere for illustration. More recently, the reconstruction of density level sets on manifolds is studied in Cholaquidis et al. (2020). Through some simulations, the behavior of the proposed method is analyzed on the torus and on the sphere.

Unfortunately, for most applications, the specific value of the level t in (1) is fully unknown by the practitioner. In addition, areas of the distribution support where g is close to zero (non-effective support) are usually of limited interest for applications. If the practitioner establishes the probability content instead of the level t, a new kind of density level sets emerges known as highest density regions (HDRs) (see Box and Tiao 1973 and Hyndman 1996). The estimation of HDRs involves further complexities given that the threshold of this particular type of level sets must be determined from the established probability content. Perhaps due to its practical importance, HDRs plug-in reconstruction from the linear kernel density estimator has been widely studied considering also the problem of selecting an appropriate bandwidth specifically devised for the HDR reconstruction (see, for instance, Baíllo and Cuevas 2006 or Samworth and Wand 2010). However, as far as we know, the notion of HDR has not been introduced for directional data yet. Therefore, the main goals of this work are to (1) generalize HDRs definition to the directional setting, (2) establish a plug-in procedure for HDRs reconstruction from the proposal of a new bootstrap bandwidth for a well-known directional kernel density estimator that can be seen as the first specific selector for directional HDRs, (3) check its performance through an extensive simulation study analysing the effect of considering a smoothing parameter not specifically designed for HDR estimation and (4) apply this methodology to analyze data on animal orientation and on seismology.

One may argue that such an absence of a general and effective proposal for directional HDRs estimation may be due to a lack of practical interest, but this is far from the truth, so let us present two application examples that motivate the developments in this work. The first one concerns a problem from animal orientation studies and the second one is related to earthquakes occurrences. Both datasets are available in the R package HDiR.^{Footnote 1}

1.1 Some motivating examples

Animal orientation example. Behavioral plasticity is considered by biologists as a feature of adaptation to changing beach environments. In particular, orientation is an adaptation characteristic that can not be modified by a single factor. Nonetheless, experts found some regularities in the orientation of sandhoppers and other animals from beach environments by changing one factor at a time under other controlled conditions.

For instance, the orientation of two sandhoppers species (Talitrus saltator and Talorchestia brito) is analyzed in Scapini et al. (2002). Both species are shown in Fig. 1. Bottom pictures can be found in Dekker (1978). Comparing the two species through regresion procedures, Scapini et al. (2002) conclude that Talitrus saltator showed more differentiated orientations, depending on the time of day, period of the year and sex, with respect to Talorchestia brito. Moreover, it seems that Talitrus saltator shows a higher flexibility (variation) of orientation than Talorchestia brito under the same environmental conditions, supporting the hypothesis that the former has a higher level of terrestrialization. As an illustration, Fig. 2 (left panel) shows the 36 orientation points (slightly jittered) corresponding to males of the specie Talitrus saltator measurements during the noon in April. It also contains the 77 angles (slightly jittered) when the measures are taken in October (Fig. 2, right panel). Differences in the distribution on the circle of these two samples can be easily observed. Therefore, the month of the year seems to play a significant role in sandhoppers behavior. In particular, two clusters for October measurements can be detected around the angle $\pi $ but they are not present for the April sample. Similar comments could be done for the situation registered around the angles $\pi /2$. HDRs reconstruction (with low probability content) would allow to determine the biggest modes of the distribution, and then, its clusters. Therefore, HDRs can be seen as a useful alternative to analyze sandhoppers orientation.

Earthquakes occurrences. The European-Mediterranean Seismological Centre (EMSC)^{Footnote 2} is a non-governmental and non-profit organisation that has been established in 1975 at the request of the European Seismological Commission. Since the European-Mediterranean region has suffered several destructive earthquakes, there was a need for a scientific organisation to be in charge of the determination, as quickly as possible (within one hour of the earthquake occurrence), of the characteristics of such earthquakes. These predictions are based on the seismological data received from more than 65 national seismological agencies, mostly in the Euro-Med region. Figure 3 (left) shows the geographical coordinates (red points), downloaded from EMSC website, of a total of 272 medium and strong world earthquakes registered between 1$^{th}$ October 2004 and 9$^{th}$ April 2020. The magnitude of all these events is at least 2.5 degrees on the Richter scale. Of course, these planar points correspond to spherical coordinates on Earth. Due to the important damages that earthquakes cause, cluster detection of HDRs could be also useful to identify, from a real dataset, where earthquakes are specially likely. This information is key for decision-making, for example, to update construction codes guaranteeing a better building seismic-resistance. An interactive representation of the sphere can be seen in Appendix D.

1.2 Paper organization

This paper is organized as follows. Section 2 contains some background ideas on directional level set estimation including some discussion on error measurements and some existing consistency results in the directional setting that will be really useful to extent the definition of HDRs for directional data in Sect. 3. There, plug-in estimators and the corresponding confidence regions are also established. Concretely, we consider the plug-in methods based on a well-known directional kernel density estimator, which requires a smoothing parameter (bandwidth) for its practical implementation. An appropriate bootstrap bandwidth selector, the first one specifically designed for directional HDRs estimation, is also introduced in this section. Section 4 presents an extensive simulation study illustrating the performance of the resulting plug-in reconstruction for the HDRs (for circular and spherical domains) considering the new bandwidth selector. These results are compared with those obtained with directional smoothing parameters not specifically designed for HDR estimation. In Sect. 5, the proposed methodology is applied to analyze the two real data examples presented in the Introduction. Finally, some conclusions and ideas for further research are presented in Sect. 6. Appendix A and the supplementary material that completes this work include further information on the datasets. Appendix B specifies the parameters taken for the construction of the spherical densities in the simulation study. Appendix C contains some additional results of simulations. Appendix D collects the description of the bandwidth selectors considered in the simulation study. All the methods presented in this paper, along with the real data examples, are accessible in HDiR package.

2 Some background on directional level sets

The specific problem of reconstructing density level sets in the directional setting is reviewed in this section: the definition of directional level set is introduced jointly with a plug-in estimator. Based on the real data and simulated examples, some discussion about how to measure the estimation error and some asymptotic results are also included.

2.1 On directional level sets

Consider a random vector X taking values on a $d-$dimensional unit sphere $S^{d-1}$ with density f. Given a level $t>0$, the directional level set is defined as:

$$\begin{aligned} G_f(t)=\{x\in S^{d-1}:f(x)\ge t\}. \end{aligned}$$

(2)

The nature of different level sets is shown in Fig. 4, which represents $G_f(t)$ in grey color for three different circular densities and three different values of the level t. The threshold t is represented through a dotted grey line. Note that, if large values of t are considered (bottom row in Fig. 4), $G_f(t)$ coincides with the greatest modes. However, for small values of t, the level set $G_f(t)$ is virtually equal to the support of the distribution.

It is important to noticed that, following (Hartigan 1975), we may also establish the concept of cluster in directional setting as the connected components of the level set $G_f(t)$. With this view in mind, note that the density represented in the second row of Fig. 4 presents four connected components for all of the considered values for t, determining four population clusters.

Plug-in estimation is the most natural and common choice for reconstructing density level sets in the Euclidean space. A review of other existing estimation alternatives can be seen in Rodríguez-Casal and Saavedra-Nieves (2019). Plug-in methods are devised to reconstruct the level set in (1) as

$$\begin{aligned} \hat{G}_g(t)=\{x\in \mathbb {R}^d:g_n(x)\ge t\} \end{aligned}$$

where $g_n$ usually denotes the classical kernel estimator for Euclidean data (see Parzen 1962 and Rosenblatt 1956). This methodology, which has received considerable attention (see, for instance, Tsybakov 1997; Baíllo 2003; Mason and Polonik 2009; Rigollet and Vert 2009; Mammen and Polonik 2013; Polonik 2013 or Chen et al. 2017) can be easily generalized to the directional setting. Given a random sample $\mathcal {X}_n=\{X_1,\ldots ,X_n\}\in S^{d-1}$ of the unknown directional density f, the corresponding level set $G_f(t)$ in (2) can be reconstructed as

$$\begin{aligned} \hat{G}_f(t)=\{x\in S^{d-1}:f_n(x)\ge t\} \end{aligned}$$

(3)

where $f_n$ denotes a nonparametric directional density estimator. Following the classical ideas for real-valued random variables, a kernel estimator on $S^{d-1}$ is provided in Bai et al. (1989) ($d>2$) who also proved strong pointwise consistency, uniform consistency, and $L_1-$norm consistency of the estimator (see also Hall et al. 1987 and Klemelä 2000 for further results). Following Bai et al. (1989), from a random sample $\mathcal {X}_n$ on a $d-$dimensional sphere the directional kernel density estimator at a point $x\in S^{d-1}$ is defined as

$$\begin{aligned} f_n(x)= \frac{1}{n} \sum _{i=1}^n K_{vM}(x;X_i;1/h^2), \end{aligned}$$

(4)

where $1/h^2 > 0$ is concentration parameter and $K_{vM}$ denotes the von Mises-Fisher kernel density (see Appendix B for explicit formulae). The consideration of a von Mises kernel in Eq. (4) is not the only option and it is particularly interesting to point out the use of a wrapped-normal kernel in the circular setting. In this case, Huckemann et al. (2016) proved that this kernel guarantees the monotonicity on the number of modes with respect to the smoothing parameter, something that also happens for the gaussian kernel in the linear case. It may be argued that such a kernel shoud be used in our problem. Nevertheless, it is computationally more expensive and our practical experience shows that results in practice are quite similar.

Note that the kernel estimator in (4) can be viewed as a mixture of von Mises-Fisher. Furthermore, the concentration parameter $1/h^2$ plays an analogous role to the bandwidth in the Euclidean case. For small values of $1/h^2$, the density estimator is oversmoothed. The opposite effect is obtained as $1/h^2$ increases: with a large value of $1/h^2$, the estimator is clearly undersmoothing the underlying target density. Hence, the choice of h is a crucial issue. For simplicity, in what follows, we refer to h as bandwidth parameter. Several approaches for selecting h in practice, in circular and even directional settings, have been proposed in the literature (see Appendix D). All the existing proposals aim to minimize some error criterion on the target density, but none of them is specifically designed focusing on the reconstruction of a directional level set.

Figure 5 shows three plug-in estimators $\hat{G}_f(t)$ for models (black colour) and levels $t_1$, $t_2$ and $t_3$ (dotted grey line) considered in Fig. 4. Kernel density estimators (grey color) in (4) have been determined from samples of size 250 considering the proposal in Oliveira et al. (2012) as bandwidth parameter.

For instance, for the sandhoppers example, Fig. 6 shows the plug-in estimators obtained for the two samples of sandhoppers represented in Fig. 2. It is possible to detect the largest modes of the two sample distributions corresponding to April and October samples. These results allow us to confirm the differences between the two populations. The largest cluster of April orientations is located around the angle $7\pi /4$. However, the pattern observed for October registries is completely different. Although an only cluster is identified around the angle $3\pi /2$, if the level t decreases slightly two additional groups can be detected around the angles $3\pi /4$ and $5\pi /4$, respectively.

Regarding the earthquakes illustration, Fig. 3 (right) shows the plug-in contour in blue obtained from the selected sample of world earthquakes considered. Chosing a convenient value of the level t, the greatest mode of sample distribution is identified in the Southeast of Europe. Countries such as Italy, Greece or Turkey (located withint this cluster) are the most affected areas in the recent past.

2.2 Error measures and consistency results on directional level sets

The level set $\hat{G}_f(t_3)$ represented in Fig. 5 (third column) presents two connected components. However, Fig. 4 (right plot in the bottom row) shows that the theoretical level set $G_f(t_3)$ has exactly three components. Therefore, the estimation error is considerably large. Distances between sets are the common criteria considered in set estimation to measure the discrepancies between the theoretical region to be estimated and the corresponding reconstruction. Of course, this is also applicable when the goal is to estimate level sets or HDRs.

The distance in measure $d_{\mu }$ between two Borel sets A and B in $\mathbb {R}^d$ is defined as

$$\begin{aligned} d_\mu (A,B)=\mu (A\triangle B) \end{aligned}$$

(5)

where $\mu $ denotes the Lebesgue measure and $A\triangle B$, the symmetric difference of A and B calculated as $(A\cap B^c)\cup (A^c\cap B)$ with $A^c$ representing the complementary of A. Consistency results for directional plug-in estimators have been already obtained in the literature for this distance. For the estimator established in (3) defined on $S^2$, Cuevas et al. (2006) and Cholaquidis et al. (2020) check that $\lim _{n\rightarrow \infty }d_\mu ( G_f(t), \hat{G}_f(t))=0, \text{ a.s. },$ and $\lim _{n\rightarrow \infty }d_\mu (G_f(t),\hat{G}_f^{'}(t))=0, \text{ a.s. }$, where $\hat{G}_f^{'}(t)=\{x\in S^{d-1}:f_n^{'}(x)\ge t\} $ and $f_n^{'}$ denotes the kernel estimator (for manifolds with boundary) proposed in Berry and Sauer (2017). From the definition in (5), it easy to check that the distance in measure $d_\mu $ does not penalize those level set estimators that have an isolated point as a connected component or any other set with null Lebesgue measure. Additionally, the undersmoothing caused by the choice of a small bandwidth value may provoque that the estimator $\hat{G}_f(t)$ presents non-significant connected components with small Lebesgue measure. In this case, $d_{\mu }$ would not be as effective as, for instance, the Hausdorff distance in detecting this situation.

Let us recall that, if A and B are now non-empty compact sets in $\mathbb {R}^d$, the Hausdorff distance between A and B is established as follows

$$\begin{aligned} d_H(A,B)=\max \left\{ \sup _{x\in A}\rho \left( \{x\},B\right) ,\sup _{y\in B}\rho \left( \{y\},A\right) \right\} \end{aligned}$$

(6)

where $\rho (\{x\},B)=\inf _{y\in B}\{\rho (x,y)\}$ being $\rho (x,y)$ the distance between two points. Note that the definition of the Hausdorff distance is very general and depending on the selection of the distance $\rho $, different error criteria emerge. Usually, $\rho $ corresponds to the chordal distance (Euclidean distance in $\mathbb {R}^d$, $\rho _1$).

Remark 1

Other natural choices such as the geodesic distance (great circle, $\rho _2$) could be considered in Eq. (6). Hopf-Rinow Theorem states that $\rho _1$ and $\rho _2$ induce the same topology on $S^{d-1}$. Figure 1 in Jeong et al. (2017) illustrates that $\rho _1(x,y)\le \rho _2(x,y)$ for any pair of points x, y in the unit circle. Following Lemma 3 in Boissonnat et al. (2019), a general upper bound for the $\rho _2(x,y)$ for all $x,y\in S^{d-1}$ depending on $\rho _1(x,y)$. Specifically, it is possible to prove that $\rho _2(x,y)\le \arcsin (\rho _1(x,y))$ for all $x,y\in S^{d-1}$ when the constant r is equal to 1/2.

The metric $d_H$ is not completely successful in detecting differences in shape properties. In other words, two sets can be very close in Hausdorff distance and still show quite different shapes. This typically happens where the boundaries $\partial A$ and $\partial B$ are far apart, no matter the proximity of A and B. So a natural way to reinforce the notion of visual proximity between two sets provided by Hausdorff distance is to account also for the proximity of the respective boundaries. This error criterion has been also considered for establishing consistency results of several directional plug-in reconstructions. Cuevas et al. (2006) prove that $\lim _{n\rightarrow \infty }d_H(\partial G_f(t),\partial \hat{G}_f(t))=0, \text{ a.s. },$ when the Hausdorff distance is defined from $\rho _1$. If the Hausdorff distance involves $\rho _2$, (Cholaquidis et al. 2020) prove that $\lim _{n\rightarrow \infty }d_H(\partial G_f(t),\partial \hat{G}_f^{'}(t))=0$ and $\lim _{n\rightarrow \infty }d_H(G_f(t),\hat{G}_f^{'}(t))=0, \text{ a.s }$. The existing monotone relationship between chordal and geodesic distances guarantees the consistency of the plug-in estimator in Cholaquidis et al. (2020) also when the Hausdorff distance depends on $\rho _1$ instead of $\rho _2$. Therefore, if the target is the reconstruction of a set, the Hausdorff metric (defined from the chordal distance) can be seen as a suitable error criteria in the directional setting.

3 HDRs in the directional setting

As noted in the Introduction, the level t is usually unknown in (1) and, for practical purposes, the practitioner chooses the probability content of the set instead of the level t. These particular class of level sets widely considered for Euclidean data are the so-called HDRs (see Box and Tiao 1973; Hyndman 1996 or Samworth and Wand 2010). However, as far as we know, HDRs were not defined in the directional context yet. Motivating the need for a proper extension of this notion and the proposal of adequate estimation tools can be easily justified. Figure 7 (top) shows four different 50% circular regions (regions containing 50% of the probability, empirically approximated) for the kernel density estimator $f_n$ represented in grey. Although all of them have probability content equal to 50%, they exhibit completely different shapes. Therefore, it is obvious that there exists an infinite number of ways to choose a region with given coverage probability and in a general scenario, it may not be clear which region must be chosen. The same happens for real-valued random variables, and (Hyndman 1996) suggests that HDRs are the best subset to summarize a probability distribution.

The usual purpose in summarizing a probability distribution by a region of the sample space is to delineate a comparatively small set which contains most of the probability, although the density may be nonzero over infinite regions of the sample space. Therefore, as in the Euclidean case, it is necessary to decide what properties the region has to verify. The following conditions are natural:

(C1)
The region should occupy the smallest possible volume in the sample space.
(C2)
Every point inside the region should have probability density at least as large as every point outside the region.

Following (Box and Tiao 1973), conditions (C1) and (C2) are equivalent and lead to regions called HDRs. Definition 1 formalizes this concept in the directional context taking into account the second criterion.

Definition 1

Let f be a directional density function on $S^{d-1}$ of a random vector X. Given $\tau \in (0,1)$, the $100(1 - \tau )$% HDR is the subset

$$\begin{aligned} L(f_\tau )=\{x\in S^{d-1}:f(x)\ge f_\tau \} \end{aligned}$$

(7)

where $f_\tau $ can be seen as the largest constant such that

$$\begin{aligned} \mathbb {P}(X\in L(f_\tau ))\ge 1-\tau \end{aligned}$$

(8)

with respect to the distribution induced by f.

According to Polonik (1997) and García et al. (2003) in the Euclidean context, $L(f_\tau )$ is the minimum volume level set with probability content at least $(1-\tau )$. Figure 4 shows the HDR $L(f_\tau )$ in grey for three different circular densities and three different values of $\tau $. The threshold $f_\tau $ is represented through a dotted grey line. Note that, if large values of $\tau $ are considered, $L(f_\tau )$ is equal to the greatest modes and, therefore, the most differentiated clusters can be easily identified. However, for small values of $\tau $, $L(f_\tau )$ is almost equal to the support of the distribution.

3.1 Plug-in estimation of directional HDRs

The first step to reconstruct the HDR in Definition 1 for a given $\tau \in (0,1)$ is to estimate the threshold $f_\tau $. As in the Euclidean case, numerical integration methods could be also used in the directional setting in order to approximate its value. However, when the dimension increases, the computational cost becomes a major issue due to the complexity of the numerical integration algorithms considered on high dimensional spaces. An alternative approach for estimating $f_{\tau }$ with a feasible computational cost is described next.

As before, let X be a random vector with directional density f and let $Y = f(X)$ be the random vector obtained by transforming X by its own density function. Since $\mathbb {P}(f(X)\ge f_\tau )=1-\tau $, $f_{\tau }$ is exactly the $\tau -$ quantile of Y, following (Hyndman 1996), $f_{\tau }$ can be estimated as a sample quantile from a set of independent and identically distributed random vectors with the same distribution as Y.

In particular, if $\mathcal X_n=\{X_1,\ldots ,X_{n}\}$ denotes a set of independent observations in $S^{d-1}$ from a density f, $\{f(X_1), \ldots , f(X_n)\}$ is a set of independent observations from the distribution of Y. Let $f_{(j)}$ be the $j-$th largest value of $\{f(X_i)\}_{i=1}^n$ so that $f_{(j)}$ is the (j/n) sample quantile of Y. We shall use $f_{(j)}$ as an estimate of $f_\tau $. Specifically, we choose $\hat{f}_\tau = f_{(j)}$ where $j = \lfloor \tau n\rfloor $. Cadre et al. (2009) study the convergence of $\hat{f}_\tau $ to $f_\tau $ in the linear setting.

Obviously, if f is a known function, the observations can be pseudorandomly generated and the estimation of $f_\tau $ could be made arbitrarily accurate by increasing n. In practice, f is often unknown and we have as only information a random sample of points $\mathcal {X}_n$ from an unknown density f. From this sample, we propose first to determine the kernel estimator $f_n$ in (4). If n is large enough, then calculate the set $\{f_n(X_1),\ldots ,f_n(X_n)\}$ in order to estimate f empirically. If n is moderate, it may be preferable to generate observations $\mathcal X_n=\{X_l,\ldots ,X_N\}$ of large size N from $f_n$. For small values of n it may not be possible to get a reasonable density estimate. Besides, with few observations and no prior knowledge of the underlying density, there seems little point in attempting to summarize the sample space (see Wand and Jones 1995 for some discussion on the number of observations needed for a reasonable linear density estimate). Note that the problem here is not with the density quantile algorithm (that give results to an arbitrary degree of accuracy given a density), but with estimating the density from insufficient data.

Once the threshold $f_\tau $ is estimated, plug-in methods reconstruct the $100(1 - \tau )$% HDR namely $L(f_\tau )$ in (7) as

$$\begin{aligned} \hat{L}(\hat{f}_\tau )=\{x\in S^{d-1}:f_n(x)\ge \hat{f}_\tau \}. \end{aligned}$$

(9)

Figure 7 shows the circular kernel estimator $f_n$ (grey color) calculated from a sample $\mathcal {X}_{250}$ generated from the second model (black color) in Fig. 4 and different empirically approximated 50% circular regions (grey color, top). The boxplot of the transformed values denoted by $\{f_n(X_1),\ldots ,f_n(X_{250})\}$ is also shown (bottom). The dotted lines represent the quantiles that determine the corresponding 50% (probability coverage) circular region. Note that only the estimated HDR (left), $\hat{L}(\hat{f}_\tau )$, is able to show the existence of the five existing modes.

Apart from the consistency of $\hat{f}_\tau $, Cadre et al. (2009) establish the exact convergerce rate (considering the distance in measure $d_\mu $ as error criteria) for Euclidean HDRs. The extension of these results to the directional setting does not seem straightforward. However, if $\hat{f}_{\tau }-$consistency remains true, we could prove that $\hat{L}(\hat{f}_\tau )$ is also a $d_H-$consistent estimator of $L(f_\tau )$ in $S^2$ under the assumptions of Corollary 1 and condition (T) in Cuevas et al. (2006). To complete the proof is only necessary to apply a triangle inequality on $d_H(L(f_\tau ),\hat{L}(\hat{f}_\tau ))$.

3.1.1 Confidence regions for estimated HDRs

The density quantile algorithm detailed above for approximating the threshold $f_{\tau }$ involves an empirical approximation. Then, it is convenient to compute some uncertainty limits on the estimated regions.

For the simplest case of X being a circular random variable (following Hyndman 1996), standard asymptotic results for a sample in Cox and Hinkley (1979) allow to prove that $\hat{f}_{\tau }$ is asymptotically normally distributed with mean $f_{\tau }$ and variance $\tau (1 - \tau )/(n[F(f_\tau )]^2$) where

$$\begin{aligned} F(y)=y\sum _{i=1}^{n(y)}|f^{'}(z_i)|^{-1} \end{aligned}$$

and $\{z_i\}$ denote those points in the sample space of X such that $f(z_i) = y$, $i = 1, 2,\ldots ,n(y)$.

Alternatively, a bootstrap algorithm can be easily designed to compute confidence regions for estimated HDRs. The procedure is detailed in Algorithm 1.

As an illustration, Fig. 8 shows the estimated confidence regions using the asymptotic approach (first row, in dark red color) and the bootstrap procedure (second row, in purple color) for three different values of $\tau $ when $\alpha =0.05$ and $B=250$. Cross validation bandwidths introduced in Hall et al. (1987) were used as smoothing parameters for circular density estimation in both approaches.

3.2 A suitable bootstrap bandwidth selector

The plug-in reconstruction of the directional HDRs in (9) involves the calculation of the kernel density estimator in (4) that is known to be heavily dependent on the selection of h. The existing methods for selecting an optimal value for h aim for minimizing some error criterion on the target density f, but they are not specifically designed for the estimation of HDRs. The goal of this section is to propose the first selector of h specifically designed for HDRs reconstruction.

A bootstrap bandwidth selector focused on the problem of reconstructing HDRs is introduced in what follows. The idea is to use an error criterion that quantifies the differences between the theoretical region and its plug-in reconstruction. In the real-valued setting, Samworth and Wand (2010) propose one of the first bandwidth selectors for HDRs estimation studying an relatively uncommon distance (depending on both $\mu $ and g) between these sets. In this work, we consider the classical Hausdorff distance (introduced in Sect. 2.2) between the boundaries of the HDR and the corresponding estimator.

In the directional case, the closed expression of $d_H(\partial L(f_\tau ),\partial \hat{L}(\hat{f}_\tau ))$ is not known. However, it could be estimated through a bootstrap procedure. Therefore, a new bandwidth selector can be established as

$$\begin{aligned} h_{1}=\arg \min _{h>0}\mathbb {E}_B\left[ d_H(\partial L^{*}(\hat{f}_\tau ^*),\partial \hat{L}(\hat{f}_\tau ))\right] \end{aligned}$$

(10)

where $\mathbb {E}_B$ denotes the bootstrap expectation with respect to random samples $\mathcal X_n^*=\{X_1^*,\ldots ,X_n^*\}$ generated from the directional kernel $f_n$ that, of course, is dependent on a pilot bandwidth and also on the choice of the distance $\rho $ in Eq. (6).

Figure 12 shows the theoretical HDR for model S3 (see Sect. 4.2) when $\tau =0.5$ (first and second columns). Moreover, the plug-in estimator $\hat{L}(\hat{f}_\tau )$ obtained from a sample of size $n=1000$ and considering the bandwidth proposed in García-Portugués (2013) when $\tau =0.5$ is also represented (third column). Note that, for this sample size, only the largest mode is detected. In this particular case, the Hausdorff error is smaller if the HDR is reconstructed from a cross-validation bandwidth designed for density estimation (fourth and fifth columns). A relevant issue appears when $h_1$ is estimated from imprecise HDR estimators. Remember that the minimization procedure considered for determining $h_1$ involves the boundary of the set $\hat{L}(\hat{f}_\tau )$. If this set is poorly approximated the resulting bandwidth surely will not provide competitive results. Therefore, largest sample sizes will be considered in this section for avoiding this problem.

Another point that is worth to mention is that diverse bandwidths selectors emerge from the consideration of the different choices of $\rho $ in the definition of the Hausdorff distance. In fact, other bandwidths could be defined if, for example, $d_H$ in (10) is replaced by a completely different error criteria such as the distance in measure $d_\mu $ that, unlike Hausdorff distance, does not take in account the connected components of a set only composed by a isolated point. Therefore, we could propose as many bandwidths as existing distances between sets attending to the specific properties and characteristics of each distance.

4 Simulation study

The performance of the proposed bandwidth selector is explored in this section. As it has been mentioned, there exist other bandwidth selectors for directional kernel density estimation (see Appendix D), although not specifically designed for HDR reconstruction. We will also check the impact of considering some of these selectors in the HDR plug-in estimation. Specifically, the selector $h_1$ established in (10) was implemented considering the chordal distance $\rho _1$ that, as we show in Sect. 2.2, guarantees good asymptotic properties of directional level sets. The code for computing it is available in the R library HDiR. All the other bandwidths are implemented in the R packages NPCirc^{Footnote 3} and Directional^{Footnote 4}. Sects. 4.1 and 4.2 contain the results obtained in circular and spherical settings, respectively. Some additional results of simulations are also contained in Appendix C.

4.1 Estimation of circular HDRs

A collection of 9 circular densities (models C1 to C9) have been considered in this simulation study. These models are mixture of different circular distributions and they correspond to densities 5, 6, 7, 8, 10, 11, 16, 19 and 20 fully described in Oliveira et al. (2014). Figure 9 shows these densities and the thresholds $f_\tau $ for $\tau =0.2$, $\tau =0.5$ and $\tau =0.8$ through dotted circles.

A total of 250 random samples of sizes $n=500$ and $n=1000$ were generated for each of these models. From each sample, circular HDRs are reconstructed for $\tau =0.2$, $\tau =0.5$ and $\tau =0.8$. Results for $\tau =0.2$ and $\tau =0.8$ are exposed in Appendix C.1. The behavior of plug-in methods that emerge from the consideration of different bandwidth parameters will be checked. Apart from $h_1$, we will consider the circular rule-of-thumb by Taylor (2008) ($h_2$); its improved version by Oliveira et al. (2012) (namely $h_3$); cross-validation methods (likelihood $h_4$ and least squares $h_5$) introduced by Hall et al. (1987) and the bootstrap bandwidth ($h_6$) presented by Di Marzio et al. (2011). Note that for computing $h_1$, a pilot bandwidth is required. In this study, $h_3$ has been taken as a pilot, and $B=200$ resamples are considered for obtaining $h_1$.

Table 1 Means (M) and standard deviations (SD) of 250 errors in Hausdorff distance for $\tau =0.5$, $n=500$ and $B=200$

Nonparametric estimation of directional highest density regions

Abstract

Similar content being viewed by others

Spatial machine learning: new opportunities for regional science

Taxonomy and Nomenclature for the Stone Domain in New England

An integrated approach for understanding global earthquake patterns and enhancing seismic risk assessment

1 Introduction

1.1 Some motivating examples

1.2 Paper organization

2 Some background on directional level sets

2.1 On directional level sets

2.2 Error measures and consistency results on directional level sets

Remark 1

3 HDRs in the directional setting

Definition 1

3.1 Plug-in estimation of directional HDRs

3.1.1 Confidence regions for estimated HDRs

3.2 A suitable bootstrap bandwidth selector

4 Simulation study

4.1 Estimation of circular HDRs

4.2 Estimation of spherical HDRs

5 Real data analysis

5.1 Behavioral plasticity of sandhoppers

5.2 Earthquakes distribution on Earth

6 Conclusions and discussion

Change history

12 February 2022

Notes

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendices

Further details on the datasets

1.1 Levels to the estimated HDRs disaggregating the sandhoppers variables

1.2 Interactive representation of HDRs for eathquakes on Earth

Simulated spherical models

Additional simulation results

1.1 Circular HDRs estimation

1.2 Spherical HDRs estimation

Some details on the directional bandwidth selectors

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Mathematics Subject Classification

Search

Navigation