Circular local likelihood

We introduce a class of local likelihood circular density estimators, which includes the kernel density estimator as a special case. The idea lies in optimizing a spatially weighted version of the log-likelihood function, where the logarithm of the density is locally approximated by a periodic polynomial. The use of von Mises density functions as weights reduces the computational burden. Also, we propose closed-form estimators which could form the basis of counterparts in the multidimensional Euclidean setting. Simulation results and a real data case study are used to evaluate the performance and illustrate the results.


Introduction
A circular observation can be represented by a point on the unit circle and measured by an angle θ ∈ [−π, π), after both an origin and an orientation have been chosen. Its real-line representation is provided by the equivalence class {2mπ + θ, m ∈ Z}, and therefore standard linear methods are not suitable for circular data analysis.
Classic examples include flight direction of birds, wind and ocean current direction. Time of day, or time of year are also obvious candidates for directional modelling. When, along with a direction, we report also the time of the day when it has been recorded, we are collecting two-dimensional circular data. In zoology many multidimensional instances arise. For example, Fisher (1993) considers the orientations of the nests of noisy scrub birds along the bank of a creek bed, together with the corresponding directions of creek flow at the nearest point to the nest: here the joint behaviour of these random variables is of interest. Multidimensional circular data are also commonly found in the analysis of protein structure (Lovell et al. 2003). In political science, Gill and Hangartner (2010) study directional party preferences in a two-dimensional ideological space for the German Bundestag elections.
Maximum likelihood estimation is a common approach in many statistical problems, although it requires an assumption that the unknown target belongs to a restricted class of functions. To obtain more general models, Tibshirani and Hastie (1987) introduced the concept of local likelihood. They proposed to fit a regression function using only the observations falling within a certain window around the estimation point. In the context of density estimation, local likelihood requires spatially weighting the log-densities. Depending on the smoothing degree, the methodology can be viewed, in practice, as basically parametric or nonparametric.
The log-densities can be modelled in various ways corresponding to various techniques. Hjort and Jones (1996) have established a general framework, where a parametric family is locally modelled, by allowing its parameters change along the support. Loader (1996a) focused on the use of log-polynomials. Eguchi and Copas (1998) proposed an alternative construction and focus on properties related to asymptotics when the smoothing degree is fixed. Delicado (2006) proposed a unified formulation of these local likelihood approaches based on the concept of sample truncation.
In the present paper, we discuss local polynomial likelihood in order to introduce small-biased circular density estimation. The current literature on nonparametric circular density estimation is substantially limited to the classical kernel estimator introduced by Hall et al. (1987), and contributions introducing more sophisticated nonparametric methods, able to arbitrarily reduce the asymptotic bias without asymptotic variance inflation, have, until now, never been systematically studied. Our proposed methods make it possible to employ a priori knowledge about the smoothness of the target density. It seems that in the current literature there is no specific focus on local methods which allow for this. A strong, technical motivation could be that in the circular setting there is no exact counterpart for the concept of kernel order, as in the Euclidean setting. On the other hand, the absence of boundaries in the support of directional distributions (circle, sphere or torus) could make nonparametric estimation less challenging. Therefore, the serious problem of boundary bias is unknown for non-Euclidean data. However, small bias estimation in nonparametric circular regression has been introduced by Di Marzio et al. (2009Marzio et al. ( , 2013. Recently Di Marzio et al. (2016) have presented a computational study where density estimation based on local polynomial likelihood is investigated for two practical issues not treated in this paper: the impact of the density normalization step when the sample size is moderate; and the effectiveness in identifying the number of population modes.
In Sect. 2 we present the model together with some major features of the estimators, while Sect. 3 is devoted to asymptotic accuracy. In Sect. 4 we show how some numerical aspects can be greatly simplified if the d-fold product (d ≥ 1) of von Mises densities is used as the weight function. After some asymptotic approximations and interpretations, two new estimators are proposed, which could inspire similar counterparts in the multidimensional Euclidean setting. Section 5 is devoted to numerical experiments where the main theoretical properties are confirmed in small to moderate sample sizes. Finally, Sect. 6 contains a real data example related to the three-dimensional structure of proteins.
Once the domain of f has been partitioned into S equal cells, say C 1 , . . . , C S , let n s and P s denote the cell counts and probabilities, respectively. Due to the mean value theorem we can write P s = f (θ s )(2π) d /S for some θ s ∈ C s . The likelihood is c S s=1 P n s s subject to S s=1 P s = 1, where c is a multinomial coefficient. Using Lagrange multipliers, this leads to a penalized log-likelihood Assuming that the number of cells is sufficiently large so that not more that one observation falls in each one, the sum can be taken over the n observations, n s = 1, and P s can be replaced by P i , the probability for the cell containing the ith observation.
For β ∈ [−π, π) d with jth entry denoted by β ( j) , we define the kernel function where K κ j is a circular kernel with zero mean direction and concentration parameter κ j ≥ 0; see Definition 1 given by Di Marzio et al. (2011). The weight function K κ j is usually chosen to be a continuous density function whose support is the circle with the property that as κ j → ∞ the density tends to concentrate at the mode. If we spatially weight each summand of L( f ) by K κ 1 ,...,κ d , and approximate the second sum with an integral, ignoring the constant term, we obtain the following definition of a local log-likelihood at θ ∈ [−π, π) d , The number of observations contributing to the estimate in the jth direction is related to the magnitude of the concentration κ j .
The main motivation for defining an estimator of f (θ ) as the maximizer of (g)] for all non-negative functions g, with equality holding when f (u) = g(u), for any u ∈ [−π, π) d . This is shown by noting that x ≥ log x + 1 if x > 0. The condition x > 0 extends our methodology also to non-negative regression function estimation.
As a model for log f consider a (2π -periodic) pth degree sin-polynomial (Di Marzio et al. 2009) and S ⊗s λ denoting the sth order Kronecker power of S λ . We call P p a sin-polynomial because the functions sin s are reminiscent of the monomial bases for ordinary polynomials. Likewise, we associate the terms linear and quadratic, respectively, to P 1 and P 2 . We use sin functions since the absolute value of the sine of a difference depends only on the magnitude of the smallest arc between the two respective points. The use of a simple difference, as for standard local polynomial modelling, would not be suited to angles because it depends on whether the origin belongs to above the smallest arc or not. In both cases, the sign depends on the orientation choice, that is also arbitrary, but this is not relevant due to the symmetry of our weight functions.
Since f is determined by a = (a 0 , a 1 , . . . , a p ) , we get Differentiating L θ (a) with respect to the elements of a, and setting these partial derivatives equal to 0, leads to p s=0 d s equations: where A(λ) = vec 1, S λ , . . . , S λ ⊗ p / p! . If log f is smooth enough at θ, the sin-polynomial P p represents a series expansion of log f up to order p, and solving the system of Eq. (1) for a gives the estimatesâ = andã s is the vector of the mixed partial derivatives of total order s of log f at θ . For example,ã 1 is the gradient vector, andã 2 = vec(H), where H denotes the Hessian matrix. Arguments in Loader (1996a) assure the existence and uniqueness ofâ since cartesian products of circles are compact. Setting g = log f , andĝ(θ ) =â 0 the density estimate at θ ∈ [−π, π) d is then given bŷ When p = 0, formula (2) simplifies to the standard kernel estimator (Di Marzio et al. 2011), whereas for p > 0 it generally becomes nonlinear and the denominator is required to make it a bona fide density. It is rotationally invariant, that isâ =â * , wherê a * is the estimate using translated data θ 1 + ω, θ 2 + ω, . . . , θ n + ω, ω ∈ [−π, π) d . Thus, if we rotate the initial direction by ω the estimate is not affected. This, in circular statistics, is an important property as the choice of the origin is arbitrary.
Maximum smoothing (κ = 0) yields non-uniform estimates when p > 0, whereas, if p = 0, it gives 1/(2π) for any data. When κ = 0 the contribution of one observation to the estimate does not depend on its location, and this makes the inferential problem fully parametric. Therefore, the use of a constant weight function in the estimate amounts to selecting an element within the parametric family of sin-polynomials we are modelling as a global estimate. When p = 0 this family contains only the element 1/(2π), while p > 0 produces a sinusoidal estimate. Figure 1 shows an example using von Mises data, in which the amplitude of the sinusoidal behaviour depends on the population.

Accuracy
As a starting point, we establish that asymptotic properties of estimator (2) can be conveniently expressed by referring to those of the estimators of log-densities. This can be seen by a very general argument that requires consistency of the estimator and smoothness of both the population and the estimate. To simplify notation, we initially consider the one-dimensional case.
Using again g = log f , in virtue of Corollary 1, we have that, for large n, R n = g − g ≈ 0 and so exp(g + R n ) ≈ f ×(1+ R n ). This shows that the rate of convergence of the log-density estimator at θ ∈ [−π, π) does not change when we exponentiate it, whereas its magnitude varies due to the multiplicative factor f (θ ). Clearly, this transformation improves the estimation at the tails, and, more generally, over the regions where f (θ ) < 1. Such regions are generally a large part of the support when we note that our densities are continuous functions over [−π, π). Concerning the convergence rate of the normalized estimator, we see that so the rate of convergence does not change even after normalization. Coming to the magnitude of the mean integrated squared error (MISE), it is interesting to note that above expression makes it possible to invoke Theorem 1 of Glad et al. (2003). They prove that, when the un-normalized area is bigger than one, then it does not worsen if we add to it a fixed quantity that makes its integral equal to one. This result can be considered very strong because it holds for any sample size. Obviously, if the area of our estimate is smaller than one, this theorem will not apply; however, we can still be confident that severe, often negative, bias at the peaks has been reduced. Comparisons of normalized and un-normalized estimators can be found in Di Marzio et al. (2016).
In general system (1) has only numerical roots when p > 0, and direct accuracy calculations are impossible, and so we expand (1) to obtain an expression for the estimation error. We consider un-normalized estimators throughout this section.

Asymptotics
The starting point for obtaining asymptotic properties is the expansion of system (1), forâ aroundã, dα is the Jacobian matrix of the local likelihood system at a =ã. It follows that Starting from Eq. (3), asymptotic bias and variance, for the case of order p ≥ 1 of the approximating sin-polynomial, are provided by Moreover, assume that, for odd p, all the mixed derivatives of total order p + 1 of the log-likelihood function exist and are continuous in [−π, π) d , and, for even p, this also holds for all the mixed derivatives of total order p + 2, then, we have where D f (θ) denotes the gradient of f at θ , and 1 is a vector of ones of length d p+1 for odd p, and d p+2 for even p > 0.

Moreover, for both (i) and (ii)
Using results in Theorem 1, component-wise consistency of (â 0 ,â 1 , . . . ,â p ) comes from a direct application of Chebychev's inequality, as stated in Remark 1 If we focus onâ 0 , the results in Theorem 1 can be simplified as follows. For a multiindex j = ( j 1 , . . . , j d ), and a kernel K κ 1 ,...,κ d , set and notice that, due to the symmetry of and g = log f . Then, using the results in Theorem 1, along with the approximatioñ while, in either case, the leading term of the variance is

Remark 2 The above results can be further simplified if
is the modified Bessel function of the first kind and order s. Then, for j ≥ 0, it holds that where OF(z) stands for the odd factorial of z ∈ Z + , with OF(0) = 1. For large enough κ, , can be approximated by 1 with an error of magnitude O(1/κ). These approximations, along with the assumptions in Theorem 1, give an asymptotic bias of O κ −( p+1)/2 for odd p and O κ −( p+2)/2 for even p, while, in both cases, the asymptotic variance is O n −1 κ d/2 . As a consequence, the value of κ which minimizes the asymptotic mean squared error ofâ 0 is O n 2/(2( p+1)+d) for odd p, and O n 2/(2( p+2)+d) for even p, which lead to rates of convergence of orders n −2( p+1)/(2( p+1)+d) and n −2( p+2)/(2( p+2)+d) , respectively.
As previously noted, when p = 0 system (1) has a closed form solution. This allows direct calculations, without using Theorem 1. For d = 1, the leading terms of bias and variance are, respectively, 1/2 f (θ )/ f (θ ) sin 2 K κ , and 1/(n f (θ )) K 2 κ . Because the local linear fit has the bias leading term equal to and the same variance, the respective convergence rates are the same. Therefore, apart from the stationary points, the P 0 fit is asymptotically superior to the (un-normalized) P 1 one where the population density is concave, as is usually the case around the modes.
The previous results can be formulated forf instead ofâ 0 . When d = 1, the leading terms of the biases of the un-normalized estimates, up to order two, are whereas the asymptotic variances are all equal to f (θ )/n K 2 κ .

Smoothing degree selection
In order to select the smoothing degree, we prefer likelihood cross-validation since it does not require explicit estimation of higher order derivatives, as happens for any plug-in approach, and explicitly takes account of the risk function we use for our sinpolynomial modelling. We start with a caveat as follows. The local likelihood estimator is nonlinear in nature when p > 0. Consequently, when the smoothing parameter(s) is (are) fixed, iff i are the normalized estimates from N samples of size n i , then the (normalized) estimate using all the data from the combined samples is not the same as N i=1 n ifi / n i , as would be the case for p = 0. This anomaly, which leads to an increased computational burden when cross-validation is used, is also apparent in the Euclidean setting.
We could use a normalized estimate, or just penalize the un-normalized one. Under the first perspective, the likelihood cross-validation criterion for density estimation suggests maximizing the leave-one-out log-likelihood wheref −i indicates an estimate obtained after removing the ith observation. The second approach leads to a penalized likelihood given by the target where λ is some penalty; here, the difficulty lies in choosing an appropriate λ.
The first approach appears more direct, but turns out to be very computationally intensive; this is a consequence of the caveat explained above. However, passing to the logarithm we can approximate the second term by n log f (α)dα. Noting that f ≈ 1, a Taylor series approximation of the logarithm leads to This can be seen as the same as a penalized likelihood when λ = n. Formula (4) has been presented by Loader (1996b, p. 90) as a direct application of standard crossvalidation to his log-likelihood model log f (X i )−n f (u)du − 1 , that is slightly different from our L( f ).

Computational aspects and interpretation
System (1) is nonlinear and contains a number of integrals; hence, closed-form solutions are in general unavailable. Nevertheless, when products of von Mises densities are used as kernels, it is possible to alleviate this issue.
In Sect. 4.1 we indicate a way to avoid numerical integration based on the properties of Bessel functions when P 1 is used. This strategy does not apply for higher order sin-polynomials ( p > 1) because cross-terms do not allow us to obtain separable integrals. Even avoiding cross-terms would not work since the resulting integrals do not have any explicit expression. In Sect. 4.2, based on asymptotic arguments, we present, for P 1 and P 2 fits, a simple way to obtain closed-form solutions without resorting to numerical integration.

Local linear fit
A local linear fit for f at θ ∈ [−π, π) d can be obtained starting from the solution for a 0 , of d + 1 equations We will denote the quantities on the LHS by the statistics and, for j ∈ (1, . . . , d), Using a von Mises kernel, and denoting {(2π) d d j=1 I 0 (κ j )} −1 by B, the quantities in the RHS of system (5), respectively, become ∈ (1, . . . , d). Hence, expressing the integrals as Bessel functions, the above quantities can be, respectively, rewritten as where atan2(y, x) gives the angle (in radians) between the x-axis and the vector from the origin to (x, y). Then, taking the ratio gives As for the existence conditions, due to the circular kernel definition, M 0 > 0. This quantity has to be solved in order to obtainâ ( j) 1 . Such an approach gives the numerical solutions for all the partial derivatives ( j ∈ (1, . . . , d)). Finally, substituting these into the first equation of system (5), we obtain This expression suggests that P 1 modelling can be seen as a correction of the kernel density estimator which basically reduces the estimate where the density gradient has nonzero norm, and leaves it unchanged at the maxima and minima. Thus, if κ j are the same both for p = 1 and p = 0, the un-normalized area of the case p = 1 is strictly smaller than one. Hence, normalization would result in bias reduction (increase) near the mode (along the valleys) and in variance inflation, drastically contrasting with the un-normalized fit that has same bias as case p = 0 at stationary points.

Asymptotic approach
Closed-form solutions for system (1) do not exist for usual circular kernels if p > 0. This is in contrast with the Euclidean setting, where the use of the Gaussian kernel makes them available if we implement P 1 and d ∈ (1, 2), or p = 2 and d = 1. Hjort and Jones (1996) report them; see their formulas (5.1), (5.2) and (7.3). In this section we obtain closed-form approximate solutions when the von Mises kernel is used by appealing to some asymptotic arguments. Like any Euclidean kernel, as n increases circular kernels also concentrate, giving significant weight only to those observations which are close to the estimation point. For large sample sizes, this allows these approximations ⎧ ⎨ ⎩ cos(u) ≈ 1 − 1 2 atan2(sin(u), cos(u)) 2 I 0 (κ) ≈ (2πκ) −1/2 exp(κ) sin(u) ≈ atan2(sin(u), cos(u)).
The resulting functions lead to closed forms for both integrals and estimators when P 1 or P 2 are modelled. We use periodic functions since the difference between two angles can fall outside the interval [−π, π). Numerically, it could be convenient to rescale the kernel in order to avoid the weight values diverging too much which would affect stability. For this reason, we have nevertheless included a scale factor, although K κ j = 1.
Result 1 Consider the log-likelihood system (1) with p = 1, d ≥ 1, and a d-fold product of von Mises densities as the weight function. The use of approximations (9) within the integrands in the RHSs of the system leads to the following closed form solutions:â with M 0 and M ( j) 1 defined in Eqs. (6) and (7).
Since theâ 0 andâ . This latter integral, for large κ j , is approximately equal to 1/κ j , which is consistent with the rate of convergence in the linear case.
Result 2 Consider the log-likelihood system (1) with p = 2, d ≥ 1, and a d-fold product of von Mises densities as the weight function. Using the approximations (9) leads to these expressions for the RHS of this system: where D indicates the following quantity If d = 1 then the above RHSs give these closed-form solutions: A check of their consistency requires to additionally know that M 2 p → f (θ ) sin 2 K κ . A brief examination also reveals that estimator (13) has first-order approximation exactly equal to d log f (θ )/dθ , differently from that seen for (11), where this is only asymptotically true. Substituting Eq. (11) into (10) we can writeâ and then model P 1 can be described as the kernel estimator plus a correction based on slope. Similarly, model P 2 can be written aŝ where a 2 1 , which is a different quantity fromâ 2 1 , denotes a consistent estimator (via Slutsky's theorem) ofã 2 1 given by the product of the derivative estimators (11) and (13). This formulation suggests that the quadratic estimator can be seen as a correction of a "linear" estimator where the bias introduced both at the minima and peaks is reduced by a logarithmic term involving second derivative estimation. Specifically, slope and curvature corrections have the same magnitude for big κ, and their relative impact on the estimate is described by the ratio a 2 1 /â 2 . Also, P 2 fits tend to have bigger area than linear ones when f is convex nearby the maxima, as often happens.
The above considerations lead to a couple of new estimators ofã 0 which do not have Euclidean counterparts. A promising competitor of the linear fit (10), based on the fact that estimator (13) is more efficient than its counterpart (11), could be: A multidimensional, quadratic estimator can be conceived as a direct generalization of the one-dimensional case:

Simulations
Firstly, we examine the efficiency of our (normalized) estimators on 200 samples of n observations drawn from a von Mises population with null mean direction and concentration parameter equal to five (v M(0, 5)). Figure 2 shows the estimated log(MISE) for n = 100 and n = 500. It can be seen that the approximation of exp(â 0 ) using Eq. (10) is very good for both values of n. As expected, a larger κ (corresponding to less smoothing) is required for larger sample sizes. Despite their asymptotic equivalence, the use of P 1 improves on the standard kernel density estimate, whereas the P 2 performance is even better. This is despite the fact that the normalization step has a bigger (beneficial) impact for p = 1 than for p = 2 in terms of bias reduction. Overall, the estimator L 0 has the best performance. In the second simulation experiment, we consider some mixture distributions. In the first case, we use an equal mixture between wrapped Cauchy with mean direction 0 and concentration 0.225 (WC(0, 0.225)) and uniform (UC(−π, π)) distributions. This population model is not as well behaved as a von Mises one as it has very thick tails, and the integral of the squared second derivative is larger. In such a context, we may expect that the case p = 2 would be superior to other models. Simulations confirm that this is indeed the case; see Fig. 3. The second example uses a equal mixture of von Mises densities to form a bimodal distribution. In this case, also, p = 2 is the best, with p = 0 the worst.
The final experiments are designed to gain knowledge about practical performance in the case that the smoothing degree is data-driven. We consider sixteen models, eight of which are bivariate. They are unimodal or multimodal and are more or less (rotationally) symmetric around the origin. We use, for each sample, two bandwidth selectors: the maximizer of likelihood cross-validation (LCV) given by Eq. (4) (which uses an un-normalized estimator); and classical least-squares cross-validation (LSCV).
Computations were made using MATLAB, and some example code is available from http://www1.maths.leeds.ac.uk/~charles/TESTpaper/matlab.zip. The results are presented in terms of average integrated squared errors evaluated using normalized estimates; see Tables 1 (univariate populations) and 2 (bivariate populations). After noting that the relative merits of the estimators do not depend on which selector has been used, the main message is that the standard kernel is the worst density estimator, even with n = 100, when asymptotic performance is not relevant. First-order fits, i.e. P 1 , always have satisfactory performance because the bias reduction at the peaks, which is mainly due to normalization, is often decisive. From additional results not reported in Table 1, it appears that estimator L 0 seems better suited than P 1 when the population shape is simpler but their performances are very similar.
Quadratic modelling behaves unexpectedly well, being the best one in the majority of cases. In general, it is more efficient when the roughness is more pronounced. Indeed, in the case of the uniform population, which can be considered as a proper counterexample in that all derivatives are zero, P 0 and P 1 fits have very similar behaviour (for both the smoothing degree selectors) which outperform Q 0 . Smoothing degree is selected by likelihood (least-squares) cross-validation U unimodal, B bimodal, T trimodal, F five-modes, S rotationally symmetric, A rotationally asymmetric  The saddle-shaped population (bottom three in Table 2) has large regions where the asymptotic bias is mainly due only to first-order properties of the model since second derivatives have different signs at opposite sides. In such case, variance inflation of the quadratic estimator dominates its bias reduction and the overall performance degrades. The last two models in Table 2 deserve particular attention because they concern correlated variables. Specifically, we use the bivariate von Mises model proposed by Singh et al. (2002). Similar to a bivariate Gaussian family, we have five parameters: two locations, two concentrations, and a "correlation". Our two casesboth bimodal-are, respectively, featured by small concordance (concentrations equal to 1.9 and correlation equal to 2), and moderate concordance (concentrations equal to 4.9 and correlation equal to 5). We see that in these cases our proposals are by far superior to the standard kernel method reaching an improvement of nearly 40% in the case of moderate concordance.
A comparison between the two smoothing selectors shows that LCV performs a little better in the majority of cases, although for n = 500 they appear almost equivalent. The fact that we measure discrepancy by squared errors suggests that we still could try to select at least p by LSCV. In order to investigate the effectiveness of this approach, we have considered, for each sample of the previous simulation, the four integrated squared error curves as functions of κ (one for each estimator), and then selected the p associated to the smallest minimum. Then we have checked if such p is the same as the one optimizing LSCV or LCV curves over κ. In Table 3 we report the number of such matches for a few populations. Since the results suggest that LSCV is the best in this regard, we might thus envisage choosing the estimator based on the optimal LSCV, and then choosing the smoothing parameter for that estimator based on LCV. This idea is taken up further in the next section. carbon-carbon) to which other atoms are linked. In particular, to one of the carbon atoms a "side chain" is formed, and the structure of this side chain defines the type of amino acid. The sequence of dihedral angles (φ, ψ, ω) along the backbone is determined by the relative positions of the atoms, and this determines the overall shape of the protein after folding. Since ω is highly predictable, with little variation, a study of − π < φ, ψ < π is useful for many purposes, particularly in validation of newly determined protein structures.
A plot of pairs (φ i , ψ i ), i = 1, . . .-known as a Ramachandran plot-for any protein reveals several subgroups, and mixture models of von Mises distributions have been used to summarize these from a parametric perspective. Work by Mardia et al. (2007) used an EM algorithm to fit a mixture, with the number of components being determined by AIC. Lennox et al. (2010) proposed a Dirichlet process to determine the number of components. A related Bayesian approach was developed by Boomsma et al. (2008) which allowed for many more components in a hidden Markov model, in which mixtures were trained on a set of angles from each amino acid. Kernel density estimation has also been used on the protein data (Taylor et al. 2012) where interest lay in considering a bivariate density estimate conditional on the amino acid type. This has been considered as an alternative approach to validation (Lovell et al. 2003), in which procheck-based on histograms-is currently used. Finally, we note that Fernández-Durán and Gregorio-Domínguez (2016) have anaylzed similar datasets using trigonometric sums, with visual results that exhibit a periodic structure.
In this case study, we use data from 500 high-quality, "representative" proteins available from the Richardson Laboratory. 1 Each protein is represented by a sequence of bivariate angles, each of which is associated with an amino acid. These are pooled together, and we thus obtain 20 datasets corresponding to the 20 amino acids. It should be noted that, within a protein, we expect observations not to be independent. However, in each of the 20 datasets we are pooling data from 500 unrelated proteins, and so strong dependence between observations in the same dataset will only occur when an amino acid occurs consecutively along the backbone; this is relatively uncommon. For each dataset, we can obtain a bivariate density estimate using the methods described in this paper, in which the smoothing parameter is selected by cross-validation. Our objective is to examine the differences in the estimates, and to see how these results may relate to previous work.
Obviously, we are unable to compare density estimates with the true densities for these data, as will be the case in any real-life application. One of the uses of density estimation is to obtain information about subgroups, or clusters, within the data. This could be subsequently used in the fitting of (parametric) models, for example. A natural way to investigate this is to identify the location (and height) of local modes. The ability of our estimators to identify bumps has been discussed, by presenting extensive simulation evidence, in Di Marzio et al. (2016). This motivates our focus on such data, where amino acid distributions are partially characterized by modes. As it is customary in peak recognition, we applied two filters, a global and a local Sample sizes, number of modes (maximum), and roughness in the estimates using various methods. The smoothing parameters are selected by likelihood cross-validation one, in order to reduce false positives. So, for a global threshold for the modes of estimator E ∈ {P 0 , P 1 , L 0 , Q 0 } we used c 0 m E where c 0 = 0.005, and m E indicates the maximum among the estimates made over an equispaced grid (of 50 × 50) locations using E. Secondly, to avoid locations which were more akin to saddle points, we required that, at a mode, say where δ m represents the set of neighbouring points around a mode, and c 1 = 0.0001. Table  4 gives the maximum off , the number of modes, and the roughness for some of the amino acids. It can be seen that P 0 has the largest maxima and the largest number of modes, but that P 1 has the largest roughness. Correspondingly, L 0 has the lowest maximum, the fewest number of modes and the smallest roughness, with Q 0 generally being closer to L 0 . We found that the number of elements in the union of mode locations, over all 20 amino acid datasets, for each of the methods in Table  4 is: 111, 74, 43 and 38, respectively. We note that the hidden Markov model of Boomsma et al. (2008) used 50 components in a mixture of bivariate von Mises distributions which seems to be consistent with these values, except for the standard kernel. The most striking difference, however, is in the visual appearance of the density estimates, in which the difference in roughness is very evident. We first note that the height of the highest mode is much greater than the rest of the density, so in order to visualize the whole density estimate we have taken cube roots throughout. To illustrate the differences, we focus on two amino acid datasets: Alanine (A), and Asparagine (N). Figures 4 and 5 show (transformed) contour plots of estimates P 0 and Q 0 , as well as slices through the density estimates at the indicated values of φ and ψ. The slices were chosen to pass through (or close to) a mode. Comparison of the contour plots confirms that the estimates for Q 0 are much smoother than those for P 0 . The profile densities, which have been chosen to pass near to a local mode, show an "adaptive" smoothing character of the Q 0 estimate, in which the tails of the density are noticeably smoother, while the height of the modes are not much less than those for P 0 . We note, also, the possibility of spurious modes far in the tails of Q 0 which may arise for similar reasons to the sinusoidal behaviour seen in Fig. 1. In all these interpretations, it should be remembered that the comparisons are made on the basis of an LCV choice of smoothing parameter.  Top: transformed (cube-root) contour plots of the bivariate density estimates for amino acid A (alanine) using P 0 (left) and Q 0 (right). Bottom: profile densities for P 0 (continuous) and Q 0 (dashed) corresponding to ψ = 2.5 (left) and φ = − 1.1 (right)

Discussion
Nonparametric density estimation for circular data has previously focussed on classical kernel density estimation (using a circular kernel). In this paper, we propose a family of estimators that have better theoretical properties in the sense of an arbitrarily small asymptotic bias without asymptotic variance inflation. Such methods have the potential to be more useful when the shape of the density shows unfriendly features like big curvature or large tails. A disadvantage of our proposal is that it is more restrictive than traditional circular kernel method because it requires that population densities are as smooth as required by the sin-polynomial order. Moreover, it is not guaranteed to give favourable results with small sample sizes. However, in our simulations it seems that improvements come with reasonable sample sizes like 100 and 500. A final drawback is that our estimators are more computationally expensive since they have not, in general, closed forms, although in the paper we have seen that use of the von Mises kernel gives an efficient closed form (in approximate or exact form) for p ≤ 2. Thus, the computational effort in our simulation examples is never more than twice that of the standard ( p = 0) case.  Top: transformed (cube-root) contour plots of the bivariate density estimates for amino acid N (asparagine) using P 0 (left) and Q 0 (right). Bottom: profile densities for P 0 (continuous) and Q 0 (dashed) corresponding to ψ = 0.7 (left) and φ = − 1.1 (right) A promising development could lie in replacing our sin-polynomial expansion by a proper, flexible circular parametric family, namely the distributions based on non-negative trigonometric sums introduced by Fernández-Durán (2004) and Fernández-Durán and Gregorio-Domínguez (2016). This would give a fully parametric method for a null concentration of the kernel, becoming more nonparametric with increasing concentration. The main difficulty would be to find the global maximum of the likelihood functions, which have many local maxima. Kernel weighting would still be used to make the method nonparametric. The formal asymptotic theory would need to be studied, and this task appears less straightforward.
The second term on the RHS is O n −1 11 K κ 1 ,...,κ d (α) sin p+1 α (1) dα 2 , and so it is negligible under hypothesis (a). After a change of variable, the first-order expansion for f (α) around θ leads to the variance under assumption (b).