Estimation and computations for Gaussian mixtures with uniform noise under separation constraints

In this paper we study a finite Gaussian mixture model with an additional uniform component that has the role to catch points in the tails of the data distribution. An adaptive constraint enforces a certain level of separation between the Gaussian mixture components and the uniform component representing noise and outliers in the tail of the distribution. The latter makes the proposed tool particularly useful for robust estimation and outlier identification. A constrained ML estimator is introduced for which existence and consistency is shown. One of the attractive features of the methodology is that the noise level is estimated from data. We also develop an EM-type algorithm with proven convergence. Based on numerical evidence we show how the methods developed in this paper are useful for several fundamental data analysis tasks: outlier identification, robust location-scale estimation, clustering, and density estimation.


Introduction
We study a class of mixture models for univariate data sets that can be used for several distinct goals: outlier identification, robust location-scale estimation, robust clustering, and density estimation.
Atypical outlying samples can break down most routine procedures such as location-scale estimation, visualization, density approximation, etc. There may be different reasons why one would flag an observation as an outlier. However, a universal definition for outliers does not exist. In homogenous populations, they are thought of as points lying at some distance from the distribution center. However, extreme points may well be non-atypical when they reflect the heavy tail structure of the underlying generating mechanism. In clustered populations, the distinction between outlying vs. non-outlying points is even more problematic; what we would qualify as a subset of atypical points could constitute a special cluster instead. Following Banfield and Raftery (1993) and Hennig (2011, 2016) among the others, in this paper we consider a model strategy that accommodates the presence of ''noise''. By noise, we mean a mechanism that: (i) generates outlying points that arise in low-density regions of the data space; (ii) generates observations that have an unstructured behavior compared with the majority of the data set so that they are considered outside the scope of the modeling. To motivate the methods proposed in this paper, we introduce three data sets that will be analyzed in Sect. 6.
TiO2 concentration: the data set was introduced in Reimann et al. (2000). A totalof n = 768 measurements of elements in soil samples around the Baltic Sea weretaken. Extensive details about how these concentration measures are constructedare given in the aforementioned paper. We focus on TiO2 (Titanium dioxide)concentration. pH of blood: the data set is obtained from Hartigan (1975) and contains n=40 samplesof various compounds in cerebrospinal fluid and blood for acidosis patients.A total of 6 variables are sampled, in this paper, we focus on pH of blood. Realized volatility: the data set has been studied in Coretto et al. (2020), it containsa cross-sectional measurement of realized volatility for n = 123 traded on theNew York Stock Exchange (NYSE) market between years 1998-2008. Here weconsider a cross-section taken on 25/Aug/1998. In Fig. 1 we report kernel density estimates of the distributions of the three data sets. Kernel estimates are computed using the Epanechnikov kernel function and the adaptive optimal bandwidth estimator of Sheather and Jones (1991). Other datadriven bandwidth selectors have been tried, but the method of Sheather and Jones (1991) provided the most credible results. The common characteristics of these data distributions are that: (i) they have an elongated right tail with few scattered points that sometimes appear far from the main bulk of the data; (ii) for all of them the tail behavior affects the optimal bandwidth estimate, which introduces undersmoothing so that these extreme points are fitted essentially alone by the kernel function. This is particularly evident in Panel (c) of Fig. 1; (iii) in all panels there is evidence of multimodality which may indicate the existence of groups. However, this multimodality may well be introduced by the undersmoothing issue, and therefore, the existence of groups in the data is not clear. The discovery of clusters in the presence of noise is a challenging issue, for an in-depth discussion consider Hennig (2004) and Ritter (2014), and references given therein.
In this paper, we study estimation methods and algorithms for a model consisting of a mixture of Gaussian distributions with the addition of a uniform component that has the role to capture the noise, and that is constrained to overlap with the main location-scale components in a user-controlled way. The model was introduced in Coretto et al. (2020) as an extension of a general class of models of this kind proposed in Coretto and Hennig (2011), which in turn generalized an original proposal of Banfield and Raftery (1993). We introduce a model restriction that ensures that the overlap between the noise and non-noise components can be controlled by the user. The model restriction is embodied in the formulation of a constrained ML estimator for which existence and consistency are studied. A feasible EM algorithm is shown to approximate the constrained ML estimator. A detailed treatment of the computational properties of the algorithm is given.
Other robust approaches can handle groups in the data. Coretto and Hennig (2016) and Coretto and Hennig (2017) propose the addition of a noise component with an improper distribution. But there exists another approach where outliers are represented as points arising from the tail regions of the mixture components. In this direction, McLachlan and Peel (2000a) proposed to achieve robustness with ML estimation for mixtures of Student-t distributions. Punzo and McNicholas (2016) and Farcomeni and Punzo (2020) recently proposed ML methods for mixtures of contaminated Gaussian distributions. Another body of work defines the outliers as points drawn from a ''spurious'' distribution not specified at the modeling stage. Within the previous framework, robustness is typically achieved through trimming methods. Gallegos and Ritter (2005), and García-Escudero et al. (2008) introduced ML-type procedures for partition models with trimming that discards a fixed proportion of observations treated as outliers. For an review of the main ideas in robust model-based clustering see Ritter (2014), Farcomeni andGreco (2015), and Hennig et al. (2016).
Most well established robust methods treat the tails of the distribution symmetrically. The model presented here is particularly useful when contamination processes affect the data distribution asymmetrically, in the sense that noise affects one of the two tails only. This is a specific situation, but it occurs often in practice as shown in the examples of Fig. 1. An additional advantage of the proposed methodology is that it estimates the noise level from the data so that the method decides whether or not contamination affects the data (see Sect. 6). The latter is important when expert supervision is not possible, for example when thousands of features are machine scanned for outlier identification (e.g. this is routinely done in genomic studies, see) Marshall 2004. The methodology can be also used to investigate group structures in the data by applying model-based clustering techniques. In Sects. 5 and 6 we show how the method under study applies to fundamental tasks in data analysis: density approximation, robust estimation, outlier identification, and clustering.
The present paper is organized as follows: in Sect. 2 we introduce and discuss the model and its possible applications; in Sect. 3 we analyze the ML estimation method, and we propose a feasible EM algorithm; in Sect. 5 an extensive Monte Carlo experiments are performed to assess the finite sample properties of the proposed methods; in Sect. 6 we analyze the three data-set discussed above. Finally, in Sect. 7 we conclude the paper with some final remarks.
2 Gaussian mixtures with tail uniform noise Let us fix some notation first. For g ¼ 1; 2; . . .; G define a collection of normal densities /ðx; h g Þ. /ðx; h g Þ is the density at a point x 2 R, where h g ¼ ðl g ; r g Þ 2 R Â ð0; þ1Þ is the mean and standard deviation parameter vector. Let I A ðxÞ be the usual indicator function, that is I A ðxÞ ¼ 1 if x 2 A, and I A ðxÞ ¼ 0 otherwise. Let ðp 0 ; p 1 ; . . .; p G Þ be a vector of mixing proportions such that p g 2 ð0; 1Þ for all g ¼ 0; 1; . . .; G and p 0 þ p 1 þ; . . .; þp G ¼ 1: The density function is a finite mixture model of G Gaussian densities with uniform noise. The parameter vector h 0 ¼ ðb 1 ; b 2 Þ, with b 1 \b 2 , contains the limits of the support of the uniform distribution. The mixing proportions have the usual sampling interpretation McLachlan and Peel (2000a) in an iid sample from a distribution having density (1) a proportion p g of points is expected to be sampled under the g-th component density /ðÁ; h g Þ, while a proportion p 0 of points is expected to be generated under the uniform component. The model parameter vector is defined as Originally, this class of models was introduced in Banfield and Raftery (1993) with the main goal of performing robust model-based clustering. The uniform component had the role of catching outliers. Banfield and Raftery (1993) proposed to fix ½b 1 ; b 2 to be equal to the data range. Coretto and Hennig (2011) generalized to the case of mixtures of general location-scale densities with an arbitrary finite number of uniform distributions having disjoint supports.
It is well known that mixtures of normal distributions can fit almost any distribution in a semiparametric way. Here the uniform distribution in (1) is called the ''noise component''. Such a model can be used so that the Gaussian mixture part represents most of the data structure, while the uniform noise component is meant to represent regions of the data that do not have a clear shape or structure. Moreover, we want to use model (1) so that the uniform distribution is specifically designed to catch ''atypical'' points when these arise from one of the two tails of the distribution.
Although the estimator and algorithms developed in Coretto and Hennig (2011) should be able to cope with the situations described above, in some situations, their method is not able to distinguish the tails of the normal mixture part of the model from the uniform component as noted in Coretto and Hennig (2010). The latter would result in estimates where both the location and the variance of the normal components are seriously biased. For these reasons, Coretto et al. (2020) used model (1) to cluster financial data, and they modified the algorithm of Coretto and Hennig (2011) to ensure a certain separation between the tails of the normal components and the uniform component. In order to achieve the previous goal, we introduce three ''model-constraints''. By ''model-constraints'', we mean constraints on the parameters that translate the interpretation of the model that needs to be addressed by the fitting procedure. Consider the following constraints Since /ðÁÞ is symmetric about l g , the constraints (RS) and (LR) imply that the support of the uniform component has a bounded amount of overlap with the region where the location-scale components put most of their mass. The degree of overlap is controlled by c. For example, assume G ¼ 2 with r 1 ¼ r 2 and l 1 \l 2 . Take c ¼ z 0:98 , i.e. c is the 98% quantile of standard Gaussian distribution. Now, if (RS) holds, the second Gaussian component (the component with the right-most peak) can only overlap with the uniform noise component in the region corresponding to its 2% tail probability. Moreover, since (RS) has to hold for all other g, the first component cannot overlap with the uniform support more than the second one. The constraints (RS) and (LS) are mutually exclusive, in fact, only one of them can hold.
The third constraint is called the ''noise proportion constraint'' (NPR). It bounds the noise level, i.e. the expected fraction of points generated from the uniform distribution under (1). The NPR rules out the possibility that the noise becomes the majority if p max \50%. The latter corresponds to the robust statistic's classical assumption that one cannot distinguish outliers if they are a majority. However, in clustering applications one would set p max not smaller than the smallest fraction of points that would be genuinely considered as a ''cluster''.
A model like (1) can be used with different aims: density approximation, robust estimation, robust model-based clustering, etc. Some of the applications will be seen in Sect. 5. The use of the word robust here does not mean formal robustness. Hennig (2004) showed that MLE corresponding to the model (1) could break-down in the presence of arbitrarily large observations. In practice, this is not common, and in fact, we will show that the proposed MLE can adapt to heavy tails and strong skewness showing good resistance to outliers.

Clustering
Insights into the clusters generating ability of (1) are central for the subsequent developments. Consider an iid sample fX 1 ; X 2 ; . . .; X n g, where X i 2 R; for all i ¼ 1; 2; . . .; n, and X i has a distribution having density f ðÁ; hÞ. Let x n :¼ fx 1 ; x 2 ; . . .; x n g be the observed sample. If the component densities of (1) are sufficiently separated one would observe G distinct clusters of points, and if p 0 [ 0 an additional group of more ''unstructured points'' would appear in the sample depending on the size of the interval ½b 1 ; b 2 . Define the following quantities s 0 ðx; hÞ ¼ p 0 I ½b 1 ;b 2 ðxÞ b 2 Àb 1 f ðx; hÞ ; and s g ðx; hÞ ¼ p g /ðx; h g Þ f ðx; hÞ ; for g ¼ 1; 2; . . .; G ð2Þ (2) will have a crucial role. These ratios are called ''posterior membership probabilities'', because in the iid sampling case s g ðx i ; hÞ is the posterior probability that a point x i has been generated by the g-th component of (1) at h conditional on the observed sample. Assuming that h is known, the optimal assignment rule minimizing the misclassification rate is the following Bayes assignment Aðx i ; hÞ :¼ arg max g¼0;1;...;G s g ðx i ; hÞ; In reality h is not known, and typically the assignment is based on its estimate, usually an ML estimate.

Estimation
This section gives a detailed account of the ML estimation and computation for fitting h. First, we consider G fixed and known. Data-driven choice of G will be considered in 1 and Sect. 5. ML estimation for finite Gaussian mixtures has been a fascinating problem studied for a long time. A framework that allows for simple existence and consistency proofs does not exist. Chen (2017) is the most recent comprehensive account of this subject's main theoretical contributions. The log-likelihood function associated with an observed data set x n from an iid sample from (1) is given by Day (1969) noted a fundamental issue connected to the unboundedness of L n ðhÞ. In fact, fix l g ¼ x i for arbitrary i and g, and it happens that taking r g ! 0 implies L n ðhÞ ! þ1. The same happens if we fix an arbitrary b 1 \x i for some i, and then taking b 2 ! b 1 the variance of the uniform component ðb 2 À b 1 Þ 2 =12 ! 0 and L n ðhÞ ! þ1. Therefore, the ML estimation needs to take into account constraints that bound scale parameters from below. One can simply require for example that r g [ ¼ r 0 [ 0, but: (i) these constraints will not produce a scale-equivariant ML; (ii) the choice of an effective r 0 is not trivial and it is shown to affect the fitting. There are several alternative approaches, although none of them translates into simple numerical approximating algorithms. The ML theory developed in Coretto and Hennig (2011) is based on scale constraints proposed by Dennis (1981) and Hathaway (1985) that are scale-equivariant. In the multivariate setting these constraints where rediscovered later due to Ingrassia (2004). However, these are not feasible here. In fact, the (RS) and (LS) constraints are an additional complication in the numerical optimization of (4). Define scale parameters Fix 0\d\1, and consider the constraint s g ! s min ¼ expðÀn d Þ studied in Tanaka and Takemura (2006). Although these constraints are not scale-equivariant at fixed n, they are asymptotically scale-equivariant since s min ! 0 as n ! þ1. Suppose that n ¼ 1300 and that we fix d ¼ 0:5, the resulting constraint would be s g ! s min ¼ 2:22Â10 À16 , where s min is equal to the so called machine epsilon. The latter means that for any n [ 1300, in practical computations, we would treat s min as it was a positive number that numerically cannot be distinguished from zero. (RS) and (LS) are mutually exclusive, hence it is simpler to treat them separately, and finally, we will give some ideas on how to choose between them. We treat estimation with (RS) in mind because positively skewed data are more common in applications. The second constraint for model-interpretation is the (NPR) constraint. While the previous scale constraint is the method's regulation tool for obtaining a well-defined and consistent ML estimator, the (RS), (LS), and (NPR) constraints restrict the domain of the parameters to gain a certain interpretation of the model. We look for the maximizer of L n ðhÞ over the following constrained parameter space for fixed 0\d\1 and c [ 0 constants. The constrained sample ML estimator is defined asĥ For simplicity we defined the ML estimator in the (RS) case, for the (LS) case one replaces l g þ cr g b 1 in (5) with b 2 l g þ cr g . Note from (5) that, contrary to the classical case, here the parameter space changes with n. Moreover, because of the uniform component, the objective function of the ML problem is not continuous. Therefore, the existence of the ML in both finite and infinite samples is not obvious. The next 1 state the finite sample existence of the sample ML estimator.
Proposition 1 L n ðhÞ achieves its maximum over H n .
The proof of the statement above is given in the final Appendix. The next 2 states the consistency of the ML estimator. Finite mixture can only be identifiable up to component label switching, in fact, in this case any permutation of the indexes g ¼ 1; . . .; G, leads to the same mixture model (1). As in Teicher (1963) we state consistency on a quotient space of the original parameter space. In practice we look for the consistency with respect to one of such permutation of the Gaussian components' indexes. Define Proposition 2 Let h 0 2 H, and suppose that (RS) holds for h 0 . Let h 0 be the generating parameter vector, in the sense that the sample fX 1 ; . . .; X n g is an iid sequence from a distribution having density f ðÁ; h 0 Þ. Then

Pr lim
The proof of the statement above is given in the final Appendix. We treated G as fixed and known. However, in density estimation, G controls the number of peaks in the final density estimate. In clustering applications, often G coincides with the number of clusters. There are several approaches to select an appropriate G. The most popular approach is to rely on information-theoretic quantities discussed in more detail in Sect. 5.
Remark 1 For univariate samples, the choice of the appropriate constrained model, i.e. deciding about (RS) vs. (LS), could be based on subject-matter considerations and visual inspection of the data. However, in cases where data are processed in an unsupervised way, it may be possible to choose the model version that fits that data better in terms of expected log-likelihood. This appears as an approach technically correct because, for a given G, both (RS) and (LS) have the same number of parameters. The joint tuning of constraints' hyperparameters and G is more problematic here. In the mixture context, the most popular approach for choosing the number of mixture components G is to use information criteria such as the AIC and the BIC. Classical approximate estimators of the information criteria have the form: (fitted likelihood -penalty), where the penalty term increases with the number of parameters. The number of parameters is generally interpreted as a proxy for model complexity and degrees of freedom. Observe that for a given G different specifications of ðc; p max ; dÞ modify the ability of the model to adapt to the data distribution, which in practice means that constraints will change the ''effective degrees of freedom'' of the model. A ''more constrained parameter space'' allows the fitting of the model to adapt less to the data, and therefore it will lead to a ''simpler'' model. Thus, the well-known bias-variance trade-off arising from the model selection is certainly affected by changing the parameter constraints. However, classical estimators of the AIC and the BIC will not reflect the effects of the constraints. Model selection frameworks that account for these effects exist only for specific cases (for instance see) Kuiper et al. 2011. In practice, we suggest that whenever G needs to be selected, classical information criteria are used for a fixed set of model's hyperparameters ðc; p max ; dÞ.

MEMR algorithm
It is well known that, even in the one-dimensional case, computation of the sample ML estimator in the finite mixture context is not an easy task due to the highly complex likelihood function surface (see) McLachlan and Krishnan 1997. Moreover, the uniform distribution here adds some more difficulties due to its discontinuities and the fact that it introduces many local maxima in L n ðÁÞ (see the proof of 1). The Expectation-Maximization (EM) algorithm of Dempster et al. (1977) is a popular choice for approximating the MLE of finite mixture models. We exploit the discontinuities of L n ðÁÞ, and we propose the ''multiple EM runs'' (MEMR) Algorithm 1. We show that the MEMR algorithm is monotonic and converges to a stationary point of the likelihood surface.
Let s ¼ 0; 1; . . . be the iteration index. Let a ðsÞ be the quantity a computed at the s-th step of the algorithm. Define By applying Theorem 4.1 in Redner and Walker (1984) it can be established that if there exists h 0 such that Qðh 0 ; h ðsÞ Þ ! Qðh ðsÞ ; h ðsÞ Þ, then L n ðh 0 Þ ! L n ðh ðsÞ Þ. Therefore, iteratively increasing (8) with an appropriate choice of a sequence fh ðsÞ g, one obtains a monotonically increasing sequence fL n ðh ðsÞ Þg. This is the a standard EM algorithm that is not feasible in our case. Rewrite (8) according to the following decomposition where The sequential maximization of QðÁÞ is separable in its three components above.
Remark 2 The usual M-step for Gaussian parameters does not work here, and the optimal solution for Q / ðÁÞ cannot be calculated with a closed-form formula as in the case of the unconstrained Gaussian mixture model. Because of the scale and the (RS) constraint, the optimization concerning h / requires numerical optimization. In our implementation (see Sections 5 and 6), we perform the M1-step using the gradient-based algorithm of Svanberg (2001) included in the NLOpt library of Johnson (2020). Although solving the M1-step requires iterative numerical optimization, both the gradient and the Jacobian of the objective function are computed in closed form.
Q u ðÁÞ is not differentiable, and it introduces many stationary points into QðÁÞ. The latter is related to the multiple local maxima behavior of L n ðÁÞ seen in the proof of 1. The next 3 characterizes this behavior that is rather central in developing the MEMR algorithm.
2Ã g for any s ¼ 1; 2; . . . Proof of 3 is given in the Appendix. The previous statement is crucial because it handles the likelihood's discontinuities caused by the uniform component. The local maxima of Q u ðÁÞ, and therefore the local maxima of QðÁÞ and L n ðÁÞ, are such that the corresponding h u coincides with a pair of distinct data points. In the Algorithm 1, the previous argument is exploited to decompose the optimization of L n ðÁÞ into multiple local searches where for each pair of distinct data points (that are local optimal for ðb 1 ; b 2 Þ), the optimization is performed on the remaining parameters ðh / ; h p Þ.
Remark 3 3 does not depend on whether we implement (RS) or (LS) constraint. The Algorithm 1 can be adapted to find a solution with (LS) constraints rather easily. In particular two modifications are required: (i) the trimmed set into the initialization step fx i 2 x n j x i \b ð0Þ 1 g is replaced with the trimmed set The proportion parameters belong to the compact set P ¼ ½0; p max Â ½0; 1 G . The M2-step in the Algorithm 1 takes care of the noise proportion constraint. The next proposition states that at any step s, the updating given in M2a-M2b maximizes Q p ðÁÞ fulfilling the noise proportion constraint.
Proposition 4 Assume that Algorithm 1 is performed for steps 1; 2; . . .; S. M2a-M2b solve the following optimization program maximize h p 2P Q p ðh p ; h ðsÞ Þ; subject to p 0 À p max 0; p g 2 ½0; 1 for all g ¼ 1; 2; . . .; G: Proof of the previous 4 is given in the Appendix. 4 establishes that the updates in M2a and M2b of the Algorithm 1 implement the closed-form solutions for the constrained M-step with respect to the proportion parameters. Although the implementation of the noise proportion constraint into the MEMR algorithm is similar to the RIMLE algorithm of Coretto and Hennig (2017), there are substantial differences. Here the M-step is unconditional, and it is based on closed formula calculations, whereas the RIMLE relies on conditional optimization and the noise proportion update requires solving nonlinear equations iteratively. The proof of the previous Proposition is given in the Appendix. 5 ensures that for an appropriate small e (stopping criterion), h ðrÞ approximates a local optimum of the log-likelihood function, or unfortunately, a stationary point of the likelihood surface. The initialization strategy forms all possible candidates for ML estimates of the uniform parameter, each of which will qualify a local maximum of L n ðÁÞ. Therefore the last step of the algorithm selects the best of such local maxima, h ðrÃÞ , for a candidate global solution of the sample ML problem.
Remark 4 The algorithm requires several iterations, one for each pair of distinct data points that initialize the uniform parameter. The latter requires too many EM runs if n is large. In both (RS) and (LS) cases, we seek models where the uniform component captures noise in the tails. Therefore a possible strategy is not to run the EM iteration for all pairs fx ðrÞ i ; x ðrÞ j g of 5 but for a subsample of the data. Random subsampling is a possibility, although, in experiments, it introduced additional variability. For n [ 500, our software implementation allows switching to faster options that work as follows.
Option A Compute˜xn = fQn(ak); k = 1;2; : : : ;Kg, where Qn(a) is the empiricalquantile function at a, fakg is an appropriate sequence of probabilities (e.g.˜xnare empirical percentiles), and K \n. Note that an empirical quantile is always anobserved data point. Form all pairs fx(r)i ;x(r)j g of Proposition 5 from the data subsetxn. In this case the number of pairs used to initialize the uniform parametersdepend on K and not n. Option B This is similar to option A but it excludes further non-tail points from .....depending on.....If the (RS) version is fitted, set minfak; k = 1;2; : : : ;Kg =(1pmax). Otherwise, if (LS) version is fitted, set maxfak; k = 1;2; : : : ;Kg =pmax: This shortcut strategy exploits the fact that the noise level is bounded bypmax, and since the separation constraints pushes the uniform component in thetails, it is not necessary to look for (b1;b2) everywhere on the data range.
We experimented extensively with samples of size n [ 250 using both A and B, and we never experimented with significantly detrimental effects. On the other hand, we obtained significant gains in computational efficiency.

Experimental settings
In this section, we assess the proposed ML estimator's finite sample performance, and we provide comparisons with existing methods in the literature. We consider three different data-generating mechanisms to show possible different uses of the proposal in the context of semiparametric density estimation, mixture parameters estimation, clustering, outlier identification, and robust location-scale estimation. We want to assess the methodology under three different circumstances embodied in the following three sampling designs.
MIXs. Data are generated from the following Gaussian mixture model with the uniform noise acting on the right tail: p 0 Uð50; 123Þ þ ð1 À p 0 Þ f0:6 N ð0; 50Þ þ 0:4 N ð27:5; 50Þg; where Uðb 1 ; b 2 Þ denotes the uniform distribution on the interval ½b 1 ; b 2 , and N ðl j ; r 2 j Þ is the Gaussian distribution with mean l j and variance r 2 j for j ¼ f1; 2g. Three values of the expected proportion for the uniform component are considered: p 0 2 f0:05; 0:1; 0:25g. The choice of the parameters is based on the following arguments. The three values of p 0 will produce increasing level of uniform noise. The separation ratio between Gaussian components is jl 1 À l 2 j=r ¼ 4:5, which is sufficient enough to produce a multimodal density McLachlan and Peel (2000a). The variance is set equal for both components to calibrate the separation ratio easily. PrfN ðl 2 ; r 2 2 Þ b 1 g ¼ 0:9993, therefore, there is a small overlap between the rightmost Gaussian component and the uniform component. Given b 1 ¼ 50, the value b 2 ¼ 123 is fixed so that, when p 0 ¼ 10%, the density contribution of the second Gaussian component in (11) at its 99%-quantile matches the density contribution of the uniform noise. This setting is challenging because it makes difficult to distinguish the normal tails from the uniform component. The MIXs data generating process fulfills the model assumption of the proposed method producing points that are ''clearly'' clustered, plus an elongated right tail.
MIXo. This is a variation of MIXs that changes the separation between the Gaussian components by reducing l 2 . The data generating model is p 0 Uð50; 123Þ þ ð1 À p 0 Þ f0:6 N ð0; 50Þ þ 0:6 N ð17:5; 50Þg: Now the separation jl 1 À l 2 j=r is roughly about 2.5 producing a strong overlap that leads to unimodality. Note that now PrfN ðl 2 ; r 2 2 Þ b 1 g % 1, and in practice the uniform component and the right-most Gaussian distribution do not overlap. With the MIXo sampling design, we again have a data generating process fulfilling the model assumptions, but it does not have a multimodal distribution and it does not produce well-clustered points.
ASY. Points are sampled from the following non-Gaussian mixture distribution where v 2 ðmÞ denotes the v 2 distribution with m degrees of freedom. The aim of the ASY experiments is different from that of the previous MIX cases. ASY model has skewed density, and its right-tail is much heavier from what we would expect under a v 2 ð3Þ. Model assumptions for the proposed method are not fulfilled here. The core of the non-tail part of the sampling mechanism is not Gaussian, and it is not symmetric. For the ASY model we also consider p 0 2 f0:05; 0:1; 0:25g. For each of the three sampling designs, we experiment with sample size n 2 f50; 100; 1000g. The proposed method is compared against alternative methods depending on the investigated aspect. In order to ease the presentation, each method is labeled as follows.

TRIMADSE Location-scale estimation based on Trimean and Mad estimators. Location-scale estimation based on the sampling mean and the standarddeviation
Software for GM, GUM, and GRUM methods have been specifically implemented to manage all their common elements exactly in the same way. For example, in Coretto and Hennig (2011), the GUM is implemented with a different class of scale constraints. Here it is implemented with the same scale constraints proposed for the GRUM method. Moreover, the work of Coretto and Hennig (2011) did not consider the noise proportion constraint that here we implemented as for the GRUM method. Initialization for all mixture-based methods is performed in the same manner as for the GRUM method, eventually ignoring the uniform component's initialization. Throughout the experiments, hyperparameter settings are: d ¼ 0:5, c ¼ z 0:975 (the 97.5%-quantile of standard normal distribution), p max ¼ 50%. For sample size n [ 500, we perform GRUM computations using the fast option B explained in the 4. For the GUM computations, we consider option A because the method does not constraint the uniform component to catch the data distribution tails. For each combination of the sample design, n, and p 0 , we perform 1000 Monte Carlo replicates.

Experimental results
This study investigates parameter estimation, model selection and density fit, outlier identification, clustering, and robust location-scale estimation.
Parameter estimation. The comparison only involves MIXs and MIXo designs and GUM and GRUM methods whose parameters are consistent with the ground truth. The number of Gaussian mixture components is fixed at G ¼ 2, and true parameters are compared with estimated parameters in terms of Mean Squared Error (MSE). Since the three classes of parameters (proportions, locations, scales) play a different role, the MSE is given for sub-vectors of parameters p ¼ ðp 0 ; p 1 ; p 2 Þ, l ¼ ðl 1 ; l 2 Þ, r ¼ ðr 1 ; r 2 Þ, b ¼ ðb 1 ; b 2 Þ. Results are shown in Fig. 2. The estimation of the mean parameters is satisfactory for both methods. However, GRUM outperforms GUM in terms of estimation of scales and uniform parameters. The latter is because the GRUM method often fails to identify b 1 , and it tempts to confuse the tail of the right-most Gaussian with the uniform component. The GUM bias is larger for the case when p 0 ¼ 10% because, as explained before, this is the case where, by construction, it's harder for a method to distinguish between tails of non-uniform components and uniform component.
Model Selection and density estimation. With a non-fixed G, the model can fit the data distribution in a semiparametric fashion. The information criteria are popular methods for performing model selection in the mixture framework McLachlan and Peel (2000a). The Akaike Information Criterion (AIC), and the Bayesin Information Criterion (BIC) are not specific to the mixture context (see) Konishi and Kitagawa 2008. In this work, we also consider the Integrated Completed Likelihood criterion (ICL) of Biernacki et al. (2000), which is specifically designed for selecting G when the mixture model is fitted to recover clusters. Whenever the AIC is reported in this work, it is corrected for small sample bias. In other words, we computed what in the literature is known as AICc (see) Konishi and Kitagawa 2008. All the mixture-based methods under comparison have been fitted for G 2 f1; 2; . . .; 10g, and the three information criteria are used to select G. We then compared the fitted density against the true underlying density in terms of Mean Integrated Squared Error (MISE). The MISE's computation requires numerical integration, which has been performed using stratified Monte Carlo integration with the strata adaptively chosen to achieve an integration error never larger than 10 À4 .
In terms of density estimation, the AIC selection performed the best, although the BIC was fairly close. The latter was expected since it is known that the AIC is not generally consistent in model-selection, but provides better fits of the data distribution in a non-parametric sense Burnham and Anderson (2002). In Fig. 3 we report the MISE performance, while Fig. 4 we report results about the AIC selection. For both MIX and ASY designs, all mixture methods outperformed nonparametric kernel density with asymptotically optimal bandwidth selection (KERN). For large n, all mixture-based methods perform remarkably well. For smaller n SGM does overall marginally better. It is interesting to note that AIC selects the model so adaptively that the noise level does not affect the performance. The sample size drives the MISE. As expected for the ASY case, SGM does particularly better than the other mixture model-based competitors, although the gap almost disappears for n ¼ 1000. Figure 4 shows that SGM, GUM, and GRUM need fewer components to fit the data than GM. The latter is because the elongated tails can be captured by the uniform component in GUM and GRUM and the skewness parameters in the SGM method.
Outlier identification. True outliers are defined as points sampled under the uniform distribution. Both GUM and GRUM can be used to detect outliers applying the assignment rule (3) and using the estimated h with G selected based on one of the three information criteria. Points assigned to the uniform noise component are flagged as ''positives'', i.e., outliers. While there are many effective methods to identify outliers in the elliptical-symmetric family framework, the world of asymmetrical contamination is less explored due to the intrinsic difficulty to define outliers in such circumstances. We compare GUM and GRUM with the ''adjusted outlyingness'' (AO) method developed by Brys et al. (2005) and Hubert and Vandervieren (2008) that extends Stahel-Donoho outlyingness towards skewed distributions (see) Stahel 1981. For both GUM and GRUM, G needs to be fixed. The idea is that the method should be able to distinguish between a majority of points having a more compact shape, and a minority points (the outliers) not consistent with the main mass of the data. Because AIC showed better behavior in recovering the underlying density, in this case, we fix G based on the AIC.
In Fig. 5 we report the true negative rate (TNR, also known as specificity); the True Positive Rate (TPR, also known as sensitivity); and the Precision. Recall that Precision = 1-FDR, where the FDR is the False Discovery Rate. The proposed method outperforms its competitors in all situations for each of the three metrics. AO is the second best, although its performance is far from that of GRUM, even in the ASY case, where it is expected to be at its best. For different reasons, both AO and GUM cannot detect the location where the uniform component takes over. Again, comparing the performance metrics in each case, the GRUM tends to confuse the Gaussian tails with the uniform contribution.
Clustering. We assess clustering on MIXs and MIXo designs. The ASY sampling does not produce clustered points. Evaluation of the clustering performance requires the definition of a ''true'' partition. The true cluster label of a point is defined as the mixture component from which the point is drawn. To compute misclassification rates (expected 0-1 loss), for mixture-based methods, we only consider the fitting with G fixed at the true G ¼ 2. Only GUM and GRUM are considered because the other mixture-based methods do not have a noise component. In Fig. 6 we report Misclassification Rates [%] of GRUM and GUM compared to the Bayes Optimal Classifier (OBC). OBC is the classifier that one would build if one knew the true model parameters. In other words, the OBC is obtained using the assignment rule 3 with the true generating mixture parameters. The latter implies that the MCR obtained for OBC in Fig. 6 represents the optimal Bayes risk. Under the usual 0-1 loss, the Bayes risk gives the best possible MCR. Comparing with the OBC is particularly convenient in situations with strong overlap as for the MIXo. GRUM outperforms GUM for smaller sample size, and its performance is closer to the optimal error rate corresponding to the OBC's error. The performance gap between GUM And GRUM is clearer in the case of overlapped clusters (MIXo). If one wants to estimate the number of clusters as the number of non-noise mixture component G, probably the ICL and the BIC do the best job. ICL performed slightly better, and for brevity, we only report its performance in Fig. 7. In the separated case (MIXs), G ¼ 2 clusters have a clear intuition because of the separation. In the latter case, GRUM and GUM often pick G ¼ 2. GRUM shows an overall advantage over its competitor. In the MIXo case, the clustering is not obvious, and in fact, the ICL tempts to choose G ¼ 1 for GRUM and GUM, and it chooses G ¼ 2 for the remaining methods. In the clustering perspective, probably G ¼ 1 is a more sensible choice for MIXo.
Robust location-scale estimation. In a non-clustered population like the ASY model, one could be interested in estimating the location and the scale of the uncontaminated data robustly. For the ASY model, true location and scale are defined as the mean and the standard deviation of the v 2 ð3Þ component. Here we compare the mixture-based methods GUM and GRUM with AO, TRIMAD, and SE. In Fig. 8 we report MSE for location and scale parameters.
For GUM, GRUM, and AO, we define robust location-scale estimates as the weighted average and standard deviations pairs: where W ¼ P n i¼1 wðx i Þ. For GUM and GRUM, x i is weighted by the estimated posterior probability that x i does not belong to the uniform noise component, i.e. wðx i ; hÞ ¼ 1 À s 0 ðx i ; hÞ. These weights are smooth. For the AO methods wðx i Þ ¼ 1 if the x i is not flagged as outlier, and wðx i Þ ¼ 0 otherwise. GRUM and GUM require, again, a decision about how to select G. In this case, the underlying model is used to fit the data distribution and not for finding the groups in the data. Therefore, the GRUM and GUM parameters are fitted based on the AIC selection. TRIMAD consists of the Tukey's Trimean and the MAD from the Trimean. Tukey's Trimean is a particularly effective measure of location in situations deviating from symmetry. The Trimean and the MAD are very easy to compute, and they are robust routine alternatives to the sampling estimators (SE). SE estimates are included as non-robust benchmarks. There is a huge catalog of robust location-scale estimators for one-dimensional data, and considering all of them is outside the scope of this paper. GRUM and TRIMAD report the best performance, although, for larger b Fig. 5  contamination rates, GRUM does better. Note that the nominal breakdown point of the TRIMEAN is 25%. Therefore, in finite samples, we would expect a deterioration of the performance even before p 0 hits 25%. The GUM and AO's problematic performance was expected based on the outlier detection performance documented in Fig. 5.

Applications to real data sets
This section presents an application of the GRUM method to the three data sets introduced in Sect. 1. We considered G ¼ 1; 2; . . .; 10, and the corresponding AIC and BIC values are shown in Fig. 9. GRUM's hyperparameters are set as in the numerical experiments of Sect. 9. In Fig. 9, the information criteria have been individually rescaled onto the interval [0,100] in order to ease the comparison.
Overall the ordering of the fitted models provided by the AIC and the BIC agrees with some exceptions. ICL behaves differently, pursuing a more parsimonious representation of the data distribution. The latter confirms the tendency of the ICL in detecting clustered regions rather than the distribution fit pursued by the AIC and the BIC. The BIC has the most monotonic behavior, whereas the ICL is on the opposite side. non-uniform part of the data distribution there seems to be some skewness rather then a clustering structure. The fact that GRUM fits a p 0 [ 0 is an indication that the tail behavior of the data distribution can be distinguished from the shape of the main part of the distribution. ICL selects G ¼ 1, completely merging the previous two Gaussian components fitted by both AIC and BIC. The interesting thing here is that the set of points assigned to the uniform component remain unchanged under all three information criteria.
The density fitted by both AIC and BIC is not dramatically different from that produced by the kernel method shown in Panel (a) of Fig. 1. However, GRUM provides a smoother approximation, although carefully looking at the density plot, there is a discontinuity at the estimated b 1 .
PH of blood data. In Fig. 9 all three information criteria select G ¼ 1. GRUM assigns 15.114% of the observed data to the uniform noise component. The latter can be seen form the density estimate in Panel (b) of Fig. 10.
The GRUM's mean estimate of the pH of Blood concentration is 38.71, while the sample mean is 41.55. The GRUM estimates a pH of blood variance of 9.3, while the sample variance is 74.54. Further checks showed that the GRUM method does not agree with some of the best performing robust univariate location-scale estimators. For example the optimal M-estimator computed at 95% efficiency (implemented in the Realized volatility data. The AIC and the BIC both agree on G ¼ 2 plus a uniform noise component representing a small group of 3 data points (estimated p 0 ¼ 2:329%). See Panel (c) in Fig. 10. The effect of these 3 data points on the optimal bandwidth estimation for kernel density is enormous. To see the latter, Panel (c) in Fig. 10 is compared with Panel (c) in Fig. 1. With the G ¼ 2 selected by AIC and BIC, the GRUM method fits two normal components of almost equal size ðp 1 ; p 2 Þ ¼ ð57:309%; 40:362%Þ, with means ðl 1 ; l 2 Þ ¼ ð0:00092; 0:00255Þ, and variances ðr 2 1 ; r 2 2 Þ ¼ ð1:13Â10 À7 ; 1:04Â10 À6 Þ. As explained before, the GRUM's d hyperparameter is always set to d ¼ 0:5 for all computations in this paper. For this data set, the scale lower bound in terms of the minimum variance would be 5:4Â10 À20 . Hence, none of the fitted mixture components hits the border of the parameter space. With G ¼ 2, we obtain a multimodal density that may correspond to the low-vs-high volatility clusters of stocks in the data sets. In this case, the two identified volatility clusters may be conveniently used to transform a continuous notion (volatility) into two easy to interpret categories that are particularly useful for risk analysts. In fact, in risk analysis is not essential the absolute level of volatility but how the volatility of an asset compares with the cross-sectional distribution at some time point. The three outliers correspond to three exceptional risky assets traded on 28/Aug/1998, and these are ''Nationstar Mortgage HO'', ''Micron Technology'' and ''Intuit Inc''.
The ICL chooses G ¼ 1, merging the two Gaussian components found by both AIC and BIC. The set of identified outlying data remains almost unchanged except that for one data point. We note that the ICL has a more ''unstable'' behavior ( Fig. 11).

Conclusions and final remarks
In this paper, we study a model that is a mixture where atypical observations on one of the two tails are represented by a uniform noise component whose separation from the main Gaussian components is controlled by the user. The proposed method can be used for several purposes: cluster analysis, outlier identification, robustlocation scale estimation in unclustered populations, semiparametric density estimation. As most flexible tools (e.g. robust estimators, nonparametric density estimators, etc.), the GRUM method requires hyperparameters' settings, and these hyperparameters can be easily interpreted.
We showed theoretical results, and we supported the analysis with both artificial data experiments and real data applications. The extension of the methods studied here to the multidimensional setting is difficult both theoretically and computationally. High-dimensional analysis plays a crucial role in modern applications, but the proposed method is a valid tool for the many one-dimensional analysis that we continue to perform routinely. or ðb 1 ;b 2 Þ is such thatb 2 Àb 1 ¼ ffiffiffiffiffi 12 p s min (the scale constraint binds) and ½b 1 ;b 2 contains at least one data point.
All possible values of the uniform parameters for which the corresponding h is a candidate for a local maximum can be obtained from the previous argument. The latter leads to only finitely many possible values for L n ðhÞ. Consider the vector tðb 1 ;b 2 Þ are fixed to be one of the many local maxima described in Step 3. Note that tðb 1 ;b 2 Þ will be contained in a compact set because of previous Steps 1. L n ðtðb 1 ;b 2 ÞÞ is continuous with respect to tðb 1 ;b 2 Þ for every choice ðb 1 ;b 2 Þ, moreover since such tðb 1 ;b 2 Þ is contained in a compact set, then L n ðtðb 1 ;b 2 ÞÞ has a well defined maximum. Applying this argument for all finitely many possible choice of ðb 1 ;b 2 Þ, we can find all possible local maxima of L n ðhÞ on _ H n , and hence among these we get the global maximum.
Proof Proof of 2. The present framework is consistent with the setup of Tanaka and Takemura (2006). For brevity we refer to Assumptions 1-4 in Tanaka and Takemura (2006) as TT1-TT4. It suffices to show that TT1-TT4 are satisfied in the case considered in this paper. First notice that both the Gaussian density and the uniform density have tails that that are of order smaller than oðjxj Àq Þ for q [ 1, therefore TT1 is satisfied. Also the Gaussian density and uniform density can be bounded above in H, therefore TT2 is fulfilled. TT4 is also trivially satisfied because it can be easily seen that R j logðf ðx; h 0 ÞÞjf ðx; h 0 Þdx\1. TT3 requires some specific care. First TT3 is fulfilled if for any sequence fh ðmÞ g taken in a compact subset A & H and h 0 2 A such that h ðmÞ ! h 0 , the limit f ðx; h ðmÞ Þ ! f ðx; h 0 Þ holds except perhaps on a set E which may depend on h 0 and of which the Lebesgue measure is zero. Since f ðÁÞ is a linear combination, it suffice to check the condition on the summands on of f ðÁÞ. /ðÁÞ is continuous with respect to both x and parameter, however the uniform distribution is discontinuous with respect the parameters for a given x.If x 6 ¼ b 0 1 and x 6 ¼ b 0 2 TT3 holds because I ½b 1;ðmÞ ;b 2;ðmÞ ðxÞ ! I ½b 0 1 ;b 0 2 ðxÞ. However, the latter is not true for the set E ¼ fx 1 ¼ b 0 1 ; x 2 ¼ b 0 2 g. Hence, TT3 holds except that for x 2 E that has zero Lebesgue measure. Since TT1-TT4 are fulfilled, Theorem 2 in Tanaka and Takemura (2006) holds which proves the result.
Proof Proof of 3. The proof just follows proof of Theorem 4 in Coretto and Hennig (2011) by taking their q ¼ 1 and their gðÁÞ ¼ /ðÁÞ.
Proof Proof of 4. For now, we ignore the third constraint. We will show that any solution to (10) without the third constraint will automatically fulfill it. The objective function in (10) is strictly concave and the equality constraint is linear. The Karush-Kuhn-Tucker (KKT) conditions are necessary for a globally optimal solution (see) Bertsekas 1999. Such a solution will be a stationary point of the Lagrangian function