# Flexible Distributions as an Approach to Robustness: The Skew-t Case

Conference paper

## Abstract

The use of flexible distributions with adaptive tails as a route to robustness has a long tradition. Recent developments in distribution theory, especially of non-symmetric form, provide additional tools for this purpose. We discuss merits and limitations of this approach to robustness as compared with classical methodology. Operationally, we adopt the skew-t as the working family of distributions used to implement this line of thinking.

## Keywords

Base Density Leibler Divergence Outlying Observation Pearson System Tail Weight
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

## 1 Flexible Distributions and Adaptive Tails

### 1.1 Some Early Proposals

The study of parametric families of distributions with high degree of flexibility, suitable to fit a wide range of shapes of empirical distributions, has a long-standing tradition in statistics; for brevity, we shall refer to this context with the phrase ‘flexible distributions’. An archetypal exemplification is provided by the Pearson system with its 12 types of distributions, but many others could be mentioned.

Recall that, for non-transition families of the Pearson system as well as in various other formulations, a specific distribution is identified by four parameters. This allows us to regulate separately from each other four qualitative aspects of a distribution, namely location, scale, slant and tail weight. In the context of robust methods, the appealing aspect of flexibility is represented by the possibility of regulating the tail weight of a continuous distribution to accommodate outlying observations.

When a continuous variable of interest spans the whole real line, an interesting distribution is the one with density function
\begin{aligned} c_\nu \,\exp \left( -\frac{|x|^\nu }{\nu }\right) , \qquad \quad x\in \mathbb {R}\,, \end{aligned}
(1)
where $$\nu >0$$ and the normalizing constant is $$c_\nu =\{2\,\nu ^{1/\nu }\,\varGamma (1+1/\nu )\}^{-1}$$. Here the parameter $$\nu$$ manoeuvres the tail weight in the sense that $$\nu =2$$ corresponds to the normal distribution, $$0<\nu <2$$ produces tails heavier than the normal ones, $$\nu >2$$ produces lighter tails. The original expression of the density put forward by Subbotin (1923) was set in a different parameterization, but this does not affect our discussion.

This flexibility of tail weight provides the motivation for Box and Tiao (1962), Box and Tiao (1973, Sect. 3.2.1), within a Bayesian framework, to adopt the Subbotin’s family of distributions, complemented with a location parameter $$\mu$$ and a scale parameter $$\sigma$$, as the parametric reference family allows for departure from normality in the tail behaviour. This logic provides a form of robustness in inference on the parameters of interest, namely $$\mu$$ and $$\sigma$$, since the tail weight parameter adjusts itself to non-normality of the data. Strictly speaking, they consider only a subset of the whole family (1), since the role of $$\nu$$ is played by the non-normality parameter $$\beta \in (-1,1]$$ whose range corresponds to $$\nu \in [1,\infty )$$ and $$\beta =0$$ corresponds to $$\nu =2$$.

Another formulation with a similar, and even more explicit, logic is the one of Lange et al. (1989). They work in a multivariate context and the error probability distribution is taken to be the Student’s t distribution, where the tail weight parameter $$\nu$$ is constituted by the degrees of freedom. Again the basic distribution is complemented by a location and a scale parameter, which are now represented by a vector $$\mu$$ and a symmetric positive-definite matrix, possibly parametrized by some lower dimensional parameter, say $$\omega$$. Robustness of maximum likelihood estimates (MLEs) of the parameters of interest, $$\mu$$ and $$\omega$$, occurs “in the sense that outlying cases with large Mahalanobis distances [...] are downweighted”, as visible from consideration of the likelihood equations.

The Student’s t family allows departures from normality in the form of heavier tails, but does not allow lighter tails. However, in a robustness context, this is commonly perceived as a minor limitation, while there is the important advantage of closure of the family of distributions with respect to marginalization, a property which does not hold for the multivariate version of Subbotin’s distribution (Kano 1994).

The present paper proceeds in a similar conceptual framework, with two main aims: (a) to include into consideration also more recent and general proposals of parametric families, (b) to discuss advantages and disadvantages of this approach compared to canonical methods of robustness. For simplicity of presentation, we shall confine our discussion almost entirely to the univariate context, but the same logic carries on in the multivariate case.

### 1.2 Flexibility via Perturbation of Symmetry

In more recent years, much work has been devoted to the construction of highly flexible families of distributions generated by applying a perturbation factor to a ‘base’ symmetric density. More specifically, in the univariate case, a density $$f_0$$ symmetric about 0 can be modulated to generate a new density
\begin{aligned} f(x) = 2\,f_0(x) \,G_0\{w(x)\}, \qquad \quad x\in \mathbb {R}, \end{aligned}
(2)
for any odd function w(x) and any continuous distribution function $$G_0$$ having density symmetric about 0. By varying the ingredients w and $$G_0$$, a base density $$f_0$$ can give rise to a multitude of new densities f, typically asymmetric but also of more varied shapes. A recent comprehensive account of this formulation, inclusive of its multivariate version, is provided by Azzalini and Capitanio (2014).
One use of mechanism (2) is to introduce asymmetric versions of the Subbotin and Student’s t distributions via the modulation factor $$G_0\{w(x)\}$$. Consider specifically the case when the base density is taken to be the Student’s t on $$\nu$$ degrees of freedom, that is,
\begin{aligned} t(x; \nu ) = \frac{\varGamma ((\nu +1)/2)}{\sqrt{\pi \,\nu }\,\varGamma (\nu /2)} \,\left( 1+\frac{x^2}{\nu }\right) ^{-(\nu +1)/2}, \qquad \quad x\in \mathbb {R}. \end{aligned}
(3)
In principle, the choice of the factor $$G_0\{w(x)\}$$ is bewildering wide, but there are reasons for focusing on the density, denoted as skew-t (ST for short),
\begin{aligned} t(x;\alpha , \nu ) = 2\, t(x; \nu )\, T\left( \alpha \,x\sqrt{\frac{\nu +1}{\nu +x^2}}; \nu +1\right) , \end{aligned}
(4)
where $$T(\cdot ;\rho )$$ represents the distribution function of a t variate with $$\rho$$ degrees of freedom and $$\alpha \in \mathbb {R}$$ is a parameter which regulates slant; $$\alpha =0$$ gives back the original Student’s t. Density (4) is displayed in Fig. 1 for a few values of $$\nu$$ and $$\alpha$$.
We indicate only one of the reasons leading to the apparently peculiar final factor of (4). Start by a continuous random variable $$Z_0$$ of skew-normal type, that is, with density function
\begin{aligned} \varphi (x;\alpha ) = 2\, \varphi (x)\,\varPhi (\alpha \,x), \qquad \quad x\in \mathbb {R} \end{aligned}
(5)
where $$\varphi$$ and $$\varPhi$$ denote the $$\mathrm {N}(0,1)$$ density and distribution function. An overview of this distribution is provided in Chap. 2 of Azzalini and Capitanio (2014). Consider further $$V\sim \chi ^2_\nu /\nu$$, independent of $$Z_0$$, and the transformation $$Z=Z_0/\sqrt{V}$$, traditionally applied with $$Z_0\sim \mathrm {N}(0,1)$$ to obtain the classical t distribution (3). On assuming instead that $$Z_0$$ is of type (5), it can be shown that Z has distribution (4).
For practical work, we introduce location and scale parameters via the transformation $$Y=\xi +\omega \,Z$$, leading to a distribution with parameters $$(\xi , \omega , \alpha , \nu )$$; in this case we write
\begin{aligned} Y \sim \mathrm {ST}(\xi , \omega ^2, \alpha , \nu )\,. \end{aligned}
(6)
Because of asymmetry of Z, here $$\xi$$ does not coincide with the mean value $$\mu$$; similarly, $$\omega$$ does not equal the standard deviation $$\sigma$$. Actually, a certain moment exists only if $$\nu$$ exceeds the order of that moment, like for an ordinary t distribution. Provided $$\nu >4$$, there are known expressions connecting $$(\xi , \omega , \alpha , \nu )$$ with $$(\mu ,\sigma , \gamma _1, \gamma _2)$$, where the last two elements denote the third and fourth standardized cumulants, commonly taken to be the measures of skewness and excess kurtosis. Inspection of these measures indicates a wide flexibility of the distribution as the parameters vary; notice however that the distribution can be employed also with $$\nu \le 4$$, and actually low values of $$\nu$$ represent an interesting situation for applications. Mathematical details omitted here and additional information on the ST distribution are provided in Sects. 4.3 and 4.4 of Azzalini and Capitanio (2014).

Clearly, expression (2) can also be employed with other base distributions and another such option is distribution (1), as expounded in Sect. 4.2 of Azzalini and Capitanio (2014). We do not dwell in this direction because (i) conceptually the underlying logical frame is the same of the ST distribution and (ii) there is a mild preference for the ST proposal. One of the reasons for this preference is similar to the one indicated near the end of Sect. 1.1 in favour of the symmetric t distribution, which is closed under marginalization in the multivariate case and this fact carries on for the ST distribution. Azzalini and Genton (2008) and Sect. 4.3.2 of Azzalini and Capitanio (2014) provide a more extensive discussion of this issue, including additional arguments.

To avoid confusion, the reader must be aware of the existence of other distributions named skew-t in the literature. The one considered here was, presumably, the first construction with this name. The original expression of the density by Branco and Dey (2001) appeared different, since it was stated in an integral form, but subsequently proved by Azzalini and Capitanio (2003) to be equivalent to (3).

The high flexibility of these distributions, specifically the possibility to regulate their tail weight combined with asymmetry, supports their use in the same logic of the papers recalled in Sect. 1.1. Azzalini (1986) has motivated the introduction of asymmetric versions of Subbotin distribution precisely by robustness considerations, although this idea has not been complemented by numerical exploration. Azzalini and Genton (2008) have worked in a similar logic, but focusing mainly on the ST distribution as the working reference distribution; more details are given in Sect. 3.4.

To give a first perception of the sort of outcome to be expected, let us consider a very classical benchmark of robustness methodology, perhaps the most classical: the ‘stack loss’ data. We use the data following the same scheme of many existing publications, by fitting a linear regression model with the three available explanatory variables plus intercept to the response variable y, i. e. the stack loss, and examine the discrepancy between observed and fitted values along the $$n=21$$ data points. A simple measure of the achieved goodness of fit is represented by the total absolute deviation
$$Q = \sum _{i=1}^{n} |y_i - \hat{y}_i|,$$
where $$y_i$$ denotes the ith observation of the response variable and $$\hat{y}_i$$ is the corresponding fitted value produced by any candidate method. The methods considered are the following: least squares (LS, in short), Huber estimator with scale parameter estimated by minimum absolute deviation, least trimmed sum of squares (LTS) of Rousseeuw and Leroy (1987), MM estimation proposed by Yohai (1987), MLE under assumption of ST distribution of the error term (MLE-ST). For the ST case, an adjustment to the intercept must be made to account for the asymmetry of the distribution; here we have added the median of the fitted ST error distribution to the crude estimate of the intercept. The outcome is reported in Table 1, whose entries have appeared in Table 5 of Azzalini and Genton (2008) except that MM estimation was not considered there. The Q value of MLE-ST is the smallest.
Table 1

Total absolute deviation of various fitting methods applied to the stack loss data

Method

LS

Huber

LTS

MM

MLE-ST

Q

49.7

46.1

49.4

45.3

43.4

## 2 Aspects of Robustness

### 2.1 Robustness and Real Data

The effectiveness of classical robust methods in work with real data has been questioned in a well-known paper by Stigler (1977). In the opening section, the author lamented that ‘most simulation studies of the robustness of statistical procedures have concentrated on a rather narrow range of alternatives to normality: independent, identically distributed samples from long-tailed symmetric continuous distributions’ and proposed instead ‘why not evaluate the performance of statistical procedures with real data?’ He then examined 24 data sets arising from classical experiments, all targeted to measure some physical or astronomical quantity, for which the modern measurement can be regarded as the true value. After studying these data sets, including application of a battery of 11 estimators on each of them, the author concluded in the final section that ‘the data sets examined do exhibit a slight tendency towards more extreme values that one would expect from normal samples, but a very small amount of trimming seems to be the best way to deal with this. [...] The more drastic modern remedies for feared gross errors [...] lead here to an unnecessary loss of efficiently.’

Similarly, Hill and Dixon (1982) start by remarking that in the robustness literature ‘most estimators have been developed and evaluated for mathematically well-behaved symmetric distributions with varying degrees of high tail’, while ‘limited consideration has been given to asymmetric distributions’. Also in this paper the programme is to examine the distribution of really observed data, in this case originating in an clinical laboratory context, and to evaluate the behaviour of proposed methods on them. Specifically, the data represent four biomedical variables recorded on ‘3000 apparently well visitors’ of which, to obtain a fairly homogeneous population, only data from women 20–50 years old were used, leading to sample sizes in the range 1037–1110 for the four variables. Also for these data, the observed distributions ‘differ from many of the generated situations currently in vogue: the tails of the biomedical distributions are not so extreme, and the densities are often asymmetric, lumpy and have relatively few unique values’. Other interesting aspects arise by repeatedly extracting subsamples of size 10, 20 and 40 from the full set, computing various estimators on these subsamples and examining the distributions of the estimators. The indications that emerge include the fact that the population values of the robust estimators do not estimate the population mean; moreover, as the distributions become more asymmetric, the robust estimates approach the population median, moving away from the mean.

A common indication from the two above-quoted papers is that the observed distributions display some departure from normality, but tail heaviness is not as extreme as in many simulation studies of the robustness literature. The data display instead other forms of departures from ideal conditions for classical methods, especially asymmetry and “lumpiness” or granularity. However, the problem of granularity will be presumably of decreasing importance as technology evolves, since data collection takes place more and more frequently in an automated manner, without involving manual transcription and consequent tendency to number rounding, as it was commonly the case in the past.

Clearly, these indications must not be regarded as universal. Stigler (1977, Sect. 6) himself recognizes that ‘some real data sets with symmetric heavy tails do exist, cannot be denied’. In addition, it can be remarked that the data considered in the quoted papers are all of experimental or laboratory origin, and possibly in a social sciences context the picture may be somewhat different. However, at the least, the indication remains that the distribution of real data sets is not systematically symmetric and not so heavy tailed as one could perceive from the simulation studies employed in a number of publications.

### 2.2 Some Qualitative Considerations

The plan of this section is to discuss qualitatively the advantages and limitation of the proposed approach, also in the light of the facts recalled in the preceding subsection.

For the sake of completeness, let us state again and even more explicitly the proposed line of work. For the estimation of parameters of interest in a given inferential problem, typically location and scale, we embed them in a parametric class which includes some additional parameters capable of regulating the shape and tail behaviour of the distribution, so to accommodate outlying observations as manifestations of the departures from normality of these distributions, hence providing a form of robustness. In a regression context, the location parameter is replaced by the regression parameters as the focus of primary interest.

In this logic, an especially interesting family of distributions is the skew-t, which allows to regulate both its asymmetry and tail weight, besides location and scale. Such a usage of the distribution was not the original motivation of its design, which was targeted to flexibility to adapt itself to a variety of situations, but this flexibility leads naturally to this other role.

The formulation prompts a number of remarks, in different and even contrasting directions, partly drawing from Azzalini and Genton (2008) and from Azzalini and Capitanio (2014, Sect. 4.3.5).

1. 1.

Clearly the proposed route does not belong to the canonical formulation of robust methods, as presented for instance by Huber and Ronchetti (2009), and one cannot expect it to fulfil the criteria stemming from that theory. However, some connections exist. Hill and Dixon (1982, Sect. 3.1) have noted that the Laszlo robust estimator of location coincides with the MLE for the location parameter of a Student’s t when its degrees of freedom are fixed. Lucas (1997), He et al. (2000) examine this connection in more detail, confirming the good robustness properties of MLE of the location parameter derived from an assumption of t distribution with fixed degrees of freedom.

2. 2.

The key motivation for adopting the flexible distributions approach is to work with a fully specified parametric model. Among the implied advantages, an important one is that it is logically clear what the estimands are: the parameters of the model. The same question is less transparent with classical robust methods. For the important family of M-estimators, the estimands are given implicitly as the solution of a certain nonlinear equation; see for instance Theorem 6.4 of Huber and Ronchetti (2009). In the simple case of a location parameter estimated using an odd $$\psi$$-function when the underlying distribution is symmetric around a certain value, the estimand is that centre of symmetry, but in a more general setting we are unable to make a similarly explicit statement.

3. 3.

Another advantage of a fully specified parametric model is that, at the end of the inference process, we obtain precisely that, a fitted probability model. Hence, as a simple example, one can assess the probability that a variable of interest lies in a given interval (ab), a question which cannot be tackled if one works with estimating equations as with M-estimates.

4. 4.

The critical point for a parametric model is of course the inclusion of the true distribution underlying the data generation among those contemplated by the model. Since models can only approximate reality, this ideal situation cannot be met exactly in practice, except exceptional situations. If we denote by $$\theta \in \varTheta \subseteq \mathbb {R}^p$$ the parameter of a certain family of distributions, $$f(x;\theta )$$, recall that, under suitable regularity conditions, the MLE $$\hat{\theta }$$ of $$\theta$$ converges in probability to the value $$\theta _0\in \varTheta$$ such that $$f(x;\theta _0)$$ has minimal Kullback–Leibler divergence from the true distribution. The approach via flexible distributions can work satisfactorily insofar it manages to keep this divergence limited in a wide range of cases.

5. 5.

Classical robust methods are instead designed to work under all possible situations, even the most extreme. On the other hand, empirical evidence recalled in Sect. 2.1 indicates that protection against all possible alternatives may be more than we need, as in the real world the most extreme situations do not arise that often.

6. 6.

As for the issue discussed in item 4, we are not disarmed, because the adequacy of a parametric model can be tested a posteriori using model diagnostic tools, hence providing a safeguard against appreciable Kullback–Leibler divergence.

## 3 Some Quantitative Indications

The arguments presented in Sect. 2.2, especially in items 4 and 5 of the list there, call for quantitative examination of how the flexible distribution approach works in specific cases, especially when the data generating distributions does not belong to the specified parametric distribution, and how it compares with classical robust methods.

This is the task of the present section, adopting the ST parametric family (6) and using MLE for estimation; for brevity we refer to this option as MLE-ST. Notice that $$\nu$$ is not fixed in advance, but estimated along with the other parameters. When a similar scheme is adopted for the classical Student’s t distribution, Lucas (1997) has shown that the influence function becomes unbounded, hence violating the canonical criteria for robustness. A similar fact can be shown to happen with the ST distribution.

### 3.1 Limit Behaviour Under a Mixture Distribution

Recall the general result about the limit behaviour of the MLE when a certain parametric assumption is made on the distribution of an observed random variable Y, whose actual distribution $$p(\cdot )$$ may not be a member of the parametric class. Under the assumption of independent sampling from Y with constant distribution p and various regularity conditions, Theorem 2 of Huber (1967) states that the MLE of parameter $$\theta$$ converges almost surely to the solution $$\theta _0$$, assumed to be unique, of the equation
\begin{aligned} \mathbb {E}_p\{\psi (Y;\theta )\} = 0, \end{aligned}
(7)
where the subscript p indicates that the expectation is taken with respect to that distribution and $$\psi (\cdot ;\theta )$$ denotes the score function of the parametric model.
We examine numerically the case where the parametric assumption is of ST type with $$\theta =(\xi , \omega , \alpha , \nu )$$ and p(x) represents a contaminated normal distribution, that is, a mixture density of the form
\begin{aligned} p(x) = (1-\pi )\,\varphi (x) + \pi \,\sigma ^{-1}\,\varphi \{\sigma ^{-1}(x-\varDelta )\} \,. \end{aligned}
(8)
In our numerical work, we have set $$\pi =0.05$$, $$\varDelta =10$$, $$\sigma =3$$. The corresponding p(x) is depicted as a grey-shaded area in Fig. 2 and its mean value, 0.5, is marked by a small circle on the horizontal axis. The expression of the four-dimensional score function for the ST assumption is given by DiCiccio and Monti (2011), reproduced with inessential changes of notation in Sect. 4.3.3 of Azzalini and Capitanio (2014). The solution of (7) obtained via numerical methods is $$\theta _0=(-0.647, 1.023, 1.073, 2.138)$$, whose corresponding ST density is represented by the dashed curve in Fig. 2. From $$\theta _0$$, we can compute standard measures of location, such as the mean and the median of the ST distribution with that parameter; their values, 0.0031 and 0.3547, are marked by vertical bars on the plot. The first of these values is almost equal to the centre of the main component of p(x), i. e. $$\varphi (x)$$, while the mean of the ST distribution is not far from the mean of p(x). Which of the two quantities is more appropriate to consider depends, at least partly, on the specific application under consideration.

To obtain a comparison term from a classical robust technique, a similar numerical evaluation has been carried out for ‘proposal 2’ of Huber (1964), where $$\theta$$ comprises a location and a scale parameter. The corresponding estimands are computed solving an equation formally identical to (7), except that now $$\psi$$ represents the set of estimating equations, not the score function; see Theorem 6.4 of Huber and Ronchetti (2009). For the case under consideration, the location estimand is 0.0957, which is also marked by a vertical bar in Fig. 2. This value is intermediate to the earlier values of the ST distribution, somewhat closer to the median, but anyway they are all not far away from each other.

For the ST distribution, alternative measures of location, scale and so on, which are formally similar to the corresponding moment-based quantities but exist for all $$\nu >0$$, have been proposed by Arellano-Valle and Azzalini (2013). In the present case, the location measure of this type, denoted pseudomean, is equal to 0.1633 which is about halfway the ST mean and median; this value is not marked on Fig. 2 to avoid cluttering.

### 3.2 A Non-random Simulation

We examine the behaviour of ST-MLE and other estimators when an “ideal sample” is perturbed by suitably modifying one of its components. As an ideal sample we take the vector $$z_1, \dots , z_n$$, where $$z_i$$ denotes the expected value of the ith order statistics of a random sample of size n drawn from the $$\mathrm {N}(0,1)$$ distribution, and its perturbed version has ith component as follows:
$$y_i = \left\{ \begin{array}{cl} z_i &{}\quad \hbox {if } i=1,\dots ,n-1,\\ z_n+\varDelta &{}\quad \hbox {if } i=n. \end{array} \right.$$
For any given $$\varDelta >0$$, we examine the corresponding estimates of location obtained from various estimation methods and then repeat the process for an increasing sequence of displacements $$\varDelta$$. Since the $$y_i$$’s are artificial data, the experiment represents a simulation, but no randomness is involved. Another way of looking at this construction is as a variant form of the sensitivity curve.

In the subsequent numerical work, we have set $$n=100$$, so that $$-2.5< z_i < 2.5$$, and $$\varDelta$$ ranges from 0 to 15. Computation of the MLE for the ST distribution has been accomplished using the R package sn (Azzalini 2015), while support for classical robust procedures is provided by packages robust (Wang et al. 2014) and robustbase (Rousseeuw et al. 2014); these packages have been used at their default settings. The degrees of freedom of the MLE-ST fitted distributions decrease from about $$4\times 10^4$$ (which essentially is a numerical substitute of $$\infty$$) when $$\varDelta =0$$, down to $$\hat{\nu }=3.57$$ when $$\varDelta =15$$.

For each MLE-ST fit, the corresponding median, mean value and pseudomean of the distribution have been computed and these are the values plotted in Fig. 3 along with the sample average and some representatives of the classical robust methodology. The slight difference between the two curves of MM estimates is due to a small difference in the tuning parameters of the R packages. Inevitably, the sample average diverges linearly as $$\varDelta$$ increases. The ST median and pseudomean behave qualitatively much like the robust methods, while the mean increases steadily, but far more gently than the sample average, following a logarithmic-like sort of curve.

### 3.3 A Random Simulation

Our last numerical exhibit refers to a regular stochastic simulation. We replicate an experiment where $$n=100$$ data points are sampled independently from the regression scheme
$$y = \beta _0 + \beta _1\, x + \varepsilon ,$$
where the values of x are equally spaced in (0, 10), $$\beta _0=0$$, $$\beta _1=2$$ and the error term $$\varepsilon$$ has contaminated normal distribution of type (8) with $$\varDelta \in \{2.5,5, 7.5, 10\}$$, $$\pi \in \{0.05, 0.10\}$$, $$\sigma =3$$.
For each generated sample, estimates of $$\beta _0$$ and $$\beta _1$$ have been computed using least squares (LS), least trimmed sum of squared (LTS), MM estimation and MLE-ST with median adjustment of the intercept; all of them have already been considered and described in an earlier section. After 50,000 replications of this step, the root-mean-square (RMS) error of the estimates has been computed and the final outcome is presented in Fig. 4 in the form of plots of RMS error versus $$\varDelta$$, separately for each parameter and each contamination level.

The main indication emerging from Fig. 4 is that the MLE-ST procedure behaves very much like the classical robust methods over a wide span of $$\varDelta$$. There is a slight increase of the RMS error of MLE-ST over MM and LTS when we move to the far right of the plots; this is in line with the known non-robustness of MLE-ST with respect to the classical criteria. However, this discrepancy is of modest entity and presumably it would require very large values of $$\varDelta$$ to become appreciable. Notice that on the right side of the plots we are already 10 standard deviations away from the centre of $$\varphi (x)$$, the main component of distribution (8).

### 3.4 Empirical and Applied Work

The MLE-ST methodology has been tested on a number of real datasets and application areas. A fairly systematic empirical study has been presented by Azzalini and Genton (2008), employing data originated from a range of situations: multiple linear regression, linear regression on time series data, multivariate observations, classification of high dimensional data. Work with multivariate data involves using the multivariate skew-t distribution, of which an account is presented in Chap. 6 of Azzalini and Capitanio (2014). In all the above-mentioned cases, the outcome has been satisfactory, sometimes very satisfactory, and has compared favourably with techniques specifically developed for the different situations under consideration.

Applications of the ST distribution arise in a number of fields. We do not attempt a complete review, but only indicate some directions. One point to bear in mind is that often, in applied work, the distinction between long tails and outlying observations is effectively blurred.

A crystalline exemplification of the last statement is provided by the returns generated in the industry of artistic productions, especially from films and music. Here the so-called ‘superstar effect’ leads to values of a few isolated units which are far higher than the main body of the production. These extremely large values are outlying but not spurious; they are genuine manifestations of the phenomenon under study, whose probability distribution is strongly asymmetric and heavy tailed, even after log transformation of the original data. See Walls (2005) and Pitt (2010) for a complete discussion and for illustrations of successful use of the ST distribution.

The above-described data pattern and corresponding explorations of use of the MLE-ST procedure exist also in other application areas. Among these, quantitative finance represents a prominent example and this has prompted also significant theoretical contributions to the development of this area; see Adcock (2010, 2014). Another important context is represented by natural phenomena, where occasionally extreme values jump far away from the main body of the observations; applied work in this direction includes multivariate modelling of coastal flooding (Thompson and Shen 2004), monthly precipitations (Marchenko and Genton 2010), riverflow intensity (Ghizzoni et al. 2010, 2012).

Another direction currently under vigorous investigation is model-based cluster analysis. The traditional assumption that each component of the underlying mixture distribution is multivariate normal is often too restrictive, leading to an inappropriate increase of the number of component distributions. A more flexible distribution, such as the multivariate ST, can overcome this limitation, as shown in an early application by Pyne et al. (2009), but various other papers along a similar line exist, including of course adoption of other flexible distributions.

At least a mention is due of methods for longitudinal data and mixed effect models, such as in Lachos et al. (2010), Ho and Lin (2010).

We stress once more that the above-quoted contributions have been picked up as the representatives of a substantially broader collection, which includes additional methodological themes and application areas. A more extensive summary of this activity is provided in the monograph of Azzalini and Capitanio (2014).

In connection with applied work, it is appropriate to underline that care must be exercised in numerical maximization of the likelihood function, at least with certain datasets. It is known that fitting a classical Student’s t distribution with unconstrained degrees of freedom can be problematic, especially in the multivariate case; the inclusion of a skewness parameter adds another level of complexity. It is then advisable to start the maximization process from various starting points. In problematic cases, computation of the profile likelihood function with respect to $$\nu$$ can be a useful device. Advancements on the reliability and efficiency of optimization techniques for this formulation would be valuable.

## 4 Concluding Remarks

The overall message which can be extracted from the preceding pages is that flexible distributions constitute a credible approach to the problem of robustness. Since it does not descend from the canonical scheme of classical robust methods, this approach cannot meet the classical robustness optimality criteria. However, these criteria are targeted to offer protection against extreme situations which in real data are not so commonly encountered, perhaps even seldom encountered. In less extreme situations, but still allowing for appreciable departure from normality, flexible distributions, specially in the representative case of the skew-t distribution, offer adequate protection against problematic situations, while providing a fully specified probability model, with the qualitative advantages discussed in Sect. 2.2.

We have adopted the ST family as our working parametric family, but the reasons for this preference, explained briefly above and more extensively by Azzalini and Genton (2008), are not definitive; in certain problems, it may well be appropriate to work with some other distribution. For instance, if one envisages that the problem under consideration contemplates departure from normality in the form of shorter tails or possibly a combination of longer and shorter tails in different subcases, and the setting is univariate, then the Subbotin distribution and its asymmetric variants represent an interesting option.

## Notes

### Acknowledgments

This paper stems directly from my oral presentation with the same title delivered at the ICORS 2015 conference held in Kolkata, India. I am grateful to the conference organizers for the kind invitation to present my work in that occasion. Thanks are also due to attendees at the talk that have contributed to the discussion with useful comments, some of which have been incorporated here.

## References

1. Adcock CJ (2010) Asset pricing and portfolio selection based on the multivariate extended skew-Student-$$t$$ distribution. Ann Oper Res 176(1):221–234. doi:
2. Adcock CJ (2014) Mean-variance-skewness efficient surfaces, Stein’s lemma and the multivariate extended skew-Student distribution. Eur J Oper Res 234(2):392–401. doi:. Accessed 20 July 2013
3. Arellano-Valle RB, Azzalini A (2013) The centred parameterization and related quantities of the skew-$$t$$ distribution. J Multiv Anal 113:73–90. doi:. Accessed 12 June 2011
4. Azzalini A (1986) Further results on a class of distributions which includes the normal ones. Statistica XLVI(2):199–208Google Scholar
5. Azzalini A (2015) The R package sn: The skew-normal and skew-$$t$$ distributions (version 1.2-1). Università di Padova, Italia. http://azzalini.stat.unipd.it/SN
6. Azzalini A, Capitanio A (2003) Distributions generated by perturbation of symmetry with emphasis on a multivariate skew $$t$$ distribution. J R Statis Soc ser B 65(2):367–389, full version of the paper at arXiv.org:0911.2342
7. Azzalini A with the collaboration of Capitanio A (2014) The Skew-Normal and Related Families. IMS Monographs, Cambridge University Press, Cambridge. http://www.cambridge.org/9781107029279
8. Azzalini A, Genton MG (2008) Robust likelihood methods based on the skew-$$t$$ and related distributions. Int Statis Rev 76:106–129. doi:
9. Box GEP, Tiao GC (1962) A further look at robustness via Bayes’s theorem. Biometrika 49:419–432
10. Box GP, Tiao GC (1973) Bayesian inference in statistical analysis. Addison-Wesley Publishing CoGoogle Scholar
11. Branco MD, Dey DK (2001) A general class of multivariate skew-elliptical distributions. J Multiv Anal 79(1):99–113
12. DiCiccio TJ, Monti AC (2011) Inferential aspects of the skew $$t$$-distribution. Quaderni di Statistica 13:1–21Google Scholar
13. Ghizzoni T, Roth G, Rudari R (2012) Multisite flooding hazard assessment in the Upper Mississippi River. J Hydrol 412–413(Hydrology Conference 2010):101–113. doi:
14. Ghizzoni T, Roth G, Rudari R (2010) Multivariate skew-$$t$$ approach to the design of accumulation risk scenarios for the flooding hazard. Adv Water Res 33(10, Sp. Iss. SI):1243–1255. doi:
15. He X, Simpson DG, Wang GY (2000) Breakdown points of $$t$$-type regression estimators. Biometrika 87:675–687
16. Hill MA, Dixon WJ (1982) Robustness in real life: a study of clinical laboratory data. Biometrics 38:377–396
17. Ho HJ, Lin TI (2010) Robust linear mixed models using the skew $$t$$ distribution with application to schizophrenia data. Biometr J 52:449–469. doi:
18. Huber PJ (1964) Robust estimation of a location parameter. Ann Math Statis 35:73–101. doi:
19. Huber PJ (1967) The behaviour of maximum likelihood estimators under nonstandard conditions. In: Le Cam LM, Neyman J (eds) Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, vol 1. University of California Press, pp 221–23Google Scholar
20. Huber PJ, Ronchetti EM (2009) Robust statistics, 2nd edn. WileyGoogle Scholar
21. Kano Y (1994) Consistency property of elliptical probability density functions. J Multiv Anal 51:139–147
22. Lachos VH, Ghosh P, Arellano-Valle RB (2010) Likelihood based inference for skew-normal independent linear mixed models. Statist Sinica 20:303–322
23. Lange KL, Little RJA, Taylor JMG (1989) Robust statistical modeling using the $$t$$-distribution. J Am Statis Assoc 84:881–896Google Scholar
24. Lucas A (1997) Robustness of the Student $$t$$ based M-estimator. Commun Statis Theory Meth 26(5):1165–1182. doi:
25. Marchenko YV, Genton MG (2010) Multivariate log-skew-elliptical distributions with applications to precipitation data. Environmetrics 21(3-4, Sp. Iss.SI):318–340. doi:
26. Pitt IL (2010) Economic analysis of music copyright: income, media and performances. Springer Science & Business Media. http://www.springer.com/book/9781441963178
27. Pyne S, Hu X, Wang K, Rossin E, Lin TI, Maier LM, Baecher-Alland C, McLachlan GJ, Tamayo P, Hafler DA, De Jagera PL, Mesirov JP (2009) Automated high-dimensional flow cytometric data analysis. PNAS 106(21):8519–8524. doi:
28. Rousseeuw P, Croux C, Todorov V, Ruckstuhl A, Salibian-Barrera M, Verbeke T, Koller M, Maechler M (2014) robustbase: basic robust statistics. http://CRAN.R-project.org/package=robustbase, R package version 0.91-1
29. Rousseeuw PJ, Leroy AM (1987) Robust regression and outlier detection. Wiley, New York
30. Stigler SM (1977) Do robust estimators work with real data? (with discussion). Ann Statis 5(6):1055–1098
31. Subbotin MT (1923) On the law of frequency of error. Matematicheskii Sbornik 31:296–301
32. Thompson KR, Shen Y (2004) Coastal flooding and the multivariate skew-$$t$$ distribution. In: Genton MG (ed) Skew-elliptical distributions and their applications: a journey beyond normality, Chap 14. Chapman & Hall/CRC, pp 243–258Google Scholar
33. Walls WD (2005) Modeling heavy tails and skewness in film returns. Appl Financ Econ 15(17):1181–1188. doi:, http://www.tandf.co.uk/journals
34. Wang J, Zamar R, Marazzi A, Yohai V, Salibian-Barrera M, Maronna R, Zivot E, Rocke D, Martin D, Maechler M, Konis K (2014) robust: robust library. http://CRAN.R-project.org/package=robust, R package version 0.4-16
35. Yohai VJ (1987) High breakdown-point and high efficiency robust estimates for regression. Ann Statis 15(20):642–656