Kurtosis removal for data pre-processing

Mesokurtic projections are linear projections with null fourth cumulants. They might be useful data pre-processing tools when nonnormality, as measured by the fourth cumulants, is either an opportunity or a challenge. Nonnull fourth cumulants are opportunities when projections with extreme kurtosis are used to identify interesting nonnormal features, as for example clusters and outliers. Unfortunately, this approach suffers from the curse of dimensionality, which may be addressed by projecting the data onto the subspace orthogonal to mesokurtic projections. Nonnull fourth cumulants are challenges when using statistical methods whose sampling properties heavily depend on the fourth cumulant themselves. Mesokurtic projections ease the problem by allowing to use the inferential properties of the same methods under normality. The paper shows necessary and sufficient conditions for the existence of mesokurtic projections and compares them with other gaussianization methods. Theoretical and empirical results suggest that mesokurtic transformations are particularly useful when sampling from finite normal mixtures. The practical use of mesokurtic projections is illustrated with the AIS and the RANDU datasets.


Introduction
A fourth cumulant of a random vector with finite fourth moments is the fourth derivative, evaluated at the origin, of the cumulant generating function of the random vector itself, that is the logarithm of its characteristic function. All fourth cumulants of a B Nicola Loperfido nicola.loperfido@uniurb.it 1 Dipartimento di Economia, Società e Politica (DESP), Università degli Studi di Urbino "Carlo Bo", Via Saffi 42, 61029 Urbino, PU, Italy multivariate normal distribution equal zero, and we refer to distributions with the same property as to mesokurtic distributions. Also, we refer to transformations leading to null fourth cumulants as to mesokurtic transformations. In particular, when the mesokurtic transformation is a linear function of the original variables, we refer to it as to a mesokurtic projection. With a little abuse of language, and for ease of description, we write that mesokurtic transformations remove kurtosis. Mesokurtic transformations include as special cases transformations to normality, also known as gaussianizing transformations. They have been extensively studied in scientific fields other than Statistics, as for example Physics (Yu et al. 2016). Mesokurtic transformations are of interest whenever the performance of the chosen statistical method either increases or decreases with the absolute values of fourth-order cumulants.
Researchers from social and life sciences are often concerned with the lack of normality of their data, since many multivariate statistical techniques are not robust when the sampled distribution is wrongly assumed to be normal. Sometimes the performance of the statistical technique depends on the fourth cumulants of the sampled distribution, as exemplified by the following cases.
Inference on covariance matrices is notoriously very sensitive to fourth cumulants. Mardia (1974) showed that the performance of a likelihood test based on the erroneous assumption of normality crucially depends on a measure of multivariate kurtosis which is a simple function of fourth cumulants. Schott (2002) used fourth cumulants to derive a robust procedure for testing the equality of the population covariance and a given matrix. Yanagihara et al. (2005) extended Mardia's results to several likelihood tests on covariance matrices based on the normality assumption, and showed that their performances depend on their fourth cumulants.
Variogram estimation is of paramount importance in spatial statistics but spatial data are often nonnormal, also due to preferential sampling (Loperfido and Guttorp 2008). Genton et al. (2001) modelled nonnormality with the multivariate skew-normal distribution and derived analytical formulae for covariances of variogram estimators, and showed that they depend on the fourth cumulants of the same distribution. Kim (2005) extended their results to mixtures of skew-normal distributions. Rezvandehy and Deutsch (2018) addressed preferential sampling with fourth cumulants.
Relevance of fourth cumulants is not limited to covariance testing and variogram estimation. Yanagihara (2007) stressed the importance of fourth cumulants in the multivariate linear model. Arevalillo and Navarro (2012) investigated the effect of fourth cumulants on the Fisher discriminant function (i.e. the projection which best separates the means of two multivariate distributions), and found it to be nonnegligible. The large-sample approximation of the multivariate sample mean by the skew-normal distribution improves when the fourth cumulants of the two distributions are close to each other (Christiansen and Loperfido 2014).
Multivariate normality is usually pursued by means of the componentwise, nonlinear, univariate transformations proposed by either Box and Cox (1964) or Tukey (1977). Tsay et al. (2017) thoroughly review these and other transformations to normality, while proposing a new one which is particularly apt at dealing with platykurtic distributions. Unfortunately, nonlinearly transformed variables might not be easily interpretable nor jointly normal, as pointed out by Lin and Lin (2010), among others. Moreover, the same methods are inappropriate when the joint distribution is not normal, but univariate marginals are, as it happens for the distributions described in Arnold et al. (2001). Loperfido (2014) and Loperfido (2019) used a simple argument to show how nonliner transformations to normality are inappropriate when using Hotelling's one-sample test.
Mesokurtic projections do not suffer from these limitations. We illustrate this point with the bivariate distribution 2φ (x) φ (y) Φ (λx y) introduced by Arnold et al. (2001), where φ (·) and Φ (·) denote the probability and the cumulative density functions of a standard normal distribution, while λ is a real value. Adcock (2021) thoroughly investigated its properties and proposed some generalizations. The default approach to the gaussianization of the random vector (X , Y ) would be raising both its components to an appropriate power. However, there do not exist two real values α and β such that the joint distribution of X α , Y β is bivariate normal. On the other hand, there are two normal, and therefore mesokurtic, projections of (X , Y ), that is X and Y .
The performance of kurtosis-based projection pursuit tends to increase with the absolute values of fourth cumulants. Projection pursuit is a multivariate statistical technique aimed at finding interesting data projections, with interestingness quantified by the projection index: the data projection with the highest value of the projection index is regarded as the most interesting. The normal distribution is commonly regarded as the least interesting (Huber 1985), so that the projection index often measures a nonnormality feature such as skewness, kurtosis or multimodality. In particular, kurtosis-based projection pursuit uses the absolute value of the fourth standardized cumulant as a projection index, consistently with the projection pursuit criteria stated in Huber (1985). Kurtosis-based projection pursuit therefore looks for interesting projections by either maximizing or minimizing kurtosis, that is the fourth standardized moment. Kurtosis-based projection pursuit appears in cluster analysis Prieto 2000, 2001b), outlier detection (Galeano et al. 2006;Peña and Prieto 2001a), normality testing (Malkovich and Afifi 1973), independent component analysis (Girolami and Fyfe 1996), invariant coordinate selection (Alashwali and Kent 2016), chemometrics (Hou and Wentzell 2014), finance (Loperfido 2020b).
Projection pursuit is commonly regarded as more problematic when applied to datasets with more variables than units (see, for example, Hui and Lindsay 2010;Lindsay and Yao 2012). Bickel et al. (2018) thoroughly investigated the asymptotic properties of projection pursuit for several ratios of the number of variables to the number of units, finding serious shortcomings of the method when the number of variables is much greater than the number of units. Pires and Branco (2019) showed that two-dimensional projections of datasets with more variables than units could approximate any given set of bivariate data with the same number of units. Lee and Cook (2010) illustrated the limitation of projection pursuit for classification with a real dataset containing 3571 variables recorded from 72 cases. However, the performance of kurtosis-based projection pursuit might rapidly deteriorate as the number of variables increases while the number of units remains fixed, even if the former remains much smaller than the latter (Loperfido 2020a).
The problems posed by high-dimensional data might be eased by means of sparse projection pursuit, that is projection pursuit performed on a small subset of variables, either original or projected (Bickel et al. 2018). Jones and Sibson (1987) proposed a linear transformation of the data into two mutually disjoint subsets of variables, one deemed uninteresting and the other deemed interesting. The former should be discarded while the latter should be analysed by means of projection pursuit. Following this approach, Blough (1989), Loperfido (2017), Franceschini and Loperfido (2019) removed skewness from the data by means of linear transformations, within the framework of skewness-based projection pursuit. Hui and Lindsay (2010) and Lindsay and Yao (2012) linearly transformed the data into two mutually orthogonal subsets of interesting and uninteresting projections. Ray (2010) used a simulated dataset with many more variables than units to illustrate the merits of this approach. In kurtosis-based projection pursuit, where the least interesting distributions are those with null fourth order cumulants, data are linearly transformed into two mutually orthogonal sets of variables, of which only one is mesokurtic. Then the mesokurtic set of variables is removed and kurtosis optimization is carried out on the remaining variables.
This paper uses several kurtosis matrices to obtain mesokurtic projections. It states both sufficient and necessary conditions for the existence of mesokurtic projections. The approach is algebraic in nature, since it relies on matrix concepts, including null linear spaces, spectra and generalized eigenvalues. As a major advantage, sampling properties of mesokurtic projections might be easily derived from the theory of random matrices and their spectra. The remainder of the paper is organized as follows. Section 2 uses kurtosis matrices to investigate the existence and the properties of mesokurtic projections. Section 3 contains some simulation studies related to model-based clustering. Section 4 applies the proposed method to a subset of the AIS dataset containing 23 units and 11 variables, which is a nonnegligible variables-to-units ratio. Section 5 uses the RANDU dataset to show that mesokurtic projections might be a viable alternative to nonlinear transformations to normality. Section 6 contains some concluding remarks, mentions further applications of mesokurtic projections, discusses their limitations and suggests some extensions of the proposed method to datasets with more variables than units. The proofs of the theorems are in the Appendix.

Theory
This section describes a method for obtaining projections with null or negligible fourth cumulants, that is mesokurtic and nearly mesokurtic projections. Firstly, it uses bivariate random vectors to illustrate situations where mesokurtic projections do not exist. Secondly, provides necessary and sufficient conditions for the existence of mesokurtic projections of random vectors in any dimension. Thirdly, the section briefly discusses the sampling properties of mesokurtic projections.
The i jhk-th cumulant of a d-dimensional random vector x = (X 1 , . . . , where ι = √ −1, t ∈ R d and E exp ιt T x is the cumulant generating function. In the univariate case, the only fourth cumulant is where μ and σ 2 are the mean and the variance of a random variable X satisfying E X 4 < ∞. The bivariate case provides a good insight into nonexistence of mesokurtic projections. Let U and W be two standardized, independent random variables whose fourth cumulants κ 4 (U ) and κ 4 (W ) are both positive. Elementary properties of cumulants imply that the fourth cumulant of the projection hU + kW is positive, too: As a direct consequence, there does not exist a mesokurtic projection of (U , W ).
Nonexistence of mesokurtic projections of bivariate and standardized random vectors becomes more difficult to ascertain when the independence assumption is removed. Let (X , Y ) be a bivariate and standardized random vector: Also, let β i = E X i Y 4−i , for i = 0, 1, 2, 3, 4. The fourth cumulant of the linear combination a X + Y is a fourth-order polynomial in a: A mesokurtic linear function of X and Y exists if and only if the polynomial has real roots.
The problem of ascertain the existence of mesokurtic projections becomes even more complicated when considering standardized random vectors with more than two components. We address the problem by recalling that fourth cumulants are simple functions of the fourth and second central moments. The i jhk-th moment of x is the expectation μ i jhk = E X i X j X h X k , for i, j, h, k = 1, . . . , d. The i jhk-th central moment of x is the i jhk-th moment of x − μ, where μ = (μ 1 , . . . , μ d ) T is the mean of x: The i jhk-cumulant can then be represented as where σ ab is the covariance between the a-th and the b-th components of x, with a, b=1,…, d. For example, the fourth cumulants of (X 1 , X 2 ) T , expressed as functions of its second and fourth central moments, are κ 1111 = μ 1111 − 3σ 2 11 , κ 2222 = μ 2222 − 3σ 2 22 , κ 1112 = μ 1112 − 3σ 11 σ 12 , κ 1122 = μ 1122 − σ 11 σ 22 − 2σ 2 12 , κ 1222 = μ 1222 − 3σ 12 σ 22 .
We first consider the fourth moment (matrix) of a d−dimensional random vector where "⊗" denotes the Kronecker product. It conveniently arranges all the fourth-order moments μ i jhk = E X i X j X h X k of x (where i, j, h, k = 1, . . . , d) and admits the block matrix representation M 4,x = B pq , where B pq = E X p X q x x T , for p, q = 1, . . . , d. For example, the fourth moment of (X 1 , X 2 ) T is If the variance Σ of x is positive definite, its fourth standardized moment M 4,z is the fourth moment of z = Σ −1/2 (x − μ), where Σ −1/2 is the symmetric, positive definite square root of the concentration matrix Σ −1 . Most measures of multivariate kurtosis are either scalar-type or matrix-type functions of M 4,z (Loperfido 2017(Loperfido , 2020a. In the univariate case M 4,z coincides with Pearson's measure of kurtosis: The eigenstructure of M 4,z has several interesting features: eigenvectors of M 4,z associated with its positive eigenvalues are vectorized, symmetric matrices and the eigenvector of M 4,z associated with its largest eigenvalue (that is the dominant eigenvector of M 4,z ) is a semidefinite matrix (Loperfido 2017). The following theorem gives a necessary condition for the existence of mesokurtic projections based on the eigenstructure of the fourth standardized moment matrix. It is reminescent of the distinction between platykurtic and leptokurtic random variables, whose fourth standardized moments are smaller and greater than three, respectively. Proposition 1 suggests the following procedure for computing projections with negligible fourth cumulants. First, compute the nonnull eigenvalues of the fourth standardized moment matrix. If they are all greater (smaller) than three the projections with fourth cumulants closer to zero are those which minimize (maximize) kurtosis. In these circumstances, kurtosis alleviation coincides with kurtosis optimization, that is kurtosis-based projection pursuit. The computational issues of the latter might be addressed by the method proposed in Franceschini and Loperfido (2018), which are implemented in the R package Kurt (Franceschini and Loperfido 2020).
On the other hand, if some nonnull eigenvalues of the fourth standardized moment are smaller than three while other eigenvalues are greater than three, it is worth looking for mesokurtic projections. An intuitively appealing approach has been proposed by Peña and Prieto (2001a) and Peña and Prieto (2001b) within the framework of outlier detection. First, find the projection with maximal kurtosis. Second, project the data onto its orthogonal subspace. Then repeat these steps until all remaining projections are mesokurtic or nearly so. This approach, which may be referred to as to iteractive projection pursuit approach, proved to be useful when nonnormality is due to the presence of outliers, but may not detect mesokurtic projections when data are generated from different models, as it happens in independent component analysis.
Independent component analysis (ICA) is a multivariate statistical technique aimed at recovering independent, unobserved signals by appropriate data projections. The basic ICA model is invertible real matrix and the components of s are mutually independent, standardized random variables of which at most one is normal (Miettinen et al. 2015).
Iteractive projection pursuit does not detect mesokurtic projections when the fourth cumulants of the signals are nonnull and take both signs. We illustrate the point with the bivariate ICA model where the first, second, fourth cumulants of S 1 and S 1 are The projection of x with maximal kurtosis coincides, up to location and scale changes, with the first signal, which is leptokurtic: The projection of x orthogonal to 2X 2 − 3X 1 coincides, up to location and scale changes, with the second signal, which is platykurtic: Neither projection is mesokurtic, but the model implies the existence of the mesokurtic projection X 1 + X 2 . A sufficient condition for the existence of mesokurtic projections might be established using the cokurtosis matrix (see, for example, Jondeau and Rockinger 2006), that is the d × d 3 matrix For example, the cokurtosis of (X 1 , The standardized cokurtosis of x is just the cokurtosis of The fourth standardized matrix M 4,z = M pq = E Z p Z q zz T and the standardized cokurtosis cok (z) might be regarded as block matrices where the same blocks are arranged in different ways. The latter aligns side by side the blocks of the fourth standardized moment in such a way that the blocks with the smallest first indices appear first, followed by those having the smallest second indices: Both M 4,z and cok (z) may contain up to d (d + 1) (d + 2) (d + 3) /24 distinct elements. Since this number grows very quickly with the vector's dimension, it is convenient to summarize them with smaller matrices. Cardoso (1989) proposed K z = E z T zzz T as a kurtosis matrix. Mòri et al. (1993) independently proposed and discussed its variant K z − (d + 2) I d . Statistical applications of both kurtosis matrices include independent component analysis (Cardoso 1989), generalized principal components (Caussinus and Ruiz-Gazen 2009), invariant coordinate selection (Alashwali and Kent 2016 and cluster analysis (Peña et al. 2010). Loperfido 2017) highlighted the connection between K z and tensor contraction. The kurtosis matrix K z is a function of the fourth moment matrix of a standardized random vector and is commonly regarded as a good compromise between detail and synthesis, when studying the kurtosis structure of the random vector itself. The kurtosis matrix K z does not account for expectations of products of four different components of z: elements in K z do not depend on E Z i Z j Z h Z k , when i, j, h, k differ from each other. For this reason we refer to K z as the partial kurtosis matrix. In the bivariate case it is The following theorem establishes a sufficient condition for the existence of mesokurtic projections. It describes a matrix depending on both standardized cokurtosis and partial kurtosis which recovers mesokurtic projections, when it is not of full rank.
Proposition 2 Let cok (z) and K z be the cokurtosis and the partial kurtosis of the standardized, d− dimensional random vector z. Also, let the columns of the d × h matrix B T , with h > 0, span the null space of the matrix where I d is the d−dimensional identity matrix. Then the fourth cumulant of the random vector Bz is a null matrix.
In practice, kurtosis matrices need to be estimated. To this end, we use a n × d data matrix X and the corresponding sample covariance matrix S , which is assumed to be of full rank. This assumption implies than there are more units than variables, the opposite case falling outside the scope of the present paper. Possible extensions of the proposed method to datasets with more variables than units are discussed in the Conclusions section.
First we obtain the standardized data matrix Z = H n X S −1/2 , where H n is the n × n centring matrix and S −1/2 is the symmetric square root of S −1 . The fourth sample standardized moment, the sample standardized cokurtosis and the sample partial kurtosis are almost surely converge to the matrices M 4,z , cok (z) and K z , as long as E X 4 i < ∞, for i = 1, . . ., d . Since eigenvalues are continuous functions of their matrices (Ortega 1987 , page 41) eigenvalues of matrices in the three sequences converge almost surely to eigenvalues of the corresponding matrices. Convergence of eigenvectors and convergence in law require some additional assumptions, which are thoroughly described in Rublik (2001). Propositions 1, 2 connect mesokurtic projections to the eigenvalues of symmetric, positive semidefinite matrices, thus easing the task of making inference on the former by using the well-established theory regarding the latter.
Consistently with Proposition 2, mesokurtic projections might be obtained as follows, if some nonnull eigenvalues of the fourth sample standardized moment being smaller than three while other eigenvalues are greater than three. First, compute the eigenvectors corresponding to the smallest eigenvalues of the matrix For the sake of clarity we assume that all eigenvalues of Q n are distinct, as it often occurs in practice, according to our personal experience. The matrix Z B n estimates the mesokurtic projections that can be obtained from the sampled distributions, where the columns of the matrix B n are the eigenvectors associated with the k smallest eigenvalues of Q n , with k < d. Alternatively, we could look for random projections onto the subspace generated by the same eigenvectors. Random projections proved to be very useful for finding interesting projections (Peña and Prieto 2007), but they might also be useful for detecting uninteresting ones, since most data projections are approximatively normal, under mild assumptions (Diaconis and Freedman 1984). The number of projections might be chosen on the ground of subject-matter considerations or after testing for nullity the smallest eigenvalues of Q n . The above mentioned approach aimed at finding mesokurtic projections relies on little restrictive assumptions, being Propositions 1, 2 nonparametric in nature. However, a practitioner would greatly appreciate more precise indications when looking for mesokurtic projections. We address the problem with another kurtosis matrix and within the framework of closed skew-normal distributions. Fourth cumulants are often arranged in the exkurtosis, that is the d × d 3 block matrix and the blocks with smaller first indices appear first, followed by those having smaller second indices: the block K i j appears before the block K q j if i < q, and the block K i j appears before the block K i p if j < p. The name "exkurtosis" is a shorthand for "excess kurtosis matrix", since the best known measures of multivariate excess kurtosis are functions of the exkurtosis. Loperfido (2020c) investigates some properties of the exkurtosis matrix. In particular, we have where b is a d-dimensional real vector, y is a d-dimensional random vector with finite fourth-order moments and independent of x, A is a k × d real matrix.
The closed skew-normal distribution introduced by Gonzalez-Farias et al. (2003) provides another useful tool for modelling the sample bias. Its name reminds that it is closed with respect to conditioning, affine transformations and convolutions. The random vector x has a closed skew-normal distribution of parameters ξ, Ω, Ψ , η, Δ, while Ω and Δ are symmetric, positive definite matrices. The following proposition gives a sufficient condition for the existence of a linear projection of a closed skew-normal random vector which is normal (and hence mesokurtic).

Proposition 3 Let x ∼ C SN (ξ, Ω, Ψ , η, Δ) be a d -dimensional random vector whose distribution is closed skew-normal with parameters
Then there exists a linear projection Bx of x which is normal, where B is a (d − h)×d full-rank matrix whose rows belong to the null space of the transposed exkurtosis of Proposition 3 and known inferential results on the closed skew-normal distribution pave the way toward a criterion aimend at finding normal (and therefore mesokurtic) projections. First, estimate the parameters of the closed skew-normal distribution either by maximum likelihood or by alternative methods (Flecher et al. 2009;He et al. 2019). The latter methods should be preferred when the likelihood function is deemed too difficult to compute, since it involves the evaluation of multivariate integrals.
Second, estimate the exkurtosis matrix exk (x) by its sample counterpart E and compute the normalized eigenvectors associated with the d − h smallest eigenvalues of E E T , where the estimate h of h is just the size of the symmetric matrix Δ, i.e. the estimate of Δ. Finally, estimate the matrix B with the matrix B whose i-th row is the normalized eigenvector associated with the above mentioned matrix. Convergence of B to a matrix whose columns belong to the null space of exk (x) T follows from the mild assumptions stated in Tyler (1981) for eigenvectors of symmetric random matrices.

Simulations
The scope of the simulation studies in this section is twofold. Firstly, simulations are used to assess the relevance of the problems posed by projection pursuit when the number of variables approaches the number of units. Secondly, simulations are used to assess the number of mesokurtic projections detected by the method described in the previous section, when sampling from finite normal mixtures.
We addressed the first question by simulating 1000 samples of size n = 25 and n = 50 from where 0 d , 1 d and I d are the d−dimensional null vector, the d−dimensional vector of ones and the d × d identity matrix, π 1 = 0.1, 0.2, 0.3, μ = 1, 2, 3, 4, d = 3, 6, 12, 24 if n = 25 and d = 6, 12, 24, 48 if n = 50. The ratio of the number of variables to the number of units increases from 12 to 96%, with greater values of μ denoting better separations between the mixture's components and greater values of π 1 denoting smaller skewness and kurtosis. The projection maximizing kurtosis is also the one which best separates the two normal components and is proportional to x T 1 d (see, for example, Loperfido 2020b). For each simulated sample, we computed the squared correlation between the projection maximizing sample kurtosis and the projection onto the direction of 1 d : higher values of the squared correlation indicate a better  The table reports the integer part of the average squared correlation between the projection maximizing sample kurtosis and the data projection on the direction of 1 d performance of kurtosis-based projection pursuit in detecting the cluster structure, that is in separating the mixture components. The simulation's results in Table 1 clearly indicate that kurtosis-based projection pursuit is highly sensitive to the ratio of the number of variables to the number of units: higher ratios lead to worse performances. Kurtosis-based projection pursuit becomes virtually useless when the number of variables is just slightly smaller than the number of units. Another simulations study (not reported here) suggests that under this circumstance the projection maximizing sample kurtosis and the projection of the data onto the direction of 1 d are uncorrelated. Quite surprisingly, the problem gets more serious when the number of units increases. On the other hand, the performance of kurtosis-based projection pursuit improves when the components are better separated and nonnormality increases.
A natural question to ask is whether mesokurtic projections exist for widely used families of multivariate distributions, such as normal mixtures. Let the distribution of the d−dimensional random vector x be the mixture, with weights π 1 ,…, π k of the normal distributions N d (μ 1 , Σ), . . ., N d (μ k , Σ), with k < d. In the general case, the fourth cumulants of x are nonnull. Consider now a k ×d matrix B satisfying Bμ i = 0 k for i = 1, …, k. It follows that Ax is normally distributed and hence mesokurtic. Does the method described in the previous section detect all these mesokurtic projections?
The question becomes more difficult to answer when the covariances of the normal components are not assumed to coincide. As an example, consider the mixture of two normal distributions with different means and proportional covariance matrices: π 1 N d (μ 1 , Σ) + π 2 N d (μ 2 , αΣ) , with π 1 + π 2 = 1, μ 1 = μ 2 , π 1 , π 2 > 0 , α = 1. There is no projection whose distribution is normal. Does this mean that there are not mesokurtic projections?
We use simulations to address these problems. Multivariate samples of size 100 were simulated from mixtures of two multivariate normal distributions. Other simulation studies, not reported here, were based on samples of sizes 200 and 300 and led to conclusions similar to the below-mentioned ones.
We first simulated 1000 samples from the normal, homoscedastic mixture where π 1 is either 0.1 or 0.5, for d = 8, 12, 16, 20. The sampled distribution is skewed and leptokurtic (symmetric and platykurtic) when π 1 = 0.1 (π 1 = 0.5), and admits d − 1 jointly normal projections onto the linear subspace orthogonal to 1 d . All samples exhibit either moderate or large units-to-variables ratios, thus motivating mesokurtic projections, either for variable selection or for parametric inference.
Then we simulated 1000 samples from the normal, heteroscedastic mixture where α is either 0.5 or 2, for d = 8, 12, 16, 20. It models the presence of multivariate outliers, by assuming that the sampled distribution is a two-component mixture, with one mixture weight much smaller than the other. The components with the largest and smallest mixture weights model the bulk of the data and the outliers, respectively (Loperfido 2020b). The outliers generated from the model are said to be concentrated or dispersed depending on whether the parameter α equals 0.5 or 2 . We measure kurtosis by Koziol's statistic (Koziol 1989), that is the squared Euclidean norm of the fourth standardized moment: where m is the sample mean, S is the sample variance (which is assumed to be nonsingular) and x T h is the h-th row of the n × d data matrix X , for h = 1, . . ., d. We favored Koziol's kurtosis over the more popular Mardia's kurtosis because the former, unlike the latter, depends on all fourth-order standardized moments. Other properties of Koziol's kurtosis, including its advantages over Mardia's kurtosis, are discussed in Loperfido (2020a).
The Koziol's kurtosis of either a random vector or a sampled dataset increases with its dimension: the Koziol's kurtosis of a d-dimensional normal random vector is 3d (d + 2). In order to compare the nonnormality of datasets with different numbers of variables we use the relative difference between their Koziol's kurtoses and the Koziol's kurtoses of normal random vectors with the same dimensions. For example, if the dataset at hand contains three variables and its Koziol's kurtosis is 54, its relative difference to the Koziol's kurtosis of a three-dimensional normal random vector is (54 − 45)/45 = 0.2.
We applied the proposed method to each sample to obtain d−1, d/2, √ d projections. Ideally, a method purported to obtain mesokurtic projections should detect all the d −1 jointly normal projections. When this does not happen, it is desirable to have some guidance about the achievable number of jointly normal linear projections, as for example d/2 or √ d. For both the original and the projected data we computed the absolute relative difference between the observed Koziol measure of kurtosis and its expectation under the normality assumption. We would like this ratio to be much lower for the projected data than for the original ones. For each sample we also computed the p−value of Koziol kurtosis, using its asymptotically normal approximation. We would like it to be, on average and for the projected data, well above the commonly used threshold value of 0.05.
Tables 2 and 3 contain the results of the two simulation studies. Columns A and B report the integer part of the average absolute relative difference, multiplied by 100, between the Koziol measure of kurtosis and the value which it is expected to attain in the normal case, for the original and the projected variables, respectively. Column C reports the integer part of the average p values, multiplied by 100, corresponding to the Koziol measure of multivariate kurtosis computed for the projected variables. The subtables denoted with d − 1, d/2 and √ d report the simulation's results corresponding to a number of projections equal to the number of variables minus one, half the number of variables and the integer part of the square root of the number of variables, respectively.
The subtables of Table 2 with headers "Leptokurtic" and "Platykurtic" report the results for samples simulated from mixtures where the weight of the standard normal component are 0.1 and 0.5, respectively. The results of the first simulation, related to homoscedastic normal mixtures, may be summarized as follows. The ratio statistic is always much smaller for the projected data than for the original ones. It increases (decreases) with the number of variables when d − 1 ( √ d) mesokurtic projections are seeked. When d/2 mesokurtic projections are seeked, the ratio statistic first decreases and then increases. Normality of the projections, as measured by Koziol's kurtosis, is better achieved when their number is small compared to the number of variables, and the latter is small, either.
The subtables of Table 3 with headers "Concentrated" and "Dispersed" report the results for samples simulated from mixtures where the outliers generated from the model are more concentrated and more dispersed, respectively. The proposed method appears to perform better in the second simulation study. The differences between the ratio statistics for the original and the projected variables are always greater in the second simulation study. The performances of the proposed method are worst and best when d − 1 and √ d mesokurtic projections are seeked, and when the number of original variables is small. The performances of mesokurtic transformations appear to be similar in the presence of concentrated and dispersed outliers. Similar remarks hold when considering the p values of Koziol's kurtosis tests for the projected variables.
We conclude that the proposed method succeeds in either removing or alleviating kurtosis. It is particularly successful when the number of projections is moderate with

AIS data
In this section we illustrate the use of mesokurtic projections in projection pursuit, as described in Sect. 2, with the data collected from the Australian Institute of Sport by Telford and Cunningham (1991), also known as AIS dataset. It contains eleven blood and body measurements from 202 athletes of both genders competing in different sport events: red blood cell count, white blood cell count, hematocrit, hemoglobin concentration, plasma ferritins, body mass index, sum of skin folds, body fat, lean body mass, height and weight. Here, we focus on the 23 female athletes playing netball, so that the number of units is about twice the number of variables. Nine variables in the dataset are mildly platykurtic, while the remaining two variables are mildly leptokurtic: their kurtoses range from 2.079 to 4.681, with just two of them being greater than 3. We assess the joint normality of the eleven variables by means of Koziol's kurtosis tests for multivariate normality, whose p value is virtually zero. The maximum number of jointly mesokurtic projection detected by the method described in Sect. 2 is three (the p value of Koziol's kurtosis test for multivariate normality is greater than 0.17), which is much smaller than the number of original variables. Further dimension reduction might be achieved by looking for two sets of mutually orthogonal projections, one nearly mesokurtic and the other containing the information on the interesting structure.
Mild nonnormality might be due to the presence of outliers, often detectable by means of visual inspection. The boxplots in Fig. 1 suggest the presence of several Healy's plot of the eleven variables in the AIS dataset recorded from female netball players potential outliers, represented by the asterisks above and below the whiskers. The third, ninth and eleventh variables correspond to the third, ninth and eleventh boxplots starting from the left of Fig. 1. They hint that the potential outliers might be the first, the second and the fifteenth units. On the other hand, the Healy's plot in Fig. 1 clearly suggests that the twenty-first unit is the most likely to be an outlier, since it has the highest Mahalanobis distance from the mean and it is represented by the point farthest away from the bisector line. We conclude that visual inspection of the original variables gives contradictory results about the presence of outliers (Fig. 2).
We addressed outlier detection by means of kurtosis maximization, as suggested by several authors (Peña and Prieto 2001a;Hou and Wentzell 2014;Loperfido 2020b). The projection with maximal kurtosis is represented by the last boxplot from the left in Fig. 3, which clearly shows the presence of a single outlier, that is the twenty-first observation. Then we performed kurtosis maximization on the projections orthogonal to those with small absolute fourth cumulants, as computed with the method described in Sect. 2. The first, second, third, fourth and fifth boxplots from the left of Fig. 3 represent the projections with maximal kurtoses orthogonal to 10, 9, 8, 7 and 6 nearly mesokurtic projections. All boxplots clearly suggest that the twenty-first unit is the most likely to be an outlier and all but the first two boxplots on the left suggest that it is the only potential outlier. All projections maximizing kurtosis are strongly correlated with each other: the correlations between them are never smaller than 0.93. This statistical analysis is exploratory in nature and more sophisticated, inferential methods are needed to assess the presence of outliers. This section is just aimed at showing how mesokurtic projections might be helpful for dimension reduction while retaining the relevant data structure to be uncovered by kurtosis-based projection pursuit.

Randu data
In simulation studies, a normal random sample is often obtained by applying the normal quantile transformation to data generated from a uniform distribution. The method might lead to unsatisfactory results, even if visual and formal procedures support the hypothesis of the randomly generated data being uniform. We illustrate the problem with the RANDU data and show how mesokurtic projections might be helpful in alleviating it. The same dataset provides a good case of mesokurtic projections virtually identical to those with extreme kurtosis. The RANDU dataset contains 400 cases and 3 variables. Each datum was generated by the RANDU multiplicative congruential scheme x n+1 = (2 16 + 3)x n (mod 2 31 ). RANDU data satisfy the constraint x n − 6x n+1 + x n+2 ≡ 0 (mod 2 31 ), so that all triplets (x n , x n+1 , x n+2 ) lie on 15 parallel lines through the unit cube. This structure might be detected by means of appropriate rotations. Despite that, each variable in the dataset appear to be generated from a uniform distribution in the interval [0, 1]. For example, their means, standard deviations, skewnesses and kurtoses (reported in Table 4) are very close to their expected values 0.5, 0.2889, 0 and 1.8, respectively. These features made the RANDU dataset a perfect case for projection pursuit (see, for example, Huber 1985).  Since each variable appears to be uniformly distributed, it is natural to trasform them to normality by applying the normal quantile transformation. Unfortunately, this method fails to achieve normality, as it is apparent from their univariate kurtoses: 3.626, 3.449, 3.795. The p values of the kurtosis-based tests for normality of the three transformed variables are smaller than 0.05. The Box-Cox transformation gives even worse results. It leads to variables which are significantly platykurtic: their kurtoses are 2.056, 2.026, 2.117. The p values of both the skewness and the kurtosis tests for normality of the three transformed variables are smaller than 0.05.
Finally, we applied the proposed method to obtain two data projections. The first one is virtually normal (its skewness and kurtosis are −0.0359 and 2.9815), while the second one is only slightly so (its skewness and kurtosis are −0.1784 and 2.5485). The p values of the kurtosis tests for testing the normality of the first and second projections are 0.94 and 0.0654, respectively. The two projections are not jointly normal: the p− value of the Koziol test for multivariate normality is 0.0067. The histograms (Figs. 4 and 5) and the scatterplot of the two projections (Fig. 6) are consistent with these findings.
Interestingly, the first projection coincides, up to location and scale changes, with the projection achieving maximal kurtosis. Hence the RANDU dataset provides a good example of kurtosis-based projection pursuit as a tool for recovering normality, rather than detecting nonnormal features.
The empirical findings in this section might be summarized as follows. We applied the proposed method to a well-known, three-dimensional, platykurtic dataset, obtaining a normal univariate projection and a slightly nonnormal bivariate projection. We therefore succeeded in achieving univariate normality and in alleviating bivariate nonnormality. Commonly used, componentwise, nonlinear methods failed in both recovering either univariate or bivariate normality. We conclude that the RANDU dataset supports the use of mesokurtic projections when a randomly generated normal sample is seeked.

Conclusions
The paper describes a method for obtaining data projection with null or negligible fourth cumulants. It uses several kurtosis matrices to establish either sufficient or necessary conditions for the existence of mesokurtic projections. It also relates mesokurtic projections to spectra of kurtosis matrices, thus easing inferential tasks. Mesokurtic projections address the problems arising in the application of kurtosis-based projection pursuit to datasets where the number of variables is only slightly smaller than the number of units. The proposed method has been implemented in the R package Kurt (Franceschini and Loperfido 2020).
Data preprocessing is a default practice in projection pursuit (see, for example, Jones and Sibson (1987)). Laa and Cook (2020) proposed to remove skewness from the data when preprocessing them, which can be done by means of linear projections (Loperfido 2014;Franceschini and Loperfido 2019). We recommend the use of mesokurtic projections after linear projections to symmetry when preprocessing the data. Centering, sphering, symmetrizing and mesokurtic projections conveniently address the first, second, third and fourth sample cumulants.
Mesokurtic projections might also be used in independent component analysis, a multivariate statistical method aimed at recovering independent random variables from the observed data which are their one-to-one affine transformations. The model is identifiable if at most one independent variable is normal (Miettinen et al. 2015). The nonnormal independent variables are commonly assumed to be leptokurtic, so that fourth cumulants may be used to recover them (Girolami and Fyfe 1996). The same assumption implies that the model is identifiable if there is at most one linear combination of the observed variables which is mesokurtic. Hence model identification is related to the rank of the matrix which characterizes mesokurtic projections.
Mesokurtic projections might also preprocess the data before using a statistical method whose performance heavily depends on the sampled distribution having null or at least negligible cumulants. Examples include inference on covariance matrices (Yanagihara et al. 2005), linear discriminant analysis (Arevalillo and Navarro 2012), multivariate linear regression (Yanagihara 2007) and variogram estimation (Rezvandehy and Deutsch 2018). Mesokurtic projections might provide a viable alternative to other statistical methods which are commonly used to address these problems: flexible modelling, nonparametric inference and nonlinear transformations.
Despite its theoretical properties and practical usefulness, the proposed method suffers from several drawbacks. In the first place, mesokurtic projections do not exist for some distributions, as for example scale mixtures of multivariate normal distributions. In the second place, the proposed method may not detect all mesokurtic projections, especially in high dimensional cases. In the third place, mesokurtic projections are not appropriate when the sampled distribution is nonnormal, but all its fourth cumulants equal zero. In the fourth place, mesokurtic projections imply some information loss, since the original data are projected onto a lower dimensional space.
A natural question to ask is whether the proposed method might be extended to datasets with more variables than units. Both Propositions 1, 2 rely on the standardized random variable z = Σ −1/2 (x − μ), which is defined only if the number of units is greater than the number of variables. In the opposite case, a common solution is the replacement of the covariance matrix Σ with the positive definite convex linear combination Γ = λΣ + (1 − λ) Ω of Σ itself with a full rank matrix Ω of the same size, where 0 < λ < 1 (see, for example, Hastie et al. 2008). This approach already appeared in the literature on projection pursuit (Lee and Cook 2010;Hui and Lindsay 2010). We are currently investigating the generalizations of Propositions 1, 2 to situations where z = Σ −1/2 (x − μ) is replaced by y = Γ −1/2 (x − μ), where Γ −1/2 is the positive definite symmetric square root of the inverse Γ −1 of Γ .

Appendix
We recall some fundamental properties of the Kronecker product which we shall use repeatedly in the following proofs (see, for example, Rao and Rao 1998 , pages 194-201): (P1) the Kronecker product is associative: Finally, we recall some properties of the commutation matrix C p,q ∈ R pq × R pq (Kollo and (Ortega 1987, page 232). This would in turn imply that tr (AB) were positive, thus leading to a contradiction. We conclude that A and B cannot be both positive definite. Equivalently, we showed that M 4,z has at most one eigenvector which might be represented as a vectorized, definite symmetric matrix. We denote this eigenvector by v = vec (V ), where V is a vectorized, symmetric and positive definite d × d matrix.
Assume now that v is not an eigenvector associated to the dominant eigenvalue λ of M 4,z , and denote with u such eigenvector. Using an argument similar to the previous one, we can show that u T v = 0, which would be possible if an only if u were a vectorized, symmetric and indefinite matrix. However, u is a vectorized, positive semidefinite symmetric matrix (Loperfido 2017), thus leading to a contradiction. We have therefore proved that v must be an eigenvector associated with the dominant eigenvalue of M 4,z .
We now prove by contradiction that λ is simple. Assume that λ is not a simple eigenvalue of M 4,z , so that there is another eigenvector m = vec (M) associated to it, where M is a vectorized and symmetric d × d matrix of unit norm. Without loss of generality we can assume that v and m are mutually orthogonal: v T m = tr (V M) = 0. The last equality, together with positive definiteness of V , imply that M is indefinite. The inequality where S is a symmetric, d × d matrix of unit norm, follows from P6 and from m being an eigenvector associated to the dominant eigenvalue of M 4,z . Since V is symmetric, it might be decomposed into ΩΔΩ T , where the columns ω 1 , …, ω d of Ω ∈ R d × R d are the normalized eigenvectors of M and Δ is a diagonal matrix whose i−th diagonal entry is the i−th eigenvalue of M: By assumption, the covariance matrix of z is nonsingular, so that E Y 2 i > 0 for i = 1, . . ., d. Suppose now that M is indefinite, so that the smallest k eigenvalues of M are negative, for some integer k beween 1 and d − 1, implying (19) Equivalently, the inequality would hold true, for Q = ΩΔ T + Ω and Δ + = diag (δ 1 , . . . , δ k−1 , −δ k , . . . , −δ d ). Also, Q would be positive semidefinite with the same norm of M: M 2 = δ 2 1 + · · ·+δ 2 d = Q 2 . As a direct consequence, m = vec (M) would not be an eigenvector associated with the dominant eigenvalue of M 4,z , thus leading to a contradiction. We conclude that the dominant eigenvalue λ of M 4,z is simple.
We now prove that no mesokurtic projection of z exists when all positive eigenvalues of M 4,z are greater than three. Let w = vec (W ) be the eigenvector associated with the smallest positive eigenvalue η > 3 of M 4,x , where W is a vectorized, symmetric d × d matrix of unit norm. Also, let c be a d−dimensional real vector of unit length, so that c T z is a standardized random variable: E c T z = 0 and E c T z 2 = 1. The where S is a symmetric, d × d matrix of unit norm, follows from P6 and from m being an eigenvector associated to the smallest positive eigenvalue of M 4,z . The kurtosis of c T z is and the d × d matrix cc T is real, symmetric and of unit norm. Hence the kurtosis of c T z is always greater than η, which in turn is greater than three by assumption. The proof that no mesokurtic projection of z exists when all positive eigenvalues of M 4,z are smaller than three is very similar to the previous one and is therefore omitted.
Proof of Proposition 2 Repeated use of P4 leads to the identities Now apply P2, P6 and P4 to obtain Similarly, apply P2 and P8 and P7 to obtain The definitions of cok (z) = E z ⊗ z T ⊗ z T ⊗ z T and K z = E z T zzz T , together with linear properties of the expected value, leads to By definition, e i is the d−dimensional vector whose i−th component is one while the others are zero. Hence e T i e j is either one or zero depending on whether i = j or i = j. We can then write The last identity follows from e i e T i being the matrix whose only nonnull element is the i−th diagonal element, which equals one. All fourth-order cumulants κ i jhk = ∂ 4 log E exp ιt T x /∂t i ∂t j ∂t h ∂t k , where ι = √ −1 and t T = (t 1 , . . . , t d ), might be conveniently arranged into the d × d 3 matrix F = ∂ 4 log E exp ιt T x /∂t∂ 3 t T , which admits the representation (Kollo and von Rosen 2005, page 187). The above identities, together with P3, helps in simplifying the product of F and its transpose F T : Without loss of generality we shall assume that the columns of B T are mutually orthogonal, normalized vectors, so that B B T = I h . Let where y = Bz. By ordinary properties of cumulants, the assumption B B T = I h and P2, we have By definition, the columns of B T span the null space of cok (z) cok T (z) − 6K + 3 (d + 2) I d , so that GG T is a null matrix, which implies that all fourth-order cumulants of Bz equal zero, and this completes the proof.

Proof of Proposition 3
Let u and v be a d-dimensional and a h-dimensional random vectors whose joint distribution is Consider now the decomposition u = u − Cv + Cv, where Basic properties of normal random vectors imply that u − Cv and v are independent, normal random vectors. Gonzalez-Farias et al. (2003) showed that x and u|v > 0 are identically distributed, so that we can write x ∼ u − Cv + Cv + , where v + = v|v > 0.
The identity exk (x) = exk (Cv + ) follows from u − Cv being a normally distributed random vector. Apply now multilinear properties of the fourth-order cumulants (Loperfido 2020c): By definition, C is a d × h matrix and by assumption d > h, so that there exists a full rank (d − h) × d matrix B such that BC is a null matrix. As a first implication, the rows of B belongs to the null space of the transposed exkurtosis: As a second implication, Bx is a normally distributed random vector: Bx ∼ Bu − BCv + BCv + = Bu.
The proof is then complete.