Generalized Tietjen–Moore test to detect outliers

An outlier is an observation that appears to deviate from other observations in the sample and outlier detection is one of the most important tasks in data analysis. One of the fundamental assumptions of most parametric multivariate techniques is multivariate normality, which implies the absence of multivariate outliers. The basis for multivariate outlier detection is based on the Mahalanobis distance and outlier detection methods have been suggested for numerous applications in the literature. In this work, Tietjen–Moore test is generalized for multivariate data. A simulation study is carried out to evaluate the performance of the multivariate outlier detection methods under various conditions. The results show that the proposed method gives better results depending on whether or not the data set is multivariate normal.


Introduction
An outlier is an observation that appears to deviate from other observations, namely, inconsistent with the reminder [2,6]. The detection of outliers in multivariate data is one of the most important problems in the physical, chemical, medical and engineering sciences. The interest in outlier detection procedures has been growing since the researchers are not only interested in the regular data but they also wish to find out the irregular data and consequently the source of the data abnormality. Most of the standard multivariate analysis techniques rely on the assumption of normality and require the use of estimates for both the location and scale parameters of the distribution and most of the statistical techniques are sensitive to the presence of outliers. Outliers may be univariate or multivariate. The most common way of identifying multivariate outliers in a multivariate normal data set is to calculate Mahalanobis distance. Moreover, there are robust and nonrobust procedures to identify outliers in multivariate data. Many methods have been proposed for multivariate outlier detection. Garrett [5] introduced the chisquare plot, which draws the empirical distribution function of the robust Mahalanobis distances against the chisquare distribution. Franklin et al. [4] used Stahel-Donoho estimators to identify the multivariate outliers. Alameddine et al. [1] demonstrated a case study to analyze the effectiveness of the minimum covariance determinant MCD, the minimum volume ellipsoid MVE, and M-estimator. Jackson and Chen [8] compared Mahalanobis distances to minimum volume ellipsoid for identifying outliers for multivariate data. Dang and Serfling [3] introduced nonparametric multivariate outlier identifiers based on multivariate depth functions. Pena and Prieto [9] presented a simple multivariate outlier detection procedure and a robust estimator for the covariance matrix. Reza  Simulation study is given in detail in Sect. 3 to assess the proposed tests in case of different conditions. The paper ends with the conclusions in Sect. 4.

Methodology
In this section, Tietjen-Moore test for univariate outliers and robust MCDE will be given for the motivation purpose. Generalized Tietjen-Moore test is proposed using MCDE to detect the multivariate outliers.

Tietjen-Moore test for univariate outliers
The Tietjen-Moore test is used to detect multiple outliers in an univariate data set [13]. This test assumes that the underlying distribution follows an approximately normal distribution. The suspected number of outliers is needed to be specified exactly to apply the test properly. The Tietjen-Moore test is defined for the hypothesis: • H 0 : There are no outliers in the data set.
• H A : There are exactly k outliers in the data set.
First, n data points are sorted from the smallest to the largest so that x i denotes the ith largest data value. Then, the test statistic for the k largest point is where x: sample mean for the full data set, x k : sample mean with the largest k points removed. Similarly, the test statistic for the k smallest point is where y k : sample mean with the smallest k4 points removed. To test statistic for outliers in both tails, the absolute residuals are calculated as r i ¼ jx i À xj where z i denote the x i values sorted by their absolute residuals in ascending order. The test statistic can be expressed in terms of z values as with z denoting the sample mean for the full data set and z k denoting the sample mean with the largest k points removed. The value of the test statistic is between zero and one. If there are no outliers in the data, the test statistic is close to 1.

Robust minimum covariance determinant estimator
The minimum covariance determinant (MCD) is a robust method in the sense that the estimates are not unduly influenced by outliers in the data, even if there are many outliers. The MCD estimator proposed by Rousseeuw [11] is highly robust and very useful to detect outliers in multivariate data. Due to the MCD's robustness, multivariate outliers can be detected by their large robust distances. The robust distance is defined like the usual Mahalanobis distance (MD) that is sensitive to the masking effect. In the multivariate location and scatter setting, the data are stored in an n Â p data matrix X ¼ ðx1; . . .; xnÞ t with xi ¼ ðx i1 ; . . .; x ip Þ t the ith observation, n stands for the number of objects and p for the number of variables. The Mahalanobis distance MDðxiÞ expresses that how far away xi is from the center of the cloud, relative to the size of the cloud. The MD is defined as following: where x is the sample mean and S the sample covariance matrix. However, instead of the nonrobust sample mean and covariance matrix, the robust distance is based on MCD location estimate and scatter matrix as following: wherel MCD is the MCD estimate of location given bŷ andP MCD is the MCD estimator of covariance given bŷ and W an appropriate function. The constant c 1 is a consistency factor [7]. The MCD estimators ðl;P MCD Þ of multivariate location and scatter have breakdown value Ã n ðlÞ ¼ Ã n ðP MCD Þ % nÀh 2 . The MCD has its highest possible breakdown value The MCD estimator has a bounded influence function [7].
FAST-MCD algorithm of Rousseeuw and Van Driessen [12] is mainly used to compute efficiently the MCD estimator. MCDCOV computes the MCD estimator of a multivariate data set. This estimator is given by the subset of h observations with smallest covariance determinant. The MCD location estimator is then the mean of those h points, and the MCD scatter estimator is their covariance matrix. The default value of h is roughly 0.75n (where n is the total number of observations), but the user may choose each value between n/2 and n. Based on the raw estimates, weights are assigned to the observation such that outliers get zero weight. The reweighed MCD estimator is then given by the mean and covariance matrix of the cases with non-zero weight [12].

Generalized Tietjen-Moore Test for multivariate outliers
Numerous methods have been suggested to detect the multivariate outliers. The most popular one is the method based on Mahanalobis distance. The presence of multivariate outliers may lead to biased estimation of the parameters and other drawbacks. The basis of Generalized Tietjen-Moore test is the univariate form of Tietjen-Moore test. Suppose, we have a set of multivariate data and we wish to test the multivariate outliers. To test for outliers in both tails, the absolute residuals are calculated as where z ij denote the x ij values sorted by their absolute residuals in ascending order. Tietjen-Moore test is generalized for the multivariate data as below: where z ij : ith observation in the jth, z jk : jth dimension mean with k / p points removed, z j : jth dimension mean for the full data.
The generalized Tietjen-Moore test is defined for the below hypothesis: • H 0 : There are no outliers in the data set • H A : There are exactly k outliers in the data set.
The test statistic value is between zero and one. If there are no outliers in the data, the test statistic is close to 1. If there are outliers in the data, the test statistic will be closer to zero. E k is distributed as Beta-distribution and is not affected from the number of sample sizes. In the next section, the simulation study is given to evaluate the performance of the multivariate outlier detection methods under various conditions.

Simulation study
To evaluate the performance of the proposed test and to compare it with each other, we conduct a simulation study with different schemes. We use two pairs of locationscatter estimators: classical ð x; SÞ and MCD [12] with an approximate 25% breakdown point (denoted RMCD25), which has better efficiency than the one with (maximal) 50% breakdown point.
The index-robust distance plots are given in Fig. 1 both for clean and contaminated data. The horizontal line represents the number of observation in one dimension. Figure 1 clearly shows outliers for contaminated data. Figure 2 displays the robust 97.5% tolerance ellipse based on robust distances for multivariate data with N = 60, 240 and p = 2.
Mahalanobis distances and robust distances for the multivariate data for p = 2, N = 60,240 are illustrated in Fig. 3. This illustrates the masking effect: classical estimates can be highly affected by outlying observations. To get a reliable analysis of multivariate data with outliers, robust estimators are required that can resist possible outliers.
The value of the test statistic lies between zero and one. If there is no outlier in the data, the test statistic is close to 1. If there are outliers in the data, the test statistic will be closer to zero. The robust test statistics give smaller values, thus the test statistics are used in the case of contamination.
In Tables 1 and 2: E k1 gives the E k values obtained from the classic residuals based on classic estimators. E k2 gives the E k values obtained from robust residuals based on MCD estimators. E k3 gives the E k values obtained from weighted residuals based on the MCD estimators.
The results for normal and non-normal multivariate data in Tables 1 and 2 Fig. 1 Index-robust distance plot for multivariate data N = 60,240 and p = 2 Fig. 2 The robust tolerance ellipse for multivariate data with N = 60, 240 and p = 2 contamination, E k3 's performance is better than E k2 . These two robust test statistics show up the outliers and so can be used to detect outliers. • When the contamination amount and sample size decrease, E k2 gives better results.
• In the case of contamination, the E k test statistic based on classic estimators is deteriorated. • The weighted robust test statistic has the best performance. • The robust test statistic is not affected by the number of sample sizes.
The MCD estimator is a highly robust estimator of multivariate location and scale. Therefore, detection of multivariate outlier using MCD estimator could be a good solution. Results are valid for both normal and non-normal multivariate cases to detect outliers. The results show that the proposed method give better results depending on whether or not the data set is multivariate normal. From the simulation study, we can conclude that the proposed method is applicable for the multivariate outlier.

Conclusions
Univariate or multivariate outliers are important because they change the results of data analysis. Even though the easiest way to detect the multivariate outliers is multidimensional scatter plot, some methods based on the Mahalanobis distance or Cooks distance have been suggested in the literature. These distances use estimates of the location and scatter to identify values that are considerably far away from the bulk of data. The principal component might be a good alternative method but its drawback is that it may fail when the distribution has multi-modal. In this paper, we generalize the Tietjen-Moore test for multivariate data. In the formulation, the robust estimators of the mean and the covariance matrix are replaced by the classical estimators to avoid the masking effect. The value of the test statistic always lies between zero and one. A simulation study is conducted to evaluate the performance of the multivariate outlier detection methods under various conditions. The results reveal that the proposed method gives better results depending on whether or not the data set is multivariate normal even though multivariate analyses require checking the multivariate normality.
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creative commons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.