1 Introduction

1.1 What are outliers

Outliers are extreme or atypical values that can reduce and distort the information in a dataset. The problem of how to deal with outliers has long been a concern. Barnett and Lewis (1994, p. 3), one of the pioneering books in mathematical statistics dealing with outlier detection, reference Pierce (1852) published more than 150 years ago. Eliminating outliers from estimation carries the risk of losing information, however including the risks of contamination. To deal with the problem, Barnett and Lewis (1994, p. 3) devised a principle to accommodate outliers using robust methods of inference, allowing for the use of all the data while alleviating the undue influence of outliers. We follow this principle and focus on the robust statistical methods introduced by Huber (1964) that are the most suitable for survey data processing. Therefore, statistical tests are beyond the scope of our discussion.

Some outliers in survey statistics contain a mistake of some sort that requires correction. Others may not involve a mistake, but represent a trend different from that of the majority while having a large design weight in the dataset. Careful consideration of the influence of outliers on estimation needs to be given, and statisticians compiling official statistics need to determine whether such extreme values deserve their prescribed sampling weights in terms of representativeness, as discussed by Chambers (1986).

While the UNECE Data Editing Group broadly defines outliers as observations in the tails of a distribution (Economic Commission for Europe of the United Nations 2000, p. 10), narrower definitions vary depending on the purpose of the activities in the statistical production process. Outliers require appropriate treatment at each of the processing steps; otherwise, they may negatively impact on estimation efficiency and introduce bias into the resulting statistical product. The objective of this paper is to introduce both practical methods currently in use and experimental methods in research intended for use in statistical production to address the problem of data outliers.

Most conventional outlier detection methods in the field of official statistics are univariate approaches mainly applied to the search for erroneous observations so that that they can be corrected, and entirely valid datasets can be established. A range-check to determine upper and lower thresholds for “normal” (i.e., not outlying) data is a typical example, as is the quartile method. However, such univariate methods cannot detect multivariate outliers, that is, outliers involving different relationships among the variables. In multivariate cases, scatter plot matrices or other visualization techniques have been frequently used compared to the multivariate methods because of their computational complexity and processing time or the difficulty associated with the inspection of detected multivariate outliers. Complicating also matters, with multivariate methods, just which outliers are detected can depend on which particular method is being used.

Historically, statistical tables have been the major final product of official statistics, which means that the demand for detecting multivariate outliers having different relationships among the variables has not been high. However, in 2007, the Statistics Act of Japan (Act No. 53) was revised for the first time in 60 years. The new act recognizes official statistics as an information infrastructure and promotes the use of microdata (e.g., Nakamura 2017). Given this change in policy, the need to detect multivariate outliers has increased since outliers tend to be more problematic in microdata, not only for users but also for providers, in terms of privacy protection. Besides, the practical usability of multivariate outlier detection methods is increasing with the continuing improvements being made in computer technologies both in hardware and statistical software.

In the next subsection, a general model of the statistical production process is described. The model consists of three steps: data cleaning, imputation, and estimation and formatting. The outliers to be focused on depend on the purpose of each step.

In Sect. 2, available multivariate outlier detection methods for the data cleaning step are discussed. Section 3 describes robust regression for imputation. M-estimators discussed in Sect. 3 are then extended to the ratio model in Sect. 4. Calibration of design weights to cope with outliers having large design weights is discussed in Sect. 5. Section 6 provides two examples of practical use of the introduced methods. Concluding remarks and discussing future work are in Sect. 7.

1.2 General model of the statistical production process

Figure 1 provides a general model of the statistical production process for surveys, beginning with raw electronic data. The first step is data cleaning. In this step, erroneous data are detected for correction to ensure a clean, valid dataset. The second step is imputation, where missing values are estimated and replaced as necessary to produce complete datasets for the analysis to be conducted in the next step. The final step involves estimation and formatting to produce the final statistical product.

Fig. 1
figure 1

General process of statistics production with relation to outliers

1.2.1 Data cleaning

The objective of the data cleaning step is to find and correct errors and inconsistencies. Consequently, the outliers in this step are those with a high likelihood of having an error or inconsistency. Any detected outlier is checked and may leave unchanged if it is not wrong. Otherwise, it is corrected based on available information when possible, or removed and estimated upon necessity in the imputation step to ensure a clean dataset.

Section 2 focuses on is multivariate outlier detection methods, especially those for elliptical distributions, since these types of methods have not been widely used in practice.

1.2.2 Imputation

Missing data are often unavoidable in survey statistics. Discarding missing records may cause biased estimation even when the missing values are MAR (missing at random) (Little and Rubin 2002, pp. 117–127). Therefore, essential variables for estimation often require missing data imputation. Since the input for the imputation step is clean data (from Step 1), the outliers here are not erroneous data but rather extreme values that may distort estimations for imputation. An example of this is high leverage points in regression estimation. Such points may have a substantial influence on the resulting estimation for imputed values.

From among the many imputation methods available, this paper focuses on linear regression and ratio imputation. In general, introducing robust estimation improves the efficiency of the imputation compared to ordinary least squares (OLS) when applied to datasets that have longer tails than the normal distribution.

Robust regression imputation is discussed in Sect. 3, followed by robust ratio imputation in Sect. 4.

1.2.3 Estimation and formatting

In the final step of the statistical production process, the outliers in need of attention are those having large design weights. As an illustration, suppose a particular record in a household survey has a design weight of 1000 and a household income of 5 million yen (approximately 46,000 USD) per month (an atypically high-income level). This is very likely to cause a problem in the statistical tables produced from the survey. This one very wealthy household is treated as a representative of 1000 other households in the area that were not surveyed. As a consequence, the population estimate of the household income for the area will reflect that there are 1000 households with a monthly income of 5 million yen. Design weight calibration based on such “outlyingness” is discussed in Sect. 5.

2 Multivariate outlier detection methods for elliptically distributed datasets

We begin with outlier detection methods for unimodal numerical data, first establishing the difference between univariate and multivariate methods, and then introducing several multivariate methods with desirable characteristics. These methods introduced in this section are mainly used for data cleaning purposes.

2.1 Univariate methods versus multivariate methods

Univariate methods for numerical data are conventionally used in the data cleaning step to identify erroneous observations. A common practice is to set the thresholds for valid data (i.e., non-outliers) at a distance of three-sigma (or more depending on its distribution) from the mean of a target dataset. This method is essentially the idea of a control chart in the field of total quality management (TQM); however, this simple method is not robust as the thresholds are supposed to be decided with a dataset in stable condition (i.e., a dataset without outliers) (Teshima et al. 2012, pp. 173–174). It is well known that with the three-sigma rule or any other non-robust method, deciding thresholds with contaminated datasets induces a masking effect, and therefore, thresholds of such methods must be determined with datasets free from outliers. We need robust methods to determine thresholds with contaminated datasets.

Noro and Wada (2015) illustrate the problem and recommend using order statistics such as the interquartile range (IQR). A box-and-whisker plot using the IQR, as proposed by Tukey (1977), is commonly used when the target dataset is slightly asymmetric. If the dataset is highly asymmetric, an appropriate data transformation may be necessary before applying the method. The scatterplot in Fig. 2 highlights the differences between robust methods and their non-robust counterparts, as well as the distinction between univariate and multivariate methods. It displays the Hertzsprung–Russell star dataset (Rousseeuw and Leroy 1987, p. 28), which contains extreme outliers. The yellow-colored rectangular area shows the thresholds according to the three-sigma rule; the green area shows the thresholds identified by the box-and-whisker method. Both are univariate methods. The orange lines in the diagram show probability ellipses drawn with a mean vector and covariance matrix. Although this represents a multivariate approach, it, too, induces the masking effect as well as the three-sigma rule when applied to contaminated datasets. The red probability ellipses are drawn using modified Stahel-Donoho (MSD) estimators produced by robust principal component analysis (PCA) based on Béguin and Hulliger (2003). MSD and other multivariate methods are discussed in the next subsection.

Fig. 2
figure 2

Differences between robust and non-robust methods both for univariate and multivariate methods. [After Wada (2010), Fig. 1.4.3, p. 98.]

2.2 Multivariate outlier detection methods for elliptical distributions

To evaluate and compare current methods for the editing and imputation of data, Eurostat conducted the EUREDIT project between March 2001 and February 2003. A series of reports were published and made available at https://www.cs.york.ac.uk/euredit/, along with five papers published in the Journal of the Royal Statistical Society. In one of the papers, Béguin and Hulliger (2004) note that NSOs had not used multivariate methods except for the Annual Wholesale and Retail Trade Survey (AWRTS) in Statistics Canada. Franklin and Brodeur (1997) report that modified Stahel-Donoho (MSD) estimators have been adopted for AWARTS and describe the algorithm. Béguin and Hulliger (2003) suggest several improvements to the estimators. Wada (2010) implemented both the original MSD and improved estimators in R and confirmed that the suggestions by Béguin and Hulliger (2003) do indeed improve performance, while the improved version of MSD estimators suffered from the curse of dimensionality. Since the improved version is incapable of processing more than 11 variables with a 32-bit PC, Wada and Tsubaki (2013) implemented an R function by parallel computing so that the function can be applied to higher-dimensional datasets.

Béguin and Hulliger (2003) suggest guiding principles for outlier detection, including good detection capability, high versatility, and simplicity. They examined several methods to estimate a mean vector and covariance matrix for elliptically distributed datasets with a high breakdown point compared to M-estimators (Huber 1981), as well as other desirable properties such as affine and orthogonal equivariance. The methods include Fast-MCD (Rousseeuw and van Driessen 1999), which approximates the minimum covariance determinant (MCD) proposed by Rousseeuw (1985) and Rousseeuw and Leroy (1987); BACON by Billor et al. (2000), named for Francis Bacon; and the Epidemic Algorithm (EA) proposed by Hulliger and Béguin (2001), in addition to the MSD estimators used by Statistics Canada. Béguin and Hulliger (2003) compared some of these methods and found that BACON showed better detection capacity than EA for the UK Annual Business Inquiry (ABI) dataset; however, they conclude that this particular dataset does not require a sophisticate robust method.

3 Multivariate outlier detection for regression imputation

After removing or correcting erroneous data in the data cleaning step, the next step is the imputation of missing values of essential variables. From the variety of imputation methods available, the focus here is on regression imputation. Typically, OLS is used to estimate the parameters of a linear regression model; however, it is well known that the existence of outliers makes such parameter estimation unreliable. After going through the data cleaning step, survey datasets may still contain outliers in another sense. These remaining outliers are assumed to be correct; however, any extreme values in the long tails of a data distribution carry the risk of distorting the parameter estimation used for imputation regardless of their correctness. OLS regression requires to remove such outliers manually. Survey observations are divided into (sometimes a large number of) imputation classes so that a uniform response mechanism is assumed within it. Parameter estimation is conducted in each imputation class separately. A robust regression method relieves us of the burden to remove outliers from each imputation classes beforehand.

We examine M-estimation for regression, which is one of the most popular methods in this section. Disadvantages of M-estimation is also introduced together with other methods to cope with the disadvantages.

3.1 M-estimators

3.1.1 Parameter estimation of the location and regression

Generally, an M-estimate is defined as the minimization problem of

$$\begin{array}{*{20}c} {\mathop \sum \limits_{i = 1}^{n} \rho \left( {x_{i} ;T_{n} } \right),} \\ \end{array}$$

for any estimate \(T_{n}\) with independent random variables \(x_{1} , \ldots , x_{n}\). Suppose an arbitrary function \(\rho\) has a derivative \(\psi \left( {x;\theta } \right) = \left( {\partial /\partial \theta } \right)\rho \left( {x_{i} ;\theta } \right)\), \(T_{n}\) satisfies the implicit equation

$$\begin{array}{*{20}c} {\mathop \sum \limits_{i = 1}^{n} \psi \left( {x_{i} ;T_{n} } \right) = 0.} \\ \end{array}$$

Huber (1964) discusses the robust estimation of a mean vector, proposes M-estimation of a location with \(\sum\nolimits_{i = 1}^{n} {\psi \left( {x_{i} - T_{n} } \right) = 0}\), and proves their consistency as well as asymptotic normality. Huber (1973) then extends the idea to the regression model

$${y_{i} = \beta_{0} + \beta_{1} x_{i1} + \cdots + \beta_{p} x_{ip} + \varepsilon_{i} = \varvec{x}_{i}^{\top} \varvec{\beta} + \varepsilon_{i} } ,$$
(1)

with an objective variable \({\varvec{y}} = \left( {y_{1} , \ldots y_{n} } \right)^{\top}\), where the error term \({\varvec{\varepsilon}} = \left( {\varepsilon_{1} , \ldots ,\varepsilon_{n} } \right)^{\top} \sim N\left( {0,\sigma^{2} } \right),\) i.i.d. and independent of \(\left( {p + 1} \right)\)-dimensional explanatory variables \({\varvec{x}}_{i} = \left( {1, x_{i1} , \ldots , x_{ip} } \right)\), and regression parameters \({\varvec{\beta}} = \left( {\beta_{0} , \beta_{1} , \ldots , \beta_{p} } \right)^{\top}\). The M-estimators \(\varvec{\beta}\) minimizes

$$\begin{array}{*{20}c} {\mathop \sum \limits_{i = 1}^{n} \rho \left( {y_{i} - {\varvec{x}}_{i}^{\top} {\varvec{\beta}} } \right),} \\ \end{array}$$

on condition that \(\rho\) is differentiable, convex, and symmetric around zero. The estimation equation is

$$\mathop \sum \limits_{i = 1}^{n} \psi \left( {\frac{{y_{i} - {\varvec{x}}_{i}^{\top} {\varvec{\beta}} }}{\sigma }} \right)x_{i} = \mathop \sum \limits_{i = 1}^{n} \psi \left( {e_{i} } \right){\varvec{x}}_{i} = 0.$$

Due to the condition on \(\rho\) described above, \(\psi\) is supposed to be a bounded and continuous odd function, since \(\psi = \rho^{\prime}\). Residuals \(\left( {y_{i} - x_{i}^{\top} \beta } \right)\) are standardized by a measure of scale \(\sigma\) to make the estimation scale equivariant. Then, \(\varvec{\beta}\) is estimated by solving

$${\mathop \sum \limits_{i = 1}^{n} w_{i} e_{i} x_{i} = \mathop \sum \limits_{i = 1}^{n} w_{i} \left( {\frac{{y_{i} - {\varvec{x}}_{i}^{\top} {\varvec{\beta}} }}{\sigma }} \right) {\varvec{x}}_{i}^{\top} = 0},$$
(2)

with a weight function defined as \(w_{i} = \psi \left( {e_{i} } \right)/e_{i}\) and \(w_{i} = w\left( {e_{i} } \right)\). Then, it can be re-expressed as

$$\mathop \sum \limits_{i = 1}^{n} {\varvec{x}}_{i}^{\top} w_{i} {\varvec{x}}_{i} {\varvec{\beta}} = \mathop \sum \limits_{i = 1}^{n} {\varvec{x}}_{i}^{\top} w_{i} y_{i} .$$

It can be re-expressed in a matrix form as \(\left( {{\varvec{X}}^{\top} {\varvec{WX}}} \right){\varvec{\beta}} = {\varvec{X}}^{\top} {\varvec{Wy}}\), and consequently, \({\varvec{\beta}}\) is estimated by

$${\hat{\varvec{\beta }}} = \left[ {\varvec{X}}^{\top} {\varvec{WX}} \right]^{ - 1} {\varvec{X}}^{\top} {\varvec{Wy}},$$
(3)

where \({\varvec{X}} = \left( {{\varvec{x}}_{1} , \ldots ,{\varvec{x}}_{n} } \right)^{\top}\) is a \(n \times \left( {p + 1} \right)\) matrix of the explanatory variable, \({\varvec{W}} = {\text{diag}}\left\{ {w_{i} } \right\}\) is a \(n \times n\) diagonal matrix of weights. After all, M-estimators for regression can be regarded as weighted least squares (WLS) estimators with their weights based on the residuals.

3.1.2 IRLS algorithm for regression

The intercept of M-estimators for regression is location equivariant, and the slope is location invariant; however, they are not scale equivariant when the scale parameter is given. Scale equivariance is achieved by estimating the scale parameter simultaneously and using it to standardize the residuals. Beaton and Tukey (1974) propose the IRLS algorithm to solve (3) with simultaneous estimation of the scale parameter. Holland and Welsch (1977) recommend it rather than Newton’s method, which is theoretically desirable but difficult to implement, or Huber’s method (Huber 1973; Bickel 1973), which requires more iterations.

The IRLS algorithm requires an appropriate initial estimate \(\hat{\varvec{\beta}}^{\left( 0 \right)}\) and use it to obtain better next estimate of \(\hat{\varvec{\beta}}^{\left( 1 \right)}\) together with \({\hat{\sigma }}\) based on the equation,

$$\hat{\varvec{\beta }}^{\left( j \right)} = \hat{\varvec{\beta }}^{{\left( {j - 1} \right)}} + \left\{ {{\varvec{X}}^{\top} \left[ {{\varvec{W}}\left( {\frac{{{\varvec{y}} - {\varvec{X}\hat{\varvec{\beta}}}^{{\left( {j - 1} \right)}} }}{\hat{\sigma }}} \right)} \right]{\varvec{X}}} \right\}^{ - 1} {\varvec{X}}^{\top} \left\{ {\left[ {{\varvec{W}}\left( {\frac{{\varvec{y}} - {\varvec{X}\hat{\varvec{\beta}}^{{\left( {j - 1} \right)}} }}{\hat{\sigma }}} \right)} \right]\left( {\varvec{y} - {\varvec{X}\hat{\varvec{\beta}}}^{\left( {j - 1} \right)}} \right)} \right\}.$$

The calculation is repeated until a conversion condition is met. The superscript \(j\) represents the iteration number.

There are some choices of measure for \({\hat{\sigma }}\). It will be discussed with a selection of a weight function since they are closely related.

3.1.3 Weight functions and measures of scale

Robust weights \(w_{i}\) in (2) are computed based on a weight function. Although there are a variety of choices (see, e.g., Antoch and Ekblom 1995; Zhang 1997), we discuss the most popular two weight functions here among them. One is called Huber’s weight function

$$ \begin{array}{*{20}c} {w_{i} = w\left( {e_{i} } \right) = w\left( {\frac{y_{i} - {\varvec{x}}_{i}^{\top} \hat{\beta }}{\hat{\sigma }}} \right) = \left\{ {\begin{array}{*{20}c} {\left[ {1 - \left( {e_{i}/{c}} \right)^{2} } \right]^{2} } & {\left| {e_{i} } \right| \le c } \\ 0 & {\left| {e_{i} } \right| > c} \\ \end{array} } \right.,} \\ \end{array} $$
(4)

proposed by Huber (1964). This weight function is proved to have a unique solution regardless of the initial values (e.g., Maronna et al. 2006, p. 350) and its estimation efficiency is high with normal or nearly normal datasets (e.g., Hampel 2001; Wada and Noro 2019). The other is Tukey’s biweight function

$${w_{i} = w\left( {e_{i} } \right) = w\left( {\frac{{y_{i} - {\varvec{x}}_{i}^{\top} \hat{\varvec{\beta }}}}{{\hat{\sigma }}}} \right) = \left\{ {\begin{array}{*{20}c} 1 & {\left| {e_{i} } \right| \le k } \\ {k}/{\left| {e_{i} } \right|} & {\left| {e_{i} } \right| > k} \\ \end{array} } \right.,}$$
(5)

by Beaton and Tukey (1974). This weight function performs well with datasets with longer tails, while it does not promise a global solution unlike Huber’s weight function. The difference between these two weight functions is based on the behavior of extreme outliers. Tukey’s function gives zero weight to observations very far from others, while Huber’s function never gives zero weight and it cannot escape from the influence of extreme outliers. The tuning constants \(c\) in (4) and \(k\) in (5) are sometimes called Huber’s \(c\) and Tukey’s \(k\), respectively. The actual values depend on the measure of scale used.

The most popular measure of scale is median absolute deviation (MAD) defined as follows:

$$\begin{array}{*{20}c} {{\hat{\sigma }}_{{{\text{MAD}}}} = {\text{median}}\left( {\left| {r_{i} - {\text{median}}\left( {r_{i} } \right)} \right|} \right),} \\ \end{array}$$

where residuals \(r_{i} = y_{i} - {\varvec{x}}_{i}^{\top} {\varvec{\beta}}\). Huber’s weight function is commonly used with MAD. Tukey’s biweight function also used with MAD (e.g., Holland and Welsch 1977; Mosteller and Tukey 1977, 9. 357); however, there are also some cases with average absolute deviation (AAD),

$$\begin{array}{*{20}c} {{\hat{\sigma }}_{{{\text{AAD}}}} = {\text{mean}}\left( {\left| {r_{i} - {\text{mean}}\left( {r_{i} } \right)} \right|} \right).} \\ \end{array}$$

Andrews et al. (1972), who conducted a large-scale Monte Carlo experiment involving robust estimation of the location parameter, show that the MAD is better than the AAD or IQR for M-estimators; however, it has not been proved that MAD is better than other scale parameters in the case of regression (Huber and Ronchetti 2009, pp. 172–173.). Holland and Welsch (1977) compare some weight functions with MAD as the measure of scale and show Huber weight function has better efficiency than the biweight function by a Monte Carlo experiment, while Bienias et al. (1997) use Tukey’s biweight function with an AAD scale and mention its convergence efficiency.

Wada and Noro (2019) made a comparison of the four estimators combined these two weight functions and the measures of scale by conducting a Monte Carlo experiment with long-tailed datasets with asymmetric contamination. It is known that the 95% asymptotic efficiency on the standard normal distribution is obtained with the tuning constant \(k = 1.3450\) for Huber’s function (e.g., Ray 1983, p. 108), and \(c = 4.6851\) for the biweight function (e.g., Ray 1983, p. 112). These figures are based on the standard deviation (SD), and the corresponding figures of MAD and AAD can be obtained by the relations

$$ \begin{gathered} \frac{\sigma_{\text{AAD}} }{\sigma_{\text{SD}} } = \frac{E\left| e \right|}{{\sqrt {E\left( {e^{2} } \right)} }} = \sqrt {\frac{2}{\pi }} \approx 0.80,\quad {\text{and}} \hfill \\ \sigma_{{{\text{SD}}}} = \frac{1}{{\Phi }^{ - 1}} \left( \frac{3}{4} \right) \cdot \sigma_{\text{MAD}} \approx 1.4826 \cdot \sigma_{\text{MAD}} , \hfill \\ \end{gathered} $$

with cumulative distribution function of the standard normal distribution \({\Phi }\) where \(\sigma_{{{\text{SD}}}}\), \(\sigma_{{{\text{MAD}}}}\) and \(\sigma_{{{\text{AAD}}}}\) are scale parameters based on SD, MAD and AAD, respectively. Wada and Noro (2019) obtain the results, as shown in Table 1, and compared the four estimators based on the standardized tuning constants shown in Table 2. The range of those constants is for the biweight functions with AAD based on Bienias et al. (1997) of Tukey’s \(k\), which is a part of the reports for official statistics called the Euredit Project conducted from 2000 to 2003 (Barcaroli 2002) funded by Eurostat. The smaller value of these tuning constants makes the estimation more resistant to outliers, while larger value increases efficiency in estimation. Wada and Noro (2019) conclude that AAD is computationally more efficient than the widely used MAD for both weight functions. Besides, AAD is more suitable than MAD for Tukey’s biweight function. Their compared estimators are available at a public repository (see Table B in Appendix).

Table 1 Tuning constants for 95% asymptotic efficiency with different measures of scale. The figures first appeared in Wada (2012); and those of \(\sigma_{{{\text{AAD}}}}\) for \(k\) for Huber are corrected in Wada and Noro (2019)
Table 2 Tuning constants scaled for a comparison. The figures appeared in Wada (2012) and Wada and Noro (2019)

3.2 Selection of the weight function and breakdown point

Wada and Tsubaki (2018) suggest choosing between these two weight functions based on purpose. They suggest Tukey’s biweight function rather than Huber’s weight in case of imputation, since the breakdown point of M-estimators for regression is \(1/n.\) It is the same as in OLS. Rousseeuw and Leroy (1987) report that the oldest definition of the breakdown point was given by Hodges (1967) regarding univariate parameter estimation and that Hampel (1971) generalized it. The definition offered by Donoho and Huber (1983) is for a finite sample:

Given sample size \(n\) for any sample, let

$$\varvec{Z} = \left[ {\left( {x_{11} , \ldots , x_{1p} , y_{1} } \right), \ldots ,\left( {x_{n1} , \ldots , x_{np} , y_{n} } \right)} \right],$$

and let \(\varvec{T}\) be the regression estimator applied to \(\varvec{Z}\). A new sample, \(\varvec{Z}^{\prime}\), is created by replacing \(m\) of the observations arbitrarily in \(\varvec{Z}\). Let \({\text{bias}}\left( {m; \varvec{T}, \varvec{Z}} \right)\) be the maximum bias produced by the contamination of the replacements in the sample. The value of \({\text{bias}}\left( {m; \varvec{T}, \varvec{Z}} \right)\) is determined as follows:

$${\text{bias}}\left( {m;{\varvec{T}}, {\varvec{Z}}} \right) = {\sup_{\varvec{Z}^{\prime}}} \|{\varvec{T}}\left( {\varvec{Z}^{\prime}} \right) - {\varvec{T}}\left( {\varvec{Z}} \right) \|.$$

If \({\text{bias}}\left( {m;{\varvec{T}}, {\varvec{Z}}} \right)\) is infinite, the indication is that contamination of size \(m\) breaks down the estimator. In general, the finite-sample breakdown point of estimator \({\varvec{T}}\) for sample \(\varvec{Z}\) is

$$\varepsilon_{n}^{*} \left( {{\varvec{T},\varvec{Z}}} \right) = \min \left[ {\frac{m}{n}; {\text{bias}}\left( {m;{\varvec{T},\varvec{Z}}} \right)\;{\text{is}}\;{\text{infinite}}} \right].$$

This can be regarded as the ratio of the smallest number of outliers that can make the value for \(\varvec{T}\) arbitrarily far from what is obtained. A breakdown point of \(1/n\) means that only one extreme observation in a dataset of any size can adversely affect the estimation and that the breakdown point reaches nearly 0% with large samples. Nevertheless, Tukey’s biweight function can eliminate the influence of extreme observations by giving zero weight, unlike Huber’s weight function. It is the reason recommended for imputation. Those outliers are only ignored in estimating imputed values, while they are used in survey enumeration. On the other hand, if M-estimators for regression are used for population estimation, i.e., directly estimating the figures appeared in final products such as statistical tables, Huber’s weight function might be more suitable as it never gives zero weight. Giving zero weight to observations in producing final survey statistics means discarding valid observations. Generally, survey statisticians working for official statistics avoid wasting precious data, since they are obtained from questionnaires filled by respondents who bear the burden to respond with goodwill.

3.3 Robust estimators to cope with outliers in explanatory variables

M-estimators have another weakness in addition to the low breakdown point that the estimators are not robust against outliers in explanatory variables. LMS (Least Median of Squares) proposed by Hampel (1975) and extended by Rousseeuw (1984), LTS (least trimmed squares) by Rousseeuw (1984), S-estimator by Rousseeuw and Yohai (1984) have higher breakdown points than M-estimators and can also cope with outliers in the explanatory variables. Unfortunately, all of them have difficulty with computation. (See, e.g., Rousseeuw and Leroy 1987; and Huber and Ronchetti 2009, pp. 195–198 for more details.)

The use of these estimators may still be in the research stage in the field of official statistics, while the software is available and may widely be used in some other fields. Generalized M (GM)-estimators and MM-estimators are popular methods. GM-estimators are introduced by Schweppe (as given in Hill 1977), and Coakley and Hettmansperger (1993). Their algorithms and software are available in Wilcox 2005. MM-estimators are first presented by Yohai (1987). Wilcox (2005) implemented an R function called bmreg for Schweppe-type GM-estimators and chreg for the other GM-estimators by Coakley and Hettmansperger (1993). In CRAN package, robustbase also have lmrob function, which implements both MM-estimators by Yohai (1987) and SMDS-estimators by Koller and Stahel (2011). Koller and Stahel (2011) achieve a 50% breakdown point and 95% asymptotic efficiency by improving MM-estimators.

Bagheri et al. (2010) compare M-estimators, MM-estimator, Schweppe-type GM-estimator, and the GM-estimator proposed by Coakley and Hettmansperger (1993), concluding that the GM-estimators proposed by Coakley and Hettmansperger were the best among the group. Wada and Tsubaki (2018) examine M-estimators and GM-estimators by Coakley and Hettmansperger (1993) for weight calibration, which will be discussed in Sect. 5. They mention that the explanatory variables chosen for imputation are often selected from among the auxiliary variables used for stratification in sample surveys. If this is the case, outliers in explanatory variables are not expected, and M-estimators could be more suitable than GM-estimators. GM-estimators reduce the robust weight of leverage points in addition to the outliers in the objective variable. It provides robustness while reduces estimation efficiency.

4 Robustification of the ratio estimation for imputation

4.1 Difference between regression imputation and ratio imputation

In regression imputation, missing values \(y_{i}\) in the target variable are replaced by estimated values \(\hat{y}_{i}\) based on a regression model with auxiliary \(x\) variables using complete observations regarding all those \(x\) and \(y\) in the target dataset (e.g., De Waal et al. 2011, p. 230).

Ratio imputation is a special case of regression imputation (De Waal et al. 2011, pp. 244–245), where missing \(y_{i}\) are replaced by the ratio of \(y_{i}\) to a single observed auxiliary \(x_{i}\). Specifically, the ratio model is

$$\begin{array}{*{20}c} {y_{i} = \beta x_{i} + \epsilon_{i} ,} \\ \end{array}$$
(6)

where missing \(y_{i}\) are replaced by \(\hat{y}_{i} = \hat{\beta }x_{i}\) with the estimated ratio

$${\hat{\beta } = \frac{{\mathop \sum \nolimits_{i = 1}^{n} y_{i} }}{{\mathop \sum \nolimits_{i = 1}^{n} x_{i} }}}$$
(7)

where data \(i = 1, \ldots , n\) of \(\left( {x, y} \right)\) are observed \(n\) elements in the imputation class. (See Cochran 1977, pp. 150–164 for further details.)

The ratio model (6) resembles the single regression model without an intercept,

$$\begin{array}{*{20}c} {y_{i} = \beta x_{i} + \varepsilon_{i} .} \\ \end{array}$$
(8)

However, there is a difference in the error terms. The error term for the ratio model (6) is heteroscedastic and expressed as \(\epsilon_{i} \sim N\left( {0, x_{i} \sigma^{2} } \right)\) with scale parameter \(\sigma\), while the error in the regression model (8) is homoscedastic, as described for the regression model (1), and expressed as \(\varepsilon_{i} \sim N\left( {0, \sigma^{2} } \right)\). Because of this error term difference, the ratio model has an advantage in imputation over regression models in its ability to fit heteroscedastic datasets without data transformations that can make the estimation of means and totals unstable. On the other hand, the heteroscedastic error is an obstacle to robustifying the ratio estimator by means of M-estimation.

4.2 Generalization and robustification of the ratio model

For robustification, Wada and Sakashita (2017) and Wada et al. (2021) re-formulate the original ratio model with the heteroscedastic error term \(\epsilon_{i}\) as follows:

$$\begin{array}{*{20}c} {y_{i} = \beta x_{i} + \sqrt {x_{i} } \varepsilon_{i} , } \\ \end{array}$$

since the two error terms discussed above have the relation, \(\epsilon_{i} = \sqrt {x_{i} } \varepsilon_{i}\).

They then extend the model to

$$\begin{array}{*{20}c} {y_{i} = \beta x_{i} + x_{i}^{\gamma } \varepsilon_{i} ,} \\ \end{array}$$
(9)

with an error term proportional to \(x_{i}^{\gamma }\). The corresponding ratio estimator becomes

$${\hat{\beta } = \frac{{\mathop \sum \nolimits_{i = 1}^{n} y_{i} x_{i}^{1 - 2\gamma } }}{{\mathop \sum \nolimits_{i = 1}^{n} x_{i}^{{2\left( {1 - \gamma } \right)}} }}.}$$
(10)

When \(\gamma = 1/2\), the model (9) and the estimator (10) correspond to the original ratio model (6) and the estimator (7). According to the value of \(\gamma\), the generalized model has different features. It also corresponds to the single regression model with an intercept when \(\gamma = 0\). Takahashi et al. (2017) also discuss the same model regarding the datasets following the log-normal model and proposed estimation of \(\gamma\).

The robustified generalized ratio estimator by Wada and Sakashita (2017) and Wada et al. (2021) is

$${\hat{\beta }_{{{\text{rob}}}} = \frac{{\sum w_{i} y_{i} x_{i}^{1 - 2\gamma } }}{{\sum w_{i} x_{i}^{{2\left( {1 - \gamma } \right)}} }},}$$
(11)

where \(w_{i}\) is obtained by a weight function with homoscedastic quasi-residuals,

$${\check{r}}_{i} = \frac{{y_{i} - \hat{\beta }_{{{\text{rob}}}} x_{i} }}{{x_{i}^{\gamma } }},$$

and a scale parameter \(\sigma\).

4.3 Further development: simultaneous estimation of \(\gamma\)

Wada and Sakashita (2017) and Wada et al. (2021) considered the generalized ratio model with fixed \(\gamma\) values, which requires model selection before estimation for imputation. Wada et al. (2019) proposed eliminating the model selection step by simultaneously estimating \(\beta\) and \(\gamma\) in (11) by means of two-stage least squares (2SLS) (e.g., Greene 2002, p. 79) with iterations. The initial estimate of \(\beta\) is obtained by OLS under the model (6). This estimation is not efficient; however, unbiased under heteroscedasticity. Using the instrumental variable, \(r_{i}^{2} = \left( {y_{i} - \hat{\beta }x_{i} } \right)^{2}\),

$$\log \left| {r_{i} } \right| = \gamma \log \left| {x_{i} } \right| + \log \left( \sigma \right),$$

is obtained with \(r_{i} = y_{i} - \hat{\beta }x_{i}\). It means that \(\gamma\) is obtained as the single regression parameter. Then, new \(\hat{\beta }\) will be obtained using \(\hat{\gamma }\), and new \(\hat{\gamma }\) will also be estimated based on the latest \(\hat{\beta }\). See Wada, Takata and Tsubaki (2019) for more detail of the algorithm. The implemented function named RBred is available with its non-robust version called Bred (see Appendix). An evaluation was made using contaminated random datasets. Estimation by RBred was found to have better efficiency than the R optim function; in particular, its estimation of \(\beta\) appears to be useful for imputation. However, further evaluation may be necessary, especially regarding the estimation of \(\gamma\), although the value of \(\gamma\) is not necessary for imputation.

5 Weight calibration

The outliers focused on in the estimation and formatting step are extreme values with large design weights. The Horvitz–Thompson estimator is widely used to estimate finite population means and totals for conventional statistical surveys. In such cases, design weights, which are the inverse of the sampling rate, are used as multipliers for each observation. The problem lies in deciding whether an extreme observation deserves the corresponding design weight. (A weight of 1000 applied to an observation means that the value of the observation represents 1000 population elements that were not sampled.) Chambers (1986) considers this outlier problem and argues that for “nonrepresentative” outliers or unique data points that have been judged free of any errors or mistakes, the design weight should be one corresponding to a single population element. This implies that these outliers do not represent other population elements that are not sampled, and consequently, they do not influence the estimation process in any substantial way.

Wada and Tsubaki (2018) propose a design weight calibration method utilizing the robust weights obtained by the M-estimators for regression described in Sect. 3. Henry and Valliant (2012) classified estimation methods for population means or totals in sample surveys as model-based approaches, design-based approaches, and model-assisted approaches. The proposed method corresponds to the latter: the model-assisted approach.

To illustrate, consider selecting a sample using random sampling without replacement from finite population \(U\) containing \(N\) elements \(u_{l} ,{ } l = 1,{ } \ldots ,{ }N.\) The extracted sample \(S\), contains \(n\) elements \(v_{i} ,{ } i = 1,{ } \ldots ,{ }n.\) Let \(\pi = n/N\) be the probability that a population element is included in the sample \(S\). The associated design weight for a sampled element i in S is \(g_{i} = 1/\pi\). Therefore, \(\sum\nolimits_{i = 1}^{n} {g_{i} = N.} { }\) The Horvitz–Thompson (HT) estimator (Horvitz and Thompson 1952) for population total \(T = \sum\nolimits_{l = 1}^{N} {u_{l} }\) in this case is

$${T_{{{\text{HT}}}} = \frac{N}{n}\mathop \sum \limits_{i = 1}^{n} v_{i} . }$$

This is also called the inverse probability weight (IPW) estimator.

It is known that the efficiency of an HT-estimator decreases with the presence of outliers or when applied to non-normal data distribution, as HT-estimators have characteristics similar to OLS estimators. To improve the efficiency of the HT-estimator, Wada and Tsubaki (2018) use robust regression estimation. The idea is to adjust the conventional design weight \(g_{i}\) by multiplying it by the robust weight, \(w_{i}\), obtained by (3) after the iterations converge, since \(w_{i}\) determined by residual can be regarded as an indicator of “outlyingness.” Additional adjustment is necessary to make the sum of the adjusted weight \(g_{i}^{*} = g_{i} w_{i}\) meet the necessary condition that \(\sum\nolimits_{i = 1}^{n} {g_{i}^{*} = N}\). The following two adjustments have been proposed:

$${g_{i}^{**} = \frac{{g_{i}^{*} \mathop \sum \nolimits_{i = 1}^{n} g_{i} }}{{\mathop \sum \nolimits_{i = 1}^{n} g_{i}^{*} }},\quad {\text{and}}}$$
(12)
$${g_{i}^{***} = 1 + \frac{{g_{i}^{*} \mathop \sum \nolimits_{i = 1}^{n} (g_{i} - 1)}}{{\mathop \sum \nolimits_{i = 1}^{n} g_{i}^{*} }}.}$$
(13)

The adjustment shown in (12) is a natural form; however, the adjusted weight, \(g_{i}^{**} ,\) becomes zero when \(w_{i} = 0\). In such cases, the corresponding observation is actually removed from the population estimation process. However, for official statistics, ignoring a sampled observation is not desirable, since the observation exists in the population. For this reason, the adjustment shown in (13), which guarantees a minimum value of 1 for each \(g_{i}^{***}\), is proposed.

Design weight calibration by robust weight has an advantage in that the reduction of estimation efficiency is less than the reduction in model-based approaches when the data distribution deviates from the applied model. Wada and Tsubaki (2018) confirm the usefulness of the proposed adjustment (13) in Monte Carlo simulations with random and real datasets.

One disadvantage of this weight calibration method may be that the weight is assigned to a variable in observation, while in conventional approaches, the design weight is assigned to the observation (i.e., all variables in observation are assigned the same design weight).

6 Examples of practical applications

6.1 MSD estimators for unincorporated enterprise survey

Unincorporated Enterprise Survey, conducted by the Statistics Bureau of Japan, Ministry of Internal Affairs and Communications (MIC), had major changes in 2019. Industries surveyed are extended, the sample size is increased ten times, i.e., approximately 37 thousand samples from 3.7 thousand, and questionnaires are collected by mail instead of enumerators. A process of a hot deck imputation is added for accounting items such as sales, total expenses, purchases, operating expenses, and inventories together with a cleaning process of the hot deck donor candidates. It had not been necessary since the response rate of the previous survey was almost 100%; however, mail surveys are typically expected to increase non-responses.

Wada et al. (2020) evaluate outlier detection methods using a mean vector and covariance matrix assume symmetric data distributions. The blocked adaptive computationally efficient outlier nominators (BACON) by Billor et al. (2000), improved MSD by Béguin and Hulliger (2003), Fast-MCD by Rousseeuw and Driessen (1999), and NNVE by Wang and Raftery (2002) are compared using skewed and long-tailed random datasets with asymmetrical contamination, and improved MSD is selected. They also examine an appropriate data transformation for the highly skewed target variables based on the number of outliers detected and scatter plot matrices. Their target variables are highly correlated accounting items that do not have values less than zero, and expected outliers have mostly large values. Therefore, the suitable data transformation, in this case, could be slightly loose than the one which makes the data strictly symmetric. They select the data transformation, which detects a minimum number of outliers in larger values and shows that removing outliers from hot deck donor candidates improves estimation for imputation. Log transformation, biquadratic root transformation, and square root transformation are compared, and biquadratic root transformation is selected. The lower triangular matrix of Fig. 3 shows an example of the manufacturing industry. Based on their results, outliers regarding highly correlated four variables (sales, total expenses, purchases, and operating expenses) of Unincorporated Enterprise Survey are detected by improved MSD after biquadratic root transformation. Beginning inventory and Ending inventory are excluded since there are certain amounts of observation with zero values in some industries, and they do not have high covariance with other variables, although these two variables are highly correlated with each other. Those outliers are removed from hot deck donor candidates in each imputation class, while they are used in aggregation for producing statistical tables.

Fig. 3
figure 3

Outliers detected by MSD estimators in the manufacturing industry with square root in the upper triangular matrix, and biquadratic root transformation in the lower matrix. [After Wada et al. (2020), Fig. 3, p. 10.]

6.2 Application of the robust estimator of the generalized ratio model

The robust estimator of the generalized ratio model (9) is adopted for the imputation of major corporate accounting items in the 2016 Economic Census for Business Activity in Japan (Wada and Sakashita 2017; Wada et al. 2021), conducted by the Ministry of Internal Affairs and Communications (MIC) and the Ministry of Economy, Trade and Industry (METI) jointly. The items to be imputed are sales explained by expenses, expenses by sales, and salaries by expenses. The model (9) with \(\gamma = 1/2\) and \(\gamma = 1\) are compared, and \(\gamma = 1/2\) is adopted for all of the imputed items.

The imputation class is determined by CART (classification and regression trees). The target variable is the ratio for imputation, and the possible explanatory variables are the 3.5-digit industrial classification code, legal organization, number of employees, type of cooperative, number of regular domestic employees, number of domestic branch offices, and number of branch offices. They are the variables available from Statistical Business Register before the 2016 Census since the imputation class has to be determined before collecting questionnaires. Statistical Business Register is a database on business establishments and enterprises across the country made from the previous Census and surveys as well as various administrative information. Among those explanatory variables, type of cooperative and number of regular domestic employees are adopted. The minimum number of complete observations for parameter estimation in each imputation class is 30. In case if an imputation class does not have enough observations, the class is merged with another class within the same 1-digit industrial classification code. If there are plural choices, the merged class is determined by the Mann–Whitney U test.

7 Concluding remarks

The focus of this paper is controlling the influence of outliers in the survey data processing. In addition to conventional univariate methods, some of the multivariate methods introduced here have come to be used in practice, although examples of their use remain limited for the time being. Other methods are still in the research stage. For example, the simultaneous estimation of \(\beta\) and \(\gamma\) described in Sect. 4.3 is in the process of improvement, and validation of the weight calibration approach in Sect. 5 may require more time since weighting by item in each observation (rather than by observation) is not yet common practice.

The author hopes that this paper will promote efficiency improvements in official statistics, in terms of both estimation and statistical production.