Robust estimation and moment selection in dynamic fixed-effects panel data models

Considering linear dynamic panel data models with fixed effects, existing outlier–robust estimators based on the median ratio of two consecutive pairs of first-differenced data are extended to higher-order differencing. The estimation procedure is thus based on many pairwise differences and their ratios and is designed to combine high precision and good robust properties. In particular, the proposed two-step GMM estimator based on the corresponding moment equations relies on an innovative weighting scheme reflecting both the variance and bias of those moment equations, where the bias is assumed to stem from data contamination. To estimate the bias, the influence function is derived and evaluated. The robust properties of the estimator are characterized both under contamination by independent additive outliers and the patches of additive outliers. The proposed estimator is additionally compared with existing methods by means of Monte Carlo simulations.


Introduction
Dynamic panel data models with fixed effects have been used in many empirical applications in economics; see Bun and Sarafidis (2015) and Harris et al. (2008) for an overview of the methodology and applications. Despite the complex data structure of dynamic panels, a vast majority of literature focuses on the models assuming that data are free of influential observations or outliers. This is often not the case in reality (Janz 2002;Verardi and Wagner 2011;Zaman et al. 2001), and procedures robust to outliers are thus very important in the case of panel data, where erroneous observations can be easily masked by the complex data structure.
The robust methods for panel data have been studied only to a limited extent till now. There are some methods available for static models (e.g., Bramati and Croux 2007;Aquaro andČížek 2013) and just a handful for the dynamic models. Locally robust estimation procedures have been proposed by Lucas et al. (2007), based on the generalized method of moment estimator with a bounded influence function, and by Galvao (2011), using quantile regression techniques. On the other hand, Dhaene and Zhu (2017) and Aquaro andČížek (2014) propose globally robust estimators that are based on the median ratios of the first differences of the dependent variable and of the first-or higher-order differences of the lagged dependent variable [note that previously studied median-unbiased estimation such as Cermeño (1999) was based on the least squares method and was thus not robust to outliers]. The main shortcomings of these methods follow from the use of a fixed number of the differences and their ratios. On the one hand, using just the first differences as in Dhaene and Zhu (2017) can be beneficial for the robustness of the estimator, but it results in a lower precision of estimates. On the other hand, Aquaro andČížek (2014) employ multiple differences of the explanatory variables to improve the precision of estimation, but it leads to a high sensitivity to sequences of outliers. Additionally, estimation using higher-order differences of the dependent variables has not been explored in neither case.
Our aim is to extend these median-based estimators of Dhaene and Zhu (2017) and Aquaro andČížek (2014) by employing multiple pairwise difference transformations in such a way that the resulting estimator is robust and also exhibits good finitesample performance in data without outliers. The use of higher-order differences of the dependent variable is not new (see Aquaro andČížek 2013), but presents two big challenges when applied in dynamic models. In particular, higher-order differences have not been previously used since (1) they can result in a substantial increase in bias in the presence of particular types of outliers and (2) their number grows quadratically with the number of time periods, which can lead to additional biases due to weak identification or outliers. We address this by proposing a data-driven weighting and selection of the median ratios of differenced data since the traditional strategy used in the robust statistics-using an initial robust estimator to detect outlying observations, and after removing them, applying an efficient non-robust estimator (c.f., Gervini and Yohai 2002)-is not feasible in this context. Even in the case of using the first differences only, removing a single observation means that the observation and its two or three following data points (depending on the actual estimation method) cannot be used in estimation. Especially in short panels with less than five time periods, removing a single observation for a given individual thus means that no observations of that individual can be used in estimation and the problem gets worse if higher-order differences are used.
In this paper, we generalize the estimation method of Dhaene and Zhu (2017) to a combination of the pth and sth order differences, p, s ∈ N, and combine multiple pairwise differences by means of the generalized method of moments (GMM). To account for the shortcomings of the current methods and to extend the analysis of Aquaro andČížek (2014), we first analyze the robustness of the median-based moment conditions, derive their influence functions, and quantify the bias caused by data contamination. Subsequently, we use the maximum bias and propose a two-step GMM estimator, which weights the (median-based) moment conditions both by their variance and bias; this guarantees that imprecise or biased moment conditions get low weights in estimation. Finally, as the number of applicable moment conditions grows quadratically with the number of time periods, a suitable number of moment conditions for the underlying data generating process needs to be selected using a robust version of moment selection procedure of Hall et al. (2007).
In the rest of the paper, the new estimator is introduced first in Sect. 2. Its robust properties are studied in Sect. 3 and are used to define the data-dependent GMM weights. The existing and proposed methods are then compared by means of Monte Carlo simulations in Sect. 4 and the proofs can be found in the "Appendix".

Median-based estimation of dynamic panel models
The dynamic panel data model (Sect. 2.1) and its median-based estimation (Sect. 2.2) will be now discussed. Later, the two-step GMM estimation procedure (Sect. 2.3) and the moment selection method (Sect. 2.4) will be introduced.

Dynamic panel data model
Consider the simple dynamic panel data model (i = 1, . . . , n; t = 1, . . . , T ; T ≥ 3) where y it denotes the response variable, η i is the unobservable fixed effect, and ε it represents the idiosyncratic error. Parameter |α| < 1 so that this data generating process can be stationary. The number T of time periods is fixed, which implies that fixed or stochastic effects η i are nuisance parameters and cannot be consistently estimated. Finally note that the extension of the discussed estimators to a model with exogenous covariates is straightforward (see Dhaene and Zhu 2017, Section 4.1). As in Aquaro andČížek (2014) and similarly to Cermeño (1999) and Han et al. (2014), we will consider model (1) under the following assumptions: A.1 Errors ε it are independent across i = 1, . . . , n and t = 1, . . . , T and possess finite second moments. Errors {ε it } T t=1 are also independent of fixed effects η i . A.2 The sequences {y it } T t=1 are time stationary for all i = 1, . . . , n. In particular, the first and second moments of y it conditional on η i exist and do not depend of time.
Except of the independence in Assumption A.1, there are no assumptions are made about the unobservable fixed effects η i . Although we impose rather strict Assumptions A.1 and A.3 on idiosyncratic errors, they can be relaxed. The errors ε it do not have to follow the same distribution across cross-sectional units i, allowing for heteroscedasticity. Additionally, the consistency of the estimators introduced below requires that the joint distributions of errors {ε it } T t=1 are elliptically contoured, making the normality Assumption A.3 sufficient, but not necessary (see Dhaene and Zhu 2017, Section 4.2). On the other hand, the violation of the time-homoscedasticity in Assumption A.3 leads to the inconsistency of the discussed estimators. If ε it ∼ N(0, σ 2 t ) for t = 1, . . . , T , the model equation (1) has to be therefore rescaled by unknown standard deviations σ t , which can be treated as unknown parameters and estimated along with α by GMM. Finally, the stationarity Assumption A.2 is used not only by the proposed estimators, but also by frequently applied GMM estimators such as Blundell and Bond (1998) and it is implied by the assumptions of Han et al. (2014) if |α| < 1.

Median-based moment conditions
To generalize the estimator by Dhaene and Zhu (2017), let s denote the sth difference operator, that is, s υ t := υ t − υ t−s (cf. Abrevaya 2000; Aquaro andČížek 2013). Given model (1), stationarity Assumption A.2 implies for any integers s, q, p ∈ N that where the triplet j = (s, q, p) and r j = cov( s y it , p y it−q )/var( p y it−q ) are independent of i and t, max{s, p +q} < T . Consequently, the variables s y it −r j p y it−q and p y it−q are uncorrelated, and by Assumption A.3, independent and symmetrically distributed around zero. Hence, E[sgn( s y it − r j p y it−q ) sgn( p y it−q )] = 0 and E sgn s y it / p y it−q − r j = 0. The estimation of r j can be therefore based on the sample analog of this moment condition: To relate r j to the autoregressive coefficient α in (1), Aquaro andČížek (2014) derived under Assumptions A.1 and A.2 that the correlation coefficient r j satisfies the moment condition and asymptotically normal. Aquaro andČížek (2014)'s estimator (AC-DZ) of α uses s = q = 1 and p being odd, p < T − 1. They do not use differences with s > 1 due to their robustness properties: while they seem robust to sequences of outliers, they can lead to large biases if outliers occur at random times.

Two-step GMM estimation
To increase the precision and robustness of the estimation, we propose to extend the (AC-)DZ estimator by allowing for multiple differences with s = q ≥ 1 and p ≥ 1. We consider only s = q as the moment conditions (4) do not allow distinguishing outlying and regular observations for s = q as shown in Aquaro andČížek (2014). For s = q, (4) simplifies after dividing by 1 − α p and accordingly redefining g j (α) to The full set of moment conditions in (5) can be then written as where g(α) = {g j (α)} j ∈J and a fixed finite set J contains all triplets j = (s, q, p) that are considered in estimation. The DZ estimator then corresponds to the special case J = {(1, 1, 1)} and the AC-DZ relies on a set J = {(1, 1, p): 1 ≤ p < T − 1 odd}.
Here we consider all combinations with any s = q odd and p odd, J ⊆ J o = {(s, s, p): s ∈ N odd, p ∈ N odd, 1 ≤ s + p < T }, as the single moment conditions do not identify uniquely α for even values of s or p and this could negatively affect the bias caused by contamination.
Given the system of equations in (6), the parameter α can be estimated by the GMM procedure. This GMM estimator is referred to here as the pairwise-difference DZ (PD-DZ) estimator and is defined bŷ where g n (c) = (g n j (c)) j∈J is the sample analog of g(α) and corresponds to (5) with r j being replaced byr n j defined in (3). The weighting matrix A n can be initially chosen as in Aquaro andČížek (2014) proportional to the number of observations available for the estimation of each moment equation: A n = A = diag{(T − p−s)/T }. The traditional variance-minimizing choice of the GMM weighting matrix A n however equals the inverse of the variance matrix V n of the moment conditions g n (α), which converges to the asymptotic variance matrix V of the moment conditions (5); see the Appendix for the asymptotic distribution of α n and the asymptotic variance matrix V previously obtained by Aquaro andČížek (2014).
On the other hand, we aim to account also for the presence of outlying observations that can substantially bias the estimates. Since simply removing outliers would result in a substantial data loss as explained in the introduction, we propose instead to use the moment conditions (5) and minimize the mean squared error (MSE) of estimates instead of the asymptotic variance. First, let us denote the MSE of g n (α) by W n , Given a weighting matrix A n and the asymptotic linearity ofα n (see Aquaro andČížek 2014, the proof of Theorem 1) as n → ∞, it immediately follows that the MSE ofα n equals which is (asymptotically) minimized by choosing A n = W −1 n (Hansen 1982, Theorem 3.2).
Next, to create a feasible procedure, both the variance and squared bias matrices have to be estimated. The estimation thus proceeds in two steps: first, the (AC-)DZ estimator is applied to obtain an initial parameter estimates; then-after estimating the bias b n and variance V n of moment conditions-the GMM estimator with all applicable pairwise differences is evaluated using an estimate of the weighting matrix A n = [b n b n + V n ] −1 . On the one hand, the estimateV n of V n can be directly obtained from Theorem 5 in the "Appendix" using initial estimates of r j and α because both the responses y it as well as estimatesα n are continuously distributed with bounded densities due to the stationarity Assumptions A.2 andA.3. On the other hand, estimating b n byb n requires first studying the biases of median-based moment conditions and constructing a feasible estimate thereof in Sect. 3. Using estimatesV n andb n to constructŴ n =b nb n +V n andÂ n =Ŵ −1 n then leads to the proposed second-step GMM estimator α n = arg min c∈(−1,1) g n (c) Â n g n (c) = arg min c∈ (−1,1) g n (c) [b nb n +V n ] −1 g n (c). (9)

Robust moment selection
The proposed two-step GMM estimator is based on the moment conditions (5), and given that we consider only odd s and p, their number equals approximately T (T − 1)/8 and grows quadratically with the number of time periods. Although the extra moment conditions based on higher-order differences might improve precision of estimation for larger values of |α|, their usefulness is rather limited if α is close to zero. At the same time, a large number of moment conditions might increase estimation bias due to outliers. More specifically, Aquaro andČížek (2014) showed for α close to 0 that the original moment condition of the DZ estimator s = q = p = 1 is least sensitive to random outliers, for instance; including higher-order moment conditions then just increases bias, does not improve the variance, and is thus harmful.
To account for this, we propose to select the moment conditions used in estimation by a robust analog of a moment selection criterion (e.g., see Cheng and Liao 2015, for an overview). Since all moments are valid and no weak instruments are involved, the information content of the moment equations and their number have to be balanced as in Hall et al. (2007), whose approach to moment selection in the presence of nearly redundant moment conditions can be adapted to robust estimation. They propose the so-called relevant moment selection criterion (RMSC) that-for a given set of moment conditions defined by triplets J in our case-equals MatrixV n,J represents an estimate of the variance matrix V J of moment conditions (6) defined by triplets J and κ(·, ·) is a deterministic penalty term depending on the number |J | of triplets (or moment conditions) and on the sample size n used for estimating the elements of V n (see Theorem 5). To select relevant moment conditions, this criterion has to be minimized: Two examples of the penalization term used by Hall et al. (2007) where the number of estimated parameters K = 1 in model (1) and constant κ c > 2.
As in Sect. 2.3, the proposed robust estimator (9) should minimize the MSE error rather than just the variance of the estimates. We therefore suggest to use the relevant robust moment selection criterion (RRMSC), which is based on the determinant of an estimateŴ n of the MSE matrix W n rather than on the variance matrix estimateV n of the moment conditions. The relevant robust moment conditions are then obtained by minimizinĝ

Robustness properties
There are many measures of robustness that are related to the bias of an estimator, or more typically, the worst-case bias of an estimator due to an unknown form of outlier contamination. In this section, various kinds of contamination are introduced and some relevant measures of robustness are defined (Sect. 3.1). Using these measures, we characterize the robustness of moment conditions (5) in Sect. 3.2 and the robustness of the GMM estimator (7) in Sect. 3.3. Next, we use these results to estimate of the bias of the moment conditions (5) as discussed in Sect. 3.4. Finally, the whole estimation procedure is summarized in Sect. 3.5.

Measures of robustness
Given that the analyzed data from model (1) are dependent, the effect of outliers can depend on their structure. Therefore, we first describe the considered contamination schemes and then the relevant measures of robustness. More formally, let Z be the set of all possible samples Z = {z it } of size (n, T ) following model (1) and let Z = {z it } be a contaminating sample of size (n, T ) following a fixed data-generating process, where the index of Z indicates the probability that an observations in Z is different from zero. The observed contaminated sample is Similarly to Dhaene and Zhu (2017), we consider the contamination by independent additive outliers following a degenerate distribution with the point mass at ζ , and by patches of k additive outliers, where ν it follows the Bernoulli distribution with the parameter˜ such that where Pr (a it−l = ζ ) = 1/2 and Pr (a it−l = −ζ ) = 1/2 and where ν it is defined as in Z 2 ,ζ . Note that (12) and (13) are special cases of a more general type of contamination and − 1 ≤ ρ ≤ 1. Note that this general type of contamination closely corresponds to the contamination by innovation outliers for large k and ρ = α and it is therefore important to study. As we can however conjecture from Dhaene and Zhu (2017)'s results for s = p = 1 that the contamination scheme Z 4 ,ζ biases estimates towards ρ for ζ → +∞ and ρ is unknown in practice, we are not analysing this most general case with ρ ∈ [− 1, 1]. Instead, we concentrate on the most extreme cases of ρ = 1 and ρ = − 1 as they can arguably bias the estimate most. Hence, the contamination schemes Z 1 ,ζ , Z 2 ,ζ , and Z 3 ,ζ bias the DZ estimates of α towards 0, 1, and − 1, respectively-see Sect. 3.2 and Dhaene and Zhu (2017).
Given the contamination schemes, one of the traditional measures of the global robustness of an estimator is the breakdown point. It can be defined as the smallest fraction of the data that can be changed in such a way that the estimator will not reflect any information concerning the remaining (non-contaminated) observations. Aquaro andČížek (2014) derived the breakdown points of the estimatorsr j , j ∈ J , for contamination schemes Z 1 ,ζ , Z 2 ,ζ , and Z 3 ,ζ , and under some regularity conditions, proved that the breakdown point of the GMM estimator (7) equals the breakdown point of the DZ estimatorr (1,1,1) if (1, 1, 1) ∈ J . While such results characterize the global robustness of the PD-DZ estimators, they are not informative about the size of the bias caused by outliers.
We therefore base the estimation of the bias due to contamination on the influence function. It is a traditional measure of local robustness and can be defined as follows. Let T (Z + Z ) denote a generic estimator of an unknown parameter θ based on a contaminated sample Z , where Z and Z have been defined at the beginning of Sect. 3. As the definition is asymptotic, let T (θ, ζ, , T ) be the probability limit of T (Z + Z ) when T is fixed and n → ∞. Note that T (θ, ζ, , T ) depends on the unknown parameter θ describing the data generating process, on the fraction of data contamination, on the non-zero value ζ characterizing the outliers, and on the number of time periods T . Assume T is consistent under non-contaminated data, that is, T (θ, ζ, 0, T ) = θ . The influence function (IF) of estimator T at data generating process Z due to contamination Z is defined as where the equality follows by the definition of asymptotic bias of T due to the data contamination Z , bias(T ; θ, ζ, , (If IF does not depend on the number T of time periods, T can be omitted from its arguments.) Clearly, the knowledge of the influence function allows us to approximate the bias of an estimator T at Z + Z by · IF(T ; θ, ζ, T ). Although such an approximation is often valid only for small values of > 0 (e.g., in the linear regression model, where the bias can get infinite), it is relevant in a much wider range of contamination levels in model (1) given that the parameter space (− 1, 1) is bounded and so is the bias (the dependence of the bias on the contamination level has been studied by Dhaene and Zhu 2017).
The disadvantage of approximating bias by · IF(T ; θ, ζ, T ) is that it depends on the unknown magnitude ζ of outliers. We therefore suggest to evaluate the supremum of the influence function, the gross error sensitivity (GES) and approximate the worst-case bias by · GES(T ; θ, T ). For the PD-DZ estimator and the corresponding moment conditions, IF and GES are derived in the following Sects. 3.2 and 3.3, where T will equal toα andr j , respectively (without the subscript n since the IF and GES definitions depend only on the probability limits of the estimators).

Influence function
The GMM estimator (7) is based on moment conditions depending on the data only by means of the medians r j . We therefore derive first the influence functions of the estimatorsr j and then combine them to derive the influence function of the GMM estimator. Building on Dhaene and Zhu (2017, Theorems 3.2 and 3.7), the IFs ofr j in model (1) under contamination schemes Z 1 ,ζ , Z 2 ,ζ , and Z 3 ,ζ are derived in the following Theorems 1-3. Only the point-mass distribution G ζ with the mass at ζ ∈ R is considered. In all theorems, denotes the cumulative distribution function of the standard normal distribution N(0, 1).

Theorem 3 Let Assumptions
Then it holds in model (1) under the patched-additive-outlier contamination Z 3 ,ζ with point-mass distribution at ζ = 0 and patch length k ≥ 2 that (74), (75), (77), (79), The influence functions reported in Theorems 1-3 are complicated objects both due to their algebraic forms and their dependence on the unknown parameter values α and ζ . As ζ is generally unknown, we characterize the worst-case scenario by means of the gross error sensitivity: recall that GES(r j ; α) = sup ζ IF(r j ; α, ζ ) by Eq. (16). Inspection of the influence functions and their elements in Theorems 1-3 reveals though that the largest effect can be attributed to outliers with magnitude |ζ | → +∞ (possibly with an exception of term E(1/2) in Theorem 3).
Given the results in Theorems 1-3, we thus have to compute the GES of estimatorŝ r j numerically for each j = (s, s, p) ∈ J o and α ∈ (− 1, 1). Although this might be relatively demanding if T is large and a dense grid for α is used, note that the GES values are asymptotic and independent of a particular data set. They have to be therefore evaluated just once and then used repeatedly during any application of the proposed PD-DZ estimator. We computed the GES ofr j for j ∈ {(s, s, p); s = 1, 3, 5, 7 and p = 1, 3, 5, 7, 9, 11} with the variance σ 2 ε set equal to one without loss of generality. The results corresponding to Theorems 1-3 are depicted on Figs. 1, 2 and 3. Irrespective of the contamination scheme, most GES curves display typically higher sensitivity to outliers for |α| close to one than for values of the autoregressive parameter around zero. One can also see that the DZ estimator corresponding to s = 1 and p = 1 is indeed biased towards 0, 1, and − 1 for the contamination schemes Z 1 ,ζ , Z 2 ,ζ , and Z 3 ,ζ , respectively. Concerning the higher-order differences we propose to add to the (AC-)DZ methods, Fig. 1   outliers. On the other hand, their sensitivity to the patches of outliers on Fig. 2, for instance, decreases with an increasing s and becomes very low (relative to s = 1 and p ≥ 1) if s is larger than the patch length k, for example, s = 7 > k = 6.

Robust properties of the GMM estimatorα n
Given the results of the previous sections, we will now analyze the robust properties of the general GMM estimatorα based on moment equations (6) for j = (s, s, p) ∈ J o . The results are stated first for the initial PD-DZ estimator (7) with a deterministic weight matrix and later for the second step of the PD-DZ estimator (9). Since the weight matrix and the bias in particular can be estimated in different ways, we consider in the latter case a weight matrix as a general function of the parameter α and the considered fraction of outliers.  First, assume that A n = A 0 is a positive definite diagonal matrix. Then the influence function of the GMM estimatorα 0 using moment conditions indexed by J is given by where d is defined in Theorem 5 and ψ is the |J | × 1 vector of the influence functions of each singler j , ψ = IF(r j ; α, ζ ) j ∈J .
Next, assume that A n = A(α 0 n , ) is a positive definite matrix function of the initial estimateα 0 n based on the deterministic weight matrix A 0 . If A n = A(α 0 n , ) → A = A(α, ) has a finite probability limit and bounded influence function as n → ∞, then the influence function ofα using moment conditions indexed by J is again given by alpha GES (s, p) ( Contrary to the breakdown point of Aquaro andČížek (2014) mentioned earlier, the bias of the proposed PD-DZ estimators is a linear combination of the biases of the individual moment conditions depending onr j . To minimize the influence of outliers on the estimator, one could theoretically select the moment condition with the smallest IF value, which could however result in a poor estimation if the moment condition is not very informative of the parameter α. As suggested in Sect. 2.3, we aim to minimize the MSE of the estimates and thus downweight the individual moment conditions if their biases or variances are large. Obviously, this will also lead to lower effects of biased or imprecise moment conditions on the IF in Theorem 4. To quantify the maximum influence of generally unknown outliers on the estimate, the GES function of the GMM estimator, that is, the supremum of IF in (21) with respect to ζ can be used again.

Estimating the bias
The IF and GES derived in Sect. 3.2 characterize only the derivative of the bias caused by outlier contamination. We will refer to them in the case of contamination schemes Z 1 ,ζ , Z 2 ,ζ , and Z 3 ,ζ by IF c k and GES c k , c = 1, 2, 3, respectively, where k denotes the number of consecutive outliers (patch length) in schemes Z 2 ,ζ , and Z 3 ,ζ . Whenever the sequence of consecutive outliers is mentioned in this section, we understand by that a sequence of observations y it , t = t 1 , . . . , t 2 , that can all be considered outliers.
To approximate b n = Bias{g n (α)} introduced in Sect. 2.3, we therefore need to estimate the type and amount of outliers in a given sample. Assuming that the consecutive outliers form sequences of length k and the fraction of such outliers in data is denoted k , the bias can be approximated using the k -multiple of | IF 1 1 | or GES 1 1 if k = 1 and of max{| IF 2 k |, | IF 3 k |} or max{GES 2 k , GES 3 k } if k > 1 since we cannot reliably distinguish contamination Z 2 ,ζ and Z 3 ,ζ . Given that the outlier locations cannot be reliably computed either, GES is preferred for estimating the bias due to contamination.
We therefore suggest to compute the bias vector b n in the following way, provided that the estimatesˆ k of the fractions of outliers forming sequences or patches of length k are available: whereα 0 n is an initial estimate of the parameter α and the inner maximum is taken over c ∈ {1} for k = 1 and c ∈ {2, 3} for k > 1. Note that if outliers (or particular types of outliers) are not present,ˆ k = 0 and the corresponding bias term is zero.
To estimateˆ k , an initial estimateα 0 n is needed. Once it is obtained by the DZ or AC-DZ estimator, the regression residualsε it can be constructed, for example, byû it = y it −α 0 n y it−1 andε it =û it − med t=2,...,Tûit for any i = 1, . . . , n and t = 2, . . . , T ; the median med t=2,...,Tûit is used here as an estimate of the individual effect η i similarly to Bramati and Croux (2007). Having estimated residualsε it , the outliers are detected and the fractions k of outliers in data forming the patches or sequences of k consecutive outliers are computed. We consider as outliers all observations with |ε it | > γσ ε , whereσ ε estimates the standard deviation of ε it , for example, by the median absolute deviationσ ε = MAD(ε it )/ −1 (3/4), and γ is a cut-off point. Although one typically uses a fixed cut-off point such as γ = 2.5, it can be chosen in a data-adaptive way by determining the fraction of residuals compatible with the normal distribution function of errors, for instance. This approach pioneered by Gervini and Yohai (2002) determines the cut-off point as the quantile of the empirical distribution function F + n of |ε it |/σ ε :γ for where F + 0 (t) = (t) − (−t), t ≥ 0, denotes the distribution function of |V |, V ∼ N (0, 1).

Algorithm
The whole procedure of the bias estimation, and subsequently, the proposed GMM estimation with the robust moment selection can be summarized as follows.
1. Obtain an initial estimateα 0 n by DZ or AC-DZ estimator. 2. Compute residualsû it = y it −α 0 n y it−1 andε it =û it −med t=2,...,Tûit and estimate their standard deviationσ ε . 3. Using the data-adaptive cut-off point (23), determine the fractionsˆ k of outliers present in the data in the forms of outlier sequences of length k. 4. Approximate the bias b n due to outliers byb n using (22)  Let us note that the algorithm in step 5 does not evaluate the GMM estimates for all subsets of indices J ⊆ J o and the corresponding moment conditions as that would be very time-consuming. It is therefore suggested to limit the number of J o subsets and one possible proposal, which always includes the DZ condition in the estimation, is described in point 5 of the algorithm. If an extensive evaluation of many GMM estimators has to be avoided, it is possible to opt for a simple selection between the DZ, AC-DZ, and PD-DZ estimator, where PD-DZ uses all moment conditions defined by J o .

Monte Carlo simulation
In this section, we evaluate the finite sample performance of the proposed and existing estimators by Monte Carlo simulations to see whether the proposed method can weight the moment conditions so that it picks and mimicks the performance of the better estimator (e.g., out of those with fixed sets of moment conditions such as DZ and AC-DZ) for each considered data generating process.
Let {y it } follow model (1). We generate T + 100 observations for each i and discard the first 100 observations to reduce the effect of the initial observations and to achieve stationarity. We consider cases with α = 0.1, 0.5, 0.9, n = 25, 50, 100, 200, T = 6, 12, η i ∼ N(0, σ 2 η ), and ε it ∼ N(0, 1). If data contamination is present, it follows the contamination schemes (11) and (12) for = 0.20. More specifically, Z 1 ,ζ and Z 2 ,ζ used with p = 3 are both based on ζ drawn for each outlier or patch of outliers randomly from U (10, 90); U (·, ·) denotes here the uniform distribution. The extreme values of outliers are chosen as they are supposed to have the largest influence on the estimates-cf. Theorem 1, for instance. Note that we have also considered mixes of two contamination schemes, for example, mixing equally independent additive outliers and patches of outliers, but the results are not reported as they are just convex combinations of the corresponding results obtained with only the first and only the second contamination schemes.
All estimators are compared by means of the mean bias and the root mean squared error (RMSE) evaluated using 1000 replications. The included estimators are chosen as follows. The non-robust estimators are represented by the Arellano-Bond (AB) twostep GMM estimator 1 (Arellano and Bond 1991), the system Blundell and Bond (BB) estimator 2 (Blundell and Bond 1998), and the X-differencing (XD) estimator (Han et al. 2014). The globally robust estimators are represented by the original DZ and AC-DZ estimators and by the proposed PD-DZ estimator. For the latter, we consider two different moment selection criteria RRMSC: BIC and HQIC with κ c = 2.1 introduced in Sect. 2.4.
Considering the clean data first (see Table 1), most estimators exhibit small RMSEs except of the AB estimator that is usually strongly negatively biased if α is close to 1. The BB estimator performs well under these circumstances as expected, but is outperformed by the XD estimation. Regarding the robust estimators, the results are closer to each other for T = 6 than for T = 12 since there are only three possible moment conditions (5) if T = 6. The DZ estimator based on the first moment condition only is lacking behind AC-DZ and PD-DZ when α is not close to zero and additional higher-order moment conditions thus improve estimation. The results for AC-DZ and PD-DZ are rather similar in most situations, with PD-DZ becoming relatively more precise as n increases due to less noisy moment selection. Overall, adding moment conditions improves performance of AC-DZ and PD-DZ relative to DZ; the performance of PD-DZ is worse than that of the AB and BB estimators for α = 0.1, matches them for α = 0.5, and outperforms them for α = 0.9.
Next, the two different data contaminations schemes are considered: independent additive outliers and the patches of additive outliers. Considering the independent additive outliers first (see Table 2), which generally bias estimates toward zero and lead thus to larger biases especially for values of α close to 1, AB, BB, and XD are 1 The (optimal) inverse weight matrix, which is used here, is the matrix of instruments per individual and H is a (T − 1) × (T − 1) tridiagonal matrix with 2 in the main diagonal, − 1 in the first two sub-diagonals, and zeros elsewhere (see Arellano and Bond 1991, p. 279). 2 The inverse weight matrix is i Z BB i G Z BB i , where Z BB i is the matrix of instruments per individual and G is a partitioned matrix, G = diag (H, I), where H is as in Arellano-Bond and I is the identity matrix [see Kiviet 2007, Eq. (38)].  strongly biased in all cases as expected. In the case of robust estimators, the negative biases of DZ, AC-DZ, and PD-DZ are rather small, although increasing with α. As AC-DZ outperforms DZ in this case, PD-DZ should and does exhibit performance more similar to AC-DZ than to DZ; PD-DZ even outperforms AC-DZ for α = 0.9 or the largest sample size. This confirms the functionality of the weighting as the inclusion of higher-order differences with s > 1 in PD-DZ could lead to large biases due to independent additive outliers especially for α = 0.9, see Fig. 1. On the other hand, the higher-order differences with s > 1 should provide benefits when the data are contaminated by the patches of additive outliers, see Table 3. This type of contamination leads again to substantially biased non-robust estimates by XD, AB, and BB, In the case of the robust estimators, the patches of outliers tend to bias them toward 1 and have thus largest effect for α close to 0. Hence, the biases of and more generally differences among the robust estimators are smallest for α = 0.9. For smaller values of α, DZ outperforms AC-DZ, in particular for α = 0.1, as the patches of outliers have a larger impact on the higher-order differences of AC-DZsee Fig. 2a. Thus, the proposed PD-DZ should and does perform similarly to DZ and actually outperforms it in most situations most cases with α ≤ 0.5, which again Table 2 Biases and RMSE for all estimators in data with  Table 3 Biases and RMSE for all estimators in data with confirms that the proposed weighting scheme is able to choose moment conditions that are less affected by the outliers. Note that the largest difference between DZ and PD-DZ is observed for T = 12 and α = 0.5 as the higher-order differences can be used only if the number T of time periods is sufficiently large and they have a reasonable precision only if α is not close to zero.

Concluding remarks
In this paper, we propose an extension of the median-based robust estimator for dynamic panel data model of Dhaene and Zhu (2017) by means of multiple pairwise differences. The newly proposed GMM estimation procedure that uses weights accounting both for the variance and outlier-related bias of the moment conditions is combined with the moment selection method. As a result, the estimator performs well in non-contaminated data as well as in data containing both independent outliers and patches of outliers.
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

A Appendix
The outlier contamination schemes Z 1 ,ζ , Z 2 ,ζ , and Z 3 ,ζ are generally described by the contamination fraction and the magnitude of outliers ζ (recall that only the point-mass distribution G ζ is considered here). Therefore, we will denote the noncontaminated sample observations following model (1) by y it and the contaminated sample observations by y ζ, it . By definition of Z 1 ,ζ , Z 2 ,ζ , and Z 3 ,ζ , the difference w it = y ζ, it − y it can only equal −ζ , 0, or ζ . In order to prove the theorems concerning the influence function ofα, it is useful to derive first the asymptotic bias ofr j as an estimator of r j . Similarly to Sect. 3.1, it is defined as bias r j ; r j , ζ, where plim denotes the probability limit operator. Let b := b(r j , ζ, ) be a short-hand notation for (24). Then, b solves the following equation: Since r j is considered only for j = (s, s, p) ∈ J o , where both s and p are odd, r j = −(1 − α s )/2. This mapping of α to r j = −(1 − α s )/2 has the same important properties for s = 1 and any odd s > 1: it maps interval (− 1, 0) to (− 1, − 1/2) and interval (0, 1) to (− 1/2, 0), it is continuous, and it is strictly increasing on (− 1, 1).
One can thus follow the proofs in Dhaene and Zhu (2017, Theorems 3.5 and 3.8) and apply them not only to the case of s = p = 1, but any odd s and p with only two adjustments: (i) the variables s y it − r j p y it−s and p y it−s have to be standardized (Dhaene and Zhu 2017, equation (A.3)) and their variances generally depend on the values of s and p and (ii) in the case of patches of outliers, the probability that a patch contaminates the ratio s y it / p y it−s needs to be generalized.
As for (i), note that, by Eq.
(2), the variables s y it − r j p y it−s and p y it−s are uncorrelated, and by Assumption A.3, they are independent and normally distributed around zero. Additionally, the stationarity Assumptions A.1 and A.2 imply that, after substituting from the model equation, cov(y it , η) = α cov(y it−1 , η i ) + var(η i ), . From this result and Aquaro andČížek (2014, Equation (24)), we can thus conclude that (the diagonal structure of the covariance matrix can be also seen from Equation 2.2 that implies cov( s y it , p y it−s ) = r j var( p y it−s )).
Based on these observations, Aquaro andČížek (2014, Theorem 1) derived the asymptotic distribution of the PD-DZ estimator defined by Eq. (7), which is presented here for the case of the triplet sets J ⊆ J o .
Then for a fixed T and n → ∞,α n is consistent and asymptotically normal, where d = ∂ g(α)/∂α = {−sα s−1 } j=(s,s, p)∈J and V is has a typical element with indices j = (s, s, p) ∈ J , j = (s , s , p ) ∈ J defined by where residual is a random vector, and f (w it ) is a random scalar. Let w it be the set of the eight possible outcomes of w it , that is, where the number of elements is # w it = 8. To simplify the notation, let us refer to (29) as it , and denote each of its element as ω it j , j = 1, . . . , 8. Then it holds Note that Pr w it = ω it j = Pr w it = ω it j for some j and j because the data contamination Z 1 ,ζ is characterized by outliers occurring independently from each other.  [(ζ, ζ, ζ ) ]. Therefore, Eq. (28) can be decomposed as where the equality follows from the implicit function theorem applied to (37) and where As in Dhaene and Zhu (2017, Equation (A.4)), where σ * is defined in (36) and X, Y ∼ N(0, 1). Hence, A(r j , 0) = 1/2 and (recall that r j = (1 − α s )/2). Next, Dhaene and Zhu (2017, Lemma A.1) implies that, for X, Z ∼ N(0, 1) and constants c, c , c , P{(X + c)/Z ≤ 0} = 1/2 and P{(X Hence, the definition of B(r j , ζ, b) and the standardization (34) imply Substituting for σ u := var(u it j ) and σ p := var( p y it−s ) from (26) and r j = −(1 − α s )/2 into (42) and for terms A(r j , 0), B(r j , ζ, 0), and A b (r j , 0) in (38) completes the proof.
A.2 Patch additive outlier contamination Z 2

,ζ
As in "Appendix A.1", it is useful to derive first the asymptotic bias ofr j under the outlier contamination Z 2 ,ζ as defined in (12). This is given by b := b(r j , ζ, , k) solving the equation where the notation is defined below. Note that the decomposition in the second equality follows along the same lines as in "Appendix A.1", in particular Eq. (30). In this case, the only difference is that outliers no longer occur independently but in patches. The number of elements of it increases to # it = 13 as now, if we observe multiple outliers, we shall distinguish the event of the outliers belonging to the same patch from the event of these outliers belonging to different patches. For instance, (0, ζ, ζ ) may be that result of one patch only, (0, ζ 1 , ζ 1 ) , or of two patches, (0, ζ 2 , ζ 1 ) , where the subscript of ζ indicates the patch. Recalling that (1 −˜ ) k = 1 − , and p A = 1 − p B − p C − p D . Next, the terms A, B, C, D are defined for r j , ζ , and b as follows: where the symmetry L(k, l, b) = L(−k, −l, b) has been used, recall Eq. (33).