Assessing the consistency of the fixed-effects estimator: a regression-based Wald test

Under large-n and fixed-T panel data asymptotics, we develop a method to test a sufficient condition for the FE estimator’s consistency using a stacked regression framework. The resulting test exploits a previously unnoted relation between the fixed-effects estimator and the short- and long-differences estimators. It takes the familiar form of a panel-robust Wald test, but is also shown to be asymptotically equivalent to a GMM test. We provide a theoretical comparison between our test and two existing ones from the literature, which are shown to focus on generic strict exogeneity conditions instead of being specifically related to the FE estimator’s moment conditions. We investigate our test’s finite-sample properties in a simulation study, where we continue the comparison with the other tests. We show that our test has good finite-sample properties, especially if the estimator of the covariance matrix is based on a panel bootstrap. The practical use of our test is illustrated in two applications to existing data from the literature.


Introduction
Since the early days of econometrics, the fixed-effects (or within) estimator has been widely used to estimate the linear panel regression model in the presence of individual effects correlated with the regressors (Mundlak 1961;Mundlak and Hoch 1965). Because the fixed-effects (FE) estimator exploits a single moment condition for each B Laura Spierdijk L.Spierdijk@utwente.nl 1 Section Financial Engineering, Department of High-tech Business and Entrepreneurship, Faculty of Behavioural, Management and Social Sciences, University of Twente, Enschede, The Netherlands covariate, it is just identified. Viewed from a GMM perspective, it is therefore not possible to test the validity of these moment conditions by means of the J -test for overidentifying moment conditions. 1 In other situations, we may have certain suspicions that some covariates are unlikely to satisfy the moment conditions imposed by the FE estimator. If instrumental variables (IVs) are available for these covariates, we could test the FE estimator against the FE-IV estimator using a Hausman test (Hausman 1978).
To our best knowledge, consistency tests for the FE estimator that do not require IVs are rare. The present study seeks to fill the gap in the literature by developing a test to validate a sufficient condition for the consistency of the FE estimator that does not use such side information. We derive the test by exploiting a previously unnoted relation between the FE estimator and the short-and long-differences estimators. The resulting test can be performed in the familiar form of a panel-robust Wald test for certain parameter restrictions in a stacked regression framework.
Because our test turns out asymptotically equivalent to a GMM test, our approach also fits in the familiar setting of GMM estimation and specification testing. Consequently, the asymptotic properties of our Wald test are standard and well documented in the literature (Newey 1985;Cameron and Trivedi 2005;Hall 2005). The theoretical part of our study uses the link with GMM testing to draw a formal comparison between our test and the ones proposed by Wooldridge (2010) and Su et al. (2016), where the latter is an extension of the former. Both tests make use of auxiliary regressions to assess the validity of certain strict exogeneity conditions. We show that our test validates sufficient conditions for the consistency of the FE estimator, while the other two focus on more generic strict exogeneity conditions that are neither sufficient nor necessary for the consistency of the FE estimator.
We investigate our test's finite-sample properties in a simulation study, where we continue the comparison between our Wald test and the strict exogeneity test of Su et al. (2016). We use a simulation design in which the two tests are either both consistent or both inconsistent. Our Wald test generally exhibits good finite-sample properties, especially if the estimator of the covariance matrix is based on a panel bootstrap.
The empirical behavior of our Wald test is illustrated in two empirical applications that elaborate on existing studies from the literature. Both McKinnish (2008) and Erickson and Whited (2000) apply the linear panel regression model to data sets containing an explanatory variable that is suspected to be subject to measurement error, which would render the FE estimator inconsistent. In the context of our theoretical results, these data sets provide a particularly relevant empirical case for our Wald test. For the linear panel regression model applied to the data of McKinnish (2008), our Wald test rejects the sufficient condition for the FE estimator's consistency. Applied to the data of Erickson and Whited (2000), however, our Wald test finds no such evidence. We also run the test of Su et al. (2016) and draw the comparison with the outcomes of our test.
Because our test validates a sufficient condition for the consistency of the FE estimator, it is possible that the FE estimator is consistent even though this condition does not hold. Furthermore, we will show that there is also a possibility that the Wald test has low power in certain cases. We will provide recommendations on how to remedy such type I and type II errors using additional analysis. Hence, although our test does not require instrumental variables, it should be combined with further investigations.
Our approach connects to different strands of literature. From the time series literature, we take the idea of a test that exploits taking differences (e.g., Plosser et al., 1982, Davidson et al., 1985, Breusch and Godfrey, 1986, Thursby, 1989. We combine this idea with the insight of Griliches and Hausman (1986, p. 114) that the linear panel regression model is misspecified if short-and long-differences estimators differ significantly. The resulting stacked regression framework facilitates researchers to routinely run our Wald test. In this way, we extend the panel data literature about large-n and fixed-T specification testing, which includes but is not limited to tests for overidentifying restrictions (Hayakawa 2019), random effects vs. fixed effects and FE vs. FE-IV (Hausman 1978;Baltagi et al. 2003;Amini et al. 2012;Joshi and Wooldridge 2019), unit roots (Harris and Tzavalis 1999), selectivity bias (Verbeek and Nijman 1992;Wooldridge 1995), cross-sectional dependence (Sarafidis and Wansbeek 2012) and GMM-based test for autocorrelation in error terms (Arellano and Bond 1991).
The setup of the remainder of this study is as follows. Section 2 describes the regression framework that we propose to estimate and test the FE estimator. Section 3 introduces the test statistic and discusses its statistical properties based on the literature (asymptotic behavior), followed by a simulation study (finite-sample behavior) in Sect. 4. Both sections draw the comparison with the test proposed by Su et al. (2016). Our approach is illustrated in Sect. 5, where we provide two applications to existing data from the literature. Lastly, Sect. 6 concludes. An appendix with supplementary material is available.

Regression framework
We consider the situation that we are interested in estimating the static linear panel regression model with T ≥ 3 time observations, given by where y i (T × 1) is the dependent variable, γ i the individual-specific intercept, ι T (T × 1) a vector of ones, X i (T × k) the matrix of observed covariates, β 0 (k × 1) the unknown coefficient vector and ε i (T × 1) the error term.

FE and differences estimators
Let D j be the (T − j)×T matrix that takes differences over time span j = 1, . . . , T −1 and write j = D j D j . We define β j as the OLS estimator of β 0 in (1) after taking differences over time span j, yielding We denote the centering matrix of order T by A T = I T − ι T ι T /T . This matrix is symmetric, idempotent of rank T −1 and orthogonal to ι T . We observe that j has the jth pseudo-diagonal equal to −1, with all other pseudo-diagonals are zero. Moreover, j j has diagonal elements equal to T − 1 since all rows add to zero. As a result, We use the relation between A T and the j s to rewrite the FE estimator of β 0 in terms of the differences estimators β j , resulting in where The complete derivation of this result is given in Appendix A. This leads to Result 1.

Result 1
The FE estimator is the weighted matrix average of differences estimators, i.e., with W j as in (4). Now let β j ≡ plim n→∞ β j and assume that standard regularity conditions for large-n and fixed-T panel data hold. We note that and observe that also these weight matrices sum to the identity matrix. Consequently, from (6) it follows that plim n→∞ β FE = β 0 under H 0 : β 1 = . . . = β T −1 = β 0 . Stated differently, a sufficient condition for the consistency of the FE estimator is that all of the differences estimators are consistent. This leads to Corollary 1. Su et al. (2016) hold. Then the FE estimator is consistent if each of the differences estimators is consistent; i.e., if plim n→∞ β j = β 0 ( j = 1, . . . , T − 1), then plim n→∞ β FE = β 0 .

Corollary 1 Assume that the large-n and fixed-T panel data regularity conditions for GMM estimators as listed in
We note that H 0 is not a necessary condition for the consistency of the FE estimator. Under H 1 : β j = β j+1 (at least one j = 1, . . . , T − 2), the FE estimator can still be consistent. This result follows directly from Result 1, but we will come back to it in Sect. 3.4.4.

Motivating examples
To illustrate the link between the (in-)consistency of the FE-and differences estimators, Table 1 provides four motivating examples. We consider the linear panel regression model with (i) classical measurement error, (ii) non-classical measurement error, (iii) omitted variables and (iv) simultaneity. The precise model specifications are described in the first column of Table 1 and given in more detail in Section B of the appendix with supplementary material. To ensure stationarity, we assume that all autoregressive parameters fall in the interval (−1, 1). In each of the four cases, the FE estimator and differences estimators are inconsistent for non-trivial parameter values. The second and third column in Table 1 report the inconsistencies. 2

Stacked regression
In order to use Corollary 1 for the construction of a statistical test for a sufficient condition for the FE estimator's consistency, it turns out useful to estimate the β j s jointly. Let and define X i1 , . . . , X i,T −1 and ε i1 , . . . , ε i,T −1 analogously. Next, let Non-classical ME as classical, butσ θη = 0 Omitted variables Notation: σ 2 w denotes the variance of w it , with w ∈ {η, θ, ε}; σ θη stands for the covariance between θ it and η it ; A denotes the T × T centering matrix; w is the matrix containing the covariances Cov (w ns , w nt ) for w ∈ {ξ, v, x, ε, u}; the matrix ξv contains the covariances Cov (ξ ns , v nt ); zx is defined analogously We then write The stacked regression model in (8) allows us to estimate the β j s jointly by means of OLS. We observe that the FE estimator arises as the constrained OLS estimator of β D = (β 1 , β 2 , . . . , β T −1 ) in (8), under the parameter restriction β 1 = . . . = β T −1 .

Test procedure
Corollary 1 states that, if each differences estimator is consistent, also the FE estimator must be consistent. This observation is the starting point of our Wald test. In brief, we first estimate β D in (8) using (unconstrained) OLS. Subsequently, we use a Wald test to test H 0 against H 1 .

Wald test
To calculate the Wald test statistic, we need a cluster-robust estimator of the asymptotic covariance matrix in addition to β D . This estimator is given by where u i = y i − X i β D . The Wald test statistic for the parameter restrictions β j = β j+1 ( j = 1, . . . , T − 2) is given by where R = B ⊗ I k and B is the k(T − 2) × (T − 1) matrix taking first differences, given by Under H 0 , the asymptotic distribution of the Wald test statistic is Chi-square with k(T − 2) degrees of freedom, while under fixed alternatives the test statistic converges in probability to infinity (Cameron and Trivedi 2005, Section 7.6.2). We therefore reject H 0 if q W exceeds the (1 − α)% critical value of the Chi-square distribution with k(T − 2) degrees of freedom, with α the chosen significance level. The usual asymptotic properties of the Wald test hold under standard large-n and fixed-T panel data regularity conditions, as summarized in the following result. Su et al. (2016) If H 0 is rejected, the pattern in the β j s can help to assess the economic relevance of the rejection. This becomes particularly relevant if n is large, since large samples incur the risk of detecting economically minor violations of the null (Griliches and Hausman 1986, p. 110). If all β j s are close in value to the FE estimator, the economic importance of the rejection is considered limited. An informal visualization of the Wald test is obtained by plotting each element of β j as a function of j, with the value of the FE estimator of each covariate's coefficient added as a horizontal line. We will refer to these plots as the 'differences curves.' These curves will be illustrated in the section with empirical applications.

Relation to GMM tests
To analyze the properties of our Wald test in more detail, it turns out useful to draw the parallel with overidentifying tests in a GMM framework. From Newey and West (1987) and Newey and McFadden (1994) and the linearity of the moment conditions, we infer that q W is numerically identical to a GMM test statistic for a stacked regression model. This is formalized in Result 3. Su et al. (2016), the Wald test statistic q W is numerically identical to the overidentifying test statistic based on the two-step GMM estimator β GMM of β 0 in the stacked regression model

Result 3 Under the regularity assumptions as listed in
using the instrument matrix Z i = X i provided that both test statistics use the same estimator for the covariance matrix, with the requirement that this estimator is consistent under H 0 .
Result 3 requires both test statistics to use the same consistent estimator for the covariance matrix. In practice, the GMM test statistic uses β GMM to obtain a panelrobust estimator of the covariance matrix, while the Wald test statistic uses β D to do so. Because both estimators of the covariance matrix are consistent under the null, this difference in covariance matrices does not matter for the asymptotic properties of the test statistics (Cameron and Trivedi 2005, ). We thus conclude that our Wald test statistic (with the panel-robust estimator of the covariance matrix based on β D ) is asymptotically equivalent to the GMM test statistic (with the panel-robust estimator of the covariance matrix based on β GMM ). The two tests have the same asymptotic power and size, under both the null and any (fixed or local) alternative hypothesis (Newey and West 1987;Newey and McFadden 1994). 3 Corollary 2 Under the regularity assumptions as listed in Su et al. (2016), the Wald test statistic q tinyW is asymptotically equivalent to the overidentifying J -statistic corresponding to the two-step estimator of β 0 in (12) with instruments Z i . The two tests have the same asymptotic power and size, under both the null and any (fixed or local) alternative hypothesis.
With Z i as the instrument matrix in the equivalent GMM test, we thus see that the overidentifying moment conditions are the k(T − 1) moment conditions imposed by the differences estimators. Hence, our test boils down to a GMM test for the overidentifying moment conditions These moment conditions arise by 'unfolding' the moment condition imposed by the FE estimator Corollary 1 already established the link between (the probability limits of) β F E and the β j s. By means of (13) and (14), we have now also shown how 'unfolding' connects the moment conditions of the FE and differences estimators.

Trivial power
Result 2 makes clear that the power of the Wald test arises from the differences in the β j s for different values of j under H 1 . In certain cases where the FE estimator is inconsistent, such differences may not exist though. Despite the FE estimator's inconsistency, we will then find β 1 = β 2 = . . . = β T −1 = β 0 . Consequently, the asymptotic rejection rate of the Wald test will be equal to the chosen significance level, yielding 'trivial' asymptotic power. As shown by Newey (1985) and (Hall 2005, Ch. 5), the issue of trivial power is inherent with overidentifying tests. These authors also provide a more technical discussion of the region where GMM tests have trivial power. The practical implication of the existence of a parameter region with trivial asymptotic power is that our Wald test may have low empirical power in certain situations. In the motivating examples of omitted variables and measurement error shown in Table 1, trivial power arises for ρ = δ. This can be inferred from the expressions for the β j s in the table, which do not vary with j for ρ = δ. A test to assess whether two panel variables have the same degree of persistence could therefore prove useful in this scenario. In practice, however, misspecification is likely to be much more complex than for the motivating examples of Table 1. Consequently, we typically do not know when trivial power will arise and to what extent it is related to the persistence in the observed variables. It is therefore hard to think of a statistical test that could be used to recognize a case of trivial power. In fact, to our best knowledge, no remedy against trivial power exists other than cautiously interpreting the outcomes of GMM tests (Parente and Santos Silva 2012). It therefore remains important to look for other evidence against the FE estimator if the test does not reject the null hypothesis, such as coefficient signs and magnitudes that are implausible from an economic perspective.

Comparison with existing tests
As mentioned in Introduction, tests for the consistency of the FE estimator that do not require IVs are rare. The two tests that come closest are the ones of (Wooldridge 2010, p. 324-325) and Su et al. (2016). These approaches also take the linear panel regression model in (1) as the starting point.

Wald test of Wooldridge (2010)
The test of Wooldridge (2010) is based on the auxiliary OLS regression after taking the within transformation. With I T −1 the identity matrix of order T − 1, the matrices L 1 and F 1 are defined as the (T − 1) × T block matrices (15) ensures that the regression model contains the one-period ahead lead values of the covariates as regressors. Because this leads to the loss of the last time period, we need the matrix L 1 to ensure that the other vectors also contain the right time observations. In more familiar notation, we would write (15) as y it = γ i + x it β 0 + x i,t+1 ζ 1 + ε it . The reason that we use the above alternative notation is to facilitate the comparison with our own 'differences' approach, as will become clear below.
The test takes the form of a standard panel-robust Wald test for the null hypothesis H 0 : ζ 1 = 0 against the alternative hypothesisH 1 : ζ 1 = 0. It is motivated by the fact that, under strict exogeneity, the FE estimator of ζ 1 will have a zero probability limit, while the FE estimator of β 0 will converge in probability to β 0 . Under the usual regularity conditions for panel data, the resulting test has nominal asymptotic size underH 0 and unit asymptotic power underH 1 .

Sup-Wald test of Wooldridge (2010)
The extension proposed by Su et al. (2016) is based on the idea that Wooldridge's approach of adding one-period ahead lead values to the regression model is rather arbitrary. They overcome this by allowing for a wider range of leads and lags. More specifically, Su et al. (2016) consider auxiliary regressions of the form where L and F are defined in analogy with (16). (17) for all ∈ S T by means of OLS after applying the within transformation. 4 Subsequently, they test the null hypothesisH 0 : ζ = 0 for all ∈ S T against the alternative hypothesisH 1 : ζ = 0 for some ∈ S T using a sup-Wald test. The sup-Wald test statistic is obtained as follows. For each individual null hypothesisH 0 : ζ = 0, they calculate the corresponding individual Wald test statistic. Subsequently, the supremum is taken over the individual Wald test statistics, yielding the sup-Wald test statistic. The underlying idea is that the supremum of a range of test statistics will behave more like the most powerful among them. The critical values of the sup-Wald test statistic are determined by means of a panel bootstrap. Under standard panel data regularity conditions, the resulting sup-Wald test has nominal asymptotic size underH 0 and unit asymptotic power underH 1 .

GMM framework
To facilitate the comparison with our own approach, it is convenient to draw the parallel with GMM overidentying tests one more time. From Newey and West (1987) and Newey and McFadden (1994), we infer that the Wald test of Wooldridge (2010) is asymptotically equivalent to a GMM test based on the two-step GMM estimator of β 0 in with instruments L 1 X i and F 1 X i , after applying the within transformation to both the regression equation and the instruments; i.e., after pre-multiplying both with the centering matrix of order T − 1. Hence, Wooldridge's approach tests 2k moment conditions, namely where We apply Newey and West (1987) and Newey and McFadden (1994) one more time to infer that each individual Wald test ofH 0 is equivalent to a GMM test of the 2k overidentifying moment conditions where A * T −| | and A † T −| | are defined in analogy with (20). For fixed , the first moment condition in (21) comes close to the moment condition of the within estimator-being IE(X i A T ε i ) = 0but applies to the subsample that excludes years from the full sample. The second moment condition is similar to the first, but is formulated in terms of lead or lags of the regressors. The moment conditions in (21) are necessary-but not sufficient-for strict exogeneity.
We thus see that the Wald test of Wooldridge (2010) and the individual Wald tests constituting the sup-Wald test of Su et al. (2016) reduce to overidentifying tests, like our own Wald test. They both tests conditions that are necessary for strict exogeneity. Because strict exogeneity is a sufficient condition for the consistency of the FE estimator, the conditions validated by the sup-Wald test are neither necessary nor sufficient for the consistency of the FE estimator.

Comparison Wald and sup-Wald tests
Similarities Because the Wald test of Wooldridge (2010) and the individual Wald tests constituting the sup-Wald test of Su et al. (2016) also reduce to overidentifying tests, like our own Wald test, they will also have a parameter region with trivial power (Newey 1985;Hall 2005). This property will be illustrated in detail in Sect. 4.
Differences The main difference among the three tests is that each of them tests a different null hypothesis, which we have labeled as H 0 ,H 0 andH 0 , respectively. Because the three tests involve different parameter restrictions in different auxiliary regressions, it is not straightforward how to compare them. The translation of each test to a GMM framework has turned out to facilitate the comparison of the three null hypotheses and has made clear that each of the three tests focuses on different moment conditions. According to Result 3, our Wald test looks at the moment conditions of the differences estimators, which arise as the 'unfolded' moment condition of the FE estimator. The underlying relation between the (probability limits of the) FE and differences estimators is specified in Corollary 1 and holds under both H 0 and H 1 .
The test of Wooldridge (2010) considers the moment conditions in (19), while the sup-Wald test focuses on those in (21). Both tests focus on generic strict exogeneity conditions instead of being specifically related to the FE estimator's orthogonality conditions. For fixed , the FE estimator of ζ in (17) has a zero probability limit under strict exogeneity, while the FE estimator of β 0 then converges in probability to β 0 (Su et al. 2016). In other cases, however, it is not known how the inconsistency of the FE estimator of β 0 (our parameter of interest) relates to the probability limit of the FE estimator of ζ (the auxiliary parameter to run the test).
To formalize the above considerations, let β FE = plim n→∞ β FE and define H FE 0 : β FE = β 0 and H FE 1 : β FE = β 0 . We will now view the Wald and sup-Wald tests as tests of H FE 0 instead of H 0 andH 0 , which means that we will derive expressions for the rejection probabilities under H FE 0 and H FE 1 . This will allow us to compare both tests' type I and type II errors if viewed as tests of H FE 0 . We start with the Wald test and assume that H FE 0 is true. In this scenario, there are two cases: (i) H 0 is true or (ii) H 1 holds true. In case (i), the rejection probability under H FE the nominal size of the Wald test according to Result 2. Case (ii) can occur because consistency of the differences estimators is a sufficient but not a necessary assumption for the consistency of the FE estimator. In case (ii), the rejection probability has a unit asymptotic value, unless there is trivial power. In the latter case, the asymptotic value of the rejection probability is equal to the nominal size of the Wald test. In sum, the asymptotic size of the Wald test is at least nominal if we view the test as a test of H FE 0 . At this point, we would ideally provide a specific example of case (ii), where some of the differences estimators are inconsistent while the FE estimator is still consistent. However, it is not straightforward to illustrate the Wald test's oversizedness in this way, because the process of constructing such an example quickly goes beyond what is still analytically tractable. 5 Fortunately, we can characterize the conditions under which case (ii) will occur in general terms using the GMM analogy. If (14) holds true but (13) does not, then the FE estimator is consistent while some of the differences estimators are inconsistent.
Next we assume that H FE 1 is true. Because H 0 is a sufficient condition for the consistency of the FE estimator, H 1 must also hold true. Hence, the rejection probability under H FE 1 is also a rejection probability under H 1 , with a unit asymptotic value according to Result 2. In sum, the Wald test viewed as a test of H FE 0 has unit asymptotic power under H FE 1 , unless we encounter a scenario of trivial power. For the sup-Wald test, the derivation of the rejection probability under H FE 0 is similar as above and leads to the insight that the asymptotic value of this probability is larger than or equal to the nominal size of the sup-Wald test. Under H FE 1 , the situation is more complex than above, since eitherH 0 orH 1 may hold true. The reason for this is that the sup-Wald test validates conditions that are neither necessary nor sufficient for the consistency of the FE estimator. As a result, there are two possibilities for the asymptotic power of the sup-Wald test if seen as a test of H FE 0 . In case (i), it has a unit value under H FE 1 , which occurs ifH 1 holds true and no trivial power arises. In case (ii), it has a value equal to the nominal size of the sup-Wald test. This occurs ifH 0 holds true, or ifH 1 holds true in combination with trivial power. Also at this point it is hard to construct analytically tractable examples that illustrate both cases.
In sum, if we view the Wald and sup-Wald tests as tests of the consistency of the FE estimator, we conclude that both of them are asymptotically oversized. Furthermore, the Wald test will have unit asymptotic power, unless we encounter a case of trivial power that is inherent with GMM testing. The sup-Wald test has one additional possibility for trivial asymptotic power, which arises because the specific moment conditions validated by this test are neither sufficient nor necessary for the consistency of the FE estimator.

Simulation study
Our simulation study focus on the motivating examples listed in Table 1, for which the three tests' finite-sample performance is an empirical matter. Because the sup-Wald test of Su et al. (2016) generally turns out superior to the test of Wooldridge (2010), we confine the comparison to our Wald test and the sup-Wald test. For a detailed simulation study into the finite-sample differences between the tests of Wooldridge (2010) and Su et al. (2016), we refer to the latter study.

Simulation setup
We start with some explanation about the design of our simulation study. As said, our simulations focus on the motivating examples in Table 1. With exception of two cases of trivial power, both our Wald test and the sup-Wald test turn out consistent in each of these examples. The consistency of our Wald test follows directly from the results in Table 1.
To determine whether the sup-Wald test is consistent in the motivating examples, we have to assess whetherH 1 holds true. 6 However, for fixed , it already turns out infeasible to obtain an analytical expression for the probability limit of the FE estimator of ζ underH 1 : ζ = 0. Calculations quickly grow complex because the constituent regressions of the sup-Wald test already contain two covariates in the simplest possible case. We therefore take a more practical view to determine whether the sup-Wald test is consistent in each of our motivating examples. Whenever our simulations yield unit empirical power for sufficiently large values of n, we view this as a reliable indicator that the sup-Wald test is consistent.
We set ρ = δ in two of our simulation experiments. Both our Wald test and the sup-Wald test turn out to have trivial power in these cases. For our Wald test, the trivial power follows immediately from the results in Table 1, which show that the β j s do not vary with j. For the sup-Wald test, the inconsistency follows from the low empirical power that persists across large values of n in our simulations.
The more technical details of our simulation design are as follows. We run simulations for three of the motivating examples considered in Table 1: classical measurement error ('ME'), omitted variables ('OV') and simultaneity ('S'). Throughout, we run 10,000 simulation runs for each simulation experiment to obtain the empirical rejection rates for our Wald test. We consider values n = 100, n = 500 and n = 1000, each with T = 5 and T = 10. We set the significance level for each test equal to 5%. For classical measurement error and omitted variables, we simulate two scenarios in terms of the persistence parameters: δ = 0.3 and ρ = 0.9 and δ = 0.3 and ρ = 0.6. For simultaneity, we consider the cases ρ = 0.6 and ρ = 0.9. For each simulation experiment, we report the empirical size and power for both tests. The reported empirical power has not been size adjusted.

Empirical power and size
The right-hand side of Table 3 reports the empirical power for our Wald test, with the full set of parameters for each of the three models as specified in the table notes. These notes also report the probability limits of the underlying models' R 2 , as well as the reliability and noise-to-signal ratios for the models with classical measurement Notes: All simulation results are based on 10,000 simulation runs. The empirical power is obtained as the fraction of the number of simulation runs in which the test rejects the null hypothesis, while the empirical size is obtained as the fraction of the number of simulation runs in which the test rejects the null hypothesis, while the null hypothesis is true. For the Wald test, the rejection rates for the bootstrap-based version of the test are in parentheses (1,000 bootstrap runs). Throughout, the significance level is 5%. The simulated models correspond to three of the illustrative cases listed in Table 1 of the main text. Parameters for classical measurement error ('ME'): β = 1, σ 2 ε = 1, σ 2 θ = 1.44, σ 2 η = 0.64, δ = 0.3. This yields reliabilities of 0.92 (ρ = 0.9) and 0.76 (ρ = 0.6), noise-to-signal ratios of 0.09 (ρ = 0.9) and 0.312 (ρ = 0.6) and probability limits for the model's R 2 of 0.88 (ρ = 0.9) and 0.69 (ρ = 0.6). Parameters for omitted variables ('OV'): β = 1, γ = 1, σ 2 ε = 0.25, σ 2 θ = 0.36, σ 2 η = 0.36, ρ θη = −0.6, δ = 0.3. This yields probability limits for the model's R 2 of 0.89 (ρ = 0.9) and 0.74 and (ρ = 0.6). Parameters for simultaneity ('S'): β = 1, α = 2, σ 2 ε = 4, σ 2 θ = 1. This yields probability limits for the model's R 2 of 0.84 (ρ = 0.9) and 0.81 (ρ = 0.6) error. The simulation results show that the Wald test's finite-sample power can turn out relatively low for smaller values of n and T . This is especially the case if the distance between ρ and δ is relatively small. We also run simulations for these three models in the absence of any measurement error, omitted variables or simultaneity, yielding the empirical size of the Wald test. The results in the right-hand side of Table 2 point out that the size of the Wald test is substantially above nominal for n = 100 and T = 10. In the other cases, the rejection rates are fairly close to nominal. For the large-n and fixed-T asymptotics to apply, we must have n >> T . For n = 100 and T = 10, the ratio n/T might be too small. We therefore try to improve the finite-sample results by resorting to a bootstrap-based test statistic. Indeed the above nominal rejection rates for n = 100 and T = 10 can Notes: All simulation results are based on 10,000 simulation runs. The empirical power is obtained as the fraction of the number of simulation runs in which the test rejects the null hypothesis, while the empirical size is obtained as the fraction of the number of simulation runs in which the test rejects the null hypothesis, while the null hypothesis is true. For the Wald test, the rejection rates for the bootstrap-based version of the test are in parentheses (1,000 bootstrap runs). Throughout, the significance level is 5%. The simulated models correspond to three of the illustrative cases listed in Table 1 of the main text. Parameters for classical measurement error ('ME'): β = 1, σ 2 ε = 1, σ 2 θ = 1.44, σ 2 η = 0.64, δ = 0.3. This yields reliabilities of 0.92 (ρ = 0.9) and 0.76 (ρ = 0.6), noise-to-signal ratios of 0.09 (ρ = 0.9) and 0.312 (ρ = 0.6) and probability limits for the model's R 2 of 0.88 (ρ = 0.9) and 0.69 (ρ = 0.6). Parameters for omitted variables ('OV'): β = 1, γ = 1, σ 2 ε = 0.25, σ 2 θ = 0.36, σ 2 η = 0.36, ρ θη = −0.6, δ = 0.3. This yields probability limits for the model's R 2 of 0.89 (ρ = 0.9) and 0.74 and (ρ = 0.6). Parameters for simultaneity ('S'): β = 1, α = 2, σ 2 ε = 4, σ 2 θ = 1. This yields probability limits for the model's R 2 of 0.84 (ρ = 0.9) and 0.81 (ρ = 0.6) be circumvented by using a panel wild bootstrap to estimate the covariance matrix. The bootstrap-based covariance matrix then replaces the formula-based covariance matrix in the Wald test statistic. 7 Table 2 reports the bootstrap-based rejection rates in parentheses, which are close to nominal. The use of the bootstrap only turned out necessary for n = 100 and T = 10, but from a robustness perspective one may consider using the bootstrap regardless of the panel dimensions. We consider the same simulation experiments for the sup-Wald strict exogeneity test of Su et al. (2016) and report the outcomes in the left-hand sides of Tables 2 and  3. Also this test's empirical size is close to nominal. In terms of empirical power, the simulation results make clear that there are two situations: one in which the sup-Wald test's empirical power is relatively low in comparison to our Wald test and one in which both tests perform similarly. Especially for n = 100, our Wald test tends to have relatively high empirical power. For the ME model with ρ = 0.6, this pattern also persists for larger values of n. However, there is one exception and that is the model with an omitted variable. Also here we identify the above two situations, but with the roles of the two tests reversed: one in which the sup-Wald test's empirical power is relatively high in comparison to our Wald test and one in which both tests perform similarly. We explain the good performance of the sup-Wald test in the presence of an omitted variable from the fact that the sup-Wald test's auxiliary regressions are close to the true panel regression model, because the observed covariate's lags and leads are correlated with the omitted variable. Hence, the sup-Wald test performs very well in this specific case of misspecification. As we will see below, however, this result does not hold true for all parameter values.

Additional simulations
We consider five additional sets of simulations. First, we simulate the three motivating examples (measurement error, omitted variable and simultaneity) with n and T equal to the values that we will later encounter in our empirical applications. This gives rise to simulations (i) and (ii). In (i), we take n = 51 and T = 20, leading to a relatively small value of n/T instead of n >> T . In (ii), we set n = 737 and T = 4. In (iii), we extend the basic ME model with n = 51 and n = 100 as to include an additional error-free regressor. In (iv) and (v), we reconsider the basic ME and omitted variable model, respectively, with a parameter setting that results in trivial power for our Wald test according to Table 1 (δ = ρ = 0.6). Table 4 reports the empirical power and size for both tests in each of the five cases. For (i) and (iii), the table reports bootstrap-based rejection rates for our Wald test.
In (i), we observe that the sup-Wald test has higher empirical power in the omitted variables case, while the two tests perform comparably in terms of empirical power in the presence of simultaneity. In case of ME, our Wald test's empirical power is slightly better. In (ii), the sup-Wald only outperforms in terms of empirical power in the omitted variable case. In the other two cases, our test outperforms. In (iii), the sup-Wald test outperforms in terms of empirical power. In (iv) and (v), we observe that both tests have trivial power. For our test, this outcome was expected on the basis of Table 1. Apparently, also the sup-Wald test needs a difference in persistence between the unobserved regressor and the measurement error in (iv) and between the observed and omitted regressor in (v) in order to have non-trivial power. 8  Notes: All simulation results are based on 10,000 simulation runs. The empirical power is obtained as the fraction of the number of simulation runs in which the test rejects the null hypothesis, while the empirical size is obtained as the fraction of the number of simulation runs in which the test rejects the null hypothesis, while the null hypothesis is true. Throughout, the significance level is 5%. The simulated models are the same as for Tables 2 and 3, with ρ = 0.6 and δ = 0.3 in (i), (ii) and (iii) and with ρ = δ = 0.6 in (iv) and (v). Furthermore, in (iii) there is an additional i.i.d. normally distributed error-free covariate included in the ME model, with coefficient 1, mean 0 and variance 4. Its correlation with θ is 0.75. The additional covariate is uncorrelated with the remaining model variables

Cautionary remark
Because it is infeasible to do analytical calculations for the sup-Wald test and the constituent individual Wald tests, we do not provide an example that illustrates that the sup-Wald test indeed test a different null hypothesis than our Wald test. For the same reason, our simulation study considers misspecification that renders the two tests both consistent or both inconsistent. Although these simulations prove insightful, we emphasize that in more general cases of misspecification the two tests will typically test different null hypotheses. The null hypothesis of our Wald test is a sufficient condition for the FE estimator's consistency, while the null hypothesis of the sup-Wald test is neither sufficient nor necessary for the consistency of the FE estimator. The finding our simulation study that our Wald test does not always have higher empirical power than the sup-Wald test must be viewed in this context.

Empirical applications
This section considers two existing panel data sets from the literature, which each contain an explanatory variable that was suspected to be subject to measurement error in the studies that introduced these data. In the context of our theoretical results, these data sets provide a particularly relevant empirical case for our test. Our goal is to investigate the consistency of the FE estimator using our Wald test. For the sake of comparison, we will also report results for the sup-Wald test.

Birth rates and welfare
Economic theory suggests that a government transfer program that reduces the cost of supporting a child should lead to a rise in birth rates. As pointed out by McKinnish (2008), childbearing is a commitment to current and future consumption. We may therefore expect fertility decisions to be relatively unresponsive to transitory fluctuations in welfare benefits. This would imply that welfare benefits are erroneous relative to the conceptual variable of interest, even though these benefits are generally reported without error in administrative records. As explained by Griliches and Hausman (1986), this kind of 'conceptual' measurement error is isomorphic to the errors-in-variables model with measurement error that is less persistent than the unobserved regressor and would render the FE estimator inconsistent due to endogeneity of the observed regressor.
McKinnish (2008) aims to provide an empirical investigation of the presence of such conceptual measurement error in welfare benefits. She uses a panel data set consisting of US state-level birth rates by white women in the age group 20-24.5 years and AFDC benefit levels for a family of four with no additional income. The panel data set with n = 51 and T = 20 covers the 1973-1992 period. The data set also contains a measure of the earnings per capita in each state. Both welfare benefits and earnings per capita are deflated and expressed in prices of the base year 1982-84.
We consider the linear panel regression model specified as where y it denotes the birth rate in state i in year t, α i a state fixed effect, δ t a year fixed effect, w it the welfare benefit (i.e., the allegedly error-ridden regressor), and e it the earnings per capita. McKinnish (2008) estimates the linear panel regression model in (22) using data that is differenced over a time span of j = 1, 3, 5, 7 years. We denote the resulting coefficient estimates of the welfare benefit by β w, j . McKinnish (2008) compares the β w, j s for different values of j. In this way, she proceeds in a similar fashion as Goolsbee (2000). McKinnish (2008) establishes a monotonically increasing pattern in the β w, j s, which she contributes to the presence of conceptual measurement error.
We estimate the linear panel regression model in (22) using data that is differenced over a time span of j = 1, . . . , 10 years. The estimation results are summarized in the upper panel of Table 5. This table also reports the results based on the FE estimator.    (22)  At the 5% significance level, our bootstrap-based Wald test rejects the null hypothesis H 0 : β e, j = β e, j+1 ; β w, j = β w, j+1 for j = 1, . . . , 9 at the 5% level; see Table 6. Although we do not have a case of large n here, we show the differences curves in Fig. 1a and b for the welfare benefit and the earnings variables anyhow, for the sake of illustration. The differences curves confirm the economic relevance of the rejection. 9 In sum, our test outcome substantiates the doubts of McKinnish (2008) about the consistency of the FE estimator of (22). As explained in Sect. 3.4.4, if the Wald test rejects it is still possible that the FE is consistent. Because of the additional evidence against the FE estimator's consistency provided by McKinnish (2008), we consider that possibility unlikely here. Although our test results are consistent with the presence of conceptual measurement error in the welfare benefit variable, the source of the inconsistency-measurement error or something else-remains an open question. For example, the data used by McKinnish (2008) are aggregated across different cohorts and states that may respond differently to changes in welfare over time, which may also render the FE estimator inconsistent.
We note that the sup-Wald test rejectsH 0 at the 5% level; see Table 6. As explained in Sect. 3.4.4, this test focuses on generic strict exogeneity conditions instead of being specifically related to the FE estimator's orthogonality conditions. We therefore view the rejection as a sign that other moment conditions related to strict exogeneity do not hold either. We refer to McKinnish (2008) for additional estimations that exploit less stringent moment conditions. Erickson and Whited (2000) analyze the impact of Tobin's q on the investment rate, with Tobin's q the ratio of the market valuation of a firm's capital stock to its replacement value. The theoretical motivation for studying this relation is the standard model of a perfectly competitive firm. This model is based on the maximization of net shareholder wealth, in the presence of convex adjustment costs following changes in the capital stock (e.g., Blundell et al. 1992). According to this model, Tobin's q has a positive effect on the investment rate. An empirical complication is the measurement error problem associated with Tobin's q. This problem arises due to the difference between marginal q, the conceptual variable of interest, and measured q as defined above. Erickson and Whited (2000) discuss the possible sources of measurement error in measured q and propose an estimator that controls for such error by exploring higher-order moments. Their empirical analysis is based on a Compustat firm-level panel data set for the 1992-1995 period, with n = 737 and T = 4.

Investments and Tobin's q
We consider the linear panel regression model given by where y it denotes the ratio of investments to the replacement value of the capital stock for firm i in year t, α i a firm fixed effect, δ t a year fixed effect, q it the proxy of marginal Tobin's q, c it cash flow divided by the replacement value of the capital stock, and f i a 0-1 variable indicating whether a firm is financially constrained or not. The indicator variable f i is constructed on the basis of a firm's lack of bond rating and does not vary over time; its own marginal effect is therefore contained in the fixed effect α i . In the presence of measurement error in q, the FE estimator of (23) will typically be inconsistent due to a lack of strict exogeneity of the proxy of marginal q. We estimate the linear panel regression model in (23) after differencing over a time span of j = 1, 2, 3 years. 10 Detailed estimation results are given in the lower panel of Table 5. This table also reports the estimation results based on the FE estimator. At the 5% significance level, our Wald test fails to reject the null hypothesis H 0 : β q, j = β q, j+1 , β c, j = β c, j+1 ; β f c, j = β f c, j+1 for j = 1, 2; see again Table 6.
We note that the sup-Wald test rejectsH 0 at the 5% level, as shown in Table 6. As noted in Sect. 3.4.4, it remains unclear what the rejection ofH 0 means for the inconsistency of the FE estimator.
We conclude that our Wald test finds no evidence against the consistency of the FE estimator. As mentioned in Sect. 3.3, we should remain aware of the possibility that the test may have low power in certain cases. Low power could also arise from limited data variability due to taking differences, yielding coefficient estimates with relatively large standard errors. In such a scenario, our test could fail to reject in the presence of misspecification. This explanation does not seem very likely in the present case, though. The strong significance of the estimated coefficients in the lower panel of Table 5 suggests that the time-differenced data still contain a sufficient amount of variation. Another possibility is that the inconsistencies in the differences estimators do not depend on j.
Given these considerations and the rejection by the sup-Wald test, it remains important to look for other evidence against the FE estimator' consistency, such as coefficient signs that are unlikely from an economic perspective. Here, we find the coefficient signs that we would expect on the basis of economic theory: Tobin's q and the cash flow variable both have a positive effect on the expected investment rate, which is smaller if firms are financially constraint. In sum, also these additional investigations do not find evidence against the consistency of the FE estimator.

Conclusion
The FE estimator is widely used to estimate the linear panel regression model. Under large-n and fixed-T panel data asymptotics, we have developed a test to validate a sufficient condition for the FE estimator's consistency using a stacked regression framework. Our test takes the familiar form of a panel-robust Wald test. Because our test is asymptotically equivalent to a specific GMM test for overidentifying restrictions, our approach also fits in the familiar setting of GMM estimation and specification testing.
We have shown that our Wald test will generally test a different null hypothesis than the strict exogeneity test of Wooldridge (2010) and the extension proposed by Su et al. (2016). Our Wald test is specifically tailored for testing a sufficient condition for the FE estimator's consistency. The other two tests, by contrast, consider more generic strict exogeneity conditions that are neither sufficient nor necessary for the FE estimator's consistency.
The Wald test's finite-sample properties have been investigated in a simulation study, where we continued the comparison with the strict exogeneity test of Su et al. (2016). Our Wald test has been shown to possess good finite-sample properties, especially if the estimator of the covariance matrix is based on a panel bootstrap. We have also illustrated the test in two applications to existing studies from the literature.
If our tests rejects, it is still possible that the FE estimator is consistent. Further testing, as discussed below, would be needed to exclude this possibility. If our test does not reject, there is no evidence against the FE estimator's consistency. Although this is the most favorable outcome, researchers should still be aware of the possibility that the test may have low power in certain cases. It therefore remains important to look for other evidence against the FE estimator, such as coefficient signs and magnitudes that are unlikely from an economic perspective. Researchers should also recognize the possibility that low pow could arise from limited data variability due to taking differences, yielding coefficient estimates with relatively large standard errors. Hence, although our test does not require IVs, it should be used in combination with additional analysis.
As usual, finding a well-specified model remains to a large extent a case-by-case puzzle without guaranteed success, depending on, e.g., prior information and the availability of valid and strong instruments. As a general guideline, it is nevertheless useful to consider the existing literature on the selection of moment conditions. Here a distinction is made between (i) separating valid moment conditions from invalid ones and (ii) the elimination of redundant conditions; i.e., conditions that do not contribute to a reduction in the GMM estimator's variance (Okui 2009). Various consistent selection procedures have been proposed, including methods that add a penalty term to the usual J -statistic for overidentification (Andrews 1999). Ideally, a fully integrated selection procedure for moment conditions should include our Wald test's moment conditions, as well as the more generic exogeneity conditions underlying the sup-Wald test of Su et al. (2016). We leave the development of such an integrated approach as a topic for future research.
A final topic for further research relates to the panel dimensions. The asymptotic distribution of our test statistic has been derived under large-n and fixed-T asymptotics, making the test suitable when n T . This was the format in the classical panel data literature, but there has been increasing attention to panel data where n and T are of a different relative size, requiring different asymptotics. A first step would be to investigate the asymptotic behavior of our Wald test statistic for n fixed and T → ∞, or for n → ∞ and T → ∞ jointly in some way.

Declarations
Conflicts of interest The authors declare that they have no conflict of interest.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

B. Motivating examples: calculations
This appendix makes use of a few elementary properties of stationary AR(1) processes, which we summarize here for completeness. Assume that x it and z it are generated by stationary AR(1) processes, such that We assume that IE(θ it ) = IE(η it ) = 0, IE(θ 2 it ) = σ 2 θ and IE(η 2 it ) = σ 2 η for all i and t. We also assume that Cov (θ mt , η is ) = 0 for m = n, Cov (θ is , η it ) = 0 for s = t, and Cov (θ it , η it ) = σ θη . Lastly, we assume that Cov (θ mt , ε is ) = Cov (η mt , ε is ) = 0 for all m, i, s, t.
For k ≥ 1, we can write By letting k → ∞, we find Using these alternative formulations for x it and z it , we find for j ≥ 0, We also have Similarly, we find

B.1. (Non-)Classical measurement error
We start with the errors-in-variables model and allow for non-classical measurement error, with classical measurement as a special case. We will derive the inconsistency in both cases. Model Consider the linear panel regression model with measurement error, given by where n = 1, . . . , n and t = 1, . . . , T . We assume that (ε it ) is i.i.d. with IE(ε it ) = 0 and IE(ε 2 it ) = σ 2 ε for all i and t. Regarding (ξ it ) and (v it ), we assume that they are generated by stationary AR(1) processes, such that We assume that IE(θ it ) = IE(η it ) = 0, IE(θ 2 it ) = σ 2 θ and IE(η 2 it ) = σ 2 η for all i and t. Furthermore, we assume that Cov (θ mt , η is ) = 0 for m = n, Cov (θ is , η it ) = 0 for s = t, Cov (θ it , η it ) = σ θη and Cov (θ mt , ε is ) = 0 for all m, i, s, t. Lastly, we assume that Cov (ε mt , η is ) = 0 for m = i, Cov (ε is , η it ) = 0 for all s, t. If σ θη = 0, we have a form of non-classical measurement error. Inconsistency We first show that the FE estimator will usually be inconsistent. Let where v contains the covariances Cov (v ns , v nt ) and ξv the covariances Cov (ξ ns , v nt ). This yields the inconsistency The inconsistency will typically be nonzero if at least v = 0. We now turn to the estimators β j that are obtained after taking differences over time span j. It holds that Under the given assumptions, the numerator in (B.15) reduces to Furthermore, the denominator can be written as The inconsistency thus boils down to if and only if δ < ρ, it is readily seen that the inconsistency's magnitude decreases with j if and only if δ < ρ. For δ > ρ, the magnitude of the inconsistency is increasing and for δ = ρ the inconsistency does not depend on j. For both classical and non-classical measurement error, the inconsistency does not vanish for larger values of j.
We estimate the omitted variable regression (B.23) and are interested in the probability limit of β j , the estimator of β based on the model after taking differences over time span j. Inconsistency We first show that the FE estimator for β will usually be inconsistent. Using similar matrix notation as before, we obtain The inconsistency will be nonzero for γ = 0 and zx = 0.
We now turn to the estimators β j that are obtained after taking differences over time span j. It holds that Under the given assumptions, the numerator reduces to For the denominator, we find As a sanity check on the above expression, we notice that the inconsistency is zero for σ θη = 0. The inconsistency should be zero in this particular case, because σ θη = 0 implies that x it and z it are uncorrelated. Because if and only if δ < ρ, is readily seen that plim n→∞ | β j − β| > plim n→∞ | β j+1 − β| if and only if δ < ρ. The inconsistency's magnitude is increasing for δ > ρ and for δ = ρ the inconsistency does not depend on j. We note that the inconsistency does not vanish for larger values of j.

B.3. Simultaneity
The third source of endogeneity that we consider is simultaneity.
Model We consider the simultaneous equations model given by the structural equations y it = β i + βx it + ε it (B.32) x it = α i + α y it + u it . (B.33) We assume that (ε it ) is i.i.d. with IE(ε it ) = 0 and Var (ε it ) = σ 2 ε , independent of (u it ). Here (u it ) is a stationary AR(1) process defined by with IE(θ it ) = 0, IE(θ 2 it ) = σ 2 θ and Cov (θ mt , ε is ) = 0 for all m, i, t, s. Solving the two equations yields the reduced forms We estimate (B.32) in jth differences, thereby ignoring (B.33). We are interested in the probability limit of β j , the estimator of β based on the model in jth differences. We want to know how the inconsistency depends on j.
This gives the inconsistency .