Alternative estimation approaches for the factor augmented panel data model with small T

In this paper we compare alternative estimation approaches for factor augmented panel data models. Our focus lies on panel data sets where the number of panel groups (N) is large relative to the number of time periods (T ). The Principal Component (PC) and Common Correlated Effects (CCE) estimators were originally developed for panel data with large N and T , whereas the GMM approaches of Ahn et al. (2013) and Robertson and Sarafidis (2015) assumes that T is small (that is T is fixed in the asymptotic analysis). Our comparison of existing methods addresses three different issues. First, we analyze the possibility of an inappropriate normalization of the factor space (the so-called normalization failure). In particular we propose a variant of the CCE estimator that avoids the normalization failure by adapting a weighting scheme inspired by the analysis of Mundlak (1978). Second, we demonstrate how the design of the Monte Carlo simulations favors some estimators, which explains the conflicting findings from existing Monte Carlo experiments. Third, we analyze the effects of estimating versus fixing the number of factors in advance. JEL classification: C23, C38


Introduction
The seminal work of Holtz-Eakin et al. (1988) has provided two important contributions to the statistical analysis of panel data. First, it proposes a GMM framework for estimating dynamic panel data models that were further developed and popularized by Arellano and Bond (1991). This approach has become standard in the dynamic analysis of panel data. The second contribution, the introduction of time varying individual effects, was less influential and went largely unnoticed for many years. For example, the excellent monograph of Baltagi (2005) -as all other textbooks on panel data analysis of the early 2000s -does not consider time varying individual effects or any other factor structure. Bai (2009) pointed out that time varying individual effects are just a special case of a factor structure and provided a general framework for estimating a panel data model with "interactive fixed effects", which is also referred to as the factor-augmented panel data model.
With the work of Ahn et al. (2001, Pesaran (2006), and Bai (2009) the interest in models that account for time varying heterogeneity and cross-section correlation surged considerably and the 25th International Conference on Panel Data in Vilnius 2019 included a large number of papers dealing with factoraugmented panel data models. In empirical practice, the Common Correlated Effects (CCE) approach proposed by Pesaran (2006) has recently become very popular among empirical researchers. This is due to the fact that this estimator is easy to understand and implement, a STATA routine (xtmg) and a Gretl add-on (xtcsd) is available and it performs well in Monte Carlo studies. It is however not clear, whether the CCE approach is similarly attractive in empirical applications where the number of time periods T is small (say 5 -15).  and Robertson and Sarafidis (2015) proposed a GMM approach that is shown to be consistent for finite T , whereas the CCE and the Principal Component (PC) estimator were developed for samples with large T and N . Su and Jin (2012) and Westerlund et al. (2019) showed that the CCE approach is consistent and asymptotically (mixed) normal if T is fixed and N → ∞, whereas the consistency of the PC estimator requires quite restrictive assumptions (such as i.i.d. errors across time) in this case. It is however not clear how large T should be in order to ensure reliable estimation and inference.
An important assumption for the CCE estimator is that the (weighted) mean of the factor loadings is different from zero. This assumption is difficult to verify as the factors loadings are typically unknown. Furthermore, we show that the CCE estimator is already biased if the mean of the factor loadings is O(N −1/2 ). To escape such a "normalization failure", we suggest a data dependent weighting scheme that is inspired by the Mundlak (1978) approach. In our Monte Carlo simulations we show that this simple weighting scheme performs well, whenever the original CCE estimator suffers from a normalization failure. The rest of the paper is organized as follows. Section 2 compares the existing estimation methods and Section 3 reviews and complements the asymptotic results for fixed T and N → ∞. Possible problems with the normalization of the estimators are analyzed in Section 4. An extension to multiple factors is considered in Section 5 and empirical approaches for selecting the number of common factors are examined in Section 6. We argue that popular selection rules for the number of factors are generally inconsistent if T is fixed. The small sample properties of alternative estimation procedures are investigated in Section 7. Specifically, we illustrate the detrimental effect of a normalization failure and demonstrate the robustness of the Mundlak type CCE estimator. Furthermore, we investigate the effects of estimating the number of factors on the performance of the estimation procedures. Finally, we employ three general model setups from the literature in order to compare the competing methods in more challenging and realistic scenarios. Section 8 concludes.

Existing estimation approaches
Consider the factor augmented panel data model: 1 where x it and β are k × 1 vectors. For the ease of exposition we first consider a single factor with r = 1, that is, f t and λ i are scalars. The extension to multiple factors is considered in Section 5.
We adopt a "classical" panel data framework where the coefficient vector β is the same for all cross-section units (homogeneous panel). Furthermore, we assume that T may be small relative to N , which is typical for many panel data applications. It should be noted that the asymptotic framework of Pesaran (2006) and Bai (2009) assumes that N and T tend to infinity, whereas  and Robertson and Sarafidis (2015) suppose that T is small and fixed. Furthermore, the latter approach treats f t as parameters and thereby avoids making any assumptions on these parameters, whereas Pesaran (2006) and Bai (2009) assume that the factors are weakly correlated random variables and the loadings are treated as parameters (or also as random variables). We make the assumption that u it is independent (strictly exogenous) of x it , f t and λ i . This rules out dynamic specifications. 2 It is well known that in the two way panel data model the individual and time specific effects (which result as special cases of the factor model with constant factor and loading, respectively) can be removed by a simple data transformation, where the variables are adjusted by the individual and time specific averages. It is not difficult to see that a similar transformation exists for the model with interactive fixed effects, which is given by where λ λ λ = (λ 1 , . . . , λ N ) and The weighted averages x t (λ) and u t (λ) are constructed in an analogous manner. Note that e t (λ λ λ) = y t (λ λ λ) − β x t (λ) = f t + u t (λ) serves as an estimate of f t . Estimating the transformed regression (3) is equivalent to the least-squares estimator, treating β and f 1 , . . . , f T as parameters and x it and λ i as regressors. Accordingly, the resulting estimator is efficient if u it iid ∼ N (0, σ 2 ).
2 In panels with individual specific parameters and fixed T , including weakly dependent regressors (such as lagged dependent variables) result in a bias of order 1/T (the incidental parameter problem). The GMM based estimators of Section 2.3 are able to cope with this bias by introducing time dependent vectors of instruments. In this paper we abstract from such complications. The reader interested in dynamic models is referred to Juodis and Sarafidis (2018).

The PC estimator
For the PC approach suggested by Bai (2009), equation (3) is replaced by the feasible version where e it = y it − β x it = λ i f t + u it and λ i denotes the PC estimator of the factor loading λ i , which is equivalent to the eigenvector associated with the largest eigenvalue of the sample covariance matrix Ω ee (β) = T −1 T t=1 e t (β)e t (β) with e t (β) = (y i1 −β x i1 , . . . , y iT −β x iT ) . As shown by Moon and Weidner (2015) the sum of squared residuals can be obtained by minimizing the objective function where y i = (y i1 , . . . , y iT ) and X i = (x i1 , . . . , x iT ) and µ min {A} denotes the smallest eigenvalue of the matrix A. The minimum can be obtained by standard numerical methods, whereas Bai (2009) proposed to compute the (nonlinear) least-squares estimator of (4) sequentially by starting with the pooled OLS or within-group estimator of β (that is by ignoring the factor structure in the errors). The first principal component of the residual e it ( β) yields a first estimator of the common factor and the associated loadings are used to obtain the estimated analog of the weighted averages in (4). The estimation procedure is iterated until the estimators converge to the least-squares estimators of β and λ. Moon and Weidner (2019) pointed out that the least-squares objective function may exhibit several local minima and therefore it is possible that the gradient based minimization based algorithm fails to find the global minimum. To cope with this problem, Moon and Weidner (2019) propose a nuclear norm penalty that results in a convex optimization problem. Another possibility is to initialize the minimization algorithm by a √ N T -consistent initial estimator. In this case it is sufficient to assume convexity in the 1/ √ N T vicinity around the true value.

The CCE Estimator
In contrast to the PC estimator, the CCE approach proposed by Pesaran (2006) does not adopt an (asymptotically) efficient weighting scheme, but employs instead pre-specified weights λ 0 . 3 In practice λ 0 = (1, . . . , 1) is the default option, but any other granual weighting scheme is possible. This gives rise to a modified transformation, where (6), we obtain the cross-section augmented regression equation, where γ i = −λ * i β and v it = e it − λ * i e t (λ λ λ 0 ). In practice, the nonlinear restriction γ i = −λ * i β is ignored and, therefore, γ i is treated as an additional parameter. 4

The HNR and ALS approach
While the CCE and PC approach replace the unobserved factor by (weighted) averages of y 1t , . . . , y N t and x 1t , . . . , x N t , the approaches suggested by Holtz-Eakin et al. (1988) (HNR) and  (ALS) replace the unknown factor loadings by linear combinations of y i1 , . . . , y iT and x i1 , . . . , x iT : HNR: ALS: The main difference between these two approaches is that in (8) the linear combination is time dependent whereas in (9) the linear combination is the same for all time series. As we do not see any advantage in using the variant HNR (and in our simulations the HNR estimator tends to perform worse than the ALS estimator), we focus on the ALS variant in the following analysis. Inserting (9) in the model (1) yields where Note that this approach involves T − 1 additional parameters θ 1 , . . . , θ T −1 , whereas the CCE approach involves N (k + 1) additional parameters, which may be a much larger number of parameters, in particular if N is large relative to T . Equation (10) can be estimated as a linear equation by ignoring the nonlinear relationship δ t = θ t β and treating δ t as additional parameters, cf. Hayakawa (2012). Furthermore, as the regressor y i,T is correlated with the errors, an instrumental variable approach is required for estimating the coefficients efficiently. Since it is assumed that x it is strictly exogenous, we employ observations of all time periods to construct the T k × 1 instrumental variable vector z i = (x i1 , x i2 , . . . , x iT ) . The first stage regression yields y iT = π z i , where π z i is the fitted value from a regression of y iT on z i . The second stage regression is Estimating the latter equation by OLS yields the two-stage least squares (2SLS) estimator. Since the error term ν it is autocorrelated (due to the common component θ t u iT ), a GMM estimator based on the moment condition E(ν i ⊗ z i ) = 0 with ν i = (ν i1 , . . . , ν iT ) is more efficient, in general.

The RS estimator
The GMM estimator of Robertson and Sarafidis (2015) results from multiplying the original model by the vector of instruments z i (e.g. the instruments of the ALS estimator) such that where The transformed model (11) results in a new factor model with m = dim(z i ) = kT observations in N cross section units. In this model γ i and f t are treated as unknown parameters and the GMM estimator results from minimizing the criterion function where W N is a consistent estimator of the optimal weighting matrix E 1 Robertson and Sarafidis (2015) propose to minimize the function Q(·) by applying a sequential least-squares (SLS) estimator. Let f 0 t denote some starting value. Replacing f t by f 0 t the parameters β and γ i can be estimated by OLS from (11). Replacing γ i by the respective OLS estimator, we can obtain an updated estimator for f t from running T cross-section regressions (11) for t = 1, . . . , T . A linear variant of this estimation approach is proposed by Juodis and Sarafidis (2020).
It is important to notice that the first order condition of the SLS estimator is invariant to some scaling factor c, such as f * t = cf t and λ * i = λ i /c. The PC estimator implies c = 1/ T t=1 f 2 t and the original ALS estimator imposes c = 1/f T . The objective function of the least-squares estimator does not impose any normalization of the factors. There exists a unique minimum for the product γ i f t , but the decomposition into γ i and f t is somewhat arbitrary and depends on the starting value of iterative algorithm.

Asymptotic properties for fixed T
The asymptotic properties of the PC and CCE estimators are typically derived by adopting a joint limit theory, where T and N tend to infinity (e.g. Pesaran 2006, Bai 2009, Greenaway-McGrevy et al. 2012and Westerlund and Urbain 2015. The asymptotic analysis revealed that the PC and CCE estimators are √ N Tconsistent whenever √ T /N → 0 and √ N /T → 0. This requirement is fulfilled if for some fixed constant, 0 < a < ∞, the paths of the sample sizes admit the inequality aT 0.5+ < N < aT 2− for some > 0. Statistical inference based on these estimators suffers from an asymptotic bias whenever T /N → κ > 0. This bias does not show up in the asymptotic analysis of Pesaran (2006), as he assumes that the coefficient vector β i = β + v i is individual specific, where v i is a random error that prevents the estimator from achieving the usual √ N T convergence rate. In the literature cited above, bias-corrected estimators are suggested that remove the asymptotic bias from the limiting distribution.
For fixed T and N → ∞ the CCE estimator of the factors is consistent as e t (λ 0 ) converges in probability to cf t , where c is some scale factor that is different from zero. Therefore, the errors-in-variable problem vanishes for N → ∞ and fixed T (cf. Westerlund et al. 2019).
For the asymptotic analysis of the PC estimator, it is usually assumed that min(N, T ) → ∞ (cf. Bai 2009) and, therefore, the PC estimator may be inconsistent if T is fixed and N → ∞ (see Remark 1 of Bai 2009). Under more restrictive assumptions it is however possible to show that the PC estimator of the factors is consistent if T is fixed and N → ∞. To focus on the main issues assume that β is known. Furthermore, we assume that the vectors f f f = (f 1 , . . . , f T ) and λ λ λ = (λ 1 , . . . , λ N ) are parameter vectors to be estimated. The PC estimator solves the first order conditions: where As N → ∞ the moment condition is solved by letting f = f and, therefore, the PC estimator for f is consistent (up to a scaling factor). If u it is heteroskedastic or autocorrelated, then M f Σ Σ Σ u f f f = 0 in general and, therefore, the PC estimator is inconsistent as N → ∞. On the other hand, if both N and T tend to infinity, the PC estimator is consistent no matter of a possible heteroskedasticity or (weak) autocorrelation (cf. Chamberlain and Rothschild 1983).
The asymptotic theory for the HNR and ALS estimators assumes that T is fixed and N tends to infinity. The GMM estimator is based on kT (T − 1) moment conditions with k + T − 1 unknown parameters. Therefore, no problem arises if T is fixed and N tends to infinity. Accordingly, the estimators are asymptotically normally distributed and centered around zero. Of course the problem of instrument proliferation arises if T gets large and the asymptotic theory breaks down if T 3 /N → κ > 0 (cf. Bekker 1994 andLee et al. 2017). 5

Identification
All estimation approaches require some normalization of the factors or loadings some of which may be problematical in empirical practice. The CCE and ALS approaches are based on the following conditions: ALS: whereas the restriction for the PC estimator T −1 T t=1 f 2 t = 1 is unproblematic in practice. The violation of the restrictions (15) and (16) may result in poor distributional properties of the estimator. If, for example, N −1 λ 0,i λ i = 0, then the cross section mean e t (λ λ λ 0 ) does not depend on the factor and, therefore, the CCE estimator is biased whenever x it and λ i f t are correlated (cf. Westerlund and Urbain 2013). Similarly, if f T = 0, then y iT = β x iT + u iT and the instruments are not able to identify the parameters θ t and δ t .
One may argue that the chance that (15) or (16) is exactly zero is negligible, so that problems only occur in rare cases (if at all). Unfortunately, this is not true, as the problems already arise whenever Including the cross-section averages y t and x t is equivalent to augmenting with e t and x t . Furthermore, where f * t = f t + (u t /λ). Since in our case u t /λ = O(1), it follows that the factor f * t is different from f t . In this case, e t does not represent the true factor and the CCE estimator of β is inconsistent whenever the factor is correlated with the regressors.
To sidestep this difficulty, we follow the analysis of Mundlak (1978) and decompose the factor loadings into a systematic component related to the ordinary average x i and the projection error ξ i : where x i = T −1 T t=1 x it and ξ i is uncorrelated with x i . In this specification γ 1 x i represents a possible linear dependence of λ i on the regressors that gives rise to an endogeneity bias. Inserting (17) in (1) yields This estimation equation is related to the projection approach of Hayakawa (2012), who considers a projection of λ i on the vector z i = vec(X i ), also known as Chamberlain projection. A second difference to the Hayakawa (2012) approach is that he employs the projection for GMM estimation of ALS, whereas we employ the Mundlak projection in the context of CCE estimation.
The weighting scheme for the CCE estimator results as Since γ 0 and γ 1 are unknown, we augment the regression by the following (k + 1) 2 cross section averages: This estimator is referred to as CCE(M ). 6 Similar normalization problems arise for the HNR and ALS approaches, but these estimators apply a normalization to the factors. For example, if f T is zero, then the linear combination of y iT and x iT is not able to identify the factor and, therefore, the ALS approach is biased whenever f T = 0 and x it is correlated with λ i f t . If T is small then one may try out all possible time periods for normalization and select the normalization that minimizes the GMM objective function. For a large number of time series this approach is rather time consuming. In such cases the normalization may be selected by estimating the factor by the PC approach. Then, the normalization period with the largest factor (in absolute value) is selected as the normalization period.
In the appendix of  a more flexible approach is proposed, which we refer to as ALS To avoid normalizing T − 1 elements to unity, we transform the equations for unit i by using a more general matrix with property Given β, the estimator of H is based on the moment condition E(H e i z i ) = 0, where z i = vec(X i ). Accordingly, a GMM estimator for H can be obtained as Accordingly, the estimator H is obtained as the matrix of eigenvectors corresponding to the smallest T − 1 eigenvalues of the matrix Ω ez Ω −1 zz Ω ez . Given H, the estimator for β is obtained from the OLS regression This estimation step yields an updated estimator for β that can be used to obtain a new estimator of H, until convergence. A drawback of this variant of the ALS estimator is that no standard errors for β are readily available, as the respective estimation step is affected by the estimation error in H.
It is interesting to compare this approach to the PC estimator of Bai (2009), which can be obtained by solving the problem Accordingly, the difference between the PC and ALS/RS approaches is that the former extracts the factors from the residual vector e i , whereas the ALS/RS approach first projects the residuals on the space spanned by the vector of instruments z i . Accordingly, the latter approach requires that the factors are correlated with the regressors, whereas the PC approach does not. Robertson and Sarafidis (2015) show that their estimator considered in Section 2.4 is asymptotically equivalent to ALS * if the error u it is i.i.d. If u it is heteroskedastic and/or serially correlated, then the weighting matrix W n results in an asymptotic efficiency gain.

Multiple factors
So far we assumed that there is only a single factor. It is not difficult to see that for a panel data model with a vector of r ≥ 1 factors f f f t and the conformable r × 1 loading vector λ λ λ i , the estimation equation (3) is given by where Λ Λ Λ = (λ λ λ 1 , . . . , λ λ λ N ) and and the r × 1 vector u t (Λ) is constructed in a similar manner. This shows that efficient estimation requires r linear independent weighting schemes applied to y y y t = (x 1t , . . . , y N t ) and X X X t = (x 1t , . . . , x N t ) .
To show consistency of the modified CCE estimator, CCE(M ), a different reasoning is required. For the ease of exposition assume k = 2 regressors and r = 2 factors. We obtain 2 different weighting schemes: that are used to obtain the following relationships: is invertible, we can obtain the linear combinations that represent the factors as Thus, the common component λ 1,i f 1,t + λ 2,i f 2,t can be (asymptotically) represented by a linear combination of the 6 means y (1) t , y t , x 2,t , x 1,t , and, x 2,t . 7 6 Determining the number of factors As argued by Pesaran (2006), the CCE estimator is consistent if the actual number of factors r is not larger than k + 1. This requires however that r − 1 factors 7 The alert reader may have noticed that the linear combination does not involve the ordinary cross-section averages N −1 i y it , N −1 i x 1,it and N −1 i x 2,it that are employed in the CCE estimator. These additional means are not required for identification but often improve the statistical properties of the estimator. They may also help to escape the problems resulting from a (nearly) singular matrix Ξ.
are correlated with the k regressors. This is due to the fact that one factor can be identified by the cross-section average e t (λ 0 ) = y t (λ 0 ) − β x t (λ 0 ), whereas the identification of the other factors requires some relationship to the cross-section averages of the regressors x t . Furthermore, the correlation pattern needs to be sufficiently informative for identifying the factors.
It is often argued that the CCE approach is attractive, as we do not need to select the number of factors, whereas for all other approaches, the number of factors needs to be known (or determined from the data). If the number of factors is smaller than k + 1 and the normalization requirements are satisfied, then the CCE estimator is consistent, but the small sample properties may suffer from including many cross-section averages. This is comparable to applying the PC estimator with r = k + 1 factors. As shown by Moon and Weidner (2015), under some additional assumptions, 8 the PC estimator is robust against over-specifying the number of factors. A similar result is obtained by Westerlund et al. (2019) for the CCE estimator. Since under certain conditions the CCE estimator for β is as efficient as the OLS estimator using the true factors, there is no gain in (asymptotic) efficiency by changing the weighting scheme or imposing nonlinear restrictions to the auxiliary parameters that are implied by knowing the number of factors. It is however not clear whether this result provides a good guidance for empirical applications in finite samples.
In practice, it may therefore be interesting to estimate the number of factors. To this end we may invoke the criteria proposed by Bai and Ng (2002) and Ahn and Horenstein (2013). Both approaches are based on the eigenvalues of the residual covariance matrix. Denote by µ 1 ≥ · · · ≥ µ T the ordered eigenvalues of the T × T sample covariance matrix Ω ee = N −1 N i=1 e i e i , where the residual vector e i is obtained by estimating the model with maximum number of factors r * . Furthermore, let Accordingly, the BN and AH criteria include an additional term of order O p ((N T ) −1/2 ) that does not affect the asymptotic properties as N and T tend to infinity. Let us consider the asymptotic properties of the respective estimators r if T is fixed and N → ∞. In this case lim N →∞ P ( r < r 0 ) = 0 is ensured by (cf. Bai and Ng 2002) As condition (19) is not satisfied for fixed T , the BN criterion may select some r < r 0 , even if N → ∞. The requirement lim N →∞ P ( r > r 0 ) = 0 implies Since log σ 2 u (r 0 ) − log σ 2 u (r) = O p (N −1 ) + O p (T −1 ) for r > r 0 (cf. Lemma 4 of Bai and Ng 2002), it may happen that for small T , condition (20) is violated as well. Hence, the BN criterion may not be consistent for fixed T . In practice it is nevertheless possible that the BN criterion selects the number of factors consistently, if the eigenvalues µ 1 , . . . , µ r 0 −1 are sufficiently large and µ r 0 +1 , . . . , µ r * are sufficiently small relative to µ r 0 .
Since for fixed T , µ r is O p (1) for all r = 1, . . . , T , it follows that the eigenvalue ratio AH(r) is O p (1) for fixed T and all r ∈ {1, . . . , r * }. Therefore, the AH criterion cannot be shown to be a consistent selection rule for fixed T . It may nevertheless perform well, if the slope of the eigenvalue function is sufficiently steep at r = r 0 .
A possibility to sidestep these problems is to adopt the BIC selection criteria of  and Robertson and Sarafidis (2015). These criteria are based on the Sargan-Hansen specification test for GMM estimators. If the number of factors is too small, then the remaining cross-correlation among the residuals results in a large value of the test statistic. The penalty function is constructed such that the sum of the test statistic and the penalty function obtains a minimum at the correct number of factors as N tends to infinity.

Monte Carlo Simulations
In this section we assess the performance of alternative estimation methods in various settings and highlight some favorable and problematic aspects of alternative estimation methods. The simulation results in Sections 7.1 -7.2 are based on the following simple data-generating process with β = 0.5 and r = 1. Hence, the regressor is correlated with the loadings, the factor and the product of both. The regression error u it and the idiosyncratic component of the regressor, ε it , are independent standard normal random variables. The constant µ is drawn from a U [0, 1] distribution. The DGPs in Sections 7.1 to 7.2 differ with respect to the distributional assumptions on the factors and their loadings. The (near) violation of the normalization restrictions for the CCE and ALS estimators are examined in Section 7.1. In Section 7.2, we compare the PC and CCE estimator with regard to their different weighting schemes. In Section 7.3 we address the estimation of the number of factors, r, for the PC, ALS* and RS approaches. There, we consider a similar DGP as in (21) and (22) for r = 1 and r = 2. The last subsection 7.4 considers the relative performance of the CCE, PC, ALS* and RS, estimation approaches in more general settings that are based on the DGPs considered by Bai (2009)

Normalization failure
As argued in Section 4, the CCE and ALS/HRN approaches may suffer from a violation of their normalization conditions. The performance already deteriorates if the parameters approach the √ N -vicinity of the problematic subspace. In a model with a single factor, the normalization of the equally weighted CCE estimator (λ 0,i = 1) requires that λ = N −1 N i=1 λ i = 0. We have argued that whenever λ = c/ √ N , the factor cannot be represented by a linear combination of y i and x i as N → ∞. Sarafidis and Wansbeek (2012) and Westerlund and Urbain (2013) analyze the performance of the CCE estimator when the normalization condition is violated. In order to study the performance of the CCE estimator when λ is different but close to zero, we consider the model in (21) and (22), where we generate the factor loadings as DGP1: λ i ∼ N (µ λ , 1) for µ λ ∈ [0, 1] and f t ∼ N (0, 1).
Hence, the loadings are normally distributed with expectation that ranges from 0 to 1. Figures 1 (a) -(d) present the absolute bias for the original CCE, the Mundlak type CCE(M) estimator suggested in Section 4, and the PC estimator for N = 100 and N = 500 with a small (T = 10) and moderate (T = 50) number of time periods. The PC estimator of Bai (2009) is obtained by a sequential estimation procedure using the pooled OLS estimator as starting value for β (see Section 2.1). It turns out that the CCE estimator is severely biased even if the mean of λ i is substantially different from zero. This is due to the fact that a bias already occurs whenever µ λ = O(N −1/2 ). This reasoning predicts that for fixed µ λ the bias gets smaller if N increases. Indeed, this is what we observe when comparing panel (a) and (c) as well as (b) and (d). Note that √ 100/ √ 500 ≈ 0.44 and, therefore, we expect that the bias reduces to a value less than one half which is a good approximation for µ λ > 0.1. The other two estimators, PC and CCE(M), are virtually unbiased, which is expected as the estimators do not rely on the assumption µ λ = 0.
In a similar manner, the normalization of the ALS estimator may be problematic if the factors approach the problematic subspace. The ALS estimator requires f T = 0. To examine the consequences of an (approximate) violation of this normalization condition, we consider the model in (21) and (22) where the factors are generated as: DGP2: f t ∼ N (0, 1) for t = 1, ..., T − 1 and f T ∼ N (µ T , 0.5) for µ T ∈ [0, 1] and the factor loadings are standard normally distributed. As the final value of the factor is crucial, we generate it by a distribution with expectation ranging from 0 to 1.
Figures 1 (e) -(f) present the bias for the ALS estimator when T = 5 and N = 100 or N = 500, respectively. As expected, the ALS estimator is severely biased whenever µ T = E(f T ) is small. But even for moderate values of µ T the bias remains substantial and decreases only gradually for larger values of µ T . It should be noted that if the regression includes an individual specific intercept, then the factors are demeaned and, therefore, assuming a nonzero mean appears inappropriate.
Figures 1 (e) -(f) also present the bias of two estimators that circumvent the problems with the normalization of the original ALS estimator. The estimator ALS * refers to the GMM estimator that estimates the matrix H that is used to remove the factors (see Section 4). 9 Our simulation results suggest that this estimator performs quite well in terms of bias, as it is virtually unbiased for all values of µ T . Another approach to escape the normalization problem is the GMM max estimator, where in a first step the factor is estimated using the PC approach. In the second step, the time period for the normalization is chosen according to the maximum absolute value of the estimated factor and the original ALS estimator is adapted, where the time period with the largest factor is shifted to the end of the sample. Both estimators are able to reduce the bias dramatically.
The figures also include the RS estimator, which corresponds to the FIVU estimator of Robertson and Sarafidis (2015). This estimator does not require f T = 0 for normalization (see Section 2.4) and thus the bias does not depend on the value of µ T . The RS estimator has a slight advantage in terms of bias when N = 100. With N = 500, the bias of the ALS * , GMM max and RS estimators is nearly zero.
To summarize, our findings confirm earlier evidence that the normalization applied for the original CCE or ALS/HNR estimators may be problematical, whenever the factors or loadings approach a normalization failure. It is however easy to adjust the estimators such that they perform well for all values of the parameter space. Our Monte Carlo exercise indicates that the PC and CCE(M) estimators as well as ALS * , GMM max and RS are very robust against a possible normalization failure.

Fixed versus data driven weights
From the reasoning of Section 2, it turns out that the CCE estimator is expected to outperform the PC estimator whenever the weighting scheme λ λ λ 0 comes close to the actual set of loadings λ λ λ, see also Westerlund and Urbain (2015). For equal weights with λ 0,i = 1 for all i, the CCE estimator performs well, whenever (i) the absolute value of the mean of the loadings is large (to avoid the normalization failure) and (ii) the variance of the loadings is small. Our DGP3 represents such a scenario, whereas the DGP4 favors the PC estimator by generating factor loadings with large variance, The remaining details of the simulation setup are identical to the model in (21) and (22).
The results reported in Table 1 clearly confirm our assertion that the CCE estimator outperforms the PC estimator in DGP3, whereas the PC estimator performs better for DGP4. This finding suggests to find a weighting scheme that comes close to the actual distribution of the loadings. This is the notion behind the Mundlak type CCE variant that employs the individual specific means y i and x i , since a linear combination of these averages can be seen as (CCE type) estimates of the loadings λ i . Therefore, we hope to improve the original CCE estimator by applying weights that are correlated with the loadings. Our results from the simple Monte Carlo experiment suggest that the CCE(M) approach of choosing a data driven weighting scheme performs similar to the best estimator in the respective situation. Furthermore, as shown in the previous subsection, the CCE(M) estimator sidesteps the risk of a normalization failure. Provided that this estimator is similarly easy to compute as the original CCE estimator, it appears as if this estimator is a robust and efficient variant of the original CCE estimator.

Selecting the number of factors
In practice, it is necessary to select the number of factors for the PC and GMM estimation procedures. The choice is important, since misspecifying the number of factors can have severe consequences: Overspecifying the number of factors can have adverse effects on the sampling properties of the estimators, while an underspecification may lead to inconsistent estimates if the ignored factors are correlated with the regressors. One possibility for selecting the number of factors is simply to specify the number according to some ad hoc rule, for instance r = k + 1, as usually advocated for the CCE approach. Another option is to use a consistent criterion for the number of factors, such as the ones proposed by Bai and Ng (2002) (hereafter: BN) and Ahn and Horenstein (2013) (AH). Note that these selection criteria were developed for the pure factor model without regressors. Furthermore, the asymptotic theory underlying these approaches requires T → ∞ (see Section 6). It is therefore interesting to investigate the performance of these criteria that were not initially developed for a small number of time periods. For the GMM estimators, the number of factors can be estimated using model information criteria, such as the Schwarz Criterion (BIC) considered by  and Robertson and Sarafidis (2015).
In order to study the performance of these selection criteria, we consider a similar model as in (21) and (22) with r = 1 and r = 2. For the loadings and factors, we assume the following DGP, DGP5: λ j,i ∼ N (0, 1), f j,t ∼ N (0, 1) for j = 1, 2.
As reported in Table 2, the hit rates for a single factor, r = 1, are nearly 100% for the BN and AH criteria whenever T ≥ 10. For T = 5 the BN criterium does not work and nearly always picks the maximum number of factors. On the other hand the AH criterion works remarkably well, even for a number of time periods as small as T = 5. 10 The hit rates for the BIC criteria exceed 90% in all but one case. For r = 2 the hit rates for the AH criterion are substantially lower, but the estimators are still quite accurate, even if T = 10 and N is large. For the BIC criteria, the hit rates decrease by only a small amount and do not seem to be very sensitive to the number of factors, in particular if N > 100.
In Table 3, we report bias and RMSE for the PC, ALS * and RS estimators based on the true number of factors (r = 1 and r = 2) as a benchmark. In addition we assess the performance of the estimators, when the number of factors is estimated based on selection criteria. 11 As expected, using the AH method for r = 1 in order to estimate the number of factors for the PC estimator produces bias and RMSE results that are of similar magnitude as the true number of factors. Applying the BIC criterion to estimate the number of factors for the GMM estimators produces very accurate estimates when N > 100, accordingly.
For r = 2, the performance of the PC estimator using the AH criterion shows a considerable bias, in particular if T is as small as 5. In contrast, bias and RMSE of the GMM estimators applying the BIC criterion are similar to the estimators based on the true number of factors when N > 100. When T increases to 10, there is still a substantial performance gap between the PC estimator using the AH method and the PC estimator based on the true number of factors, whereas the GMM estimators based on the BIC criterion perform much better. This is surprising as Table 2 suggests that the hit rates of the BIC criterion are only slightly better in these cases. The reason is that the AH criterion tends to underestimate the number of factors whereas the BIC criterion overestimates the number of factors in case the correct number of factors is not found.
Consider, for instance, T = 10 and N = 500. The BIC estimator finds the 10 The performance is similar to the case where β is known (not shown). Therefore, the estimation of β does not seem to have an important effect on the performance of the BN and AH selection criteria. Furthermore, the growth ratio statistic of Ahn and Horenstein (2013) performs similar to the eigenvalue ratio statistic. For reasons of space we do not show the respective results.
11 To save space, we do not show results for the estimators based on the BN criterion, since the hit rates are either 0% or (close to) 100%. correct number of factors (r = 2) in more than 95% of the cases and overestimates the number in the other (< 5%) cases. The AH estimator finds the correct value of r = 2 in 89.8% of the cases, however underestimates the number in all other cases. Since the estimator is biased if the number of factors is too small, the AH criterion tends to produce a large negative bias in some cases, whereas the BIC criterion tends to produce unbiased estimators with a slightly larger variance than estimating with the correct number of factors in some very rare cases.

Performance in more general setups
So far the DGPs considered in this paper were simplified versions of the ones considered in the literature and focus on the particular features of these models. In the following, we study the relative performance of the CCE, PC, ALS * and RS approaches in more sophisticated simulation setups, similar to the simulation experiments of Bai (2009), Chudik et al. (2011 and . The details of these data generating processes are presented in the online appendix to this paper. The Monte Carlo design of Bai (2009) employs two regressors that are correlated with two factors, their loadings and the product of both. The idiosyncratic error is i.i.d. across individuals and time periods. We refer to this model as DGP6. DGP7 refers to the factor model of Chudik et al. (2011) that includes two regressors and three factors. A special feature of this DGP is that the factor loadings of the regressors are independent of the loadings in the errors e it . Accordingly, no endogeneity bias arises from estimating the model by a pooled OLS estimator. The factors are generated by independent AR(1) processes and the idiosyncratic component u it is heteroskedastic but mutually and serially uncorrelated. DGP8 corresponds to the Monte Carlo design of , which includes two regressors and two factors. The first regressor is correlated with the first factor and the second regressor is correlated with the second factor. The idiosyncratic error is autocorrelated but the variances are identical across panel units and time periods.
The results in Table 4 indicate that the relative performance of the estimators depends quite sensitively on the DGP considered. The first panel of Table 4 presents the results for DGP6. The CCE estimator is not consistent in this setting, since the rank condition is violated and both factor and loading vectors are correlated with both regressors. The other three estimators are consistent in this setting, where the RS estimator is the least biased when T = 5 and the ALS * exhibits the lowest bias for T ≥ 10. The latter performs best in terms of RMSE with only slight advantages over the PC estimator when T ≥ 10.
The second panel of Table 4 reports the results for DGP7. The CCE estimator is the favored one in this setting. It has a very small bias and exhibits the lowest RMSE for nearly all considered (N, T ) combinations, in particular if T is as small as 5. Comparing the PC and GMM estimators, the results slightly favor the PC estimator in terms of RMSE. The difference between the PC and the CCE estimator is negligible when T = 15 and N = 500. With regard to the GMM estimators, the RS estimator has a marginally lower RMSE when T = 5 and N is large, while the results indicate small advantages for the ALS * estimator when T ≥ 10.
The third panel of Table 4 presents the results for DGP8. The GMM estimators are the least biased estimators in this setting. The ALS * estimator exhibits the smallest RMSE for all (N, T ) combinations with only slight advantages over the RS estimator. For example, for T = 10 and N = 500, the RMSE of the ALS * estimator is about 40% lower than the RMSE of the PC estimator and more than 60% lower than the RMSE of the CCE estimator. The CCE estimator is problematic in this setting, since the expectation of the loadings is equal to zero. The PC estimator is problematic in this small T setting. However, the RMSE is lower for larger samples with T = 15 and N = 500.

Conclusion
In this paper we compare three existing approaches for estimating factor augmented panel data models. We argue that the PC estimator can be seen as an estimated analog of the optimal transformation for eliminating the common factors from the data. The CCE estimator applies a data transformation that has the important advantage that the weighting scheme is fixed and does not involve any sampling error. This ensures that the estimator is consistent even if T is fixed, whereas the PC estimator requires much more restrictive assumptions (such as i.i.d. errors) if T is fixed. The third estimation approach is the nonlinear GMM estimators of  and Robertson and Sarafidis (2015). In contrast to the PC and CCE estimators, this estimator treats the T observations of the factor as parameters, whereas the factor loadings are substituted out. Therefore the number of parameters involved by this approach does not depend on N .
An important difference between the PC estimator and all other approaches is that the PC estimator does not require that the factors are correlated with the regressors. In contrast, the ALS/RS approaches and the CCE estimator (for r > 1) rely on the assumption that the factors are linearly related to the regressors (that is, the instruments are relevant). Accordingly, if some factors are uncorrelated (or weakly correlated) with the regressors, one can expect the PC estimator to be more efficient.
In this paper we focus on the typical micro panel data setup where T is small compared to N . Since for an approximate factor model the consistency of the PC estimator requires T → ∞, it is interesting to investigate how large T needs to be for ensuring the PC estimator to be approximately unbiased. Our Monte Carlo experiments indicate that for all data generating mechanisms considered in this paper T = 10 is already sufficient to achieve reasonable small sample properties of the PC estimator. Sometimes the CCE and ALS * estimators perform slightly better than the PC estimator, but in other Monte Carlo setups the PC estimator clearly outperforms all other competitors. Furthermore, we show that for small T the selection criteria for the number of factors proposed by Bai and Ng (2002) and Ahn and Horenstein (2013) may be inconsistent, whereas the BIC criteria of  and Robertson and Sarafidis (2015) perform well.

Compliance with Ethical Standards
Conflict of Interest: The author Jörg Breitung declares that he has no conflict of interest. Author Philipp Hansen declares that he has no conflict of interest. Ethical approval: This article does not contain any studies with human participants or animals performed by any of the authors.
The first 50 observations of η l,t are discarded as "burn-in" period.
The first 50 time observations of u it are discarded. The idiosyncractic components of the regressors are η l,it iid ∼ N (0, 1) and µ l,i iid ∼ N (0, 1) for l = 1, 2.