Approximate functional differencing

Inference on common parameters in panel data models with individual-specific fixed effects is a classic example of Neyman and Scott’s (Econometrica 36:1–32, 1948) incidental parameter problem (IPP). One solution to this IPP is functional differencing (Bonhomme in Econometrica 80(4):1337–1385, 2012), which works when the number of time periods T is fixed (and may be small), but this solution is not applicable to all panel data models of interest. Another solution, which applies to a larger class of models, is “large-T” bias correction [pioneered by Hahn and Kuersteiner (Econometrica 70(4):1639–1657, 2002) and Hahn and Newey (Econometrica 72(4):1295–1319, 2004)], but this is only guaranteed to work well when T is sufficiently large. This paper provides a unified approach that connects these two seemingly disparate solutions to the IPP. In doing so, we provide an approximate version of functional differencing, that is, an approximate solution to the IPP that is applicable to a large class of panel data models even when T is relatively small.


Introduction
Panel data offer the potential to account for unobserved heterogeneity, typically through the inclusion of unit-specific parameters; see Arellano (2003) and Arellano and Bonhomme (2011) for reviews.Nonlinear panel data models, however, remain challenging to estimate, precisely because in many models the presence of unit-specific -or "incidental" -parameters makes the maximum likelihood estimator (MLE) of the common parameters inconsistent when the number of observations per unit, T , is finite (Neyman and Scott 1948).The failure of maximum likelihood has prompted two kinds of reactions.
One approach is to look for point-identifying moment conditions that are free of incidental parameters.Such moment conditions can come from a conditional or a marginal likelihood (e.g., Rasch 1960, Lancaster 2000), an invariant likelihood (Moreira 2009), an integrated likelihood (Lancaster 2002), functional differencing (Bonhomme 2012), or from some other reasoning to eliminate the incidental parameters, for example, differencing in linear dynamic models (e.g., Arellano and Bond 1991).This approach is usually modelspecific and is "fixed-T ", i.e., it seeks consistent estimation when T is fixed (and usually small).However, point-identifying moment conditions for small T may not exist because point identification simply may fail; see Honoré and Tamer (2006) and Chamberlain (2010) for examples.
The other main approach is motivated by "large-T " arguments and seeks to reduce the large-T bias of the MLE or of the likelihood function itself or its score function (e.g., Hahn and Kuersteiner 2002, Alvarez and Arellano 2003, Hahn and Newey 2004, Arellano and Bonhomme 2009, Bonhomme and Manresa 2015, Dhaene and Jochmans 2015b, Arellano and Hahn 2016, Fernández-Val and Weidner 2016).This approach is less model-specific and may also be applied to models where point identification fails for small T .
The functional differencing method of Bonhomme (2012) provides an algebraic approach to systematically find valid moment conditions in panel models with incidental parameters-if such moment conditions exist.Related ideas are used in Honoré (1992), Hu (2002), Johnson (2004), Kitazawa (2013), Honoré and Weidner (2020), Honoré, Muris and Weidner (2021), and Davezies, D'Haultfoeuille and Mugnier (2022).In this paper, we extend the scope of functional differencing to models where point identification may fail.In such models, exact functional differencing (as in Bonhomme 2012) is not possible, but an approximate version thereof yields moment conditions that are free of incidental parameters and that are approximately valid in the sense that their solution yields a point close to the true common parameter value.Bonhomme's method relies on the existence of (one or more) zero eigenvalues of a matrix of posterior predictive probabilities (or a posterior predictive density function) defined by the model.Our extension considers the case where all eigenvalues are positive and, therefore, point identification fails, but where some eigenvalues are very close to zero.This occurs as the number of support points of the outcome variable increases.Eigenvalues close to zero then lead to approximate moment conditions obtained as a bias correction of an initially chosen moment condition.The bias correction can be iterated, possibly infinitely many times.In point-identified models, the infinitely iterated bias correction is equivalent to functional differencing.Therefore, approximate functional differencing can be viewed as finite-T inference in point-identified models, and as a large-T iterative bias correction method in models that are not pointidentified.
The construction of approximate moment conditions is our main focus.Once such moment conditions are found, estimation follows easily using the (generalized) method of moments, and the discussion of estimation is therefore deferred to later parts of the paper (from Section 6).
We illustrate approximate functional differencing in a probit binary choice model.
The implementation, including the iteration, is straightforward, only requiring elementary matrix operations.Indeed, one of the contributions of this paper is to show how to iterate score-based bias correction methods for discrete choice panel data relatively efficiently.
After introducing the model setup in Section 2, we review the main ideas behind functional differencing in Section 3. In Section 4 we introduce our novel bias corrections and explain how they relate to functional differencing.Section 5 examines the eigenvalues of the matrix of posterior predictive probabilities in a numerical example.Section 6 briefly discusses estimation.Further numerical illustration of the methods and some Monte Carlo simulation results are presented in Section 7. Section 8 discusses some extensions, in particular, a generalization of the estimation method to average effects.Finally, we provide some concluding remarks in Section 9.

Setup
We observe outcomes Y i ∈ Y and covariates X i ∈ X for units i = 1, . . ., n.We only consider finite outcome sets Y in this paper, but in principle all our results can be generalized to infinite sets Y.There are also latent variables A i ∈ A, which are treated as nuisance parameters.We assume that (Y i , X i , A i ), i = 1, . . ., n, are independent and identical draws from a distribution with conditional outcome probabilities where the function f y i x i , α i , θ is known (this function specifies "the model"), but the true parameter value θ 0 ∈ Θ ⊂ R d θ is unknown.Our primary goal in this paper is inference on θ 0 .
Let π 0 (α i | x i ) be the true distribution of A i conditional on X i = x i .Then, the conditional distribution that can be identified from the data is No restrictions are imposed on π 0 (α i | x i ) nor on the marginal distribution of X i , that is, we have a semi-parametric model with unknown parametric component θ 0 and unknown The setup just described covers many nonlinear panel data models with fixed effects.
There, we observe outcomes Y it and covariates X it for unit i over time periods t = 1, . . ., T .
For static panel models we then set Y i = (Y i1 , . . ., Y iT ) and X i = (X i1 , . . ., X iT ), and the model is typically specified as where Here, f * y it x it , α i , θ often depends on x it , α i , θ only through a single index x ′ it θ+α i , where θ is a regression coefficient vector of the same dimension as x it , and α i ∈ R is an individual-specific fixed effect.Of course, θ may also contain additional parameters (e.g., the variance of the error term in a Tobit model).
For dynamic nonlinear panel models, we usually have to model the dynamics explicitly.For example, we may include a lagged dependent variable in the model.In that case, assuming that Y it at t = 0 is observed, we have Y i = (Y i1 , . . ., Y iT ) and X i = (Y i0 , X i1 , . . ., X iT ), and the model is usually specified as Here, the initial observation Y i0 is included in the conditioning variable X i .In this way, the setup in equations ( 1) and ( 2) also covers dynamic panel data models.
The setup may also be relevant for applications outside of standard panel data, e.g., pseudo-panels, network models, or games.But one typically needs Y i to be a vector of more than one outcome to learn anything about θ 0 since, in most models, the value of α i alone can fully fit any possible outcome value if there is only a single outcome per unit (i.e., if the sample is purely cross-sectional).
The main insights of our paper are therefore applicable more broadly, but our focus will be on panel data.In particular, the following static binary choice panel data model will be our running example throughout the paper.
Example 1A (Static binary choice panel data model).Consider a static panel data model with are generated by and the errors U it are independent of X i and A i ∈ R, and are i.i.d.across i and t with cdf F (u).This implies For the probit model, we have F (u) = Φ(u), where Φ is the standard normal cdf, and for the logistic model we have F (u) = (1 + e −u ) −1 .To make the example even more specific, we consider a single binary covariate X it ∈ {0, 1} such that for all i = 1, . . ., n we have for some T 0 ∈ {1, . . ., T − 1}, that is, X it is equal to zero for the initial T 0 time periods, and is equal to one for the remaining T 1 = T − T 0 time periods.Here T 0 , and therefore X i , is non-random and constant across i, so we can simply write f y i α i , θ instead of Example 1B (Example 1A reframed).Consider Example 1A, but denote the binary outcomes now as Y * it ∈ {0, 1} and define the outcome Y i for unit i as the pair Here, is the number of outcomes for unit i for which Y * it = 1 within those time periods that have is the number of outcomes with Y * it = 1 for the time periods with X it = 1.This implies that where we drop x i from f y i x i , α i , θ since it is non-random and constant across i.The parameter of interest, θ 0 ∈ R, is unchanged.
From the perspective of parameter estimation, Example 1B is completely equivalent to Example 1A, because Y i in Example 1B is a minimal sufficient statistic for the parameters (θ 0 , α i ) in Example 1A.Nevertheless, the outcome space in Example 1A is larger (|Y| = 2 T ) than the outcome space in Example 1B (|Y| = (T 0 + 1)(T 1 + 1)), and this will make a difference in our discussion of moment conditions in these two examples below.

Main idea behind functional differencing
We now explain the main idea behind the functional differencing method of Bonhomme (2012).Our presentation is similar to that in Honoré and Weidner (2020).However, our goal here is much closer to that in Bonhomme's original paper because we want to describe a general estimation method, one that is applicable to a very large class of models, as opposed to obtaining an analytical expression for moment conditions in specific models.

Exact moment conditions
Consider the model described by ( 1) and ( 2), where our goal is to estimate θ 0 .Functional differencing (Bonhomme 2012) aims to find moment functions m(y i , x i , θ) ∈ R dm such that the model implies, for all x i and α i , that or, equivalently, y∈Y m(y, x i , θ) f y x i , α i , θ = 0, since we want (4) to hold for all possible θ 0 ∈ Θ. Verifying that m(y i , x i , θ) satisfies this conditional moment condition only requires knowledge of the model f y i x i , α i , θ , not of the observed data.Note that m(y i , x i , θ) does not depend on α i , but nevertheless should have zero mean conditional on any realization A i = α i .This is a strong requirement, and we will get back to this below.
Once we have found such valid moment functions m(y i , x i , θ), we can choose an arbitrary (matrix-valued) function g(x i , θ) ∈ R dm×dm , and define which is a vector of dimension d m .By the law of iterated expectations, we then obtain, under weak regularity conditions, the unconditional moment condition which we can use to estimate θ 0 by the generalized method of moments (GMM, Hansen 1982).The nuisance parameters α i do not feature in the GMM estimation at all, that is, functional differencing provides a solution to the incidental parameter problem (Neyman and Scott 1948).
Of course, the key condition for consistent GMM estimation is that for any θ = θ 0 .This identification condition is violated if m(y i , x i , θ) does not depend on θ (a special case of which is m(y i , x i , θ) = 0, which is a trivial solution to (4)).Hence the moment functions must depend on θ to be informative about θ 0 .

Uninformative moment functions in Example 1A
To give an example of a moment function that is uninformative about θ 0 , consider Example 1A.Let t and s be two time periods where X it = X is .Let Y i,−(t,s) ∈ {0, 1} T −2 be the outcome vector Y i from which the outcomes Y it and Y is are dropped.Then, since X it = X is , the outcomes Y it and Y is are exchangeable and therefore This implies that for any function g : {0, 1} is a sufficient statistic for A i , so the distribution of Y i conditional on Y i,0 + Y i,1 does not depend on A i .It is well known that this implies that the corresponding conditional MLE of θ 0 is consistent as n → ∞, for any fixed T ≥ 2; see, e.g., Rasch (1960), Andersen The null space of this linear mapping m A → m B is spanned by moment functions of the form (6).This implies that the difference m A (y * i , θ) − m B (y(y * i ), θ) is a linear combination of moment functions of the form (6).
The observation that the existence of a sufficient statistic for the nuisance parameter A i allows for identification and estimation of θ 0 is quite old (e.g., Rasch 1960).However, the reason that the functional differencing method is truly powerful is that moment functions satisfying (4) and identifying θ 0 may exist even in models where no sufficient statistic for A i is available.Examples of this are given by Honoré (1992), Hu (2002), Johnson (2004), Kitazawa (2013), Honoré and Weidner (2020), Honoré, Muris and Weidner (2021), and Davezies, D'Haultfoeuille and Mugnier (2022).Bonhomme (2012) provides a computational method for obtaining moment functions m(y i , x i , θ) such that (4) holds in a large class of models, while Honoré and Weidner (2020) discuss how to obtain explicit algebraic formulas for moment conditions in specific models.Dobronyi, Gu and Kim (2021) show that additional moment inequalities may exist that contain identifying information on θ 0 that is not contained in the moment equalities.
Our example of a moment function in ( 7) is convenient and easy to understand, but it is not really representative of the potential complexity of more general moment functions.
The papers cited in the previous paragraph give a better view of the true capability of the functional differencing method in more challenging settings.

Approximate moment conditions
Functional differencing is a very powerful and useful method.Nevertheless, there are many models to which it is not applicable.The reason is that the condition in equation ( 4) is actually quite strong.It requires us to find a function m(Y i , X i , θ) that does not depend on A i at all, but that is supposed to have a conditional mean of zero for any possible realization of A i .In most standard panel data models A i takes values in R (A i can also be a vector), implying that (4) imposes an infinite number of linear restrictions.
It is therefore perhaps unsurprising that there are many panel data models for which (4) has no non-trivial solution at all.In Example 1B we have shown the existence of valid moment functions for the logit model, but it turns out that no valid moment function exists for the probit model when θ 0 = 0 (we have verified this non-existence numerically for many values of T and T 0 ).
Instead of trying to find moment functions satisfying (4), and hence ( 5), exactly, we argue that it can also be fruitful to search for moment functions that satisfy these conditions only approximately, i.e., For a given model f y i x i , α i , θ we might not be able to find an exact solution to ( 5), but we might be able to find a very good approximate solution.
Examples of approximate moment conditions are provided by the "large-T " panel data literature, which considers asymptotic sequences where also T → ∞ (jointly with n → ∞).
To illustrate the insights of this literature, let α i (θ) be the MLE of α i obtained from maxi- Then, under standard regularity conditions, α i (θ 0 ) is consistent for A i as T → ∞, and a useful approximate moment function is therefore given by m(y i , x i , θ) = ψ(y i , x i , α i (θ), θ).
In this example, the vague approximate statement in (8) can be made precise, namely one can show that, as T → ∞, implying that the estimator of θ 0 obtained from this approximate moment condition also has a bias of order T −1 .It is possible to correct this bias and obtain moment functions where the right hand side of ( 9) is of order T −2 or smaller, implying an even smaller bias for estimates of θ 0 when T is sufficiently large; see, e.g., Hahn and Newey (2004), Arellano andHahn (2007, 2016), Arellano and Bonhomme (2009), and Dhaene and Jochmans (2015b).In this paper, we aim for even higher-order bias correction, where the remaining bias is only of order T −q , for some integer q > 0, because we expect better small-T estimation properties from such higher-order corrections, and it allows us to connect the bias correction results with the functional differencing method.However, by correcting the bias in this way, one might very well increase the variance of the resulting estimator for θ 0 , as we will see in our Monte Carlo simulations in Section 7. The question of how to optimally trade off the bias vs. the variance (using, e.g., a mean squared error criterion function, as in Bonhomme and Weidner 2022) is interesting and could lead to an optimal choice of the bias correction order q, but we will not formalize this in the current paper.

Approximate functional differencing
In this section, we answer the following questions: In a model where exactly valid moment functions as in (4) do not exist, is it still possible to construct useful moment functions m(y i , x i , θ) that are approximately valid as in (8)?And if yes, how can we construct such moment functions in a principled way?

Notation and preliminaries
Our discussion in this section is at the "population level", that is, for one representative unit i.In the following, we therefore drop the index i throughout.For example, instead of Y i , X i , A i we simply write Y , X, A.

Prior distribution of the fixed effects
Let π prior (α | x) be a chosen prior distribution for A, conditional on X = x.The prior should integrate to one, that is, A π prior (α | x) dα = 1, for all x ∈ X , but we do not require π prior to be identical to π 0 .We require non-zero prior probability for all points in the support of A, i.e., The prior does not need to depend on x; we may choose π prior (α | x) = π prior (α), which may be easier to specify in practice, but we allow for general priors π prior (α | x) in the following.

Posterior distribution of the fixed effects
Given the chosen prior π prior , the posterior distribution of A, conditional on Y = y and where dα is the prior predictive probability of outcome y.

Score function
Let s : Y × X × θ → R ds be some function, which we will call the "score function".In our numerical illustrations in Section 7 we choose the integrated score where d s = d θ .However, for our construction of moment functions in the following subsection, which is based on a chosen score function, we can actually choose almost any function s(y, x, θ), as long as it is differentiable in θ and not identically zero.For example, the profile score s(y, x, θ) = ∇ θ [max α∈A log f (y|x, α, θ)] would be an equally natural choice.
Whatever the choice of s(y, x, θ), we now rewrite it using matrix notation.Let n Y = |Y| be the number of possible outcomes, and label the elements of the outcome set as The following lemma states some properties of the matrix Q(x, θ) that will be useful later.
Lemma 1.Let x ∈ X and θ ∈ Θ. Assume that p prior (y | x, θ) > 0 for all y ∈ Y. Then Q(x, θ) is diagonalizable and all its eigenvalues are real numbers in the interval [0, 1].
The proof of the lemma is given in the appendix.

Bias-corrected score functions
We consider the chosen score function s(y, x, θ) as our first candidate for a moment function m(y, x, θ) to achieve (8).However, E[s(Y, X, θ 0 )] need neither be zero nor close to zero.(As discussed above, there are choices of s(y, x, θ) such that E[s(Y, X, θ 0 )] is close to zero for large T , but even for those choices E[s(Y, X, θ 0 )] need not be close to zero for small T .)Therefore, we aim to "bias-correct" the score by defining an improved score for some correction function b(y, x, θ).The goal is to choose b(y, x, θ) such that the We would achieve exact bias correction (i.e., E[s However, even though A is unobserved, the posterior distribution π post (α | y, x, θ) should contain some useful information about A. Inspired by ( 15) we suggest choosing b(y, x, θ) = y∈Y s( y, x, θ) where we have replaced π 0 (α | x) by π post (α | y, x, θ).This gives the bias-corrected score where Á n Y is the n Y × n Y identity matrix.Now, if the expected posterior distribution , that is, s (1) (y, x, θ) should be a better choice than s(y, x, θ) as a moment function m(y, x, θ) satisfying (8).Note also that in the very special case where the prior equals the true distribution of A (i.e., π prior = π 0 ), and hence s (1) (Y, X, θ 0 ) has exactly zero mean regardless of the choice of initial score function.
Of course, generically we still expect that E[s (1) (Y, X, θ 0 )] = 0.It is, therefore, natural to iterate the above procedure, that is, to apply the same bias-correction method that we applied to s(y, x, θ) also to s (1) (y, x, θ), which gives s (2) (y, x, θ), and to continue iterating this procedure.Since the bias-correction method applied to s(y, x, θ) = S(x, θ) δ(y) gives ( 16), it is easy to see that after q ∈ {1, 2, . ..} iterations of the same bias-correction procedure we obtain The bias-corrected functions s (q) (y, x, θ) are the main choices of moment function m(y, x, θ) that we consider in this paper.
To our knowledge, the bias-corrected scores s (q) (y, x, θ) have not previously been discussed in the literature.However, as will be explained in Section 8.1, these bias-corrected scores are closely related to existing bias-correction methods.In particular, if in ( 16) we replace the posterior distribution π post (α y, x, θ) with a point-mass distribution at the MLE of A for fixed θ, then the profile-score adjustments of Dhaene and Jochmans (2015a) are obtained.In analogy to such existing methods, we make the following conjecture, which is supported by our numerical results in Section 7 below for the model in Examples 1A and 1B, and which we presume to hold more generally under appropriate regularity conditions.
Conjecture 1.If we choose the original score function s(y, x, θ) to be the integrated or profile score, then both E s (q) (Y, X, θ 0 ) and θ * −θ 0 are at most of order T −q−1 , as T → ∞ while q is fixed.
From the perspective of the large-T panel literature, we have simply provided another way to achieve and iterate large-T bias correction.What is truly novel, however, is that in the limit q → ∞, our correction can be directly related to the functional differencing method of Bonhomme (2012), which delivers exact (finite-T ) inference results in models to which it is applicable.Our procedure, therefore, interpolates between large-T bias correction and exact finite-T inference.
Remark 1: If the initial score function s(y, x, θ) is unbiased, that is, if it satisfies the exact moment condition (4) or, equivalently, (5), then the bias correction step (16) does not change the score function at all.More generally, if at any point in the iteration procedure (17) we obtain an unbiased score function, i.e., one for which y∈Y s (q) (y, x, θ) f y x, α, θ = 0 for some q, then we have s (r) (y, x, θ) = s (q) (y, x, θ) for all further iterations r ≥ q.Hence, unbiased score functions correspond to fixed points in our iteration procedure.

Relation to functional differencing
It turns out that exact functional differencing (Bonhomme 2012) corresponds to choosing s ∞ (y, x, θ) := lim q→∞ s (q) (y, x, θ) as moment functions for estimating θ 0 , and the lemma below formalizes this relationship.
Before presenting the lemma, it is useful to rewrite the bias-corrected score function in (17) in terms of the spectral decomposition of Q(x, θ).Let λ 1 (x, θ) ≥ . . .≥ λ n Y (x, θ) be the eigenvalues of Q(x, θ) sorted in descending order, and let U(x, θ) be the n Y ×n Y matrix whose columns are the corresponding right-eigenvectors of Q(x, θ).Lemma 1 guarantees that λ k (x, θ) ∈ [0, 1], for all k = 1, . . ., n Y , and that Q(x, θ) can be diagonalized, that is, Now let h : [0, 1] → R be a stem function and let h(•) be the associated primary matrix function (Horn and Johnson 1994), so that That is, applying h(•) to the matrix Q(x, θ) simply means applying the stem function separately to each eigenvalue of Q(x, θ) while leaving the eigenvectors unchanged.Every stem function h defines a moment function hence generalizing (17).In particular, the stem function h q (λ) := (1 − λ) q gives the moment function s (q) (y, x, θ) = s hq (y, x, θ) in (17).In the limit q → ∞ we obtain lim q→∞ h q (λ) = ½{λ = 0}, for λ ∈ [0, 1], and the limiting bias-corrected score function can therefore be written as Thus, s ∞ (y, x, θ) is obtained by applying the projection h ∞ [Q(x, θ)] to the original score function s(y, x, θ).The projection matrix h ∞ [Q(x, θ)] is obtained according to (18) by giving weight only to eigenvectors of Q(x, θ) accociated with zero eigenvalues.
Lemma 2. Let x ∈ X .Suppose that (1) and (10) hold, that p prior (y | x, θ 0 ) > 0 for all y ∈ Y, and that f y x, α, θ 0 is continuous in α ∈ A. Then (i) we have (ii) the matrix Q(x, θ 0 ) has a zero eigenvalue if and only if there exists a non-zero (iii) for every moment function m(y, x, θ 0 ) ∈ R that satisfies the condition in part (ii), there exists a function s(y, x, θ 0 ) ∈ R such that m(y, x, θ 0 ) = s ∞ (y, x, θ 0 ) , for all y ∈ Y .
The proof of the lemma is given in the appendix.Note that the true parameter value, θ 0 , only takes a special role in Lemma 2 because the expectation is evaluated using f (y|x, θ 0 , α), according to (1).If we had written these conditional expectations as explicit sums over f (y|x, θ 0 , α), then we could have replaced θ 0 in the lemma by an arbitrary value θ ∈ Θ; that is, there is nothing special about the parameter value θ 0 that generates the data.
Part (i) of the lemma states that s ∞ (y, x, θ) is an exactly valid moment function in the sense of (4).If Q(x, θ) does not have any zero eigenvalues, then this part of the lemma is a trivial result, because then we simply have s ∞ (y, x, θ) = 0, which is not useful for estimating θ 0 .However, if Q(x, θ) does have one or more zero eigenvalues, then, for a generic choice of s(y, x, θ), we have s ∞ (y, x, θ) = 0, and part (i) of the lemma becomes non-trivial.
Part (ii) of the lemma states that the existence of a zero eigenvalue of Q(x, θ) is indeed a necessary and sufficient condition for the existence of an exactly valid moment function in the sense of (4).As explained in the proof, if Q(x, θ) has a zero eigenvalue, then an exactly valid moment function m(y, x, θ) is simply obtained by the entries of the corresponding left-eigenvector of Q(x, θ).
Finally, part (iii) of the lemma states that any such exactly valid moment function m(y, x, θ) can be obtained as lim q→∞ s (q) (y, x, θ), i.e., as the limit of our iterative bias cor-rection scheme above, for some appropriately chosen initial score function s(y, x, θ).Thus, the set of valid moment functions is identical to the set of all possible limits s ∞ (y, x, θ).
Recall that finding such exactly valid moment functions is the underlying idea of the functional differencing method of Bonhomme (2012).Thus, Lemma 2 establishes a very close relationship between our bias correction method and functional differencing.
Remark 2: If the set A is finite with cardinality n A = |A|, then by construction A to be unknown.For our purposes, the fact that rank[Q(x, θ)] ≤ n A matters only in our numerical implementation, where the rank of Q(x, θ) might be truncated my the discretization of the set A.
5 Eigenvalues of Q(x, θ): numerical example Lemma 1 guarantees that all eigenvalues of the matrix Q(x, θ) lie in the interval [0, 1], and Lemma 2 shows that exact moment conditions that are free of the incidental parameter A are only available if Q(x, θ) has a zero eigenvalue.However, even in models where Q(x, θ) does not have a zero eigenvalue, we suggest that calculating the eigenvalues of Q(x, θ) is generally informative about whether moment conditions exist that are approximately free of the incidental parameters.This is because in typical applications we expect that the distinction between a zero eigenvalue and a very small eigenvalue of Q(x, θ) should be practically irrelevant, that is, as long as Q(x, θ) has one or more eigenvalues that are very close to zero, then very good approximate moment conditions in the sense of (8) should exist.
It is difficult to make a general statement about how small an eigenvalue of Q(x, θ) needs to be to qualify as sufficiently small.However, in a typical model with a sufficiently large number n Y of outcomes (which for discrete choice panel data usually requires only moderately large T ) one will often have multiple eigenvalues of Q(x, θ) that are so small (say smaller than 10 −5 ) that there is little doubt that they can be considered equal to zero for practical purposes.
To illustrate this, consider Example 1B with normally distributed errors, F (u) = Φ(u), even values of T , and T 0 = T 1 = T /2, which implies n Y = (1 + T /2)2 .For the prior distribution of A we choose the standard normal distribution, π prior (α) = φ(α).From these figures, we see that for T ≥ 4 the smallest eigenvalue of Q( 1) is less than 10 −9 , which we argue can be considered equal to zero for practical purposes.In Figures 1   and 2 we see that the largest eigenvalue, λ 1 , is equal to one (because Q(1) is a stochastic matrix), but then the eigenvalues λ j decay exponentially fast as j increases.3 If we were to replace the standard normal distribution of the errors by the standardized logistic cdf F (u) = (1 + e −πu/ √ 3 ) −1 (normalized to have variance one), then the left-hand side (non-logarithmic) plots in Figures 1 and 2 would look almost identical, but there would be (T /2) 2 eigenvalues exactly equal to zero.These zero eigenvalues for the logit model are due to the existence of a sufficient statistic for A and, correspondingly, the existence of exact moment functions (associated with the left null-space of Q(θ)), as discussed in Section 3.1 above.Given that the change from standardized logistic errors to standard normal errors is a relatively minor modification of the model, it is not surprising that we see many eigenvalues close to zero in Figures 1 and 2.
Figure 3 shows that the smallest eigenvalue of Q(1) in this example also decays exponentially fast as T increases. 4However, for none of the values of T that we considered here, did we find an exact zero eigenvalue for the static binary choice probit model.We conjecture that this is true for all T ≥ 2, but we have no proof. 51 2 3 4 5 6 7 8 9  0   0  Figure 3: For the same setting as in Figure 1, but for different values of T (with T 0 = T 1 = T /2), we plot only the smallest eigenvalue of Q(θ) for θ = 1.Notice that the smallest eigenvalue is never zero, that is, Q(1) has full rank for all values of T considered.This example illustrates that eigenvalues of Q(x, θ) very close to zero but not exactly zero may exist in interesting models.When aiming to estimate the parameter θ 0 in a particular model of the form (2), our first recommendation is to calculate the eigenvalues of Q(x, θ) for some representative values of θ and x to see if some of them are zero or close to zero.If some are equal to zero, then exact functional differencing (Bonhomme 2012) is applicable.If some are very close to zero, then approximate moment functions (as in ( 8)) are available.
The eigenvalues of Q(x, θ) are useful to examine whether exact or approximate moment functions for θ are available in a given model.However, as explained in Section 3.1, the corresponding moment functions also have to depend on θ to be useful for parameter estimation.For example, the matrix Q(1) in Example 1A has exactly the same non-zero eigenvalues as the matrix Q(1) in Example 1B that we just discussed, but in addition, it has a zero eigenvalue with multiplicity equal to 2 T − (T 0 + 1)(T 1 + 1), corresponding to the uninformative moment functions in equation ( 6).As a diagnostic tool, it can also be useful to calculate the matrix Q(x) for the model f (y|x, θ i , α i ), which has no common parameters and where both θ i and α i are individual-specific fixed effects (this requires choosing a prior for θ as well, which may have finite support to keep the computation simple).Every zero eigenvalue of that matrix Q(x) then corresponds to a moment function, for that value of x, that does not depend on θ (within the range of the chosen prior for θ).The existence of uninformative moment functions (6) in Example 1A, for example, can be detected in this way.

Estimation
Suppose we have chosen a prior distribution π prior , an initial score function s(y, x, θ), and an order of bias correction, q.This gives the bias-corrected score function s (q) (y, x, θ) as the moment function m(y, x, θ) for which the approximate moment condition ( 8) is assumed to hold.For simplicity, suppose that d s = d θ , so that the number of moment conditions equals the number of common parameters we want to estimate.We can then another error distribution.If, in Example 1B with θ = 1 and T 0 = T 1 = T /2, one chooses the error distribution F (u) to be the Laplace distribution with mean zero and scale one, then numerically we found that for any choice of prior the matrix Q(1) does not have a zero eigenvalue for T = 2 and T = 4, but it does for T = 6.So it is not impossible that something similar could happen for the probit model for sufficiently large T , although we do not expect it.define a pseudo-true value θ * ∈ Θ as the solution of The corresponding method of moments estimator θ satisfies 1 n n i=1 m(Y i , X i , θ) = 0.Under appropriate regularity conditions, including existence and uniqueness of θ * , we with asymptotic variance given by In Section 7 we will report the bias θ * − θ 0 and the asymptotic variance V * for different choices of moment functions in Example 1B.Note that reporting the bias θ * − θ 0 of the parameter estimates is more informative than reporting the bias E [m(Y i , X i , θ 0 )] of the moment condition, in particular since the moment condition can be rescaled by an arbitrary factor.
How should q be chosen?If Q(x, θ) is singular, then a natural choice is q = ∞, as described in Section 4.3, because this delivers an exactly unbiased moment condition.If Q(x, θ) is nonsingular but some of its eigenvalues are small, then, recalling our discussion in Section 5, our general recommendation is to choose relatively large values of q.The larger the chosen q, the more we rely on the smallest eigenvalues of Q(x, θ), because contributions to s (q) (y, x, θ) from larger eigenvalues of Q(x, θ) are downweighted more heavily as q increases.If none of the eigenvalues Q(x, θ) is close to zero, then there are no moment conditions that hold approximately in the sense of ( 8).Yet, even then, setting q > 0 is likely to improve on q = 0, even though the remaining bias will still be non-negligible in general.Whatever the eigenvalues of Q(x, θ) are (and, indeed, whether Q(x, θ) is singular or not), q is a tuning parameter and a principled way to choose q would be to optimize some criterion, for example, the (estimated) mean squared error of θ.We leave this for further study.In our numerical illustrations in Section 7 we just consider finite values of q up to q = 1000, and q = ∞.
Table 4 presents the corresponding simulations, showing that the finite-sample results are, again, very close to asymptotic results reported in Table 3.

Case 3: Continuous regressor
Here we illustrate approximate functional differencing in a panel probit model with a single continuous regressor and fixed effects.Apart from the continuity of the regressor, the setup is identical to that in Cases 1 and 2 above.Specifically, we consider Example 1A with , and the integrated score as initial score.We set X it ∼ N (0.5, 0.25), so that X it has the same mean and variance (across t) as the binary regressor in Cases 1 and 2. For T ∈ {4, 6} and n = 1000, we generated a single data set X it (t = 1, . . ., T ; i = 1, . . ., n) to form X = {X 1 , . . ., X n }, so the results are to be understood with reference to this X .Table 5 presents the asymptotic biases and variances for q up to 1000.(We do not consider q = ∞ here because, even though θ * and θ remain well-defined in the limit q → ∞, the limiting values are generically determined by a single X i ∈ X , that is, estimation would be based on a single observation i, for which Q(X i , θ) has the smallest eigenvalue within the sample.)The results are similar to those in Table 1, albeit the asymptotic biases and variances are somewhat larger (for all q).Table 6 presents the corresponding simulation results, which are in line with those in Table 5.

Numerical calculations related to Conjecture 1
Table 7 reports bias calculations for larger values of T that support our conjecture about the rate of the bias as T grows.The model is as in Example 1B with standard normal errors, π prior (α) = φ(α), T 0 = T 1 = T /2, and θ 0 = 1.We calculated the bias, b T := θ * − θ 0 , with θ * based on the integrated score, for q up to 3, T ∈ {64, 128, 256, 512}, and for five different choices of π 0 : three degenerate distributions, δ z , with mass 1 at z ∈ {0, 1, 2} (i.e., A = z is constant); and two uniform distributions, U[0.5, 1.5] and U[0, 2].(So in all these cases π prior is very different from π 0 .)Table 7 gives the bias b T for the chosen values of T , and also the successive bias ratios b T /2 /b T (the three rightmost columns).If Conjecture 1 is correct, we should see these ratios converge to 2 q+1 as T → ∞.For comparability, we also report the bias for q = 0, where the known rate is confirmed: b T /2 /b T converges to 2 for every π 0 , although when π 0 = δ 2 the convergence to 2 is not yet quite visible.
Presumably, this is because then π 0 and π prior are quite different, requiring a larger T for the ratio b T /2 /b T to become stable.For q = 1, b T /2 /b T is seen to converge to 4 (as conjectured) when π 0 ∈ {δ 0 , δ 1 , U[0.5, 1.5]}, while for π 0 ∈ {δ 2 , U[0, 2]} the convergence is less visible.Overall, the picture is a little more blurred for q = 1 compared to q = 0.For q = 2, where b T /2 /b T should converge to 8, we tend to see this convergence for π 0 ∈ {δ 0 , δ 1 }, although the picture is more blurred; but also here the order of magnitude of b T /2 /b T is in line with convergence to 8. Finally, for q = 3, the picture is even more blurred: we tend to see convergence of b T /2 /b T to 16 only for π 0 = δ 0 , but even here the order of magnitude of b T /2 /b T is not incompatible with convergence to 16 (apart from the case which clearly needs larger values of T for b T /2 /b T to stabilize).Certainly, these numerical calculations are by no means proof of the conjectured rates, but looking at the last column of Table 7, the ratios b T /2 /b T are broadly in line with the conjecture.
Note, furthermore, that for any q ≥ 1 the remaining bias in Table 7 is extremely tiny in most cases, unlike the bias of the maximum integrated likelihood estimator (reported as q = 0 in the table ).
The expression in ( 23) is identical to that in (16), except that Q(x, θ) is replaced with Q(x, θ).Iterating this alternative bias correction q times therefore also gives the formula in (17) with Q(x, θ) replaced with Q(x, θ).Thus, by the same arguments as before, for large values of q, the corresponding score function s (q) (x, θ) will be dominated by contributions from eigenvectors of Q(x, θ) that correspond to eigenvalues close to or equal to zero.
It is therefore natural to ask why in our presentation above we have chosen the bias correction in ( 16) based on the posterior distribution of A instead of the bias correction in (23) based on the MLE of A. The answer is that the matrix Q(x, θ) does not have the same convenient algebraic properties as the matrix Q(x, θ).In particular, none of the parts (i), (ii), (iii) of Lemma 2 would hold if we replaced Q(x, θ) by Q(x, θ), implying that the close relationship between the bias correction in (23) and functional differencing does not generally hold for the alternative bias correction discussed here.
To explain why Q(x, θ) does not have these properties, consider the following.For given values of x and θ, assume that there exist two outcomes y and ȳ that give the same MLE of A, that is, α(y, x, θ) = α(ȳ, x, θ).Then, the two columns of Q(x, θ) that correspond to y and ȳ are identical, and therefore Q(x, θ) does not have full rank, implying that it has a zero eigenvalue.The existence of this zero eigenvalue is simply a consequence of α(y, x, θ) = α(ȳ, x, θ).Now, in models where there exists a sufficient statistic for A (conditional on X), if the outcomes y and ȳ have the same value of the sufficient statistic, then α(y, x, θ) = α(ȳ, x, θ), and in that case the zero eigenvalue of Q(x, θ) just discussed is closely related to functional differencing because the existence of the sufficient statistic generates valid moment functions; recall the example in equation ( 7).
However, we may also have α(y, x, θ) = α(ȳ, x, θ) for reasons that have nothing to do with functional differencing.For example, consider Example 1B with normally distributed errors, T ≥ 2 even, and T 0 = T 1 = T /2.Then, all outcomes y with y 0 + y 1 = T /2 have α(y, θ) = −θ/2.So there are at least 1 + T /2 different outcomes with the same value of α(y, θ), which implies that Q(θ) has at least T /2 zero eigenvalues.But we have found numerically that Q(θ) does not have zero eigenvalues in this example for θ = 0, that is, according to Lemma 2, no exact moment function exists.
This example shows that Q(x, θ) and Q(x, θ) have different algebraic properties, and it explains why we have focused on Q(x, θ) instead of Q(x, θ) in our discussion.Nevertheless, the observation that bias correction can be iterated using the formula in (17) can be useful for alternative methods as well.
The tuning parameter q ∈ {0, 1, 2, . ..} is replaced here by the bandwidth parameter c > 0, which specifies which eigenvalues of Q(θ, x) are considered to be close to zero.

Average effect estimation
In models of the form (2) we are often not only interested in the unknown θ 0 but also in functionals of the unknown π 0 (α|x).In particular, consider average effects of the form where µ(x, α, θ) is a known function that specifies the average effect of interest.For example, in a panel data model, if we are interested in the average partial effect with respect to the p-th regressor in period t, we could choose µ(x, α, θ) = ∂ ∂xt,p y∈Y y t f (y|x, θ, α).For other examples of functionals of the individual-specific effects, see e.g.Arellano and Bonhomme (2012).
We now focus on the problem of estimating µ 0 .Therefore, in this subsection, we assume that the problem of estimating θ 0 is already resolved (with corresponding estimator θ), and we focus on the problem that π 0 (α|x) is unknown when estimating average effects µ 0 .
Let W (x, θ) be the n Y -vector with entries w (0) (y (k) , x, θ), k = 1, . . ., n Y .Then, the analog of the limiting estimating function in (20), corresponding to q → ∞, for average effects is where Q † (x, θ) is a pseudo-inverse of Q(x, θ), and the application of a function h (∞) : [0, 1] → R to the matrix Q(x, θ) was defined in equation ( 18).The motivation for choosing w (∞) (y, x, θ) in this way is that it gives an unbiased estimator of the average effect (i.e., µ (∞) * = µ 0 ) whenever we can write µ(x, α, θ) = y∈Y ν(y, x, θ)f y x, α, θ for some function ν(y, x, θ).7 Of course, average effects with this form of µ(α, x, θ) are a very special case, but they are usually the only cases for which we can expect unbiased estimation of the average effect to be feasible (for fixed T ); see also Aguirregabiria and Carro (2021).
Notice that we do not assume here that µ(α, x, θ) is of this form, it is just used to motivate (24).
As we have seen before, the non-zero eigenvalues of Q(x, θ) can be very small, which implies that the pseudo-inverse Q † (x, θ) can have very large elements.The corresponding estimator µ (∞) based on (24) therefore typically has a very large variance and we do not recommend this estimator in practice.Instead, to balance the bias-variance trade-off of the average effect estimator, some regularization of the pseudo-inverse of Q(x, θ) in ( 24) is required.There are various ways to implement regularization, in the same way that there are various ways to implement approximate functional differencing (see Section 8.2).
Here, regularization means that we want to find functions h q (λ) that approximate the inverse function 1/λ well for large values of λ ∈ [0, 1], but that deviate from 1/λ for values of λ close to zero to avoid divergence. 8This gives, 9 for q ∈ {0, 1, 2, . ..}, The corresponding estimating function that regularizes w (∞) (y, x, θ) is therefore given by This is a polynomial in Q(x, θ), as was the case for s (q) (y, x, θ).Choosing a value of q that is not too large therefore ensures that the variance of the corresponding estimator µ (q) remains reasonably small (for fixed q), because we don't need the pseudo-inverse of Q(x, θ).
Note also that w (q) (y, x, θ) and the corresponding estimators µ (q) have a large-T biascorrection interpretation very similar to s (q) (y, x, θ).For example, we have h 1 (λ) = 2 − λ, and therefore We conjecture that the estimator of µ 0 corresponding to only W ′ (x, θ)Q(x, θ)δ(y) has twice the leading order 1/T asymptotic bias of the estimator µ (0) corresponding to w (0) (y, x, θ), that is, w (1) (y, x, θ) is exactly the jackknife linear combination that eliminates the large-T 8 In previous sections, the functions h q (λ) = (1 − λ) q were polynomial approximations of (rescaled versions of) the function h ∞ (λ) = ½{λ = 0}.The regularization that is analogous to s (q) (y, x, θ) in ( 17) is given by a q-th order Taylor expansion of the function 1/λ around λ = 1. 9 Here, we use the convention that 0 0 = 1, which also implies that [Á nY − Q(x, θ)] 0 = Á nY even though Q(x, θ) has an eigenvalue equal to one.Also, there is some ambiguity in what value we should assign to h q (λ) for λ = 0. We choose h q (0) = q + 1 because it results in the simple polynomial expression (26) for h q [Q(x, θ)], which is convenient since h q [Q(x, θ)] can be evaluated without ever calculating the eigenvalues and eigenvectors of Q(x, θ).However, if we want to obtain w (∞) (y, x, θ) in (24) as the limit of w (q) (y, x, θ) as q → ∞, then we should assign h q (0) = 0 for λ = 0, but this would deviate from the polynomial expression.
We are not considering average effects further here.But we found it noteworthy that there is a formalism for average effect calculation that closely mirrors the development of approximate functional differencing for the estimation of θ 0 introduced above.However, this does not imply that we expect the results for average-effect estimation to be necessarily similar to those for the estimation of the common parameters θ 0 .In particular, for small values of T , the identified set for the average effects in discrete-choice panel data models tends to be much larger than the identified set of the common parameters (see,  q) to perform well, and we also expect the bias-variance trade-off in the choice of q to be quite different.For a closely related discussion see Bonhomme and Davezies (2017), and also the section on "Average marginal effects" in the 2010 working paper version of Bonhomme (2012).

Conclusions
We have linked the large-T panel data literature with the functional differencing method through a bias correction that converges to functional differencing when iterated.Our numerical illustrations show that in models where exact functional differencing is not possible, one may still apply it approximately to obtain estimates that can be essentially unbiased, even when the number of time periods T is small.
The key element in our construction is the n Y × n Y matrix Q(x, θ).The eigenvalues of this matrix are informative about whether (approximate) functional differencing is applicable in a given model.The matrix Q(x, θ) also features prominently in our biascorrected score functions in (17) and in our regularized estimating functions for average effects in (26).We have assumed a discrete outcome space with a finite number of elements n Y .When the outcome space is infinite, the matrix Q(x, θ) has to be replaced by the corresponding operator.
The goal of this paper was primarily to introduce and illustrate an approximate version of functional differencing.Future work is needed to better understand the properties of the method and to explore its usefulness in empirical work, both for the estimation of common parameters, which was our primary focus, and for the estimation of average effects, briefly introduced in Section 8.3.

2
We then calculate the eigenvalues of the n Y × n Y matrix Q(θ) for θ = 1 (there are no longer covariates x in this example as they are assigned non-random values).For T = 2 we have n Y = 4, and the four eigenvalues of Q(1) are λ 1 = 1, λ 2 = 0.47463, λ 3 = 0.10727, and λ 4 = 0.00016.For T = 4 and T = 6 we have n Y = 9 and n Y = 16, respectively, and the corresponding eigenvalues of Q(1) are plotted in Figures 1 and 2. Figure 3 plots only the smallest eigenvalues of Q(1) for T = 2, 4, . . ., 20.

Figure 1 :Figure 2 :
Figure 1: The eigenvalues λ j (θ) of the matrix Q(θ) in Example 1B are plotted for the case θ = 1, T = 4, T 0 = T 1 = 2, and where both the error distribution and the prior distribution of A are standard normal.The left and right plots show the same eigenvalues, just with a different scaling of the y-axis.
e.g., Chernozhukov, Fernández-Val, Hahn and Newey 2013, Davezies, D'Haultfoeuille and Laage 2021, Liu, Poirier and Shiu 2021, and Pakel and Weidner 2021).Therefore we expect larger values of T to be required for the point estimators µ ( n Y , where ½(•) is the indicator function.Recall that s(y, x, θ) is a d s -vector.Let S(x, θ) be the corresponding d s × n Y matrix with columns s(y (k) , x, θ), k = 1, . . ., n Y .Now we can write s(y, x, θ) as Bonhomme and Manresa 2015, Su, Shi and Phillips 2016, Bonhomme, Lamadon and Manresa 2022)er, that this assumes not only that α takes on only a finite number of values, but also that these values are known (they constitute the known set A).By contrast, the literature on discretizing heterogeneity in panel data (e.g.,Bonhomme and Manresa 2015, Su, Shi and Phillips 2016, Bonhomme, Lamadon and Manresa 2022)usually considers the support points of

Table 1 :
Asymptotic biases and variances, and approximate RMSEs and coverage rates (n = 1000).The model is as in Example 1B with standard normal errors, π prior

Table 7 :
Asymptotic bias rates.The model is as in Example 1A with standard normal errors, T 0 = T 1 = T /2, π prior (α) = φ(α), and π 0 as given in the table.The left part of the table gives b T Regarding the thresholding function, one could in principle consider a simple indicator function K(ξ) = ½{ξ ≤ 1}, but since this function is discontinuous, the resulting score function s h (y, x, θ) defined in (19) would then be discontinuous in θ, so we would not recommend this.Another possibility to implement approximate functional differencing is to replace the set A by a finite set A * with cardinality n A less than n Y .As explained in Remark 2 above, after this replacement, the matrix Q(x, θ) will have at least n Y − n A zero eigenvalues, that is, one can then use the moment function s ∞ (y, x, θ) defined in (20) to implement the MM or GMM estimator.In this case, the key tuning parameter to choose is the number of points n A in the set A * that "approximates" A.