Optimal data collection design in machine learning: the case of the fixed effects generalized least squares panel data model

This work belongs to the strand of literature that combines machine learning, optimization, and econometrics. The aim is to optimize the data collection process in a specific statistical model, commonly used in econometrics, employing an optimization criterion inspired by machine learning, namely, the generalization error conditioned on the training input data. More specifically, the paper is focused on the analysis of the conditional generalization error of the Fixed Effects Generalized Least Squares (FEGLS) panel data model, i.e., a linear regression model with applications in several fields, able to represent unobserved heterogeneity in the data associated with different units, for which distinct observations related to the same unit are corrupted by correlated measurement errors. The framework considered in this work differs from the classical FEGLS model for the additional possibility of controlling the conditional variance of the output variable given the associated unit and input variables, by changing the cost per supervision of each training example. Assuming an upper bound on the total supervision cost, i.e., the cost associated with the whole training set, the trade-off between the training set size and the precision of supervision (i.e., the reciprocal of the conditional variance of the output variable) is analyzed and optimized. This is achieved by formulating and solving in closed form suitable optimization problems, based on large-sample approximations of the generalization error associated with the FEGLS estimates of the model parameters, conditioned on the training input data. The results of the analysis extend to the FEGLS case and to various large-sample approximations of its conditional generalization error the ones obtained by the authors in recent works for simpler linear regression models. They highlight the importance of how the precision of supervision scales with respect to the cost per training example in determining the optimal trade-off between training set size and precision. Numerical results confirm the validity of the theoretical findings.


Introduction
In various applications in economics, engineering, medicine, physics, and several other fields, one has often the need of approximating a function, based on a finite set of input/ output noisy examples. This belongs to the typical class of problems investigated by supervised machine learning (Hastie et al. 2009;Vapnik 1998). In some cases, the noise variance of the output variable can be decreased, at least to some extent, by making the cost of each supervision larger. As an example, observations could be acquired by using more precise measurement devices (then, likely, having also larger acquisition cost). Similarly, each supervision could be made by an expert (also in this case, a larger cost would be expected by increasing the level of expertise). In all these situations, it can be useful to optimize the trade-off between the training set size and the precision of supervision. In the conference work , this kind of analysis was conducted by proposing a modification of the classical linear regression model, in which one has the additional possibility to control the conditional variance of the output variable given the associated input variables, by changing the time (hence, the cost) dedicated to provide a label to each training input example, and fixing an upper bound on the time available for the supervision of the whole training set. Based on a large-sample approximation of the output of the ordinary least squares regression algorithm, it was shown in that work that the optimal choice of the supervision time per example highly depends on how the precision of supervision scales with respect to the cost per training example. The analysis was refined in , where a related optimization problem, based on the analysis of the output produced by a different regression algorithm (namely, weighted least squares) was considered, obtaining similar results at optimality, for a model in which distinct training examples are possibly associated with different supervision times. Finally, in the conference work , the analysis of the optimal trade-off between training set size and precision of supervision was extended to a more general linear model of the input/ output relationship, namely, the fixed effects panel data model. In this model, observations associated with different units (individuals) depend also on additional constants, one for each unit, which make it possible to include, in the input/output relationship, unobserved heterogeneity in the data. Moreover, each unit is observed along another dimension, which is typically time. This kind of model (and its variations) is commonly applied in the analysis of data in both microeconomics and macroeconomics (Arellano 2004;Cameron and Trivedi 2005;Wooldridge 2002), where each unit may represent, for instance, a firm, or a country. It is also applied in biostatistics (Härdle et al. 2007) and sociology (Frees 2004). An important engineering application of the model (and of its variations) is in the calibration of sensors (Reeve 1988, Sect. 4.1).
In order to increase the applicability of the analysis carried out in our previous conference work , in this paper we extend it thoroughly in the following directions. First,  investigated only the case in which the measurements errors of observations associated with the same unit are mutually independent. In this paper, we extend such analysis to the case of dependent measurement errors. Moreover, differently from , we confirm the validity of the obtained theoretical results numerically. Further, in , the optimal trade-off between training set size and precision of supervision was analyzed only for a fixed number of units, assuming that the number of observations associated with the same unit is large enough to justify a large-sample approximation with respect to the number of observations. In the last part of this work, we consider additionally the cases of a large-sample approximation with respect to the number of units, and of a large-sample approximation with respect to both the number of units and the number of observations per unit.
In line with the results of the theoretical analyses made in , 2019 for simpler linear regression models, we show that, also for the more applicable fixed effects generalized least squares panel data model, the following holds in general: when the precision of each supervision (i.e., the reciprocal of the conditional variance of the output variable, given the associated unit and input variables) increases less than proportionally versus an increase of the supervision cost per training example, the minimum (large-sample approximation of the) generalization error (conditioned on the training input data) is obtained in correspondence of the smallest supervision cost per example (hence, of the largest number of examples); when that precision increases more than proportionally versus an increase of the supervision cost per example, the optimal supervision cost per example is the largest one (which corresponds to the smallest number of examples). Differently from , 2019, in the analysis made in the present work, the number of training examples can be varied either by increasing the number of observations per unit, or the number of units, or both. In summary, the results of the analyses made in , 2019 and, for a different and more complex regression model, in this paper, highlight that increasing the training set size is not always beneficial, if a smaller number of more reliable data can be collected. Hence, not only the quantity of data, but of course, also their quality matters. This looks particularly relevant when the data collection process can be designed before data are actually collected.
The paper is structured as follows. Section 2 provides a background on the fixed effects generalized least squares panel data model. Section 3 presents the analysis of its conditional generalization error, and of the large-sample approximation of the latter with respect to time. Section 4 formulates and solves an optimization problem we propose in order to provide an optimal trade-off between training set size and precision of supervision for the fixed effects generalized least squares panel data model, using the large-sample approximation above. Section 5 presents some numerical results, which validate the theoretical ones. Finally, Sect. 6 discusses some possible applications and extensions of the theoretical results obtained in the work. Some technical proofs and remarks about the extension of the analysis made in the paper to other large-sample settings are reported in the Appendices.

Background
In this section, we recall some basic facts about the following Fixed Effects Generalized Least Squares (FEGLS) panel data model (see, e.g., Wooldridge 2002, Chapter 10). Specifically, we refer to the following model: where the outputs y n,t ∈ ℝ are scalars, whereas the inputs x n,t ( n = 1, … , N, t = 1, … , T ) are column vectors in ℝ p , and are modeled as random vectors. The superscript ′ denotes transposition. The parameters of the model are the individual constants n ( n = 1, … , N) , one for each unit, and the column vector ∈ ℝ p . The constants n are also called fixed effects. Eq. (1) represents a balanced panel data model, in which each unit n is associated with the same number T of outputs, each one at a different time t. The model represents the (1) y n,t ∶= n + � x n,t , for n = 1, … , N, t = 1, … , T , case in which the input/output relationship is linear, and different units, which are observed at the times t = 1, … , T , are associated with possibly different constants.
Note that the outputs y n,t are actually unavailable; only their noisy measurements z n,t can be obtained, which are assumed to be generated according to the following additive noise model: where, for any n, the n,t are identically distributed and possibly dependent random variables, having mean 0, and are further independent from all the x n,t . For any two units n ≠ m and any two time instants t 1 , t 2 ∈ {1, … , T} , n,t 1 and m,t 2 are assumed to be independent. Hence, only the possibility of temporal dependence for the measurement errors associated with the same unit is considered in the following, in line with several works in the literature (see, e.g., Bhargava et al. (1982) and (Wooldridge 2002, Section 10.5.5)).
For each unit n, let X n ∈ ℝ T×p be the matrix whose rows are the transposes of the x n,t ; further, let z n ∈ ℝ T be the column vector which collects the noisy measurements z n,t , and n ∈ ℝ T the column vector which collects the measurement noises n,t . The input/corrupted output pairs (x n,t , z n,t ) , for n = 1, … , N , t = 1, … , T , are used to train the FEGLS model, i.e., to estimate its parameters.
The following first-order serial covariance form is assumed (see, e.g., Bhargava et al. (1982) and Wooldridge (2002, Section 10.5.5)) for the (unconditional) covariance matrix of the vector of measurement noises associated with the n-th unit 1 , where > 0 and ∈ (−1, 1) hold (here, denotes the expectation operator): which is a symmetric and positive-definite matrix. In other words, the measurement noise is assumed to be generated by a first-order autoregressive (AR(1)) process (Ruud 2000, Section 25.2). In the particular case of uncorrelated ( = 0 ) and independent measurement noises, one obtains the model considered in . Let the matrix Q T ∈ ℝ T×T be defined as where I T ∈ ℝ T×T is the identity matrix, and 1 T ∈ ℝ T a column vector whose elements are all equal to 1. One can check that Q T is a symmetric and idempotent matrix (i.e., Q � T = Q T = Q 2 T ), and its eigenvalues are 0 with multiplicity 1, and 1 with multiplicity T − 1 . Hence, for each unit n, one can define (2) z n,t ∶= y n,t + n,t , for n = 1, … , N, t = 1, … , T , and which represent, respectively, the matrix of time de-meaned training inputs, the vector of time de-meaned corrupted training outputs, and the vector of time de-meaned measurements noises. The goal of time de-meaning is to obtain a derived dataset where the fixed effects are removed, making it possible to estimate first the vector , then -turning back to the original dataset -the fixed effects n . The covariance matrix {̈n̈� n } has the expression which is symmetric and positive semi-definite, and has rank T − 1 < T (Wooldridge 2002). Although this deficient rank prevents the application of the most usual approach to Generalized Least Squares (GLS) estimation, based on the inversion of the covariance matrix (which in this case cannot be inverted), one can still apply GLS by projecting Eqs. (1) and (2) onto the orthogonal complement L of the vector 1 T by using Q T , then solving a standard GLS problem on L (Aitken 1936). This is formally obtained by replacing the inverse of the covariance matrix with its Moore-Penrose pseudoinverse 2 (denoted by + ), as made in the context of FEGLS estimation in Kiefer (1980, Im et al. (1999. More precisely, assuming the invertibility of the matrix ∑ N n=1Ẍ � n +Ẍ n (see Remark 3.1 for a justification of this assumption), the FEGLS estimate of is 2 It is recalled here from Strang (1993) that the Moore-Penrose pseudoinverse M + of a matrix M ∈ ℝ T×T inverts a special restriction of the linear application represented by the matrix M, whose domain and codomain are restricted, respectively, to the row space of M and to the column space of M (which coincide in the case of a symmetric matrix). For a matrix M ∈ ℝ T×T with singular value decomposition M = U V � (where U, V ∈ ℝ T×T are orthogonal matrices, and ∈ ℝ T×T is a diagonal matrix whose non-zero entries are the singular values of M), the singular value decomposition of its Moore-Penrose pseudoinverse is M + = U + V � (where + ∈ ℝ T×T is a diagonal matrix whose non-zero entries are the reciprocals of the singular values of M). Finally, in the particular case in which M is symmetric and positive semi-definite (e.g., when it is a covariance matrix), its singular values coincide with its positive eigenvalues, U ∈ ℝ T×T is an orthogonal matrix whose columns are its eigenvectors, V � = U −1 , and M + is symmetric and positive semi-definite. The concept of Moore-Penrose pseudoinversion can be extended to the cases of rectangular matrices and matrices with complex entries, but such extensions are not needed in this work.

3
The estimate ̂F EGLS in (9) can be interpreted as the GLS estimate of obtained by replacing the original input/corrupted output training data with their de-meaned versions reported above. It is worth observing that the training input/corrupted output pairs x n,t , z n,t ( n = 1, … , N, t = 1, … , T ) are all used to estimate .

Remark 2.1
Another commonly used approach to deal with the issue above is to drop one of the time periods from the analysis, in order to get an invertible covariance matrix. It can be rigorously proved (see, e.g. (Im et al. 1999, Theorem 4.3)) that this second approach is equivalent to the one based on the Moore-Penrose pseudoinverse (producing exactly the same FEGLS estimate), and that it does not matter which time period is dropped, as the resulting GLS estimator has always the same form. Therefore, dropping the last row of Q T , one gets the matrix Q T ∈ ℝ (T−1)×T , from which one obtains the matrix X n ∶=Q T X n ∈ ℝ (T−1)×p , the column vector z n ∶=Q T z n ∈ ℝ T−1 , and the column vector ̃n ∶=Q T n ∈ ℝ T−1 . Moreover, denoting by X n ∈ ℝ (T−1)×p , z n ∈ ℝ T−1 , and ̃n ∈ ℝ T−1 the matrix and the vectors obtained by removing the last row, respectively, from X n , z n , and n , one gets which is, differently from , an invertible matrix, with inverse ̃− 1 = (Q TQ � T ) −1 . The resulting FEGLS estimate is (see, e.g., Wooldridge (2002)). The FEGLS estimate ̂F EGLS and the alternative one ̂a lt FEGLS are actually identical (Im et al. 1999, Theorem 4.3). This equivalence is obtained by expressing such estimates in terms of the original variables before de-meaning, then exploiting the proof of (Im et al. 1999, Theorem 4.3), which shows that T (this still holds if an observation different from the last one is dropped, and Q T is redefined accordingly).
The FEGLS estimates of the n (also called fixed effects residuals (Wooldridge 2002)) are They are obtained by subtracting the estimate ̂′ FEGLS x n,t of ′ x n,t from each corrupted output z n,t , then performing an empirical average, limiting to training data associated with the unit n. The FEGLS estimates reported in Eq. (12) are motivated by the fact that the n are constants, whereas the n,t have mean 0.
By taking expectations, it readily follows from their definitions that the estimates (9) and (12) are conditionally unbiased with respect to the training input data {x n,t } t=1,…,T n=1,…,N , i.e., that where 0 p ∈ ℝ p is a column vector whose elements are all equal to 0, and, for any i = 1, … , N, Finally, the covariance matrix of ̂F EGLS , conditioned on the training input data, is

Conditional generalization error and its large-sample approximation
The goal of this section is to analyze the generalization error associated with the FEGLS estimates (9) and (12), conditioned on the training input data, by providing its large-sample approximation. Then, in Sect. 4, the resulting expression is optimized, after choosing suitable models for the standard deviation of the measurement noise and for the time horizon, which is chosen in such a way it satisfies a suitable budget constraint.
First, we express the generalization error or expected risk for the i-th unit ( i = 1, … , N ), conditioned on the training input data, by where x test i ∈ ℝ p is independent from the training data. It is the expected mean squared error of the prediction of the output associated with a test input, conditioned on the training input data.
As shown in Appendix 1, we can express the conditional generalization error (16) as follows, highlighting its dependence on 2 : where some computations (reported in Appendix 1) show that Next, we obtain a large-sample approximation of the conditional generalization error (17) with respect to T, for a fixed number of units N. Such an approximation is useful, e.g., in the application of the model to macroeconomics data, for which it is common to investigate the case of a large horizon T. Under mild conditions (e.g., if for the unit i the x i,t are mutually independent, identically distributed, and have finite moments up to the order 4), the following convergences in probability 3 hold (their proofs are reported in Appendix 2): Similarly, if for each fixed unit n the x n,t are mutually independent, identically distributed 4 , and have finite moments up to the order 4, and one makes the additional assumption (whose validity is discussed extensively in Appendix 2) that (where, for a symmetric matrix M ∈ ℝ T×T , ‖M‖ 2 = max t=1,…,T � t (M)� denotes its spectral norm), then also the following convergence in probability holds: 3 We recall that a sequence of random real matrices M T , T = 1, 2, … , converges in probability to the real matrix M if, for every > 0 , Prob ‖ ‖ M T − M ‖ ‖ > (where ‖ ⋅ ‖ is an arbitrary matrix norm) tends to 0 as T tends to +∞ . In this case, we write plim where is a symmetric and positive semi-definite matrix. In the following, its positive definiteness (hence, its invertibility) is also assumed.
Remark 3.1 The existence of the probability limit (23) and the assumed positive definiteness of the matrix A N guarantee that the invertibility of the matrix ∑ N n=1Ẍ � n +Ẍ n (see the invertibility assumption before Eq. (9)) holds with probability near 1 for large T.
When (20), (21), and (23) hold, inserting such probability limits in Eq. (17), one gets the following large-sample approximation of the conditional generalization error (17) with respect to T: where, for a vector v ∈ ℝ p , ‖v‖ 2 denotes its l 2 (Euclidean) norm, and A − 1 2 N is the principal square root (i.e., the symmetric and positive definite square root) of the symmetric and positive definite matrix A −1 N . Eq. (25) is obtained taking into account that, as a consequence of the Continuous Mapping Theorem (Florescu 2015, Theorem 7.33), the probability limit of the product of two random variables equals the product of their probability limits, when the latter two exist. By doing this, the third and sixth terms of Eq. (17) cancel out due to Eq. (21), whereas the second term is computed using Eq. (18).
Interestingly, the large-sample approximation (25) has the form is a positive constant (possibly, a different constant for each unit i). This simplifies the analysis of the trade-off between training set size and precision of supervision performed in the next section, since one does not need to compute the exact expression of K i to find the optimal trade-off. In Appendix 3, an extension of the analysis made above is presented, by considering, respectively, the case of large N, and the one in which both N and T are large.

Optimal trade-off between training set size and precision of supervision for the fixed effects generalized least squares panel data model under the large-sample approximation
In this section, we are interested in optimizing the large-sample approximation (25) of the conditional generalization error when the variance 2 is modeled as a decreasing function of the supervision cost per example c, and there is an upper bound C on the total supervision cost NTc associated with the whole training set. In the analysis, N is fixed, and T is chosen as . Moreover, the supervision cost per example c is allowed to take values on . In the following, C is supposed to be sufficiently large, so that the large-sample approximation (25) can be assumed to hold for every c ∈ [c min , c max ].
Consistently with , 2019, we adopt the following model for the variance 2 , as a function of the supervision cost per example c: where k, > 0 . For 0 < < 1 , the precision of each supervision is characterized by "decreasing returns of scale" with respect to its cost because, if one doubles the supervision cost per example c, then the precision 1∕ 2 (c) becomes less than two times its initial value (or equivalently, the variance 2 (c) becomes more than one half its initial value). Conversely, for > 1 , there are "increasing returns of scale" because, if one doubles the supervision cost per example c, then the precision 1∕ 2 (c) becomes more than two times its initial value (or equivalently, the variance 2 (c) becomes less than one half its initial value). The case = 1 is intermediate and refers to "constant returns of scale". In all the cases above, the precision of each supervision increases by increasing the supervision cost per example c. Finally, it is worth observing that, according to the model (3) for the covariance matrix of the vector of measurement noises, the correlation coefficient between successive measurement noises does not depend on c.
Concluding, under the assumptions above, the optimal trade-off between the training set size and the precision of supervision for the fixed effects generalized least squares panel data model is modeled by the following optimization problem: By a similar argument as in the proof of Gnecco and Nutarelli (2019, Proposition 3.2), when C is sufficiently large, the objective function CK i k c − Concluding, under the approximation above, one can replace the optimization problem (28) with whose optimal solutions c • have the following expressions: In summary, the results of the analysis show that, in the case of "decreasing returns of scale", "many but bad" examples are associated with a smaller conditional generalization error than "few but good" ones. The opposite occurs for "increasing returns of scale", whereas the case of "constant returns of scale" is intermediate. These results are qualitatively in line with the ones obtained in , 2019 for simpler linear regression problems, to which different regression algorithms were applied (respectively, ordinary least squares, weighted least squares, and fixed effects ordinary least squares). This depends on the fact that, in all these cases, the large-sample approximation of the conditional generalization error has the same functional form 2 T K i (although different positive constants K i are involved in the various cases).
One can observe that, in order to discriminate among the three cases of the analysis reported above, one does not need to know the exact values of the constants , k, K i , and N. Moreover, to discriminate between the first two cases, it is not necessary to know the exact value of the positive constant (indeed, it suffices to know if belongs, respectively, to the interval (0, 1) or the one (1, +∞) ). Interestingly, no precise knowledge of the probability distributions of the input examples (one for each unit) is needed. In particular, different probability distributions may be associated with different units, without affecting the results of the analysis. Finally, the same conclusions as above are reached if the objective function in (29) is replaced by the summation of the large-sample approximation of the conditional generalization error over all the N units. In that case, the constant K i in (29)

Numerical results
In this section, the theoretical results obtained in the paper are tested through simulations. For each c, the following empirical approximation of the summation of the generalization error over all the units, conditioned on the training input data, is adopted. It is based on N tr training sets and N test each choice of c, all the N tr generated training sets share the same training input data matrices X n , but differ in the random choice of the measurement noise. The number of rows in each matrix X n is increased when c is reduced from c max to c min , by increasing the number of observations T. For a fair comparison, when doing this, the rows already present in each matrix X n are kept fixed. Finally, the same test examples (generated independently from the training sets) are used to assess the performance of the fixed effects generalized least squares estimates for different costs per example c. Similarly, the fixed effects n (for n = 1, … , N ) are generated randomly and independently according to a uniform distribution on [−1, 1] , obtaining the vector For both training and test sets, the input data associated with each unit are generated as realizations of a multivariate Gaussian distribution with mean 0 and covariance matrix which is symmetric and positive definite. This covariance matrix has been generated by setting Var x n,t = Var x test where the elements of A x ∈ ℝ p×p have been randomly and independently generated according to a uniform probability density on the interval [0,1]. The parameter in the covariance matrix (3) of the zero-mean vector of measurement noises (modeled in the simulations by a multivariate Gaussian distribution) is chosen to be equal to 0.3. As a robustness check, the whole procedure is repeated 100 times.
The results of the analysis are displayed in Tables 1 (for = 0.5 ), 2 (for = 1.5 ), and 3 (for = 1 ). Each table reports the results obtained in each repetition for c = c min and c = c max . The total simulation time (for a MATLAB 9.4 implementation of the procedure) is of about 501 sec on a notebook with a 2.30 GHz Intel(R) Core(TM) i5-4200U CPU and 6 GB of RAM. A statistical analysis of the elements of the tables leads to the following conclusions:  1. for = 0.5 (Table 1), the application of a one-sided Wilcoxon matched-pairs signed-rank test (Barlow 1989, Sect. 9.2.3) rejects the null hypothesis that the difference between the approximated performance index from Eq. (30) for c = c max and the one for c = c min has a symmetric distribution around its median and that median is smaller than or equal to 0 (p-value=1.9780 ⋅ 10 −18 , significance level set to 0.05); 2. for = 1.5 (Table 2), the application of a one-sided Wilcoxon matched-pairs signedrank test rejects the null hypothesis that the same difference as above has a symmetric distribution around its median and that median is larger than or equal to 0 (p-value=1.9780 ⋅ 10 −18 , significance level set to 0.05); 3. for = 1 (Table 3), the application of a two-sided Wilcoxon matched-pairs signed-rank test fails to reject the null hypothesis that the same difference as above has a symmetric distribution around its median and that median is equal to 0 (p-value=0.4453, significance level set to 0.05).
Concluding, the tables show that the simulation results are in perfect agreement with the theoretical ones, leading to the same conclusions. Interestingly, this holds even though relatively small values for T have been chosen for the simulations.

Discussion and possible extensions
Up to the authors' knowledge, the analysis and the optimization, made in the present article, of the conditional generalization error in regression as a trade-off between training set size and precision of supervision, has been carried out only rarely in the literature. In particular, the authors believe that it was never addressed for the fixed effects generalized least squares panel data model. Nevertheless, the methodology used in the present article is similar to the one exploited in the context of the optimization of sample survey design, in which some parameters of the design are optimized to minimize, e.g., the sampling variance (see, for instance, the classical solution provided by Neyman allocation for optimal stratified sampling design, in case the dataset has a fixed size (Groves et al. 2004). It also shares some similarity to the approach used in Nguyen et al. (2009) -in a context, however, in which linear regression is marginally involved, since only arithmetic averages of measurement results are considered -for the optimal design of measurement devices. Finally, the present article can also be related to some recent works which, according to an emerging trend in the literature, combine methods from machine learning, optimization, and econometrics (Athey and Imbens 2016; Bargagli Stoffi and Gnecco 2018, 2020; Varian 2014) (e.g., the generalization error -which is typically considered in machine learning, and optimized by solving suitable optimization problems -is not investigated in the classical analysis of the fixed effects generalized least squares panel data model (Wooldridge 2002, Chapter 10)). In this way, the interaction between machine learning and optimization-which appears commonly in the literature (Bennett and Parrado-Hernández 2006;Bianchini et al. 1998;Gnecco et al. 2013;Gori 2017;Özöğür-Akyüz et al. 2011;Sra et al. 2011)-is extended to the econometrics field. For what concerns practical applications, the theoretical results obtained in the analysis made in this work could be applied to the acquisition design of fixed effects panel data in both microeconometrics and macroeconometrics (Greene 2003, Chapter 13). A semi-artificial validation on existing datasets could also be performed by inserting artificial noise (2) 6.127 ⋅ 10 −4 (3) 5.710 ⋅ 10 −4 (4) 5.638 ⋅ 10 −4 (5) 5.792 ⋅ 10 −4 (6) 5.535 ⋅ 10 −4 (7) 6.084 ⋅ 10 −4 (8) 5.756 ⋅ 10 −4 (9) 5.976 ⋅ 10 −4 (10) 5.669 ⋅ 10 −4 (11) 5.496 ⋅ 10 −4 (12) 5.562 ⋅ 10 −4 (13) 6.043 ⋅ 10 −4 (14) 5.762 ⋅ 10 −4 (15) 6.391 ⋅ 10 −4 (16) 6.071 ⋅ 10 −4 (17) 6.007 ⋅ 10 −4 (18) 5.925 ⋅ 10 −4 (19) 5.600 ⋅ 10 −4 (20) 6.244 ⋅ 10 −4    with variance expressed as in Eq. (27), possibly with the inclusion of an additional constant term in that variance, to model the case of the original dataset before the insertion of the artificial noise. As a possible extension, one could investigate and optimize the trade-off between training set size and precision of supervision for the unbalanced FEGLS case (in which different units are associated with possibly different numbers of observations) 5 , for the situation in which some parameters of the noise model have to be estimated either from the data or from a subset of the data, and for the case of a non-zero correlation of measurement errors for the observations associated with different units. Such developments could open the way to the application of the proposed framework to real-world problems, e.g., in econometrics. Another possible extension concerns the replacement in the investigation of the fixed effects panel data model with the random effects one (Greene 2003, Chapter 13), which is also commonly applied to deal with the analysis of economic data, and differs from the fixed effects panel data model in that its parameters are random variables 6 . In the present analysis, however, a possible advantage of the fixed effects panel data model is that it also makes it possible to get estimates of the individual constants n (see Eq. (12)), which appear in the expression (16) of the conditional generalization error. Finally, another possible extension involves the case of dynamic panel data models (Cameron and Trivedi 2005, Chapter 21).

Appendix 1: proofs of Eqs. (17), (18), and (19)
To simplify the notation, here and in the next appendices, Q T will be often replaced by the shorthand Q. Also the dependence of and and other matrices/vectors on T will be typically omitted in the notation, apart from a few cases (e.g., in part of Appendix 2).

Proof of Eq. (17)
For n = 1, … , N , let n ∈ ℝ T be the column vector whose elements are all equal to n . Using the expressions (1), (2), (6), and (9) respectively of y n,t , z n,t , z n , and ̂F EGLS , and one can write the term ̂F EGLS − as follows: The unbalanced case in which all the measurement errors are uncorrelated -therefore, the FEGLS model is replaced in the analysis by the simpler Fixed Effects (FE) model -is the subject of our recent work . 6 If the additional assumptions of the random effects model hold, then both the fixed and the random effects estimates are consistent, but the latter is more efficient than the former. However, if they do not hold, then the random effects model provides inconsistent estimates (Greene 2003, Chapter 13).
In the following, to simplify the notation, we set B ∶= Now, we expand the conditional generalization error (16) as follows: Exploiting the conditional unbiasedness of ̂i ,FEGLS , and the expressions (1) of y n,t , (2) of z n,t , and (12)  Expanding the square in the first term in the expression above, and splitting its last term in two parts, one obtains In order to simplify the expressions above, one exploits the following properties: for n ≠ m (being 0 T×T ∈ ℝ T×T the matrix whose elements are all equal to 0), and Then, expanding b and exploiting also the facts that + + = , B = B � , the matrix + is symmetric and deterministic, and all the Ẍ n are known once all the x n,t are given, one gets Finally, inserting (A.9) in (A.6), expanding the expressions of B and b , and recalling that i � i = 2 , one obtains Eq. (17). ◻

Proof of Eq. (18)
The expression 1 ′ T 1 T in Eq. (18) is the summation of all the elements of the matrix . Now, the element 0 = 1 appears in that summation T times, whereas the generic element t (for t = 1, … , T − 1) appears 2(T − t) times. Hence, Then, Eq. (18) is obtained from (A.10) by exploiting the following well-known expressions for the partial sums of the geometric series, and of its derivative: and where the right-hand side in Eq. (A.12) has been obtained by simplifying common factors in the numerator and the denominator. ◻

Proof of Eq. (19)
We compute Q 1 T , as follows: where the expression above of (1 � T 1 T )1 T comes from Eq. (18). Finally, Eq. (19) is obtained from Eq. (A.13) by expanding the elements of 1 T , then simplifying the resulting expressions. ◻

Appendix 2: proofs of the probability limits in Section 3
In the following, Eqs. (20) and (21) are derived under the common assumption that, for the unit i, the x i,t are mutually independent, identically distributed, and have finite moments up to the order 4. To derive Eq. (23), one makes the similar assumption that, for each fixed unit n, the x n,t are mutually independent, identically distributed, and have finite moments up to the order 4, together with the additional assumption lim reported in Eq. (22). The validity of this last assumption is discussed extensively at the end of this appendix. Eqs. (20), (21), and (23) could be derived under more general conditions, but such possible extension is out of the scope of the paper.

Proof of Eq. (20)
Eq. (20) simply replaces the empirical average of the transposes of the x n,t (which is 1 T 1 ′ T X i ) with their common expected value x i,1 ′ , and follows from Chebyschev's weak law of large numbers (Ruud 2000, Sect. 13.4.2). ◻

Proof of Eq. (21)
In order to prove Eq. (21), it is convenient to introduce (recalling the definition of u T provided in Eq. (19)) the vector since the argument of the probability limit in Eq. (21) can be written as follows: In other words, the T elements of each row of X ′ i are summed with (different and deterministic) weights v T,t (the components of v T ) , for t = 1, … , T , then their weighted sum is divided by T. This suggests the application of a suitable form of the law of large numbers, which holds in this case: specifically, the one provided in (Bai et al. 1997, Theorem 2.1)). In view of the next application of that theorem, first we investigate the following properties of the various terms involved in Eqs. (A.14) and (A.15).
(i) The Euclidean norm of the vector u T is bounded from above as follows, for K u > 0 independent from T:

3
This follows from the fact that the absolute values of all the components of the vector u T , whose expression is reported in Eq. (19), are bounded from above by a sufficiently large K u > 0 , which is independent from T. (ii) All the eigenvalues of the matrix belong to the interval 1− 2 1+ 2 +2 , 1− 2 This result follows by observing that is a symmetric Toeplitz matrix 7 (Gray 2006) , and is the imaginary unit. By inverting the Fourier series above as in (Gilgen 2006, Eqs. (7.77-7.79)), one gets from which one gets m f = 1− 2 1+ 2 +2 > 0 and M f = 1− 2 1+ 2 −2 < +∞ since ∈ (−1, 1) , which concludes the proof of item ii). (iii) The matrix has 0 as eigenvalue with multiplicity 1, and an associated eigenvector is 1 T .
This result follows from the characterization of the eigenvalues of a symmetric matrix M ∈ ℝ T×T as the stationary values of its Rayleigh quotient x ′ x (with x ∈ ℝ T and x ≠ 0 T ) (Parlett 1998, Chapter 1), the invertibility of the matrix , and the fact that Q has eigenvalue 0 with multiplicity 1, and associated eigenvector 1 T . Hence, for x ≠ 0 T , and only x is proportional to 1 T .
(iv) All the other eigenvalues of belong to the interval 1− 2 1+ 2 +2 , 1− 2 This follows again from the characterization of the eigenvalues of a symmetric matrix as the stationary values of its Rayleigh quotient, and also from Courant-Fisher's maxmin theorem (Parlett 1998, Theorem 10.2.1) and from item ii). Indeed, by ordering the (real) eigenvalues of and respectively as 1 ( ) ≤ 2 ( ) ≤ … T ( ) and We recall that a matrix M ∈ ℝ T×T is a symmetric Toeplitz matrix if it has the form where m 0 , m 1 , … , m T−1 ∈ ℝ.
from Courant-Fisher's maxmin theorem, whereas is obtained by expressing Q ′ x by using a basis of orthonormal eigenvectors e t of , associated with the respective eigenvalues t ( ) , t = 1, … , T . Indeed, for some coefficients t , t = 1, … , T (depending on x ), one has hence whereas Then, one gets for any x ≠ 0 T .
(v) All the non-zero eigenvalues of the matrix + belong to the interval This follows from items iii) and iv) and the relation between the singular value decomposition of a symmetric positive semi-definite matrix and the singular value decomposition of its Moore-Penrose pseudoinverse, which has been reported in footnote 2. (vi) The Euclidean norm of the vector v T is bounded from above as follows, for K v > 0 independent from T: This is obtained by combining the definition of v T provided in Eq. (A.14) with items i) and v), and the fact that the eigenvalue with maximus modulus of Q = Q � is 1. A possible expression for K v is K v = 1+ 2 +2 1− 2 K u . (vii) The following holds: This is obtained immediately from item vi).
To conclude the proof of Eq. (21), we first consider the case in which, for the unit i, all the x i,t have mean 0 p . Later, this additional assumption is removed. (viii) Proof of Eq. (21) when all the x i,t have mean 0 p .
Item vii) and the fact that all the elements of each row of X ′ i have 0 mean, are independent, identically distributed, and their moments up to the order 4 are finite allow one to apply (Bai et al. 1997, Theorem 2.1), getting the following result, where, This, combined with the inequalities and the fact that almost sure convergence implies convergence in probability (Rao 1973), shows that, for all r = 1, … , p , one has To conclude, one gets Eq. (21) from Eqs. (A.15) and (A.30), by exploiting the fact that, for a sequence of random matrices with fixed dimension, element-wise convergence in probability implies convergence in probability of the whole sequence (Lee 2010). (ix) Proof of Eq. (21) when all the x i,t have the same mean m ∈ ℝ p .
We set x i,t ∶= x i,t − m , in such a way that the x i,t have mean 0 p . Similarly, we set one reduces the analysis to the one made in item viii).
Remark 6.1 As a variation of item ii), a simpler argument (not based on the theory of Toeplitz matrices) can be used to prove that all the eigenvalues of the matrix belong to the interval 1 − 2 1− , 1 + 2 1− . This result follows by seeing the matrix as a perturbation of the identity matrix, then applying Gershgorin's circle theorem (Gerschgorin 1931). Indeed, all the eigenvalues of (which are non-negative since is symmetric and positive semidefinite) belong to the union of the T Gershgorin's circles C i ( i = 1, … , T ) in the complex plane, which have the same center 1 and respective radii ∑ j=1,…,T,j≠i � ij � . The latter radii can be bounded from above by 2 1− , which follows from a geometric series argument based on Eq. (A.11). We have preferred to use in the main text the argument based on Toeplitz 8 We recall that a sequence of random real variables b T , T = 1, 2, … , converges almost surely to b ∈ ℝ if Prob( lim matrices, since imposing 1 − 2 1− , 1 + 2 1− ⊂ (0, +∞) requires the additional assumption < 1 3 (instead, 1− 2 1+ 2 +2 , 1− 2 1+ 2 −2 ⊂ (0, +∞) holds for any ∈ (−1, 1) ). Moreover, such argument produces an even better estimate (when ∑ j=1,…,T,j≠i � ij � is replaced by its upper bound 2 1− ), since 1 − 2 1− < 1− 2

Proof of Eq. (23)
To make the reading easier, the proof of Eq. (23) is divided into several steps.
(i) The following holds: This is obtained as follows. First, since the matrix Q = Q � is idempotent, one gets by (Maciejewski and Klein 1985, Appendix). Additionally, since Moore-Penrose pseudoinversion commutes with transposition (Barata and Hussein 2012), we get where the last step follows again by (Maciejewski and Klein 1985, Appendix)  This is obtained as follows. Denoting by > 0 an upper bound on the spectral norm of the matrix + − Q −1 Q � and by c n,h the h-th column of X n , the absolute value of the element in position (h, k) of the matrix X � n + − Q −1 Q � X n can be bounded from above as follows: where Cauchy-Schwarz inequality has been applied, together with the elementary inequality |a||b| ≤ a 2 +b 2 2 , for a, b ∈ ℝ . Since by the assumption lim T→∞ ‖ + − Q −1 Q � ‖ 2 = 0 stated in Eq. (22) one can make tend to 0 as T tends to +∞ , and both ‖c n,h ‖ 2 2 and ‖c n,k ‖ 2 2 are summations of T independent and identically (A.31) distributed random variables with finite mean and finite second order moments, by applying Chebyschev's weak law of large numbers, one gets Finally, one gets Eq. (A.35) from Eq. (A.37), since for a sequence of random matrices with fixed dimension, element-wise convergence in probability implies convergence in probability of the whole sequence (Lee 2010). (iv) The following probability limit holds: This is obtained as follows. Exploiting the symmetry of Q and the following Cholesky factorization (see, e.g. (Ruud 2000, Sect. 19.2)) where one gets Hence, from Eq. (A.41) one gets Now, we compute the probability limit in the right-hand side of Eq. (A.42) by considering separately the following various terms.
(iv.a) The following holds: This is obtained by applying directly Chebyschev's inequality (Ruud 2000, Section D.2), since each element of the matrix ẍ n,1ẍ ′ n,1 has finite mean and finite second order moments.
(iv.b) Similarly, since the addition of a finite number of terms like the one reported in Eq. (A.43) does not change the probability limit, one gets where the last equality is obtained by exploiting the eigendecomposition of Q ′ Q (which, combined with the assumptions on the x n,t , shows that each element in position (h, k) of the matrix X ′ n Q ′ QX n is the summation of T − 1 independent random variables with mean ( {(x n,1 − {x n,1 })(x n,1 − {x n,1 }) � }) h,k and the same finite variance), then applying Chebyschev's weak law of large numbers. (iv.c) Moreover, where the second-last equality comes from the fact that the probability limit of the product of two factors equals the product of the probability limits of the two factors, when the latter probability limits exist (this is a consequence of the Continuous Mapping Theorem (Florescu 2015, Theorem 7.33)), and from the assumptions on the x n,t . Remark 6.2 One can easily check (e.g., by a numerical study for selected values of T and of the parameter in ) that, in general, the stronger result + = Q −1 Q � does not hold. This depends on the fact that, given two matrices M 1 , M 2 ∈ ℝ T×T , typically (M 1 M 2 ) + ≠ M + 2 M + 1 , apart from particular cases (Dattorro 2019, Eq. (E.0.0.0.1)).
Eq. (22), i.e., lim T→∞ ‖ + − Q −1 Q � ‖ 2 = 0 , represents a stronger convergence requirement on + − Q −1 Q � with respect to the convergence result provided by Eq. (A.46), which has been proved above. This depends on the fact that, for any matrix M ∈ ℝ T×T , one has ‖M‖ 2 ≥ ‖M‖ HS (Gray 2006, Eq. (2.19)). The validity of Eq. (22) has been assumed to complete the proof of Eq. (23) -specifically, of part of item iii) therein -since a similar argument based on Eq. (A.46) would be not enough to complete that proof. Although we are currently unable to provide a formal proof of Eq. (22) -which is why it has been reported here as an assumption -its validity is strongly supported by the numerical results shown in Fig. 2, where the spectral norm error ‖ + − Q −1 Q � ‖ 2 is reported for several choices of T and (similar results are obtained for a wider range of values of T and other values of ). The difficulty in getting a proof of Eq. (22) depends on the fact that the vector 1 T is not an eigenvector of (although it is an eigenvector of its circulant matrix approximation). Hence, there is no guarantee a priori that all the elements of an orthonormal basis of eigenvectors of have nonzero orthogonal projections onto 1 T (indeed, it can be easily checked numerically -e.g., by finding a basis of eigenvectors of for a few choices of T, then computing such orthogonal projections -that they are typically nonzero) 11 . This suggests, as a possible way to proceed in the proof, to investigate theoretically if such orthogonal projections converge uniformly to 0 as T tends to +∞ . In any case, in this appendix we have reported Eq. (A.46) together with its proof, because such equation is obviously related to Eq. (22), and because a proof of the latter could be obtained by combining Eq. (A.46) with specific properties of the matrix . It is also worth mentioning that such proof would be not necessarily based on the use of a circulant matrix approximation of , then of −1 (for which negative results on the spectral norm approximation error are known, unless a related but restricted notion of finite-term strong convergence is considered (Sun 2003, Theorem 1)).
the simplified expressions = I T and = Q Q � = QQ � = I T − 1 T 1 T 1 � T . Then, one gets Q � + Q = + = QQ � = Q � Q by combining Eq. (A.31) and the relation between the singular value decomposition of a matrix and the singular value decomposition of its Moore-Penrose pseudoinverse.
First, we consider the case in which N is large. Assuming stationarity and mutual independence of different observations associated with the same unit, computations of the elements of the matrix where A − 1 2 is the principal square root of the symmetric and positive definite matrix A −1 . Second, we consider the case in which both N and T are large. In this case, (A.49) is replaced by for the same matrix A as above.
When (20) and (A.52) hold and = 0 , the conditional generalization error (17) has the following large-sample approximation with respect to N and T: Starting from the large-sample approximations (A.51) and (A.53) for the conditional generalization error, and adopting the model (27) for the variance 2 , two optimization problems similar to (28) can be stated and solved. For simplicity, in the following we make some approximations in the analysis of their optimal solutions.
In the first problem, one optimizes the corresponding large-sample approximation of the conditional generalization error with respect to N (or equivalently, with respect to c, as in (28)), whereas T is fixed. More precisely, for C sufficiently large (in such a way that the large-sample approximation (A.51) can be assumed to hold for every c ∈ [c min , c max ] ) and under the approximation NTc ≃ C at optimality 13 , setting the first optimization problem can be written as (A.51)