Optimal trade-off between sample size, precision of supervision, and selection probabilities for the unbalanced fixed effects panel data model

This paper is focused on the unbalanced fixed effects panel data model. This is a linear regression model able to represent unobserved heterogeneity in the data, by allowing each two distinct observational units to have possibly different numbers of associated observations. We specifically address the case in which the model includes the additional possibility of controlling the conditional variance of the output given the input and the selection probabilities of the different units per unit time. This is achieved by varying the cost associated with the supervision of each training example. Assuming an upper bound on the expected total supervision cost and fixing the expected number of observed units for each instant, we analyze and optimize the trade-off between sample size, precision of supervision (the reciprocal of the conditional variance of the output) and selection probabilities. This is obtained by formulating and solving a suitable optimization problem. The formulation of such a problem is based on a large-sample upper bound on the generalization error associated with the estimates of the parameters of the unbalanced fixed effects panel data model, conditioned on the training input dataset. We prove that, under appropriate assumptions, in some cases “many but bad” examples provide a smaller large-sample upper bound on the conditional generalization error than “few but good” ones, whereas in other cases the opposite occurs. We conclude discussing possible applications of the presented results, and extensions of the proposed optimization framework to other panel data models.


Introduction
In many situations involving economics, engineering, physics, and other fields, it is required to approximate a function on Communicated (Vapnik 1998). In some cases, the output noise variance can be reduced to some extent, by increasing the cost of each supervision. For example, devices with higher precision could be used to acquire measurements, or experts could be involved in the data analysis procedure. However, in the presence of a budget constraint, increasing the cost of each supervision could reduce the total number of available labeled examples. In such cases, the investigation of an optimal trade-off between the sample size and the precision of supervision plays a key role. In Gnecco and Nutarelli (2019a), this analysis was carried out by employing the classical linear regression model, suitably modified in order to include the additional possibility of controlling the conditional variance of the output given the input. Specifically, this was pursued by varying the time (hence, the cost) dedicated to the supervision of each training example, and fixing an upper bound on the total available supervision time. Based on a large-sample approximation of the output of the ordinary least squares regression algorithm, it was shown therein that the optimal choice of the supervision time per example is highly dependent on the noise model. The analysis was refined in Gnecco and Nutarelli (2019b) 1 , where an additional algorithm (weighted least squares) was considered, and shown to produce similar results at optimality as the ordinary least squares algorithm, for a model in which different training examples are possibly associated with different supervision times.
In this work, we analyze the optimal trade-off between sample size, precision of supervision, and selection probabilities for a more general linear model of the input-output relationship, which is the unbalanced fixed effects panel data model. The (either balanced or unbalanced) fixed effects model is commonly applied in the econometric analysis of microeconomic and macroeconomic data (Andreß et al. 2013;Arellano 2004;Cameron and Trivedi 2005;Wooldridge 2002), where each unit may represent, e.g., a firm, or a country. It is also applied, among other fields, in biostatistics (Härdle et al. 2007), educational research (Sherron et al. 2000), engineering (Reeve 1988;Yu et al. 2018;Zeifman 2015), neuroscience (Friston et al. 1999), political science (Bell and Jones 2014), and sociology (Frees 2004). In a fixed effects panel data model, observations related to different observational units (individuals) are associated with possibly different constants, which are able to represent unobserved heterogeneity in the data. Moreover, the same unit is observed along another dimension, which is typically time. In the unbalanced case, at each instant, different units may be not observed with some positive probability (possibly unit-dependent), resulting in a possibly unbalanced panel. In this framework, the balanced case corresponds to the situation in which the number of observations is the same for all the units.
The present work extends significantly the analysis of our previous conference article (Gnecco and Nutarelli 2020) to the unbalanced fixed effects panel data model, which is more general than the balanced case considered therein, and leads to an optimization problem that is more complex to investigate. Indeed, in Gnecco and Nutarelli (2020), all the units are always selected at each instant, therefore the selection probabilities do not appear as optimization variables in the corresponding model. Moreover, theoretical arguments are reported in much more details in the current work.
The results that will be presented in this paper concerning the unbalanced fixed effects panel data model are consistent with those of Gnecco and Nutarelli (2020) for the balanced case, and those of Gnecco and Nutarelli (2019a, b) concerning simpler linear regression models. Specifically, we show that, also for the unbalanced fixed effects panel data model, the following holds. When the precision of the supervision increases less than proportionally with respect to the supervision cost per example, the minimum (large-sample upper bound on the) generalization error (conditioned on the training input dataset) is obtained in correspondence of the smallest supervision cost per example. As a consequence of the problem formulation, this corresponds to the choice of the largest number of examples. Instead, when the precision of the supervision increases more than proportionally with respect to the supervision cost per example, the optimal supervision cost per example is the largest one. Again, as a consequence of the problem formulation, this corresponds to the choice of the smallest number of examples. The structure of the optimal selection probabilities is also investigated, under the constraint of a constant expected number of observed units for each instant. In summary, the results of the theoretical analyses performed, for different regression models of increasing complexity, in Nutarelli (2019a, b, 2020), and in this paper highlight that, in some circumstances, collecting a smaller number of more reliable data is preferable than increasing the size of the sample set. This looks particularly relevant when one is given a certain flexibility in designing the data collection process.
Up to our knowledge, the analysis and the optimization of the trade-off between sample size, precision of supervision, and selection probabilities in regression has been carried out rarely in the machine-learning literature. Nevertheless, the approach applied in this paper resembles the one used in the optimization of sample survey design, where some of the design parameters are optimized to minimize the sampling variance (Groves et al. 2004). Such an approach is also similar to the one exploited in Nguyen et al. (2009) for the optimization of the design of measurement devices. In that framework, however, linear regression is marginally involved, since only arithmetic averages of measurement results are considered therein. The search for optimal sample designs can be also performed by the Optimal Computing Budget Allocation (OCBA) method (Chen and Lee 2010). Differently from that approach, however, our analysis provides the optimal design a priori, i.e., before actually collecting the data. Our work can also be related to recent literature dealing with the joint application of machine learning, optimization, and econometrics (Varian 2014; Athey and Imbens 2016; Bargagli Gnecco 2018, 2019;Crane-Droesch 2017). For instance, the generalization error-which is typically investigated by machine learning, and optimized by solving suitable optimization problems-is not addressed in the classical analysis of the either balanced or unbalanced fixed effects panel data model (Wooldridge 2002, Chapters 10 and 17). Finally, an advantage of the approach considered in this work with respect to other possible ones grounded on Statistical Learning Theory (SLT) (Vapnik 1998) is that, being based on a large-sample approximation, it provides bounds on the conditional generalization error that do not need any a-posteriori evaluation of empirical risks.
The paper is structured as follows. Section 2 provides a background on the unbalanced fixed effects panel data model. Section 3 presents the analysis of its conditional generalization error, and of the large-sample upper bound on the latter with respect to time. Section 4 formulates and solves the optimization problem modeling the trade-off between sample size, precision of supervision, and selection probabilities for the unbalanced fixed effects panel data model, using the large-sample upper bound above. Finally, Sect. 5 discusses some possible applications and extensions of the theoretical results obtained in the work.

Background
We recall some basic facts about the following (static) unbalanced fixed effects panel data model (see, e.g., (Wooldridge 2002, Chapters 10 and 17)). Let n = 1, . . . , N denote observational units and, for each n, let t = 1, . . . , T n be time instants. Moreover, let the inputs x n,t (n = 1, . . . , N , t = 1, . . . , T n ) to the model be random column vectors in R p and, for each n = 1, . . . , N and t = 1, . . . , T n , let the output y n,t ∈ R be a scalar. The parameters of the model are some individual constants η n (n = 1, . . . , N ), one for each unit, and a column vector β ∈ R p . The (noise-free) input-output relationship is expressed as follows: y n,t := η n + β x n,t , n = 1, . . . , N , t = 1, . . . , T n . (1) Equation (1) represents an unbalanced panel data model, which can be applied in the following two situations: • distinct units n are associated with possibly different numbers T n of data collected at each time instant t = 1, . . . , T n over a whole observation period T ≥ max N n=1 T n ; • the observations related to the same unit are associated with a subsequence {t 1 , t 2 , . . . , t T n } of the sequence {1, 2, . . . , T }.
In the next sections, we focus on the second situation. To avoid burdening the notation by introducing an additional index, we still indicate, also in this case, by {1, 2, . . . , T n } the subsequence {t 1 , t 2 , . . . , t T n }. A possible way to get different numbers of observations T n for distinct units consists in associating to each unit n a scalar q n ∈ (0, 1], which denotes the (positive) probability that n is observed at any time t. Selections for different units are supposed to be mutually independent. For simplicity, for each unit, selections at different times are also assumed to be mutually independent. For a total observation time T , denoting by E the expectation operator, the expected number of observations for each unit n is E {T n } = q n T . The balanced case, which was considered in the analysis of Gnecco and Nutarelli (2020), corresponds to the situation q n = 1 for each n. Let {ε n,t } n=1,...,N , t=1,...,T n be a collection of mutually independent and identically distributed random variables, having mean 0 and the same variance σ 2 . Moreover, let all the ε n,t be independent also from all the x n,t . It is assumed that noisy measurementsỹ n,t of the outputs y n,t are available; specifically, the following additive noise model is considered: y n,t = y n,t + ε n,t , n = 1, . . . , N , t = 1, . . . , T n . (2) The input-output pairs x n,t ,ỹ n,t for n = 1, . . . , N , t = 1, . . . , T n , are used to train the model, i.e., to estimate its parameters. In the following, for n = 1, . . . , N , let X n ∈ R T n , p denote the matrix whose rows are the transposes of the x n,t ;ỹ n be the column vector that collects the noisy measurementsỹ n,t ; I T n ∈ R T n ×T n denote the identity matrix; 1 T n ∈ R T n be the column vector whose elements are all equal to 1; and be a symmetric and idempotent matrix, i.e., such that Q n = Q n = Q 2 n . Hence, for each unit n, and represent, respectively, the matrix of time de-meaned training inputs, and the vector of time de-meaned corrupted training outputs. The aim of time de-meaning is to generate another dataset that does not include the fixed effects, making it possible to estimate first the vector β, then-going back to the original dataset-the fixed effects η n . Assuming in the following the invertibility of the matrix N n=1 X n Q n X n (see the next Remark 3.2 for a mild condition ensuring this), the fixed effects estimate of β for the unbalanced case iŝ The unbalanced Fixed Effects (FE) estimates of the η n , for n = 1, . . . , N , arê Let 0 p ∈ R p be the column vector whose elements are all equal to 0. By taking expectations and recalling the respective definitions and the fact that the measurement errors have 0 mean, it follows that the estimates (6) and (7) are conditionally unbiased with respect to the training input dataset and, for any i = 1, . . . , N , Finally, the covariance matrix ofβ FE , conditioned on the training input dataset, is

Large-sample upper bound on the conditional generalization error
This section analyzes the generalization error associated with the FE estimates (6) and (7), conditioned on the training input dataset, by providing its large-sample approximation, and a related large-sample upper bound on it. Then, in the next section, the resulting expression is optimized, after choosing a suitable model for the variance σ 2 of the measurement noise, and imposing appropriate constraints.
Let x test i ∈ R p be a random test vector, which is assumed to have finite mean and finite covariance matrix, and to be independent from the training data. We express the generalization error for the i-th unit (i = 1, . . . , N ), conditioned on the training input dataset, as follows 2 : The conditional generalization error (11) represents the expected mean squared error of the prediction of the output associated with a test input, conditioned on the training input dataset. For n = 1, . . . , N , let ε n ∈ R T n be the column vector whose elements are the ε n,t ; η n ∈ R T n be the column vector whose elements are all equal to η n ; and 0 T n ×T n ∈ R T n ×T n be a matrix whose elements are all equal to 0. Noting that and we can express the conditional generalization error (11) as follows, highlighting its dependence on σ 2 and T i (see "Appendix 1" for the details): Next, we obtain a large-sample approximation of the conditional generalization error (18) with respect to T , for a fixed number N of units 3 .
For n = 1, . . . , N , let the symmetric and positive semidefinite matrices A n ∈ R p× p be defined as In the following, the positive definiteness (hence, the invertibility) of each matrix A n is assumed. This is a quite mild condition because it is associated with the fact that, with positive probability, the random vectors x n,1 − E x n,1 do not belong to any given subspace of R p with dimension smaller than p (so, they are effectively p-dimensional random vectors). Under mild conditions (e.g., if the x n,t are mutually independent, identically distributed, and have finite moments up to the order 4), the following convergences in probability 4 hold: and where which is the weighted summation, with positive weights q n , of the symmetric and positive definite matrices A n , hence it is also a symmetric and positive definite matrix. (20) and (21) follow from the extension of Chebyschev's weak law of large numbers (Ruud 2000, Section 13.4.2) to the case of the summation of a random case of a large horizon T . The case of finite T and large N is of more interest for microeconometrics (Cameron and Trivedi 2005), and will be investigated in future research. 4 We recall that a sequence of random real matrices M T of the same dimension, T = 1, 2, . . . , converges in probability to the real matrix M if, for every ε > 0, Prob ( M T − M > ε) (where · is an arbitrary matrix norm) tends to 0 as T tends to +∞. In this case, one writes plim T →+∞ M T = M.

Remark 3.1 Equations
number of mutually independent random variables (Révész 1968, Theorem 10.1), combined with other technical results. First, for each n = 1, . . . , N , convergence in probability of 1 T n X n Q n Q n X n to A n is proved element-wise, by applying (Révész 1968, Theorem 10.1). Then, one exploits the fact that, as a consequence of the Continuous Mapping Theorem (Florescu 2015, Theorem 7.33), the probability limit of the product of two random variables (in this case, T n T and each element of 1 T n X n Q n Q n X n ) equals the product of their probability limits, when the latter two exist (which is the case for T n T and each element of 1 T n X n Q n Q n X n ). Finally, one applies the fact that, for a random matrix, element-wise convergence in probability implies convergence in probability of the whole random matrix (Lee 2010).
Remark 3.2 The existence of the probability limit (21) and the positive definiteness of the matrix A N guarantee that the invertibility of the matrix (see Sect. 2) holds with probability close to 1 for large T . Due to the generalization of Slutsky's theorem reported in (Greene 2003, Theorem D.14) 5 , under the stated assumptions also the sequence of random matrices converges in probability to A −1 N . This is needed to obtain the next large-sample approximation (25) of the conditional generalization error.

Remark 3.3
We point out that the conditional generalization error (11) is investigated in this work, instead of its unconditional version because, in general, probability limits and expectations cannot be inverted in order. This could prevent the application of (Greene 2003, Theorem D.14) (or of similar results about probability limits) when performing a similar analysis for the unconditional generalization error.
Let · 2 denote the l 2 -norm, and A the following large-sample approximation (with respect to T ) for the conditional generalization error (11): In the following, we denote, for a generic symmetric matrix A ∈ R s×s , by λ min ( A) and λ max ( A), respectively, its minimum and maximum eigenvalue. Starting from the largesample approximation (25), the following steps can be proved (see "Appendix 2" for the details): We refer to the inequality as the large-sample upper bound on the conditional generalization error. Interestingly, its right-hand side is expressed in the separable form σ 2 depends only on the q n . As shown in the next section, this simplifies the analysis of the trade-off between sample size, precision of supervision, and selection probabilities performed therein, since one does not need to compute the exact expression of the function K i ({q n } N n=1 ) to find the optimal trade-off with respect to a suitable subset of optimization variables.
In this section, we are interested in optimizing the largesample upper bound (27) of the conditional generalization error when the variance σ 2 is modeled as a decreasing function of the supervision cost per example c, and a given upper bound C > 0 is imposed on the expected total supervision cost N n=1 q n T c associated with the whole training set. For large T , this upper bound practically coincides with the total supervision cost N n=1 T n c. This follows by an application of Chebyschev's weak law of large numbers.

Remark 4.1
In our previous conference work (Gnecco and Nutarelli 2020), the large-sample approximation (25) was optimized, instead of (27). This was motivated by the fact that all the selection probabilities q n were fixed to 1, implying that both q i and A N , hence also the term were constant therein.
In the following analysis of the optimal trade-off, N is kept fixed; furthermore, one imposes the constraints q n,min ≤ q n ≤ q n,max , n = 1, . . . , N , for some given q n,min ∈ (0, 1) and q n,max ∈ [q n,min , 1], and N n=1 q n =q N , for some givenq ∈ In Eq. (31), N n=1 q n represents the expected number of observed units for each instant, which is fixed. Moreover, T is chosen as C q Nc . Finally, the supervision cost per example c is allowed to take values on the interval [c min , c max ], where 0 < c min < c max , so that the resulting T belongs to C q Nc max , . . . , C q Nc min . In the following, C is supposed to be sufficiently large, so that the large-sample upper bound (27) can be assumed to hold for every c ∈ [c min , c max ] and every q n ∈ [q n,min , q n,max ] (for n = 1, . . . , N ).
Consistently with Nutarelli 2019a, b, 2020), we adopt the following model for the variance σ 2 , as a function of the supervision cost per example c: where k, α > 0. For 0 < α < 1, if one doubles the supervision cost per example c, then the precision 1/σ 2 (c) (i.e., the reciprocal of the conditional variance of the output) becomes less than two times its initial value (or equivalently, the variance σ 2 (c) becomes more than one half its initial value). This case is referred to as "decreasing returns of scale" in the precision of each supervision. Conversely, for α > 1, if one doubles the supervision cost per example c, then the precision 1/σ 2 (c) becomes more than two times its initial value (or equivalently, the variance σ 2 (c) becomes less than one half its initial value). This case is referred to as "increasing returns of scale" in the precision of each supervision. Finally, the case α = 1 is intermediate and refers to "constant returns of scale". In all the cases above, the precision of each supervision increases by increasing the supervision cost per example c. Summarizing, under the assumptions above, the optimal trade-off between sample size, precision of supervision, and selection probabilities for the unbalanced fixed effects panel data model is modeled by the following optimization problem: minimize c∈[c min ,c max ], q n ∈[q n,min ,q n,max ], n=1,...,N By a similar argument as in the proof of (Gnecco and Nutarelli 2019b, Proposition 3.2), which refers to an analogous function approximation problem, when C is sufficiently large, the objective function C K i ({q n } N n=1 )k c −α C q Nc of the optimization problem (33), rescaled by the multiplicative factor C, can be approximated, with a negligible error in the maximum norm on [c min , c max ] × N n=1 [q n,min , q n,max ], byq N K i ({q n } N n=1 )kc 1−α . Figure 1 shows the behavior of the rescaled objective functions for the three cases 0 < α = 0.5 < 1, α = 1.5 > 1, and α = 1. The values of the other parameters are k = 0.5, . The values chosen for the other parameters are detailed in the text q = 0.5, K i ({q n } N n=1 ) = 2 (which can be assumed to hold for a fixed choice of the set of the q n ), N = 10, C = 125, c min = 0.4, and c max = 0.8. One can show by standard calculus that, for C → +∞ and the q n fixed to constant values, the number of discontinuity points of the rescaled objective function C K i ({q n } N n=1 )k c −α C q Nc tends to infinity, whereas the amplitude of its oscillations above the lower envelopē q N K i ({q n } N n=1 )kc 1−α tends to 0 uniformly with respect to c ∈ [c min , c max ].
Concluding, under the approximation above, one can replace the optimization problem (33) Such optimization problem appears in a separable form, in which one can optimize separately the variable c and the variables q n , for n = 1, . . . , N . In particular, the optimal solutions c • have the following expressions: In summary, the results of this part of the analysis show that, in the case of "decreasing returns of scale", "many but bad" examples are associated with a smaller large-sample upper bound on the conditional generalization error than "few but good" ones. The opposite occurs for "increasing returns of scale", whereas the case of "constant returns of scale" is intermediate. These results are qualitatively in line with the ones obtained in Gnecco and Nutarelli (2020) for the balanced case and in Gnecco and Nutarelli (2019a, b) for simpler linear regression problems, to which the ordinary/weighted least squares algorithms were applied. This depends on the fact that, in all these cases, the large-sample approximation of the conditional generalization error (or its large-sample upper bound) has the functional form σ 2 T K i , where K i is either a constant, or depends on optimization variables related to neither σ nor T .
One can observe that, in order to discriminate among the three cases of the analysis reported above, it is not needed to know the exact values of the constants k and N , neither the expression of K i as a function of the q n . Moreover, to discriminate between the first two cases, it is not necessary to know the exact value of the positive constant α. Indeed, it suffices to know if α belongs, respectively, to the interval (0, 1) or the interval (1, +∞). Finally, for this part of the analysis, knowledge of the probability distributions of the input examples associated with the different units is limited to the determination of the expressions of the constants λ min ( A n ) involved in the optimization of the variables q n .
Assuming that the constant terms λ min ( A n ) and E E x i,1 − x test i 2 2 are known, optimal q • n can be derived as follows. First, note that, for each fixed admissible choice of q i , the optimization of the other q n can be restated as follows: More precisely, an admissible choice for q i is one for which q i ∈ q i,min ,q i,max , wherê and The optimization problem (37) is a linear programming one, which can be reduced to a continuous knapsack problem (Martello and Toth 1990, Section 2.2.1), after a rescaling of all its optimization variables and of their respective bounds. It is well known that, due to its particular structure, such a problem can be solved by the following greedy algorithm, which is divided into three steps (for simplicity of exposition, we assume that all the λ min ( A n ) are different from each other): 1. first, the variables q n are re-ordered according to decreasing values of the associated λ min ( A n ). So, letq n := q π(n) andǍ n := A π(n) , where the function π : {1, . . . , N } → {1, . . . , N } is a permutation satisfying λ min (Ǎ m ) < λ min (Ǎ n ) for every m ≥ n. Let alsoǐ = π(i); 2. starting fromq n =q n,min for every n =ǐ, the first variableq 1 (ifǐ = 1) is increased until either the constraint n=1,...,N ,n =ǐq n =q N −qˇi , or the constrainť q 1 =q 1,max , is met; ifǐ = 1, then the procedure is applied to the second variableq 2 ; 3. step 2 is repeated for the successive variables (excludinǧ qˇi ), terminating the first time the constraint n=1,...,N ,n =ǐq n =q N −qˇi is met (this surely occurs, since q i is admissible).
The resulting optimal q • n (for n = 1, . . . , N with n = i) are parametrized by the remaining variable q i . Then, the optimal value of the objective function of the optimization problem (37) is a real-valued function of q i which, in the following, is denoted by f i (q i ). It follows from the procedure above that f i (q i ) is a continuous and piece-wise affine function of q i , with piece-wise constant slopes λ min (Ǎˇi ) − λ min (Ǎ n(q i ) ), where the choice of the index n is a function of q i , and is such that λ min (Ǎ n(q i ) ) is a nonincreasing function of q i . Hence, f i (q i ) is concave, and is nondecreasing for q i ≤ q N − ǐ −1 n=1q n,max , whereq n,max := q π(n),max , and nonincreasing otherwise.
Exploiting the results above, the optimal value of q i for the original optimization problem (36) is obtained by solving the following optimization problem: This is a convex optimization problem, since the function 1 q i is convex, whereas the function is of the form h( f i ), where f i is concave and h is convex and nonincreasing, so h( f i ) is convex (Boyd and Vandenberghe 2004, Section 3.2). After solving the optimization problem (40), the optimal values of the other q n for the original optimization problem (36) are obtained as a consequence of the three steps detailed above. It follows from the reasoning above that the structure of the optimal solutions q • n is as follows. First, there exists a thresholdλ • > 0 such that (i) for any n = i with λ min ( A n ) >λ • , q • n is equal to its maximum admissible value q n,max ; (ii) for any n = i with λ min ( A n ) <λ • , q • n is equal to its minimum admissible value q n,min ; (iii) for at most one unit n = i (for which λ min ( A n ) = λ • , provided that there exists one value of n for which this condition holds), q • n belongs to the interior of the interval [q n,min , q n,max ]. Moreover, and and Finally, it is worth observing that the structure highlighted above for the optimal solutions q • n and c • (the latter reported under Eq. (36)), which is valid for any fixed value ofq, can be useful to solve the modification of the optimization problem (36) obtained in case the constraint (31) is replaced bȳ for some givenq min ,q max ∈ (0, 1], withq min <q max .

Conclusions
In this paper, the optimal trade-off between sample size, precision of supervision, and selection probabilities, has been studied with specific reference to a quite general linear model of input-output relationship representing unobserved heterogeneity in the data, namely the unbalanced fixed effects panel data model. First, we have analyzed its conditional generalization error, then we have minimized a large-sample upper bound on it with respect to some of its parameters. We have proved that, under suitable assumptions, "many but bad" examples provide a smaller upper bound on the conditional generalization error than "few but good" ones, whereas in other cases the opposite occurs. The choice between "many but bad" and "few but good" examples plays an important role when better supervision implies higher costs.
The theoretical results obtained in this work could be applied to the acquisition design of unbalanced panel data related to several fields, such as biostatistics, econometrics, educational research, engineering, neuroscience, political science, and sociology. Moreover, the analysis of the largesample case could be extended to deal with large N , or with both large N and T . These cases would be of interest for their potential applications in microeconometrics (Cameron and Trivedi 2005). Another possible extension concerns the introduction, in the noise model, of a subset of not controllable parameters (beyond the controllable one, i.e., the noise variance), which could be estimated from a subset of training data. As a final extension, one could investigate and optimize the trade-off between sample size and precision of supervision (and possibly, also selection probabilities) for the random effects panel data model (Greene 2003, Chapter 13). This is also commonly applied in the analysis of economic data, and differs from the fixed effects panel data model in that its parameters are considered as random variables. In the present context, however, a possible advantage of the fixed effects panel data model is that it also allows one to obtain estimates of the individual constants η n (see Eq. (7)), which appear in the expression (11) of the conditional generalization error. Moreover, the application of the random effects model to the unbalanced case requires stronger assumptions than the ones needed for the application of the fixed effects model (Wooldridge 2002, Chapter 17).

Compliance with ethical standards
Conflict of interest The authors declare that they have no conflict of interest.
Ethical approval/informed consent This article does not contain any studies with human participants or animals performed by any of the authors.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecomm ons.org/licenses/by/4.0/.

Appendix 1: Proof of Equation (18)
First, we expand the conditional generalization error (11) as follows: Exploiting the conditional unbiasedness ofη i,FE , and the expressions (1) of y n,t , (2) ofỹ n,t , and (7) ofη i,FE (with the index n replaced by the index i), one gets It follows from Eq. (48) that Eq. (47) can be re-written as Using the expression (6) ofβ FE , and Eq. (16), one can simplify the termβ FE − β above as follows: Then, Eq. (49) becomes Expanding the square in the first term in the expression above, and splitting its last term in two parts, one obtains the following expression for Eq. (51): In order to simplify the various terms contained in Eq. (52), one observes that, due to Eqs. (12), (13), and (15), one gets and Then, by an application of the two equations just derived above, one obtains the following equivalent expression for Eq. (52): where, in some cases, the conditional expectations of deterministic matrices (and of random matrices, like X i , that become known once the set of conditioning matrices {X n } N n=1 has been fixed) have been replaced by the matrices themselves. Finally, exploiting Eq. (17), one can get rid of the third and sixth terms in Eq. (55), which then becomes which is Eq. (18).

Appendix 2: Proof of Equation (26)
The first inequality in Eq. (26) is obtained by exploiting the definition of induced l 2 -matrix norm, i.e., Then, the equality .
Finally, the last inequality in Eq. (26) is obtained by exploiting Weyl's inequalities (Bhatia 1997, Theorem III.2.1) for the eigenvalues of the sum of symmetric matrices, as detailed in the following remark.
Remark 5.1 Given any pair of symmetric matrices A, B ∈ R s×s , let their eigenvalues and those of C := A + B be ordered nondecreasingly (with possible repetitions in case of multiplicity larger than 1) as Then, Weyl's inequalities, in their simplest form, state that, for every k = 1, . . . , s, one has Hence, λ min (C) ≥ λ min ( A) + λ min (B). Similarly, for any μ 1 , μ 2 ≥ 0, when A and B are also positive semi-definite (as in the case of the matrices A defined in Eq. (19)), one gets λ min (μ 1 A + μ 2 B) ≥ μ 1 λ min ( A) + μ 2 λ min (B).
Finally, Eq. (63) extends directly to the case of a weighted summation (with non-negative weights) of symmetric and positive semi-definite matrices, proving the last inequality in Eq. (26).