1 Introduction

In many situations involving economics, engineering, physics, and other fields, it is required to approximate a function on the basis of a finite set of input–output noisy examples. This belongs to the typical class of problems addressed by supervised machine learning (Vapnik 1998). In some cases, the output noise variance can be reduced to some extent, by increasing the cost of each supervision. For example, devices with higher precision could be used to acquire measurements, or experts could be involved in the data analysis procedure. However, in the presence of a budget constraint, increasing the cost of each supervision could reduce the total number of available labeled examples. In such cases, the investigation of an optimal trade-off between the sample size and the precision of supervision plays a key role. In Gnecco and Nutarelli (2019a), this analysis was carried out by employing the classical linear regression model, suitably modified in order to include the additional possibility of controlling the conditional variance of the output given the input. Specifically, this was pursued by varying the time (hence, the cost) dedicated to the supervision of each training example, and fixing an upper bound on the total available supervision time. Based on a large-sample approximation of the output of the ordinary least squares regression algorithm, it was shown therein that the optimal choice of the supervision time per example is highly dependent on the noise model. The analysis was refined in Gnecco and Nutarelli (2019b)Footnote 1, where an additional algorithm (weighted least squares) was considered, and shown to produce similar results at optimality as the ordinary least squares algorithm, for a model in which different training examples are possibly associated with different supervision times.

In this work, we analyze the optimal trade-off between sample size, precision of supervision, and selection probabilities for a more general linear model of the input–output relationship, which is the unbalanced fixed effects panel data model. The (either balanced or unbalanced) fixed effects model is commonly applied in the econometric analysis of microeconomic and macroeconomic data (Andreß et al. 2013; Arellano 2004; Cameron and Trivedi 2005; Wooldridge 2002), where each unit may represent, e.g., a firm, or a country. It is also applied, among other fields, in biostatistics (Härdle et al. 2007), educational research (Sherron et al. 2000), engineering (Reeve 1988; Yu et al. 2018; Zeifman 2015), neuroscience (Friston et al. 1999), political science (Bell and Jones 2014), and sociology (Frees 2004). In a fixed effects panel data model, observations related to different observational units (individuals) are associated with possibly different constants, which are able to represent unobserved heterogeneity in the data. Moreover, the same unit is observed along another dimension, which is typically time. In the unbalanced case, at each instant, different units may be not observed with some positive probability (possibly unit-dependent), resulting in a possibly unbalanced panel. In this framework, the balanced case corresponds to the situation in which the number of observations is the same for all the units.

The present work extends significantly the analysis of our previous conference article (Gnecco and Nutarelli 2020) to the unbalanced fixed effects panel data model, which is more general than the balanced case considered therein, and leads to an optimization problem that is more complex to investigate. Indeed, in Gnecco and Nutarelli (2020), all the units are always selected at each instant, therefore the selection probabilities do not appear as optimization variables in the corresponding model. Moreover, theoretical arguments are reported in much more details in the current work.

The results that will be presented in this paper concerning the unbalanced fixed effects panel data model are consistent with those of Gnecco and Nutarelli (2020) for the balanced case, and those of Gnecco and Nutarelli (2019a, (2019b) concerning simpler linear regression models. Specifically, we show that, also for the unbalanced fixed effects panel data model, the following holds. When the precision of the supervision increases less than proportionally with respect to the supervision cost per example, the minimum (large-sample upper bound on the) generalization error (conditioned on the training input dataset) is obtained in correspondence of the smallest supervision cost per example. As a consequence of the problem formulation, this corresponds to the choice of the largest number of examples. Instead, when the precision of the supervision increases more than proportionally with respect to the supervision cost per example, the optimal supervision cost per example is the largest one. Again, as a consequence of the problem formulation, this corresponds to the choice of the smallest number of examples. The structure of the optimal selection probabilities is also investigated, under the constraint of a constant expected number of observed units for each instant. In summary, the results of the theoretical analyses performed, for different regression models of increasing complexity, in Gnecco and Nutarelli (2019a, (2019b, (2020), and in this paper highlight that, in some circumstances, collecting a smaller number of more reliable data is preferable than increasing the size of the sample set. This looks particularly relevant when one is given a certain flexibility in designing the data collection process.

Up to our knowledge, the analysis and the optimization of the trade-off between sample size, precision of supervision, and selection probabilities in regression has been carried out rarely in the machine-learning literature. Nevertheless, the approach applied in this paper resembles the one used in the optimization of sample survey design, where some of the design parameters are optimized to minimize the sampling variance (Groves et al. 2004). Such an approach is also similar to the one exploited in Nguyen et al. (2009) for the optimization of the design of measurement devices. In that framework, however, linear regression is marginally involved, since only arithmetic averages of measurement results are considered therein. The search for optimal sample designs can be also performed by the Optimal Computing Budget Allocation (OCBA) method (Chen and Lee 2010). Differently from that approach, however, our analysis provides the optimal design a priori, i.e., before actually collecting the data. Our work can also be related to recent literature dealing with the joint application of machine learning, optimization, and econometrics (Varian 2014; Athey and Imbens 2016; Bargagli Stoffi and Gnecco 2018, 2019; Crane-Droesch 2017). For instance, the generalization error—which is typically investigated by machine learning, and optimized by solving suitable optimization problems—is not addressed in the classical analysis of the either balanced or unbalanced fixed effects panel data model (Wooldridge 2002, Chapters 10 and 17). Finally, an advantage of the approach considered in this work with respect to other possible ones grounded on Statistical Learning Theory (SLT) (Vapnik 1998) is that, being based on a large-sample approximation, it provides bounds on the conditional generalization error that do not need any a-posteriori evaluation of empirical risks.

The paper is structured as follows. Section 2 provides a background on the unbalanced fixed effects panel data model. Section 3 presents the analysis of its conditional generalization error, and of the large-sample upper bound on the latter with respect to time. Section 4 formulates and solves the optimization problem modeling the trade-off between sample size, precision of supervision, and selection probabilities for the unbalanced fixed effects panel data model, using the large-sample upper bound above. Finally, Sect. 5 discusses some possible applications and extensions of the theoretical results obtained in the work.

2 Background

We recall some basic facts about the following (static) unbalanced fixed effects panel data model (see, e.g., (Wooldridge 2002, Chapters 10 and 17)). Let \(n=1,\ldots ,N\) denote observational units and, for each n, let \(t=1,\ldots ,T_n\) be time instants. Moreover, let the inputs \(\varvec{x}_{n,t}\) (\(n=1,\ldots ,N,t=1,\ldots ,T_n\)) to the model be random column vectors in \({\mathbb {R}}^p\) and, for each \(n=1,\ldots ,N\) and \(t=1,\ldots ,T_n\), let the output \(y_{n,t} \in {\mathbb {R}}\) be a scalar. The parameters of the model are some individual constants \(\eta _n\) (\(n=1,\ldots ,N)\), one for each unit, and a column vector \(\varvec{\beta }\in {\mathbb {R}}^p\). The (noise-free) input–output relationship is expressed as follows:

$$\begin{aligned} y_{n,t}:=\eta _n+\varvec{\beta }'\varvec{x}_{n,t}, \quad n=1,\ldots ,N,\quad t=1,\ldots ,T_n. \end{aligned}$$
(1)

Equation (1) represents an unbalanced panel data model, which can be applied in the following two situations:

  • distinct units n are associated with possibly different numbers \(T_n\) of data collected at each time instant \(t=1,\ldots ,T_n\) over a whole observation period \(T \ge \max _{n=1}^N T_n\);

  • the observations related to the same unit are associated with a subsequence \(\{t_1,t_2,\ldots ,t_{T_n}\}\) of the sequence \(\{1,2,\ldots ,T\}\).

In the next sections, we focus on the second situation. To avoid burdening the notation by introducing an additional index, we still indicate, also in this case, by \(\{1,2,\ldots ,T_n\}\) the subsequence \(\{t_1,t_2,\ldots ,t_{T_n}\}\). A possible way to get different numbers of observations \(T_n\) for distinct units consists in associating to each unit n a scalar \(q_n \in (0, 1]\), which denotes the (positive) probability that n is observed at any time t. Selections for different units are supposed to be mutually independent. For simplicity, for each unit, selections at different times are also assumed to be mutually independent. For a total observation time T, denoting by \({\mathbb {E}}\) the expectation operator, the expected number of observations for each unit n is \({\mathbb {E}}\left\{ T_n\right\} =q_n T\). The balanced case, which was considered in the analysis of Gnecco and Nutarelli (2020), corresponds to the situation \(q_n=1\) for each n.

Let \(\{\varepsilon _{n,t}\}_{n=1,\ldots ,N,\,t=1,\ldots ,T_n}\) be a collection of mutually independent and identically distributed random variables, having mean 0 and the same variance \(\sigma ^2\). Moreover, let all the \(\varepsilon _{n,t}\) be independent also from all the \(\varvec{x}_{n,t}\). It is assumed that noisy measurements \({\tilde{y}}_{n,t}\) of the outputs \(y_{n,t}\) are available; specifically, the following additive noise model is considered:

$$\begin{aligned} {\tilde{y}}_{n,t}=y_{n,t}+\varepsilon _{n,t}, \quad n=1,\ldots ,N,\quad t=1,\ldots ,T_n. \end{aligned}$$
(2)

The input–output pairs \(\left( \varvec{x}_{n,t}, {\tilde{y}}_{n,t}\right) \) for \(n=1,\ldots ,N\), \(t=1,\ldots ,T_n\), are used to train the model, i.e., to estimate its parameters. In the following, for \(n=1,\ldots ,N\), let \(\varvec{X}_n \in {\mathbb {R}}^{T_n,p}\) denote the matrix whose rows are the transposes of the \(\varvec{x}_{n,t}\); \(\tilde{\varvec{y}}_n\) be the column vector that collects the noisy measurements \({\tilde{y}}_{n,t}\); \(\varvec{I}_{T_n} \in {\mathbb {R}}^{T_n \times T_n}\) denote the identity matrix; \(\varvec{1}_{T_n} \in {\mathbb {R}}^{T_n}\) be the column vector whose elements are all equal to 1; and

$$\begin{aligned} \varvec{Q}_n:=\varvec{I}_{T_n}-\frac{1}{T_n} \varvec{1}_{T_n} \varvec{1}_{T_n'} \end{aligned}$$
(3)

be a symmetric and idempotent matrix, i.e., such that \(\varvec{Q}_n'=\varvec{Q}_n=\varvec{Q}_n^2\). Hence, for each unit n,

$$\begin{aligned} \varvec{Q}_n \varvec{X}_n= \begin{bmatrix} \varvec{x}_{n,1}-\frac{1}{T_n}\sum _{t=1}^{T_n} \varvec{x}_{n,t} \\ \varvec{x}_{n,2}-\frac{1}{T_n}\sum _{t=1}^{T_n} \varvec{x}_{n,t} \\ \cdots \\ \varvec{x}_{n,T_n}-\frac{1}{T_n}\sum _{t=1}^{T_n} \varvec{x}_{n,t} \end{bmatrix}, \end{aligned}$$
(4)

and

$$\begin{aligned} \varvec{Q}_n {\tilde{\varvec{y}}}_n= \begin{bmatrix} {{\tilde{y}}}_{n,1}-\frac{1}{T_n}\sum _{t=1}^{T_n} {\tilde{y}}_{n,t} \\ {{\tilde{y}}}_{n,2}-\frac{1}{T_n}\sum _{t=1}^{T_n} {\tilde{y}}_{n,t} \\ \cdots \\ {{\tilde{y}}}_{n,T_n}-\frac{1}{T_n}\sum _{t=1}^{T_n} {\tilde{y}}_{n,t} \end{bmatrix} \end{aligned}$$
(5)

represent, respectively, the matrix of time de-meaned training inputs, and the vector of time de-meaned corrupted training outputs. The aim of time de-meaning is to generate another dataset that does not include the fixed effects, making it possible to estimate first the vector \({\varvec{\beta }}\), then—going back to the original dataset—the fixed effects \(\eta _n\).

Assuming in the following the invertibility of the matrix \(\sum _{n=1}^N \varvec{X}_n' \varvec{Q}_n \varvec{X}_n\) (see the next Remark 3.2 for a mild condition ensuring this), the fixed effects estimate of \(\varvec{\beta }\) for the unbalanced case is

$$\begin{aligned}&\varvec{{\hat{\beta }}}_\mathrm{FE} \nonumber \\&\quad :=\left( \sum _{n=1}^N \varvec{X}_n' \varvec{Q}_n \varvec{X}_n\right) ^{-1} \left( \sum _{n=1}^N \varvec{X}_n' \varvec{Q}_n \tilde{\varvec{y}}_n\right) \nonumber \\&\quad = \left( \sum _{n=1}^N \varvec{X}_n' \varvec{Q}_n' \varvec{Q}_n \varvec{X}_n\right) ^{-1} \left( \sum _{n=1}^N \varvec{X}_n' \varvec{Q}_n' \varvec{Q}_n \tilde{\varvec{y}}_n\right) . \end{aligned}$$
(6)

The unbalanced Fixed Effects (FE) estimates of the \(\eta _n\), for \(n=1,\ldots ,N\), are

$$\begin{aligned} {\hat{\eta }}_{n,\mathrm{FE}}:=\frac{1}{T_n} \sum _{t=1}^{T_n} \left( {\tilde{y}}_{n,t}-\varvec{{\hat{\beta }}}_\mathrm{FE}'\varvec{x}_{n,t}\right) . \end{aligned}$$
(7)

Let \(\varvec{0}_p \in {\mathbb {R}}^p\) be the column vector whose elements are all equal to 0. By taking expectations and recalling the respective definitions and the fact that the measurement errors have 0 mean, it follows that the estimates (6) and (7) are conditionally unbiased with respect to the training input dataset \(\{\varvec{X}_{n}\}_{n=1}^{N}\), i.e.,

$$\begin{aligned} {\mathbb {E}} \left\{ \left( \varvec{{\hat{\beta }}}_\mathrm{FE}-\varvec{\beta }\right) | \{\varvec{X}_{n}\}_{n=1}^{N}\right\} =\varvec{0}_p, \end{aligned}$$
(8)

and, for any \(i=1,\ldots ,N\),

$$\begin{aligned} {\mathbb {E}} \left\{ \left( {\hat{\eta }}_{i,\mathrm{FE}}-\eta _i\right) | \{\varvec{X}_{n}\}_{n=1}^{N}\right\} =0. \end{aligned}$$
(9)

Finally, the covariance matrix of \(\varvec{{\hat{\beta }}}_\mathrm{FE}\), conditioned on the training input dataset, is

$$\begin{aligned}&\mathrm{Var}\left( \varvec{{\hat{\beta }}}_\mathrm{FE}|\{\varvec{X}_{n}\}_{n=1}^{N}\right) \nonumber \\&\quad =\sigma ^2 \left( \sum _{n=1}^N \varvec{X}_n' \varvec{Q}_n \varvec{X}_n\right) ^{-1} \nonumber \\&\quad =\sigma ^2 \left( \sum _{n=1}^N \varvec{X}_n' \varvec{Q}_n' \varvec{Q}_n \varvec{X}_n\right) ^{-1}. \end{aligned}$$
(10)

3 Large-sample upper bound on the conditional generalization error

This section analyzes the generalization error associated with the FE estimates (6) and (7), conditioned on the training input dataset, by providing its large-sample approximation, and a related large-sample upper bound on it. Then, in the next section, the resulting expression is optimized, after choosing a suitable model for the variance \(\sigma ^2\) of the measurement noise, and imposing appropriate constraints.

Let \(\varvec{x}_i^\mathrm{test} \in {\mathbb {R}}^p\) be a random test vector, which is assumed to have finite mean and finite covariance matrix, and to be independent from the training data. We express the generalization error for the i-th unit (\(i=1,\ldots ,N\)), conditioned on the training input dataset, as followsFootnote 2:

$$\begin{aligned} {\mathbb {E}} \left\{ \left( {\hat{\eta }}_{i,\mathrm{FE}}+\varvec{{\hat{\beta }}}_\mathrm{FE}'\varvec{x}_i^\mathrm{test}-\eta _i-\varvec{\beta }'\varvec{x}_i^\mathrm{test}\right) ^2 \big | \{\varvec{X}_{n}\}_{n=1}^{N}\right\} . \end{aligned}$$
(11)

The conditional generalization error (11) represents the expected mean squared error of the prediction of the output associated with a test input, conditioned on the training input dataset.

For \(n=1,\ldots ,N\), let \(\varvec{\varepsilon }_n \in {\mathbb {R}}^{T_n}\) be the column vector whose elements are the \(\varepsilon _{n,t}\); \(\varvec{\eta }_n \in {\mathbb {R}}^{T_n}\) be the column vector whose elements are all equal to \(\eta _n\); and \(\varvec{0}_{T_n \times T_n} \in {\mathbb {R}}^{T_n \times T_n}\) be a matrix whose elements are all equal to 0. Noting that

$$\begin{aligned}&{\mathbb {E}} \left\{ \varvec{\varepsilon }_n \varvec{\varepsilon }_m' \right\} =\varvec{0}_{T_n \times T_n},\quad \text{ for }\,\, n \ne m, \end{aligned}$$
(12)
$$\begin{aligned}&{\mathbb {E}} \left\{ \varvec{\varepsilon }_n \varvec{\varepsilon }_n' \right\} =\sigma ^2 \varvec{I}_{T_n}, \end{aligned}$$
(13)
$$\begin{aligned}&\varvec{Q}_n'\varvec{Q}_n=\varvec{Q}_n, \end{aligned}$$
(14)
$$\begin{aligned}&\varvec{Q}_n'\varvec{Q}_n\varvec{Q}_n'\varvec{Q}_n=\varvec{Q}_n'\varvec{Q}_n, \end{aligned}$$
(15)
$$\begin{aligned}&\varvec{Q}_n \varvec{\eta }_n= \varvec{Q}_n' \varvec{\eta }_n= \varvec{0}_{T_n}, \end{aligned}$$
(16)

and

$$\begin{aligned} \varvec{Q}_n \varvec{1}_{T_n}= \varvec{Q}_n' \varvec{1}_{T_n}= \varvec{0}_{T_n}, \end{aligned}$$
(17)

we can express the conditional generalization error (11) as follows, highlighting its dependence on \(\sigma ^2\) and \(T_i\) (see “Appendix 1” for the details):

$$\begin{aligned}&{\mathbb {E}} \left\{ \left( {\hat{\eta }}_{i,\mathrm{FE}}+\varvec{{\hat{\beta }}}_\mathrm{FE}'\varvec{x}_i^\mathrm{test}-\eta _i-\varvec{\beta }'\varvec{x}_i^\mathrm{test}\right) ^2 \big | \{\varvec{X}_{n}\}_{n=1}^{N}\right\} \nonumber \\&\quad =\frac{\sigma ^2}{T_i^2} \varvec{1}_{T_i}' \varvec{X}_i \left( \sum _{n=1}^N \varvec{X}_n' \varvec{Q}_n' \varvec{Q}_n \varvec{X}_n\right) ^{-1} \varvec{X}_i' \varvec{1}_{T_i} \nonumber \\&\qquad +\frac{\sigma ^2}{T_i} \nonumber \\&\qquad +{\mathbb {E}} \Bigg \{\sigma ^2 \left( \varvec{x}_i^\mathrm{test}\right) '\left( \sum _{n=1}^N \varvec{X}_n' \varvec{Q}_n' \varvec{Q}_n \varvec{X}_n\right) ^{-1} \varvec{x}_i^\mathrm{test} \big | \{\varvec{X}_{n}\}_{n=1}^{N}\Bigg \} \nonumber \\&\qquad -2 {\mathbb {E}} \Bigg \{\frac{\sigma ^2}{T_i} \varvec{1}_{T_i}' \varvec{X}_i \left( \sum _{n=1}^N \varvec{X}_n' \varvec{Q}_n' \varvec{Q}_n \varvec{X}_n\right) ^{-1} \varvec{x}_i^\mathrm{test} \big | \{\varvec{X}_{n}\}_{n=1}^{N} \Bigg \}. \end{aligned}$$
(18)

Next, we obtain a large-sample approximation of the conditional generalization error (18) with respect to T, for a fixed number N of unitsFootnote 3.

For \(n=1,\ldots ,N\), let the symmetric and positive semi-definite matrices \(\varvec{A}_n \in {\mathbb {R}}^{p \times p}\) be defined as

$$\begin{aligned} \varvec{A}_n=\varvec{A}_n':= {\mathbb {E}} \left\{ \left( \varvec{x}_{n,1}-{\mathbb {E}} \left\{ \varvec{x}_{n,1}\right\} \right) \left( \varvec{x}_{n,1}-{\mathbb {E}} \left\{ \varvec{x}_{n,1}\right\} \right) '\right\} . \end{aligned}$$
(19)

In the following, the positive definiteness (hence, the invertibility) of each matrix \(A_n\) is assumed. This is a quite mild condition because it is associated with the fact that, with positive probability, the random vectors \(\varvec{x}_{n,1}-{\mathbb {E}} \left\{ \varvec{x}_{n,1}\right\} \) do not belong to any given subspace of \({\mathbb {R}}^p\) with dimension smaller than p (so, they are effectively p-dimensional random vectors).

Under mild conditions (e.g., if the \(\varvec{x}_{n,t}\) are mutually independent, identically distributed, and have finite moments up to the order 4), the following convergences in probabilityFootnote 4 hold:

$$\begin{aligned}&\mathrm{plim}_{T \rightarrow +\infty } \frac{1}{T_i} \varvec{1}_{T_i}' \varvec{X}_i \nonumber \\&\quad = \mathrm{plim}_{T \rightarrow +\infty } \frac{1}{T_i} \sum _{t=1}^{T_i} \varvec{x}_{i,t}' \nonumber \\&\quad = \left( {\mathbb {E}} \left\{ \varvec{x}_{i,1}\right\} \right) ', \end{aligned}$$
(20)

and

$$\begin{aligned}&\mathrm{plim}_{T \rightarrow +\infty } \frac{1}{T} \sum _{n=1}^N \varvec{X}_n' \varvec{Q}_n' \varvec{Q}_n \varvec{X}_n \nonumber \\&\quad = \mathrm{plim}_{T \rightarrow +\infty } \sum _{n=1}^N \frac{T_n}{T} \frac{1}{T_n} \varvec{X}_n' \varvec{Q}_n' \varvec{Q}_n \varvec{X}_n \nonumber \\&\quad = \varvec{A}_N, \end{aligned}$$
(21)

where

$$\begin{aligned} \varvec{A}_N=\varvec{A}_N':= \sum _{n=1}^N q_n \varvec{A}_n \end{aligned}$$
(22)

which is the weighted summation, with positive weights \(q_n\), of the symmetric and positive definite matrices \(\varvec{A}_n\), hence it is also a symmetric and positive definite matrix.

Remark 3.1

Equations (20) and (21) follow from the extension of Chebyschev’s weak law of large numbers (Ruud 2000, Section 13.4.2) to the case of the summation of a random number of mutually independent random variables (Révész 1968, Theorem 10.1), combined with other technical results. First, for each \(n=1,\ldots ,N\), convergence in probability of \(\frac{1}{T_n} \varvec{X}_n' \varvec{Q}_n' \varvec{Q}_n \varvec{X}_n\) to \(\varvec{A}_n\) is proved element-wise, by applying (Révész 1968, Theorem 10.1). Then, one exploits the fact that, as a consequence of the Continuous Mapping Theorem (Florescu 2015, Theorem 7.33), the probability limit of the product of two random variables (in this case, \(\frac{T_n}{T}\) and each element of \(\frac{1}{T_n} \varvec{X}_n' \varvec{Q}_n' \varvec{Q}_n \varvec{X}_n\)) equals the product of their probability limits, when the latter two exist (which is the case for \(\frac{T_n}{T}\) and each element of \(\frac{1}{T_n} \varvec{X}_n' \varvec{Q}_n' \varvec{Q}_n \varvec{X}_n\)). Finally, one applies the fact that, for a random matrix, element-wise convergence in probability implies convergence in probability of the whole random matrix (Lee 2010).

Remark 3.2

The existence of the probability limit (21) and the positive definiteness of the matrix \(\varvec{A}_N\) guarantee that the invertibility of the matrix

$$\begin{aligned} \sum _{n=1}^N \varvec{X}_n' \varvec{Q}_n \varvec{X}_n=\sum _{n=1}^N \varvec{X}_n' \varvec{Q}_n' \varvec{Q}_n \varvec{X}_n \end{aligned}$$
(23)

(see Sect. 2) holds with probability close to 1 for large T. Due to the generalization of Slutsky’s theorem reported in (Greene 2003, Theorem D.14)Footnote 5, under the stated assumptions also the sequence of random matrices

$$\begin{aligned} \left( \frac{1}{T} \sum _{n=1}^N \varvec{X}_n' \varvec{Q}_n' \varvec{Q}_n \varvec{X}_n\right) ^{-1} \end{aligned}$$
(24)

converges in probability to \(\varvec{A}_N^{-1}\). This is needed to obtain the next large-sample approximation (25) of the conditional generalization error.

Remark 3.3

We point out that the conditional generalization error (11) is investigated in this work, instead of its unconditional version because, in general, probability limits and expectations cannot be inverted in order. This could prevent the application of (Greene 2003, Theorem D.14) (or of similar results about probability limits) when performing a similar analysis for the unconditional generalization error.

Let \(\Vert \cdot \Vert _2\) denote the \(l_2\)-norm, and \(\varvec{A}_N^{-\frac{1}{2}}\) be the principal square root (i.e., the symmetric and positive definite square root) of the symmetric and positive definite matrix \(\varvec{A}_N^{-1}\). When (20) and (21) hold, from (18) and the assumed independence of \(\varvec{x}_i^\mathrm{test}\) from all the other random vectors we get the following large-sample approximation (with respect to T) for the conditional generalization error (11):

$$\begin{aligned}&{\mathbb {E}} \left\{ \left( {\hat{\eta }}_{i,\mathrm{FE}}+\varvec{{\hat{\beta }}}_\mathrm{FE}'\varvec{x}_i^\mathrm{test}-\eta _i-\varvec{\beta }'\varvec{x}_i^\mathrm{test}\right) ^2 \big | \{\varvec{X}_{n}\}_{n=1}^{N}\right\} \nonumber \\\simeq & {} \frac{\sigma ^2}{T} \left( {\mathbb {E}} \left\{ \varvec{x}_{i,1}\right\} \right) ' \varvec{A}_N^{-1} {\mathbb {E}} \left\{ \varvec{x}_{i,1}\right\} \nonumber \\&+\frac{\sigma ^2}{q_i T} \nonumber \\&+\frac{\sigma ^2}{T} {\mathbb {E}} \left\{ \left( \varvec{x}_i^\mathrm{test}\right) ' \varvec{A}_N^{-1} \varvec{x}_i^\mathrm{test} \right\} \nonumber \\&-2 \frac{\sigma ^2}{T} \left( {\mathbb {E}} \left\{ \varvec{x}_{i,1}\right\} \right) ' \varvec{A}_N^{-1} {\mathbb {E}} \left\{ \varvec{x}_i^\mathrm{test} \right\} \nonumber \\= & {} \frac{\sigma ^2}{T} \left( \frac{1}{q_i} + {\mathbb {E}} \left\{ \left\| \varvec{A}_N^{-\frac{1}{2}} \left( {\mathbb {E}} \left\{ \varvec{x}_{i,1}\right\} -\varvec{x}_i^\mathrm{test}\right) \right\| _2^2\right\} \right) . \end{aligned}$$
(25)

In the following, we denote, for a generic symmetric matrix \(\varvec{A} \in {\mathbb {R}}^{s \times s}\), by \(\lambda _\mathrm{min}(\varvec{A})\) and \(\lambda _{\mathrm{max}}(\varvec{A})\), respectively, its minimum and maximum eigenvalue. Starting from the large-sample approximation (25), the following steps can be proved (see “Appendix 2” for the details):

$$\begin{aligned}&\frac{\sigma ^2}{T} \left( \frac{1}{q_i} + {\mathbb {E}} \left\{ \left\| \varvec{A}_N^{-\frac{1}{2}} \left( {\mathbb {E}} \left\{ \varvec{x}_{i,1}\right\} -\varvec{x}_i^\mathrm{test}\right) \right\| _2^2\right\} \right) \nonumber \\&\quad \le \frac{\sigma ^2}{T} \left( \frac{1}{q_i} + \lambda _{\mathrm{max}}(\varvec{A}_N^{-1}) {\mathbb {E}} \left\{ \left\| \left( {\mathbb {E}} \left\{ \varvec{x}_{i,1}\right\} -\varvec{x}_i^\mathrm{test}\right) \right\| _2^2\right\} \right) \nonumber \\&\quad = \frac{\sigma ^2}{T} \left( \frac{1}{q_i} + \frac{1}{\lambda _{\mathrm{min}}(\varvec{A}_N)} {\mathbb {E}} \left\{ \left\| \left( {\mathbb {E}} \left\{ \varvec{x}_{i,1}\right\} -\varvec{x}_i^\mathrm{test}\right) \right\| _2^2\right\} \right) \nonumber \\&\quad \le \frac{\sigma ^2}{T} \left( \frac{1}{q_i} + \frac{1}{\sum _{n=1}^{N} q_n \lambda _{\mathrm{min}}(\varvec{A}_n)} {\mathbb {E}} \left\{ \left\| \left( {\mathbb {E}} \left\{ \varvec{x}_{i,1}\right\} -\varvec{x}_i^\mathrm{test}\right) \right\| _2^2\right\} \right) . \end{aligned}$$
(26)

We refer to the inequality

$$\begin{aligned}&\frac{\sigma ^2}{T} \left( \frac{1}{q_i} + {\mathbb {E}} \left\{ \left\| \varvec{A}_N^{-\frac{1}{2}} \left( {\mathbb {E}} \left\{ \varvec{x}_{i,1}\right\} -\varvec{x}_i^\mathrm{test}\right) \right\| _2^2\right\} \right) \nonumber \\&\quad \le \frac{\sigma ^2}{T} \left( \frac{1}{q_i} + \frac{1}{\sum _{n=1}^{N} q_n \lambda _{\mathrm{min}}(\varvec{A}_n)} {\mathbb {E}} \left\{ \left\| \left( {\mathbb {E}} \left\{ \varvec{x}_{i,1}\right\} -\varvec{x}_i^\mathrm{test}\right) \right\| _2^2\right\} \right) \end{aligned}$$
(27)

as the large-sample upper bound on the conditional generalization error. Interestingly, its right-hand side is expressed in the separable form \(\frac{\sigma ^2}{T} K_i (\{q_n\}_{n=1}^N)\), where

$$\begin{aligned}&K_i (\{q_n\}_{n=1}^N) \nonumber \\&\quad :=\left( \frac{1}{q_i} + \frac{1}{\sum _{n=1}^{N} q_n \lambda _\mathrm{min}(\varvec{A}_n)} {\mathbb {E}} \left\{ \left\| \left( {\mathbb {E}} \left\{ \varvec{x}_{i,1}\right\} -\varvec{x}_i^\mathrm{test}\right) \right\| _2^2\right\} \right) \end{aligned}$$
(28)

depends only on the \(q_n\). As shown in the next section, this simplifies the analysis of the trade-off between sample size, precision of supervision, and selection probabilities performed therein, since one does not need to compute the exact expression of the function \(K_i (\{q_n\}_{n=1}^N)\) to find the optimal trade-off with respect to a suitable subset of optimization variables.

4 Optimal trade-off between sample size, precision of supervision, and selection probabilities

In this section, we are interested in optimizing the large-sample upper bound (27) of the conditional generalization error when the variance \(\sigma ^2\) is modeled as a decreasing function of the supervision cost per example c, and a given upper bound \(C>0\) is imposed on the expected total supervision cost \(\sum _{n=1}^N q_n T c\) associated with the whole training set. For large T, this upper bound practically coincides with the total supervision cost \(\sum _{n=1}^N T_n c\). This follows by an application of Chebyschev’s weak law of large numbers.

Remark 4.1

In our previous conference work (Gnecco and Nutarelli 2020), the large-sample approximation (25) was optimized, instead of (27). This was motivated by the fact that all the selection probabilities \(q_n\) were fixed to 1, implying that both \(q_i\) and \(A_N\), hence also the term

$$\begin{aligned} \left( \frac{1}{q_i} + {\mathbb {E}} \left\{ \left\| \varvec{A}_N^{-\frac{1}{2}} \left( {\mathbb {E}} \left\{ \varvec{x}_{i,1}\right\} -\varvec{x}_i^\mathrm{test}\right) \right\| _2^2\right\} \right) , \end{aligned}$$
(29)

were constant therein.

In the following analysis of the optimal trade-off, N is kept fixed; furthermore, one imposes the constraints

$$\begin{aligned} q_{n,\mathrm{min}} \le q_n \le q_{n,\mathrm max}, \quad n=1,\ldots ,N, \end{aligned}$$
(30)

for some given \(q_{n,\mathrm{min}} \in (0,1)\) and \(q_{n,\mathrm{max}} \in [q_{n,\mathrm{min}}, 1]\), and

$$\begin{aligned} \sum _{n=1}^N q_n = {\bar{q}} N, \end{aligned}$$
(31)

for some given \({\bar{q}} \in \left[ \frac{\sum _{n=1}^N q_{n,\mathrm{min}}}{N},\frac{\sum _{n=1}^N q_{n,\mathrm{max}}}{N}\right] \subseteq (0,1]\). In Eq. (31), \(\sum _{n=1}^N q_n\) represents the expected number of observed units for each instant, which is fixed. Moreover, T is chosen as \(\left\lfloor \frac{C}{{\bar{q}} N c} \right\rfloor \). Finally, the supervision cost per example c is allowed to take values on the interval \([c_{\mathrm{min}}, c_{\mathrm{max}}]\), where \(0< c_{\mathrm{min}} < c_{\mathrm{max}}\), so that the resulting T belongs to \(\left\{ \left\lfloor \frac{C}{{\bar{q}} N c_{\mathrm{max}}} \right\rfloor , \ldots , \left\lfloor \frac{C}{{\bar{q}} N c_{\mathrm{min}}} \right\rfloor \right\} \). In the following, C is supposed to be sufficiently large, so that the large-sample upper bound (27) can be assumed to hold for every \(c \in [c_{\mathrm{min}}, c_{\mathrm{max}}]\) and every \(q_n \in [q_{n, \mathrm{min}}, q_{n, \mathrm{max}}]\) (for \(n=1,\ldots ,N\)).

Consistently with (Gnecco and Nutarelli 2019a, b, 2020), we adopt the following model for the variance \(\sigma ^2\), as a function of the supervision cost per example c:

$$\begin{aligned} \sigma ^2(c)=k c^{-\alpha }, \end{aligned}$$
(32)

where \(k,\alpha > 0\). For \(0< \alpha < 1\), if one doubles the supervision cost per example c, then the precision \(1/\sigma ^2(c)\) (i.e., the reciprocal of the conditional variance of the output) becomes less than two times its initial value (or equivalently, the variance \(\sigma ^2(c)\) becomes more than one half its initial value). This case is referred to as “decreasing returns of scale” in the precision of each supervision. Conversely, for \(\alpha > 1\), if one doubles the supervision cost per example c, then the precision \(1/\sigma ^2(c)\) becomes more than two times its initial value (or equivalently, the variance \(\sigma ^2(c)\) becomes less than one half its initial value). This case is referred to as “increasing returns of scale” in the precision of each supervision. Finally, the case \(\alpha =1\) is intermediate and refers to “constant returns of scale”. In all the cases above, the precision of each supervision increases by increasing the supervision cost per example c.

Fig. 1
figure 1

Plots of the rescaled objective functions \(C K_i (\{q_n\}_{n=1}^N) k\frac{c^{-\alpha }}{\left\lfloor \frac{C}{{\bar{q}} N c} \right\rfloor }\) and \({\bar{q}} N K_i (\{q_n\}_{n=1}^N) k c^{1-\alpha }\) for \(\alpha = 0.5 \) (a), \(\alpha = 1.5 \) (b), and \(\alpha = 1\) (c). The values chosen for the other parameters are detailed in the text

Summarizing, under the assumptions above, the optimal trade-off between sample size, precision of supervision, and selection probabilities for the unbalanced fixed effects panel data model is modeled by the following optimization problem:

$$\begin{aligned}&\underset{\begin{array}{c} c \in [c_{\mathrm{min}}, c_{\mathrm{max}}], \\ q_n \in [q_{n,\mathrm{min}}, q_{n,\mathrm{max}}], \\ n=1,\ldots ,N \end{array}}{\mathrm{minimize}} K_i (\{q_n\}_{n=1}^N) k \frac{c^{-\alpha }}{\left\lfloor \frac{C}{{\bar{q}} N c} \right\rfloor } \nonumber \\&\mathrm{s. t.} \sum _{n=1}^N q_n = {\bar{q}} N. \end{aligned}$$
(33)

By a similar argument as in the proof of (Gnecco and Nutarelli 2019b, Proposition 3.2), which refers to an analogous function approximation problem, when C is sufficiently large, the objective function \(C K_i (\{q_n\}_{n=1}^N) k\frac{c^{-\alpha }}{\left\lfloor \frac{C}{{\bar{q}} N c} \right\rfloor }\) of the optimization problem (33), rescaled by the multiplicative factor C, can be approximated, with a negligible error in the maximum norm on \([c_{\mathrm{min}}, c_{\mathrm{max}}] \times \Pi _{n=1}^N [q_{n,\mathrm{min}}, q_{n,\mathrm{max}}]\), by \({\bar{q}} N K_i (\{q_n\}_{n=1}^N) k c^{1-\alpha }\). Figure 1 shows the behavior of the rescaled objective functions

$$\begin{aligned} C K_i (\{q_n\}_{n=1}^N) k\frac{c^{-\alpha }}{\left\lfloor \frac{C}{{\bar{q}} N c} \right\rfloor } \end{aligned}$$
(34)

and

$$\begin{aligned} {\bar{q}} N K_i (\{q_n\}_{n=1}^N) k c^{1-\alpha } \end{aligned}$$
(35)

for the three cases \(0<\alpha = 0.5 < 1\), \(\alpha = 1.5 > 1\), and \(\alpha = 1\). The values of the other parameters are \(k=0.5\), \({\bar{q}}=0.5\), \(K_i (\{q_n\}_{n=1}^N)=2\) (which can be assumed to hold for a fixed choice of the set of the \(q_n\)), \(N=10\), \(C=125\), \(c_{\mathrm{min}}=0.4\), and \(c_{\mathrm{max}}=0.8\). One can show by standard calculus that, for \(C \rightarrow +\infty \) and the \(q_n\) fixed to constant values, the number of discontinuity points of the rescaled objective function \(C K_i (\{q_n\}_{n=1}^N) k\frac{c^{-\alpha }}{\left\lfloor \frac{C}{{\bar{q}} N c} \right\rfloor }\) tends to infinity, whereas the amplitude of its oscillations above the lower envelope \({\bar{q}} N K_i (\{q_n\}_{n=1}^N) k c^{1-\alpha }\) tends to 0 uniformly with respect to \(c \in [c_{\mathrm{min}}, c_{\mathrm{max}}]\).

Concluding, under the approximation above, one can replace the optimization problem (33) with

$$\begin{aligned}&\underset{\begin{array}{c} c \in [c_{\mathrm{min}}, c_{\mathrm{max}}], \\ q_n \in [q_{n,\mathrm{min}}, q_{n,\mathrm{max}}], \\ n=1,\ldots ,N \end{array}}{\mathrm{minimize}} {\bar{q}} N K_i (\{q_n\}_{n=1}^N) k c^{1-\alpha } \nonumber \\&\mathrm{s. t.} \sum _{n=1}^N q_n = {\bar{q}} N. \end{aligned}$$
(36)

Such optimization problem appears in a separable form, in which one can optimize separately the variable c and the variables \(q_n\), for \(n=1,\ldots ,N\). In particular, the optimal solutions \(c^\circ \) have the following expressions:

  1. (a)

    if \(0< \alpha < 1\) (“decreasing returns of scale”): \(c^\circ = c_{\mathrm{min}}\);

  2. (b)

    if \(\alpha > 1\) (“increasing returns of scale”): \(c^\circ = c_{\mathrm{max}}\);

  3. (c)

    if \(\alpha = 1\) (“constant returns of scale”): \(c^\circ =\) any cost c in the interval \([c_{\mathrm{min}}, c_{\mathrm{max}}]\).

In summary, the results of this part of the analysis show that, in the case of “decreasing returns of scale”, “many but bad” examples are associated with a smaller large-sample upper bound on the conditional generalization error than “few but good” ones. The opposite occurs for “increasing returns of scale”, whereas the case of “constant returns of scale” is intermediate. These results are qualitatively in line with the ones obtained in Gnecco and Nutarelli (2020) for the balanced case and in Gnecco and Nutarelli (2019a, (2019b) for simpler linear regression problems, to which the ordinary/weighted least squares algorithms were applied. This depends on the fact that, in all these cases, the large-sample approximation of the conditional generalization error (or its large-sample upper bound) has the functional form \(\frac{\sigma ^2}{T} K_i\), where \(K_i\) is either a constant, or depends on optimization variables related to neither \(\sigma \) nor T.

One can observe that, in order to discriminate among the three cases of the analysis reported above, it is not needed to know the exact values of the constants k and N, neither the expression of \(K_i\) as a function of the \(q_n\). Moreover, to discriminate between the first two cases, it is not necessary to know the exact value of the positive constant \(\alpha \). Indeed, it suffices to know if \(\alpha \) belongs, respectively, to the interval (0, 1) or the interval \((1,+\infty )\). Finally, for this part of the analysis, knowledge of the probability distributions of the input examples associated with the different units is limited to the determination of the expressions of the constants \(\lambda _{\mathrm{min}}({\varvec{A}}_n)\) involved in the optimization of the variables \(q_n\).

Assuming that the constant terms \(\lambda _{\mathrm{min}}({\varvec{A}}_n)\) and \({\mathbb {E}} \left\{ \left\| \left( {\mathbb {E}} \left\{ \varvec{x}_{i,1}\right\} -\varvec{x}_i^\mathrm{test}\right) \right\| _2^2\right\} \) are known, optimal \(q_n^{\circ }\) can be derived as follows. First, note that, for each fixed admissible choice of \(q_i\), the optimization of the other \(q_n\) can be restated as follows:

$$\begin{aligned}&\underset{\begin{array}{c} q_n \in [q_{n,\mathrm{min}}, q_{n,\mathrm{max}}], \\ n=1,\ldots ,N, n \ne i \end{array}}{\mathrm{maximize}} \left( \lambda _{\min }({\varvec{A}}_i) q_i + \sum _{n=1,\ldots ,N, n \ne i} \lambda _{\min }({\varvec{A}}_n) q_n \right) \nonumber \\&\mathrm{s. t.} \sum _{n=1,\ldots ,N, n \ne i} q_n = {\bar{q}} N-q_i. \end{aligned}$$
(37)

More precisely, an admissible choice for \(q_i\) is one for which \(q_i \in \left[ {\hat{q}}_{i,\mathrm{min}}, {\hat{q}}_{i,\mathrm{max}} \right] \), where

$$\begin{aligned} {\hat{q}}_{i,\mathrm{min}}:=\max \{q_{i,\mathrm{min}},{\bar{q}} N -\sum _{n=1,\ldots ,N, n \ne i} q_{n, \mathrm{max}}\}, \end{aligned}$$
(38)

and

$$\begin{aligned} {\hat{q}}_{i,\mathrm{max}}:=\min \{q_{i,\mathrm{max}},{\bar{q}} N -\sum _{n=1,\ldots ,N, n \ne i} q_{n, \mathrm{min}}\}. \end{aligned}$$
(39)

The optimization problem (37) is a linear programming one, which can be reduced to a continuous knapsack problem (Martello and Toth 1990, Section 2.2.1), after a rescaling of all its optimization variables and of their respective bounds. It is well known that, due to its particular structure, such a problem can be solved by the following greedy algorithm, which is divided into three steps (for simplicity of exposition, we assume that all the \(\lambda _{\min }({\varvec{A}}_n)\) are different from each other):

  1. 1.

    first, the variables \(q_n\) are re-ordered according to decreasing values of the associated \(\lambda _{\min }({\varvec{A}}_n)\). So, let \(\check{q}_{n}:=q_{\pi (n)}\) and \(\check{\varvec{A}}_n:={\varvec{A}}_{\pi (n)}\), where the function \(\pi : \{1,\ldots ,N\} \rightarrow \{1,\ldots ,N\}\) is a permutation satisfying \(\lambda _{\min }(\check{\varvec{A}}_m) < \lambda _{\min }(\check{\varvec{A}}_n)\) for every \(m \ge n\). Let also \(\check{i}=\pi (i)\);

  2. 2.

    starting from \(\check{q}_{n}=\check{q}_{n,\mathrm{min}}\) for every \(n \ne \check{i}\), the first variable \(\check{q}_{1}\) (if \(\check{i} \ne 1\)) is increased until either the constraint \(\sum _{n=1,\ldots ,N, n \ne \check{i}} \check{q}_n = {\bar{q}} N-\check{q}_{\check{i}}\), or the constraint \(\check{q}_{1}=\check{q}_{1,\mathrm{max}}\), is met; if \(\check{i}=1\), then the procedure is applied to the second variable \(\check{q}_{2}\);

  3. 3.

    step 2 is repeated for the successive variables (excluding \(\check{q}_{\check{i}}\)), terminating the first time the constraint \(\sum _{n=1,\ldots ,N, n \ne \check{i}} \check{q}_n = {\bar{q}} N-\check{q}_{\check{i}}\) is met (this surely occurs, since \(q_i\) is admissible).

The resulting optimal \(q_n^{\circ }\) (for \(n=1,\ldots ,N\) with \(n \ne i\)) are parametrized by the remaining variable \(q_i\). Then, the optimal value of the objective function of the optimization problem (37) is a real-valued function of \(q_i\) which, in the following, is denoted by \(f_i(q_i)\). It follows from the procedure above that \(f_i(q_i)\) is a continuous and piece-wise affine function of \(q_i\), with piece-wise constant slopes \(\lambda _{\min }(\check{\varvec{A}}_{\check{i}})-\lambda _{\min }(\check{\varvec{A}}_{n(q_i)})\), where the choice of the index n is a function of \(q_i\), and is such that \(\lambda _{\min }(\check{\varvec{A}}_{n(q_i)})\) is a nonincreasing function of \(q_i\). Hence, \(f_i(q_i)\) is concave, and is nondecreasing for \(q_i \le {\bar{q}} N - \sum _{n=1}^{\check{i}-1} \check{q}_{n, \mathrm{max}}\), where \(\check{q}_{n, \mathrm{max}}:=q_{\pi (n), \mathrm{max}}\), and nonincreasing otherwise.

Exploiting the results above, the optimal value of \(q_i\) for the original optimization problem (36) is obtained by solving the following optimization problem:

$$\begin{aligned} \underset{q_i \in [{\hat{q}}_{i,\mathrm{min}}, {\hat{q}}_{i,\mathrm{max}}]}{\mathrm{minimize}} \left( \frac{1}{q_i} + \frac{{\mathbb {E}} \left\{ \left\| \left( {\mathbb {E}} \left\{ \varvec{x}_{i,1}\right\} -\varvec{x}_i^\mathrm{test}\right) \right\| _2^2\right\} }{f_i(q_i)}\right) . \end{aligned}$$
(40)

This is a convex optimization problem, since the function \(\frac{1}{q_i}\) is convex, whereas the function

$$\begin{aligned} \frac{{\mathbb {E}} \left\{ \left\| \left( {\mathbb {E}} \left\{ \varvec{x}_{i,1}\right\} -\varvec{x}_i^\mathrm{test}\right) \right\| _2^2\right\} }{f_i(q_i)} \end{aligned}$$
(41)

is of the form \(h(f_i)\), where \(f_i\) is concave and h is convex and nonincreasing, so \(h(f_i)\) is convex (Boyd and Vandenberghe 2004, Section 3.2). After solving the optimization problem (40), the optimal values of the other \(q_n\) for the original optimization problem (36) are obtained as a consequence of the three steps detailed above.

It follows from the reasoning above that the structure of the optimal solutions \(q_n^\circ \) is as follows. First, there exists a threshold \({\bar{\lambda }}^\circ > 0\) such that

  1. (i)

    for any \(n \ne i\) with \(\lambda _{\min }({\varvec{A}}_n) > {{\bar{\lambda }}^\circ }\), \(q_n^\circ \) is equal to its maximum admissible value \(q_{n,\mathrm{max}}\);

  2. (ii)

    for any \(n \ne i\) with \(\lambda _{\min }({\varvec{A}}_n) < {{\bar{\lambda }}^\circ }\), \(q_n^\circ \) is equal to its minimum admissible value \(q_{n,\mathrm{min}}\);

  3. (iii)

    for at most one unit \(n \ne i\) (for which \(\lambda _{\min }({\varvec{A}}_n) = {{\bar{\lambda }}^\circ }\), provided that there exists one value of n for which this condition holds), \(q_n^\circ \) belongs to the interior of the interval \([q_{n,\mathrm{min}}, q_{n,\mathrm{max}}]\).

Moreover,

  1. (iv)

    if \(\left( {\bar{q}} N - \sum _{n=1}^{\check{i}-1} \check{q}_{n, \mathrm{max}}\right) \ge {\hat{q}}_{i,\mathrm{max}}\), then

    $$\begin{aligned} q_i^\circ ={\hat{q}}_{i,\mathrm{max}}, \end{aligned}$$
    (42)

    and

    $$\begin{aligned} {\bar{\lambda }}^\circ \in (0,\lambda _{\min }({\varvec{A}}_i))\,; \end{aligned}$$
    (43)
  2. (v)

    if \(\left( {\bar{q}} N - \sum _{n=1}^{\check{i}-1} \check{q}_{n, \mathrm{max}}\right) < {\hat{q}}_{i,\mathrm{max}}\), then

    $$\begin{aligned} q_i^\circ \in \left[ \left( {\bar{q}} N - \sum _{n=1}^{\check{i}-1} \check{q}_{n, \mathrm{max}}\right) , {\hat{q}}_{i,\mathrm{max}}\right] , \end{aligned}$$
    (44)

    and

    $$\begin{aligned} {\bar{\lambda }}^\circ > \lambda _{\min }({\varvec{A}}_i). \end{aligned}$$
    (45)

Finally, it is worth observing that the structure highlighted above for the optimal solutions \(q_n^\circ \) and \(c^\circ \) (the latter reported under Eq.  (36)), which is valid for any fixed value of \({\bar{q}}\), can be useful to solve the modification of the optimization problem (36) obtained in case the constraint (31) is replaced by

$$\begin{aligned} {\bar{q}}_{\mathrm{min}} N \le \sum _{n=1}^N q_n \le {\bar{q}}_{\mathrm{max}} N, \end{aligned}$$
(46)

for some given \({\bar{q}}_{\mathrm{min}}, {\bar{q}}_{\mathrm{max}} \in (0,1]\), with \({\bar{q}}_{\mathrm{min}}<{\bar{q}}_{\mathrm{max}}\).

5 Conclusions

In this paper, the optimal trade-off between sample size, precision of supervision, and selection probabilities, has been studied with specific reference to a quite general linear model of input–output relationship representing unobserved heterogeneity in the data, namely the unbalanced fixed effects panel data model. First, we have analyzed its conditional generalization error, then we have minimized a large-sample upper bound on it with respect to some of its parameters. We have proved that, under suitable assumptions, “many but bad” examples provide a smaller upper bound on the conditional generalization error than “few but good” ones, whereas in other cases the opposite occurs. The choice between “many but bad” and “few but good” examples plays an important role when better supervision implies higher costs.

The theoretical results obtained in this work could be applied to the acquisition design of unbalanced panel data related to several fields, such as biostatistics, econometrics, educational research, engineering, neuroscience, political science, and sociology. Moreover, the analysis of the large-sample case could be extended to deal with large N, or with both large N and T. These cases would be of interest for their potential applications in microeconometrics (Cameron and Trivedi 2005). Another possible extension concerns the introduction, in the noise model, of a subset of not controllable parameters (beyond the controllable one, i.e., the noise variance), which could be estimated from a subset of training data. As a final extension, one could investigate and optimize the trade-off between sample size and precision of supervision (and possibly, also selection probabilities) for the random effects panel data model (Greene 2003, Chapter 13). This is also commonly applied in the analysis of economic data, and differs from the fixed effects panel data model in that its parameters are considered as random variables. In the present context, however, a possible advantage of the fixed effects panel data model is that it also allows one to obtain estimates of the individual constants \(\eta _n\) (see Eq.  (7)), which appear in the expression (11) of the conditional generalization error. Moreover, the application of the random effects model to the unbalanced case requires stronger assumptions than the ones needed for the application of the fixed effects model (Wooldridge 2002, Chapter 17).