Advertisement

Springer Nature is making SARS-CoV-2 and COVID-19 research free. View research | View latest news | Sign up for updates

On the trade-off between number of examples and precision of supervision in machine learning problems

Abstract

We investigate linear regression problems for which one is given the additional possibility of controlling the conditional variance of the output given the input, by varying the computational time dedicated to supervise each example. For a given upper bound on the total computational time for supervision, we optimize the trade-off between the number of examples and their precision (the reciprocal of the conditional variance of the output), by formulating and solving suitable optimization problems, based on large-sample approximations of the outputs of the classical ordinary least squares and weighted least squares regression algorithms. Considering a specific functional form for that precision, we prove that there are cases in which “many but bad” examples provide a smaller generalization error than “few but good” ones, but also that the converse can occur, depending on the “returns to scale” of the precision with respect to the computational time assigned to supervise each example. Hence, the results of this study highlight that increasing the size of the dataset is not always beneficial, if one has the possibility to collect a smaller number of more reliable examples. We conclude presenting numerical results validating the theory, and discussing extensions of the proposed framework to other optimization problems.

This is a preview of subscription content, log in to check access.

Fig. 1

Notes

  1. 1.

    https://www.advanpix.com.

  2. 2.

    Having test examples independent from the training set is a very common assumption in machine learning, made to get a fair estimate of the generalization capability of the trained model (e.g., for the specific case considered in the work, of the performance index reported later in Eq. (4). Replacement of the test examples with the training ones to get such an estimate would produce misleading results in case of overfitting of the trained learning machine on the training set.

  3. 3.

    We use the expressions “decreasing returns of scale”, “constant returns of scale”, and “increasing returns of scale” when modeling the precision of supervision as a function of the computational time per example to refer to the case in which such precision is modeled, respectively, as a strictly concave and increasing function, a linear increasing function, and a strictly convex and increasing function of the computational time per example. Given the connections between the subject of the paper and econometrics, the former terminology has been used to provide an interpretation of the measurement error model that can be of interest to readers with a background in economics.

  4. 4.

    This topic is clearly in line with the call for papers of the special issue “Optimization Models and Solution Techniques” of the journal Optimization Letters, from which we report the following paragraph: “Recent advances in information technology enable the treatment of big data volumes, devising effective solution methods toward better decisions”.

  5. 5.

    In this interaction, also optimization takes an important role, as justified by the large number of papers dealing with the application of optimization techniques to machine learning, published in journals such as Operations Research, Mathematics of Operations Research, and Optimization Letters. As a further example, the International Annual Conference on machine Learning, Optimization and Data science (LOD) is specifically devoted to the interaction between optimization and machine learning.

  6. 6.

    Both ordinary least squares and weighted least squares (considered later in this section) implement optimal solutions of related unconstrained convex quadratic optimization problems; see, e.g., [15, Chapters 2 and 18].

  7. 7.

    Although in the paper it is assumed that all the training examples are available simultaneously to the learning machine (e.g., at the end of the time interval used for all their supervisions), the analysis could be extended to online learning, by replacing T with the current time, and applying results such as [7, Proposition 5].

  8. 8.

    I.e., for every \(\varepsilon >0\), \(\mathrm{Prob} \left( \left\| \frac{X_{N(\varDelta T)}' X_{N(\varDelta T)}}{N(\varDelta T)} - \mathbb {E} \left\{ \underline{x} \, \underline{x}'\right\} \right\| > \varepsilon \right) \) (where \(\Vert \cdot \Vert \) is an arbitrary matrix norm) tends to 0 as \(N(\varDelta T)\) tends to \(+\infty \).

  9. 9.

    This assumption has been introduced only in order to avoid the pathological case for which the set of \(\varepsilon \)-optimal solutions of one of the optimization problems (13) or (14) coincides trivially with the whole admissible domain \([\varDelta T_{\mathrm{min}}, \varDelta T_{\mathrm{max}}]\).

  10. 10.

    For a better understanding of this part of the proof, Fig. 1 shows the behavior of the rescaled objective functions \(T \frac{p C (\varDelta T)^{-\alpha }}{\left\lfloor \frac{T}{\varDelta T} \right\rfloor }\) and \(p C (\varDelta T)^{1-\alpha }\) for the three cases \(0< \alpha = 0.5 < 1\), \(\alpha = 1.5 > 1\), and \(\alpha = 1\) (the values of the other parameters are \(p=10\), \(T=10\) sec, \(\varDelta T_{\mathrm{min}}=0.3\) sec, \(\varDelta T_{\mathrm{max}}=0.7\) sec, \(k_1=1\), and \(k_2=1\) sec).

  11. 11.

    Simulation results similar to the ones reported in this section have been obtained also for other choices of the parameters of the problem. Given the limited space, the choice \(p=10\) has been made for illustrative purposes, to achieve a good compromise between too small and too large choices for the dimension of the parameter vector.

  12. 12.

    This choice of the covariance matrix has been obtained by setting \(\mathrm{Var}\left( \underline{x}\right) =A A'\), where the elements of \(A \in \mathbb {R}^{p \times p}\) have been randomly and independently generated according to a uniform probability density on the interval [0,1]).

References

  1. 1.

    Athey, S., Imbens, G.: Recursive partitioning for heterogeneous causal effects. Proc. Natl. Acad. Sci. 113, 7353–7360 (2016)

  2. 2.

    Bacigalupo, A., Gnecco, G.: Metamaterial filter design via surrogate optimization. J. Phys. Conf. Proc. 1092, 4 (2018)

  3. 3.

    Bacigalupo, A., Gnecco, G., Lepidi, M., Gambarotta, L.: Optimal design of low-frequency band gaps in anti-tetrachiral lattice meta-materials. Compos. Part B Eng. 115, 341–359 (2017)

  4. 4.

    Bacigalupo, A., Lepidi, M., Gnecco, G., Gambarotta, L.: Optimal design of auxetic hexachiral metamaterials with local resonators. Smart Mater. Struct. 25(5), 19 (2016)

  5. 5.

    Bargagli Stoffi, F.J., Gnecco, G.: Estimating heterogeneous causal effects in the presence of irregular assignment mechanisms. In: Proceedings of the \(5{{\rm th}}\) IEEE International Conference on Data Science and Advanced Analytics (IEEE DSAA 2018), Turin, Italy, pp. 1–10 (2018)

  6. 6.

    Barlow, R.J.: Statistics: A Guide to the Use of Statistical Methods in the Physical Sciences, 1st edn. Wiley, London (1989)

  7. 7.

    Gnecco, G., Bemporad, A., Gori, M., Sanguineti, M.: LQG online learning. Neural Computation 29, 2203–2291 (2017)

  8. 8.

    Gnecco, G., Nutarelli, F.: On the trade-off between number of examples and precision of supervision in regression problems. In: Proceedings of the \(4{{\rm th}}\) International Conference of the International Neural Network Society on Big Data and Deep Learning (INNS BDDL 2019), Sestri Levante, Italy, pp. 1–6 (2019)

  9. 9.

    Gnecco, G., Nutarelli, F.: On the trade-off between sample size and precision of supervision in the fixed effects panel data model. In: Proceedings of the \(5{\rm th}\) International Conference on machine Learning, Optimization & Data science (LOD 2019), Certosa di Pontignano (Siena), Italy, pp. 1–12 (2019)

  10. 10.

    Greene, W.H.: Econometric Analysis, 5th edn. Pearson Education Inc., London (2003)

  11. 11.

    Groves, R.M., Fowler, F.J.J., Couper, M.P., Lepkowski, J.M., Singer, E., Tourangeau, R.: Survey Methodology, 1st edn. Wiley-Interscience, London (2004)

  12. 12.

    Hamming, R.: Numerical Methods for Scientists and Engineers, 2nd edn. McGraw-Hill, New York (1973)

  13. 13.

    Korolev, V.Y., Shevtsova, I.G.: On the upper bound for the absolute constant in the Berry-Esseen inequality. Theory Probab. Appl. 54(4), 638 (2010)

  14. 14.

    Maddala, G.S.: Introduction to Econometrics, 2nd edn. MacMillan Publishing Company, London (1992)

  15. 15.

    Ruud, P.A.: An Introduction to Classical Econometric Theory, 1st edn. Oxford University Press, Oxford (2000)

  16. 16.

    Shawe-Taylor, J., Cristianini, N.: Kernel Methods for Pattern Analysis, 1st edn. Cambridge University Press, Cambridge (2004)

  17. 17.

    Vapnik, V.N.: Statistical Learning Theory, 1st edn. Wiley, London (1998)

  18. 18.

    Varian, H.R.: Big data: new tricks for econometrics. J. Econ. Perspect. 28, 3–28 (2014)

  19. 19.

    Wilkinson, J.H.: The evaluation of the zeros of ill-conditioned polynomials. Part I. Numer. Math. 1, 150–166 (1959)

Download references

Author information

Correspondence to Giorgio Gnecco.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Gnecco, G., Nutarelli, F. On the trade-off between number of examples and precision of supervision in machine learning problems. Optim Lett (2019). https://doi.org/10.1007/s11590-019-01486-x

Download citation

Keywords

  • Optimal supervision time
  • Linear regression
  • Variance control
  • Ordinary least squares
  • Large-sample approximation