Skip to main content
Log in

Regression with small data sets: a case study using code surrogates in additive manufacturing

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

There has been an increasing interest in recent years in the mining of massive data sets whose sizes are measured in terabytes. However, there are some problems where collecting even a single data point is very expensive, resulting in data sets with only tens or hundreds of samples. One such problem is that of building code surrogates, where a computer simulation is run using many different values of the input parameters and a regression model is built to relate the outputs of the simulation to the inputs. A good surrogate can be very useful in sensitivity analysis, uncertainty analysis, and in designing experiments, but the cost of running expensive simulations at many sample points can be high. In this paper, we use a problem from the domain of additive manufacturing to show that even with small data sets we can build good quality surrogates by appropriately selecting the input samples and the regression algorithm. Our work is broadly applicable to simulations in other domains and the ideas proposed can be used in time-constrained machine learning tasks, such as hyper-parameter optimization.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

References

  1. ACME (2016) Accelerated climate modeling for energy web page. https://climatemodeling.science.energy.gov/projects/accelerated-climate-modeling-energy

  2. Atkeson C, Schaal SA, Moore AW (1997) Locally weighted learning. AI Rev. 11:75–133

    Google Scholar 

  3. Austin PC, Steyerberg EW (2015) The number of subjects per variable required in linear regression analyses. J Clin Epidemiol 68:627–636

    Article  Google Scholar 

  4. Babyak MA (2004) What you see may not be what you get: a brief, non-technical introduction to overfitting in regression-type models. Psychosom Med 66:411–421

    Google Scholar 

  5. Bennett KP, Mangasarian OL (1992) Robust linear programming discrimination of two linearly inseparable sets. Optim Methods Softw 1(1):23–34

    Article  Google Scholar 

  6. Bergstra J, Bengio Y (2012) Random search for hyper-parameter optimization. J Mach Learn Res 13:281–305

    MathSciNet  MATH  Google Scholar 

  7. Beuth J et al (2013) Process mapping for qualification across multiple direct metal additive manufacturing processes. In: Bourell D (ed) International solid freeform fabrication symposium, an additive manufacturing conference. University of Texas at Austin, Austin, Texas, pp 655–665

  8. Breiman L, Friedman JH, Olshen RA, Stone CJ (1984) Classification and regression trees. CRC Press, Boca Raton

    MATH  Google Scholar 

  9. Burl MC et al (2006) Automated knowledge discovery from simulators. In: Proceedings, Sixth SIAM international conference on data mining, pp 82–93

  10. Carriera-Perpiñán MA (1996) A review of dimension reduction techniques. Tech. rep., Technical Report CS-96-09, Department of Computer Science, University of Sheffield, UK

  11. Chang C-C, Lin C-J (2011) LIBSVM: A library for support vector machines. ACM Trans Intell Syst Technol 2:27:1–27:27. Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm

  12. Chapelle O, Vapnik V, Bengio Y (2002) Model selection for small sample regression. Mach Learn 48(1):9–23

    Article  MATH  Google Scholar 

  13. Committee on Mathematical Foundations of Verification, Validation, and Uncertainty Quantification; Board on Mathematical Sciences and Their Applications, Division on Engineering and Physical Sciences, National Research Council (2012) Assessing the reliability of complex models: mathematical and statistical foundations of verification, validation, and uncertainty quantification. The National Academies Press, Washington

    Google Scholar 

  14. Eagar T, Tsai N (1983) Temperature-fields produced by traveling distributed heat-sources. Weld J 62:S346–S355

    Google Scholar 

  15. Fang K-T, Li R, Sudjianto A (2005) Design and modeling for computer experiments. Chapman and Hall/CRC Press, Boca Raton

    Book  MATH  Google Scholar 

  16. Friedman JH (1991) Multivariate adaptive regression splines. Ann Stat 19, 1(03):1–67

    Article  MathSciNet  MATH  Google Scholar 

  17. GPy (2012) GPy: A Gaussian process framework in python. http://github.com/SheffieldML/GPy

  18. Guo Y, Graber A, McBurney RN, Balasubramanian R (2010) Sample size and statistical power considerations in high-dimensionality data settings: a comparative study of classification algorithms. BMC Bioinf 11:447

    Article  Google Scholar 

  19. Isaksson A, Wallman M, Goransson H, Gustafsson M (2008) Cross-validation and bootstrapping are unreliable in small sample classification. Pattern Recogn Lett 29:1960–1965

    Article  Google Scholar 

  20. Kamath C (2009) Scientific data mining: a practical perspective. Society for Industrial and Applied Mathematics (SIAM), Philadelphia

    Book  MATH  Google Scholar 

  21. Kamath C (2016) Data mining and statistical inference in selective laser melting. Int J Adv Manuf Technol 86:1659–1677

    Article  Google Scholar 

  22. Kamath C, Cantú-Paz E (2001) Creating ensembles of decision trees through sampling. In: Proceedings of the 33-rd symposium on the interface: computing science and statistics

  23. Kamath C, El-dasher B, Gallegos GF, King WE, Sisto A (2014) Density of additively-manufactured, 316L SS parts using laser powder-bed fusion at powers up to 400 W. Int J Adv Manuf Technol 74:65–78

    Article  Google Scholar 

  24. Kleijnen JPC (2008) Design and analysis of simulation experiments. Springer, New York

    MATH  Google Scholar 

  25. Mitchell DP (1991) Spectrally optimal sampling for distribution ray tracing. Comput Graph 25(4):157–164

    Article  Google Scholar 

  26. Oehlert GW (2000) A first course in design and analysis of experiments. W. H. Freeman. http://users.stat.umn.edu/~gary/Book.html

  27. Owen AB (2003) Quasi-Monte Carlo sampling. Course notes from Siggraph course. http://www-stat.stanford.edu/~owen/reports/

  28. Owen AB (1998) Latin supercube sampling for very high-dimensional simulations. ACM Trans Model Comput Simul 8(1):71–102

    Article  MATH  Google Scholar 

  29. Qian Y et al (2016) Uncertainty quantification in climate modeling and projection. Bull Am Meteorol Soc 97(5):821–824

    Article  Google Scholar 

  30. Rasmussen CE, Williams CKI (2006) Gaussian processes for machine learning. MIT Press, Cambridge

    MATH  Google Scholar 

  31. Rokach L (2010) Pattern classification using ensemble methods. World Scientific Publishing, Singapore

    MATH  Google Scholar 

  32. Rokach L, Maimon O (2014) Data mining with decision trees: theory and applications. World Scientific Publishing, Singapore

    Book  MATH  Google Scholar 

  33. Rudy J (2013) Py-earth. https://contrib.scikit-learn.org/py-earth/

  34. Schölkopf B, Smola AJ, Williamson RC, Bartlett PL (2000) New support vector algorithms. Neural Comput 12(5):1207–1245

    Article  Google Scholar 

  35. Shiflet AB, Shiflet GW (2006) Introduction to computational science: modeling and simulation for the sciences. Princeton University Press, Princeton

    MATH  Google Scholar 

  36. Vapnik VN (1995) The nature of statistical learning theory. Springer, New York

    Book  MATH  Google Scholar 

  37. Verhaeghe F, Craeghs T, Heulens J, Pandalaers L (2009) A pragmatic model for selective laser melting with evaporation. Acta Mater 57:6006–6012

    Article  Google Scholar 

  38. Yadroitsev I, Gusarov A, Yadroitsava I, Smurov I (2010) Single track formation in selective laser melting of metal powders. J Mater Process Technol 210:1624–1631

    Article  Google Scholar 

Download references

Acknowledgements

The results in this paper were generated using codes we developed for regression trees and LWKR, as well as public domain codes for MARS [33], SVR [11], and GP [17]. The Eagar–Tsai data were generated using a code developed by David Macknelly. We thank the anonymous reviewers for their feedback. This work performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Chandrika Kamath.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Kamath, C., Fan, Y.J. Regression with small data sets: a case study using code surrogates in additive manufacturing. Knowl Inf Syst 57, 475–493 (2018). https://doi.org/10.1007/s10115-018-1174-1

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-018-1174-1

Keywords

Navigation