Abstract
There has been an increasing interest in recent years in the mining of massive data sets whose sizes are measured in terabytes. However, there are some problems where collecting even a single data point is very expensive, resulting in data sets with only tens or hundreds of samples. One such problem is that of building code surrogates, where a computer simulation is run using many different values of the input parameters and a regression model is built to relate the outputs of the simulation to the inputs. A good surrogate can be very useful in sensitivity analysis, uncertainty analysis, and in designing experiments, but the cost of running expensive simulations at many sample points can be high. In this paper, we use a problem from the domain of additive manufacturing to show that even with small data sets we can build good quality surrogates by appropriately selecting the input samples and the regression algorithm. Our work is broadly applicable to simulations in other domains and the ideas proposed can be used in time-constrained machine learning tasks, such as hyper-parameter optimization.
Similar content being viewed by others
References
ACME (2016) Accelerated climate modeling for energy web page. https://climatemodeling.science.energy.gov/projects/accelerated-climate-modeling-energy
Atkeson C, Schaal SA, Moore AW (1997) Locally weighted learning. AI Rev. 11:75–133
Austin PC, Steyerberg EW (2015) The number of subjects per variable required in linear regression analyses. J Clin Epidemiol 68:627–636
Babyak MA (2004) What you see may not be what you get: a brief, non-technical introduction to overfitting in regression-type models. Psychosom Med 66:411–421
Bennett KP, Mangasarian OL (1992) Robust linear programming discrimination of two linearly inseparable sets. Optim Methods Softw 1(1):23–34
Bergstra J, Bengio Y (2012) Random search for hyper-parameter optimization. J Mach Learn Res 13:281–305
Beuth J et al (2013) Process mapping for qualification across multiple direct metal additive manufacturing processes. In: Bourell D (ed) International solid freeform fabrication symposium, an additive manufacturing conference. University of Texas at Austin, Austin, Texas, pp 655–665
Breiman L, Friedman JH, Olshen RA, Stone CJ (1984) Classification and regression trees. CRC Press, Boca Raton
Burl MC et al (2006) Automated knowledge discovery from simulators. In: Proceedings, Sixth SIAM international conference on data mining, pp 82–93
Carriera-Perpiñán MA (1996) A review of dimension reduction techniques. Tech. rep., Technical Report CS-96-09, Department of Computer Science, University of Sheffield, UK
Chang C-C, Lin C-J (2011) LIBSVM: A library for support vector machines. ACM Trans Intell Syst Technol 2:27:1–27:27. Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm
Chapelle O, Vapnik V, Bengio Y (2002) Model selection for small sample regression. Mach Learn 48(1):9–23
Committee on Mathematical Foundations of Verification, Validation, and Uncertainty Quantification; Board on Mathematical Sciences and Their Applications, Division on Engineering and Physical Sciences, National Research Council (2012) Assessing the reliability of complex models: mathematical and statistical foundations of verification, validation, and uncertainty quantification. The National Academies Press, Washington
Eagar T, Tsai N (1983) Temperature-fields produced by traveling distributed heat-sources. Weld J 62:S346–S355
Fang K-T, Li R, Sudjianto A (2005) Design and modeling for computer experiments. Chapman and Hall/CRC Press, Boca Raton
Friedman JH (1991) Multivariate adaptive regression splines. Ann Stat 19, 1(03):1–67
GPy (2012) GPy: A Gaussian process framework in python. http://github.com/SheffieldML/GPy
Guo Y, Graber A, McBurney RN, Balasubramanian R (2010) Sample size and statistical power considerations in high-dimensionality data settings: a comparative study of classification algorithms. BMC Bioinf 11:447
Isaksson A, Wallman M, Goransson H, Gustafsson M (2008) Cross-validation and bootstrapping are unreliable in small sample classification. Pattern Recogn Lett 29:1960–1965
Kamath C (2009) Scientific data mining: a practical perspective. Society for Industrial and Applied Mathematics (SIAM), Philadelphia
Kamath C (2016) Data mining and statistical inference in selective laser melting. Int J Adv Manuf Technol 86:1659–1677
Kamath C, Cantú-Paz E (2001) Creating ensembles of decision trees through sampling. In: Proceedings of the 33-rd symposium on the interface: computing science and statistics
Kamath C, El-dasher B, Gallegos GF, King WE, Sisto A (2014) Density of additively-manufactured, 316L SS parts using laser powder-bed fusion at powers up to 400 W. Int J Adv Manuf Technol 74:65–78
Kleijnen JPC (2008) Design and analysis of simulation experiments. Springer, New York
Mitchell DP (1991) Spectrally optimal sampling for distribution ray tracing. Comput Graph 25(4):157–164
Oehlert GW (2000) A first course in design and analysis of experiments. W. H. Freeman. http://users.stat.umn.edu/~gary/Book.html
Owen AB (2003) Quasi-Monte Carlo sampling. Course notes from Siggraph course. http://www-stat.stanford.edu/~owen/reports/
Owen AB (1998) Latin supercube sampling for very high-dimensional simulations. ACM Trans Model Comput Simul 8(1):71–102
Qian Y et al (2016) Uncertainty quantification in climate modeling and projection. Bull Am Meteorol Soc 97(5):821–824
Rasmussen CE, Williams CKI (2006) Gaussian processes for machine learning. MIT Press, Cambridge
Rokach L (2010) Pattern classification using ensemble methods. World Scientific Publishing, Singapore
Rokach L, Maimon O (2014) Data mining with decision trees: theory and applications. World Scientific Publishing, Singapore
Rudy J (2013) Py-earth. https://contrib.scikit-learn.org/py-earth/
Schölkopf B, Smola AJ, Williamson RC, Bartlett PL (2000) New support vector algorithms. Neural Comput 12(5):1207–1245
Shiflet AB, Shiflet GW (2006) Introduction to computational science: modeling and simulation for the sciences. Princeton University Press, Princeton
Vapnik VN (1995) The nature of statistical learning theory. Springer, New York
Verhaeghe F, Craeghs T, Heulens J, Pandalaers L (2009) A pragmatic model for selective laser melting with evaporation. Acta Mater 57:6006–6012
Yadroitsev I, Gusarov A, Yadroitsava I, Smurov I (2010) Single track formation in selective laser melting of metal powders. J Mater Process Technol 210:1624–1631
Acknowledgements
The results in this paper were generated using codes we developed for regression trees and LWKR, as well as public domain codes for MARS [33], SVR [11], and GP [17]. The Eagar–Tsai data were generated using a code developed by David Macknelly. We thank the anonymous reviewers for their feedback. This work performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Kamath, C., Fan, Y.J. Regression with small data sets: a case study using code surrogates in additive manufacturing. Knowl Inf Syst 57, 475–493 (2018). https://doi.org/10.1007/s10115-018-1174-1
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-018-1174-1