Advertisement

The Generation of Experimental Data for Computational Testing in Optimization

  • Nicholas G. HallEmail author
  • Marc E. PosnerEmail author

Abstract

This chapter discusses approaches to generating synthetic data for use in scientific experiments. In many diverse scientific fields, the lack of availability, high cost or inconvenience of the collection of real-world data motivates the generation of synthetic data. In many experiments, the method chosen to generate synthetic data can significantly affect the results of an experiment. Unfortunately, the scientific literature does not contain general protocols for how synthetic data should be generated. The purpose of this chapter is to rectify that deficiency. The protocol we propose is based on several generation principles. These principles motivate and organize the data generation process. The principles are operationalized by generation properties. Then, together with information about the features of the application and of the experiment, the properties are used to construct a data generation scheme. Finally, we suggest procedures for validating the synthetic data generated. The usefulness of our protocol is illustrated by a discussion of numerous applications of data generation from the optimization literature. This discussion identifies examples of both good and bad data generation practice as it relates to our protocol.

Keywords

Synthetic Data Knapsack Problem Scenario Tree Data Generation Process Project Schedule Problem 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Notes

Acknowledgements

This research is supported by the Summer Fellowship Program of the Fisher College of Business, The Ohio State University, to the first author. Helpful comments were provided by Tito Homem-de-Mello and two anonymous reviewers.

References

  1. Aksoy H, Bayazit M (2000) A model for daily flows of intermittent streams. Hydrological Processes 14:1725–1744Google Scholar
  2. Amini MM, Racer M (1994) A rigorous computational comparison of alternative solution methods for the generalized assignment problem. Management Science 40:868–890zbMATHGoogle Scholar
  3. Andersen ED, Ye Y (1998) A computational study of the homogeneous algorithm for large-scale convex optimization. Computational Optimization and Applications 10:243–269zbMATHMathSciNetGoogle Scholar
  4. Angel E, Zissimopoulos V (1998) Autocorrelation coefficient for the graph bipartitioning problem. Theoretical Computer Science 191:229–243zbMATHMathSciNetGoogle Scholar
  5. Angel E, Zissimopoulos V (2000) On the classification of NP-complete problems in terms of their correlation coefficient. Discrete Applied Mathematics 99:261–277zbMATHMathSciNetGoogle Scholar
  6. Arthur JL, Frendewey JO (1988) Generating travelling-salesman problems with known optimal tours. Journal of the Operational Research Society 39:153–159zbMATHGoogle Scholar
  7. Atamtürk A (2007) Strong formulations of robust mixed 0-1 programming. Mathematical Programming 108:235–250Google Scholar
  8. Bailey DD, Dalmau V, Kolaitis PG (2007) Phase transitions of PP-complete satisfiability problems. Discrete Applied Mathematics 155:1627–1639zbMATHMathSciNetGoogle Scholar
  9. Balas E, Martin CH (1980) Pivot and complement - a heuristic for 0-1 programming. Management Science 26:86–96zbMATHMathSciNetGoogle Scholar
  10. Balas E, Zemel E (1980) An algorithm for large zero-one knapsack problems. Operations Research 28:1130–1154zbMATHMathSciNetGoogle Scholar
  11. Bauer HU, Herrmann M, Villmann T (1999) Neural maps and topographic vector quantization. Neural Networks 12:659–676Google Scholar
  12. Bayraksan G, Morton D (2007) Assessing solution quality in stochastic programs. Mathematical Programming, Series B 108:495–514MathSciNetGoogle Scholar
  13. Bertsimas D, Natarajan K, Teo CP (2006) Persistence in discrete optimization under demand uncertainty. Mathematical Programming 108:251–274zbMATHMathSciNetGoogle Scholar
  14. Beyer K, Goldstein J, Ramakrishnan R (1999) When is “nearest neighbour” meaningful? Database Theory - ICDT ’99 1540:217–235Google Scholar
  15. Bienstock D, Raskina O, Saniee I, Wang Q (2006) Combined network design and multiperiod pricing: Modeling, solution techniques and computation. Operations Research 54:261–276zbMATHMathSciNetGoogle Scholar
  16. Bijmolt THA, Wedel M (1999) A comparison of multidimensional scaling methods for perceptual mapping. Journal of Marketing Research 36:277–285Google Scholar
  17. Brahimi N, Dauzère-Pérès S, Najid NM (2006) Capacitated multi-item lot-sizing problems with time windows. Operations Research 54:951–967zbMATHGoogle Scholar
  18. Cario MC, Clifford JJ, Hill RR, Yang J, Yang K, Reilly CH (2002) An investigation of the relationship between problem characteristics and algorithm performance: A case study of the GAP. IIE Transactions 34:297–312Google Scholar
  19. Chalmet L, Gelders L (1976) Lagrangean relaxation for a generalized assignmenttype problem. North-Holland, Amsterdam, The NetherlandsGoogle Scholar
  20. Cheeseman P, Kanefsky B, Taylor WM (1991) Where the really hard problems are. In: Proceedings of IJCAI-91, Morgan Kaufmann, San Mateo, CA, pp 331–337Google Scholar
  21. Chen ZL, Pundoor G (2006) Order assignment and scheduling in a supply chain. Operations Research 54:555–572zbMATHMathSciNetGoogle Scholar
  22. Cordeau JJ (2006) A branch-and-cut algorithm for the dial-a-ride problem. Operations Research 54:573–586zbMATHMathSciNetGoogle Scholar
  23. Culberson J, Beacham A, Papp D (1995) Hiding our colors. In: Proceedings of the CP ’95 Workshop on Studying and Solving Really Hard Problems, Cassis, France, pp 31–42Google Scholar
  24. Degraeve Z, Schrage L (1997) Should I use a portable generator in an emergency? Working paper, Department of Applied Economic Sciences, Katholieke Universiteit Leuven, BelgiumGoogle Scholar
  25. Demeulemeester E, Vanhoucke M, Herroelen W(2003) RanGen: A random network generator for activity-on-the-node networks. Journal of Scheduling 6:17–38zbMATHMathSciNetGoogle Scholar
  26. Dupačová J, Consigli G, Wallace SW (2000) Scenarios for mutistage stochastic programs. Annals of Operations Research 100:25–53zbMATHMathSciNetGoogle Scholar
  27. Dupačová J, Gröwe N, Römisch W (2003) Scenario reduction in stochastic programming: An approach using probability metrics. Mathematical Programming, Series A 95:493–511zbMATHMathSciNetGoogle Scholar
  28. Estivill-Castro V, Murray AT (1997) Spatial clustering for data mining with generic algorithms. Technical Report FIT-TR-97-10, Faculty of Information Management, Queensland University of TechnologyGoogle Scholar
  29. Fischetti M, Lodi A, Martello S, Toth P (2001) A polyhedral approach to simplified crew scheduling and vehicle scheduling problems. Management Science 47:833–850Google Scholar
  30. Fisher ML (1994) Optimal solution of vehicle routing problems using minimum k-trees. Operations Research 42:626–642zbMATHMathSciNetGoogle Scholar
  31. Freed JA (2000) Conceptual comparison of two computer models of corpuscle sectioning and of two algorithms for correction of ploidy measurements in tissue sections. Analytical and Quantitative Cytology and Histology 22:17–25Google Scholar
  32. Frenje L, Juhlin C (1998) Scattering of seismic waves simulated by finite difference modelling in random media: Application to the Gravberg-1 well. Sweden Tectonophysics 293:61–68Google Scholar
  33. Garey MR, Johnson DS (1979) Computers and Intractability: a Guide to the Theory of NP-Completeness. W.H. Freeman, San Francisco, CAzbMATHGoogle Scholar
  34. Gelius LJ, Westerdahl H (1997) Seismic noise modelling. Journal of Seismic Exploration 6:351–366Google Scholar
  35. Ghiani G, Laporte G, Semet F (2006) The black and white traveling salesman problem. Operations Research 54:366–378zbMATHMathSciNetGoogle Scholar
  36. Gonzalez J, Gutierrez R (1999) Direct motion estimation from a range scan sequence. Journal of Robotic Systems 16:73–80zbMATHGoogle Scholar
  37. Goutte C (2000) Extraction of the relevant delays in temporal modelling. IEEE Transactions on Signal Processing 48:1787–1795Google Scholar
  38. Grate JW, Wise BM, Abraham MH (1999) Method for unknown vapor characterization and classification using a multivariate sorption detector. Analytical Chemistry 71:4544–4553Google Scholar
  39. Guignard M, Rosenwein MB (1989) An improved dual based algorithm for the generalized knapsack problem. Operations Research 37:658–663zbMATHMathSciNetGoogle Scholar
  40. Hadjar A, Marcotte O, Soumis F (2006) A branch-and-cut algorithm for the multiple depot vehicle scheduling problem. Operations Research 54:130–149zbMATHGoogle Scholar
  41. Hall NG, Posner ME (2001) Generating experimental data for computational testing with machine scheduling applications. Operations Research 49:854–865zbMATHMathSciNetGoogle Scholar
  42. Hall NG, Posner ME (2007) Performance prediction and preselection for optimization procedures. Operations Research 55:703–716zbMATHGoogle Scholar
  43. Hariri AM, Potts CN (1983) An algorithm for single machine sequencing with release dates to minimize total weighted completion time. Discrete Applied Mathematics 5:99–109zbMATHGoogle Scholar
  44. Hays WL (1973) Statistics for the Social Sciences, 2nd edn. Holt, Rinehart and Winston, Inc., New York, NYGoogle Scholar
  45. Heitsch H, Römisch W (2005) Generation of multivariate scenario trees to model stochasticity in power management. In: Power Tech, IEEE Russia, pp 1–7Google Scholar
  46. Herique A (1999) Radio wave back-propogating in polar coordinates: A linear filter in the time-frequency angle-frequency domain. Radio Science 34:509–519Google Scholar
  47. Hill RR, Reilly CH (1994) Composition for multivariate random variables. In: Proceedings, 1994 Winter Simulation Conference, Institute of Electrical and Electronics Engineers, Orlando, FL, pp 332–342Google Scholar
  48. Hill RR, Reilly CH (2000) The effects of coefficient correlation structure in twodimensional knapsack problems on solution procedure. Management Science 46:302–317Google Scholar
  49. Ho TK, Baird HS (1997) Large-scale simulation studies in image pattern recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 19:1067–1079Google Scholar
  50. Hooker JN (1994) Needed: An empirical science of algorithms. Operations Research 42:201–212zbMATHGoogle Scholar
  51. Hooker JN (1995) Testing heuristics: We have it all wrong. Journal of Heuristics 1:33–42zbMATHGoogle Scholar
  52. Hooshyar MA, Lam TH, Razavy M (2000) Inverse problem of the wave equation and the Schwinger approximation. Journal of the Acoustical Society of America 107:404–413Google Scholar
  53. Høyland K, Wallace SW (2001) Generating scenario trees for multstage decision problems. Management Science 47:295–307Google Scholar
  54. Høyland K, Kaut M,Wallace SW (2003) A heuristic for moment-matching scenario generation. Annals of Operations Research 24:169–185Google Scholar
  55. Iman RL, Conover WJ (1982) A distribution-free approach to inducing rank correlation among input variables. Communications in Statistics: Simulation and Computing B11:311–334Google Scholar
  56. John TC (1989) Tradeoff solutions in single machine production scheduling for minimizing flow time and maximum penalty. Computers & Operations Research 16:471–479zbMATHGoogle Scholar
  57. Kadlec RH (2000) The inadequacy of first-order treatment wetland models. Ecological Engineering 15:105–119Google Scholar
  58. Kall P, Mayer J (1993) SLP-IOR: On the design of a workbench for testing SLP codes. Revista Investigación Operacional 14:148–161zbMATHMathSciNetGoogle Scholar
  59. Karp RM (1972) Reducibility among combinatorial problems. In: Complexity of Computer Computations, Plenum, New York, NY, pp 85–103Google Scholar
  60. Karyapis G, Han EH, Kumar V (1999) CHAMELEON: A hierarchical clustering algorithm using dynamic modeling. IEEE Computer 32:68–75Google Scholar
  61. Kaufman L, Rousseeuw PJ (1990) Finding Groups in Data: An Introduction to Cluster Analysis. Wiley, New York, NYGoogle Scholar
  62. Kaut M, Wallace SW (2003) Evaluation of scenario-generation methods for stochastic programming. Working paper, Molde University College, NorwayGoogle Scholar
  63. Koehler JR, Owen AB (1996) Computer experiments. In: Ghosh S, Rao C (eds) Handbook of Statistics, vol 13, Elsevier Science, New York, NY, pp 261–308Google Scholar
  64. Kolisch R, Sprecher A, Drexl A (1995) Characterization and generation of a general class of resource-constrained project scheduling problems. Management Science 41:1693–1703zbMATHGoogle Scholar
  65. Krieger AM, Green PE (1999) A cautionary note on using internal cross validation to select the number of clusters. Psychometrika 64:341–353Google Scholar
  66. Laguna M, Rafael M (2001) A GRASP for coloring sparse graphs. Computational Optimization and Applications 19:165–178zbMATHMathSciNetGoogle Scholar
  67. Larsson T, Yuan D (2004) An augmented Lagrangian algorithm for large scale multicommodity routing. Computational Optimization and Applications 27:187–215zbMATHMathSciNetGoogle Scholar
  68. Law AM, Kelton WD(1991) Simulation Modeling and Analysis, 2nd edn. McGraw-Hill, New York, NYGoogle Scholar
  69. Linderoth J, Shapiro A, Wright S (2006) The empirical behavior of sampling methods for stochastic programming. Annals of Operations Research 142:215–241zbMATHMathSciNetGoogle Scholar
  70. Lium AG, Crainic TG,Wallace SW (2007) Correlations in stochastic programming: A case from stochastic service network design. Revista Investigación Operacional 24:161–179zbMATHMathSciNetGoogle Scholar
  71. Lu Z,Wyss M, Pulpan H (1997) Details of stress directions in the Alaska subduction zone from fault plane solutions. Journal of Geophysical Research-Solid Earth 102:5385–5402Google Scholar
  72. Martello S, Toth P (1979) The 0-1 knapsack problem. In: Christofides N, Mingozzi A, Toth P, Sandi C (eds) Combinatorial Optimization, Wiley, New York, NY, pp 237–279Google Scholar
  73. Martello S, Toth P (1981) An algorithm for the generalized assignment problem. In: Brans JP (ed) Operational Research ’81, North-Holland, Amsterdam, The Netherlands, pp 589–603Google Scholar
  74. Martello S, Toth P (1988) A new algorithm for the 0-1 knapsack problem. Management Science 34:633–644zbMATHMathSciNetGoogle Scholar
  75. Martello S, Toth P (1997) Upper bounds and algorithms for hard 0-1 knapsack problems. Operations Research 45:768–778zbMATHMathSciNetGoogle Scholar
  76. McGeoch CC (1996) Towards an experimental method for algorithm simulation. INFORMS Journal on Computing 8:1–15zbMATHGoogle Scholar
  77. McIntosh SW, Charbonneau P, Brown JC (2000) Preconditioning the differential emission measure (T-e) inverse problem. Astrophysics Journal 529:1115–1130Google Scholar
  78. Miller DL (1995) A matching based exact algorithm for capacitated vehicle routing problems. ORSA Journal on Computing 7:1–9zbMATHGoogle Scholar
  79. Munizaga MA, Heydecker BG, Ortuzar JD (2000) Representation of heteroskedasticity in discrete choice models. Transportation Research B–Methodology 34:219–240Google Scholar
  80. Ng R, Han J (1994) Efficient and effective clustering methods for spatial data mining. In: Proceedings of International Conference on Very Large Data Bases, Santiago, Chile, pp 144–155Google Scholar
  81. Ow P (1985) Focused scheduling in proportionate flowshops. Management Science 31:852–869zbMATHGoogle Scholar
  82. Palmer CR, Faloutsos C (2000) Density biased sampling: An improved method for data mining and clustering. SIGMOD Record 29:82–92Google Scholar
  83. Pan Y, Shi L (2007) On the equivalence of the max-min transportation lower bound and the time-indexed lower bound for single-machine scheduling problems. Mathematical Programming 110:543–559zbMATHMathSciNetGoogle Scholar
  84. Patterson JH (1984) A comparison of exact procedures for solving the multipleconstrained resource project scheduling problem. Management Science 30:854–867Google Scholar
  85. Pearson WR, Robins G, Zhang TT (1999) Generalized neighbor-joining: more reliable phylogenetic tree reconstruction. Molecular Biology and Evolution 16:806–816Google Scholar
  86. Pei Y, Zaïa``ne O (2006) A synthetic data generator for clustering and outlier analysis. In: Technical report TR06-15, University of Alberta, Edmonton, AlbertaGoogle Scholar
  87. Pennanen T (2005) Epi-convergent discretizations of multistage stochastic programs. Mathematics of Operations Research 30:245–256zbMATHMathSciNetGoogle Scholar
  88. Pilcher MG, Rardin RL (1992) Partial polyhedral description and generation of discrete optimization problems with known optima. Naval Research Logistics 39:839–858zbMATHMathSciNetGoogle Scholar
  89. Potts CN, Van Wassenhove LN (1985) A Lagrangean based branch and bound algorithm for single machine sequencing with precedence constraints to minimize total weighted completion time. Management Science 31:1300–1311zbMATHMathSciNetGoogle Scholar
  90. Potts CN, Van Wassenhove LN (1988) Algorithms for scheduling a single machine to minimize the weighted number of late jobs. Management Science 34:843–858zbMATHMathSciNetGoogle Scholar
  91. Potts CN, VanWassenhove LN (1992) Single machine scheduling to minimize total late work. Operations Research 40:586–595zbMATHMathSciNetGoogle Scholar
  92. Qin G, Jing BY (2000) Asymptotic properties for estimation of partial linear models with censored data. Journal of Statistical Planning and Inference 84:95–110zbMATHMathSciNetGoogle Scholar
  93. Racer M, Amini MM (1994) A robust heuristic for the generalized assignment problem. Annals of Operations Research 50:487–503zbMATHGoogle Scholar
  94. Rardin RL, Uzsoy R (2001) Experimental evaluation of heuristic optimization algorithms: A tutorial. Journal of Heuristics 7:261–304zbMATHGoogle Scholar
  95. Ray KS, Ghoshal J (2000) Neuro-genetic approach to multidimensional fuzzy reasoning for pattern classification. Fuzzy Sets and Systems 112:449–483zbMATHGoogle Scholar
  96. Reiter JP (2002) Satisfying disclosure restrictions with synthetic data sets. Journal of Official Statistics 18:531–543Google Scholar
  97. Romeijn HE, Morales DR (2001a) Generating experimental data for the generalized assignment problem. Operations Research 49:866–878zbMATHMathSciNetGoogle Scholar
  98. Romeijn HE, Morales DR (2001b) A probabilistic analysis of the multi-period single-sourcing problem. Discrete Applied Mathematics 112:301–328zbMATHMathSciNetGoogle Scholar
  99. Ross GT, Soland RM (1975) A branch and bound algorithm for the generalized assignment problem. Mathematical Programming 8:91–103zbMATHMathSciNetGoogle Scholar
  100. Ross GT, Soland RM (1977) Modeling facility location problems as generalized assignment problems. Management Science 24:345–357zbMATHGoogle Scholar
  101. Roversi P, Irwin JJ, Bricogne G (1998) Accurate charge density studies as an extension of Bayesian crystal structure determination. Acta Crystallographica Section A 54:971–996Google Scholar
  102. Ruchala KJ, Olivera GH, Schloesser EA (1999) Megavoltage CT on a tomotherapy system. Physics in Medicine and Biology 44:2597–2621Google Scholar
  103. Rushmeier RA, Nemhauser GL (1993) Experiments with parallel branch-and-bound algorithms for the set covering problem. Operations Research Letters 13:277–285zbMATHGoogle Scholar
  104. Schaffer C (1994) A conservation law for generalization performance. In: International Conference on Machine Learning, Morgan Kaufmann, San Francisco, CA, pp 259–265Google Scholar
  105. Schena G, Chiaruttini C (2000) A stereologically posed mass balance for calculating the distributed efficiency of particle separation systems. International Journal of Mineral Processing 59:149–162Google Scholar
  106. Schwindt C (1995) A new problem generator for different resource-constrained project scheduling problems with minimal and maximal time lags. WIORReport-449, Institut für Wirtschaftstheorie und Operations Research, University of KarlsruheGoogle Scholar
  107. Shen L, Shen H, Cheng L (1999) New algorithms for efficient mining of association rules. Information Sciences 118:251–268Google Scholar
  108. Sherali HD, Smith JC (2006) A polyhedral study of the generalized vertex cover problem. Mathematical Programming 107:367–390zbMATHMathSciNetGoogle Scholar
  109. Sherali HD, Zhu X (2007) On solving discrete two-stage stochastic programs having mixed-integer first- and second-stage variables. Mathematical Programming 105:597–616MathSciNetGoogle Scholar
  110. Smith-Miles KA (2008) Cross-disciplinary perspectives on meta-learning for algorithm selection. ACM Computing Surveys 41:6.1–6.25Google Scholar
  111. Smith-Miles KA, James RJW, Giffin JW, Tu Y (2009) A knowledge discovery approach to understanding relationships between scheduling problem structure and heuristic performance. In: Learning and Intelligent OptimizatioN Conference (LION 3), Trento, ItalyGoogle Scholar
  112. Toth P, Vigo D (2002) Models, relaxations and exact approaches for the capacitated vehicle routing problem. Discrete Applied Mathematics 123:487–512zbMATHMathSciNetGoogle Scholar
  113. Trick MA (1992) A linear relaxation heuristic for the generalized assignment problem. Naval Research Logistics 39:137–151zbMATHMathSciNetGoogle Scholar
  114. Uma RN, Wein J (1998) On the relationship between combinatorial and LP-based approaches to NP-hard scheduling problems. Integer Programming and Combinatorial Optimization, Lecture Notes in Computer Science 1412:394–408MathSciNetGoogle Scholar
  115. Vander Wiel RJ, Sahinidis NV (1995) Heuristic bounds and test problem generation for the time-dependent traveling salesman problem. Transportation Science 29:167–183zbMATHGoogle Scholar
  116. van de Velde SL (1995) Dual decomposition of a single-machine scheduling problem. Mathematical Programming 69:413–428MathSciNetGoogle Scholar
  117. Verweij B, Ahmed S, Kleywegt A, Nemhauser G, Shapiro A (2003) The sample average approximation method applied to stochastic routing problems: A computational study. Computational Optimization and Applications 24:289–333zbMATHMathSciNetGoogle Scholar
  118. Wei CP, Lee YH, Hsu CM (2003) Empirical comparison of fast partitioningbased clustering algorithms for large data sets. Expert Systems with Applications 24:351–363Google Scholar
  119. Wilson RC, Hancock ER (2000) Bias variance analysis for controlling adaptive surface meshes. Computer Vision and Image Understanding 77:25–47Google Scholar
  120. Wolpert DH, Macready WG (1997) No free lunch theorems for optimization. IEEE Transactions on Evolutionary Computation 1:67–82Google Scholar
  121. Wu D, Golbasi H (2004) Multi-item, multi-facility supply chain planning: Models, complexities and algorithms. Computational Optimization and Applications 28:325–356zbMATHMathSciNetGoogle Scholar
  122. Xu S, Freund RM, Sun J (2003) Solution methodologies for the smallest enclosing circle problem. Computational Optimization and Applications 25:283–292zbMATHMathSciNetGoogle Scholar
  123. Yaman H, Karaşan OE, Pinar MÇ (2007) Restricted robust uniform matroid maximization under interval uncertainty. Mathematical Programming 110:431–441zbMATHMathSciNetGoogle Scholar
  124. Yuval (2000) Neural network training for prediction of climatological time series, regularized by minimization of the generalized cross validation function. Monthly Weather Review 128:1456–1473Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2010

Authors and Affiliations

  1. 1.Department of Management SciencesThe Ohio State UniversityColumbusUSA
  2. 2.Department of Integrated Systems EngineeringThe Ohio State UniversityColumbusUSA

Personalised recommendations