Data Mining and Knowledge Discovery

, Volume 1, Issue 2, pp 183–201 | Cite as

Mathematical Programming in Data Mining

  • O.L. Mangasarian


Mathematical programming approaches to three fundamental problems will be described: feature selection, clustering and robust representation. The feature selection problem considered is that of discriminating between two sets while recognizing irrelevant and redundant features and suppressing them. This creates a lean model that often generalizes better to new unseen data. Computational results on real data confirm improved generalization of leaner models. Clustering is exemplified by the unsupervised learning of patterns and clusters that may exist in a given database and is a useful tool for knowledge discovery in databases (KDD). A mathematical programming formulation of this problem is proposed that is theoretically justifiable and computationally implementable in a finite number of steps. A resulting k-Median Algorithm is utilized to discover very useful survival curves for breast cancer patients from a medical database. Robust representation is concerned with minimizing trained model degradation when applied to new problems. A novel approach is proposed that purposely tolerates a small error in the training process in order to avoid overfitting data that may contain errors. Examples of applications of these concepts are given.

Mathematical programming feature selection clustering robust representation 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. K. Al-Sultan. A Tabu search approach to the clustering problem. Pattern Recognition, 28(9):1443–1451, 1995.Google Scholar
  2. K. P. Bennett and E. J. Bredensteiner. Geometry in learning. In C. Gorini, E. Hart, W. Meyer, and T. Phillips, editors, Geometry at Work, Washington, D.C., 1997. Mathematical Association of America. To appear, bennek/ Scholar
  3. K. P. Bennett and O. L. Mangasarian. Neural network training via linear programming. In P. M. Pardalos, editor, Advances in Optimization and Parallel Computing, pages 56–67, Amsterdam, 1992. North Holland.Google Scholar
  4. K. P. Bennett and O. L. Mangasarian. Robust linear programming discrimination of two linearly inseparable sets. Optimization Methods and Software, 1:23–34, 1992.Google Scholar
  5. K. P. Bennett and O. L. Mangasarian. Bilinear separation of two sets in n-space. Computational Optimization & Applications, 2:207–227, 1993.Google Scholar
  6. D. P. Bertsekas. Nonlinear Programming. Athena Scientific, Belmont, MA, 1995.Google Scholar
  7. D. P. Bertsekas and J. N. Tsitsiklis. Parallel and Distributed Computation. Prentice-Hall, Inc, Englewood Cliffs, New Jersey, 1989.Google Scholar
  8. R. E. Bixby, J.W. Gregory, I. J. Lustig, R. E. Marsten, and D. F. Shanno. Very large-scale linear programming: a case study combining interior point and simplex methods. Operations Research, 40:885–897, 1992.Google Scholar
  9. A. Blumer, A. Ehrenfeucht, D. Haussler, and M. K. Warmuth. Occam's razor. Information Processing Letters, 24:377–380, 1987.Google Scholar
  10. P. S. Bradley, O. L. Mangasarian, and W. N. Street. Feature selection via mathematical programming. Technical Report 95-21, Computer Sciences Department, University ofWisconsin, Madison,Wisconsin, December 1995. Submitted to INFORMS Journal on Computing. Available by Scholar
  11. P. S. Bradley, O. L. Mangasarian, and W. N. Street. Clustering via concave minimization. Technical Report 96-03, Computer Sciences Department, University of Wisconsin, Madison, Wisconsin, May 1996. Advances in Neural Information Processing Systems 9, MIT Press, Cambridge, MA 1997, 368–374, M.C. Mozer, M.I. Jordan and T. Petsche, editors. Available by Scholar
  12. E. J. Bredensteiner and K. P. Bennett. Feature minimization within decision trees. Department of Mathematical Sciences Math Report No. 218, Rensselaer Polytechnic Institute, Troy, NY 12180, 1995.Google Scholar
  13. F. Cordellier and J. Ch. Fiorot. On the Fermat-Weber problem with convex cost functionals. Mathematical Programming, 14:295–311, 1978.Google Scholar
  14. CPLEX Optimization Inc., Incline Village, Nevada. Using the CPLEX(TM) Linear Optimizer and CPLEX(TM) Mixed Integer Optimizer (Version 2.0), 1992.Google Scholar
  15. G. B. Dantzig. Linear Programming and Extensions. Princeton University Press, Princeton, New Jersey, 1963.Google Scholar
  16. R. De Leone and O. L. Mangasarian. Serial and parallel solution of large scale linear programs by augmented Lagrangian successive overrelaxation. In A. Kurzhanski, K. Neumann, and D. Pallaschke, editors, Optimization, Parallel Processing and Applications, pages 103–124, Berlin, 1988. Springer-Verlag. Lecture Notes in Economics and Mathematical Systems 304.Google Scholar
  17. R. De Leone and M. A. Tork Roth. Massively parallel solution of quadratic programs via successive overrelaxation. Concurrency: Practice and Experience, 5:623–634, 1993.Google Scholar
  18. U. Fayyad, G. Piatetsky-Shapiro, and P. Smyth. The KDD process for extracting useful knowledge from volumes of data. Communications of the ACM, 39:27–34, 1996.Google Scholar
  19. M. C. Ferris and O. L. Mangasarian. Parallel variable distribution. SIAM Journal on Optimization, 4(4): 815–832, 1994.Google Scholar
  20. D. Fisher. Knowledge acquisition via incremental conceptual clustering. Machine Learning, 2:139–172, 1987. Christodoulos A. Floudas. Nonlinear and Mixed-Integer Optimization: Fundamentals and Applications (Topics in Chemical Engineering). Oxford Univ Press, New York, 1995.Google Scholar
  21. K. Fukunaga. Statistical Pattern Recognition. Academic Press, NY, 1990. A. A. Gaivoronski. Convergence properties of backpropagation for neural nets via theory of stochastic gradient methods. part i. Optimization Methods and Software, 4(2), 1994.Google Scholar
  22. R. S. Garfinkel and G. L. Nemhauser. Integer Programming. John Wiley & Sons, New York, 1972.Google Scholar
  23. F. Glover. Improved linear programming models for discriminant analysis. Decision Sciences, 21:771–785, 1990.Google Scholar
  24. R. C. Grinold. Mathematical methods for pattern classification. Management Science, 19:272–289, 1972.Google Scholar
  25. M. H. Hassoun. Fundamentals of Artificial Neural Networks. MIT Press, Cambridge, MA, 1995.Google Scholar
  26. J. Hertz, A. Krogh, and R. G. Palmer. Introduction to the Theory of Neural Computation. Addison-Wesley, Redwood City, California, 1991.Google Scholar
  27. Frederick S. Hillier and Gerald J. Lieberman. Introduction to Operations Research. McGraw-Hill, New York, sixth edition, 1995.Google Scholar
  28. P. J. Huber. Robust estimation of location parameter. Annals of Mathematical Statistics, 35:73–101, 1964.Google Scholar
  29. P. J. Huber. Robust Statistics. John Wiley, New York, 1981.Google Scholar
  30. A. K. Jain and R. C. Dubes. Algorithms for Clustering Data. Prentice-Hall, Inc, Englewood Cliffs, NJ, 1988.Google Scholar
  31. G. H. John, R. Kohavi, and K. Pfleger. Irrelevant features and the subset selection problem. In Proceedings of the 11th International Conference on Machine Learning, San Mateo, CA, 1994. Morgan Kaufmann.Google Scholar
  32. E. L. Kaplan and P. Meier. Nonparametric estimation from incomplete observations. J. Am. Stat. Assoc., 53:457–481, 1958.Google Scholar
  33. Samuel Karlin. Mathematical Methods and Theory in Games, Programming, and Economics, Volumes 1 and 2. Dover Publications, Mineola, New York, 1992.Google Scholar
  34. N. Karmarkar. A new polynomial time algorithm for linear programming. Combinatorica, 4:373–395, 1984.Google Scholar
  35. A. Klarbring. Mathematical programming in contact problems. In M. H. Aliabadi and C. A. Brebbia, editors, Computational Methods in Contact Mechanics, pages 233–263. Computational Mechanics Publications, Southampton, England, 1993.Google Scholar
  36. David G. Kleinbaum. Survival Analysis. Springer-Verlag, New York, 1996.Google Scholar
  37. Y. le Cun, J. S. Denker, and S. A. Solla. Optimal brain damage. In D. S. Touretzky, editor, Advances in Neural Information Processing Systems II (Denver 1989), pages 598–605, San Mateo, California, 1990. Morgan Kaufmann.Google Scholar
  38. I. J. Lustig, R. E. Marsten, and D. F. Shanno. Interior-point methods for linear programming: Computational state of the art. ORSA Journal on Computing, 6(1):1–38, 1994.Google Scholar
  39. O. L. Mangasarian. Linear and nonlinear separation of patterns by linear programming. Operations Research, 13:444–452, 1965.Google Scholar
  40. O. L. Mangasarian. Nonlinear Programming. McGraw-Hill, New York, 1969. Reprint: SIAM Classic in Applied Mathematics 10, 1994, Philadelphia.Google Scholar
  41. O. L. Mangasarian. Mathematical programming in neural networks. ORSA Journal on Computing, 5(4):349–360, 1993.Google Scholar
  42. O. L. Mangasarian. Machine learning via polyhedral concave minimization. In H. Fischer, B. Riedmueller, and S. Schaeffler, editors, Applied Mathematics and Parallel Computing-Festschrift for Klaus Ritter, pages 175–188. Physica-Verlag A Springer-Verlag Company, Heidelberg, 1996. Available by tech-reports/ Scholar
  43. O. L. Mangasarian. Mathematical programming in machine learning. In G. Di Pillo and F. Giannessi, editors, Nonlinear Optimization and Applications, pages 283–295, New York, 1996. Plenum Publishing.Google Scholar
  44. O. L. Mangasarian and R. R. Meyer. Nonlinear perturbation of linear programs. SIAM Journal on Control and Optimization, 17(6):745–752, November 1979. 200 MANGASARIANGoogle Scholar
  45. O. L. Mangasarian and M. V. Solodov. Serial and parallel backpropagation convergence via nonmonotone perturbed minimization. Optimization Methods and Software, 4(2):103–116, 1994.Google Scholar
  46. O. L. Mangasarian, W. Nick Street, and W. H. Wolberg. Breast cancer diagnosis and prognosis via linear programming. Operations Research, 43(4):570–577, July-August 1995.Google Scholar
  47. D. Mladenic. Combinatorial optimization in inductive concept learning. In Proceedings of the Tenth International Conference on Machine Learning, pages 205–211, San MAteo, CA, 1993. Morgan Kaufmann.Google Scholar
  48. P. M. Murphy and D. W. Aha. UCI repository of machine learning databases. Department of Information and Computer Science, University of California, Irvine,, 1992.Google Scholar
  49. K. G. Murty. Linear Programming. John Wiley & Sons, New York, 1983.Google Scholar
  50. K. G. Murty. Network Programming. Prentice Hall, Englewood Cliffs, New Jersey, 1992.Google Scholar
  51. K. G. Murty. Operations Research. Prentice Hall, Englewood Cliffs, New Jersey, 1995.Google Scholar
  52. G. L. Nemhauser and L. A. Wolsey. Integer and Combinatorial Optimization. John Wiley, New York, 1988.Google Scholar
  53. M. L. Overton. A quadratically convergent method for minimizing a sum of Euclidean norms. Mathematical Programming, 27:34–63, 1983.Google Scholar
  54. P. D. Panagiotopoulos. Inequality Problems in Mechanics and Applications. Birkhäuser, Boston, 1985.Google Scholar
  55. M. R. Rao. Cluster analysis and mathematical programming. Journal of the American Statistical Association, 66:622–626, 1971.Google Scholar
  56. R. T. Rockafellar. Convex Analysis. Princeton University Press, Princeton, New Jersey, 1970.Google Scholar
  57. R. T. Rockafellar. Network Flows and Monotropic Optimization. Wiley-Interscience, New York, 1984.Google Scholar
  58. A. Roy, L. S. Kim, and S. Mukhopadhyay. A polynomial time algorithm for the construction and training of a class of multilayer perceptrons. Neural Networks, 6:535–545, 1993.Google Scholar
  59. D. E. Rumelhart and J. L. McClelland. Parallel Distributed Processing. MIT Press, Cambridge, Massachusetts, 1986.Google Scholar
  60. C. Schaffer. Overfitting avoidance as bias. Machine Learning, 10:153–178, 1993.Google Scholar
  61. S. Z. Selim and M. A. Ismail. K-Means-Type algorithms: a generalized convergence theorem and characterization of local optimality. IEEE Transactions on Pattern Analysis and Machine Intelligence, PAMI-6:81–87, 1984.Google Scholar
  62. J. W. Shavlik and T. G. Dietterich (editors). Readings in Machine Learning. Morgan Kaufman, San Mateo, California, 1990.Google Scholar
  63. F. W. Smith. Pattern classifier design by linear programming. IEEE Transactions on Computers, C-17:367–372, 1968.Google Scholar
  64. W. Nick Street and O. L. Mangasarian. Improved generalization via tolerant training. Technical Report 95-11, Computer Sciences Department, University of Wisconsin, Madison, Wisconsin, July 1995. Journal of Optimization Theory and Applications, 96:2, February 1998, to appear. Available by tech-reports/ Scholar
  65. Y. Z. Tsypkin. Foundations of the Theory of Learning Systems. Academic Press, New York, 1973.Google Scholar
  66. Robert J. Vanderbei. Linear Programming: Foundations and Extensions. Kluwer Academic Publishers, Hingham, MA, 1997.Google Scholar
  67. V. N. Vapnik. The Nature of Statistical Learning Theory. Springer, New York, 1995.Google Scholar
  68. J. von Neumann and O. Morgenstern. Theory of Games and Economic Behavior. Princeton University Press, Princeton, New Jersey, 1944.Google Scholar
  69. W. H. Wolberg, W. N. Street, and O. L. Mangasarian. WPBC:Wisconsin Prognostic Breast Cancer Database. Computer Sciences Department, University ofWisconsin, Madison, WPBC/, 1995.Google Scholar
  70. D. H. Wolpert, editor. The Mathematics of Generalization, Reading, MA, 1995. Addison-Wesley.Google Scholar

Copyright information

© Kluwer Academic Publishers 1997

Authors and Affiliations

  • O.L. Mangasarian
    • 1
  1. 1.Computer Sciences DepartmentUniversity of WisconsinMadison

Personalised recommendations