# Mathematical Programming in Data Mining

- 571 Downloads
- 67 Citations

## Abstract

Mathematical programming approaches to three fundamental problems will be described: feature selection, clustering and robust representation. The feature selection problem considered is that of discriminating between two sets while recognizing irrelevant and redundant features and suppressing them. This creates a lean model that often generalizes better to new unseen data. Computational results on real data confirm improved generalization of leaner models. Clustering is exemplified by the unsupervised learning of patterns and clusters that may exist in a given database and is a useful tool for knowledge discovery in databases (KDD). A mathematical programming formulation of this problem is proposed that is theoretically justifiable and computationally implementable in a finite number of steps. A resulting k-Median Algorithm is utilized to discover very useful survival curves for breast cancer patients from a medical database. Robust representation is concerned with minimizing trained model degradation when applied to new problems. A novel approach is proposed that purposely tolerates a small error in the training process in order to avoid overfitting data that may contain errors. Examples of applications of these concepts are given.

## Preview

Unable to display preview. Download preview PDF.

## References

- K. Al-Sultan. A Tabu search approach to the clustering problem.
*Pattern Recognition*, 28(9):1443–1451, 1995.Google Scholar - K. P. Bennett and E. J. Bredensteiner. Geometry in learning. In C. Gorini, E. Hart, W. Meyer, and T. Phillips, editors,
*Geometry at Work*, Washington, D.C., 1997. Mathematical Association of America. To appear, www.rpi.math.edu/ bennek/geometry2.ps.Google Scholar - K. P. Bennett and O. L. Mangasarian. Neural network training via linear programming. In P. M. Pardalos, editor,
*Advances in Optimization and Parallel Computing*, pages 56–67, Amsterdam, 1992. North Holland.Google Scholar - K. P. Bennett and O. L. Mangasarian. Robust linear programming discrimination of two linearly inseparable sets.
*Optimization Methods and Software*, 1:23–34, 1992.Google Scholar - K. P. Bennett and O. L. Mangasarian. Bilinear separation of two sets in n-space.
*Computational Optimization & Applications*, 2:207–227, 1993.Google Scholar - D. P. Bertsekas.
*Nonlinear Programming*. Athena Scientific, Belmont, MA, 1995.Google Scholar - D. P. Bertsekas and J. N. Tsitsiklis.
*Parallel and Distributed Computation*. Prentice-Hall, Inc, Englewood Cliffs, New Jersey, 1989.Google Scholar - R. E. Bixby, J.W. Gregory, I. J. Lustig, R. E. Marsten, and D. F. Shanno. Very large-scale linear programming: a case study combining interior point and simplex methods.
*Operations Research*, 40:885–897, 1992.Google Scholar - A. Blumer, A. Ehrenfeucht, D. Haussler, and M. K. Warmuth. Occam's razor.
*Information Processing Letters*, 24:377–380, 1987.Google Scholar - P. S. Bradley, O. L. Mangasarian, and W. N. Street. Feature selection via mathematical programming. Technical Report 95-21, Computer Sciences Department, University ofWisconsin, Madison,Wisconsin, December 1995. Submitted to INFORMS Journal on Computing. Available by ftp://ftp.cs.wisc.edu/math-prog/tech-reports/95-21.ps.Z.Google Scholar
- P. S. Bradley, O. L. Mangasarian, and W. N. Street. Clustering via concave minimization. Technical Report 96-03, Computer Sciences Department, University of Wisconsin, Madison, Wisconsin, May 1996. Advances in Neural Information Processing Systems 9, MIT Press, Cambridge, MA 1997, 368–374, M.C. Mozer, M.I. Jordan and T. Petsche, editors. Available by ftp://ftp.cs.wisc.edu/math-prog/tech-reports/96-03.ps.Z.Google Scholar
- E. J. Bredensteiner and K. P. Bennett. Feature minimization within decision trees. Department of Mathematical Sciences Math Report No. 218, Rensselaer Polytechnic Institute, Troy, NY 12180, 1995.Google Scholar
- F. Cordellier and J. Ch. Fiorot. On the Fermat-Weber problem with convex cost functionals.
*Mathematical Programming*, 14:295–311, 1978.Google Scholar - CPLEX Optimization Inc., Incline Village, Nevada.
*Using the CPLEX(TM) Linear Optimizer and CPLEX(TM) Mixed Integer Optimizer (Version 2.0)*, 1992.Google Scholar - G. B. Dantzig.
*Linear Programming and Extensions*. Princeton University Press, Princeton, New Jersey, 1963.Google Scholar - R. De Leone and O. L. Mangasarian. Serial and parallel solution of large scale linear programs by augmented Lagrangian successive overrelaxation. In A. Kurzhanski, K. Neumann, and D. Pallaschke, editors,
*Optimization, Parallel Processing and Applications*, pages 103–124, Berlin, 1988. Springer-Verlag. Lecture Notes in Economics and Mathematical Systems 304.Google Scholar - R. De Leone and M. A. Tork Roth. Massively parallel solution of quadratic programs via successive overrelaxation.
*Concurrency: Practice and Experience*, 5:623–634, 1993.Google Scholar - U. Fayyad, G. Piatetsky-Shapiro, and P. Smyth. The KDD process for extracting useful knowledge from volumes of data.
*Communications of the ACM*, 39:27–34, 1996.Google Scholar - M. C. Ferris and O. L. Mangasarian. Parallel variable distribution.
*SIAM Journal on Optimization*, 4(4): 815–832, 1994.Google Scholar - D. Fisher. Knowledge acquisition via incremental conceptual clustering.
*Machine Learning*, 2:139–172, 1987. Christodoulos A. Floudas.*Nonlinear and Mixed-Integer Optimization: Fundamentals and Applications (Topics in Chemical Engineering)*. Oxford Univ Press, New York, 1995.Google Scholar - K. Fukunaga.
*Statistical Pattern Recognition*. Academic Press, NY, 1990. A. A. Gaivoronski. Convergence properties of backpropagation for neural nets via theory of stochastic gradient methods. part i.*Optimization Methods and Software*, 4(2), 1994.Google Scholar - R. S. Garfinkel and G. L. Nemhauser.
*Integer Programming*. John Wiley & Sons, New York, 1972.Google Scholar - F. Glover. Improved linear programming models for discriminant analysis.
*Decision Sciences*, 21:771–785, 1990.Google Scholar - R. C. Grinold. Mathematical methods for pattern classification.
*Management Science*, 19:272–289, 1972.Google Scholar - M. H. Hassoun.
*Fundamentals of Artificial Neural Networks*. MIT Press, Cambridge, MA, 1995.Google Scholar - J. Hertz, A. Krogh, and R. G. Palmer.
*Introduction to the Theory of Neural Computation*. Addison-Wesley, Redwood City, California, 1991.Google Scholar - Frederick S. Hillier and Gerald J. Lieberman.
*Introduction to Operations Research*. McGraw-Hill, New York, sixth edition, 1995.Google Scholar - P. J. Huber. Robust estimation of location parameter.
*Annals of Mathematical Statistics*, 35:73–101, 1964.Google Scholar - P. J. Huber.
*Robust Statistics*. John Wiley, New York, 1981.Google Scholar - A. K. Jain and R. C. Dubes.
*Algorithms for Clustering Data*. Prentice-Hall, Inc, Englewood Cliffs, NJ, 1988.Google Scholar - G. H. John, R. Kohavi, and K. Pfleger. Irrelevant features and the subset selection problem. In
*Proceedings of the 11th International Conference on Machine Learning*, San Mateo, CA, 1994. Morgan Kaufmann.Google Scholar - E. L. Kaplan and P. Meier. Nonparametric estimation from incomplete observations.
*J. Am. Stat. Assoc.*, 53:457–481, 1958.Google Scholar - Samuel Karlin.
*Mathematical Methods and Theory in Games, Programming, and Economics, Volumes 1 and 2*. Dover Publications, Mineola, New York, 1992.Google Scholar - N. Karmarkar. A new polynomial time algorithm for linear programming.
*Combinatorica*, 4:373–395, 1984.Google Scholar - A. Klarbring. Mathematical programming in contact problems. In M. H. Aliabadi and C. A. Brebbia, editors,
*Computational Methods in Contact Mechanics*, pages 233–263. Computational Mechanics Publications, Southampton, England, 1993.Google Scholar - David G. Kleinbaum.
*Survival Analysis*. Springer-Verlag, New York, 1996.Google Scholar - Y. le Cun, J. S. Denker, and S. A. Solla. Optimal brain damage. In D. S. Touretzky, editor,
*Advances in Neural Information Processing Systems II (Denver 1989)*, pages 598–605, San Mateo, California, 1990. Morgan Kaufmann.Google Scholar - I. J. Lustig, R. E. Marsten, and D. F. Shanno. Interior-point methods for linear programming: Computational state of the art.
*ORSA Journal on Computing*, 6(1):1–38, 1994.Google Scholar - O. L. Mangasarian. Linear and nonlinear separation of patterns by linear programming.
*Operations Research*, 13:444–452, 1965.Google Scholar - O. L. Mangasarian.
*Nonlinear Programming*. McGraw-Hill, New York, 1969. Reprint: SIAM Classic in Applied Mathematics 10, 1994, Philadelphia.Google Scholar - O. L. Mangasarian. Mathematical programming in neural networks.
*ORSA Journal on Computing*, 5(4):349–360, 1993.Google Scholar - O. L. Mangasarian. Machine learning via polyhedral concave minimization. In H. Fischer, B. Riedmueller, and S. Schaeffler, editors,
*Applied Mathematics and Parallel Computing-Festschrift for Klaus Ritter*, pages 175–188. Physica-Verlag A Springer-Verlag Company, Heidelberg, 1996. Available by ftp://ftp.cs.wisc.edu/mathprog/ tech-reports/95-20.ps.Z.Google Scholar - O. L. Mangasarian. Mathematical programming in machine learning. In G. Di Pillo and F. Giannessi, editors,
*Nonlinear Optimization and Applications*, pages 283–295, New York, 1996. Plenum Publishing.Google Scholar - O. L. Mangasarian and R. R. Meyer. Nonlinear perturbation of linear programs.
*SIAM Journal on Control and Optimization*, 17(6):745–752, November 1979. 200 MANGASARIANGoogle Scholar - O. L. Mangasarian and M. V. Solodov. Serial and parallel backpropagation convergence via nonmonotone perturbed minimization.
*Optimization Methods and Software*, 4(2):103–116, 1994.Google Scholar - O. L. Mangasarian, W. Nick Street, and W. H. Wolberg. Breast cancer diagnosis and prognosis via linear programming.
*Operations Research*, 43(4):570–577, July-August 1995.Google Scholar - D. Mladenic. Combinatorial optimization in inductive concept learning. In
*Proceedings of the Tenth International Conference on Machine Learning*, pages 205–211, San MAteo, CA, 1993. Morgan Kaufmann.Google Scholar - P. M. Murphy and D. W. Aha. UCI repository of machine learning databases. Department of Information and Computer Science, University of California, Irvine, www.ics.uci.edu/AI/ML/MLDBRepository.html, 1992.Google Scholar
- K. G. Murty.
*Linear Programming*. John Wiley & Sons, New York, 1983.Google Scholar - K. G. Murty.
*Network Programming*. Prentice Hall, Englewood Cliffs, New Jersey, 1992.Google Scholar - K. G. Murty.
*Operations Research*. Prentice Hall, Englewood Cliffs, New Jersey, 1995.Google Scholar - G. L. Nemhauser and L. A. Wolsey.
*Integer and Combinatorial Optimization*. John Wiley, New York, 1988.Google Scholar - M. L. Overton. A quadratically convergent method for minimizing a sum of Euclidean norms.
*Mathematical Programming*, 27:34–63, 1983.Google Scholar - P. D. Panagiotopoulos.
*Inequality Problems in Mechanics and Applications*. Birkhäuser, Boston, 1985.Google Scholar - M. R. Rao. Cluster analysis and mathematical programming.
*Journal of the American Statistical Association*, 66:622–626, 1971.Google Scholar - R. T. Rockafellar.
*Convex Analysis*. Princeton University Press, Princeton, New Jersey, 1970.Google Scholar - R. T. Rockafellar.
*Network Flows and Monotropic Optimization*. Wiley-Interscience, New York, 1984.Google Scholar - A. Roy, L. S. Kim, and S. Mukhopadhyay. A polynomial time algorithm for the construction and training of a class of multilayer perceptrons.
*Neural Networks*, 6:535–545, 1993.Google Scholar - D. E. Rumelhart and J. L. McClelland.
*Parallel Distributed Processing*. MIT Press, Cambridge, Massachusetts, 1986.Google Scholar - C. Schaffer. Overfitting avoidance as bias.
*Machine Learning*, 10:153–178, 1993.Google Scholar - S. Z. Selim and M. A. Ismail. K-Means-Type algorithms: a generalized convergence theorem and characterization of local optimality.
*IEEE Transactions on Pattern Analysis and Machine Intelligence*, PAMI-6:81–87, 1984.Google Scholar - J. W. Shavlik and T. G. Dietterich (editors).
*Readings in Machine Learning*. Morgan Kaufman, San Mateo, California, 1990.Google Scholar - F. W. Smith. Pattern classifier design by linear programming.
*IEEE Transactions on Computers*, C-17:367–372, 1968.Google Scholar - W. Nick Street and O. L. Mangasarian. Improved generalization via tolerant training. Technical Report 95-11, Computer Sciences Department, University of Wisconsin, Madison, Wisconsin, July 1995.
*Journal of Optimization Theory and Applications*, 96:2, February 1998, to appear. Available by ftp://ftp.cs.wisc.edu/mathprog/ tech-reports/95-11.ps.Z.Google Scholar - Y. Z. Tsypkin.
*Foundations of the Theory of Learning Systems*. Academic Press, New York, 1973.Google Scholar - Robert J. Vanderbei.
*Linear Programming: Foundations and Extensions*. Kluwer Academic Publishers, Hingham, MA, 1997.Google Scholar - V. N. Vapnik.
*The Nature of Statistical Learning Theory*. Springer, New York, 1995.Google Scholar - J. von Neumann and O. Morgenstern.
*Theory of Games and Economic Behavior*. Princeton University Press, Princeton, New Jersey, 1944.Google Scholar - W. H. Wolberg, W. N. Street, and O. L. Mangasarian. WPBC:Wisconsin Prognostic Breast Cancer Database. Computer Sciences Department, University ofWisconsin, Madison, ftp://ftp.cs.wisc.edu/math-prog/cpo-dataset/machinelearn/ WPBC/, 1995.Google Scholar
- D. H. Wolpert, editor.
*The Mathematics of Generalization*, Reading, MA, 1995. Addison-Wesley.Google Scholar