Data Mining Methods and Applications

Abstract

In this chapter, we provide a review of the knowledge discovery process, including data handling, data mining methods and software, and current research activities. The introduction defines and provides a general background to data mining knowledge discovery in databases. In particular, the potential for data mining to improve manufacturing processes in industry is discussed. This is followed by an outline of the entire process of knowledge discovery in databases in the second part of the chapter.

The third part presents data handling issues, including databases and preparation of the data for analysis. Although these issues are generally considered uninteresting to modelers, the largest portion of the knowledge discovery process is spent handling data. It is also of great importance since the resulting models can only be as good as the data on which they are based.

The fourth part is the core of the chapter and describes popular data mining methods, separated as supervised versus unsupervised learning. In supervised learning, the training data set includes observed output values (“correct answers”) for the given set of inputs. If the outputs are continuous/quantitative, then we have a regression problem. If the outputs are categorical/qualitative, then we have a classification problem. Supervised learning methods are described in the context of both regression and classification (as appropriate), beginning with the simplest case of linear models, then presenting more complex modeling with trees, neural networks, and support vector machines, and concluding with some methods, such as nearest neighbor, that are only for classification. In unsupervised learning, the training data set does not contain output values. Unsupervised learning methods are described under two categories: association rules and clustering. Association rules are appropriate for business applications where precise numerical data may not be available while clustering methods are more technically similar to the supervised learning methods presented in this chapter. Finally, this section closes with a review of various software options.

The fifth part presents current research projects, involving both industrial and business applications. In the first project, data is collected from monitoring systems, and the objective is to detect unusual activity that may require action. For example, credit card companies monitor customersʼ credit card usage to detect possible fraud. While methods from statistical process control were developed for similar purposes, the difference lies in the quantity of data. The second project describes data mining tools developed by Genichi Taguchi, who is well known for his industrial work on robust design. The third project tackles quality and productivity improvement in manufacturing industries. Although some detail is given, considerable research is still needed to develop a practical tool for todayʼs complex manufacturing processes.

Finally, the last part provides a brief discussion on remaining problems and future trends.

Keywords

Data Mining Linear Discriminant Analysis Association Rule Unsupervised Learning Statistical Process Control 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Abbreviations

ANN

artificial neural networks

CART

classification and regression tree

CHAID

chi-square automatic interaction detection

DBSCAN

Density-based clustering

DM

Data mining

EBD

equivalent business days

GAM

generalized additive model

KDD

knowledge discovery in databases

LOF

lack-of-fit

MARS

multivariate adaptive regression splines

MART

multiple additive regression tree

MOLAP

multidimensional OLAP

MTS

Mahalanobis–Taguchi system

NN

nearest neighbor

OLAP

online analytical processing

PCB

printed circuit board

SOM

self-organizing (feature) map

SPC

statistical process control

SQL

structured query language

SVM

support vector machine

References

  1. 36.1.
    M. J. A. Berry, G. Linoff: Mastering Data Mining: The Art and Science of Customer Relationship Management (Wiley, New York 2000)Google Scholar
  2. 36.2.
    E. Wegman: Data Mining Tutorial, Short Course Notes, Interface 2001 Symposium, Cosa Mesa, Californien (2001)Google Scholar
  3. 36.3.
    P. Adriaans, D. Zantinge: Data Mining (Addison-Wesley, New York 1996)Google Scholar
  4. 36.4.
    J. H. Friedman: Data Mining and Statistics: What is the Connection? Technical Report (Stat. Dep., Stanford University 1997)Google Scholar
  5. 36.5.
    K. B. Clark, T. Fujimoto: Product Development and Competitiveness, J. Jpn Int. Econ. 6(2), 101–143 (1992)CrossRefGoogle Scholar
  6. 36.6.
    D. W. LaBahn, A. Ali, R. Krapfel: New Product Development Cycle Time. The Influence of Project and Process Factors in Small Manufacturing Companies, J. Business Res. 36(2), 179–188 (1996)CrossRefGoogle Scholar
  7. 36.7.
    J. Han, M. Kamber: Data Mining: Concept and Techniques (Morgan Kaufmann, San Francisco 2001)Google Scholar
  8. 36.8.
    T. Hastie, J. H. Friedman, R. Tibshirani: Elements of Statistical Learning: Data Mining, Inference, and Prediction (Springer, Berlin Heidelberg New York 2001)MATHGoogle Scholar
  9. 36.9.
    S. Weisberg: Applied Linear Regression (Wiley, New York 1980)MATHGoogle Scholar
  10. 36.10.
    G. Seber: Multivariate Observations (Wiley, New York 1984)CrossRefMATHGoogle Scholar
  11. 36.11.
    J. Neter, M. H. Kutner, C. J. Nachtsheim, W. Wasserman: Applied Linear Statistical Models, 4th edn. (Irwin, Chicago 1996)Google Scholar
  12. 36.12.
    A. E. Hoerl, R. Kennard: Ridge Regression: Biased Estimation of Nonorthogonal Problems, Technometrics 12, 55–67 (1970)CrossRefMATHGoogle Scholar
  13. 36.13.
    R. Tibshirani: Regression Shrinkage and Selection via the Lasso, J. R. Stat. Soc. Series B 58, 267–288 (1996)MathSciNetMATHGoogle Scholar
  14. 36.14.
    A. Agresti: An Introduction to Categorical Data Analysis (Wiley, New York 1996)MATHGoogle Scholar
  15. 36.15.
    D. Hand: Discrimination and Classification (Wiley, Chichester 1981)MATHGoogle Scholar
  16. 36.16.
    P. McCullagh, J. A. Nelder: Generalized Linear Models, 2nd edn. (Chapman Hall, New York 1989)MATHGoogle Scholar
  17. 36.17.
    T. Hastie, R. Tibshirani: Generalized Additive Models (Chapman Hall, New York 1990)MATHGoogle Scholar
  18. 36.18.
    W. S. Cleveland: Robust Locally-Weighted Regression and Smoothing Scatterplots, J. Am. Stat. Assoc. 74, 829–836 (1979)CrossRefMathSciNetMATHGoogle Scholar
  19. 36.19.
    R. L. Eubank: Spline Smoothing and Nonparametric Regression (Marcel Dekker, New York 1988)MATHGoogle Scholar
  20. 36.20.
    G. Wahba: Spline Models for Observational Data, Applied Mathematics, Vol. 59 (SIAM, Philadelphia 1990)MATHGoogle Scholar
  21. 36.21.
    W. Härdle: Applied Non-parametric Regression (Cambridge Univ. Press, Cambridge 1990)Google Scholar
  22. 36.22.
    D. Biggs, B. deVille, E. Suen: A Method of Choosing Multiway Partitions for Classification and Decision Trees, J. Appl. Stat. 18(1), 49–62 (1991)CrossRefGoogle Scholar
  23. 36.23.
    B. D. Ripley: Pattern Recognition and Neural Networks (Cambridge Univ. Press, Cambridge 1996)MATHGoogle Scholar
  24. 36.24.
    L. Breiman, J. H. Friedman, R. A. Olshen, C. J. Stone: Classification and Regression Trees (Wadsworth, Belmont, California 1984)MATHGoogle Scholar
  25. 36.25.
    J. N. Morgan, J. A. Sonquist: Problems in the Analysis of Survey Data, and a Proposal, J. Am. Stat. Assoc. 58, 415–434 (1963)CrossRefMATHGoogle Scholar
  26. 36.26.
    A. Fielding: Binary segmentation: The Automatic Interaction Detector and Related Techniques for Exploring Data Structure. In: The Analysis of Survey Data, Volume I: Exploring Data Structures, ed. by C. A. OʼMuircheartaigh, C. Payne (Wiley, New York 1977) pp. 221–258Google Scholar
  27. 36.27.
    W. Y. Loh, N. Vanichsetakul: Tree-Structured Classification Via Generalized Discriminant Analysis, J. Am. Stat. Assoc. 83, 715–728 (1988)CrossRefMathSciNetMATHGoogle Scholar
  28. 36.28.
    W. D. Lo. Chaudhuri, W. Y. Loh, C. C. Yang: Generalized Regression Trees, Stat. Sin. 5, 643–666 (1995)MathSciNetGoogle Scholar
  29. 36.29.
    W. Y. Loh, Y. S. Shih: Split-Selection Methods for Classification Trees, Statistica Sinica 7, 815–840 (1997)MathSciNetMATHGoogle Scholar
  30. 36.30.
    J. H. Friedman, T. Hastie, R. Tibshirani: Additive Logistic Regression: a Statistical View of Boosting, Ann. Stat. 28, 337–407 (2000)CrossRefMathSciNetMATHGoogle Scholar
  31. 36.31.
    Y. Freund, R. Schapire: Experiments with a New Boosting Algorithm, Machine Learning: Proceedings of the Thirteenth International Conference, Bari, Italy 1996, ed. by M. Kaufmann, (Bari, Italy 1996) 148–156Google Scholar
  32. 36.32.
    L. Breiman: Bagging Predictors, Machine Learning 26, 123–140 (1996)Google Scholar
  33. 36.33.
    J. H. Friedman: Greedy Function Approximation: a Gradient Boosting Machine, Ann. Stat. 29, 1189–1232 (2001)CrossRefMATHGoogle Scholar
  34. 36.34.
    J. H. Friedman: Stochastic Gradient Boosting, Computational Statistics and Data Analysis 38(4), 367–378 (2002)CrossRefGoogle Scholar
  35. 36.35.
    J. H. Friedman: Multivariate Adaptive Regression Splines (with Discussion), Ann. Stat. 19, 1–141 (1991)CrossRefMATHGoogle Scholar
  36. 36.36.
    J. H. Friedman, B. W. Silverman: Flexible Parsimonious Smoothing and Additive Modeling, Technometrics 31, 3–39 (1989)CrossRefMathSciNetMATHGoogle Scholar
  37. 36.37.
    R. P. Lippmann: An Introduction to Computing with Neural Nets, IEEE ASSP Magazine April, 4–22 (1987)CrossRefGoogle Scholar
  38. 36.38.
    S. S. Haykin: Neural Networks: A Comprehensive Foundation, 2nd edn. (Prentice Hall, Upper Saddle River 1999)MATHGoogle Scholar
  39. 36.39.
    H. White: Learning in Neural Networks: a Statistical Perspective, Neural Computation 1, 425–464 (1989)CrossRefGoogle Scholar
  40. 36.40.
    A. R. Barron, R. L. Barron, E. J. Wegman: Statistical Learning Networks: A Unifying View, Computer Science and Statistics: Proceedings of the 20th Symposium on the Interface 1992, ed. by E. J. Wegman, D. T. Gantz, J. J. Miller (American Statistical Association, Alexandria, VA 1992) 192–203Google Scholar
  41. 36.41.
    B. Cheng, D. M. Titterington: Neural Networks: A Review from a Statistical Perspective (with discussion), Stat. Sci 9, 2–54 (1994)CrossRefMathSciNetMATHGoogle Scholar
  42. 36.42.
    D. Rumelhart, G. Hinton, R. Williams: Learning Internal Representations by Error Propagation. In: Parallel Distributed Processing: Explorations in the Microstructures of Cognition, Vol. 1: Foundations, ed. by D. E. Rumelhart, J. L. McClelland (MIT, Cambridge 1986) pp. 318–362Google Scholar
  43. 36.43.
    V. Vapnik: The Nature of Statistical Learning (Springer, Berlin Heidelberg New York 1996)Google Scholar
  44. 36.44.
    C. J. C. Burges: A Tutorial on Support Vector Machines for Pattern Recognition, Knowledge Discovery and Data Mining 2(2), 121–167 (1998)CrossRefGoogle Scholar
  45. 36.45.
    J. Shawe-Taylor, N. Cristianini: Kernel Methods for Pattern Analysis (Cambridge Univ. Press, Cambridge 2004)Google Scholar
  46. 36.46.
    N. Cristianini, J. Shawe-Taylor: An Introduction to Support Vector Machines (Cambridge Univ. Press, Cambridge 2000)Google Scholar
  47. 36.47.
    P. Huber: Ann. Math. Stat., Robust Estimation of a Location Parameter 53, 73–101 (1964)MathSciNetGoogle Scholar
  48. 36.48.
    B. V. Dasarathy: Nearest Neighbor Pattern Classification Techniques (IEEE Computer Society, New York 1991)Google Scholar
  49. 36.49.
    T. Hastie, R. Tibshirani: Discriminant Adaptive Nearest-Neighbor Classification, IEEE Pattern Recognition and Machine Intelligence 18, 607–616 (1996)CrossRefGoogle Scholar
  50. 36.50.
    T. Hastie, R. Tibshirani, A. Buja: Flexible Discriminant and Mixture Models. In: Statistics and Artificial Neural Networks, ed. by J. Kay, M. Titterington (Oxford Univ. Press, Oxford 1998)Google Scholar
  51. 36.51.
    J. R. Koza: Genetic Programming: On the Programming of Computers by Means of Natural Selection (MIT, Cambridge 1992)MATHGoogle Scholar
  52. 36.52.
    W. Banzhaf, P. Nordin, R. E. Keller, F. D. Francone: Genetic Programming: An Introduction (Morgan Kaufmann, San Francisco 1998)MATHGoogle Scholar
  53. 36.53.
    P. W. H. Smith: Genetic Programming as a Data-Mining Tool. In: Data Mining: A Heuristic Approach, ed. by H. A. Abbass, R. A. Sarker, C. S. Newton (Idea Group Publishing, London 2002) pp. 157–173Google Scholar
  54. 36.54.
    R. Agrawal, H. Mannila, R. Srikant, H. Toivonen, A. I. Verkamo: Fast Discovery Of Association Rules: Advances in Knowledge Discovery and Data Mining (MIT, Cambridge 1995) Chap. 12Google Scholar
  55. 36.55.
    A. Gordon: Classification, 2nd edn. (Chapman Hall, New York 1999)MATHGoogle Scholar
  56. 36.56.
    J. A. Hartigan, M. A. Wong: A K-Means Clustering Algorithm, Appl. Stat. 28, 100–108 (1979)CrossRefMATHGoogle Scholar
  57. 36.57.
    L. Kaufman, P. Rousseeuw: Finding Groups in Data: An Introduction to Cluster Analysis (Wiley, New York 1990)Google Scholar
  58. 36.58.
    M. Ester, H.-P. Kriegel, J. Sander, X. Xu: A Density-Based Algorithm for Discovering Cluster in Large Spatial Databases, Proceedings of 1996 International Conference on Knowledge Discovery and Data Mining (KDD96), Portland 1996, ed. by E. Simoudis, J. Han, U. Fayyad (AAAI Press, Menlo Park 1996) 226–231Google Scholar
  59. 36.59.
    J. Sander, M. Ester, H.-P. Kriegel, X. Xu: Density-Based Clustering in Spatial Databases: The Algorithm DGBSCAN and its Applications, Data Mining and Knowledge Discovery 2(2), 169–194 (1998)CrossRefGoogle Scholar
  60. 36.60.
    M. Ankerst, M. M. Breunig, H.-P. Kriegel, J. Sander: OPTICS: Ordering Points to Identify the Clustering Structure, Proc. ACMSIGMOD Int. Conf. on Management of Data, Philadelphia, Pennsylvania June 1-3, 1999 (ACM Press, New York 1999) 49–60Google Scholar
  61. 36.61.
    T. Kohonen: Self-Organization and Associative Memory, 3rd edn. (Springer, Berlin Heidelberg New York 1989)Google Scholar
  62. 36.62.
    D. Haughton, J. Deichmann, A. Eshghi, S. Sayek, N. Teebagy, H. Topi: A Review of Software Packages for Data Mining, Amer. Stat. 57(4), 290–309 (2003)CrossRefGoogle Scholar
  63. 36.63.
    J. R. Quinlan: C4.5: Programs for Machine Learning (Morgan Kaufmann, San Mateo 1993)Google Scholar
  64. 36.64.
    T. Fawcett, F. Provost: Activity Monitoring: Noticing Interesting Changes in Behavior, Proceedings of KDD-99, San Diego 1999, (San Diego, CA 1999) 53–62Google Scholar
  65. 36.65.
    D. C. Montgomery: Introduction to Statistical Quality Control, 5th edn. (Wiley, New York 2001)Google Scholar
  66. 36.66.
    W. H. Woodall, K.-L. Tsui, G. R. Tucker: A Review of Statistical and Fuzzy Quality Control Based on Categorical Data, Frontiers in Statistical Quality Control 5, 83–89 (1997)Google Scholar
  67. 36.67.
    D. C. Montgomery, W. H. Woodall: A Discussion on Statistically-Based Process Monitoring and Control, J. Qual. Technol. 29, 121–162 (1997)Google Scholar
  68. 36.68.
    A. J. Hayter, K.-L. Tsui: Identification and Qualification in Multivariate Quality Control Problems, J. Qual. Tech. 26(3), 197–208 (1994)Google Scholar
  69. 36.69.
    R. L. Mason, C. W. Champ, N. D. Tracy, S. J. Wierda, J. C. Young: Assessment of Multivariate Process Control Techniques, J. Qual. Technol. 29, 140–143 (1997)Google Scholar
  70. 36.70.
    W. Jiang, S.-T. Au, K.-L. Tsui: A Statistical Process Control Approach for Customer Activity Monitoring, Technical Report, AT&T Labs (2004)Google Scholar
  71. 36.71.
    M. West, J. Harrison: Bayesian Forecasting and Dynamic Models, 2nd edn. (Springer, New York 1997)MATHGoogle Scholar
  72. 36.72.
    C. Fraley, A. E. Raftery: Model-based Clustering, Discriminant Analysis, and Density Estimation, J. Amer. Stat. Assoc. 97, 611–631 (2002)CrossRefMathSciNetMATHGoogle Scholar
  73. 36.73.
    G. Taguchi: Introduction to Quality Engineering: Designing Quality into Products and Processes (Asian Productivity Organization, Tokyo 1986)Google Scholar
  74. 36.74.
    G. E. P. Box, R. N. Kacker, V. N. Nair, M. S. Phadke, A. C. Shoemaker, C. F. Wu: Quality Practices in Japan, Qual. Progress March, 21–29 (1988)Google Scholar
  75. 36.75.
    V. N. Nair: Taguchiʼs Parameter Design: A Panel Discussion, Technometrics 34, 127–161 (1992)CrossRefMathSciNetGoogle Scholar
  76. 36.76.
    K.-L. Tsui: An Overview of Taguchi Method and Newly Developed Statistical Methods for Robust Design, IIE Trans. 24, 44–57 (1992)CrossRefGoogle Scholar
  77. 36.77.
    K.-L. Tsui: A Critical Look at Taguchiʼs Modeling Approach for Robust Design, J. Appl. Stat. 23, 81–95 (1996)CrossRefGoogle Scholar
  78. 36.78.
    G. Taguchi, S. Chowdhury, Y. Wu: The Mahalanobis–Taguchi System (McGraw-Hill, New York 2001)Google Scholar
  79. 36.79.
    G. Taguchi, R. Jugulum: The Mahalanobis–Taguchi Strategy: A Pattern Technology System (Wiley, New York 2002)Google Scholar
  80. 36.80.
    W. H. Woodall, R. Koudelik, K.-L. Tsui, S. B. Kim, Z. G. Stoumbos, C. P. Carvounis: A Review and Analysis of the Mahalanobis–Taguchi System, Technometrics 45(1), 1–15 (2003)CrossRefMathSciNetGoogle Scholar
  81. 36.81.
    A. Kusiak, C. Kurasek: Data Mining of Printed–Circuit Board Defects, IEEE Transactions on Robotics and Automation 17(2), 191–196 (2001)CrossRefGoogle Scholar
  82. 36.82.
    A. Kusiak: Rough Set Theory: A Data Mining Tool for Semiconductor Manufacturing, IEEE Transactions on Electronics Packaging Manufacturing 24(1), 44–50 (2001)CrossRefGoogle Scholar
  83. 36.83.
    A. Ultsch: Information and Classification: Concepts, Methods and Applications (Springer, Berlin Heidelberg New York 1993)Google Scholar
  84. 36.84.
    A. Y. Wong: A Statistical Approach to Identify Semiconductor Process Equipment Related Yield Problems, IEEE International Symposium on Defect and Fault Tolerance in VLSI Systems, Paris 1997 (IEEE Computer Society, Paris, France 1997) 20–22Google Scholar
  85. 36.85.
    ANSI (2002). Am. Nat. Standards Institute, IPC-9261, In-Process DPMO and Estimated Yield for PWBGoogle Scholar
  86. 36.86.
    M. Baron, C. K. Lakshminarayan, Z. Chen: Markov Random Fields In Pattern Recognition For Semiconductor Manufacturing, Technometrics 43, 66–72 (2001)CrossRefMathSciNetMATHGoogle Scholar
  87. 36.87.
    G. King: Event Count Models for International Relations: Generalizations and Applications, International Studies Quarterly 33(2), 123–147 (1989)CrossRefGoogle Scholar
  88. 36.88.
    P. Smyth: Hidden Markov models for fault detection in dynamic systems, Pattern Recognition 27(1), 149–164 (1994)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag 2006

Authors and Affiliations

  1. 1.School of Industrial and Systems EngineeringGeorgia Institute of TechnologyAtlantaUSA
  2. 2.Industrial and Manufacturing Systems EngineeringUniversity of Texas at ArlingtonArlingtonUSA
  3. 3.Department of Systems Engineering and Engineering ManagementStevens Institute of TechnologyHobokenUSA
  4. 4.Computer Science and EngineeringThe University of Texas at ArlingtonArlingtonUSA

Personalised recommendations