Springer Handbook of Engineering Statistics pp 651-669 | Cite as
Data Mining Methods and Applications
Abstract
In this chapter, we provide a review of the knowledge discovery process, including data handling, data mining methods and software, and current research activities. The introduction defines and provides a general background to data mining knowledge discovery in databases. In particular, the potential for data mining to improve manufacturing processes in industry is discussed. This is followed by an outline of the entire process of knowledge discovery in databases in the second part of the chapter.
The third part presents data handling issues, including databases and preparation of the data for analysis. Although these issues are generally considered uninteresting to modelers, the largest portion of the knowledge discovery process is spent handling data. It is also of great importance since the resulting models can only be as good as the data on which they are based.
The fourth part is the core of the chapter and describes popular data mining methods, separated as supervised versus unsupervised learning. In supervised learning, the training data set includes observed output values (“correct answers”) for the given set of inputs. If the outputs are continuous/quantitative, then we have a regression problem. If the outputs are categorical/qualitative, then we have a classification problem. Supervised learning methods are described in the context of both regression and classification (as appropriate), beginning with the simplest case of linear models, then presenting more complex modeling with trees, neural networks, and support vector machines, and concluding with some methods, such as nearest neighbor, that are only for classification. In unsupervised learning, the training data set does not contain output values. Unsupervised learning methods are described under two categories: association rules and clustering. Association rules are appropriate for business applications where precise numerical data may not be available while clustering methods are more technically similar to the supervised learning methods presented in this chapter. Finally, this section closes with a review of various software options.
The fifth part presents current research projects, involving both industrial and business applications. In the first project, data is collected from monitoring systems, and the objective is to detect unusual activity that may require action. For example, credit card companies monitor customersʼ credit card usage to detect possible fraud. While methods from statistical process control were developed for similar purposes, the difference lies in the quantity of data. The second project describes data mining tools developed by Genichi Taguchi, who is well known for his industrial work on robust design. The third project tackles quality and productivity improvement in manufacturing industries. Although some detail is given, considerable research is still needed to develop a practical tool for todayʼs complex manufacturing processes.
Finally, the last part provides a brief discussion on remaining problems and future trends.
Keywords
Data Mining Linear Discriminant Analysis Association Rule Unsupervised Learning Statistical Process ControlAbbreviations
- ANN
artificial neural networks
- CART
classification and regression tree
- CHAID
chi-square automatic interaction detection
- DBSCAN
Density-based clustering
- DM
Data mining
- EBD
equivalent business days
- GAM
generalized additive model
- KDD
knowledge discovery in databases
- LOF
lack-of-fit
- MARS
multivariate adaptive regression splines
- MART
multiple additive regression tree
- MOLAP
multidimensional OLAP
- MTS
Mahalanobis–Taguchi system
- NN
nearest neighbor
- OLAP
online analytical processing
- PCB
printed circuit board
- SOM
self-organizing (feature) map
- SPC
statistical process control
- SQL
structured query language
- SVM
support vector machine
References
- 36.1.M. J. A. Berry, G. Linoff: Mastering Data Mining: The Art and Science of Customer Relationship Management (Wiley, New York 2000)Google Scholar
- 36.2.E. Wegman: Data Mining Tutorial, Short Course Notes, Interface 2001 Symposium, Cosa Mesa, Californien (2001)Google Scholar
- 36.3.P. Adriaans, D. Zantinge: Data Mining (Addison-Wesley, New York 1996)Google Scholar
- 36.4.J. H. Friedman: Data Mining and Statistics: What is the Connection? Technical Report (Stat. Dep., Stanford University 1997)Google Scholar
- 36.5.K. B. Clark, T. Fujimoto: Product Development and Competitiveness, J. Jpn Int. Econ. 6(2), 101–143 (1992)CrossRefGoogle Scholar
- 36.6.D. W. LaBahn, A. Ali, R. Krapfel: New Product Development Cycle Time. The Influence of Project and Process Factors in Small Manufacturing Companies, J. Business Res. 36(2), 179–188 (1996)CrossRefGoogle Scholar
- 36.7.J. Han, M. Kamber: Data Mining: Concept and Techniques (Morgan Kaufmann, San Francisco 2001)Google Scholar
- 36.8.T. Hastie, J. H. Friedman, R. Tibshirani: Elements of Statistical Learning: Data Mining, Inference, and Prediction (Springer, Berlin Heidelberg New York 2001)MATHGoogle Scholar
- 36.9.S. Weisberg: Applied Linear Regression (Wiley, New York 1980)MATHGoogle Scholar
- 36.10.G. Seber: Multivariate Observations (Wiley, New York 1984)CrossRefMATHGoogle Scholar
- 36.11.J. Neter, M. H. Kutner, C. J. Nachtsheim, W. Wasserman: Applied Linear Statistical Models, 4th edn. (Irwin, Chicago 1996)Google Scholar
- 36.12.A. E. Hoerl, R. Kennard: Ridge Regression: Biased Estimation of Nonorthogonal Problems, Technometrics 12, 55–67 (1970)CrossRefMATHGoogle Scholar
- 36.13.R. Tibshirani: Regression Shrinkage and Selection via the Lasso, J. R. Stat. Soc. Series B 58, 267–288 (1996)MathSciNetMATHGoogle Scholar
- 36.14.A. Agresti: An Introduction to Categorical Data Analysis (Wiley, New York 1996)MATHGoogle Scholar
- 36.15.D. Hand: Discrimination and Classification (Wiley, Chichester 1981)MATHGoogle Scholar
- 36.16.P. McCullagh, J. A. Nelder: Generalized Linear Models, 2nd edn. (Chapman Hall, New York 1989)MATHGoogle Scholar
- 36.17.T. Hastie, R. Tibshirani: Generalized Additive Models (Chapman Hall, New York 1990)MATHGoogle Scholar
- 36.18.W. S. Cleveland: Robust Locally-Weighted Regression and Smoothing Scatterplots, J. Am. Stat. Assoc. 74, 829–836 (1979)CrossRefMathSciNetMATHGoogle Scholar
- 36.19.R. L. Eubank: Spline Smoothing and Nonparametric Regression (Marcel Dekker, New York 1988)MATHGoogle Scholar
- 36.20.G. Wahba: Spline Models for Observational Data, Applied Mathematics, Vol. 59 (SIAM, Philadelphia 1990)MATHGoogle Scholar
- 36.21.W. Härdle: Applied Non-parametric Regression (Cambridge Univ. Press, Cambridge 1990)Google Scholar
- 36.22.D. Biggs, B. deVille, E. Suen: A Method of Choosing Multiway Partitions for Classification and Decision Trees, J. Appl. Stat. 18(1), 49–62 (1991)CrossRefGoogle Scholar
- 36.23.B. D. Ripley: Pattern Recognition and Neural Networks (Cambridge Univ. Press, Cambridge 1996)MATHGoogle Scholar
- 36.24.L. Breiman, J. H. Friedman, R. A. Olshen, C. J. Stone: Classification and Regression Trees (Wadsworth, Belmont, California 1984)MATHGoogle Scholar
- 36.25.J. N. Morgan, J. A. Sonquist: Problems in the Analysis of Survey Data, and a Proposal, J. Am. Stat. Assoc. 58, 415–434 (1963)CrossRefMATHGoogle Scholar
- 36.26.A. Fielding: Binary segmentation: The Automatic Interaction Detector and Related Techniques for Exploring Data Structure. In: The Analysis of Survey Data, Volume I: Exploring Data Structures, ed. by C. A. OʼMuircheartaigh, C. Payne (Wiley, New York 1977) pp. 221–258Google Scholar
- 36.27.W. Y. Loh, N. Vanichsetakul: Tree-Structured Classification Via Generalized Discriminant Analysis, J. Am. Stat. Assoc. 83, 715–728 (1988)CrossRefMathSciNetMATHGoogle Scholar
- 36.28.W. D. Lo. Chaudhuri, W. Y. Loh, C. C. Yang: Generalized Regression Trees, Stat. Sin. 5, 643–666 (1995)MathSciNetGoogle Scholar
- 36.29.W. Y. Loh, Y. S. Shih: Split-Selection Methods for Classification Trees, Statistica Sinica 7, 815–840 (1997)MathSciNetMATHGoogle Scholar
- 36.30.J. H. Friedman, T. Hastie, R. Tibshirani: Additive Logistic Regression: a Statistical View of Boosting, Ann. Stat. 28, 337–407 (2000)CrossRefMathSciNetMATHGoogle Scholar
- 36.31.Y. Freund, R. Schapire: Experiments with a New Boosting Algorithm, Machine Learning: Proceedings of the Thirteenth International Conference, Bari, Italy 1996, ed. by M. Kaufmann, (Bari, Italy 1996) 148–156Google Scholar
- 36.32.L. Breiman: Bagging Predictors, Machine Learning 26, 123–140 (1996)Google Scholar
- 36.33.J. H. Friedman: Greedy Function Approximation: a Gradient Boosting Machine, Ann. Stat. 29, 1189–1232 (2001)CrossRefMATHGoogle Scholar
- 36.34.J. H. Friedman: Stochastic Gradient Boosting, Computational Statistics and Data Analysis 38(4), 367–378 (2002)CrossRefGoogle Scholar
- 36.35.J. H. Friedman: Multivariate Adaptive Regression Splines (with Discussion), Ann. Stat. 19, 1–141 (1991)CrossRefMATHGoogle Scholar
- 36.36.J. H. Friedman, B. W. Silverman: Flexible Parsimonious Smoothing and Additive Modeling, Technometrics 31, 3–39 (1989)CrossRefMathSciNetMATHGoogle Scholar
- 36.37.R. P. Lippmann: An Introduction to Computing with Neural Nets, IEEE ASSP Magazine April, 4–22 (1987)CrossRefGoogle Scholar
- 36.38.S. S. Haykin: Neural Networks: A Comprehensive Foundation, 2nd edn. (Prentice Hall, Upper Saddle River 1999)MATHGoogle Scholar
- 36.39.H. White: Learning in Neural Networks: a Statistical Perspective, Neural Computation 1, 425–464 (1989)CrossRefGoogle Scholar
- 36.40.A. R. Barron, R. L. Barron, E. J. Wegman: Statistical Learning Networks: A Unifying View, Computer Science and Statistics: Proceedings of the 20th Symposium on the Interface 1992, ed. by E. J. Wegman, D. T. Gantz, J. J. Miller (American Statistical Association, Alexandria, VA 1992) 192–203Google Scholar
- 36.41.B. Cheng, D. M. Titterington: Neural Networks: A Review from a Statistical Perspective (with discussion), Stat. Sci 9, 2–54 (1994)CrossRefMathSciNetMATHGoogle Scholar
- 36.42.D. Rumelhart, G. Hinton, R. Williams: Learning Internal Representations by Error Propagation. In: Parallel Distributed Processing: Explorations in the Microstructures of Cognition, Vol. 1: Foundations, ed. by D. E. Rumelhart, J. L. McClelland (MIT, Cambridge 1986) pp. 318–362Google Scholar
- 36.43.V. Vapnik: The Nature of Statistical Learning (Springer, Berlin Heidelberg New York 1996)Google Scholar
- 36.44.C. J. C. Burges: A Tutorial on Support Vector Machines for Pattern Recognition, Knowledge Discovery and Data Mining 2(2), 121–167 (1998)CrossRefGoogle Scholar
- 36.45.J. Shawe-Taylor, N. Cristianini: Kernel Methods for Pattern Analysis (Cambridge Univ. Press, Cambridge 2004)Google Scholar
- 36.46.N. Cristianini, J. Shawe-Taylor: An Introduction to Support Vector Machines (Cambridge Univ. Press, Cambridge 2000)Google Scholar
- 36.47.P. Huber: Ann. Math. Stat., Robust Estimation of a Location Parameter 53, 73–101 (1964)MathSciNetGoogle Scholar
- 36.48.B. V. Dasarathy: Nearest Neighbor Pattern Classification Techniques (IEEE Computer Society, New York 1991)Google Scholar
- 36.49.T. Hastie, R. Tibshirani: Discriminant Adaptive Nearest-Neighbor Classification, IEEE Pattern Recognition and Machine Intelligence 18, 607–616 (1996)CrossRefGoogle Scholar
- 36.50.T. Hastie, R. Tibshirani, A. Buja: Flexible Discriminant and Mixture Models. In: Statistics and Artificial Neural Networks, ed. by J. Kay, M. Titterington (Oxford Univ. Press, Oxford 1998)Google Scholar
- 36.51.J. R. Koza: Genetic Programming: On the Programming of Computers by Means of Natural Selection (MIT, Cambridge 1992)MATHGoogle Scholar
- 36.52.W. Banzhaf, P. Nordin, R. E. Keller, F. D. Francone: Genetic Programming: An Introduction (Morgan Kaufmann, San Francisco 1998)MATHGoogle Scholar
- 36.53.P. W. H. Smith: Genetic Programming as a Data-Mining Tool. In: Data Mining: A Heuristic Approach, ed. by H. A. Abbass, R. A. Sarker, C. S. Newton (Idea Group Publishing, London 2002) pp. 157–173Google Scholar
- 36.54.R. Agrawal, H. Mannila, R. Srikant, H. Toivonen, A. I. Verkamo: Fast Discovery Of Association Rules: Advances in Knowledge Discovery and Data Mining (MIT, Cambridge 1995) Chap. 12Google Scholar
- 36.55.A. Gordon: Classification, 2nd edn. (Chapman Hall, New York 1999)MATHGoogle Scholar
- 36.56.J. A. Hartigan, M. A. Wong: A K-Means Clustering Algorithm, Appl. Stat. 28, 100–108 (1979)CrossRefMATHGoogle Scholar
- 36.57.L. Kaufman, P. Rousseeuw: Finding Groups in Data: An Introduction to Cluster Analysis (Wiley, New York 1990)Google Scholar
- 36.58.M. Ester, H.-P. Kriegel, J. Sander, X. Xu: A Density-Based Algorithm for Discovering Cluster in Large Spatial Databases, Proceedings of 1996 International Conference on Knowledge Discovery and Data Mining (KDD96), Portland 1996, ed. by E. Simoudis, J. Han, U. Fayyad (AAAI Press, Menlo Park 1996) 226–231Google Scholar
- 36.59.J. Sander, M. Ester, H.-P. Kriegel, X. Xu: Density-Based Clustering in Spatial Databases: The Algorithm DGBSCAN and its Applications, Data Mining and Knowledge Discovery 2(2), 169–194 (1998)CrossRefGoogle Scholar
- 36.60.M. Ankerst, M. M. Breunig, H.-P. Kriegel, J. Sander: OPTICS: Ordering Points to Identify the Clustering Structure, Proc. ACMSIGMOD Int. Conf. on Management of Data, Philadelphia, Pennsylvania June 1-3, 1999 (ACM Press, New York 1999) 49–60Google Scholar
- 36.61.T. Kohonen: Self-Organization and Associative Memory, 3rd edn. (Springer, Berlin Heidelberg New York 1989)Google Scholar
- 36.62.D. Haughton, J. Deichmann, A. Eshghi, S. Sayek, N. Teebagy, H. Topi: A Review of Software Packages for Data Mining, Amer. Stat. 57(4), 290–309 (2003)CrossRefGoogle Scholar
- 36.63.J. R. Quinlan: C4.5: Programs for Machine Learning (Morgan Kaufmann, San Mateo 1993)Google Scholar
- 36.64.T. Fawcett, F. Provost: Activity Monitoring: Noticing Interesting Changes in Behavior, Proceedings of KDD-99, San Diego 1999, (San Diego, CA 1999) 53–62Google Scholar
- 36.65.D. C. Montgomery: Introduction to Statistical Quality Control, 5th edn. (Wiley, New York 2001)Google Scholar
- 36.66.W. H. Woodall, K.-L. Tsui, G. R. Tucker: A Review of Statistical and Fuzzy Quality Control Based on Categorical Data, Frontiers in Statistical Quality Control 5, 83–89 (1997)Google Scholar
- 36.67.D. C. Montgomery, W. H. Woodall: A Discussion on Statistically-Based Process Monitoring and Control, J. Qual. Technol. 29, 121–162 (1997)Google Scholar
- 36.68.A. J. Hayter, K.-L. Tsui: Identification and Qualification in Multivariate Quality Control Problems, J. Qual. Tech. 26(3), 197–208 (1994)Google Scholar
- 36.69.R. L. Mason, C. W. Champ, N. D. Tracy, S. J. Wierda, J. C. Young: Assessment of Multivariate Process Control Techniques, J. Qual. Technol. 29, 140–143 (1997)Google Scholar
- 36.70.W. Jiang, S.-T. Au, K.-L. Tsui: A Statistical Process Control Approach for Customer Activity Monitoring, Technical Report, AT&T Labs (2004)Google Scholar
- 36.71.M. West, J. Harrison: Bayesian Forecasting and Dynamic Models, 2nd edn. (Springer, New York 1997)MATHGoogle Scholar
- 36.72.C. Fraley, A. E. Raftery: Model-based Clustering, Discriminant Analysis, and Density Estimation, J. Amer. Stat. Assoc. 97, 611–631 (2002)CrossRefMathSciNetMATHGoogle Scholar
- 36.73.G. Taguchi: Introduction to Quality Engineering: Designing Quality into Products and Processes (Asian Productivity Organization, Tokyo 1986)Google Scholar
- 36.74.G. E. P. Box, R. N. Kacker, V. N. Nair, M. S. Phadke, A. C. Shoemaker, C. F. Wu: Quality Practices in Japan, Qual. Progress March, 21–29 (1988)Google Scholar
- 36.75.V. N. Nair: Taguchiʼs Parameter Design: A Panel Discussion, Technometrics 34, 127–161 (1992)CrossRefMathSciNetGoogle Scholar
- 36.76.K.-L. Tsui: An Overview of Taguchi Method and Newly Developed Statistical Methods for Robust Design, IIE Trans. 24, 44–57 (1992)CrossRefGoogle Scholar
- 36.77.K.-L. Tsui: A Critical Look at Taguchiʼs Modeling Approach for Robust Design, J. Appl. Stat. 23, 81–95 (1996)CrossRefGoogle Scholar
- 36.78.G. Taguchi, S. Chowdhury, Y. Wu: The Mahalanobis–Taguchi System (McGraw-Hill, New York 2001)Google Scholar
- 36.79.G. Taguchi, R. Jugulum: The Mahalanobis–Taguchi Strategy: A Pattern Technology System (Wiley, New York 2002)Google Scholar
- 36.80.W. H. Woodall, R. Koudelik, K.-L. Tsui, S. B. Kim, Z. G. Stoumbos, C. P. Carvounis: A Review and Analysis of the Mahalanobis–Taguchi System, Technometrics 45(1), 1–15 (2003)CrossRefMathSciNetGoogle Scholar
- 36.81.A. Kusiak, C. Kurasek: Data Mining of Printed–Circuit Board Defects, IEEE Transactions on Robotics and Automation 17(2), 191–196 (2001)CrossRefGoogle Scholar
- 36.82.A. Kusiak: Rough Set Theory: A Data Mining Tool for Semiconductor Manufacturing, IEEE Transactions on Electronics Packaging Manufacturing 24(1), 44–50 (2001)CrossRefGoogle Scholar
- 36.83.A. Ultsch: Information and Classification: Concepts, Methods and Applications (Springer, Berlin Heidelberg New York 1993)Google Scholar
- 36.84.A. Y. Wong: A Statistical Approach to Identify Semiconductor Process Equipment Related Yield Problems, IEEE International Symposium on Defect and Fault Tolerance in VLSI Systems, Paris 1997 (IEEE Computer Society, Paris, France 1997) 20–22Google Scholar
- 36.85.ANSI (2002). Am. Nat. Standards Institute, IPC-9261, In-Process DPMO and Estimated Yield for PWBGoogle Scholar
- 36.86.M. Baron, C. K. Lakshminarayan, Z. Chen: Markov Random Fields In Pattern Recognition For Semiconductor Manufacturing, Technometrics 43, 66–72 (2001)CrossRefMathSciNetMATHGoogle Scholar
- 36.87.G. King: Event Count Models for International Relations: Generalizations and Applications, International Studies Quarterly 33(2), 123–147 (1989)CrossRefGoogle Scholar
- 36.88.P. Smyth: Hidden Markov models for fault detection in dynamic systems, Pattern Recognition 27(1), 149–164 (1994)CrossRefGoogle Scholar