Handbook of Partial Least Squares pp 327-357 | Cite as
PLS in Data Mining and Data Integration
Abstract
- (a)
Clustering, i.e., finding and interpreting “natural” groups in the data
- (b)
Classification and identification, e.g., biologically active compounds vs inactive
- (c)
Quantitative relationships between different sets of variables, e.g., finding variables related to quality of a product, or related to time, seasonal or/and geographical change
- (1)
Identification of outliers and their aberrant data profiles
- (2)
Finding the dominating variables and their joint relationships
- (3)
Making predictions for new samples
With many variables and few observations (samples) – a common situation in data mining – the risk to obtain spurious models is substantial. Spurious models look great for the training set data, but give miserable predictions for new samples. Hence, the validation of the data analytical results is essential, and approaches for that are discussed.
Keywords
Data Mining Data Integration Principal Component Analysis Model Score Vector Process Analytical TechnologyPreview
Unable to display preview. Download preview PDF.
References
- Albano, C., Dunn, W. G., III, Edlund, U., Johansson, E., Nordén, B., & Sjöström, M., et al. (1978). Four levels of pattern recognition. Analytica Chemica Acta, 103, 429–443.CrossRefGoogle Scholar
- Barnes, R. J., Dhanoa, M. S., & Lister, S. J. (1989). Standard normal variate transformation and de-trending of near-infrared diffuse reflectance spectra. Applied Spectroscopy, 43, 772–777.CrossRefGoogle Scholar
- Buydens, L. M. C., Reijmers, T. H., Beckers, M. L. M., & Wehrens, R. (1999). Molecular data mining: A challenge for chemometrics. Chemometrics and Intelligent Laboratory Systems, 49, 121–133.CrossRefGoogle Scholar
- , M., Petersen, R. V., Larsen, J., Rudolf, B., Nørgaard, L., & Engelsen, S. B. (2004). Towards on-line monitoring of the composition of commercial carrageenan powders. Carbohydrate Polymers, 57, 337–348.Google Scholar
- Eriksson, L., Johansson, E., Müller, M., & Wold, S. (2000). On the selection of training set in environmental QSAR when compounds are clustered. Journal of Chemometrics, 14, 599–616.CrossRefGoogle Scholar
- Eriksson, L., Johansson, E., Lindgren, F., Sjöström, M., & Wold, S. (2002). Megavariate analysis of hierarchical QSAR data. Journal of Computer-Aided Molecular Design, 16, 711–726.CrossRefGoogle Scholar
- Eriksson, L., Jaworska, J., Worth, A., Cronin, M., McDowell, R. M., & Gramatica, P. (2003). Methods for reliability, uncertainty assessment, and applicability evaluations of classification and regression based QSARs. Environmental Health Perspectives, 111, 1361–1375.Google Scholar
- Eriksson, L., Antti, H., Gottfries, J., Holmes, E., Johansson, E., Lindgren, F., et al. (2004). Using chemometrics for navigating in the large data sets of genomics. Proteomics and Metabonomics, Analytical and Bioanalytical Chemistry, 380, 419–429.CrossRefGoogle Scholar
- Eriksson, L., Johansson, E., Kettaneh-Wold, S., Trygg, J., Wikström, C., & Wold, S. (2006a). Multi- and megavariate data analysis. Part I. Basic principles and applications. Second revised and enlarged edition, Umetrics Academy, 2006. ISBN-10: 91–973730–2–8. ISBN-13: 978–91–973730–2–9.Google Scholar
- Eriksson, L., Johansson, E., Kettaneh-Wold, S., Trygg, J., Wikström, C., & Wold, S. (2006b). Multi- and megavariate data analysis. Part II. Advanced applications and method extension. Second revised and enlarged edition, Umetrics Academy, 2006. ISBN-10: 91–973730–3–6. ISBN-13: 978–91–973730–3–6.Google Scholar
- Hand, D. J. (1998). Data mining: Statistics and more? American Statistician, 52, 112–188.CrossRefGoogle Scholar
- Kettaneh, N., Berglund, S., & Wold, S. (2005). PCA and PLS with very large data sets. Computational Statistics & Data Analysis, 48, 69–85.MATHCrossRefMathSciNetGoogle Scholar
- Linusson, A., Gottfries, J., Lindgren, F., & Wold, S. (2000). Statistical molecular design of building blocks for combinatorial chemistry. Journal of Medicinal Chemistry, 43, 1320–1328.CrossRefGoogle Scholar
- Maitra, R. (2001). Clustering massive data sets with applications in software metrics and tomography. Technometrics, 43, 336–346.CrossRefMathSciNetGoogle Scholar
- Munck, L. (2005). The revolutionary aspect of exploratory chemometric technology. The universe and the biological cell as computers. A plea for cognitive flexibility in mathematical modelling. Gylling: Narayana Press.Google Scholar
- Naes, T., & Mevik, B. H. (1999). The flexibility of fuzzy clustering illustrated by examples. Journal of Chemometrics, 13, 435–444.CrossRefGoogle Scholar
- Olsson, I. M., Gottfries, J., & Wold, S. (2004a). D-optimal onion design in statistical molecular design. Chemometrics and Intelligent Laboratory Systems, 73, 37–46.CrossRefGoogle Scholar
- Olsson, I. M., Gottfries, J., & Wold, S. (2004b). Controlling coverage of D-optimal onion designs and selections. Journal of Chemometrics, 18, 548–557.CrossRefGoogle Scholar
- Oprea, T. I., & Gottfries, J. (2001). Chemography: The art of navigating in chemical space. Journal of Combinatorial Chemistry, 3, 157–166.CrossRefGoogle Scholar
- Sjöström, M., Wold, S., & Söderström, M. (1986). PLS discriminant plots, In E. S. Gelsema & L. N. Kanal (Eds.), Pattern recognition in practice II (pp. 461–470). North-Holland: Elsevier Science.Google Scholar
- Svensson, O., Josefsson, M., & Langkilde, F. W. (1997). Classification of chemically modified celluloses using a near-infrared spectrometer and soft independent modeling of class analogy. Applied Spectroscopy, 51, 1826–1835.CrossRefGoogle Scholar
- Trygg, J., & Wold, S. (2002). Orthogonal projections to latent structures (OPLS). Journal of Chemometrics, 16, 119–128.CrossRefGoogle Scholar
- Wold, S., Johansson, E., & Cocchi, M. (1993). PLS. In H. Kubinyi (Ed.), 3D-QSAR in drug design, theory, methods, and applications (pp. 523–550). Ledien: ESCOM Science.Google Scholar
- Wold, S., Kettaneh, N., & Tjessem, K. (1996). Hierarchical multiblock PLS and PC models for easier model interpretation and as an alternative to variable selection. Journal of Chemometrics, 10, 463–482.CrossRefGoogle Scholar
- Wold, S., Kettaneh, N., Fridén, H., & Holmberg, A. (1998). Modeling and diagnostics of batch processes and analogous kinetic experiments. Chemometrics and Intelligent Laboratory Systems, 44, 331–340.CrossRefGoogle Scholar
- Wold, S., Sjöström, M., & Eriksson, L. (2001). PLS-regression: A basic tool of chemometrics. Chemometrics and Intelligent Laboratory Systems, 58, 109–130.CrossRefGoogle Scholar
- Wold, S., Josefson, M., Gottfries, J., & Linusson, A. (2004). The utility of multivariate design in PLS modeling. Journal of Chemometrics, 18, 156–165.CrossRefGoogle Scholar