PLS in Data Mining and Data Integration

Chapter
Part of the Springer Handbooks of Computational Statistics book series (SHCS)

Abstract

Data mining by means of projection methods such as PLS (projection to latent structures), and their extensions is discussed. The most common data analytical questions in data mining are covered, and illustrated with examples.
  1. (a)

    Clustering, i.e., finding and interpreting “natural” groups in the data

     
  2. (b)

    Classification and identification, e.g., biologically active compounds vs inactive

     
  3. (c)

    Quantitative relationships between different sets of variables, e.g., finding variables related to quality of a product, or related to time, seasonal or/and geographical change

     
Sub-problems occurring in both (a) to (c) are discussed.
  1. (1)

    Identification of outliers and their aberrant data profiles

     
  2. (2)

    Finding the dominating variables and their joint relationships

     
  3. (3)

    Making predictions for new samples

     
The use of graphics for the contextual interpretation of results is emphasized.

With many variables and few observations (samples) – a common situation in data mining – the risk to obtain spurious models is substantial. Spurious models look great for the training set data, but give miserable predictions for new samples. Hence, the validation of the data analytical results is essential, and approaches for that are discussed.

Keywords

Data Mining Data Integration Principal Component Analysis Model Score Vector Process Analytical Technology 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Albano, C., Dunn, W. G., III, Edlund, U., Johansson, E., Nordén, B., & Sjöström, M., et al. (1978). Four levels of pattern recognition. Analytica Chemica Acta, 103, 429–443.CrossRefGoogle Scholar
  2. Barnes, R. J., Dhanoa, M. S., & Lister, S. J. (1989). Standard normal variate transformation and de-trending of near-infrared diffuse reflectance spectra. Applied Spectroscopy, 43, 772–777.CrossRefGoogle Scholar
  3. Buydens, L. M. C., Reijmers, T. H., Beckers, M. L. M., & Wehrens, R. (1999). Molecular data mining: A challenge for chemometrics. Chemometrics and Intelligent Laboratory Systems, 49, 121–133.CrossRefGoogle Scholar
  4. , M., Petersen, R. V., Larsen, J., Rudolf, B., Nørgaard, L., & Engelsen, S. B. (2004). Towards on-line monitoring of the composition of commercial carrageenan powders. Carbohydrate Polymers, 57, 337–348.Google Scholar
  5. Eriksson, L., Johansson, E., Müller, M., & Wold, S. (2000). On the selection of training set in environmental QSAR when compounds are clustered. Journal of Chemometrics, 14, 599–616.CrossRefGoogle Scholar
  6. Eriksson, L., Johansson, E., Lindgren, F., Sjöström, M., & Wold, S. (2002). Megavariate analysis of hierarchical QSAR data. Journal of Computer-Aided Molecular Design, 16, 711–726.CrossRefGoogle Scholar
  7. Eriksson, L., Jaworska, J., Worth, A., Cronin, M., McDowell, R. M., & Gramatica, P. (2003). Methods for reliability, uncertainty assessment, and applicability evaluations of classification and regression based QSARs. Environmental Health Perspectives, 111, 1361–1375.Google Scholar
  8. Eriksson, L., Antti, H., Gottfries, J., Holmes, E., Johansson, E., Lindgren, F., et al. (2004). Using chemometrics for navigating in the large data sets of genomics. Proteomics and Metabonomics, Analytical and Bioanalytical Chemistry, 380, 419–429.CrossRefGoogle Scholar
  9. Eriksson, L., Johansson, E., Kettaneh-Wold, S., Trygg, J., Wikström, C., & Wold, S. (2006a). Multi- and megavariate data analysis. Part I. Basic principles and applications. Second revised and enlarged edition, Umetrics Academy, 2006. ISBN-10: 91–973730–2–8. ISBN-13: 978–91–973730–2–9.Google Scholar
  10. Eriksson, L., Johansson, E., Kettaneh-Wold, S., Trygg, J., Wikström, C., & Wold, S. (2006b). Multi- and megavariate data analysis. Part II. Advanced applications and method extension. Second revised and enlarged edition, Umetrics Academy, 2006. ISBN-10: 91–973730–3–6. ISBN-13: 978–91–973730–3–6.Google Scholar
  11. Hand, D. J. (1998). Data mining: Statistics and more? American Statistician, 52, 112–188.CrossRefGoogle Scholar
  12. Kettaneh, N., Berglund, S., & Wold, S. (2005). PCA and PLS with very large data sets. Computational Statistics & Data Analysis, 48, 69–85.MATHCrossRefMathSciNetGoogle Scholar
  13. Linusson, A., Gottfries, J., Lindgren, F., & Wold, S. (2000). Statistical molecular design of building blocks for combinatorial chemistry. Journal of Medicinal Chemistry, 43, 1320–1328.CrossRefGoogle Scholar
  14. Maitra, R. (2001). Clustering massive data sets with applications in software metrics and tomography. Technometrics, 43, 336–346.CrossRefMathSciNetGoogle Scholar
  15. Munck, L. (2005). The revolutionary aspect of exploratory chemometric technology. The universe and the biological cell as computers. A plea for cognitive flexibility in mathematical modelling. Gylling: Narayana Press.Google Scholar
  16. Naes, T., & Mevik, B. H. (1999). The flexibility of fuzzy clustering illustrated by examples. Journal of Chemometrics, 13, 435–444.CrossRefGoogle Scholar
  17. Olsson, I. M., Gottfries, J., & Wold, S. (2004a). D-optimal onion design in statistical molecular design. Chemometrics and Intelligent Laboratory Systems, 73, 37–46.CrossRefGoogle Scholar
  18. Olsson, I. M., Gottfries, J., & Wold, S. (2004b). Controlling coverage of D-optimal onion designs and selections. Journal of Chemometrics, 18, 548–557.CrossRefGoogle Scholar
  19. Oprea, T. I., & Gottfries, J. (2001). Chemography: The art of navigating in chemical space. Journal of Combinatorial Chemistry, 3, 157–166.CrossRefGoogle Scholar
  20. Sjöström, M., Wold, S., & Söderström, M. (1986). PLS discriminant plots, In E. S. Gelsema & L. N. Kanal (Eds.), Pattern recognition in practice II (pp. 461–470). North-Holland: Elsevier Science.Google Scholar
  21. Svensson, O., Josefsson, M., & Langkilde, F. W. (1997). Classification of chemically modified celluloses using a near-infrared spectrometer and soft independent modeling of class analogy. Applied Spectroscopy, 51, 1826–1835.CrossRefGoogle Scholar
  22. Trygg, J., & Wold, S. (2002). Orthogonal projections to latent structures (OPLS). Journal of Chemometrics, 16, 119–128.CrossRefGoogle Scholar
  23. Wold, S., Johansson, E., & Cocchi, M. (1993). PLS. In H. Kubinyi (Ed.), 3D-QSAR in drug design, theory, methods, and applications (pp. 523–550). Ledien: ESCOM Science.Google Scholar
  24. Wold, S., Kettaneh, N., & Tjessem, K. (1996). Hierarchical multiblock PLS and PC models for easier model interpretation and as an alternative to variable selection. Journal of Chemometrics, 10, 463–482.CrossRefGoogle Scholar
  25. Wold, S., Kettaneh, N., Fridén, H., & Holmberg, A. (1998). Modeling and diagnostics of batch processes and analogous kinetic experiments. Chemometrics and Intelligent Laboratory Systems, 44, 331–340.CrossRefGoogle Scholar
  26. Wold, S., Sjöström, M., & Eriksson, L. (2001). PLS-regression: A basic tool of chemometrics. Chemometrics and Intelligent Laboratory Systems, 58, 109–130.CrossRefGoogle Scholar
  27. Wold, S., Josefson, M., Gottfries, J., & Linusson, A. (2004). The utility of multivariate design in PLS modeling. Journal of Chemometrics, 18, 156–165.CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2010

Authors and Affiliations

  1. 1.NNS ConsultingHollisUSA
  2. 2.Umetrics IncKinnelonUSA

Personalised recommendations