Skip to main content

Unsupervised Learning Methods and Similarity Analysis in Chemoinformatics

  • Reference work entry
  • First Online:
Handbook of Computational Chemistry

Abstract

In this chapter, we present an overview of various chemometric methods, appropriate for analyzing and interpreting data from social media, industry, academia, medicine, and other sources. We discuss unsupervised machine-learning techniques used for grouping (hierarchical cluster analysis, k-means) and exploring (principal component analysis, self-organizing Kohonen maps) all types of data, both quantitative and qualitative. For each method described in this chapter, we explain the basic concepts, provide a rudimentary algorithm, and present practical applications. All the examples are based on a set of molecular descriptors calculated for a selected group of persistent organic pollutants (POPs).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 1,099.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Hardcover Book
USD 1,399.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Bibliography

  • Brereton, R. G. (2003). Chemometrics: Data analysis for the laboratory and chemical plant. Chichester/Hoboken: Wiley.

    Book  Google Scholar 

  • Brereton, R. G. (2009). Chemometrics for pattern recognition. Chichester: Wiley.

    Book  Google Scholar 

  • Brown, S. D., TauleriFerre, R., & Walczak, B. (2009). Comprehensive chemometrics: Chemical and biochemical data analysis. Amsterdam/London: Elsevier.

    Google Scholar 

  • Everitt, B., Landau, S., Leese, M., & Stahl, D. (2011). Cluster analysis (5th ed.). Oxford: Wiley-Blackwell.

    Book  Google Scholar 

  • Gajewicz, A., Haranczyk, M., & Puzyn, T. (2010). Predicting logarithmic values of the subcooled liquid vapor pressure of halogenated persistent organic pollutants with QSPR: How different are chlorinated and brominated congeners? Atmospheric Environment, 44(11), 1428–1436.

    Article  CAS  Google Scholar 

  • Gemperline, P. (2006). Practical guide to chemometrics (2nd ed.). Boca Raton: CRC/Taylor & Francis.

    Book  Google Scholar 

  • Golebiowski, M., Sosnowska, A., Puzyn, T., Bogus, M. I., Wieloch, W., Włóka, E., & Stepnowski, P. (2014). Application of two-way hierarchical cluster analysis for the identification of similarities between the individual lipid fractions of Lucilia sericata. Chemistry and Biodiversity, 11, 733–748.

    Article  CAS  Google Scholar 

  • Han, J., Kamber, M., & Pei, J. P. D. (2012). Data mining: Concepts and techniques (3rd ed.). Waltham/Oxford: Morgan Kaufmann/Elsevier Science, distributor.

    Google Scholar 

  • Hastie, T., Tibshirani, R., & Friedman, J. H. (2009). The elements of statistical learning: Data mining, inference, and prediction (2nd ed.). New York: Springer.

    Book  Google Scholar 

  • Jolliffe, I. T. (2002). Principal component analysis (Springer series in statistics 2nd ed.). New York: Springer.

    Google Scholar 

  • Khan, S. S., & Kant, S. (2007). Computation of initial modes for K-modes clustering algorithm using evidence accumulation. Paper presented at the Proceedings of the 20th international joint conference on artificial intelligence, Hyderabad.

    Google Scholar 

  • Kohonen, T. (2001). Self-organizing maps (3rd ed.). Berlin/London: Springer.

    Book  Google Scholar 

  • Kountchev, R., & Iantovics, B. (2013). Advances in intelligent analysis of medical data and decision support systems (Studies in Computational Intelligence, Vol. 473). Springer International Publishing Switzerland.

    Google Scholar 

  • Li, Y., Pang, G.-F., Fan, C.-L., & Chen, X. (2013). Hierarchical cluster analysis of matrix effects on 110 pesticide residues in 28 tea matrixes. Journal of AOAC International, 96(6), 1453–1465.

    Article  CAS  Google Scholar 

  • Livingstone, D. (2009). A practical guide to scientific data analysis. Chichester: Wiley.

    Book  Google Scholar 

  • Maimon, O. Z., & Rokach, L. (2005). Data mining and knowledge discovery handbook. Ramat-Aviv: Springer.

    Book  Google Scholar 

  • Milligan, G., & Cooper, M. (1985). An examination of procedures for determining the number of clusters in a data set. Psychometrika, 50(2), 159–179.

    Article  Google Scholar 

  • Myatt, G. J. (2007). Making sense of data: A practical guide to exploratory data analysis and data mining. Hoboken: Wiley-Interscience.

    Book  Google Scholar 

  • Petushkova, N. A., Pyatnitskiy, M. A., Rudenko, V. A., Larina, O. V., Trifonova, O. P., Kisrieva, J. S., Samenkova, N. F., Kuznetsova, G. P., Karuzina, I. I., & Lisitsa, A. V. (2014). Applying of hierarchical clustering to analysis of protein patterns in the human cancer-associated liver. PloS One, 9(8), e103950.

    Article  Google Scholar 

  • Schnegg, M., Massonnet, G., & Gueissaz, L. (2015). Motorcycle helmets: What about their coating? Forensic Science International, 252, 114–126.

    Article  CAS  Google Scholar 

  • Skwarzec, B., Kabat, K., Puzyn, T., & Astel, A. (2011). Inflow of polonium, uranium and plutonium radionuclides in Odra River catchment area assessment by environmetric expertise. Journal of Radioanalytical and Nuclear Chemistry, 292(2), 519–529.

    Article  Google Scholar 

  • Varmuza, K., & Filzmoser, P. (2009). Introduction to multivariate statistical analysis in chemometrics. CRC Press: Boca Raton, p xiii, 321 p.

    Google Scholar 

  • Vesanto, J., & Alhoniemi, E. (2000). Clustering of the self-organizing map. IEEE Transactions on Neural Networks/A Publication of the IEEE Neural Networks Council, 11(3), 586–600.

    Article  CAS  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Katarzyna Odziomek .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer International Publishing Switzerland

About this entry

Cite this entry

Odziomek, K., Rybinska, A., Puzyn, T. (2017). Unsupervised Learning Methods and Similarity Analysis in Chemoinformatics. In: Leszczynski, J., Kaczmarek-Kedziera, A., Puzyn, T., G. Papadopoulos, M., Reis, H., K. Shukla, M. (eds) Handbook of Computational Chemistry. Springer, Cham. https://doi.org/10.1007/978-3-319-27282-5_53

Download citation

Publish with us

Policies and ethics