Skip to main content

A Strategy for Selecting Data Mining Techniques in Metabolomics

  • Protocol
  • First Online:
Plant Metabolomics

Part of the book series: Methods in Molecular Biology ((MIMB,volume 860))

Abstract

There is a general agreement that the development of metabolomics depends not only on advances in chemical analysis techniques but also on advances in computing and data analysis methods. Metabolomics data usually requires intensive pre-processing, analysis, and mining procedures. Selecting and applying such procedures requires attention to issues including justification, traceability, and reproducibility. We describe a strategy for selecting data mining techniques which takes into consideration the goals of data mining techniques on the one hand, and the goals of metabolomics investigations and the nature of the data on the other. The strategy aims to ensure the validity and soundness of results and promote the achievement of the investigation goals.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Protocol
USD 49.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 89.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 119.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 169.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Goodacre, R., Vaidyanathan, S., Dunn, W. B., Harrigan, G. G. and Kell, D. B. (2004) Metabolomics By Numbers: Acquiring Understanding Global Metabolite Data. Trends Biotech 22, 245–252.

    Article  CAS  Google Scholar 

  2. Kell, D. B. (2002) Genotype-phenotype mapping: genes as computer programs. Trends Genetics 18, 555–559.

    Article  CAS  Google Scholar 

  3. Kell, D. B. and Oliver, S. G. (2004) Here is the evidence, now what is the hypothesis? The complementary roles of inductive and hypothesis-driven science in the post-genomic era. BioEssays 26, 99–105.

    Article  PubMed  Google Scholar 

  4. Heldman, K. (2005) Project Management Jumpstart. 2nd ed. SYBEX Inc., San Francisco, CA.

    Google Scholar 

  5. Heldman, K. (2007) PMP: Project Management Professional Exam Study Guide. 5th ed. Wiley Publishing Inc., Indianapolis, IN.

    Google Scholar 

  6. Lewis, J. P. (2007) Fundamentals of Project Management. 3rd ed. American Management Association, New York, NY.

    Google Scholar 

  7. Maimon, O. and Rokach, L. (2005) Data Mining and Knowledge Discovery Handbook. Springer, New York, NY.

    Google Scholar 

  8. Maimon, O. and Rokach, L. (2005) Decomposition methodology for knowledge discovery and data mining: theory and applications. Series in machine perception and artificial intelligence Vol. 61. World Scientific, Singapore.

    Google Scholar 

  9. Sumathi, S. and Sivanandam, S. N. (2006) Data Mining Tasks, Techniques, and Applications, in Introduction to Data Mining and its Applications (S. Sumathi, ed.), Springer, New York, NY/Berlin. pp. 195–216.

    Google Scholar 

  10. Fayyad, U., Piatetsky-Shapiro, G. and Smyth, P. (1996) Knowledge Discovery and Data Mining: Toward a Unifying Framework. in The Second Int Conf on Knowledge Discovery and Data Mining (KDD96). Portland, OR, AAAI Press. Menlo Park, CA.

    Google Scholar 

  11. Taylor, C. F., Field, D., Sansone, S., Aerts, J., Apweiler, R., Ashburner, M., et al. (2008) Promoting coherent minimum reporting guidelines for biological and biomedical investigations: the MIBBI project. Nat Biotech 26, 889–896.

    Article  CAS  Google Scholar 

  12. Bryan, K., Brennan, L. and Cunningham, P. (2008) MetaFIND: A feature analysis tool for metabolomics data. BMC Bioinformatics 9, 470.

    Article  PubMed  Google Scholar 

  13. Hayashi, S., Akiyama, S., Tamaru, Y., Takeda, Y., Fujiwara, T., Inoue, K., et al. (2009) A novel application of metabolomics in vertebrate development. Biochem & Biophys Res Comm 386, 268–272.

    Article  CAS  Google Scholar 

  14. Truong, Y., Lin, X. and Beecher, C. (2004) Learning a complex metabolomic dataset using random forests and support vector machines. in Proc Tenth ACM SIGKDD Int Conf Knowledge Discovery and Data Mining. Seattle, WA, ACM Press, Menlo Park, CA.

    Google Scholar 

  15. Sanchez, D. H., Redestig, H., Kramer, U., Udvardi, M. K. and Kopka, J. (2008) Metabolome-ionome-biomass interactions: What can we learn about salt stress by multiparallel phenotyping? Plant Signal Behav 3, 598–600.

    Article  PubMed  Google Scholar 

  16. Hollywood, K., Brison, D. R. and Goodacre, R. (2006) Metabolomics: Current technologies and future trends. Proteomics 6, 4716–4723.

    Article  PubMed  CAS  Google Scholar 

  17. Enot, D. P., Lin, W., Beckmann, M., Parker, D., Overy, D. P. and Draper, J. (2008) Preprocessing, classification modeling and feature selection using flow injection electrospray mass spectrometry metabolite fingerprint data. Nat Protocols 3, 446–470.

    Article  CAS  Google Scholar 

  18. Ye, J., Janardan, R., Li, Q. and Park, H. (2004) Feature extraction via generalized uncorrelated linear discriminant analysis. in The Twenty-First Int Conf Machine Learning. Banff, Alberta, ACM, New York, NY.

    Google Scholar 

  19. Lindon, J. C., Holmes, E. and Nicholson, J. K. (2001) Pattern recognition methods and applications in biomedical magnetic resonance. Progress in Nuclear Magnetic Resonance Spectroscopy 39, 1–40.

    Article  CAS  Google Scholar 

  20. Brown, M., Dunn, W. B., Ellis, D. I., Goodacre, R., Handl, J., Knowles, J. D., et al. (2005) A metabolome pipeline: from concept to data to knowledge. Metabolomics 1, 39–51.

    Article  CAS  Google Scholar 

  21. Johnson, H. E., Broadhurst, D., Goodacre, R. and Smith, A. R. (2003) Metabolic fingerprinting of salt-stressed tomatoes. Phytochem 62, 919–928.

    Article  CAS  Google Scholar 

  22. Steuer, R., Morgenthal, K., Weckwerth, W. and Selbig, J. (2007) A Gentle Guide to the Analysis of Metabolomic Data, in Metabolomics: Methods and Protocols (W. Weckwerth, ed.), Humana Press, Totowa, NJ. pp. 105–126.

    Google Scholar 

  23. Sumner, L. W., Mendes, P. and Dixon, R. A. (2003) Plant metabolomics: large-scale phytochemistry in the functional genomics era. Phytochem 62, 817–836.

    Article  CAS  Google Scholar 

  24. Goodacre, R. (2007) Metabolomics of a Superorganism. J Nutrition 137, 259–266.

    Google Scholar 

  25. Goodacre, R. (2005) Making sense of the metabolome using evolutionary computation: seeing the wood with the trees. J. Exp Bot 56, 245–254.

    Article  PubMed  CAS  Google Scholar 

  26. Cuperlović-Culf M, Belacel N et al. (2009) NMR metabolic analysis of samples using fuzzy K-means clustering. Magnetic Resonance in Chem 47, S96–S104.

    Google Scholar 

  27. Li, X., Lu, X., Tian, J., Gao, P., Kong, H. and Xu, G. (2009) Application of Fuzzy c-Means Clustering in Data Analysis of Metabolomics. Anal Chem 81, 4468–4475.

    Article  PubMed  CAS  Google Scholar 

  28. Thakkar, D., Ruiz, C. and Ryder, E. F. (2007) Hypothesis-Driven Specialization of Gene Expression Association Rules. in Proc 2007 IEEE Int Conf Bioinformatics and Biomedicine. Fremont, CA, IEEE Computer Society.

    Google Scholar 

  29. Hipp, J., Güntzer, U. and Nakhaeizadeh, G. (2002) Data Mining of Association Rules and the Process of Knowledge Discovery in Databases, in Advances in Data Mining (P. Perner, ed.), Springer, Berlin/Heidelberg. pp. 207–226.

    Google Scholar 

  30. Agrawal, R., Imieliski, T. and Swami, A. (1993) Mining association rules between sets of items in large databases. in Proc 1993 ACM SIGMOD Int Conf on Management of Data. Washington, DC, ACM, New York, NY.

    Google Scholar 

  31. Gupta, R. K. and Agrawal, D. P. (2009) Improving the Performance of Association Rule Mining Algorithms by Filtering Insignificant Transactions Dynamically. Asian J Information Management 3, 7–17.

    Article  Google Scholar 

  32. Osl, M., Dreiseitl, S., Pfeifer, B., Weinberger, K., Klocker, H., Bartsch, G., et al. (2008) A new rule-based algorithm for identifying metabolic markers in prostate cancer using tandem mass spectrometry. Bioinformatics 24, 2908–2914.

    Article  PubMed  CAS  Google Scholar 

  33. Yamamoto, H., Yamaji, H., Abe, Y., Harada, K., Waluyo, D., Fukusaki, E., et al. (2009) Dimensionality reduction for metabolome data using PCA, PLS, OPLS, and RFDA with differential penalties to latent variables. Chemometrics & Intelligent Lab Sys 98, 136–142.

    Article  CAS  Google Scholar 

  34. Kim, Y., Park, I. and Lee, D. (2007) Integrated Data Mining Strategy for Effective Metabolomic Data Analysis. in Optimization and Systems Biology, The First Int Symp, OSB’07. Beijing, China, ORSC & APORC.

    Google Scholar 

  35. Scholz, M., Gatzek, S., Sterling, A., Fiehn, O. and Selbig, J. (2004) Metabolite fingerprinting: detecting biological features by independent component analysis. Bioinformatics 20, 2447–2454.

    Article  PubMed  CAS  Google Scholar 

  36. Scholz, M. and Selbig, J. (2006) Visualization and Analysis of Molecular Data, in Metabolomics (W. Weckwerth, ed.), Humana Press, NJ. pp. 87–104.

    Google Scholar 

  37. Mendes, P. (2002) Emerging bioinformatics for the metabolome. Briefings Bioinformatics 3, 134–145.

    Article  CAS  Google Scholar 

  38. Goodacre, R., Broadhurst, D., Smilde, A., Kristal, B., Baker, J., Beger, R., et al. (2007) Proposed minimum reporting standards for data analysis in metabolomics. Metabolomics 3, 231–241.

    Article  CAS  Google Scholar 

  39. Johnson, H., Lloyd, A., Mur, L., Smith, A. and Causton, D. (2007) The application of MANOVA to analyse Arabidopsis thaliana metabolomic data from factorially designed experiments. Metabolomics 3, 517–530.

    Article  CAS  Google Scholar 

  40. McGregor, M. (1997) Nuclear Magnetic Resonance Spectroscopy in Handbook of instrumental techniques for analytical chemistry (F.A. Settle, ed.), Prentice Hall, Upper Saddle River, NJ/London. pp. 309–337.

    Google Scholar 

  41. Brown, P. and DeAntonis, K. (1997) High-performance Liquid Chromotography, in Handbook of instrumental techniques for analytical chemistry (F.A. Settle, ed.), Prentice Hall, Upper Saddle River, NJ/ London. pp. 309–337.

    Google Scholar 

  42. Dettmer, K., Aronov, P. A. and Hammock, B. D. (2007) Mass spectrometry-based metabolomics. Mass Spectrometry Rev 26, 51–78.

    Article  CAS  Google Scholar 

  43. Dunn, W. B. and Ellis, D. I. (2005) Metabolomics: Current analytical platforms and methodologies. Trends Anal Chem 24, 285–294.

    Article  CAS  Google Scholar 

  44. Hites, R. A. (1997) Gas Chromotography Mass Spectrometry, in Handbook of instrumental techniques for analytical chemistry (F.A. Settle, ed.), Prentice Hall, Upper Saddle River, NJ/London. pp. 609–626.

    Google Scholar 

  45. Krishna, C., Sockalingum, G., Bhat, R., Venteo, L., Kushtagi, P., Pluot, M., et al. (2007) FTIR and Raman microspectroscopy of normal, benign, and malignant formalin-fixed ovarian tissues. Analytical & Bioanalytical Chem 387, 1649–1656.

    Article  CAS  Google Scholar 

  46. Jain, A. K., Murty, M. N., et al. (1999). Data clustering: A review. ACM Comput Surv 31(3), 264–323.

    Google Scholar 

  47. Sherman Hsu, C. P. (1997) Infrared Spectroscopy in Handbook of instrumental techniques for analytical chemistry (F.A. Settle, ed.), Prentice Hall, Upper Saddle River, NJ/London. pp. 309–337.

    Google Scholar 

  48. Xia, J., Psychogios, N., Young, N. and Wishart, D. S. (2009) MetaboAnalyst: a web server for metabolomic data analysis and interpretation. Nucleic Acids Res 37, W652–660.

    Article  PubMed  CAS  Google Scholar 

  49. Spasic, I., Dunn, W., Velarde, G., Tseng, A., Jenkins, H., Hardy, N., et al. (2006) MeMo: a hybrid SQL/XML approach to metabolomic data management for functional genomics. BMC Bioinformatics 7, 281.

    Article  PubMed  Google Scholar 

  50. Sumner, L. W., Amberg, A., Barrett, D., Beale, M. H., Beger, R., Daykin, C. A., et al. (2007) Proposed minimum reporting standards for chemical analysis. Metabolomics 3, 211–221.

    Article  CAS  Google Scholar 

  51. Jenkins, H., Johnson, H., Kular, B., Wang, T. and Hardy, N. (2005) Toward supportive data collection tools for plant metabolomics. Plant Physiol 138, 67–77.

    Article  PubMed  CAS  Google Scholar 

  52. Goebel, M. and Gruenwald, L. (1999) A survey of data mining and knowledge discovery software tools. SIGKDD Explorations Newsletter. 1, 20–33.

    Article  Google Scholar 

  53. Rokach, L. and Maimon, O. Z. (2008) Data mining with decision trees: theory and applications. Series in machine perception and artificial intelligence. Vol. 69. World Scientific, Singapore.

    Google Scholar 

  54. Clare, A. (2003) Machine Learning and Data Mining for Yeast Functional Genomics PhD. University of Wales, Aberystwyth

    Google Scholar 

  55. Michalski, R. S., Bratko, I. and Kubat, M. (1998) Machine Learning and Data Mining: Methods and Applications. John Wiley & Sons, Chichester, UK.

    Google Scholar 

  56. Pelckmans, K., De Brabanter, J., Suykens, J. A. K. and De Moor, B. (2005) Handling missing values in support vector machine classifiers. Neural Networks 18, 684–692.

    Article  PubMed  CAS  Google Scholar 

  57. Jingke, X. (2008) Outlier Detection Algorithms in Data Mining. in Intelligent Information Technology Application, 2008. IITA ‘08. Second International Symposium on. Shanghai, IEEE Computer Society.

    Google Scholar 

  58. Chapman, P., Clinton, J., Kerber, R., Khabaza, T., Reinartz, T., Shearer, C., et al., CRISP-DM 1.0 Step-by-step data mining guide. 2000, SPSS Inc.

    Google Scholar 

  59. Wirth, R. and Hipp, J. (2000) CRISP-DM: Towards a Standard Process Model for Data Mining. in Proc 4th Int Conf Practical Application of Knowledge Discovery and Data Mining. Manchester, UK

    Google Scholar 

  60. Xia, J.m., Wu, X.j., and Yuan, Y.j. (2007) Integration of wavelet transform with PCA and ANN for metabolomics data-mining. Metabolomics 3, 531–537.

    Article  CAS  Google Scholar 

  61. Trochim, W. and Donnelly, J. (2007) The Research Methods Knowledge Base. 3rd ed. Atomic Dog Publishing.

    Google Scholar 

  62. Sansone, S., Rocca-Serra, P., Tong, W., Fostel, J., Morrison, N. and Jones, A. R. (2006) A Strategy Capitalizing on Synergies: The Reporting Structure for Biological Investigation (RSBI) Working Group. OMICS: A J of Integrative Biology 10, 164–171.

    Article  CAS  Google Scholar 

  63. Sansone, S., Rocca-Serra, P., Brandizi, M., Brazma, A., Field, D., Fostel, J., et al. (2008) The First RSBI (ISA-TAB) Workshop: Can a Simple Format Work for Complex Studies? OMICS: A J of Integrative Biology 12, 143–149.

    Article  CAS  Google Scholar 

  64. Smith, B., Ashburner, M., Rosse, C., Bard, J., Bug, W., Ceusters, W., et al. (2007) The OBO Foundry: coordinated evolution of ontologies to support biomedical data integration. Nat Biotech 25, 1251–1255.

    Article  CAS  Google Scholar 

  65. Langley, P., Shiran, O., Shrager, J., Todorovski, L. and Pohorille, A. (2006) Constructing explanatory process models from biological data and knowledge. Artificial Intelligence in Medicine 37, 191–201.

    Article  PubMed  Google Scholar 

  66. Merriam-Webster Inc. (2005) The Merriam-Webster dictionary. Merriam-Webster, Springfield, MA.

    Google Scholar 

  67. Kell, D. B. (2004) Metabolomics and system Biology, making the Sense of the Soup. Curr Opin Biotech 7, 296–307.

    CAS  Google Scholar 

  68. Barrett, S. J. and Langdon, W. B. (2006) Advances in the Application of Machine Learning Techniques in Drug Discovery Design and Development. in Applications of Soft Computing: Recent Trends. Springer, Berlin/Heidleberg/New York, NY

    Google Scholar 

  69. Mahadevan, S., Shah, S. L., Marrie, T. J. and Slupsky, C. M. (2008) Analysis of metabolomic data using support vector machines. Anal Chem 80, 7562–7570.

    Article  PubMed  CAS  Google Scholar 

  70. Chatterjee, S. and Hadi, A. S. (2006) Regression analysis by example. 4th ed. Wiley series in probability and statistics. Wiley-Interscience, Hoboken, N.J.

    Book  Google Scholar 

  71. Fukusaki, E. and Kobayashi, A. (2005) Plant metabolomics: potential for practical operation. J Bioscience and Bioengineering 100, 347–354.

    Article  CAS  Google Scholar 

  72. Enot, D. P., Beckmann, M., Overy, D. and Draper, J. (2006) Predicting interpretability of metabolome models based on behavior, putative identity, and biological relevance of explanatory signals. PNAS 103, 14865–14870.

    Article  PubMed  CAS  Google Scholar 

  73. Kotsiantis, S., Zaharakis, I. and Pintelas, P. (2006) Machine learning: a review of classification and combining techniques. Artificial Intelligence Rev 26, 159–190.

    Article  Google Scholar 

  74. Kotsiantis, S. B. (2007) Supervised Machine Learning a Review of Classification techniques. Informatica 31, 249–268

    Google Scholar 

  75. Johnson, H. E., Gilbert, R. J., Winson, M. K., Goodacre, R., Smith, A. R., Rowland, J. J., et al. (2000) Explanatory Analysis of the Metabolome Using Genetic Programming of Simple, Interpretable Rules. Genetic Programming & Evolvable Machines 1, 243–258.

    Article  Google Scholar 

  76. Fiehn, O. (2001) Combining Genomics, Metabolome Analysis, and Biochemical Modelling to Understand Metabolic Networks. Comparative & Functional Genomics 2, 155–168.

    Article  CAS  Google Scholar 

  77. Taylor, J., King, R., Altmann, T. and Fiehn, O. (2002) Application of Metabolomics to Plant Genotype Discrimination Using Statistics and Machine Learning BioInformatics 18, 241–248.

    Google Scholar 

  78. Catchpole, G. S., Beckmann, M., Enot, D. P., Mondhe, M., Zywicki, B., Taylor, J., et al. (2005) Hierarchical metabolomics demonstrates substantial compositional similarity between genetically modified and conventional potato crops. PNAS 102, 14458–14462.

    Article  PubMed  CAS  Google Scholar 

  79. Wishart, D. S. (2008) Metabolomics: applications to food science and nutrition research. Trends in Food Sci & Tech 19, 482–493.

    Article  CAS  Google Scholar 

  80. Badjio, E. F. and Poulet, F. (2005) User Guidance: From Theory to Practice, the Case of Visual Data Mining. in Proceedings of the 17th IEEE International Conference on Tools with Artificial Intelligence. Hong Kong, IEEE Computer Society.

    Google Scholar 

  81. Camacho, D., de la Fuente, A. and Mendes, P. (2005) The origin of correlations in metabolomics data. Metabolomics 1, 53–63.

    Article  CAS  Google Scholar 

  82. Roessner-Tunali, U. (2007) uncovering the plant metabolome: current and future challenges, in Concepts in Plant Metabolomics (B.J. Nikolau and E.S. Wurtele, eds.), Springer, Dordrecht. pp. 71–85.

    Google Scholar 

  83. Xu, E., Schaefer, W. and Xu, Q. (2009) Metabolomics in pharmaceutical research and development: Metabolites, mechanisms and pathways. Current Opinion in Drug Discovery & Development 12, 40–52.

    Google Scholar 

  84. Rozen, S., Cudkowicz, M. E., Bogdanov, M., Matson, W. R., Kristal, B. S., Beecher, C., et al. (2005) Metabolomic analysis and signatures in motor neuron disease. Metabolomics 1, 101–108.

    Article  PubMed  CAS  Google Scholar 

  85. Broadhurst, D. and Kell, D. (2006) Statistical strategies for avoiding false discoveries in metabolomics and related experiments. Metabolomics 2, 171–196.

    Article  CAS  Google Scholar 

  86. Smelser, N. J. and Baltes, P. B. (2001) International encyclopedia of the social & behavioral sciences. 1st ed. Elsevier, Amsterdam/New York, NY.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Nigel W. Hardy .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2011 Springer Science+Business Media, LLC

About this protocol

Cite this protocol

BaniMustafa, A.H., Hardy, N.W. (2011). A Strategy for Selecting Data Mining Techniques in Metabolomics. In: Hardy, N., Hall, R. (eds) Plant Metabolomics. Methods in Molecular Biology, vol 860. Humana Press. https://doi.org/10.1007/978-1-61779-594-7_18

Download citation

  • DOI: https://doi.org/10.1007/978-1-61779-594-7_18

  • Published:

  • Publisher Name: Humana Press

  • Print ISBN: 978-1-61779-593-0

  • Online ISBN: 978-1-61779-594-7

  • eBook Packages: Springer Protocols

Publish with us

Policies and ethics