Abstract
There is a general agreement that the development of metabolomics depends not only on advances in chemical analysis techniques but also on advances in computing and data analysis methods. Metabolomics data usually requires intensive pre-processing, analysis, and mining procedures. Selecting and applying such procedures requires attention to issues including justification, traceability, and reproducibility. We describe a strategy for selecting data mining techniques which takes into consideration the goals of data mining techniques on the one hand, and the goals of metabolomics investigations and the nature of the data on the other. The strategy aims to ensure the validity and soundness of results and promote the achievement of the investigation goals.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Goodacre, R., Vaidyanathan, S., Dunn, W. B., Harrigan, G. G. and Kell, D. B. (2004) Metabolomics By Numbers: Acquiring Understanding Global Metabolite Data. Trends Biotech 22, 245–252.
Kell, D. B. (2002) Genotype-phenotype mapping: genes as computer programs. Trends Genetics 18, 555–559.
Kell, D. B. and Oliver, S. G. (2004) Here is the evidence, now what is the hypothesis? The complementary roles of inductive and hypothesis-driven science in the post-genomic era. BioEssays 26, 99–105.
Heldman, K. (2005) Project Management Jumpstart. 2nd ed. SYBEX Inc., San Francisco, CA.
Heldman, K. (2007) PMP: Project Management Professional Exam Study Guide. 5th ed. Wiley Publishing Inc., Indianapolis, IN.
Lewis, J. P. (2007) Fundamentals of Project Management. 3rd ed. American Management Association, New York, NY.
Maimon, O. and Rokach, L. (2005) Data Mining and Knowledge Discovery Handbook. Springer, New York, NY.
Maimon, O. and Rokach, L. (2005) Decomposition methodology for knowledge discovery and data mining: theory and applications. Series in machine perception and artificial intelligence Vol. 61. World Scientific, Singapore.
Sumathi, S. and Sivanandam, S. N. (2006) Data Mining Tasks, Techniques, and Applications, in Introduction to Data Mining and its Applications (S. Sumathi, ed.), Springer, New York, NY/Berlin. pp. 195–216.
Fayyad, U., Piatetsky-Shapiro, G. and Smyth, P. (1996) Knowledge Discovery and Data Mining: Toward a Unifying Framework. in The Second Int Conf on Knowledge Discovery and Data Mining (KDD96). Portland, OR, AAAI Press. Menlo Park, CA.
Taylor, C. F., Field, D., Sansone, S., Aerts, J., Apweiler, R., Ashburner, M., et al. (2008) Promoting coherent minimum reporting guidelines for biological and biomedical investigations: the MIBBI project. Nat Biotech 26, 889–896.
Bryan, K., Brennan, L. and Cunningham, P. (2008) MetaFIND: A feature analysis tool for metabolomics data. BMC Bioinformatics 9, 470.
Hayashi, S., Akiyama, S., Tamaru, Y., Takeda, Y., Fujiwara, T., Inoue, K., et al. (2009) A novel application of metabolomics in vertebrate development. Biochem & Biophys Res Comm 386, 268–272.
Truong, Y., Lin, X. and Beecher, C. (2004) Learning a complex metabolomic dataset using random forests and support vector machines. in Proc Tenth ACM SIGKDD Int Conf Knowledge Discovery and Data Mining. Seattle, WA, ACM Press, Menlo Park, CA.
Sanchez, D. H., Redestig, H., Kramer, U., Udvardi, M. K. and Kopka, J. (2008) Metabolome-ionome-biomass interactions: What can we learn about salt stress by multiparallel phenotyping? Plant Signal Behav 3, 598–600.
Hollywood, K., Brison, D. R. and Goodacre, R. (2006) Metabolomics: Current technologies and future trends. Proteomics 6, 4716–4723.
Enot, D. P., Lin, W., Beckmann, M., Parker, D., Overy, D. P. and Draper, J. (2008) Preprocessing, classification modeling and feature selection using flow injection electrospray mass spectrometry metabolite fingerprint data. Nat Protocols 3, 446–470.
Ye, J., Janardan, R., Li, Q. and Park, H. (2004) Feature extraction via generalized uncorrelated linear discriminant analysis. in The Twenty-First Int Conf Machine Learning. Banff, Alberta, ACM, New York, NY.
Lindon, J. C., Holmes, E. and Nicholson, J. K. (2001) Pattern recognition methods and applications in biomedical magnetic resonance. Progress in Nuclear Magnetic Resonance Spectroscopy 39, 1–40.
Brown, M., Dunn, W. B., Ellis, D. I., Goodacre, R., Handl, J., Knowles, J. D., et al. (2005) A metabolome pipeline: from concept to data to knowledge. Metabolomics 1, 39–51.
Johnson, H. E., Broadhurst, D., Goodacre, R. and Smith, A. R. (2003) Metabolic fingerprinting of salt-stressed tomatoes. Phytochem 62, 919–928.
Steuer, R., Morgenthal, K., Weckwerth, W. and Selbig, J. (2007) A Gentle Guide to the Analysis of Metabolomic Data, in Metabolomics: Methods and Protocols (W. Weckwerth, ed.), Humana Press, Totowa, NJ. pp. 105–126.
Sumner, L. W., Mendes, P. and Dixon, R. A. (2003) Plant metabolomics: large-scale phytochemistry in the functional genomics era. Phytochem 62, 817–836.
Goodacre, R. (2007) Metabolomics of a Superorganism. J Nutrition 137, 259–266.
Goodacre, R. (2005) Making sense of the metabolome using evolutionary computation: seeing the wood with the trees. J. Exp Bot 56, 245–254.
Cuperlović-Culf M, Belacel N et al. (2009) NMR metabolic analysis of samples using fuzzy K-means clustering. Magnetic Resonance in Chem 47, S96–S104.
Li, X., Lu, X., Tian, J., Gao, P., Kong, H. and Xu, G. (2009) Application of Fuzzy c-Means Clustering in Data Analysis of Metabolomics. Anal Chem 81, 4468–4475.
Thakkar, D., Ruiz, C. and Ryder, E. F. (2007) Hypothesis-Driven Specialization of Gene Expression Association Rules. in Proc 2007 IEEE Int Conf Bioinformatics and Biomedicine. Fremont, CA, IEEE Computer Society.
Hipp, J., Güntzer, U. and Nakhaeizadeh, G. (2002) Data Mining of Association Rules and the Process of Knowledge Discovery in Databases, in Advances in Data Mining (P. Perner, ed.), Springer, Berlin/Heidelberg. pp. 207–226.
Agrawal, R., Imieliski, T. and Swami, A. (1993) Mining association rules between sets of items in large databases. in Proc 1993 ACM SIGMOD Int Conf on Management of Data. Washington, DC, ACM, New York, NY.
Gupta, R. K. and Agrawal, D. P. (2009) Improving the Performance of Association Rule Mining Algorithms by Filtering Insignificant Transactions Dynamically. Asian J Information Management 3, 7–17.
Osl, M., Dreiseitl, S., Pfeifer, B., Weinberger, K., Klocker, H., Bartsch, G., et al. (2008) A new rule-based algorithm for identifying metabolic markers in prostate cancer using tandem mass spectrometry. Bioinformatics 24, 2908–2914.
Yamamoto, H., Yamaji, H., Abe, Y., Harada, K., Waluyo, D., Fukusaki, E., et al. (2009) Dimensionality reduction for metabolome data using PCA, PLS, OPLS, and RFDA with differential penalties to latent variables. Chemometrics & Intelligent Lab Sys 98, 136–142.
Kim, Y., Park, I. and Lee, D. (2007) Integrated Data Mining Strategy for Effective Metabolomic Data Analysis. in Optimization and Systems Biology, The First Int Symp, OSB’07. Beijing, China, ORSC & APORC.
Scholz, M., Gatzek, S., Sterling, A., Fiehn, O. and Selbig, J. (2004) Metabolite fingerprinting: detecting biological features by independent component analysis. Bioinformatics 20, 2447–2454.
Scholz, M. and Selbig, J. (2006) Visualization and Analysis of Molecular Data, in Metabolomics (W. Weckwerth, ed.), Humana Press, NJ. pp. 87–104.
Mendes, P. (2002) Emerging bioinformatics for the metabolome. Briefings Bioinformatics 3, 134–145.
Goodacre, R., Broadhurst, D., Smilde, A., Kristal, B., Baker, J., Beger, R., et al. (2007) Proposed minimum reporting standards for data analysis in metabolomics. Metabolomics 3, 231–241.
Johnson, H., Lloyd, A., Mur, L., Smith, A. and Causton, D. (2007) The application of MANOVA to analyse Arabidopsis thaliana metabolomic data from factorially designed experiments. Metabolomics 3, 517–530.
McGregor, M. (1997) Nuclear Magnetic Resonance Spectroscopy in Handbook of instrumental techniques for analytical chemistry (F.A. Settle, ed.), Prentice Hall, Upper Saddle River, NJ/London. pp. 309–337.
Brown, P. and DeAntonis, K. (1997) High-performance Liquid Chromotography, in Handbook of instrumental techniques for analytical chemistry (F.A. Settle, ed.), Prentice Hall, Upper Saddle River, NJ/ London. pp. 309–337.
Dettmer, K., Aronov, P. A. and Hammock, B. D. (2007) Mass spectrometry-based metabolomics. Mass Spectrometry Rev 26, 51–78.
Dunn, W. B. and Ellis, D. I. (2005) Metabolomics: Current analytical platforms and methodologies. Trends Anal Chem 24, 285–294.
Hites, R. A. (1997) Gas Chromotography Mass Spectrometry, in Handbook of instrumental techniques for analytical chemistry (F.A. Settle, ed.), Prentice Hall, Upper Saddle River, NJ/London. pp. 609–626.
Krishna, C., Sockalingum, G., Bhat, R., Venteo, L., Kushtagi, P., Pluot, M., et al. (2007) FTIR and Raman microspectroscopy of normal, benign, and malignant formalin-fixed ovarian tissues. Analytical & Bioanalytical Chem 387, 1649–1656.
Jain, A. K., Murty, M. N., et al. (1999). Data clustering: A review. ACM Comput Surv 31(3), 264–323.
Sherman Hsu, C. P. (1997) Infrared Spectroscopy in Handbook of instrumental techniques for analytical chemistry (F.A. Settle, ed.), Prentice Hall, Upper Saddle River, NJ/London. pp. 309–337.
Xia, J., Psychogios, N., Young, N. and Wishart, D. S. (2009) MetaboAnalyst: a web server for metabolomic data analysis and interpretation. Nucleic Acids Res 37, W652–660.
Spasic, I., Dunn, W., Velarde, G., Tseng, A., Jenkins, H., Hardy, N., et al. (2006) MeMo: a hybrid SQL/XML approach to metabolomic data management for functional genomics. BMC Bioinformatics 7, 281.
Sumner, L. W., Amberg, A., Barrett, D., Beale, M. H., Beger, R., Daykin, C. A., et al. (2007) Proposed minimum reporting standards for chemical analysis. Metabolomics 3, 211–221.
Jenkins, H., Johnson, H., Kular, B., Wang, T. and Hardy, N. (2005) Toward supportive data collection tools for plant metabolomics. Plant Physiol 138, 67–77.
Goebel, M. and Gruenwald, L. (1999) A survey of data mining and knowledge discovery software tools. SIGKDD Explorations Newsletter. 1, 20–33.
Rokach, L. and Maimon, O. Z. (2008) Data mining with decision trees: theory and applications. Series in machine perception and artificial intelligence. Vol. 69. World Scientific, Singapore.
Clare, A. (2003) Machine Learning and Data Mining for Yeast Functional Genomics PhD. University of Wales, Aberystwyth
Michalski, R. S., Bratko, I. and Kubat, M. (1998) Machine Learning and Data Mining: Methods and Applications. John Wiley & Sons, Chichester, UK.
Pelckmans, K., De Brabanter, J., Suykens, J. A. K. and De Moor, B. (2005) Handling missing values in support vector machine classifiers. Neural Networks 18, 684–692.
Jingke, X. (2008) Outlier Detection Algorithms in Data Mining. in Intelligent Information Technology Application, 2008. IITA ‘08. Second International Symposium on. Shanghai, IEEE Computer Society.
Chapman, P., Clinton, J., Kerber, R., Khabaza, T., Reinartz, T., Shearer, C., et al., CRISP-DM 1.0 Step-by-step data mining guide. 2000, SPSS Inc.
Wirth, R. and Hipp, J. (2000) CRISP-DM: Towards a Standard Process Model for Data Mining. in Proc 4th Int Conf Practical Application of Knowledge Discovery and Data Mining. Manchester, UK
Xia, J.m., Wu, X.j., and Yuan, Y.j. (2007) Integration of wavelet transform with PCA and ANN for metabolomics data-mining. Metabolomics 3, 531–537.
Trochim, W. and Donnelly, J. (2007) The Research Methods Knowledge Base. 3rd ed. Atomic Dog Publishing.
Sansone, S., Rocca-Serra, P., Tong, W., Fostel, J., Morrison, N. and Jones, A. R. (2006) A Strategy Capitalizing on Synergies: The Reporting Structure for Biological Investigation (RSBI) Working Group. OMICS: A J of Integrative Biology 10, 164–171.
Sansone, S., Rocca-Serra, P., Brandizi, M., Brazma, A., Field, D., Fostel, J., et al. (2008) The First RSBI (ISA-TAB) Workshop: Can a Simple Format Work for Complex Studies? OMICS: A J of Integrative Biology 12, 143–149.
Smith, B., Ashburner, M., Rosse, C., Bard, J., Bug, W., Ceusters, W., et al. (2007) The OBO Foundry: coordinated evolution of ontologies to support biomedical data integration. Nat Biotech 25, 1251–1255.
Langley, P., Shiran, O., Shrager, J., Todorovski, L. and Pohorille, A. (2006) Constructing explanatory process models from biological data and knowledge. Artificial Intelligence in Medicine 37, 191–201.
Merriam-Webster Inc. (2005) The Merriam-Webster dictionary. Merriam-Webster, Springfield, MA.
Kell, D. B. (2004) Metabolomics and system Biology, making the Sense of the Soup. Curr Opin Biotech 7, 296–307.
Barrett, S. J. and Langdon, W. B. (2006) Advances in the Application of Machine Learning Techniques in Drug Discovery Design and Development. in Applications of Soft Computing: Recent Trends. Springer, Berlin/Heidleberg/New York, NY
Mahadevan, S., Shah, S. L., Marrie, T. J. and Slupsky, C. M. (2008) Analysis of metabolomic data using support vector machines. Anal Chem 80, 7562–7570.
Chatterjee, S. and Hadi, A. S. (2006) Regression analysis by example. 4th ed. Wiley series in probability and statistics. Wiley-Interscience, Hoboken, N.J.
Fukusaki, E. and Kobayashi, A. (2005) Plant metabolomics: potential for practical operation. J Bioscience and Bioengineering 100, 347–354.
Enot, D. P., Beckmann, M., Overy, D. and Draper, J. (2006) Predicting interpretability of metabolome models based on behavior, putative identity, and biological relevance of explanatory signals. PNAS 103, 14865–14870.
Kotsiantis, S., Zaharakis, I. and Pintelas, P. (2006) Machine learning: a review of classification and combining techniques. Artificial Intelligence Rev 26, 159–190.
Kotsiantis, S. B. (2007) Supervised Machine Learning a Review of Classification techniques. Informatica 31, 249–268
Johnson, H. E., Gilbert, R. J., Winson, M. K., Goodacre, R., Smith, A. R., Rowland, J. J., et al. (2000) Explanatory Analysis of the Metabolome Using Genetic Programming of Simple, Interpretable Rules. Genetic Programming & Evolvable Machines 1, 243–258.
Fiehn, O. (2001) Combining Genomics, Metabolome Analysis, and Biochemical Modelling to Understand Metabolic Networks. Comparative & Functional Genomics 2, 155–168.
Taylor, J., King, R., Altmann, T. and Fiehn, O. (2002) Application of Metabolomics to Plant Genotype Discrimination Using Statistics and Machine Learning BioInformatics 18, 241–248.
Catchpole, G. S., Beckmann, M., Enot, D. P., Mondhe, M., Zywicki, B., Taylor, J., et al. (2005) Hierarchical metabolomics demonstrates substantial compositional similarity between genetically modified and conventional potato crops. PNAS 102, 14458–14462.
Wishart, D. S. (2008) Metabolomics: applications to food science and nutrition research. Trends in Food Sci & Tech 19, 482–493.
Badjio, E. F. and Poulet, F. (2005) User Guidance: From Theory to Practice, the Case of Visual Data Mining. in Proceedings of the 17th IEEE International Conference on Tools with Artificial Intelligence. Hong Kong, IEEE Computer Society.
Camacho, D., de la Fuente, A. and Mendes, P. (2005) The origin of correlations in metabolomics data. Metabolomics 1, 53–63.
Roessner-Tunali, U. (2007) uncovering the plant metabolome: current and future challenges, in Concepts in Plant Metabolomics (B.J. Nikolau and E.S. Wurtele, eds.), Springer, Dordrecht. pp. 71–85.
Xu, E., Schaefer, W. and Xu, Q. (2009) Metabolomics in pharmaceutical research and development: Metabolites, mechanisms and pathways. Current Opinion in Drug Discovery & Development 12, 40–52.
Rozen, S., Cudkowicz, M. E., Bogdanov, M., Matson, W. R., Kristal, B. S., Beecher, C., et al. (2005) Metabolomic analysis and signatures in motor neuron disease. Metabolomics 1, 101–108.
Broadhurst, D. and Kell, D. (2006) Statistical strategies for avoiding false discoveries in metabolomics and related experiments. Metabolomics 2, 171–196.
Smelser, N. J. and Baltes, P. B. (2001) International encyclopedia of the social & behavioral sciences. 1st ed. Elsevier, Amsterdam/New York, NY.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2011 Springer Science+Business Media, LLC
About this protocol
Cite this protocol
BaniMustafa, A.H., Hardy, N.W. (2011). A Strategy for Selecting Data Mining Techniques in Metabolomics. In: Hardy, N., Hall, R. (eds) Plant Metabolomics. Methods in Molecular Biology, vol 860. Humana Press. https://doi.org/10.1007/978-1-61779-594-7_18
Download citation
DOI: https://doi.org/10.1007/978-1-61779-594-7_18
Published:
Publisher Name: Humana Press
Print ISBN: 978-1-61779-593-0
Online ISBN: 978-1-61779-594-7
eBook Packages: Springer Protocols