Introduction

Over the past decades, metabolomics has been described as a popular omics approach along with other approaches like genomics, transcriptomics and, proteomics to study the metabolites (compounds less than 1500 Da) in a living system (Fiehn et al. 2000; Fiehn 2001, 2002; Bino et al. 2004; Saito and Matsuda 2010). Recently, metabolomics is a growing field and applied to study in-depth information on plant cellular metabolism, a group of the small molecular mass of compounds present in the plant extracts (Oliver et al. 1998). Usually, plants produce numerous metabolites for their developmental growth and environmental responses. Around 200,000 metabolites, including, primary and secondary metabolites are present in the plant kingdom (Wink 2010). These primary metabolites are liable for developmental growth and these are structurally conserved while, secondary metabolites are specialized, utilized during stress conditions, and vary across plant species (Scossa et al. 2016). Plant metabolome produces a wide selection of metabolites including, ionic inorganic compounds, hydrophilic carbohydrates, amino acids, organic compounds, and compounds associated with hydrophobic lipids. The complex metabolites present in plant samples bring challenges to analytical tools for the separation and characterization of the metabolites. At present, the various analytical tools such as Nuclear magnetic resonance (NMR), Liquid chromatography–mass spectrometry (LC–MS), Gas chromatography–mass spectrometry (GC–MS), Capillary electrophoresis mass spectrometry (CE–MS), and Fourier transform ion cyclotron resonance mass spectrometry (FT-ICR–MS) (Okazaki and Saito 2012; Khakimov et al. 2014) are employed in the metabolomic analysis. Among these, LC–MS and GC–MS, and NMR are being used routinely in plant metabolomics. However, there is an intrinsic limitation for each analytical technology that can cover the whole plant metabolome, the combination of analytical techniques being used in plant metabolomics analysis.

The large amount of plant metabolome data produced by each analytical tool requires distinct handling, and thus, bioinformatics resources like software for data processing and analysis, databases for annotation, and workflows are required. There is a continuous increase in the advancement of analytical platforms and high-throughput data, newer tools are being developed for handling the metabolomic datasets. Recently, Spicer et al. provided a list of open-source software tools for metabolomic analysis. Most of the software tools are platform-specific and written in different scripting languages such as Python, R, Java, and MATLAB (Novikova et al. 2020). The automated software or workflows for plant metabolomics data analysis are not freely available. The metabolite identification process is manual and semi-automated to access the features of biological interest (Dunn 2008). To fill the gap, several workflows are scripted to perform integrated pre-processing, annotation, statistical analysis, and biological interpretation of metabolomic data. These are powerful, easy-to-use, modularized, and a good choice for plant metabolomics data analysis. In the present review, we will describe the typical plant metabolomics analysis workflow including analytical tools for data acquisition, software tools and databases for processing, annotation, and data analysis for GC–MS, LC–MS, NMR, GC–MS/MS, and LC–MS/MS data. This review also describes the multifunctional tools such as workflows for plant metabolomics analysis and limitations of current tools and future perspectives towards plant metabolomic analysis.

Workflow for plant metabolomic analysis

The typical workflow for plant metabolomic analysis comprised four steps namely (1) sample preparation (2) data acquisition (3) data mining/data analysis (4) data integration resulted in the interpretation of meaningful biochemical information (Fig. 1).

Fig. 1
figure 1

Flowchart of a typical plant metabolomics study

Sample preparation

This is often the critical step where practical considerations should be taken from the beginning to avoid the introduction of variability in metabolomics analysis. Sample preparation involves mainly three steps: harvesting, drying, and extraction methods. Plant selection depends on the biological issue to be explored (Kim and Verpoorte 2010; Álvarez-Sánchez et al. 2010). Harvesting the plant material is crucial to avoid metabolite changes caused by enzymatic reactions and contamination (Fiehn et al. 2000; Verpoorte et al. 2008; Xu et al. 2010). After harvested the plant material, drying can be performed using air, oven, freeze, and trap drying methods (Harbourne et al. 2009). Drying should be performed prior to the extraction to maintain a uniform water level within the plant sample. Extraction methods used for the extraction of metabolites depend on physicochemical properties, solvent properties, and biochemical system composition. The various extraction methods such as solvent extraction, ultra-sonication, microwave-assisted extraction, and supercritical fluid extraction are commonly and routinely used. However, no single extraction technique is been developed to extract all the metabolites till date. Thus, the combination of extraction methods usage is imperative for the extraction of various classes of metabolites. (Kim et al. 2011; Verpoorte et al. 2008; Hall, 2006; Dunn and Ellis 2005; Kim and Verpoorte 2010).

Data acquisition using analytical techniques

Once the sample is prepared, various analytical techniques are used to separate, quantify, and identify the various metabolite groups (Sumner et al. 2003; Allwood et al. 2008). Each analytical technique has its pros and cons as listed in Table 1. Selecting the analytical method relies on the type of approach (targeted or untargeted) and also based on the physical and chemical characteristics of the sample.

Table 1 Analytical techniques in plant metabolomics and its advantages and disadvantages

Nuclear magnetic resonance spectroscopy

NMR spectroscopy is a non-destructive and, a non-biased method that is easy to quantify, needs little to no preparation of samples, no analyte or chromatographic separation, and no sample derivatization prior to the analysis. This analytical tool is straightforward and depends on the external magnetic field with radiofrequency (RF) radiations applied to the nuclei of molecules. When samples present in NMR solvent are applied with the external magnetic field and radiofrequency pulses, the atomic nuclei transfer energy from lower states to higher states, and subsequent energy is emitted when spins back to lower energy states at an equivalent frequency. This generates signals and is reported as chemical shifts in NMR. Usually, the NMR chemical shifts are mentioned relative to an internal reference. The resultant spectrum is that all the peaks are distributed at different locations and intensities to obtain a unique pattern for each compound. In NMR, with the use of specific atomic nuclei like hydrogen (1H), carbon (13C), and phosphorous (31P), information on specific metabolite types is obtained (Reo 2002). Hence, NMR analysis provides the global view of all metabolites in plant samples (Kim et al. 2011). It can also unveil structural information which helps to identify the unknown metabolite. Although NMR has the greatest advantages, it exhibits inherent low sensitivity and resolution which hampers the quality of analysis for complex samples (Welije et al. 2006). With the implementation of cryogenic probes, disadvantages are often addressed (Kovacs et al. 2005). Hence there is a shift within the metabolomics community towards the utilization of MS-based techniques.

Gas chromatography–mass spectrometry

GC–MS is the analytical tool often used in plant metabolomics. It is largely used to quantify volatile compounds at higher temperatures and derivatized polar metabolites (Dunn 2008). As separation in GC occurs at a higher temperature, derivatization is important for the samples to form them volatile, prior to the analysis. The electron ionization technique helps in the compound identification by generating informative and characteristic mass spectra with high degree fragmentation. The spectral data are often compared with the NIST database to identify unknown metabolites. However, molecular ions are often undetected which largely reduces the elemental composition, and hence, this technique is often used for the selective study of known primary metabolites. Recently, alternative techniques such as positive chemical ionization and negative ion chemical ionization are employed to enhance the separation and sensitivity of metabolites in complex mixtures (Raina and Hall 2008).

Liquid chromatography–mass spectrometry

LC–MS is another analytical technique utilized in plant metabolomics. In comparison with the GC–MS technique, it is more suitable for non-volatile and polar metabolites. Plant metabolomics is used for untargeted secondary metabolites analysis and complex lipids using chromatographic separation with reverse phased C18 columns (De Vos et al. 2007). These columns provide a strong separation between nonpolar and weak polar compounds. To analyze polar compounds in plant extracts, hydrophilic interaction chromatography columns are often used (Tolstikov and Fiehn 2002). In liquid chromatography mass spectrometry, the ionization techniques include electrospray ionization and atmospheric chemical ionization coupled with LC–ESI–MS (Codrea et al. 2007; Dunn 2008; Grandori et al. 2009). LC–ESI–MS produces molecular ions namely a few like [M + H]+, [M + 2H]+, [M + NH4]+ ions in positive mode of ionization and [M–H], [M–2H], [M–NH4] ions in negative mode, for the identification of the proposed compounds. The advantage of this ionization source would be either volatile or non-volatile and derivatization is not necessary as that of GC–MS. The disadvantage of this ionization source is that the identification of metabolite is more time-intensive and lacks a mass spectral library for metabolite identification. However, with the utilization of tandem mass spectrometry (MS/MS) the metabolites are often identified (Lenz et al. 2004). Hence, more effort is required to identify the metabolites through the development of comprehensive databases (Kind and Fiehn, 2006; Moco et al. 2006; Shinbo et al. 2006; Böcker and Rasche, 2008; Horai et al. 2010).

Capillary electrophoresis–mass spectrometry

This technique is employed to analyze polar and charged metabolites. The ionization method used for CE–MS is API, almost like LC–MS. In capillary electrophoresis, the migration of ions under the presence of an electrical field produces electro osmotic flow through the capillary action (Okazaki and Saito 2012). Ionic compounds are separated into capillaries based on the charge and size of ions. The samples in CE–MS are separated into the cationic and anionic analysis. In Soga’s method for cationic analysis, formic acid used as the electrolyte provides good reproducibility of metabolites (Soga et al. 2006). While in the anionic analysis, the use of platinum electrospray needles has improved significantly in analyzing anionic metabolites (Soga et al. 2009). In the CE–MS technique derivatization is not required for the detection of compounds like GC–MS (Okazaki and Saito 2012). The metabolites identified using the CE–MS technique are physiologically important and similar in organisms. Hence, this technique is employed to profile metabolites of all organisms (Urano et al. 2009; Ishikawa et al. 2010).

Fourier transform ion cyclotron resonance mass spectrometry

This technique analyses the mass to charge ratio (m/z) of ions within the fixed magnetic field supported cyclotron frequency. This technique has higher importance in metabolomics because similar molecular mass compounds are often separately detected, and chromatographic separation is not required for the rapid detection of metabolites. Another advantage is that chemical structure can be obtained from the peak analysis, which helps to identify unknown compounds. Coupling tandem mass spectrometry with FT-ICR–MS with high precision and resolution can detect fragmented metabolite ions. This technique produces a large amount of data with hardware handling difficulties. Compared with other MS techniques, only a few reports are available with FT-ICR–MS (Kujawinski 2002).

Data mining/analysis

The analytical instruments produce a large amount of data that is used for metabolome analysis. To facilitate this, automated software is required for the identification of peaks from the raw MS or NMR data to identify and quantify the metabolites. Hence, informatics and statistics are essential for metabolomics data analysis (Weljie et al. 2006; Boccard et al. 2007; Liland 2011). Data analysis includes data pre-processing, annotation of metabolites, and statistical analysis.

Data pre-processing

During this step, peak detection and chromatogram alignment algorithm are used for the raw data signals (chromatograms, NMR) to obtain the metabolite signals. Different software tools or packages were developed to aid compound identification, including XCMS (Smith et al. 2006), mzMine2 (Katajamaa et al. 2006), Metabolomic Analysis and Visualization Engine, MAVEN (Clasquin et al. 2012), Mass Spectrometry-Data Independent AnaLysis software, MS-DIAL (Tsugawa et al. 2015), OpenMS (Sturm et al. 2008), automated data analysis pipeline, ADAP (Jiang et al. 2010), adaptive processing of liquid chromatography mass spectrometry, apLCMS (Yu et al. 2009). All the major software programs are compatible with the open format such as mzML for feature extraction and quantification (Kessner et al. 2008; Martens et al. 2011). The annotation of metabolites is required to interpret the result from pre-processed data. Several databases and tools are available for metabolite identification as listed in Table 2.

Table 2 List of tools and databases for metabolomics interpretation

Databases and tools for annotation of metabolites

Annotation of metabolites is the connection between raw analytical data and biological knowledge through targeted or untargeted profiling approaches. But the identification or annotation of metabolites is usually a laborious process. Usually, identification of metabolites being administered with the fragmentation data obtained MS/MS targeted against online tandem MS databases when standard compounds are unavailable. For GC–MS data, NIST spectral library is employed for automated deconvolution and identification system of the mass spectra (Ausloos et al. 1999), and the Golm Metabolite database, GMD (Kopka et al. 2005) is another open-source software for spectral search. For LC–MS metabolite annotation and spectral search are often performed using spectral databases like METLIN (Smith et al. 2005), MoNA (http://mona.fiehnlab.ucdavis.edu/), mzCloud (https://www.mzcloud.org/), MassBank (Horai et al. 2010) and Global Natural Products Social Molecular Networking database, GNPS (Wang et al. 2016). For mass spectral and NMR data, the Human Metabolome Database, HMDB (Wishart et al. 2009) and Madison Metabolomics Consortium Database, MMCDB (Markley et al. 2007; Cui et al. 2008) are useful for annotation of metabolites in both humans and plants.

Several software tools including xMSannotator (Uppal et al. 2017), RAMclust (Broeckling et al. 2014), CAMERA (Kuhl et al. 2012), and MetAssign (Daly et al. 2014) are available for metabolite annotation. Numerous compound repositories for metabolite annotation such as PubChem (Wheeler et al. 2008), Chemical Entities of Biological Interest, ChEBI (Degtyarenko et al. 2008), KNApSAcK (Shinbo et al. 2006), LipidBank, and LIPID MAPS (Fahy et al. 2007), Kyoto Encyclopedia of Genes and Genomes, KEGG (Kanehisa et al. 2008) and PlantCyc, from Plant metabolite network (Seaver et al. 2012) are also accessible which are obliging for metabolite annotation. Many specialized databases of plant species are used to focus on mass spectra or metabolite annotation. These include Metabolome Tomato Database, MotoDB (Moco et al. 2006), and Kazusa Metabolomics Database, KOMICS (Iijima et al. 2008) contain information on metabolites present in the tomato and Plant Reactome database (Naithani et al. 2020) for curated pathway information related to 84 species of Gramineae family.

Statistical analysis

Statistical analysis is applied to compare the metabolite properties with the dependent variable. Depending on the experimental objective, appropriate statistical methods are often applied to metabolomics data. Metabolomics studies from the analytical methods incorporate various biases into the data that make interpretation complicated and demanding, hence, multivariate analysis and chemometric methods are used for metabolomics data. The multivariate analysis consists of supervised and unsupervised methods. One of the unsupervised methods in metabolomics studies is principal component analysis. PCA describes the systematic variations in each sample’s metabolites by projecting and clustering the samples in a data table. Many studies have reported the utilization of PCA for statistical analysis of the metabolomics data (Catchpole et al. 2005). Other statistical tools like hierarchical clustering (Murtagh 1983), K means clustering (Murtagh 1983), partial least square discriminant analysis, PLS-DA (Barker and Rayens 2003), and Orthogonal PLS, OPLS (Trygg and Wold 2002) are used in metabolomics data analysis. Commercial tools like SIMCA-P (http://umetrics.com/products/simca), PLS-Toolbox (http://www.eigenvector.com/software/pls_toolbox.htm) are employed for statistical analysis. Also, some software packages written in R programs like Statistical Metabolomics Analysis-An R Tool, SMART (Liang et al. 2016), and specmine (Costa et al. 2016) are being used.

Data interpretation

After statistical analysis, the chosen metabolites from the metabolomics data got to be linked with biochemical pathways. There are several ad-hoc software tools includes metaP-server (Suhre et al. 2011), metabolic pathway analysis, metPA (Xia et al. 2011), MetExplore (Cottret et al. 2018), Metabolite Set Enrichment Analysis, MSEA (Xia and Wishart 2010) for the enrichment and pathway analysis which map to the biochemical pathways available publicly databases like Kyoto Encyclopedia of Genes and Genomes, KEGG (Ogata et al. 1998). In addition, specific plant metabolomics pathway databases like PlantCyc (http://www.plantcyc.org/), KaPPA-View (Tokimatsu et al. 2005; Sakurai et al. 2011), MapMan (Thimm et al. 2004), Arabidopsis Reactome (Tsesmetzis et al. 2008), and Plant Reactome (Naithani et al. 2020) are available. Although, these biochemical pathways provide meaningful information, in the case of plants most of the metabolic pathways are highly interconnected and redundant to understand the biological interpretation from Metabolome data. The analysis tools used to identify perturbations in pathways depend on prior metabolite identification. Hence, analysis using untargeted metabolomics workflow has increased prominence in recent years (Heuberger et al. 2014).

Multifunctional tools as workflows for plant metabolomic analysis

The developed software tools should facilitate the data processing, annotation, statistical analysis, and interpretation of metabolite data. The workflows in the Web app and other programming languages like R, python are easy to use and allowing the researchers to perform analysis in a single tool. These workflows are primarily designed for the analysis of LC–MS data. In this section, the workflows for plant metabolomics are listed in Table 3.

Table 3 List of tools for metabolomics data analysis

XCMS Online (Tautenhahn et al. 2012), Scripps center for metabolomics and mass spectrometry, is an integrated freely accessible cloud-based platform to facilitate raw MS spectra processing and visualization of untargeted LC–MS data. XCMS Online retains the same features as the original XCMS software (Smith et al. 2006) with more flexibility in web browsing and also provides a direct link to the METLIN (Smith et al. 2005) database for metabolite identification. It also provides interactive metabolomic cloud plots, retention time correction plots, univariate statistical analysis and, pathway information.

Metaboanalyst 5.0 (Pang et al. 2021) is an integrated freely accessible web platform for metabolomic analysis, visualization, and interpretation of MS and NMR data. It contains several new modules like MS spectral processing, functional analysis, meta-analysis, joint pathway analysis, enrichment analysis, and network explorer to the existing modules from Metaboanalyst 4.0 (Chong et al. 2018).

Workflow4metabolomics (Giacomoni et al. 2015) is developed by the French Bioinformatics Institute for data processing, statistical analysis, and interpretation of LC–HRMS data. It is built on a Galaxy environment with unique computational modules like data normalization, multivariate analysis, and annotation. Galaxy-M (Davidson et al. 2016) is developed for LC–MS and DIMS metabolomic data analysis. It provides processing of raw data to preparation of statistical analysis and annotation in the galaxy-based platform. Both Workflow4metabolomics and Galaxy-M require implementations of XCMS (Smith et al. 2006) and CAMERA (Kuhl et al. 2012) packages for the analysis.

MZmine 2 (Pluskal et al. 2010) is developed for both targeted and untargeted LC–MS data. It is an open-source data processing toolbox obtained under the license of GNU GPL. Metabox (Wanichthanarak et al. 2017), an R-based software framework for metabolomics data processing, statistical analysis, and functional analysis. WebSpecmine (Cardoso et al. 2019) is an R-based web-based framework for metabolomics and spectral data processing. MS-DIAL 4.0 (Tsugawa et al. 2020) is developed for tandem mass fragmentation data in data-independent acquisition mode along with comprehensive identification and quantification of small molecules including lipids by mass spectral deconvolution. Integrated mass spectrometry-based untargeted data mining, IP4M (Liang et al. 2020) covers all eight modules include data preprocessing, peak annotation, statistical analysis, pathway and enrichment analysis, classification and biomarker detection, correlation analysis, regression analysis, and sample size and power analysis for untargeted LC–MS and GC–MS metabolomics data.

Limitations of software tools and workflows

The metabolomic tools written in different coding languages like C, C +  + , R, Python, and Java have their pros and cons. Most of the tools have different pre-knowledge and dependencies as a result, not all tools are compatible with various operating systems. Tools with differential input–output behavior, lack of reliable metabolomics databases, sensitivity, cost, standardization of workflows are some of the issues in the metabolomics research (Perez-Riverol et al. 2014). Despite its challenges in the metabolomics community towards metabolite identification, archiving and biological interpretation, computational biologists and informaticians have laid enormous efforts towards the development of software tools, common interfaces, integrated tools, and workflows. This needs to be validated by the analytical chemists before accepted by the metabolomics community to gain popularity.

Conclusion and future perspectives

Metabolomics is rapidly evolving in the field of plant biotechnology with the development of new analytical tools for metabolomic data acquisition to cover the whole plant metabolome. This review provides an overview of the most widely used analytical tools in plant metabolomics experiments along with the typical workflow of plant metabolomics experiment starting from the sample preparation to data interpretation with supporting software tools, databases, and workflows. These computational tools and databases are expanding exponentially for functional characterization and biological interpretation of plant metabolomic profiles. The computational metabolomics community needs to develop workflows with the advancement of modules. These workflows are integrated multifunctional tools which are freely available, makes the beginner analyze the metabolomic data easily. Finally, there is an urgent need for the integration of metabolomics with other omics technologies and a combination of various instrumentation capabilities helps to understand complex biological systems in plants.