Key words

1 Introduction

Understanding the metabolism in any given plant organ at any given developmental stage requires the combination of multiple data types. Indeed, targeted global metabolite content (the metabolome) alone gives powerful phenotypic information but little to none on the mechanisms giving rise to such phenotypes, while global transcript content alone (the transcriptome) only provides information on gene expression that may or may not be the cause of these phenotypic differences [1]. However, metabolic states in plant tissues are better described when transcriptomics and metabolomics are properly combined, as they allow for corroboration of the metabolome composition with the gene expression that (a) directly gives rise to the actual enzymes catalyzing metabolic reactions and (b) determines a large part of the regulatory framework controlling signaling, hormone responses, and responses to external stimuli, among other processes [2]. While this methodology is broadly applicable when studying metabolism, it requires a good deal of methodological rigor to ensure the procedures are coordinated from start to finish.

First, experiments must be designed to collect enough material for both metabolite and transcript quantification, ideally from relatively homogeneous individuals at the same developmental stage under the same environmental conditions. In addition to the experimental design, careful consideration of a precise and reliable extraction method needs to take place. These methods differ for each type of “omics” experiment; while one can typically extract all RNA with only one extraction procedure for transcriptome analysis, for metabolomic analysis there is not one solvent or solvent mix to extract and dissolve all metabolites due to their great chemical diversity [3]. Often an experimental design must incorporate even more extra replicates for metabolite quantification, as a separate extraction may be required for each class of metabolite (i.e., organic acids, phosphorylated compounds, sugars and sugar alcohols, amino acids, phenolics, etc.).

Second, appropriate use of statistics is required to appropriately integrate both metabolomic and transcriptomic datasets. Transcriptomic datasets almost always contain many orders of magnitude more features than targeted metabolomics datasets, and because of this, care must be taken to ensure that transcript trends do not overpower metabolite data. This can be accomplished by using different strategies of assigning weights to each dataset but must be done so in accordance with the desired analysis [4]. Additionally, results from these analyses can be obtained to study features on a global scale (e.g., full coexpression networks, cluster analyses) or from a pathway perspective to address more specific questions.

The following protocol outlines a workflow for generating both a transcriptomic and metabolomic dataset comparing two experimental groups—using plant embryos as an example—followed by a method of integrating results from each dataset. The plant embryo example followed in this protocol compares two varieties of a hypothetical oilseed species, one producing a high amount of oil (variety “A”), which serves as the experimental group displaying a desirable trait, and then a low-oil variety (variety “B”). The targeted metabolomics protocol allows for the quantification of 100+ water-soluble metabolites: amino acids, sugars, sugar alcohols, organic acids, and phosphorylated compounds from freeze-dried plant embryos. However, it can be used for any freeze-dried biological tissue/sample by simply adjusting the starting material amount as necessary. The transcriptomics section of the protocol describes an efficient RNA extraction procedure. Similar to the metabolite extraction method, the RNA one is effective for any tissue (from either plants, animals, or microorganisms); however it strictly requires that tissues remain frozen following collection. After the RNA extraction is a walkthrough for how to perform a differential gene expression analysis, which is valuable in and of itself. Finally, the transcriptomic data is integrated with the metabolomics dataset for a final multi-omics analysis, providing a powerful and global perspective on metabolism. To interpret the results and understand the metabolism behind a desirable trait in plant, an example of the association of metabolomic and transcriptomic data with seed/embryo oil content is provided.

2 Materials

2.1 Sample Collection

In transcriptomic and metabolomic studies, fast processing and cooling of the samples is key to obtaining accurate data.

  1. 1.

    Dissecting microscope.

  2. 2.

    Forceps.

  3. 3.

    2 mL Sarstedt screw cap tubes (Sarstedt AG & Co. KG, Germany).

  4. 4.

    Ice-cold double distilled water (ddH2O).

  5. 5.

    2 mL Sarstedt tubes with perforated lids for freeze-drying (lids can be perforated by puncturing two to three times with a disposable syringe needle, etc.).

  6. 6.

    Liquid nitrogen.

  7. 7.

    Lyophilizer with a chamber that cools up to −80 °C and allows a vacuum of 0.070 mBar.

2.2 Metabolomics

2.2.1 Metabolite Extraction

  1. 1.

    Refrigerated microcentrifuge at 4 °C.

  2. 2.

    Refrigerated benchtop centrifuge at 4 °C.

  3. 3.

    Retsch mill MM400 (Retsch, USA, Verder Scientific, Inc., Newtown, PA) or 2010 Geno/Grinder® (SPEX™ SamplePrep, Metuchen, NJ) (or thorough tissue grinder).

  4. 4.

    Water bath at 100 °C.

  5. 5.

    Ice-cold ddH2O in a 50 mL falcon tube (in a bucket of ice).

  6. 6.

    100 °C ddH2O in a 50 mL falcon tube (in the water bath).

  7. 7.

    A mixture of the compounds that will be used as internal standards (IS). It is recommended to use at least one IS for each family of metabolite to be quantified. The analytes used need to be absent from the samples and have a similar chemical structure than the compounds that will be analyzed, for example, [U-13C]-glycine, for amino acids, [U-13C]-mannose, for sugar and sugar alcohols, and [U-13C]-fumarate for organic acids and phosphorylated compounds. The final concentration of the IS needs to be in the same order of magnitude as the concentration of the metabolites that are going to be quantified. For example, 10 μL of an aqueous solution containing 4 mM [U-13C]-glycine, 10 mM [U-13C]-mannose, and 1 mM [U-13C]-fumarate could be used for extractions in which 5 mg of freeze-dried embryos is extracted and resuspended in a final volume of 500 μL.

  8. 8.

    Ice-cold, labeled 5 mL syringes, one for each sample and one for every IS control.

  9. 9.

    Ice-cold 15 mL falcon tubes, one for each sample and one for every IS control.

  10. 10.

    0.22 μm syringe filters (Cat. No. SLGVR33RS, MilliporeSigma, Burlington, MA), one for each sample and one for every IS control.

  11. 11.

    5 mm tungsten beads, one for each sample and one for every IS control.

  12. 12.

    Perforated lids of 15 mL falcon tubes, one for each sample and one for every IS control.

  13. 13.

    Ice-cold, labeled 3 kDa Amicon Ultra 0.5 mL column filtering devices, one for each sample and one for every IS control.

  14. 14.

    Ice-cold, labeled 0.2 μm Nanosep filtering devices, one for each sample and one for every IS control.

  15. 15.

    Liquid nitrogen.

  16. 16.

    Lyophilizer with a chamber that cools to −80 °C and maintains a vacuum of ~0.070 mBar.

  17. 17.

    Heat-resistant tube rack for incubation in water bath.

2.2.2 Metabolite Quantification

Prepare all solutions using ddH2O and LC-MS-grade solvents at room temperature (unless indicated otherwise). The model of columns and LC-MS/MS systems are interchangeable with proper method development.

  1. 1.

    Agilent amber screw vials and caps.

  2. 2.

    10 mM HCl.

  3. 3.

    LC-MS-grade acetonitrile, water, and acetic acid.

  4. 4.

    Acetonitrile/water (60:40, v/v).

  5. 5.

    External standards (ES) mix. For each family of compounds, a specific number of standard mixtures are prepared considering the impurities that are present in some “pure standards.” For each of the metabolites that will be quantified, a pure standard is needed. The final dilution of the ES is performed with the same solvents as the sample final dilution. The exact final concentration of each standard needs to be known and preferably be in the same order of magnitude as the levels in the samples.

  6. 6.

    0.1% acetic acid in acetonitrile, volume dependent on the number of samples to run.

  7. 7.

    0.1% acetic acid in ddH2O, volume dependent on the number of samples to run.

  8. 8.

    0.5 and 75 mM of potassium hydroxide (KOH), volume dependent on the number of samples to run. For more accurate and reproducible preparation of these solutions, it is imperative to use degassed LC-MS-grade water and to measure the volume of the KOH solution (45% w/w) by weight (do not use KOH pellets), as previously described [3].

  9. 9.

    LC columns (Table 1).

  10. 10.

    Agilent 1290 Infinity II coupled to an AB SCIEX QTRAP 6500+ LC-MS/MS system.

  11. 11.

    Analyst software—SCIEX (https://sciex.com/products/software/analyst-software).

  12. 12.

    MetaboAnalyst v. 5.0 (browser-based) (https://www.metaboanalyst.ca/) [5].

Table 1 Liquid chromatograph parameters

2.3 Transcriptomics

Solutions must be made using RNase-free reagents. Furthermore, any tools and containers/bottles used must be autoclaved for 30–60 min at 121 °C. The service end of the probes used for pH measurement can be placed in 70% ethanol (EtOH) for 1 min and then in 1 M sodium hydroxide for 5 min. Rinse by dunking the probe’s end in RNase-free water repeatedly, until a stable pH can be read using the clean probe.

2.3.1 RNA Extraction

  1. 1.

    1.5 and 2 mL snap-top sample tubes, autoclaved.

  2. 2.

    Deionized water or 18.2 MΩ-cm water.

  3. 3.

    Diethyl dicarbonate (DEPC).

  4. 4.

    Ethanol (EtOH).

  5. 5.

    Sodium hydroxide (NaOH).

  6. 6.

    2,2′,2″,2‴-(Ethane-1,2-diyldinitrilo)tetraacetic acid (EDTA).

  7. 7.

    2-Amino-2-(hydroxymethyl)propane-1,3-diol (Tris).

  8. 8.

    Hydrochloric acid (HCl).

  9. 9.

    Chloroform.

  10. 10.

    Isoamyl alcohol.

  11. 11.

    Sodium acetate (NaOAc).

  12. 12.

    Glacial acetic acid.

  13. 13.

    Lithium chloride (LiCl).

  14. 14.

    Glycogen (we used RNase-free, Thermo Fisher Scientific Cat. No. R0551).

  15. 15.

    N,N,N-Trimethylhexadecan-1-aminium bromide (CTAB).

  16. 16.

    Polyvinylpolypyrrolidone (PVPP).

  17. 17.

    Sodium chloride (NaCl).

  18. 18.

    Spermidine.

  19. 19.

    2-Mercaptoethanol.

  20. 20.

    Aluminum foil.

  21. 21.

    Calibrated pH meter.

  22. 22.

    Magnetic stirrer.

  23. 23.

    Autoclave.

  24. 24.

    Liquid nitrogen.

  25. 25.

    Access to gel capillary electrophoresis instrument or service.

2.3.2 Linux

A Linux distribution such as Ubuntu (or Ubuntu Server) is required. Installing these on your current operating system requires some specific instructions, so be sure to follow carefully. The programs used here will work on Ubuntu v. 22.01.1 and later. If using a high-performance computer remotely, WinSCP (https://winscp.net/eng/index.php) is an easy way to navigate your directories. The command line attached to WinSCP is powered by a local installation of PuTTY (https://www.putty.org).

  1. 1.

    HISAT2 v. 2.2.1 (http://daehwankimlab.github.io/hisat2/) [8].

  2. 2.

    samtools v. 1.16.1 (www.htslib.org) [9].

  3. 3.

    StringTie v. 2.2.0 (https://ccb.jhu.edu/software/stringtie/) [10].

  4. 4.

    gffread v. 0.12.7 (https://github.com/gpertea/gffread) [11].

  5. 5.

    Trinity v. 2.14.0 (https://github.com/trinityrnaseq/trinityrnaseq) [12].

  6. 6.

    Trans-ABySS v. 2.0.1 (https://github.com/bcgsc/transabyss) [13].

  7. 7.

    AGAT v. 1.0.0 (https://github.com/NBISweden/AGAT).

  8. 8.

    EvidentialGene (http://arthropods.eugenes.org/EvidentialGene/evigene/).

  9. 9.

    Salmon v. 1.9.0 (https://combine-lab.github.io/salmon/) [14].

  10. 10.

    bedtools v. 2.30.0 (https://github.com/arq5x/bedtools2/releases) [15].

  11. 11.

    minimap2 v 2.24 (https://github.com/lh3/minimap2) [16].

2.3.3 R

The R statistical package (www.r-project.org) needs to be installed (v. 4.1.3 or later). A convenient way to use R is via RStudio (v. 2022.07.2+576) (https://posit.co/download/rstudio-desktop/) which allows for easy navigation of R terminals and directories and streamlines maintaining script files. The local installation of R is accessed by RStudio.

  1. 1.

    DESeq2 v. 1.38.1 (https://bioconductor.org/packages/release/bioc/html/DESeq2.html) [17].

  2. 2.

    tximport v. 1.26.0 (https://bioconductor.org/packages/release/bioc/html/tximport.html) [18].

  3. 3.

    tximportData v. 3.16 (https://bioconductor.org/packages/release/data/experiment/html/tximportData.html).

2.3.4 Windows/PC

  1. 1.

    FastQC v. 0.11.9 (https://www.bioinformatics.babraham.ac.uk/projects/fastqc/) [19].

  2. 2.

    Blast2GO v. 6.0 (https://www.biobam.com/blast2go-previous-versions/). This is currently available as a stand-alone and is free for academic use after registering at https://www.biobam.com/blast2go-basic/ [20]. It is also part of the proprietary OmicsBox suite (https://www.biobam.com/omicsbox/).

  3. 3.

    PANTHER v. 17.0 (browser-based) (http://www.pantherdb.org) [21, 22].

  4. 4.

    MetaboAnalyst v. 5.0, as used previously.

3 Methods

3.1 Targeted Metabolomics

3.1.1 Plant Embryo Collection and Preparation

  1. 1.

    Peel immature seeds using forceps and the dissecting microscope; discard the seed coat and endosperm. Cool down the tissue immediately after dissection in 2 mL Sarstedt tubes placed on ice. The number of embryos to be collected for one biological replicate should be equivalent to at least 1.5 and 3 mg of dry weight (DW) for smaller embryos and larger embryos, respectively. Once the required number of embryos is collected, freeze the 2 mL Sarstedt tube in liquid nitrogen. Keep the tubes at −80 °C until all the biological replicates have been collected.

  2. 2.

    Dry the frozen embryos for 3 days in a lyophilizer, capping the tubes with perforated lids to allow the gaseous water to be released from the samples.

  3. 3.

    Grind each replicate separately using a 5 mm tungsten bead in a Retsch mill/tissue grinder for 5 min at 30 Hz.

  4. 4.

    For larger embryos, weigh 5 ± 0.1 mg of the powder of each replicate into a 2 mL Sarstedt tube. For smaller embryos, the same 2 mL Sarstedt tube containing the sample can be used for the extraction, as long as the DW of the material is between 1.5 and 10 mg.

3.1.2 Metabolite Extraction with Boiling Water

We recommend starting the processing of the samples 2 days before LC-MS/MS runs to avoid long periods of storage.

  1. 1.

    Spin down the 2 mL Sarstedt tube containing the dried powder and the 5 mm tungsten bead using a microcentrifuge set to 4 °C at 17,000 × g for 1 min. This step is used to pull the dried material to the bottom of the tube.

  2. 2.

    Remove tubes from microcentrifuge and place them immediately on ice.

  3. 3.

    At this point, label 1 empty 2 mL Sarstedt tube (IS-1, IS-2, …, IS-n) for every 12 samples. These tubes are processed exactly the same way as the sample tubes but will contain only IS and no biological samples.

  4. 4.

    Start a timer and process one tube at a time.

  5. 5.

    Add 10 μL of the IS mix (4 mM [U-13C]-glycine, 10 mM [U-13C]-mannose, 1 mM [U-13C]-fumarate) to the first tube.

  6. 6.

    Immediately add 1 mL of boiling ddH2O to that first tube.

  7. 7.

    Screw the lid tightly, vortex, and transfer the tube immediately to the boiling water bath (a heat-resistant floating rack is helpful if water level is high). Record exact stopwatch time when the incubation started for that first tube.

  8. 8.

    Proceed with the following tubes (one at a time) in the same way as the first tube, following steps 57. Try to space them about 30 s apart; this way the addition of boiling water to samples can be staggered and processed more efficiently.

  9. 9.

    After 5 min of incubation, remove each tube from the water bath, vortex, and put it back for another 5 min.

  10. 10.

    After a total of 10 min incubation, place the first tube quickly on ice, and do so, until the last sample is removed from the water bath. Once the last tube has been placed on ice, wait for 3–5 min before continuing with the following step.

  11. 11.

    Centrifuge the tubes at 17,000 × g and 4 °C for 5 min.

  12. 12.

    Remove tubes carefully from the microcentrifuge to avoid disrupting the pellet.

  13. 13.

    Transfer each supernatant to a pre-labeled 5 mL syringe with a 0.22 μm filter.

  14. 14.

    Push supernatant through the filter by using the pre-labeled plunger that comes with the 5 mL syringe, and collect it in a 15 mL falcon tube previously placed on ice.

  15. 15.

    Add 1 mL of ice-cold ddH2O to the 2 mL Sarstedt screw cap tubes containing the pellets, vortex until pellet is resuspended, and centrifuge at 17,000 × g and 4 °C for 5 min.

  16. 16.

    Filter each supernatant as explained before using the same syringes and filters for each sample and collecting the eluent in the same 15 mL falcon tubes. After this step each falcon tube will contain 2 mL of ice-cold extract.

  17. 17.

    Rinse syringes with filters twice with 1 mL of ice-cold ddH2O.

  18. 18.

    Discard syringes and filters.

  19. 19.

    Cap the 15 mL falcon tubes each containing approximately 4 mL of ice-cold extracts, and centrifuge at 1000 × g and 4 °C for 60 s.

  20. 20.

    Remove caps of the 15 mL falcon tubes and replace them with perforated lids.

  21. 21.

    Freeze the 15 mL falcon tubes in liquid nitrogen for 15–20 s.

  22. 22.

    Lyophilize the frozen extracts for 24 h at −80 °C.

3.1.3 Resuspension of Lyophilized Samples

  1. 1.

    Add 500 μL of ice-cold ddH2O to the lyophilized samples, change the lids of the tubes to new ones, and vortex until complete dissolution.

  2. 2.

    Centrifuge the resuspended samples at 1000 × g at 4 °C for 60 s.

  3. 3.

    Split samples as follows:

  4. 4.

    150 μL is loaded on a 0.22 μm Nanosep filter (use for sugars and sugar alcohols).

  5. 5.

    The remaining volume is added to a 3 kDa Amicon Ultra 0.5 mL column filtering device.

  6. 6.

    Centrifuge the filtering devices at 14,000 × g at 4 °C for 60 min. At the end of the centrifugation, a maximum of only 5 μL should be lost via retention in the μm Nanosep filter and only 50–75 μL in the 3 kDa Amicon filters. Do not exceed 14,200 × g for the 3 kDa Amicon filtering device because the membrane will break and macromolecules (e.g., proteins) will be released in the eluate.

  7. 7.

    Eluates are placed on ice and the filters are discarded.

  8. 8.

    The filtered samples can be stored in the refrigerator until LC-MS/MS processing. If storage time is over 2 days, it is highly recommended to freeze the extracts and keep them at −20 °C until further analysis.

3.1.4 Metabolite Quantification

The water-soluble metabolites are quantified using an Agilent 1290 Infinity II coupled to an AB SCIEX QTRAP 6500+ LC-MS/MS system through multiple reaction monitoring (MRM). Three specific procedures previously described in the literature are used for quantifying (1) amino acids and their derivatives, (2) soluble sugars and sugar alcohols, and (3) phosphorylated compounds and organic acids [3, 7, 23, 24]. These metabolites are of particular interest because they are key players in the central metabolism of many species. The preparation of the samples, chromatographic columns, mobile phases, gradients, flow, and remaining LC parameters are described in Table 1, while the MS specifications for each method and metabolite are shown in Table 2. The length of the run is 10, 30, and 80 min for steps (1), (2), and (3), respectively. In each case, a calibration curve with each pure standard is required as the area/mol ratio is specific for each compound and allows the identification of linearity range and the limits of detection and quantification. In general, the calibration graphs show excellent linearity in the low fmol to high pmol range and detection limits varied from 0.1 to 122.5 fmol. General steps for each method are as follows:

  1. 1.

    Create an acquisition method in the Analyst software (see Note 1) under “Acquisition Mode” tab for each family of compounds using the information provided in Tables 1 and 2. For the LC, define Solvents A and B, percentage of each solvent at each given time, flow, autosampler temperature, and column temperature. For the MS sections, set the polarity, Q1 and Q3 masses, dwell time, entrance potential (EP), declustering potential (DP), collision energy (CE), and the collision cell exit potential (CXP), as well as the source parameters: curtain gas (CUR), collision gas (CAD), ion spray voltage (IS), temperature (TEM), and ion source gas 1 and 2 (GS1 and GS2). Note that each of these parameters was optimized for each of the metabolites and depends on the specific instrumentation used.

  2. 2.

    Purge and prime the LC section with 50% of each mobile phase.

  3. 3.

    Install the corresponding column, carefully considering the direction of the flow and making sure that the system does not show leakages, especially in the connections of the tubing entering and exiting the column.

  4. 4.

    Equilibrate the column for 20 min using the percentage of Solvents A and B and flow with which the chromatography run starts (Table 1). In the case of metabolites from step (3) that are separated in anion exchange column, a suppressor needs to be connected downstream of the column to remove the salts from the mobile phase before reaching the MS using isocratic pump, as previously described [3].

  5. 5.

    Create a sample submission batch in “Acquisition Mode”; select one of the acquisition methods created in step 1; add three samples; label them as blanks; select the path where the data from the runs will be saved, the rack type, the vial position, and the direction in which the vials will be used (columns or rows); and select an injection volume of 5 μL. We advise to include, in the name of the run, the name of the sample, the dilution (for the samples), and the injection volume. For example, the blanks will be Blank1_5ul_001, Blank2_5ul_002, and Blank3_5ul_003. Note that mu in μL is replaced with “u,” as special characters can cause problems with sample/batch submissions; using spaces should also be avoided.

  6. 6.

    Submit the blanks, and check if they appear in the queue, and verify that all the information was added correctly.

  7. 7.

    Prepare a blank vial and insert in rack 1 position 1. The blank will consist of 1 mL of the same solvent that will be used to dilute the samples (Table 1).

  8. 8.

    Equilibrate the MS for 60 s, connect the tubing coming out of the column to the MS, and check if the shape of the ionization spray is adequate (the stream should be in a “V” shape); if not, adjust the protuberance of the capillary from the electrospray probe to not more than 1 mm.

  9. 9.

    Start the run of the blanks, and check the chromatograms in real time (“Explore Mode” in the Analyst software), where flat curves for total ion count (TIC) along the duration of the run for each of the transitions are expected. If peaks appear, run more blanks until they are no longer seen.

  10. 10.

    Add the ES run in the sample submission batch (“Acquisition Mode” in the Analyst software). We suggest the following dilutions for the mix of ES (six vials): 1000 μM, 100 μM, 10 μM, 1 μM, 100 nM, and 1 nM. Inject each ES concentration using different injection volumes: 2, 5, and 10 μL. When entering the information in the sample submission batch, add one row for each dilution and for each injection volume, assigning a vial position.

  11. 11.

    Insert the ES vials in the rack according to the positions assigned in the sample submission.

  12. 12.

    Run the different dilutions of ES mix to generate a calibration curve.

  13. 13.

    Prepare a quantitative method in the Analyst software (“Quantitation Mode”) to integrate the area under each peak. First, select one sample to be used as an example for the automatic integration of the peaks; typically, this is done by running the ES so that each metabolite is obtained in one injection. For each metabolite determine the correct transition (parent and daughter ion masses, Q1 and Q3), using one row for each metabolite. In the case of isomers (i.e., several metabolites in the same transition but at different retention times), duplicate the transition as many times as the number of isomers. Then, for each analyte, set the correct RT, smoothing factor (normally we used a factor between 1 and 2 to avoid an automatic integration of half of the peak, but it depends on the peak’s shape), and the zone of the chromatogram to measure the noise (i.e., the zone with no peaks, even for the sample chromatograms).

  14. 14.

    Apply the quantitation method to the ES samples and check manually if each automatic integration was done correctly; if not, manually correct it.

  15. 15.

    Graph the calibration curves (area vs pmol) for each metabolite and determine linearity range, limits of detection, and quantification. According to the sensitivity obtained for each metabolite (slope of the graph), determine an ideal ES mix for which the concentration of each of the metabolites is approximately in the middle of the calibration curve using only one injection volume. This dilution of ES will be run together with the samples. In Table 3, the concentration and injection volumes for the three different methods are shown, establishing one ideal ES mix for amino acids and derivatives (named White, injection volume of 6 μL), two for soluble sugars and sugar alcohols (named Yellow and Red, injection volume of 10 μL), and four for organic acids and phosphorylated compounds (named Yellow, Red, Blue, and Orange, injection volume of 10 μL).

  16. 16.

    Prepare one representative sample for each “type” of sample to determine the level of dilution needed and the injection volume to have all the metabolites in the linear range of the curves. In Table 1, the solvents used for the preparation of the samples in each method can be found. In general, an initial trial would be as follows:

    1. (1)

      For amino acids:

      • 50× dilution in the vial → 5 μL injection

      • 880 μL of ddH2O + 100 μL of 10 mM HCl + 20 μL of sample after Amicon filtration

    2. (2)

      For sugars:

      • 50× dilution in the vial → 5 μL injection

      • 980 μL of acetonitrile/water solution (60:40; v/v) + 20 μL of sample after Nanosep filtration

    3. (3)

      For organic acids and phosphorylated compounds:

      • 10× dilution in the vial → 5 μL injection

      • 900 μL of ultrapure water + 100 μL of sample after Amicon filtration

      (see Note 2)

  17. 17.

    Run the trial and identify the level of dilution needed for each type of sample. Multiple runs with different dilutions and/or injection volumes may be required for the samples where the metabolites to be quantified are present in a large range of concentrations. Furthermore, adjusting the DP for some of the metabolites in a run may avoid the need to run the samples multiple times.

  18. 18.

    Prepare the rest of the samples following the best dilution identified in the trial.

  19. 19.

    Add the samples to the sample submission batch in “Acquisition Mode.” If the samples are run on a different day than the ES calibration curves, plan runs of the ideal ES mix every 12–15 samples. We recommend distributing samples, ES mix, IS, and blanks in the following order: blank (using the same injection volume than the samples), ES mix, IS (using the same injection volume than the samples), 12–15 samples; blank, ES mix, IS, 12–15 samples; etc. It is important to run ideal ES mix, IS, and blanks before and after every 12–15 samples to account for run variations that may occur over long running times.

  20. 20.

    After the runs, apply the same quantification method created in step 13 to the full batch of samples, ES mix, IS, and blanks. Check the automatic integrations, and if considerable variations in the retention time occurred between the runs of the ES calibration curves and the sample batch, adjustments to the quantitative method will need to be performed for effective automatic integration. Small variations in the preparation of the mobile phase could lead to shifts in the RT. It is also possible when dealing with a big number of samples and especially for the organic acids and phosphorylated compounds (80 min runs) that shifts in the RT and/or area of the standards are seen between the first samples and the last ones. In those cases, the normalization and the quantification will be done only considering the flanking IS and ES mix for every 12–15 samples, respectively.

  21. 21.

    For the calculation of each metabolite concentration in pmol/mg DW, first we calculate a correction factor for each sample between the area of the IS in the samples and the area obtained in the IS tubes (where only the IS was present); this will be applied to the areas of each metabolite to account for sample loss throughout extraction and preparation or extract concentration during filtration:

    $$ \mathrm{Correction}\ \mathrm{factor}=\frac{\mathrm{Original}\ \mathrm{peak}\ \mathrm{area}\ \mathrm{of}\ 13\mathrm{C}-\mathrm{mannose}\ \mathrm{in}\ \mathrm{biological}\ \mathrm{sample}}{\mathrm{Average}\ \mathrm{peak}\ \mathrm{area}\ \mathrm{of}\ 13\mathrm{C}-\mathrm{mannose}\ \mathrm{across}\ \mathrm{all}\ \mathrm{IS}} $$
    (1)
    $$ \mathrm{Corrected}\ \mathrm{peak}\ \mathrm{area}=\frac{\mathrm{Original}\ \mathrm{peak}\ \mathrm{area}}{\mathrm{Correction}\ \mathrm{factor}} $$
    (2)

    Correction factors closer to 1 indicate that no significant losses nor concentration occurred throughout sample preparation. Then, we calculate the quantity (mol) of each metabolite by comparing the area of a given metabolite in the sample with the area and known concentration of this same metabolite in the ideal ES mix (or the calibration curves if processed simultaneously with the samples):

    $$ \mathrm{mol}=\frac{\mathrm{mol}\ \mathrm{of}\ \mathrm{ES}\ \mathrm{injected}}{\mathrm{average}\ \mathrm{peak}\ \mathrm{area}\ \mathrm{of}\ \mathrm{ES}}\times \mathrm{corrected}\ \mathrm{area}\times \mathrm{dilution}\ \mathrm{factor} $$
    (3)

    The dilution factor is calculated as follows:

    $$ \mathrm{Dilution}\ \mathrm{factor}=\frac{\mathrm{sample}\ \mathrm{extraction}\ \mathrm{volume}\times \mathrm{to}\mathrm{tal}\ \mathrm{volume}\ \mathrm{in}\ \mathrm{LCMS}\ \mathrm{vial}}{\mathrm{volume}\ \mathrm{in}\mathrm{jected}\times \mathrm{volume}\ \mathrm{of}\ \mathrm{sample}\ \mathrm{added}\ \mathrm{to}\ \mathrm{LCMS}\ \mathrm{vial}} $$
    (4)

    After considering the total dilution and the mg of DW used for each sample, we obtain the concentration in pmol/mg DW of each metabolite in each sample:

    $$ \mathrm{pmol}/\mathrm{mg}\ \mathrm{DW}=\frac{\mathrm{mol}\times 1\times {10}^{12}}{\mathrm{mg}\ \mathrm{DW}\ \mathrm{of}\ \mathrm{sample}} $$
    (5)
Table 2 Mass spectrometer parameters
Table 3 External standard mixes

3.1.5 Metabolome Visualization and Analysis

There are different free software packages that allow the visualization and analysis of metabolomic data. Here, we suggest the use of MetaboAnalyst, which not only allows the analysis of the metabolomic data (targeted and untargeted) but also contains tools to integrate the results with transcriptomic data [5]. Some characteristics that made this software our first choice are the user-friendly design, the frequent updates and generation of new tools, and the availability of tutorials, manuals, examples, and even personal technical support to help the user navigate through the page. For each of the tools offered in MetaboAnalyst, the R commands are available for export to allow the user to have some freedom to redefine some variables, which can be especially important for image processing. We describe a series of analyses to explore, understand, and visualize the data using the “Statistical Analysis [one factor]” option in the module overview page, which one can navigate to by clicking the “Click here to start” button on the home page (Fig. 1).

Fig. 1
A table. They present L C-M S spectra processing for raw spectra. Functional and functional-meta analysis for M S peaks. Enrichment, pathway, joint-pathway, and network analysis for annotated features. Statistical, biomarker, statistical-meta, and power analysis and others for generic format.

Different analysis modules available in MetaboAnalyst 5.0

3.1.5.1 Data Entry

There are different ways to enter the data into MetaboAnalyst. For targeted metabolomics, we recommend the use of .csv files with metabolite concentrations expressed in the same units for each biological replicate and sample type. Using columns or rows for metabolites or samples is inconsequential as the software allows one to select the way data is organized. Different strategies to deal with missing values, data normalization, data transformation, and scaling are offered, but the best approach needs to be evaluated in a case-by-case manner. As many of the statistical analyses offered follow the assumption that the data is distributed normally, the user is responsible to select the best data transformation that satisfies that requirement. While navigating through the tools, the options “Data editor” and “Data filter” are available to reduce or modify the samples and metabolites that are being analyzed. A common way of looking at this type of data is to perform a log transformation and autoscaling.

3.1.5.2 Statistical Analysis Module
  1. 1.

    First, assess the quality and variation of the data to evaluate how the biological replicates group together; for this use the “hierarchical clustering dendrogram.” Replicates from the same sample are expected to group together. The output also allows one to visualize the general arrangement of the samples/conditions. A complementary visualization of the data is given through “principal component analysis (PCA).” This analysis reduces the large multivariate data into a few “components” that are constructed as linear combinations of the initial variables (metabolites) and still containing most of the information of the large set. Furthermore, by the interpretation of the individual coefficients for each variable from a component, information regarding which metabolites contribute the most to the separation of the samples can be discovered. These two exploratory tools allow one to evaluate the replicates, detecting outliers and the degree of variation in the data.

  2. 2.

    Run a significance test (ANOVA or t-test depending on your data), and identify the metabolites that are significantly different between the conditions being analyzed. Reducing the number of variables to only those metabolites that are significantly different could help to reduce the noise and facilitate the understanding of the general metabolic changes that most differentiate the conditions analyzed.

  3. 3.

    When analyzing a large set of conditions, evaluate the relationship between the metabolite’s changes through a “Correlations” and/or “PatternHunter” analysis from the sidebar menu. This can be done for all metabolites against all or just the associations with one metabolite or trait, respectively. This is especially powerful if one has multiple agronomic traits in their data, such as oil content, protein content, harvesting index, etc. The direction of the correlation and its strength can be used for identifying processes that respond together under particular circumstances and opens further research to understand the type of relationship that linked them together.

3.1.5.3 Pathway Analysis Module

Although this protocol will address joint-pathway analysis using both transcriptomics and metabolomics data later, it is helpful to look at just metabolomics data alone to see patterns and inspect results [4, 5]. One must replace the metabolite names by the KEGG C number in the .csv file and perform a pathway analysis in MetaboAnalyst (Fig. 1). Since targeted metabolomics data is being used, make sure to upload a reference metabolome listing all the compounds that are being detected in the LC-MS/MS method (this is important for statistical reasons, as false significance would come from using an entire database of compounds in such a case). The results will indicate if the significantly different metabolites detected above are common members of a pathway. Keep notes on these results, as it is useful to see if trends with metabolomics data match the integrated analysis later on.

3.1.5.4 Biomarker Analysis Module

This tool is useful for identifying metabolite changes that are associated with a certain condition. This analysis allows the discovery of biomarkers based on receiver operating characteristic (ROC) curves, or manually specifying biomarker models to perform new sample predictions. This is a particularly powerful method if using very large metabolite sets and/or population-level sample data, for which it can display which metabolites are most important for explaining the data trends as a whole.

3.2 Transcriptomics

3.2.1 Embryo Collection and Preparation

  1. 1.

    When collecting samples, be sure to use tissue containing appropriate amounts of RNA for enough of a yield to obtain a good library. In general, plant samples have higher concentrations of RNA in younger tissues including leaves and seeds. Preliminary extractions may need to be performed to determine what amount of tissue is required for no less than ~150 ng/μL. A maximum of 1000 ng/μL is recommended, as RNA concentration instruments often give distorted results above this concentration. Serial dilutions generally will be necessary to prepare for RNA-Seq, so an excess concentration of RNA gives more control for dilution.

  2. 2.

    Vials or containers used for collecting tissue must be autoclaved for 30–60 min at 121 °C to remove RNases. In addition, any tool or surface that may come into contact with samples (forceps, spatulas, liquid nitrogen containers, etc.) must be RNase-free. If these items cannot be autoclaved, this is ideally done with a commercial solution such as RNaseZAP™ (Sigma-Aldrich, St. Louis, MO); however success has been achieved by performing a thorough scrubbing with EtOH (70% using RNase-free water). Be sure to use RNaseZAP or 70% EtOH on any working surfaces, nitrogen buckets, ice trays, etc.

  3. 3.

    Collected tissue should be immediately placed in liquid nitrogen in vials that are not fully sealed so as to prevent implosion. The time between collection and freezing should be as minimal as possible; if thawing occurs at any point, the sample should be discarded and recollected using fresh tissue. If collection of samples requires significant time, sample tubes can be placed on dry ice or in a rack in liquid nitrogen. Furthermore, samples can be stored at −80 °C following collection if not ready for processing.

  4. 4.

    Samples should be pulverized into as fine of a powder as possible while in liquid nitrogen. This can be done by placing sample vials into a rack seated in liquid nitrogen and carefully grinding by hand with compatible RNase-free plastic pestles. When working with tissue that is harder, such as seeds, it is important to begin the grinding slowly to ensure that tissue does not eject out when under pressure from the pestle. RNA yield increases exponentially with greater surface area for extraction, so the finest powder that can be obtained under frozen conditions is highly desirable. Samples can be placed in a −80 °C freezer if not being immediately used or kept in liquid nitrogen if extraction is to take place right away. If placing in a −80 °C freezer, wait several hours after placing in freezer before closing the vial lids to allow temperature to equilibrate and avoid implosion of the vial.

3.2.2 Reagent Preparation

  1. 1.

    Add 1 mL of 0.1% diethyl dicarbonate (DEPC) solution per 1000 mL of total (untreated) water, mix well, then cover bottle with foil, and allow to incubate overnight at ~35 °C. Vent bottle by loosely placing cap on and autoclave with a liquid cycle at 121 °C for 30 min.

  2. 2.

    Prepare 0.5 M 2,2′,2″,2‴-(ethane-1,2-diyldinitrilo)tetraacetic acid (EDTA) by adding 116.0 g EDTA to 500 mL of deionized water and place on a magnetic stirrer. Adjust pH to 8.0 using sodium hydroxide and wait until dissolved. Autoclave at 121 °C for 30 min.

  3. 3.

    Prepare 1 M 2-amino-2-(hydroxymethyl)propane-1,3-diol (Tris) by adding 60.6 g Tris to 500 mL RNase-free water. Place on magnetic stirrer and adjust pH to 8.0 with hydrochloric acid.

  4. 4.

    Prepare a 24:1 (v/v) chloroform/3-methylbutan-1-ol (isoamyl alcohol) by adding 30 mL of isoamyl alcohol to 720 mL of chloroform in a 1 L bottle. Cover in foil to keep out of light.

  5. 5.

    Prepare 500 mL of 70% EtOH (v/v) in RNase-free water, and another 500 mL of 100% EtOH.

  6. 6.

    Commercial RNA-grade glycogen is the best option for this procedure, but RNase-free glycogen of any sort will work. Dilute glycogen to aliquots of 500 μg in autoclaved 1.5 or 2 mL snap-top centrifuge tubes.

  7. 7.

    Prepare 200 mL of 3.2 M sodium acetate (NaOAc) by adding 52.5 g to 200 mL RNase-free water. Place on magnetic stirrer and adjust pH to 5.5 using glacial acetic acid.

  8. 8.

    Prepare 200 mL of 8 M lithium chloride (LiCl) by adding 67.8 g LiCl to 200 mL RNase-free water.

  9. 9.

    Create the N,N,N-trimethylhexadecan-1-aminium bromide (CTAB) extraction buffer stock solution by adding 8.0 g of CTAB, 46.8 g of sodium chloride (NaCl), and 12 g of 1-ethenylpyrrolidin-2-one (polyvinylpolypyrrolidone, PVPP) to a 500 mL glass bottle. Add 400 mL of RNase-free water to bring the concentration of CTAB to 2% w/v, NaCl to 2 M, and PVPP to 3% w/v.

  10. 10.

    Add 40 mL of 1 M Tris–HCl and 20 mL of 0.5 M EDTA to the CTAB buffer to obtain a final concentration of 100 mM Tris–HCl and 25 mM EDTA. Add 0.5 mL of 0.1% DEPC to the buffer and incubate overnight at ~35 °C before autoclaving at 121 °C for 30 min. This will serve as the CTAB stock solution.

  11. 11.

    Prepare the working extraction buffer. Determine the volume necessary by multiplying each sample +2 for overage by 1050 μL. For 12 samples, 15 mL is enough and will be used in this example (all concentrations can be adjusted if more or less volume of extraction buffer is needed). Add 14.5 mL of the CTAB stock solution to a RNase-free tube; then add 8.1 μL of N1-(3-aminopropyl)butane-1,4-diamine (spermidine) and 450 μL of 2-sulfanylethan-1-ol (2-mercaptoethanol).

3.2.3 RNA Extraction (See Note 4)

RNA extraction was adapted from Horn et al. [25].

  1. 1.

    For a 2 mL volume sample vial containing 1000 μg of tissue, a 600 μL volume CTAB extraction buffer heated to 65 °C is added. Ensure that PVPP is resuspended in buffer prior to adding to samples. This mass to volume ratio of sample to vial is recommended to be used; thus if using a larger or smaller amount, adjust reagent volumes in protocol accordingly. Density of sample tissue matters; however as long as ground tissue does not take up more than ~1/4 of the vial used for extraction, there should be enough contact of the tissue’s surface area with the extraction buffer for adequate yield.

  2. 2.

    Vortex samples at 500–1000 rpm for 10 min at 65 °C.

  3. 3.

    Add 600 μL of 24:1 chloroform/isoamyl alcohol (v/v) to each sample and vortex at 500–1000 rpm for 2 min at 65 °C.

  4. 4.

    Centrifuge samples at 13,000 × g for 10 min at room temperature, and transfer upper aqueous phase to a 2 mL autoclaved microcentrifuge tube seated on ice.

  5. 5.

    Add 450 μL of CTAB extraction buffer to organic phase remaining in sample tubes for a sequential extraction, and vortex at 500–1000 rpm for 2 min at 65 °C.

  6. 6.

    Transfer upper aqueous phase to the same tube on ice used for step 4. Briefly vortex to combine and return to ice.

  7. 7.

    Add 800 μL of 24:1 chloroform/isoamyl alcohol (v/v), briefly vortex to mix, and then centrifuge at 12,000 × g for 15 min at 4 °C.

  8. 8.

    Transfer 300 μL of upper aqueous phase into a fresh autoclaved 1.5 mL microcentrifuge tube on ice; then transfer another 300 μL of the remaining upper aqueous phase to a second autoclaved 1.5 mL microcentrifuge tube. The division of sample into two separate tubes allows for the following precipitation to be carried out at a lower volume, thus resulting in a faster reaction with the RNA extract. Ensure that tubes are labeled as both subsamples will be recombined into one sample in further steps following precipitation.

3.2.4 Precipitations and Resuspension

All discarded supernatants/solutions should be placed appropriately in chemical waste.

  1. 1.

    Add 30 μL of 3.2 M NaOAc (pH = 5.5) to each subsample.

  2. 2.

    Add 75 μL of RNA-grade glycogen (0.5 mg/mL) to each subsample.

  3. 3.

    Add 600 μL of 100% EtOH to each tube. Vortex well for 5–10 s; then incubate for 12 h at −20 °C.

  4. 4.

    Centrifuge subsamples for 1 h at 13,000 × g and 4 °C. A small white RNA pellet may be visible at this point, but its absence does not indicate an unsuccessful extraction thus far. Carefully decant supernatant into waste without disturbing RNA pellet by slowly pouring out. This and further decanting should be done instead of removing supernatant with a pipette to reduce degradation/contamination and damage to the pellet.

  5. 5.

    Rinse subsamples by adding 1 mL of 70% EtOH; then vigorously flick sample to detach pellet from sidewall of microcentrifuge tube (the pellet should be seen floating in the solution, if a pellet was observed). If no pellet is present, flick and agitate sample for 1 min. Vortex subsamples at 500–1000 rpm for 15 min at 4 °C.

  6. 6.

    Centrifuge at 21,100 × g and 4 °C for 15 min. Again, carefully decant supernatant into waste.

  7. 7.

    Repeat steps 2 and 3 two more times, resulting in a total of three washes of the RNA pellet.

  8. 8.

    Open tubes and place laterally in a laminar sterile flow hood for ~15 min, with the opening of the tube facing the flow to dry pellet. This is recommended instead of drying on ice because it takes less time, thus reducing the chance for contaminants to be introduced to the pellets.

  9. 9.

    Resuspend pellets in 103 μL of commercial-grade RNase-free water; then recombine the two corresponding, resuspended subsamples into a fresh, autoclaved 1.5 mL microcentrifuge tube. At this point there should once again be only one tube per sample.

  10. 10.

    Add 94 μL of 8 M LiCl to each sample. Vortex thoroughly for 5–10 s; then incubate for 12 h at −20 °C.

  11. 11.

    Centrifuge samples for 1 h at 13,000 × g and 4 °C. Carefully decant the small amount of LiCl supernatant into waste.

  12. 12.

    Rinse samples by adding 1 mL of 70% EtOH; then vigorously flick sample to detach pellet from sidewall of microcentrifuge tube (the pellet should be seen floating in the solution, if a pellet was observed). If no pellet is present, flick sample for 1 min. Vortex samples at 500–1000 rpm for 15 min at 4 °C.

  13. 13.

    Centrifuge at 21,100 × g and 4 °C for 15 min. Carefully decant supernatant into waste.

  14. 14.

    Repeat steps 2 and 3 two more times, resulting in a total of three washes of the RNA pellet.

  15. 15.

    Open tubes and place laterally in a laminar sterile flow hood for ~15 min, with the opening of the tube facing the flow to dry pellet.

  16. 16.

    Dissolve pellet in commercial-grade RNase-free water. We recommend starting with 20–30 μL. Vigorously pipette the water up and down into the pellet until fully dissolved. If insoluble portions of the pellet are observed, remove liquid RNA solution, transferring to a new autoclaved 1.5 mL microcentrifuge tube. Discard insoluble pellet.

3.2.5 Quality Confirmation and Library Prep

  1. 1.

    An initial rough estimate of RNA quality can be determined via a spectrophotometer, with absorbance at 260/absorbance at 280 nm (A260/A280) in the range of 1.8–2.2. If the concentration exceeds 1000 ng/μL, dilute until it falls below this concentration, as A260/A280 can be distorted when analyzed at such a high concentration.

  2. 2.

    Run capillary gel electrophoresis to determine RNA integrity according to instrument manufacturer settings. The integrity is determined by an RNA integrity number (RIN, on a scale of 1–10), usually directly reported by the instrument. In plant tissue, a RIN of 8–10 can be confidently used for RNA-Seq. However, since plants can contain significant amounts of other ribosomal RNAs including those from the chloroplast as well as other small RNAs, the original algorithm used for RIN calculation may cause distortions. In the case where at least one significant band is observed other than the 18S and 28S ribosomal RNA, a returned RIN value of 7 is considered high-quality RNA that is suitable for RNA-Seq. Ensure at this step no genomic DNA is present; else a DNA cleanup kit may need to be used for further purification.

This protocol will work with RNA-Seq data using any type of read length, paired- or single-end reads, or strand specificity. However, it is very important to keep track of these parameters before proceeding to further steps as any downstream usage of RNA-Seq libraries will rely heavily on accounting for these. For instance, if using a typical Illumina TruSeq protocol, the reads are generally going to be strand-specific short reads that are paired-end, where the first strand is the reverse complement of the original cDNA template. The program FastQC is a good option for determining read quality, determining adapter or contaminant presence, among other information concerning usability of reads [19]. Instructions for interpreting FastQC data based on different read types are included. The remaining protocol assumes the reads are ready to be used.

3.2.6 Transcript Assembly: Reference Genome Available (Skip to Subheading 3.2.7 if There Is No Available Reference)

Open the Linux client; from here forward, all lines of computer code for Linux are prefaced by “$.” First, all experimental reads must be combined for this step, because we need to use all the evidence we can for accurate construction of transcripts. This can be done using the “cat” function in this way, as appropriate depending on your library:

$ cat Read_fileA.fastq.gz Read_fileB.fastq.gz > Reads_combined.fastq.gz

  1. 1.

    If a reference genome is available, download the full FASTA file along with any GFF/GTF (genome feature format) files if available. This can be done on NCBI (https://www.ncbi.nlm.nih.gov/), Phytozome (https://phytozome-next.jgi.doe.gov/), or any number of different databases. For this example, we will use a FASTA file called “example_genome.fasta” and an example GFF file called “example_genomeGFF.gff.”

  2. 2.

    Reads obtained from RNA-Seq first need to be aligned (aka mapped) to the genome using a short-read mapper such as HISAT2 or STAR (https://github.com/alexdobin/STAR). This will allow for the transcripts we are ultimately trying to quantify to be oriented on the reference genome, which will help assist in the assembly of the transcripts in subsequent steps. We will perform this mapping using HISAT2. First, an index needs to be created from the reference genome that will allow HISAT2 to efficiently map reads to it using the following command:

    $ hisat2-build example_genome.fasta Ref_genome -p 8

    This will build the index files using a base name “Ref_genome” (see Note 5).

  3. 3.

    We will then map reads to the reference. For paired-end strand-specific short reads (common with current Illumina protocols), the script needs to be tailored to include right- and left-end reads and the orientation of strand specificity too:

    $ hisat2 -x Ref_genome -1 Reads_left.fastq.gz -2 Reads_right.fastq.gz –S Illumina_paired_alignment.sam -p 8 –dta --rna-strandness RF

    The --dta option tells the program to map in a way that will allow for downstream assembly to be performed. The --rna-strandedness RF tells HISAT2 that the reads are strand-specific and that the first strand corresponds to the reverse complement (“R”) of the original cDNA template.

    If performing using single-end short reads:

    $ hisat2 -x Ref_genome -U Reads.fastq.gz –S Illumina_single_alignment.sam -p 8 --dta

    Finally, if performing using long reads, the parameters need to be relaxed a bit when mapping:

    $ hisat2 -x Ref_genome -U LongReads.fastq.gz -S Illumina_long_alignment.sam -p 8 --dta --score-min L,0.0,-2

  4. 4.

    Once mapped, a SAM file that is generally very large (on the order of hundreds of GB) is generated. Compress this significantly by converting the alignment into a BAM file using the “view” function in samtools:

    $ samtools view -S -b -@ 8 Illumina_paired_alignment.sam > Illumina_paired_alignment.bam

  5. 5.

    The alignment then needs to be sorted using the “sort” function in samtools. This allows the entire alignment to be sorted by coordinates of the aligned reads in the reference genome:

    $ samtools sort –o Illumina_paired_alignment_sorted.bam -O bam -@ 8 Illumina_paired_alignment.bam

    Finally, if it was necessary to perform several alignments due to computing limitations, they can be merged like so:

    $ samtools merge -@ 8 -o alignments_merged.bam alignment1.bam alignment2.bam alignment3.bam

  6. 6.

    The program StringTie will be used to assemble transcripts from the alignment file since we have a reference genome (and potentially a corresponding GFF file). If a GFF file is available with the reference genome, it is strongly recommended to be used as StringTie will allow the features included (i.e., UTRs, exons, start and stop sites, genes) to guide the process. In this case one would perform the assembly like this for the paired-end strand-specific case (leave off --rf if paired-end reads are not strand-specific or if using single-end reads):

    $ stringtie -o StringTie_output.gtf –G example_genomeGFF.gff -p 8 Illumina_paired_alignment_sorted.bam --rf

    In the case of long reads:

    $ stringtie -L -o StringTie_output.gtf –G example_genomeGFF.gff -p 8 Illumina_long_alignment_sorted.bam

    The output of StringTie is a GTF file that contains the positions of each transcript within the reference.

  7. 7.

    The resulting GTF file can then be fed into gffread to extract the FASTA sequences for each transcript into one multi-FASTA file. First, create an index to speed this step up using the “faidx” command in samtools:

    $ samtools faidx StringTie_output.gtf -@ 8

    Then feed the GTF file into gffread:

    $ gffread -w stringtie_transcripts.fasta -g example_genome.fasta StringTie_output.gtf

    The resulting transcript FASTA will be used for calculating expression in later steps.

3.2.7 Transcript Assembly: No Reference Genome Available

Trinity requires a lot more computing power compared to StringTie and will potentially take a very long time if not using a high-performance computer. Be sure to utilize all available computing power in the script, as insufficient memory can cause a premature crash of the program (see Note 6).

  1. 1.

    If a reference is unavailable, a de novo assembly can be generated, where reads themselves are used to assemble transcript sequences via overlap based on a k-mer length. Trinity is an effective program for this and has been shown to surpass benchmarks desired for transcriptome assemblies. Trinity can be initiated like so for paired-end, strand-specific reads:

    $ Trinity --seqType fq --left Reads_left.fastq.gz --right Reads_right.fastq.gz --max_memory 16G --CPU 8 --full_cleanup --SS_lib_type RF --output./trinity_out_paired

    For our single-end reads:

    $ Trinity --seqType fq --single Reads.fastq.gz --max_memory 16G --CPU 8 --full_cleanup --SS_lib_type RF --output ./trinity_out_single

    Finally, if using long reads, one doesn’t need any additional parameters with the single-end example:

$ Trinity --seqType fq --single LongReads.fastq.gz --max_memory 16G --CPU 8 --full_cleanup --SS_lib_type RF --output ./trinity_out_long

  1. 2.

    The primary output from Trinity is a .fasta file, so gffread is not needed. If one needs to merge several assemblies, the “transabyss-merge” command from Trans-ABySS is an effective tool. “transabyss-merge” will find redundant transcripts and keep the longer version and remove the shorter transcript. Since Trinity uses a constant k-mer size of 25, this needs to be accounted for in the merge command (if changed to any other setting, be sure to adjust --mink and --maxk accordingly):

    $ transabyss-merge assembly1.fasta assembly2.fasta assembly3.fasta --mink 25 --maxk 25

3.2.8 Gene to Transcript Matrix Preparation and Assembly Processing

It is important to note that at this point, all of the sequences in the FASTA file are at the transcript level, not gene level. For a true differentially expressed gene (DEG) analysis, we need to obtain a tab-separated text file (.TSV) with the first column listing all transcripts and the second column corresponding to each transcript’s parent gene (it must be in this order). Trinity itself will output a file with suffix “.gene_trans_map,” which will correspond transcripts to a parent “gene,” which Trinity defines as a cluster of transcripts. The GTF output of the StringTie assembly will also give this information; the value in the “gene_id” field for each transcript feature is the gene, and the child transcript ID is in the “transcript_id” field. A quick way to format this is to use the AGAT package (see Note 7):

$ agat_sp_extract_attributes.pl –gff StringTie_output.gtf --att transcript_id,gene_id-t transcript -o gene_to_transcripts.output -m

  1. 1.

    It is extremely important at this point to make sure that the assembly’s .fasta file has correctly formatted headers corresponding to the “gene_to_transcripts.output” file. If the assembly .fasta file sequence IDs corresponding the sequences to the transcripts in the “gene_to_transcripts.output” or “.gene_trans_map” files are after any white space (tabs or spaces), they may get erased by filtering, and the way to correlate genes with the filtered transcripts will be lost. Particularly, if StringTie was used, the “gene_id” and “transcript_id” in the StringTie_output.gtf that were extracted above may get cut out of the .fasta file after processing. If we have a “gene_to_transcripts.output” or “.gene_trans_map” formatted like the following:

    Transcript_1_name Gene_1_name Transcript_2_name Gene_1_name Transcript_3_name Gene_2_name Transcript_4_name Gene_3_name Transcript_5_name Gene_4_name

    then headers in the corresponding fasta file should look like this:

    >Transcript_1_name

    and not this:

    >NonamEVm013594t1 Any; other; text; oid=Transcript_1_name;

    If this latter example is the case, one option is to export the .fasta file to any text editor and delete anything before “Transcript_1_name” with the find and replace function; this is useful if it is formatted unusually. However, in this more common specific situation, everything in the header needs to be removed before and including the “oid=” character and then the “;” at the end, which can be done using the sed command like this:

    $ sed 's/>.*oid=* */>/' –i stringtie_transcripts.fasta $ sed 's/;//' -i stringtie_transcripts.fasta

    then check the headers to confirm they are correct:

    $ head stringtie_transcripts.fasta

  2. 2.

    Assemblies generated via de novo programs such as Trinity can often contain many more isoforms than a guided assembly, and even after merging there may be significant amounts of fragments, duplicates, and chimeras. There are many programs available for filtering these assemblies. In this case we will use “tr2aacds” from EvidentialGene to filter our raw assembly (either “stringtie_transcripts.fasta” or the Trinity output .fasta file shown here as “assembly.fasta”):

    $ tr2aacds.pl -log -cdna assembly.fasta –tidy -NCPU 8 -MAXMEM 8000

    (see Note 8).

    Navigating the output folders, find the file named something like “assembly.okay.mrna,” usually in the “okayset” folder. This will be your filtered assembly. For the purpose of this protocol, we will rename “assembly.okay.mrna” to “assembly.fasta.” If a StringTie assembly is being used, it is generally less important to do this step as StringTie is more conservative in its approach to constructing transcripts (however it should be considered depending on the assembly).

3.2.9 Transcript Quantification

  1. 1.

    With the final assembly in hand, the reads can be mapped directly to it using Salmon. Note that many quantification programs require that the reads be mapped to a reference genome, but Salmon is convenient because the reads are mapped directly to the transcriptome assembly. First, align reads for each biological sample again using HISAT2 but this time to the assembly (indexed first), in order to obtain a separate BAM file for each biological sample. It is important to note that the resulting BAM alignments should not be sorted, which is required by Salmon for quantification:

    $ hisat2-build assembly.fasta assembly -p 8 $ hisat2 -x assembly -1 Sample1_Reads_left.fastq.gz -2 Sample2_Reads_right.fastq.gz –S Sample1_alignment.sam -p 8 --rna-strandness RF $ samtools view -S -b -@ 8 Sample1_alignment.sam > Sample1_alignment.bam

    Then, perform the quantification on the resulting BAM file:

    $ salmon quant -t assembly.fasta -l A –a Sample1_alignment.bam -o ./sample1_algn_quant

    The resulting output will be a file with suffix “quant.sf,” which contains the transcripts per million (TPM) and the number of reads mapped for each transcript, as well as some other information (see Note 9).

  2. 2.

    Since quantification is being done, it is time to switch to R; from here forward all lines of computer code for R are prefaced by “>.” The data has to be imported in a specific way for the DEG analysis to work. A convenient pipeline for this whole process is mediated by the “tximport” package. This will allow the “quant.sf” files generated by Salmon to be directly imported into R for DEG analysis. First, convert the “gene_to_transcripts.output” file into .csv format to read into R. Then, in R, load tximport and import it:

    > library(tximport)) > gene.and.trans <- read.csv(“/path/to/gene_to_transcripts.out put.csv”)

    Then, create a working directory in the location containing the “salmon” output folders:

    > workdir <- “/path/to/my/salmon/output”

    Next, import into R a list of the samples that will be included in the analysis. The list should be a .txt file (“sample_names.txt” here, with control and treated samples) formatted so that each sample listed corresponds to the folder name containing each sample’s “quant.sf” file in the working directory:

    sample_name Control1 Control2 Control3 Control4 Treated1 Treated2 Treated3 Treated4

    taking care to have the header on line 1. You can then read this list into an R object along with each sample’s corresponding folder in the working directory:

    > sample_names <- read.table(file.path(workdir,“sample_names.txt”), header = TRUE) > samples <- file.path(workdir,sample_names $sample_name, “quant.sf”) > names(samples) <- c(“Control1”,”Control2”,”Control3”, ”Control4”,”Treated1”,”Treated2”,”Treated3”,”Treated4”)

    Now, to put all expression data in one place using tximport, with the transcripts and corresponding genes also taken into account:

    > tx.imported.data <- tximport(samples, type = “salmon”, tx2gene = gene.and.trans, dropInfReps = TRUE)

  3. 3.

    Next load “DESeq2” (> library(DESeq2)) and create a data frame associating each sample to its corresponding treatment factor, and assign the raw counts from each “quant.sf” to the samples:

    > DESeq2.dataframe <- data.frame(condition = factor(rep(c(“Control”, “Treated”), each = 4)))

    Note that the “each” argument indicates how many replicates there are per treatment factor.

    > rownames(DESeq2.dataframe) <- colnames (tx.imported.data$counts) > DESeq2.DEG.input <- DESeqDataSetFromTximport (tx.imported.data, DESeq2.dataframe, ~condition)

  4. 4.

    It is a good idea to filter out lowly expressed transcripts, since these may either be misassemblies from StringTie/Trinity or transcripts that are so lowly expressed that any change would be considered biologically irrelevant to the question. For this example, we will consider that to be any transcript in which the sum of all samples’ counts is under 20:

    > filtered.data <- rowSums(counts(DEseq2.DEG.input)) >= 20 > DESeq2.DEG.input <- DESeq2.DEG.input[filtered.data,]

  5. 5.

    Finally, run the DESeq2 analysis and collect the results into a .csv file:

    > DESeq2.DEG.analysis <- DESEq(DESeq2.DEG.input)

    Then, make a results object:

    > DESeq2.output <- results(DESeq2.DEG.analysis, contrast=c(“condition”,”Treated”,”Control”))

    Note that when setting the “contrast” argument, the treatment factor that is desired to be the reference is placed last. To get our results out at the significance level desired (padj = 0.01 here), sort and write to a .csv file:

    > DESeq2.output.significant <- subset (DESeq2.output, padj < 0.01) > write.csv(DESeq2.output.significant, “DEG_results.csv”)

    Otherwise, to get the results for all genes regardless of significance, disregard the “subset” command.

3.2.10 Annotation

  1. 1.

    After quantification, one may want to annotate the transcript sequences so that one can get an idea of the function of each transcript (and thus corresponding gene). This will be performed in the Linux console. A simple way to do this is to obtain the fasta of proteomes (on NCBI they contain the format “.faa”) of several related annotated species and merge into one file:

    $ cat species1_aas.fasta species2_aas.fasta species3_aas.fasta > database.fasta

  2. 2.

    Blast2GO is a simple to use desktop program for identifying homologs in the assembly via homology to the transcripts in the database just created. Once installed, import the final assembly by clicking the arrow in the “start” box and selecting “Load Sequences (fasta).” Then, click the arrow by the “blast” box; select “Make Blast Database” to convert the fasta database “database.fasta” to the formats needed by Blast2GO to perform a search. Once completed, click the “blast” box and select “Local Blast” and click “Next >.” Use the “blastx-fast” option in “Blast Program” so that the transcripts assembled are searched based on their expected protein product(s). Leave all other options at default and click “Next >” until you can hit “Run.”

  3. 3.

    Export the annotation as a fasta and/or as a table for later use as needed by going to “File > Export>” and selecting the appropriate option in the menu.

If interested in the identity of the gene (not transcript) sequences themselves and using the StringTie procedure, the locus coordinates of the transcript features that are associated with each gene must be first extracted. Basically, the coordinates of the genome are being isolated from which the transcript(s) originate(s), obtaining the fasta sequences inside these genome coordinates, and doing a homology search on those genome sequences to predict the protein product encoded by their child transcript(s). The following script will take the StringTie GTF file and format as a tab-separated file (TSV):

$ agat_convert_sp_gff2tsv.pl –gff StringTie_output.gtf -o StringTie_output.tsv

Then extract only the information necessary for creation of a BED file. This can be done by rearranging the format with awk:

$ awk -v OFS='\t' '$3=="gene"' StringTie_output.tsv > stringtie_coords.1.tsv $ awk -v OFS='\t' '{print $1,$4,$5,$10,$6,$7}' stringtie_coords.1.tsv > stringtie_coords.2.tsv $ awk -v OFS='\t' '$4 != "N/A"' strin gtie_coords.2.tsv > stringtie_coords.3.tsv $ awk -v OFS='\t' '{sub("-1","-",$6);print}' stringtie_coords.3.tsv > stringtie_coords.4.tsv $ awk -v OFS='\t' '{sub("1","+",$6);print}' stringtie_coords.4.tsv > stringtie_coords.bed

This BED file will then be used by bedtools to obtain the genome sequences, using the original genome reference fasta as the sequence source:

$ bedtools getfasta -fi example_genome.fasta –bed stringtie_coords.bed -s -name -fo stringtie_genes.fasta

These fasta sequences can then be put into Blast2GO as above to predict identity (see Note 10).

The identity of Trinity gene sequences cannot be extracted for two reasons, primarily because no reference genome sequence was available and also because the definition of “gene” in Trinity is a “cluster” of transcripts with similar sequence content and is thus not confirmed to originate from the same locus of a genome. If a quality annotation was done and the right databases were used, one can infer the function of the gene itself based on what each transcript of the gene (i.e., cluster) was identified as, assuming consistency across all transcripts of that gene.

3.2.11 Overrepresentation Analysis

Since a typical RNA-Seq dataset contains a large sample size of potentially tens of thousands of transcripts, including an overrepresentation on transcriptome data alone is recommended (there may be strong trends that can only be observed when using this dataset alone). There are numerous programs and methods for performing gene set overrepresentation analysis as well as characterization. This protocol uses the Gene Ontology Project database, which groups genes based on three types of categorization domains: biological process, molecular function, and the cellular component the gene/gene function is associated with. Each of these domains consists of a large hierarchy of further categories (or “terms”) that fall under their scope. For the purpose of this protocol, the assumption is that the list of DEGs is what is being analyzed. The use of this analysis will allow for the user to determine if specific pathways, reactions, functions, cell compartments, etc. are statistically overrepresented by the DEG gene list, which can identify what kind of processes are altered between treatment factors.

  1. 1.

    The PANTHER knowledgebase will be used for GO analysis. Go to the PANTHER website (pantherdb.org) and click on the “Tools” tab on the main page, and select the “Gene List Analysis” on the tools tab. This tool will allow the following to be put as input, according to the boxes pictured (Fig. 2): (1) a text box to enter a set of genes (DEGs or any genes of interest) and options for the input format, (2) an option to select organism (most closely related to your species if your specific species isn’t listed), and (3) an option to select what kind of analysis to perform.

  2. 2.

    To determine if the DEGs are associated with any GO term(s) in particular, only the list of IDs of the DEGs are needed (no quantification data is necessary here). PANTHER allows for multiple kinds of IDs to be used, but the most common are RefSeq, UniProt, and Ensembl accession numbers/IDs, and for plants TAIR IDs work as well. Simply use the Blast2GO output to assign these homolog IDs to the corresponding DEGs. For this example, multiple TAIR IDs (all DEGs in this example) were entered into box 1, selected Arabidopsis thaliana in box 2, and “statistical over-representation test” in box 3. After selecting for the overrepresentation test in box 3, there are numerous options for which databases to analyze. GO domains are often returned as massive lists of categories at every hierarchical level, making conclusions a bit obscure. There is a “slim” version of these databases in the PANTHER knowledgebase which includes only the most common subsets of GO terms of interest put together; these are shown as “PANTHER GO-Slim.” For this example, we are interested in biological processes associated with our DEG list, so we will select “PANTHER GO-Slim Biological Process.” Click on “submit.”

  3. 3.

    The next form allows for submission of a reference list using many different options, which is useful if one wants to search DEGs against a custom database. For this purpose, we will simply select “Arabidopsis thaliana genes” under the “Default whole-genome lists” box. Finally, at the analysis summary screen (Fig. 3), parameters can be selected to be used for the test.

  4. 4.

    Select “Fisher’s Exact” test type and use the “calculate false discovery rate” option. The false discovery rate (FDR) will allow for a greater power in estimation of what GO terms are overrepresented but may increase the rate of falsely significant hits; if a more conservative approach is desired, select the Bonferroni correction. Finally, click “Launch analysis” and you will see a result similar to Fig. 4 if any significant hits were found.

Fig. 2
A screenshot of the input form for gene ontology analysis. It has 3 sections of options. 1. enter I Ds and select file for batch upload, 2. select organism, and 3. select analysis.

Input form for gene ontology analysis on PANTHER website

Fig. 3
A screenshot of the summary of the analysis. It lists the analysis type, annotation version and release date, analyzed list, reference list, annotation data set, test type, and correction. A button for launch analysis is at the bottom left.

Summary of analysis and options using PANTHER-Go Slim overrepresentation test

Fig. 4
A screenshot has a table with 3 columns for PANTHER GO-Slim biological process, Arabidopsis thaliana, and client text box input. Client text box input has subcolumns titled, hashtag, expected, fold enrichment, plus or minus, raw P value, and F D R.

GO overrepresentation test with TAIR IDs using PANTHER GO-Slim

In this example, the most specific GO terms that are significant are “response to osmotic stress,” “response to heat,” “response to gibberellin,” “response to wounding,” and “regulation of defense response.” The parent terms of these significant child GO terms are also listed if they are significant, which will give a more bird’s-eye view about which biological processes are significantly overrepresented by our DEG list. The correct interpretation of these data is that more DEGs have membership in each GO term with padj (FDR here) < 0.05 than would be expected by chance. For instance, if osmotic stress genes weren’t actually differentially expressed between the treatment factors (in this example, oilseed crop variety), the probability that 12 of the genes involved in osmotic stress are in the DEG list out of 19 total in the GO term is so low that the null hypothesis can be rejected, concluding that osmotic stress is altered between treatment factors.

3.2.12 Coexpression Analysis

Experiments dealing with many treatment factors (i.e., multiple accessions of an organism, samples from different geographical locations, etc.) may require the investigator to perform an analysis looking at trends of response variables with each other rather than each treatment factor’s specific response to individual variables. A coexpression analysis is a powerful tool for determining how genes, whether specific genes or large modules of genes, are correlated to one another based on expression. Certain coexpression approaches generally require the entire transcriptome to be analyzed; however performing a network in a way that will find central hub genes (those centrally tied to treatment factor effects) can be effectively performed on just DEGs. Furthermore, biological processes can be correlated with the coexpression results via GO terms. An effective protocol for performing this using a DEG set of this type is available in Orlando Contreras-López et al. [26].

3.3 Multi-omics Integration

We now describe a routine integration of both transcriptomic and metabolomic data into a pathway analysis. This will assist in gaining biological context from our results, and we can use a database that contains pathways with genes and metabolites which are represented by our data. This will be done by returning to MetaboAnalyst. Note: “gene” here refers to transcript, not locus, in a typical transcriptomic dataset.

  1. 1.

    In the main module list (Fig. 1), select joint-pathway analysis. Before entering data, select the organism and metabolomics type for the left and right boxes, respectively. Since this protocol is using plants, we will select “Arabidopsis thaliana (thale cress)” for the organism gene names and “Targeted (compound list)” for metabolomics type.

  2. 2.

    Enter data as tab-separated text for both options. The ID type must be specified for data to be properly connected to the databases. In the first column, add the feature ID followed by a tab-separated value for log fold-change. Click submit. Once submitted, MetaboAnalyst will automatically tell which input features were properly correlated with the database. Be sure to check this for both genes and compounds prior to continuing.

  3. 3.

    There are now a few important considerations to be made for an accurate analysis. For “Pathway Database,” select “Metabolic pathways (integrated).” This is appropriate when one is only interested in how both transcriptomic and metabolomic data work together; if also interested in gene-only pathways, one can select all pathways for this option.

  4. 4.

    The algorithm parameters then need to be specified under “Algorithm Selection” prior to running the analysis. As above, this can change depending on what the user is looking for. In a typical integrative analysis where hub features (genes or metabolites) are being searched for that connect large branches of metabolic pathways, betweenness centrality is a good choice. If only interested in immediate relationships of pathway features, degree centrality is a quick choice that will return the hub features simply connected to the most other features directly as a first neighbor.

  5. 5.

    Finally, the integration method must be carefully considered. Targeted metabolomics datasets are generally many orders of magnitude smaller than transcriptomic datasets, as most RNA-Seq platforms simply collect all mRNA present while our metabolomics protocol contains pre-selected metabolites of interest. Therefore, combining queries in this case would prove statistically inappropriate because there may be thousands of significant genes, but no more than the maximum amount of measured metabolites as significant metabolites in targeted metabolomics are often only ~100. For getting around this, the software can run the metabolite data alone, simultaneously run the gene data alone, and then combine results by selecting one of three different assumptions about feature weights [4, 5]:

    • Universally consider the relevance of metabolites and genes to be the same regardless of the pathway analyzed or database used; do this by selecting “combine p values (unweighted).” This may be helpful if one is trying to do a quick screen of affected pathways making sure both datasets are considered no matter what. In this situation, 4 significant metabolites would hold the same weight as 25 significant transcripts in any given pathway. Thus, the effect of the significant differences on the overall pathway activity would equally be considered regardless of how many significant features there were of each type; one can think of this as getting results where half the effect on the pathway is forced to be from metabolites and half is considered to be from genes. One downside of this approach is that if one is working with databases or pathways that contain many more genes than measurable metabolites (which they often do), one could be artificially inflating the relevance of significant metabolites or artificially deflating the relevance of a significant genes since p-value weights are forced to be equal in both data types. This is common (among other reasons) due to the fact that most pathways/organisms often have multiple genes encoding for one reaction but generally only a reactant and product metabolite for said reaction.

    • Take into account the amount of metabolites and genes in the entire database when weighing results. Therefore, if a database contains 500 metabolites but 6000 genes, the fact that metabolites are less well-represented is accounted for in the calculations. If one is looking more at global-scale effects on metabolism, where simply the number of pathways that are affected is of interest, this can be an effective option. The downside here is that one does not have much conclusion power with regard to specific pathways; although there is a 1–13 ratio of metabolites to genes in the database searched, the most significant pathways may be inaccurately represented, because that individual pathway could have a metabolite to gene ratio much lower, such as 1–100. In this situation, the user may erroneously conclude that the top pathway in the results was indeed the most affected; however the weights of features would be completely misapplied if trying to judge effects on that pathway versus others in the results.

    • Take into account separately the amount of metabolites and genes for each pathway contained in the database. If one wants to make biological interpretations by how certain pathways are ranked in comparison to others, this is the most appropriate option.

Following submission, proceed to observe the results.

3.4 Interpretation of Results

An overview of the pathway analysis is displayed, showing a graph depicting various metabolic pathways and how significantly they were affected by the experimental treatment. This is based on two metrics: (a) a log-transformed p-value on the x-axis indicating the overrepresentation (“enrichment” in MetaboAnalyst) of the differentially expressed features and (b) a pathway impact value on the y-axis, which indicates how perturbed a specific pathway was based on the location of significantly different features within that pathway. Hence, if using betweenness centrality as mentioned in Subheading 3.3, having a significantly different gene/metabolite in a central location (e.g., hub) within a metabolic pathway would result in high pathway impact, which increases as it approaches 1.00. Staying consistent with the example stated in the introduction, consider that the results in Fig. 5 display the joint pathway analysis output when comparing embryos between the high-yielding oilseed crop variety “A” and the poorly yielding variety “B”.

Fig. 5
A scatterplot of the results from joint pathway analysis plots minus log 10 versus pathway impact. Starch and sucrose metabolism is in (2.4, 7), pentose phosphate pathway in (1.7, 4.5), carbon fixation in photosynthetic organisms in (1.8, 4.5), and glycolysis in (2, 2.1). Values are approximated.

Example results from joint pathway analysis in MetaboAnalyst, with select pathways highlighted. (Adapted using open metabolomic and transcriptomic data from Johnston et al. [2])

Notable KEGG pathways overrepresented (“enriched”) by the significantly different features are the pentose phosphate pathway and carbon fixation in photosynthetic organisms, while the position of the significant features within the pathway was most influential in glycolysis and starch and sucrose metabolism [27, 28]. When clicking on an individual pathway, a detailed figure will be displayed (Fig. 6), showing the position of significant features within the pathway, and whether they were up- or downregulated. Now that the results have been obtained, multiple types of conclusions can be made.

  1. 1.

    As shown in the pathway impact figure, most of what is generally considered central carbon metabolism (glycolysis, pentose phosphate pathway, tricarboxylic acid cycle, carbon fixation) is significantly impacted and overrepresented. In the example stated and shown in Fig. 5, variety “A”, with higher oil content in its seeds, has higher flux of overall carbon flow toward oil. Previous research on oilseed species pennycress found that all pathways involved in central carbon metabolism, when comparing a high- and low-oil accession of the species, were significantly impacted [2]. Furthermore, upon inspecting which features were up- and downregulated in each pathway, evidence was found that increased glycolytic flow, malate precursor accumulation, increased pentose phosphate pathway activity, and increased anaplerotic pathway activity were occurring in the high-oil accession. Such comparisons between a high-yielding oilseed crop variety “A” and the poorly yielding variety “B” have the potential to identify targets to genetically engineer this crop to improve its oil content. This kind of inference may be displayed using Fig. 6: PGL1 and PFK5 (green nodes) are highly downregulated in oilseed variety “A”, while metabolite content shows consistent upregulation. With this information, one could investigate a further hypothesis that carbon is redirected toward pentose phosphate pathway intermediates, which would result in higher production of NADPH to support oil synthesis [2, 29]. Furthermore, it appears that the rate of converting intermediates is slowed down at least in part due to reduced expression of PFK5 and/or PGL1.

  2. 2.

    Sometimes, more specific metabolic differences are observed within specialized parts of metabolism. Research on eggplant identified a significant contribution of glycine, serine, and threonine metabolism to nitrogen use efficiency, as serine was accumulated in nitrogen-use-inefficient genotypes of the species [30]. Furthermore, increased accumulation of starch was shown in the inefficient genotypes, which conceivably is related to carbon redirection toward storage.

  3. 3.

    The application of this methodology is not limited to species producing conventional crop products. One group used joint pathway analysis to identify orchid responses to biotic stress, showing that flavonoid biosynthesis in the plant was significantly impacted by fungal interaction [31]. This research identified multiple secondary metabolites that accumulated in response to the interaction and proposed that this interaction could be exploited to use the Dactylorhiza sp. orchids as an agronomic source of such compounds.

  4. 4.

    This methodology is equally applicable to animal systems provided proper considerations are made. Wooden breast myopathy, a trait negatively affecting chicken meat quality, was investigated by a research group using combined transcriptomics and metabolomics [32]. The pathway analysis directed researchers to find specific metabolic associations with progression of the disease, one of which was that downregulation of carnosine synthase expression. This was corroborated by the inclusion of metabolomics, which showed that the products of carnosine synthase, carnosine and anserine, were also reduced in diseased samples.

Fig. 6
A network diagram for the pentose phosphate pathway. The network has bright circular nodes for C 0025 and bright rectangular nodes for F F K 5 and F G L 1. The other nodes include A T 2 G 16790, C 01236, C 01172, and P G I 1. A 3-node network below connects C 00673, A T 1 G 17160, and C 01801.

Detailed view of pentose phosphate pathway from joint pathway analysis, including example data. (Adapted using open metabolomic and transcriptomic data from Johnston et al. [2])

4 Notes

  1. 1.

    The guides and tutorials for the Analyst software are installed automatically with the software and are available from the Start menu. For more information on how to use the Analyst software, please check the software manual: https://sciex.com/content/dam/SCIEX/pdf/software/an_ref_d1000064246_en.pdf.

  2. 2.

    The use of the samples without dilution is discouraged, as we have observed considerable peak distortions in such cases.

  3. 3.

    For amino acids, switch between positive and negative polarities was used. Negative ionization was selected to specifically differentiate homoserine from threonine that has unique product ions and therefore transitions.

  4. 4.

    For leaf tissue, RNA concentration is generally high enough to obtain a good yield and quality from kits. The following extraction protocol applies to seeds and other tissues that may contain oils, lignin, high protein content, and other insoluble biomass components.

  5. 5.

    Commands using -p or -@ indicate how many cores are to be used. If performing on a supercomputer, you may want to increase this to save time and use all available computing power. These steps can take quite some time otherwise.

  6. 6.

    The flag “--max memory” tells Trinity how much memory to use per core. This example is thus assuming each core as 2 GB of memory.

  7. 7.

    A lot of warnings and notices may appear while AGAT is running. This is usually just logging information; just make sure the output is correct in these cases.

  8. 8.

    The -maxmem flag in this case is in MB, not GB.

  9. 9.

    This is assuming the biological sample reads are paired-end and strand-specific Illumina reads. If not, use the appropriate algorithm in the HISAT2 instructions above. Since we are no longer doing assembling, the “--dta” option is not necessary.

  10. 10.

    Generally, this predicted gene product will be the same as all of the gene’s corresponding child transcript(s), but it is good practice to confirm.

  11. 11.

    Different instruments or sequencing data from different technologies always require adjustments when performing research. Even two groups using their own instrument of the model may find a big difference in signal intensity, reads per lane, sequencing errors, etc.

  12. 12.

    Programs are regularly updated, so options and menus may differ if using different versions.

  13. 13.

    What is considered statistically appropriate is always changing. Assumptions about one’s data and what kind of analysis is to be performed may be considered void with the advent of new methods. We encourage regular literature reviews to stay current and obtain the maximum research impact possible.