Background

The diversity of characters among members of a species is an inherent feature of biological complexity. Most studies of biological diversity in crops have focused on morphological characters and DNA markers, covering both ends of the path of gene expression from genome to phenotype. Genome analysis records and compares the genetic make-up of lineages or individuals based on DNA sequences or fragment patterns. Both sequence analysis and DNA fingerprinting sample genome diversity, which is independent of environmental conditions and the developmental stage of the organism [1]. AFLP markers [2] are anonymous and are generally thought to be selectively neutral, which probably holds true for many kinds of DNA markers [3]. Even whole-genome sequencing of populations, the ultimate genome diversity survey tool, reveals at most the potential of a population to express various phenotypic features. Approaches based on transcriptomics and proteomics can identify gene expression patterns that underlie the current phenotype and that are affected by environment and the developmental stage of the organism. The relationship between the abundance of mRNA and protein molecules on one side and of phenotypic features relevant for crop production on the other is obscure and cannot yet be exploited for breeding purposes even in major crops with extensive genomic resources, let alone in minor or orphan crops.

A third level of gene expression, represented by the metabolic constitution of the organism, is directly related to features that are important in plant production. We are interested in secondary metabolites, because these natural products provide most of the chemical diversity in plants, and are a key factor (i) affecting the resistance of crops to pathogens and pests, and (ii) controlling commercially relevant traits such as taste, color, aroma and antioxidative properties.

The metabolic phenotype of an organism is analyzed by metabolomics, whose final goal is to identify and quantify all of the metabolites present in a sample [4, 5]. Such a complete inventory is not attainable with current technology even for model organisms, so different types of metabolite analysis with more limited scopes serve as surrogates. Metabolic fingerprints are a static set of analytical signals originating from small molecules (e.g. HPLC peaks, TLC spots, or mass spectra), which can be used for diagnostic purposes or to confirm the origin of a sample. In metabolic profiling, which is analogous to transcription profiling, metabolic signals, either anonymous or assigned to structures, are generated and evaluated quantitatively for samples originating from different varieties, physiological states or treatments. Term profiling is also used for a comprehensive analysis of a class of substances defined by common structural features (e.g., oxylipin profiling). Alternative definitions of metabolic profiling and fingerprinting [6, 7] are likely to lead to confusions whenever metabolic analysis and genome fingerprinting are treated jointly.

Sesame (Sesamum indicum L.) is one of the most ancient crops [8, 9]. Sesame seed is highly nutritive (50% oil and 25% protein) and may be consumed directly or pressed to five an oil of excellent quality. Most studies of secondary metabolites in sesame focused on the lignans sesamin, sesamol, sesamolin and sesaminol [1013] in seeds. These natural products have antioxidative properties and may confer health-promoting qualities on products containing sesame seeds or oil [1417]. Sesame lignans also may play a role in the resistance of sesame to insect pests and microbial pathogens [1823]. The metabolism of sesame lignans after ingestion is understood to a limited extent [24]. Metabolic profiling has not been a part of diversity studies in sesame.

Our objective in this study was to compare metabolic and genomic diversity in sesame and to discern the relationship between the two sets of data. Based on the difference in the diversification of sesame at the genomic and metabolic levels we will assess the usefulness of metabolic profiles in the identification of parent lines for breeding programs and in the selection of accessions for biodiversity preservation in sesame.

Results

Sesame accessions for this study were selected based on previously published AFLP data and represent most of the genome diversity in sesame from India, Western Asia, Sudan and Venezuela. Among these accessions, eight accessions have Jaccard similarity coefficients from pairwise comparisons that range from 0.39 to 0.85. These accessions encompass nearly all of the genome diversity detected by AFLP in the two-dimensional space of principal coordinate analysis and represent the four previously described major clusters [25]. The two Venezuelan genotypes, an experimental line and a commercial cultivar, were included because they represent Venezuelan breeding products with a Jaccard's similarity coefficient of 0.45 [26]. These genotypes represent the two major clusters comprising Venezuelan commercial cultivars and contain 80% of the total genetic diversity of sesame in Venezuela.

Three hundred and eighty one AFLP markers, ranging from 100 to 550 base-pairs, were recorded using 8 primer combinations. Ninety-five percent of the markers were polymorphic. Eighty-eight bands (23%) were unique, ranging from 5 (Turkey) to 21 (India 7) per accession (Table 1).

Table 1 AFLP: Primer combinations and polymorphism of DNA bands

The reproducibility of the metabolic analysis was very good because similarity and dissimilarity measures and principal component analysis results showed negligible differences regarding three independent profiles generated from extracts of Sudan3 accession and compared to the other 9 accessions. The average of the three replicas obtained for Sudan3 was used for all further analysis.

Eighty-eight dominant metabolic signals were selected based on the mass chromatogram quality index, 47 of them in negative mode ESI and 41 in positive mode ESI. More than 50% of the signals resulted from peaks eluting in a well-resolved area with retention times between 15 and 27 min. Thirty-four signals were common to all accessions, 16 in positive mode ESI and 18 in negative mode ESI. Eight signals were either accession-specific or present in all except one accession (Table 2). No association was found between the distribution of unique AFLP markers and accession-specific metabolic signals.

Table 2 Metabolic signals in sesame seed extracts used in the analysis

The coefficient of correlation between correlation coefficient-based similarity matrix and simple-matching coefficient-based similarity matrix was 0.63 (P < 0.01). Correlation between matrices obtained from AFLP data and metabolic profiles was not significant. Comparisons of matrices of metabolic data with Jaccard's coefficient matrix of AFLP data resulted in a correlation coefficient of -0.09 (P < 0.33) for the simple matching coefficient matrix and -0.24 (P < 0.18) for the correlation matrix. There were consistencies in scatter plots for pairs of accessions that fell into the same category (high similarity, intermediate similarity, or low similarity) for genomic and metabolic data (Fig. 1). Accession pair Syria-Sudan3 had high similarity on both axes, Sudan2–43 × 32, India7–43 × 32, and India5-Sudan2 were dissimilar both in metabolic profiles and AFLP fingerprints and pairs India1-India8, India1-Syria, India1-Turkey, India1-43 × 32 and India8-Turkey had intermediate similarities.

Figure 1
figure 1

Scatter plots comparing ordination based on AFLP (Jaccard's coefficient) with ordination based on metabolic profiles. Upper part: Metabolic profile comparisons based on quantitative variables (correlation coefficient). Lower part: Metabolic profile comparisons based on binary variables (simple matching coefficient). Accessions in pairwise comparisons which have a high, intermediate or low similarity for both approaches are labeled.

Biplot of principal coordinate analysis based on AFLP data calculated from Jaccard's coefficient captured 64% of the total variation (Fig. 2). Accessions Sudan2 and India7 on one side, and commercial cultivars Inamar and 43 × 32 on the other, are the most distinctive. Biplots of principal component/coordinate analysis based on correlation coefficient, which captured 62% of the variation, and simple matching coefficient, which captured 77% of the variation, had similar patterns in that accessions India5 and 43 × 32 formed one group and the remaining eight accessions formed a second group. Visual comparison of biplots obtained with AFLP and metabolic profiles confirmed the classification of cultivar 43 × 32 as the most distinctive, which explains why Sudan2-43 × 32 was one of the most dissimilar pairs. Both biplots grouped Syria-Sudan3 as the most similar pair and India1-Syria, India1-Sudan3, India1-Turkey and I8-43 × 32 as pairs with intermediate similarities. The most important contradiction between both biplots was the placement of India5, India7 and Inamar. Based on AFLPs, India7 and Inamar were the most distinctive accessions, whereas metabolic profiles grouped them together with six other accessions. The opposite situation was found for India5, which was classified as one of the most distinctive based on metabolic profiles, but groups together with 5 other accessions based on the AFLP data.

Figure 2
figure 2

Biplot of principal coordinate analysis based on Jaccard's coefficient for AFLP.

Discussion

Seed metabolic profiles were unrelated to the geographic origin of the accessions studied, which is similar to results obtained previously for genome diversity as assessed by AFLPs [25]. The relationship patterns generated for the AFLP data and for the seed metabolic profiles were different. No relevant data from other plant species are available for comparison, but there are two studies of the relationship between genomic and metabolic diversity in microorganisms. In rhizobia (bacteria), metabolic and genomic data (AFLP) were unrelated [27], while there were strong similarities between genome variation and metabolite diversity between two endophytic fungi [28].

If the number of characters reflects the sampling depth, then the metabolic profiles and AFLP fingerprints cover only a small portion of the underlying character sets. The AFLP-based analysis appears more robust because it was based on 363 polymorphic bands while only 88 signals were evaluated in the metabolic profiles. However, the metabolic profiles may contain more information because they are based on continuous rather than binary variables. To test this hypothesis we transformed the metabolic data into a binary matrix and compared the binary and continuous results. The quantitative information (normalized amplitudes of mass signals) did not affect the similarity patterns and therefore can be neglected in diversity surveys.

Diversity in AFLP patterns and metabolic profiles reflect different facets of genomic polymorphism. AFLPs are insensitive to gene expression and may occur most frequently in noncoding portions of the genome. Seed metabolic profiles result from biosynthetic activities in embryo and endosperm based on the expression of a small fraction of the total genomes. If the samples are representative, then differences between the diversity patterns are due to differences in the diversification of sesame at genomic and metabolic levels. Because the majority of plant genomes consist of noncoding sequences, most changes in AFLP patterns are expected to result from neutral mutations fixed by genetic drift rather than by selection. On the other hand all metabolites synthesized by a plant affect its fitness: apart from the metabolic costs incurred, anabolic processes are subjected to different selection pressures, both positive (e.g., resistance to pathogens, protection against light, improved dissemination of seeds) and negative (e.g., reduced attractiveness of seeds for disseminating animals because of a bitter taste, volatiles attracting pests, trigger of the germination of microbial pathogens). The synthesis of many secondary metabolites is known to be limited to conditions under which they enhance the fitness of their producer, limiting the costs of biosynthesis [41]. Metabolic profiles of sesame recorded under different environmental conditions are therefore likely to differ. For example, exposure to biotic stress is likely to generate defence-related signals, which may not be present in metabolic profiles of plants grown in the absence of pathogens and pests. Regardless of the progress in analytical technologies, chemical diversity revealed by metabolic profiling under a single set of conditions therefore remains an underestimate of the total metabolic capacity of sesame.

The genetic basis of the variation in the metabolic composition on plants was proven by the association between metabolic peaks detected by HPLC-MS and specific genomic loci in segregating populations of A. thaliana [29]. The disparity between the diversity patterns represented by AFLPs and by metabolic profiles thus provides insights into the processes that led to the composition of the current sesame genome. With the growing availability of instrumentation and software tools for nontargeted metabolite analysis by HPLC-MS [30, 31], large-scale metabolic profiling is becoming a feasible task for diversity studies in cultivated plants. From a practical point of view, crop improvement programs [32] will benefit from the complementation of diversity assessment based on DNA markers by metabolic profiling particularly for plants such as pepper [35, 40], mulberry [36, 37, 39] and fenugreek [38], the commercial value of which is largely affected by complex mixtures of secondary metabolites.

In addition to genuine differences in similarity patterns between genomic and metabolomic profiles caused by differences in diversification rates, non-representative sampling also may lead to inconsistencies. The involvement of one accession in many pairwise comparisons would amplify this distortion. For example, two accessions in our set affect 17/45 pairwise comparisons. Thus a small number of biased data sets may alter the global pattern of biplots in a principal components or coordinates analysis. In this situation, scatter plots can identify which data sets are correlated, which are not, and which are not independent. Consistencies in scatter plots corroborate the representativeness of sampling in a particular pairwise comparison. For example, the pairs Sudan2-43 × 32, India7-43 × 32 and India5-Sudan2 were consistently the most dissimilar pairs in both the AFLP and the metabolic analysis. Similarly consistent were the comparison of pairs Syria-Sudan3 (highly similar for both approaches), and India1-Syria, India1-India8, India1-Turkey, India1-43 × 32 and I8-Turkey (intermediate similarity). Thus, the consistency of pairwise comparisons is independent of the similarity level.

Selection on the metabolome of a plant could distort the congruency in diversification between neutral DNA markers (AFLP) and metabolic profiles in a manner dependent on the intensity and duration of the selection pressure. Comparative analysis of intra- and interpopulation diversity at the genomic and metabolic levels will aid our understanding of the effect of selection on the evolution of metabolic capacity. Dedicated statistical tools that test the congruency in diversification of the metabolome and the genome are not available. However, tools for diversity estimation established in population genetics can be applied, offering at least qualitative insights.

Plants subjected to selection for metabolic traits should evolve faster on the metabolic level than neutral DNA markers should at the genomic level, as the rate of fixation of neutral mutations is controlled by only the mutation rate and population size. For example, Turkey-Syria accession pair appears to demonstrate the effect of selection on metabolic profiles of sesame (compare Fig. 2 and 3).

Figure 3
figure 3

Biplot of principal components analysis based on correlation coefficient for seed metabolic profiles

Common selection pressure exerted on different genotypes may result in different outcomes, i.e. convergent evolution or increased diversification. Increased diversification occurs when the biochemical basis of traits under selection differs among genotypes, e.g. when unrelated metabolic pathways enhance resistance to a common pathogen. The accession pair India7-Syria are very different at the genomic level but have similar metabolic phenotypes, and could have resulted from convergent evolution driven by common selection on genotypes with the same metabolic potential. Alternatively, neutral markers may have diversified over a long period of time, during which the metabolic phenotype was maintained by constant selection pressure.

The third situation encountered in our comparison of genomic and metabolic diversity in sesame was that the relative amount of diversification between the members of a pair was qualitatively similar at both the genomic and the metabolic levels. Thus pairs that were highly different at the genomic level also were highly different at the metabolic level. We suggest that varying selection and complex evolutionary histories might explain this kind of data. The analysis of the inheritance of metabolic patterns and of the association between metabolic and genomic markers might provide deeper insights. We have begun to generate segregating populations to address these questions.

The purpose of untargeted metabolic profiling in our work was to sample metabolic diversity without bias for the biological activity or practical relevance of the underlying compounds. One might want to know, however, whether metabolites of particular interest have been recovered in ethanol extracts used for the analysis. The most prominent metabolites of sesame are phenylpropanoids with one or more methylenedioxybenzole (piperonyl-) moieties such as sesamin and sesaminol. These lignans occur in free form and as di- and triglucosides and possess antioxidative properties. Certain sesame lignans lower blood and liver cholesterol levels, qualifying as health-promoting agents. In traditional analytical protocols, crushed sesame seeds are defatted with hexane prior extraction with ethanol or methanol. The defatting step is often used in lignan analysis in order to improve the recovery [42], but the lignans of sesame can be extracted from oil directly into methanol [43, 44], indicating that defatting is not necessary. Indeed, an HPLC method for the analysis of sesame lignans based on extraction with 80% ethanol without defatting was described [45]. Similarly, extraction of sesame with methanol without defatting was used for sesamin determination [46]. In line with these results, we observed that the recovery of eight sesame lignans did not improve substantially by defatting seeds prior ethanol extraction (data not shown), which we used in the comparison of lignan content among sesame accessions [47]. As long as the life span of reverse phase columns is not a matter of concern, defatting seeds prior extraction can be omitted.

Conclusion

Diversity patterns in sesame (Sesamum indicum L.) at the genomic level (neutral DNA markers) and at the metabolic level (nontargeted HPLC-MS profiles) differed, often showing a higher diversification rate at the metabolic level. For sesame breeders this means that the distances among accessions determined by genome fingerprinting need not reflect differences in metabolic capacity. Genetic analyses based on neutral markers is not an accurate predictor of the potential of parental lines for breeding programs aiming to improve traits controlled by metabolic phenotype such as resistance to pests or taste. The complementation of AFLP fingerprints by metabolic profiles for breeding and conservation purposes in sesame is recommended.

Methods

Plant material

Seeds were obtained from Centro Nacional de Investigaciones Agropecuarias (CENIAP) Germplasm Bank, Venezuela (Table 3). Plants were germinated and grown in the greenhouse with a photoperiod of 12 hours dark and 12 hours light at 30°C.

Table 3 Sesame accessions

AFLP analysis

DNA was extracted from leaves and AFLP analysis was performed based on the protocol by Voss et al. [2] with minor modifications as previously reported [25, 26], using eight primer combinations (Table 1). AFLP reactions were performed twice for each accession, using restriction enzymes EcoRI and Tru1I (MBI Fermentas, Germany) and compatible primers (see Table 1 and Table 7 in [25]). Primers for pre-amplification were extended by one selective nucleotides (C for MseI and A for EcoRI). During selective amplification, fluorescent label (Cy5) was attached to the EcoRI primer. DNA fragments were separated on ALFexpress II DNA analyzer (Amersham Pharmacia Biotech, Uppsala, Sweden). Automatic band recognition and matching was done by using GelCompar II software (Applied Math, Belgium). A threshold value of 5% relative to the maximum value within each lane was applied and only fragments identified in both replicas (between 94 and 100% of all bands recorded) were used for band matching. The results of band matching were encoded as a binary matrix, which was used for all further analysis.

Metabolic profiling

Seeds originating from five plants per accession were bulked and 1 g of tissue was frozen with liquid nitrogen, ground in a mortar with a pestle and extracted anaerobically with a mixture of 80% ethanol (gradient grade, Roth, Germany) and 20% water for 16 h with stirring (100 rpm). The liquid phase was filtered through 0.2 μm filters and kept at -20°C until HPLC analysis.

For HPLC analysis, 10 μl aliquots of extracts were loaded onto a polar-modified RP-18 phase column (C18-Pyramid, Macherey-Nagel, Düren, Germany, 3 μm, 2 × 125 mm) and separated at 40°C with a gradient of 10% – 98% methanol at a flow rate of 0.2 ml min-1. The eluent was subjected to electrospray ionisation (ESI). Ions were analyzed in both positive and negative full scan mode between 50 and 1000 m/z with an ion trap.

Data processing and analysis

Raw data from the metabolic study were processed with the CODA algorithm (background reduction and spike elimination [33]). Extracted ion chromatograms with a mass quality index of at least 0.85 (according to technical manual of ACD/MS Manager v. 8.0, Advanced Chemistry Development, Toronto, Canada) were generated and compared. Based on these chromatograms, peak tables were generated. Ten peaks with the highest MCQ value for each accession were selected. For each peak, matching peaks in all accessions were identified, building a set of peaks for use in further analysis. Isotope peaks, recognized by the difference of one unit in the molecular weight and the same retention time, were combined to generate one value per metabolite per accession.

Peak areas were standardized twice, first within every accession by dividing the area by the total sum of areas of all peaks for each accession to compensate for loading differences, and second within every m/z value (across accessions) by dividing peak areas by the maximum area within the m/z value. The purpose of the second normalization was to weight major peaks in each extracted ion chromatogram equally for statistical evaluation, because the relationship between the amount of a substance that enters the ion source and the magnitude of the signal recorded by a mass detector varies among metabolites. Due to the lack of a suitable criterion, no data pretreatment was applied [34]. The resulting matrix was used to calculate correlation coefficients as a measure of similarity between pairs of accessions. To assess the effect of differences in signal intensities within extracted ion chromatograms, the matrix of doubly-normalized intensities was transformed into a binary matrix by replacing all nonzero values with 1. Using the binary matrix, a simple matching coefficient was calculated for each pair of accessions. The correlation between the correlation coefficient-based matrix and simple matching coefficient-based matrix was calculated by Mantel test (500 permutations). To visualize the relationship among accessions according to their metabolite content, principal component analysis was conducted with the correlation matrix. Principal coordinate analysis was used for the simple matching coefficient matrix. Calculations of similarity and dissimilarity coefficients, principal component and coordinate analysis were performed with NTSySpc 2.11T (Applied Biostatistics, Setauket (NY), USA).

A binary matrix from the AFLP data was obtained and a Jaccard's coefficient similarity matrix was calculated. The relationship among accessions was visualised as a principal coordinate analysis. Comparison of ordination obtained by AFLP and metabolite content was based on Pearson's correlation and a Mantel test between the matrices with 1000 permutations. The two approaches also were compared by scatter plots to visualize the correlation. The variability range in the scatter plots was split into three sections (high similarity, intermediate similarity and low similarity) on both the X axis and the Y axis. Pairwise comparisons for the same category in both approaches were identified i.e. pairs of accessions that were highly similar both in AFLP and metabolic data, pairs that possessed an intermediate similarity in both data sets, and pairs dissimilar both in genome and metabolome. The results of principal coordinate analysis performed on AFLP data and principal component analysis performed on metabolic data were compared visually.