A genetical metabolomics approach for bioprospecting plant biosynthetic gene clusters
Plants produce a plethora of specialized metabolites to defend themselves against pathogens and insects, to attract pollinators and to communicate with other organisms. Many of these are also applied in the clinic and in agriculture. Genes encoding the enzymes that drive the biosynthesis of these metabolites are sometimes physically grouped on the chromosome, in regions called biosynthetic gene clusters (BGCs). Several algorithms have been developed to identify plant BGCs, but a large percentage of predicted gene clusters upon further inspection do not show coexpression or do not encode a single functional biosynthetic pathway. Hence, further prioritization is needed.
Here, we introduce a strategy to systematically evaluate potential functions of predicted BGCs by superimposing their locations on metabolite quantitative trait loci (mQTLs). We show the feasibility of such an approach by integrating automated BGC prediction with mQTL datasets originating from a recombinant inbred line (RIL) population of Oryza sativa and a genome-wide association study (GWAS) of Arabidopsis thaliana. In these data, we identified several links for which the enzyme content of the BGCs matches well with the chemical features observed in the metabolite structure, suggesting that this method can effectively guide bioprospecting of plant BGCs.
KeywordsBioinformatics Specialized metabolism Natural products Gene cluster Genetics GWAS QTL Metabolomics Mass spectrometry Comparative genomics
biosynthetic gene cluster
metabolite quantitative trait locus
recombinant inbred line
genome-wide association study
logarithm of the odds ratio
single nucleotide polymorphism
Plant specialized metabolism is the source of hundreds of thousands of natural products. These molecules play key roles in plant development and ecology as, e.g., defense agents and signals, and are broadly applied as medicines, dyes, flavorings and cosmetics. With the sequencing of hundreds of plant genomes, genome mining has become a new strategy to uncover the biosynthetic pathways towards known molecules of interest as well as to identify pathways towards novel compounds . The recent discovery that significant numbers of plant metabolic pathways are encoded by physically clustered genes further facilitates the genome mining process, as it enables rapid identification of candidate pathways from genome sequences alone . Multiple tools have become available that automate the identification of these biosynthetic gene clusters (BGCs) in plant genomes [3, 4, 5]. Moreover, synthetic biology platforms have been developed for (transient) heterologous expression of such gene clusters in, e.g., tobacco and yeast, which allows relatively fast experimental exploration of plant’s biosynthetic potential [6, 7, 8].
However, heterologous expression of BGCs still entails a significant amount of work. Moreover, it appears that a substantial proportion of predicted gene clusters may not be bona fide BGCs; in such cases, multiple enzyme-coding genes—while located adjacently on the chromosome and therefore triggering BGC prediction—do not in fact encode subsequent catalytic steps in one and the same pathway, and their transcription is not co-regulated. Indeed, Wisecaver et al. reported limited overlap between BGCs and coexpression modules they obtained from large-scale transcriptomics data  (although they predicted these BGCs with methods not specifically designed for plants). Similarly, Kautsar et al. found that strong coexpression within a BGC could be detected for around 25% of the gene clusters predicted in Arabidopsis thaliana . They identified two cases in which enzyme-coding genes within predicted BGCs clearly encoded unrelated steps in glucosinolate biosynthesis.
Hence, to capitalize on plant BGCs for the discovery of natural products and their pathways, new methods are required to prioritize predicted gene clusters. Besides transcriptomic analysis, another promising avenue for this is the combined use of metabolomics and genetic data to systematically connect gene clusters with known and yet unknown metabolites based on natural variation [10, 11]. Indeed, several recent genetic studies in different plant species use untargeted metabolomics of plant populations to associate metabolite abundance quantitative trait loci (mQTLs) to enzyme-coding genes [12, 13, 14, 15, 16].
Here we argue that such metabolomics-based systems genetics approaches can be extended to systematically study plant BGCs and prioritize them for heterologous expression. To illustrate this, we use datasets from a recombinant inbred line (RIL) population from Oryza sativa and a genome-wide association study (GWAS) from Arabidopsis to establish a proof of principle, showing that studying the overlap of mQTLs from such data with predicted BGCs generates interesting hypotheses regarding the functional significance of these putative gene clusters.
Another interesting overlap in this rice dataset was found between a predicted polyketide BGC on chromosome 11 (18,762,365–18,822,272 bp) and several flavonoid mQTLs, including an mQTL for putatively identified isogemichalcone B (LOD-score: 4.5). The flavonoid mQTLs match well with the presence of chalcone/stilbene synthases in the predicted BGC.
We also used an mQTL dataset from a GWAS study with 349 A. thaliana accessions genotyped at 214,051 markers (see Additional file 1). Unbiased metabolomics was performed with accurate mass LC–MS on full rosette leaf tissue grown under normal conditions, and raw MS spectral data were processed with MetAlign-MSClust . Linear mixed models in EMMA  and GAPIT  were applied to the genotype and metabolite profiling matrix, resulting in 1897 significant mQTLs (see Additional file 1). Again, the mQTLs were overlaid with BGCs predicted by plantiSMASH .
By examining the mQTL-BGC overlaps in the Arabidopsis dataset (see Additional file 1), we identified several cases in which predicted BGCs overlapped with mQTLs corresponding to molecules which are in fact known to be synthesized by enzymes in specialized metabolic pathways encoded by non-clustered genes. For example, a putative saccharide BGC located on chromosome 2 (9,744,720–9,841,503 bp) overlapped with multiple mQTLs connected to flavonoid saccharides, including kaempferitrin, a kaempferol species that is O-rhamnosylated on the third and seventh carbon atoms. The predicted BGC contained a UDP-glycosyltransferase (AT2G22930), which is similar in sequence to quercetin 3-O-glucosyltransferases. It is possible that this glycosyltransferase is able to 3-O-rhamnosylate kaempferol, since kaempferol and quercetin only differ in one hydroxy group on the B-ring. Alternatively, if the UDP-glycosyltransferase only has substrate specificity for glucose and not rhamnose, the mQTL could be caused by an indirect effect, due to glucosylation competing with rhamnosylation of the same flavonoid substrate. Intriguingly, the predicted gene cluster also encodes multiple Scl acyltransferases, two of which (AT2G22990 and AT2G23000) have previously been shown to act as anthocyanin sinapoyltransferases . We observed a strong degree of coexpression (Pearson correlation coefficients of > 0.79) for the glycosyltransferase AT2G22930 with three Scl acyltransferases (AT2G22920, AT2G22960 and AT2G23000) in a leaf time-course analysis of the response to barley powdery mildew fungus Bgh (NCBI GEO dataset GSE39463, Additional file 2: Figure S2). Altogether, this result suggests that multiple enzymes encoded in this predicted gene cluster are involved in different types of flavonoid modification. We also found mQTLs for kaempferitrin in three other loci, encoding Scl acyltransferases, a cytochrome P450 and a beta-glucosidase that may potentially be involved in further modifying or breaking down the molecule. While this locus thus does not seem to encode a complete biosynthetic pathway by itself, it is still likely to encode multiple enzymatic steps involved in the same pathway, and may represent a case of ‘partial’ pathway clustering similar to cases reported for monoindole terpene alkaloid biosynthesis in Catharanthus roseus .
Of the four experimentally characterized BGCs in Arabidopsis, three—the thalianol, marneral and tirucalladienol clusters—are specifically expressed in roots [23, 24, 25]; hence, we did not expect to find mQTLs for these molecules in this GWAS dataset. The fourth, the arabidiol/baruol BGC, contains some genes that are expressed in both root and leaf (such as the baruol synthase PEN2, according to the Arabidopsis eFP browser [http://bar.utoronto.ca/efp/cgi-bin/efpWeb.cgi]). Indeed, six mQTLs mapped to different genes in this BGC, and two of these mQTLs mapped specifically to the PEN1 and PEN2 oxidosqualene cyclase-encoding genes. The masses of the metabolites connected to this mQTL represent yet unknown compounds, and further research (e.g. MS/MS fragmentation analysis) is needed to confirm whether these masses belong to arabidiol/baruol derivatives or unrelated metabolites due to, e.g., downstream effects.
We also observed cases in which predicted BGCs may be regarded as putative false positives, in the sense that they do not encode enzyme-coding genes that are likely to function within the same pathway. E.g., a methoxyglucobrassicin mQTL was associated with the cytochrome P450 gene CYP81F2 (AT5G57220, which has been implicated in glucobrassicin biosynthesis ) within a predicted BGC (chromosome 5: 23,184,526–23,213,996 bp) containing no other enzyme-coding genes known to be involved in glucosinolate biosynthesis. For two other methoxyglucobrassicin mQTLs we could not derive a functional link with glucosinolate biosynthesis.
Finally and perhaps most interestingly, 23 predicted gene clusters showed only overlap with mQTLs of metabolites that have not been annotated yet, showing a clear potential for discovery of novel enzymes that can be tested through synthetic biology approaches to identify novel chemistry. Among such predicted gene clusters, one may identify likely bona fide BGCs by finding cases in which multiple mQTLs overlap with the same predicted BGC (and in the case of GWAS, with multiple different genes within it) and are likely to represent biosynthetically connected molecules, based on e.g. having defined mass differences between them.
Of course, it is also possible to look for overlap of mQTLs with enzyme-coding genes in a non-BGC-centric fashion. By means of example, we scanned all Arabidopsis mQTLs for enzyme families known to be involved in biosynthetic pathways using all profile Hidden Markov Models from plantiSMASH  that are related to scaffold biosynthesis. While the results (see Additional file 1) include some potentially interesting links (e.g., linking TPS04 with a cyclohexene-related terpene and linking TPS08 with a naphthalene-related terpene), these mQTLs may also be caused by indirect effects, e.g. through affecting precursor pools, especially given the fact that these terpene synthases have been linked to the production of different terpenes in other studies [27, 28].
Our results confirm that indeed (genes in) predicted BGCs can be meaningfully linked to mQTLs explaining variation in their metabolic products by GWAS and RIL-based studies, allowing the prioritization of BGCs for further analysis and the generation of new hypotheses about the functions of these predicted BGCs. Altogether, we conclude that the prediction of BGCs in plant genomes for bioprospecting can become a more powerful tool for discovery when complemented not only with coexpression analysis but also with unsupervised metabolomics linked to genetic variation. Most importantly, this allows identifying predicted BGCs that overlap with genomic loci associated with the abundance of unknown molecules, which would constitute candidates for further investigation. At the same time, this makes it possible to distinguish these from genomic loci involved in the biosynthesis of well-known molecules whose biosynthetic pathways are known not to be clustered.
Thus, the metabolomics-based systems genetics approach described here has the potential to become an important technology for the systematic genome-wide assessment of biosynthetic genes that can be prioritized for heterologous expression using the latest synthetic biology methodologies .
Although we were able to predict several links between BGCs and metabolites, we did not have the resources to experimentally validate these links through mutagenesis or heterologous expression. It is possible that some mQTLs in fact represent indirect effects. Also, the RIL data from rice resulted in relatively broad mQTL regions, in which other genes may be hidden that could be causative of the metabolic differences underlying the QTLs.
Additionally, the metabolomics datasets were limited to shoots and leaves, while many key metabolites may be specifically produced in roots. Using a larger number of relevant conditions in the future (including root metabolomics and samples from biotic or abiotic stress treatments) will make it possible to connect metabolites to gene clusters and gene cluster-like genomic loci on larger scales.
Finally, in the Arabidopsis metabolomics study applied here, we did not generate dedicated metabolite fragmentation data, such as MS/MS and MSn spectra , making it yet difficult to predict the nature of the unknown metabolites linked to putative BGCs of unknown function. New technologies that generate and exploit metabolite fragmentation data based on molecular networking and substructure identification [30, 31, 32] will make it easier to obtain structural information for such unknowns, and thus facilitate assessing whether the combination of molecular and genetic (enzymatic) features observed in an mQTL-BGC pair shows high potential or not.
HN and MHM conceived the original research plans; HN and MHM supervised the computational analyses; LW performed the computational analyses; LW, JJJvdH, MHM and HN analyzed the data; RK, RCHdV and JJBK contributed the genetic and metabolomic analysis in Arabidopsis; LW wrote the article with contributions of all the authors; MHM and HN supervised and completed the writing. MHM and HN agree to serve as the authors responsible for contact and ensuing communication. All authors read and approved the final manuscript.
The authors declare that they have no competing interests.
Availability of data and materials
The mQTL dataset of Oryza sativa, supporting the conclusions of this study, is included in the published study of Gong et al. . The predicted BGCs for O. sativa are included in Additional file 1. The mQTL datasets and the BGC predictions of Arabidopsis thaliana, supporting the conclusions of this study, are included in Additional file 1.
Consent for publication
Ethics approval and consent to participate
JJJvdH is supported by an ASDI eScience Grant (ASDI.2017.030) from the Netherlands Organisation for Scientific Research (NWO). MHM is supported by VENI Grant 863.15.002 from NWO. RCHdV was partially funded by the Netherlands Metabolomics Centre (NMC) and the Centre of Biosystems Genomics, which were both part of the Netherlands Genomics Initiative/Netherlands Organization for Scientific Research.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
- 15.Matros A, Liu G, Hartmann A, Jiang Y, Zhao Y, Wang H, Ebmeyer E, Korzun V, Schachschneider R, Kazman E, Schacht J, Longin F, Reif JC, Mock H-P. Genome-metabolite associations revealed low heritability, high genetic complexity, and causal relations for leaf metabolites in winter wheat (Triticum aestivum). J Exp Bot. 2017;68:415–28.PubMedGoogle Scholar
- 27.Herde M, Gärtner K, Köllner TG, Fode B, Boland W, Gershenzon J, Gatz C, Tholl D. Identification and regulation of TPS04/GES, an Arabidopsis geranyllinalool synthase catalyzing the first step in the formation of the insect-induced volatile C16-homoterpene TMTT. Plant Cell. 2008;20:1152–68.CrossRefGoogle Scholar
- 28.Vaughan MM, Wang Q, Webster FX, Kiemle D, Hong YJ, Tantillo DJ, Coates RM, Wray AT, Askew W, O’Donnell C, Tokuhisa JG, Tholl D. Formation of the unusual semivolatile diterpene rhizathalene by the Arabidopsis class I terpene synthase TPS08 in the root stele is involved in defense against belowground herbivory. Plant Cell. 2013;25:1108–25.CrossRefGoogle Scholar
- 31.Wang M, Carver JJ, Phelan VV, Sanchez LM, Garg N, Peng Y, Nguyen DD, Watrous J, Kapono CA, Luzzatto-Knaan T, Porto C, Bouslimani A, Melnik AV, Meehan MJ, Liu W-T, Crüsemann M, Boudreau PD, Esquenazi E, Sandoval-Calderón M, Kersten RD, Pace LA, Quinn RA, Duncan KR, Hsu C-C, Floros DJ, Gavilan RG, Kleigrewe K, Northen T, Dutton RJ, Parrot D, et al. Sharing and community curation of mass spectrometry data with global natural products social molecular networking. Nat Biotechnol. 2016;34:828–37.CrossRefGoogle Scholar
- 32.Ernst M, Nothias-Scaglia L-F, van der Hooft J, Silva RR, Saslis-Lagoudakis CH, Grace OM, Martinez-Swatson K, Hassemer G, Funez L, Simonsen HT, Medema MH, Staerk D, Nilsson N, Lovato P, Dorrestein P, Ronsted N. Did a plant-herbivore arms race drive chemical diversity in Euphorbia? bioRxiv. 2018;67:87. https://doi.org/10.1101/323014.CrossRefGoogle Scholar
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.