Background

German Chamomile (Matricaria recutita L. syn. Chamomilla recutita (L.) Rauschert; from hereon chamomile) is described to be native to the Near East and South-Southeast Europe [1] and is grown in many temperate regions of the world. As already mentioned by Hippocrates (fifth century BC, [2]), it has a long history of use in herbal medicine, with applications in gastrointestinal diseases, treatment of infections and inflammatory diseases of the skin, and respiratory problems. The flower heads contain medicinal compounds, and are either used dried (e.g. tea) or the essential oils are extracted. Chamomile has numerous secondary metabolites which are medicinally active, including the terpenes alpha-bisabolol (syn. Levomenol) and its derivates (bisaboloids), matricine/chamazulene, further terpenes, and several flavonoids [2].

Chamomile is mainly outcrossing [3, 4], and as most chamomile varieties are rather landraces, they display a high degree of phenotypic variability reflecting that of natural populations [5], from which they were isolated. Whereas wild populations of chamomile are diploid, agronomically cultivated varieties are either diploid or artificially generated tetraploid, the latter of which have larger flower heads and higher 1000-seed weight [6]. Genetic diversity in chamomile was evaluated in a few studies using a variety of different genetic markers (summarized in [7]), including AFLP [8], RAPD [8, 9] and ISSR [10]. Next-generation sequencing (NGS) approaches have been successfully used for the analysis and exploitation of genetic and genomic resources in crop plant breeding, including rapeseed, soybean, barley, or maize, e.g. [11, 12]. One NGS approach, genotyping by sequencing (GBS), has been widely used in many crop plants (e.g. [13,14,15,16,17,18]), is highly reproducible [13] and reduces genome complexity with restriction enzymes, thus, lowering costs. In contrast, the use of NGS / GBS for the amelioration of medicinal and aromatic plant breeding is relatively underexploited. An NGS approach has not yet been described for the non-model plant species chamomile.

In our work, we applied GBS to a set of chamomile origins (varieties, populations, accessions) to identify genome-wide, high quality single nucleotide polymorphisms (SNPs), to estimate genetic diversity and to reveal genetic structure within our cultivated chamomile germplasm collection. Therefore this is the first study reporting the generation and use of SNP markers in chamomile. We furthermore phenotyped the plants for flowering time (FT) and the content of bisaboloids by gas chromatography–mass spectrometry (GC-MS). Using phenotyping and genotyping data, an association study was performed to identify SNPs associated with FT and bisaboloids content, and to search for possible candidate genes underlying FT and the production of the medicinally-important compound alpha-bisabolol. In comparison to other marker platforms like AFLP or RAPD, the high marker density of genome-wide SNPs generated by GBS increases the probability of identifying markers closely linked to genes of interest. The resulting data provide a basis to enhance breeding in chamomile.

Methods

Plant material

Ninety one genotypes, representing 2–4 genotypes each from 33 origins of M. recutita L., and an additional 4 genotypes from 2 M. discoidea DC accessions as outgroup, were included in the analysis (Table 1). The focus of the study on cultivated chamomile, mainly encompassing origins from all over Europe, but also including some locations outside of Europe.

Table 1 Origin name, genotype code, described geographical origin and source of the investigated plants

For the purpose of this study, we define an origin as either a registered variety, a gene bank accession, a cultivated or wild population of chamomile. The term population applies to a wild natural population as well as to a cultivated one selected for breeding or agriculture. Genotype refers to a genetically individual plant.

Plant cultivation and ploidy analysis

Seeds were sown in the greenhouse on the 16th July 2014, and seedlings were individually transplanted into single pots (14 cm diameter) after reaching the 20-leaf stage. Between 2 to 4 genotypes per origin were selected for the subsequent analysis, with a focus on elite varieties/populations (4 genotypes each). Plant ploidy was determined by flow-cytometry according to Otto et al. [19] at the 10- to 20-leaf stage.

Plants were randomized in the greenhouse, organized in a block of plants of 6 × 16 plants, and grown until the 31st of January 2015 under natural light conditions (short day conditions, with day / night temperatures of 17–21 °C (12 h) / 14–19 °C (12 h)) and from February 2015 until the end of April 2015 under additional light for 16 h (long day conditions, MASTER Agro 400 W E40 1SL Na-bulb lamp, 2000 K colour temperature, Royal Philips N.V. Netherlands) with day / night temperatures of 18–21 °C (12 h) / 15–17 °C (12 h). Flowering time (FT) was measured as days after sowing (DAS).

Analysis of chemical compounds by gas chromatography - mass spectrometry (GC-MS)

The content (peak areas) of bisaboloids (terpenoids), namely bisabolol oxide B, bisabolon oxide A, alpha-bisabolol, and bisabolol oxide A, were analysed by gas chromatography for 80 samples (76 from M. recutita and 4 from M. discoidea). The content of the chemical compounds was determined for each genotype in a sample of pooled flower heads, where flower heads were harvested after the first 2 circles of florets started flowering, scored as the flowering time until maximum half of the florets were flowering. All plants were grown and processed equally. Samples were immediately frozen in liquid nitrogen and stored at −80 °C until further analysis to prevent the oxidation of alpha-bisabolol.

The analysis by GC-MS was done with modifications according to Köllner et al. [20] and Irmisch et al. [21]. For each sample, the finely ground powder on liquid nitrogen of 80 to 100 mg material from 5 flower heads was used. For each pulverized sample 200–400 μl of n-hexane was added, followed by overnight incubation with shaking at 25 °C. A total of 100 μl was transferred to a clean microtube fitted for GC vials. Terpene analysis was carried out with a gas chromatograph (GC 2010, Shimadzu, Duisburg, Germany) equipped with a splitless injector (injector temperature, 220 °C; injection volume, 1 μl) and coupled to a quadrupole mass selective detector (Shimadzu). H2 was used as carrier gas at a flow rate of 1 ml min-1. Samples were analysed on a Supreme-5 ms column (30 m length × 0.25 mm inner diameter × 0.25 μm film thickness, Chromatographie Service GmbH, Germany). The coupled mass spectrometer was operated with a transfer line temperature of 230 °C, a source temperature of 230 °C, a quadrupole temperature of 150 °C, an ionization potential of 70 eV and a scan range of 50–350 amu.

A flame ionization detector (FID) was used for the quantification of the compounds (μg/g of fresh weight), and operated at 250 °C. Compounds were identified by comparison of retention times and mass spectra to those of authentic reference substances obtained from Sigma-Aldrich (Steinheim, Germany), Roth (Karlsruhe, Germany), or by reference spectra in the Wiley and Shimadzu libraries [22].

DNA extraction and genotyping-by-sequencing (GBS)

DNA was isolated from the 95 samples according to the manufacturer’s instruction using the Agencourt Chloropure kit (Agencourt Bioscience Corp., Beverly, Massachusetts, USA). After isolation the DNA was analysed on a 1.3% Agarose gel (Bio&SELL Universal-Agarose), and the DNA concentration was determined relative to a DNA standard. DNA dissolved in TE buffer was shipped to the Cornell University Biotechnology Resource Center (BRC, Cornell, USA) for GBS analysis.

Sample preparation and sequencing was performed at the BRC using a protocol modified from Elshire et al. [3] (http://www.biotech.cornell.edu/brc/genomics-facility) using the enzyme ApeKI for digestion. GBS libraries constructed were sequenced in the BRC Genomics Facility on the Illumina HiSeq 2000/2500 (100 bp, single-end reads).

Sequencing data analysis and SNP detection

In the absence of a reference genome, the assembly was done de novo. The 156 × 106 barcoded reads from the 95 samples (genotypes) were demultiplexed, trimmed for restriction site and barcode, filtered and clustered using the software pipeline pyRAD v. 3.0 [23]. If not described differently, the default settings were used. The 4 samples from M. discoidea were assigned as outgroup, with the 91 remaining samples also defined as threshold (n = 91) for the maximum number of individuals with a shared heterozygous site to account for the high proportion of polyploid samples and the outcrossing nature of M. recutita. The minimal number of samples per locus was set to 75, i.e. 75 from 91 samples having data at a particular locus, the parameter for the clustering threshold of reads within and between individuals was set to 0.88, and filtering for barcodes, adapters and cut sites was done strictly setting value “2”. The maximum ploidy level was appointed as tetraploid.

Using the above criteria, the pyRAD pipeline filtered and called 6495 SNPs from the 91 M. recutita genotypes.

Analysis of the genetic diversity and genetic structure

The genetic diversity and structure of cultivated chamomile was analysed with the filtered 6495 high-quality SNPs using the admixture model of STRUCTURE 2.3.4 [24,25,26]. The outgroup (4 genotypes from M. discoidea) was excluded from the analysis, since M. discoidea was clearly different from M. recutita. Ten independent analyses with K (number of clusters) ranging from 1 to 15 with 15,000 iterations as burn-in were performed followed by 50,000 supplementary Markov Chain Monte Carlo (MCMC) iterations. The optimal K was obtained according to Evanno et al. [27] using STRUCTURE Harvester [28]. For K = 3 and K = 7, an additional 10 independent analyses with 25,000 iterations as burn-in followed by 100,000 supplementary MCMC iterations were done and used for further analysis. The burn-in length was assessed by monitoring the parameters. CLUMPP v 1.1.2 [29] was used to align the results of individual runs (Greedy algorithm).

The allele frequencies were assumed to be correlated between the origins of chamomile rather than independent [25], as chamomile is a mainly outcrossing species whose seeds are exchanged for cultivation and breeding between regions, and thus the investigated populations / origins can share similar allele frequencies [25]. The admixture model was used, since it is more general and does not assume each individual to belong to a single cluster [26]. The value for lambda (allele frequencies parameter) was set to 0.3 after lambda was estimated in STRUCTURE. Otherwise, the default settings of STRUCTURE were used.

Additionally, the supermatrix of concatenated sequences, 682,699 bp long, was used to compute a phylogenetic tree. The neighbor joining method [30] with 100 bootstrap and keeping nodes above 75% was used to obtain a consensus phylogenetic tree with Geneious v.10.0.9 (https://www.geneious.com, [31]). To validate the results from the neighbor-joining and STRUCTURE analysis, principal coordinate analysis (PCoA) was done.

Analysis of phenotype-genotype associations

To obtain markers for GWAS analysis (genome wide association studies), the full sequences (contigs) identified by the pyRAD pipeline as harbouring all the 6495 SNPs of the STRUCTURE analysis were used to detect 44,468 polymorphic markers within a set incorporating all 95 genotypes. Due to the lack of genetic or physical map, the minor number of missing profiles were imputed by means of Random Forest algorithm [32] through R package “missForest” [33]. Quality control was applied such that markers with call rates less than 90% were excluded [34, 35]. In the end, out of the 44,468 polymorphic markers 16,972 markers, including 16,333 biallelic and 639 multiallelic, were available for subsequent GWAS. Since the single results in GWAS were subsequently evaluated (including consideration of the p-values and r2-values), different criteria for SNP filtering were applied than for the genetic structure analysis so that more SNPs could be used as basic data for the GWAS.

Considering the lack of phenotyping replicates here, estimating the heritability was not feasible. Nevertheless, genomic heritability [36, 37] is a well-suited surrogate for the heritability. We estimated genomic heritability via a replicated (20×) 5-fold cross validation, as the square of average correlation from the 20 replications comparing predicted genotypic values and phenotypic values of respective traits.

GWAS was performed with the R package “rrBLUP” [38] using phenotypic data and the 16,972 markers obtained. The kinship matrix was estimated with a G matrix [39] using 5768 markers (out of the total 16,972 markers) whose minor allele frequency was above 0.05. The final significance test for each marker/allele was accomplished using F-test in “rrBLUP”. The percentage of phenotypic variance explaining each significant marker was estimated via R2 by fitting a regression between phenotypes and marker profiles, as implemented in R using the function “lm”.

The DNA sequences containing highly significant markers as identified by GWAS were annotated using firstly BLAST (blastn and tblastx; [40, 41], http://blast.ncbi.nlm.nih.gov/Blast.cgi) against the NCBI plant genomes nucleotide collection of flowering plants data using the discontiguous megablast algorithm. Mapping and annotation of the BLAST-results were performed using Blast2GO 4.0.2 and default settings [42,43,44,45]. Additionally, blastn against the Helianthus annuus (sunflower) XRQ genome assembly was done at https://www.heliagene.org/HanXRQ-SUNRISE/ [46], since sunflower is the closest related species with a sequenced genome available, belonging together with chamomile in the Asteraceae. Local BLAST analysis was done with CLC Genomics Workbench 8.5 (QIAGEN Aarhus A/S).

Results

Analysis of genetic structure reveals less genetic diversity for tetraploid than for diploid chamomile origins

As expected, the 4 genotypes of M. discoidea (outgroup) were clearly genetically separated from M. recutita (see Fig. 1, Additional file 1: Fig. S1 and Additional file 2: Fig. S2), and were thus excluded from further analyses. Using STRUCTURE, the optimal numbers of clusters (k) that described the population structure of the investigated chamomile were determined as 3 and 7 (Additional file 3: Fig. S3) for the main (Additional file 4: Fig. S4) and more detailed substructures (Fig. 2, Additional file 5: Table S1), respectively. Although the allele frequencies were assumed to be correlated between the origins of chamomile, STRUCTURE was nonetheless additionally run with the setting of independent allele frequencies, which lead to far less population structure differentiation (as expected and described by Falush et al. [25]), but did not significantly differ from the outcome of the analysis with correlated allele frequencies (Additional file 6: Fig. S5). In general, the STRUCTURE analyses demonstrated that the single plants of an origin were genetically similar, with moderate exceptions occurring (Fig. 2, e.g. 005 = ‘Lutea’ or 010 = ‘Argenmilla’).

Fig. 1
figure 1

Phylogenetic tree of M recutita samples calculated with the neighbor-joining method. The tree derived from the supermatrix consisting of all concatenated sequences and was rooted with the outgroup M. discoidea (MD1 and 701). The long branch leading to the outgroup was arbitrarily shortened for clarity. Bootstrap values (bs) of the clades above 75 are indicated close to the nodes

Fig. 2
figure 2

STRUCTURE* analysis (k = 7), with data for geographic origin and ploidy (both upper rows). * The barplot is organized according to the ploidy. The genotypes are represented by the vertical bars, whereas the different colours indicate the seven genetic clusters (Additional file 5: Table S1). The top row indicates the geographical origin (upper) and ploidy (lower) for each sample according to Table 1. The genotypes presented in the order of Table 2 can be found in Additional file 8: Fig. S7

The plants characterized by 1 or 2 main STRUCTURE clusters, i.e. having one or two STRUCTURE colours > = 9.5% (Fig. 2) were mostly tetraploids (average ploidy of the whole group close to tetraploid: Additional file 7: Fig. S6), whereas all plants except one with 3 to 5 different main STRUCTURE clusters (> = 9.5%) were diploid (Fig. 2, Additional file 7: Fig. S6). Except for one genotype (‘Camoflora’ 004_04) all tetraploids had only 1 or 2 main genetic clusters, whereas diploids showed far more genetic diversity (Fig. 2 and Table 2), although several origins of wild growing chamomile populations were also relatively homogenous (020, 064, 721, 722).

Table 2 Ploidy level, flowering time (FT) and quantity of bisaboloids (μg/g) for the investigated plants

In particular, the 39 genotypes from 14 tetraploid origins (USA and throughout Europe) were genetically similar and homogenous, forming one large group (see Fig. 2, Additional file 8: Fig. S7: red coloured bars, Fig. 1: parenthesis “4×”, Additional file 1: Fig. S1: red coloured dots and Additional file 9: Fig. S8a). Furthermore, 9 tetraploid genotypes (‘Lutea’ = 005, ‘Zloty Lan’ = 006 and one genotype ‘Goral’ = 007_01) and all four plants ‘Manzana’ (both origins 002 and 717) formed two additional groups. ‘Manzana’, like several other origins (e.g. an accession from North Korea (516), ‘Camoflora’ (004) and a genebank accession from Bulgaria (067)), was identified as being genetically different from most of the origins (Fig. 2). No clear connection between the genetic structure (clusters) and the geographic origin was apparent, i. e. no geographic differentiation could be identified (Fig. 2 and Additional file 9: Fig. S8b), even though the origins from Argentina (010) and North Korea (516) indeed seem to be a little bit more distinct (Additional file 9: Fig. S8b).

Average flowering time for diploid chamomile is shorter than for tetraploids

Diploid chamomile plants flowered significantly earlier than tetraploid ones (Fig. 3, Additional file 10: Table S2: average flowering time 201 d vs. 244 d, t-test: P = 4.20 × 10−5), and most of the plants (70 out of 78) flowered under long-day conditions, that is 200 DAS. Tetraploid plants showed a narrower range of FT than diploid ones (Fig. 3). In total only 8 genotypes and the 4 M. discoidea plants as outgroup flowered under short-day conditions (Table 2: all three from Argenmilla (010), two out of three from “population Croatia” (029_03, 029_04), and one from each origin ‘Bona’ (715_04), ‘Camoflora’ (04_04), and North-Korea (516_07)). All genotypes flowering under short-day conditions were diploid except for one tetraploid genotype (“04_04”) from the variety ‘Camoflora’. A PCoA-analysis revealed for most of these plants (except 715_04 and 04_04) clear differences to the later flowering ones (Additional file 9: Fig. S8c).

Fig. 3
figure 3

Flowering time [d] for 27 diploid (2×) and 51 tetraploid (4×) M. recutita plants. Box-whisker-plots of days after sowing (DAS) when flowering started, for diploid and tetraploid plants. The quartiles 2 and 3 are drawn with the median in between them. The whiskers indicate the maximum and minimum values, whereas the dot in the middle represents the arithmetic mean (Additional file 10: Table S2)

The content of the bisaboloids was highly variable between the genotypes

Large differences in the content of single bisaboloids were measured between the genotypes and origins, with single genotypes tending to be rich in only one of the four bisaboloids, i.e. plants showing a high content of alpha-bisabolol often had a low content of bisabolol oxide A and bisabolon oxide A (Fig. 4, Table 2). The plants from the outgroup M. discoidea had low contents of bisaboloids, only bisabolon oxide A was present to some extent (Fig. 4).

Fig. 4
figure 4

Content of bisaboloids [μg/g] in the pooled flower heads, genotypes ordered by alpha-bisabolol content

In total, 52 plants out of 80 showed no detectable content of alpha-bisabolol. The plants with high levels of the medicinal important substance bisabolol Fig. 4) were the 4 genotypes from ‘Manzana’ (002_02, 002_04, 717_03, 717_04) and the genetically defined group (see above and Fig. 2) consisting of ‘Camoflora’, ‘Lutea’ and one genotype from ‘Goral’ (007_01). The alpha-bisabolol rich genotypes were characterized by two main STRUCTURE clusters each (> = 9.5% share of each genetic cluster, compare Table 2 and Fig. 2).

Genome-wide association study (GWAS) identified significantly associated SNP-markers for the traits flowering time and alpha-bisabolol

The estimated genomic heritabilities for the traits flowering time and content of alpha-bisabolol were above 0.4 (0.48 and 0.40 respectively, Fig. 5, Additional file 11: Table S3), while that for the content of bisabololoxide B, bisabolone oxide A and bisabololoxide A were clearly lower (0.01; 0.01; 0.14, Fig. 5, Additional file 11: Table S3). Thus, genome-wide association studies were performed for flowering time and alpha-bisabolol only, focusing on the two traits for which a high portion of the phenotypic variability can be explained genetically.

Fig. 5
figure 5

Genomic heritability* for flowering time and content of the four bisaboloids. *The genomic heritability is given in a range between 0 and 1 (y-axis), and estimates how much of the observed phenotypic variation can be explained genetically. In the box-whisker-plot the quartiles 2 and 3 are drawn, with the median in between them. The whiskers indicate the maximum and minimum values (Additional file 11: Table S3)

GWAS resulted in the identification of 9 SNPs located in 5 different sequences for flowering time (FT) and 72 SNPs located in 60 different sequences for alpha-bisabolol content, all of which were above the defined significance threshold 0.01 under Bonferroni correction (Additional file 12: Table S4). Three sequences (2445, 3029, 4585) harboured more than one SNP associated with FT, and ten sequences (256, 1379, 1998, 2208, 2380, 2489, 5681, 6623, 6440, 6670) more than one SNP associated with alpha-bisabolol content.

Flowering time (FT)

The BLAST alignment [40, 41] of the 5 sequences harbouring significantly associated SNPs (Additional file 12: Table S4) against the NCBI nucleotide collection of flowering plants identified 2 sequences with significant hits to potential candidate genes (BLAST-score of > = 80 with a sequence identity of > = 80% and an E-value <1 × 10−13) in multiple plant species. Both sequences harboured more than one significantly associated SNP: (1) “seq. 2445” with 2 SNPs: xyloglucan endotransglucosylase/hydrolase protein with significant alignments to 30 plant species; and (2) “seq. 4585” with 3 SNPs: uncharacterized protein with significant alignments to 9 plant species, and protein kinase / salt-induced ABC1 kinase below the BLAST aligment results threshold in 2 plant species (Additional file 13: Table S5). Xyloglucan endotransglucosylase/hydrolase is described to be involved in self-pollination in Arabidopsis thaliana [47].

BLAST aligment (megablast) of the FT associated sequences to the sunflower genome [46] yielded hits for 4 of the 5 sequences on different chromosomes (multiple hits on 6 chromosomes for “2445”, single hits all in different chromosomes for “441”, “3029” and “4585”).

Alpha-bisabolol content

Out of the 60 sequences (contigs) associated with alpha-bisabolol content in our study (Additional file 12: Table S4), 31 could be aligned by BLAST against the NCBI nucleotide collection of flowering plants, of which 13 sequences achieved matches above the selected threshold (listed in Additional file 13: Table S5 together with the suggested functional role of the putative gene: BLAST-score of > = 80 with a sequence identity of > = 75% and an E-value <1.33 × 10−13). For 12 of these 13 sequences specific gene products were described in multiple plant species (Additional file 13: Table S5).

Whereas 8 (sequences 3218, 3223, 3249, 4194, 6598, 6694) of the 12 sequences are involved in synthesis or regulation within universal pathways and associated ubiquitous proteins in many eukaryotes or plants (Additional file 13: Table S5), 4 sequences are described to play a role in plant biotic and abiotic stress response. Interestingly, one sequence (3223) is described to be involved both in basic function and in abiotic stress response ([48]: armadillo/beta-catenin repeat family proteins). Ascorbate peroxidase (722) responses are directly involved in the protection of plant cells against adverse environmental conditions [49], Glutamate receptor-like channels (1139) in plants are discussed to fulfill a function in response to an attack from pests and pathogens [50], and the chaperone protein ClpB1 (5721) is described to be a heat shock protein required for acclimation to high temperatures [51]. The last sequence (1998) is predicted to encode a nuclear fusion defective 4-like protein, which is described to be required for karyogamy during female gametophyte development and fertilization in Arabidopsis thaliana [52].

The BLAST aligment (megablast) of the 60 sequences associated with alpha-bisabolol content against the sunflower genome [46] yielded 35 matches above the threshold for the E-value of 10−3. 12 of the 13 sequences from the alignment to different plant genomes (Additional file 13: Table S5, except the microsatellite sequence “1379”) were above the threshold value for the alignment against sunflower genome.

The 35 sequences with E-value above the threshold, were aligned to all 17 chromosomes of the sunflower genome, with minimum of 1 (chromosome 17) and maximum of 7 to 8 (chromosomes 8, 10 and 13) sequences per chromosome (Additional file 14: Table S6). Among these 35 sequences, 14 were aligned to only 1 sunflower chromosome, 10 to 2 different sunflower chromosomes, and the remaining 11 to multiple chromosomes, with sequences “4194” and “5751” both being aligned to 6 chromosomes. Ten sequences were aligned to multiple loci on the same chromosome. Tblastx gave more hits against the sunflower genome, but with higher E-value and only consisting of more multiple hits against the same chromosomes.

The sequences for which specific gene products in multiple plant species were identified (see above) were not concentrated on any of the sunflower chromosomes (Additional file 14: Table S6: sequence name written bold). Based on these data, no clustering of these 35 highly alpha-bisabolol associated sequences to single sunflower chromosomes could be observed.

Mapping by Blast2GO of the BLAST results to retrieve gene ontology terms for FT and alpha-bisabolol content achieved no matches. Thus, no functional annotation could be done.

Discussion

Rare autopolyploidization events are a potential cause for low genetic diversity of the cultivated tetraploid chamomile

Tetraploid chamomile varieties were generated a very few times artificially by polyploidisation of diploid plants, as evidenced by the STRUCTURE analyses (Fig. 2). The large, genetically homogenous group of 14 tetraploid orgins, obtained from all over Europe, is likely to originate from the same polyploidised diploid genetic background, i.e. similar parental material, with minor genetic differences arising via selection and introgression during breeding. These data indicate that most of the high-performing tetraploid chamomile origins are genetically highly similar, and justify the demand to further broaden the diversity by including different germplasm in the future breeding process. Two smaller, genetically different groups of tetraploid origins with “extended” genetic background include plants with the highest alpha-bisabolol content (see below): (1) The tetraploid variety ‘Manzana’, which was generated from diploid ‘Degumille’ [53] and (2) the group containing ‘Lutea’, ‘Zloty Lan’ and partly ‘Goral’.

Two tetraploid genotypes (003_03 from ‘Bodegold’, 004_04 from ‘Camoflora’) were genetically diverse, each consisting of several STRUCTURE-clusters, although both were similar in “STRUCTURE fingerprints” to their corresponding diploid genotypes. Considering ‘Bodegold’, registered in 1962 as a tetraploid variety [54], which consisted of di- and tetraploid plants, we hypothesize that contamination with closely related wild growing diploid chamomile through seeds or (unreduced) pollen has led to variable ploidy [19]. Contrarily, ‘Camoflora’ (a diploid variety) might have experienced fertilization with both unreduced male and female gametes or unreduced egg cells crossed with cultivated tetraploid chamomile, which both could have led to the detection of tetraploid genotypes in our study. However, frequent gene flow between di- and tetraploid chamomile is not expected due to the different ploidy of the gametes, which would typically lead to highly sterile triploid F1-progeny [55, 19]. Consequently, an increase in the genetic diversity of tetraploid chamomile origins by crosses with diploids is not likely to happen spontaneously, and hence the inclusion of different new origins in breeding programmes would likely be advantageous.

Geographic origin showed no congruence with genetic structure

In our analysis, there was no clear congruence between the genetic structure (clusters) and geographic origin (Fig. 2 and Additional file 9: Fig. S8b), i.e. no clustering due to geographic origin could be found. Even though the origins from Argentina (010) and North Korea (516) indeed seem to be genetically distinct from the European ones (Additional file 9: Fig. S8b), the samples from Russia (23), Egypt (26) and USA (27) showed strong similarity to diverse European origins. An explanation could be that regular exchange of plant material between the regions took place for this mainly outcrossing species, and that in our panel of plants were mainly cultivated chamomile origins with good agricultural performance. Franke and Schilcher [9] discriminated the origin of chamomile flowers by their chemotype, but only from wild and not cultivated collections. For the crop plant Brassica rapa, Tanhuanpää et al. [56] analysed 61 accessions with SNP markers and found grouping corresponding to morphotype and flowering habit, but not to geographic origin. Their results are comparable to our data, indicating the limited value of the geographic origin alone (without correlation to other traits) to genetically structure cultivated outcrossing plants for which the exchange of genetic material is not inhibited. Consequently, only with numerous wild collected samples from the natural regions of origin (e.g. Iran) specific geographic footprints in the genetic patterns might be revealed.

Flowering started earlier for diploid than for tetraploid chamomile plants

Diploid plants flowered in average significantly earlier than tetraploid ones (Fig. 3, Additional file 10: Table S2), as has been shown for Chamerion angustifolium [57, 58]. It is hypothesized that (auto-) tetraploid plants develop slower and flower later due to the longer time necessary for replication of a doubled genome [59, 60], which consequently may lead to slowing of metabolism and growth rates in polyploids [61, 62]. Since tetraploid chamomile is autotetraploid, possible effects of allopolyploidy like allelic interactions or epigenetic modifications of homoeologous loci leading to nonadditive gene regulation [63, 64], are not considered.

The high genomic heritability observed for the flowering time in our study (Fig. 5) additionally indicates a strong genetic influence on the flowering time in chamomile, as has been described for Arabidopsis and Brassica. Mayfield et al. [64] hypothesized that late flowering in these genera is a dominant trait, as flowering time probably depends on the expression of a repressor gene. Thus, late flowering in tetraploids compared to diploids could be additionally influenced by higher chances for dominant alleles to be present. Bao et al. [65] identified factors regulating flowering and endopolyploidization in Arabidopsis and suggest an involvement of cell cycle control in the timing of reproductive transition.

Identification of the flowering time (FT) associated sequences in chamomile can enhance breeding

The 5 sequences containing the SNPs significantly associated with flowering time (Additional file 12: Table S4) may serve as a starting point to develop molecular markers to be used in marker assisted selection (MAS) for FT in chamomile. Early flowering (=early harvesting) could help circumvent negative effects on yield due to summer drought, whereas late flowering might be of advantage in cold regions with late frost, especially when sowing was already done in autumn. Additionally to the effect of the ploidy on FT, molecular markers could facilitate the breeding for early or late flowering varieties by selecting genotypes with genetic factors for these traits.

The BLAST alignment demonstrated that the orthologue of one of these 5 sequences (“2445”; Additional file 12: Table S4) codes for protein connected to flowering (i.e. self-pollination) in A. thaliana [47]. Further investigations would be required to unravel the function of the identified sequences in chamomile.

High alpha-bisabolol content detected in tetraploid plants from two genetically defined groups indicates its limited occurence in cultivated chamomile

Only 9 chamomile genotypes (origins ‘Manzana’, ‘Lutea’, ‘Goral’ and ‘Zloty Lan’) from 2 STRUCTURE groups (Fig. 2) showed a high content of alpha-bisabolol, as has been described for ‘Manzana’, ‘Lutea’, and ‘Goral’ [2]. Here, only one out of three analysed ‘Goral’ genotypes showed a high bisabolol level, while two of three ‘Zloty Lan’ genotypes were alpha-bisabolol-rich with the third genotype being rich in alpha-bisabolol oxide B, in contrast to the data of Franke and Schilcher [2] and Das [7]. Our data for ‘Lazur’ also differ from the data of Franke and Schilcher [2] as we detected high levels of bisabolol oxide A and B but no bisabolon. These differences could be explained by genotypic variability in the polymorphic varieties of this outcrossing species, potentially increased by introgression with other chamomile populations during cultivation. This potential variability should be taken into account when maintaining a variety characterized by high-levels of specific compounds serving as medicinal drug, as well as when breeding new varieties.

The pharmaceutically important sesquiterpene alpha-bisabolol was detected in 28 of the analysed plants. It was previously suggested that an oxidative biosynthetic pathway converts the alpha-bisabolol to bisabolol oxide B or to bisabolol oxide A, and from the latter further to bisabolone oxide A [66, 67].

The 9 alpha-bisabolol rich genotypes belonged to 4 varieties (see above) from 2 different tetraploid groups, both possessing 2 main genetic clusters as defined by STRUCTURE (compare Table 2 and Fig. 2). However, a general conclusion that heterotic effects lead to the higher content of alpha-bisabolol cannot be deduced from our data (Fig. 4 and Fig. 2), as many origins with several main genetic clusters each had no detectable alpha-bisabolol content. Since only 2 tetraploid groups showed plants with high alpha-bisabolol content in our study, their membership in two main genetic clusters could also be a random effect explained by a breeding process of choosing different genetic materials as parents, but not necessarily leading to significant heterotic effects. This view is also supported by the analysis of the average heterozygosity. Some alpha-bisabolol rich genotypes displayed an elevated level of heterozygosity over the whole DNA-sequences, but more genotypes with the same level of heterozygosity contained no alpha-bisabolol (Additional file 15: Fig. S9, Additional file 16: Table S7). Wagner et al. [16] describe the inheritance of the ability to produce alpha-bisabolol as monogenic, which does not support the involvement of a heterosis effect in its content. The gene most likely responsible for alpha-bisabolol formation is the alpha-bisabolol synthase McTPS2 identified in chamomile [68]. The expression of such a terpene synthase was often found to be dominant in many plants including maize [69, 70]. In a genome wide association study of maize, several SNPs with significance for terpene production were located in terpene synthase genes [71]. However, in chamomile, none of the SNPs associated with alpha- bisabolol content were located in the sequences of alpha-bisabolol synthase or other genes with an apparent involvement in terpene biosynthesis [21]. Some of the SNPs controlling alpha-bisabolol biosynthesis may affect regulatory factors that also cause a positive introgression/heterosis effect.

It cannot be concluded that tetraploidy leads to an enhanced potential for elevated levels of alpha-bisabolol, since only some tetraploids had high levels of it, and for ‘Manzana’ the diploid starting material (‘Degumille’: [53]) is already described to contain high levels of alpha-bisabolol.

Connection between high alpha-bisabolol content with genes for plant defense and stress response

Thirteen out of the 60 sequences harbouring the 72 SNPs significantly associated with alpha-bisabolol content could be annotated, 9 of which had orthologues in multiple plant species, of which 4 are described with roles in plant biotic and abiotic stress response (Additional file 13: Table S5).

The association of genes for plant defense and stress response with alpha-bisabolol content is consistent with the antifungal and antibacterial effects of alpha-bisabolol and its oxides [10, 72], which are part of their medical value. Secondary metabolites often function in plant defense and stress response, and in particular, terpenoids (a class of substances to which the bisaboloids belong) play vital roles in plant defense [73]. Furthermore, synthesis and accumulation of natural products (i.e. secondary metabolites) is enhanced in drought-stressed plants, or by application of signal transducers [74, 75]. Hence, we hypothesize that moderate stress might enhance the production and thus concentration of alpha-bisabolol in chamomile, and yield as long as the overall growth of the plant is not hampered.

Conclusion

Our analyses of high quality SNPs enabled us to identify genetic clustering in chamomile, with some genotypes belonging mainly to one genetic cluster, while others demonstrate evidence for introgression beween different genetic clusters. Tetraploid varieties tended to be less diverse than the diploid ones. Several origins were identified as being genetically different from most other origins, suggesting their possible use for exploiting heterosis. We identified several genotypes with high levels of the medicinal important substance alpha-bisabolol. Single genotypes tended to be enriched in only one of the four bisaboloids. GWAS identified highly significant (P < 0.01) SNPs for flowering time (9) and alpha-bisabolol content (71). The data from GWAS pave the way for future research towards the identification of marker trait associations, their underlying genetics and their use in marker assisted selection (MAS) for practical breeding in German chamomile.