Background

Sugarcane, the highest tonnage crop among cultivated plants, plays a substantial role in the global economy. Nowadays, this crop has gained great importance not only for its traditional use as food (80 % of world’s sugar is produced from sugarcane) but also for ethanol and biomass production. The production of alternative energy sources as well as the establishment of the biorefinery concept has also increased sugarcane world demand rapidly [1]. In order to supply this continuous increasing requirement, the development of new varieties with high biomass and sugar yield is essential.

The modern sugarcane cultivars are interspecific hybrids derived essentially from early crosses between Saccharum officinarum (2n = 80, x = 10), a species with high sugar content stalks, and Saccharum spontaneum (2n = 40–128, x = 8), a wild and vigorous species resistant to several sugarcane diseases. The initial interspecific hybrids were repeatedly backcrossed to S. officinarum clones or to other hybrids in order to recover high sugar content, a process known as “nobilization”. These modern cultivars are highly polyploid and often aneuploid, with chromosome numbers ranging from 100 to 130 [2]. Due to this genetic complexity, the application of both conventional and molecular breeding is a challenge in sugarcane.

Most of sugarcane production regions have their own breeding programs to develop and improve local varieties adapted to their specific environments and agricultural practices. Developing a new sugarcane variety takes on average 12 years [3]. Molecular markers associated with relevant agronomic traits could significantly reduce the time and cost involved in developing new varieties because they could aid in selecting the best parents as well as accelerating the rate of genetic gain in the breeding program. In that sense, association mapping has become widely used to identify molecular markers associated with relevant traits in several crops [49]. This method is based on the linkage disequilibrium (LD) between molecular markers and quantitative trait loci (QTL) [10]. The resolution and applicability of association mapping depends on the extent of LD within the population under consideration. The breeding history of sugarcane, consisting of a strong foundation bottleneck followed by a small number of cycles of intercrossing and vegetative propagation, suggest that LD should be extensive, thus a high density of markers may not be needed to detect marker–trait associations [11]. In 1999 [12], and more recently in 2008 [13], the persistence of high LD in modern sugarcane cultivars was confirmed.

The forces generating and/or conserving LD are those that produce allele frequency changes, i.e. population stratification, genetic relatedness, selection, mutation, genetic drift and linkage [10]. With the exception of linkage, all the genetic forces may cause false positive correlation between markers and traits in population-based association mapping approaches. The effects of a structured population in association mapping studies have been well documented and identified as one of the main causes of spurious associations [1416]. For that reason and considering the often complex relationships among genotypes in breeding populations, it is extremely important to control for population structure in order to effectively decrease type I error rates (i.e. false positives) [17]. For this purpose, a range of statistical methodologies have been developed that include some sort of population or relatedness control using mixed models [1619].

In addition to controlling for population structure, the availability of both accurate phenotypic data and molecular markers distributed across the genome are critical requirements for the success of association mapping. One of the advantages of this mapping method for plants compared to classical QTL analysis based on balanced mapping populations is that association mapping allows the use of historical phenotypic data sets collected by the breeding programs [5]. Typically, this data come from multiple trials across different environments and years, therefore, statistical analysis such as mixed models are necessary to obtain phenotypic values that best represent the performance of each genotype. Malosetti et al. [19] extended the standard phenotypic analysis of multiple trials by mixed models to arrive at models suitable for association mapping by introducing marker genotype information as random covariates to model the correlation between genotypes.

The recently developed technology of DArT in sugarcane [1] makes it possible to have genome-wide scans of this genetically complex crop, capturing genomic profiles with many thousands of polymorphic markers of several kinds (INDELs, SNPs, methylation changes) [20]. Another molecular marker system recently developed that could also be convenient to detect markers associated with desirable traits is Target Region Amplification Polymorphism (TRAP). These dominant markers enable the identification of polymorphisms in coding regions involved in specific pathways as sucrose metabolism or drought tolerance among others [21, 22].

Information of the marker sequences for DArT is available and could be anchored to the sugarcane genome if sequenced. Several efforts are still ongoing in order to sequence the sugarcane genome which has a high genetic complexity due to its ploidy level. However, considering that i) sugarcane monoploid genome estimated on 930 Mb is similar to the sorghum genome (2n = 2x = 10) estimated on 730 Mb [23]; ii) sugarcane and sorghum both belong to the Poaceae family and the same sub-tribu Saccharinae, and iii) their high degree of colinearity [24, 25]; the available sequence of sorghum genome becomes an important tool for the analysis of regions of interest in sugarcane.

The goal of this research was to establish an appropriate genome-wide association analysis (GWAS) tool in a sugarcane breeding population, and to find molecular markers associated with high yield of both biomass and sugar stable through successive crop cycles. Therefore, a GWAS mapping within a mixed-model framework following Malosetti et al. [19] was used. Spurious associations were minimized while the power to detect true associations was maximized by considering the possible population structure. A Principal Component Analysis (PCA) from a genotype data set was performed [26] and values obtained from the significant axes for each genotype were used as covariates in the model. In contrast with others sugarcane GWAS studies reported earlier involving yield related traits [27, 28] where analyzes were conducted at plant-cane stage, the novel methodology to analyze multi-QTLs through successive crop cycles used in the present study allowed us to find several markers associated with relevant traits. Results highlighted that this approach could be a valuable tool to assist the improvement of sugarcane and better supply the sugar and biomass demand that has been projected for the upcoming decades.

Methods

Plant material and phenotyping

The experimental population consisted on sugarcane clones from the selection panel (Infield Variety Trials, IVT) of the sugarcane breeding program of “Estación Experimental Agroindustrial Obispo Colombres” (SCBP-EEAOC) (i.e. 88 clones, Table 1). IVT are the fourth step of selection of SCBP-EEAOC, where in 2008 a total of 100 clones were planted and thoroughly evaluated in 2009 in order to select potentially new varieties at the following steps. This breeding population consists in genotypes obtained from crosses between the best parents, i.e. with highly productive offspring. To avoid the over-representation of any family, out of the 100 clones, 14 full-sibs were removed to assemble the panel suitable for association mapping. Only some full-sib clones were conserved for not reducing the number of genotypes of the population. The first and second more planted varieties in Tucumán (Argentina) LCP 85-384 and TUCCP 77-42, respectively [29], were also included in the association panel. The IVT were conducted at two locations in Tucumán, Argentina (Additional file 1) during three successive crop cycles. Within each trial, a randomized complete-block design with three replications was used. The individual plot size was 3 rows x 10 m, with an inter-row spacing of 1.6 m. Cane yield (CY) (kg plot-1) was evaluated directly by weighing stalks from the full plot in the field during the harvesting season 2009 (plant cane), 2010 (first ratoon), and 2011 (second ratoon). Even though CY was measured in kg plot-1 in the present GWAS study, final effects were converted to t ha-1 for a better interpretation. In May of each year, sugar content (SC) was estimated from ten randomly chosen stalks from each plot by determining Brixº (percentage of soluble solids, mostly sugars, minerals, and organic acids) and Pol (level of sucrose in stalk juice determined by polarimetry) [30, 31]. SC was determined at the millroom of an EEAOC’s laboratory by using Brixº and Pol, according to the following equation:

Table 1 Sugarcane accessions and their parents used in the genome-wide association study of cane yield and sugar content
$$ \mathsf{S}\mathsf{C}\% = \mathsf{0}.\mathsf{98} \times \mathsf{p}\mathsf{o}\mathsf{l}\ \%\ \hbox{-}\ \mathsf{0}.\mathsf{28} \times \mathsf{brix}\ \% $$

[32]

Statistical analysis for the phenotypic data

Field trials were analyzed for each harvesting season independently using the following mixed model:

$$ {\mathsf{y}}_{\mathit{\mathsf{i}}\mathit{\mathsf{j}}\mathit{\mathsf{k}}}=\mathsf{\mu}+{\mathsf{G}}_{\mathit{\mathsf{i}}}+{\mathsf{S}}_{\mathit{\mathsf{j}}}+{{\mathsf{B}}_{\mathit{\mathsf{k}}}}_{\left(\mathit{\mathsf{j}}\right)}+\mathsf{G}{\mathsf{S}}_{\left(\mathit{\mathsf{i}}\mathit{\mathsf{j}}\right)}+{\mathsf{\varepsilon}}_{\mathit{\mathsf{i}}\mathit{\mathsf{j}}\mathit{\mathsf{k}}} $$

where y ijk is yield of genotype i at location j and block k; μ is the overall mean; G i is the i-th genotype fixed effect with i = 1,…,g; S j is the j-th location random effect with j = 1,…,s and Sj ~ N(0, σ2 S); B k(j) is the k-th block random effect at location j with k = 1,…,n and Bk(j) ~ N(0, σ2 B); GS(ij) is the genotype i by location j interaction random effect with GS(ij) ~ N(0, σ2 GS); and ε ijk is the random error associated with observation y ijk . Comparison through harvesting seasons is particularly interesting since dynamics and characteristics of plant-cane bud sprouting and growth are different from those of ratoon crop [33]. Therefore, different genome regions would be implied in yield of both cane and sugar, through different crop ages. The estimated means (Best Linear Unbiased Estimator, BLUE) obtained from this model for CY and SC of all genotypes were used for the association mapping analysis. The analysis was performed using PROC MIXED in SAS software 9.0 (SAS Institute 2004). A mixed model for association mapping was used later (described below) and therefore, BLUEs instead of BLUPs were used as genetic values for the accessions to avoid double-shrinking [3438]. Pearson correlation of genotypic means was estimated between traits in R software [39]. Broad-sense heritability (H2) at an experimental level was calculated on a genotype mean basis for each trait and at each location as the ratio of genotypic to phenotypic variance, using the components of variance obtained from a model adjusted as follows:

$$ {H}^2=\frac{\sigma_G^2}{\sigma_G^2+{\sigma}_{\varepsilon}^2/r} $$

where σ 2 G is the genetic variance, σ 2 ε the residual variance and r the number of replicates [40].

Genotyping

DNA was extracted from frozen leaf tissue following the Diversity Arrays Technology (DArT) Pty Ltd (Yarralumla, Australia) protocol [41]. The quality and quantity of DNA were verified on a 0.8 % agarose gel. All clones were genotyped using DArT [1] and TRAP markers [21, 22]. DArT genotyping of the population mapping was carried out by DArT Pty Ltd with the Sugarcane High Density 1.0 array. This service involves two methods of complexity reduction (both based on PstI-based methyl filtration) against the array containing 7680 probes. TRAP genotyping was carried out according to [22] with minor modifications. All PCR reactions were carried out in our lab and performed in a Bio-Rad My clycler Termalcycler (Hercules, CA, USA) in 5 μl reaction containing 50 ng DNA sample, 10X reaction buffer (Fermentas, Spain, EU), 2.5 mM MgCl2 (Fermentas), 0.088 mM of each dATP, dTTP and dGTP, 0.072 mM of dCTP, 0.16 μM of each primer (Table 2), and 0.5 U of Taq DNA polymerase (Fermentas). Different concentrations of Cy5.5-dCTP (GE Healthcare, Buckinghamshire, UK) were included in the reaction depending on the primer combination (Table 2). Amplifications were performed by initially denaturing the template DNA at 94 °C for 2 min, followed by five cycles at 94 °C for 45 s, 35 °C for 45 s, and 72 °C for 1 min, 35 cycles at 94 °C for 45 s, 50 °C for 45 s, and 72 °C for1 min, and a final extension step at 72 °C for 7 min. Loading dye was added and 0.3 μl PCR products were separated on a 25 cm polyacrylamide gel (Amersham Biosciences) (0.25 mm thick) in a LI-COR 4300 DNA Analyzer (LICOR Biosciences, Lincoln, NE, USA) according to manufacturer’s instructions. Images were captured with slow scan laser at 700 nm and analyzed with the SAGATM software (LICOR Biosciences). The product sizes were determined by comparison with molecular weight marker LI-COR IRDye 50–700 bp Size Standard (LICOR Biosciences). TRAP markers, classified as 1 (presence) or 0 (absence), and the binary data from DArT were used for association analysis. All markers with a minor allele frequency (MAF) lower than 0.1 were excluded from the GWAS analysis.

Table 2 Conditions for sugarcane TRAP genotyping used in the GWA study of sugarcane breeding population

Genetic diversity and population structure

All polymorphic DArT and TRAP markers scored on the 88 sugarcane accessions were used to estimate genetic relationship among clones. Genetic dissimilarities between all pairwise combinations of clones were calculated using the Dice index [42]. Then, a Neighbor Joining tree was built from the matrix of pairwise dissimilarities using the Darwin software V.5.0.158 [43].

In order to detect and correct for population structure, a PCA was carried out using a subset of 107 DArT markers. All the available markers were not included in this analysis mainly because using the same markers to estimate population structure and then including them in the model to test for an association could create a dependency among terms in the model absorbing some of the QTL effects [44]. The markers used for PCA were sampled according to their position on different Linkage Groups of the Homology Groups of a sugarcane map recently published [45].

GWAS analysis

A mainstream mixed model GWAS analysis was conducted following [19] and [46]. Associations between molecular markers and quantitative traits were determined following the general linear mixed model for each year:

$$ \mathsf{Y}=\mathsf{X}\mathsf{\beta }+\underline {\mathsf{Q}\mathsf{\upsilon }}+\mathsf{e} $$

where Y is the phenotypic means vector (i.e. BLUEs from field analysis), X is the incidence matrix of molecular markers, β is the vector of parameters related to the simple regression of the markers on the phenotypes, Q are the eigenvectors of the significant axes of the PCA matrix, υ is a vector of predicted values of population structure, and e is the vector of random errors. The PCA scores were used in the model as random components following [19] and [46]. Modeling population structure as random effects not only does the relatedness matrix capture population structure, but also encodes a wider range of structures, including cryptic relatedness and family structure [36, 47, 48]. The significant PC axes included in the model were determined with the Tracy-Widom statistic [46]. The analyses were performed using R-code developed by the author’s with modifications from the emma [49] and GAPIT [50] packages and recently published [40] using the R software 3.0.0. The code will be uploaded to the R-Cran repository as mmQTL package [51]. Briefly, a two-step approach was followed to arrive to a multi-QTL model. First, a marker-by-marker scan of the genome was conducted to identify significant marker-trait associations with a false-discovery rate (FDR) (α = 0.05) to control for multiple testing. Since a large number of significant marker-trait associations were found, and to report the more relevant QTL, a second pruning of markers with a more stringent FDR P-value (0.01) was conducted. Second, all significant markers were fitted in a single final multi-QTL model adding markers at a time in a stepwise-forward selection manner to control for residual QTL and to identify QTL following [5254]. The Wald statistic with a liberal P-value < 0.01 following [19, 36] was used for this model.

QQ-plots assuming a uniform distribution of P-values under the null-hypothesis of no-QTL (i.e., Schwederand Spjøtvoll plots; [55]) were used to evaluate the models. Briefly, the observed P-values values are plotted against the expected theoretical values (i.e. cumulative density function) for a uniform distribution. This is standard methodology to evaluate the models ability to control for spurious association [17, 36, 56]. These analyses were also performed in R statistical software.

Analysis of sugarcane DArT marker sequences associated to important traits

Sequences from sugarcane DArT markers significantly associated with CY or SC at least in 2 years of study and DArT markers significantly associated with a trait in the multi-QTL model that resulted in highest Allelic Substitution Effect (ASE) were used to determine their similarity and position on the sorghum genome. This was conducted by using BLASTN 2.2.22 [57] on non-redundant databases of sorghum sequences with different algorithms. First, “Megablast” was employed to identify query sequences. In the cases where no significant similarity was found, a second algorithm “Discontiguous megablast” was chosen since it uses an initial seed that ignores some bases and is intended for cross-species comparisons. Finally, when no significant similarity was found using the second algorithm, BLAST was performed using “blastN”.

Results

Phenotypic data, molecular markers, panel diversity and population structure

The 88 sugarcane clones used in this study were phenotyped by SCBP-EEAOC for CY and SC during 2009, 2010 and 2011 and genetically characterized by DArT and TRAP markers. The BLUE values obtained with the adjusted model, described above, were 48 to 85 t ha-1 for CY and 9.2 to 10.9 % for SC (Table 3 and Additional file 2). The genetic correlations observed between years for CY were 0.60 for 2009 and 2010, 0.78 for 2010 and 2011, and 0.50 between 2009 and 2011. Meanwhile, genetic correlations observed between years for SC were 0.40 for 2009 and 2010, 0.72 for 2010 and 2011, and 0.46 between 2009 and 2011. There were low correlations between CY and SC across years (-0.06, -0.24 and -0.14 for 2009, 2010 and 2011, respectively), being only significant (P-value <0.05) correlation among CY 2010 and SC 2010 (Additional file 3). Results of broad-sense heritability for both trait and location are presented in Table 4. CY was under strong genetic control, since estimates of broad-sense heritability were high, ranging from 0.51 to 0.84. Estimates of H2 for SC were also high (from 0.55 to 0.80), with the only exception for SC 2010 with a moderate value of H2 of 0.30. This high estimates of heritability indicated that the field trials produced good-quality data for the association study.

Table 3 Descriptive statistics of cane yield (CY) and sugar content (SC) from field trial of all genotypes evaluated in the GWA study
Table 4 Broad-sense heritability (H2) at each location and at each crop cycle for Cane Yield and Sugar Content

Out of the 7680 probes evaluated in the DArT array, 1642 markers were informative (i.e. polymorphic, with a MAF higher than 0.10). Out of the 177 TRAP markers evaluated, only 103 markers were included in the GWAS and 74 were excluded because the MAF was lower than 0.1. Among the 1642 informative DArT markers, 258 were mapped on the recently published sugarcane genetic map [45].

Diversity analysis using all the informative TRAP and DArT markers revealed no particular structure in the mapping population (Fig. 1 and Additional file 4; http://dx.doi.org/10.5061/dryad.mv88m). The most closely related clones (parent–descendant or full-sib) were grouped in the same area of the neighbor-joining tree. However, they do not form outstanding branches. Surprisingly, there were two exceptions where full-sib clones were located in different branches, i.e. TUC 02-38 and TUC 02-37 whose genealogical records indicate that they are descendant from the same parents; and TUC 03-32 that would be full-sib with TUC 03-31, TUC 03-33, TUC 03-37 and TUC 04-4, and grouped separately from the rest. At the most distant branch, located at the lower right portion of the tree, grouped LCP 85-384 and most of the clones derived from this variety. At the lower center position of the tree, clones derived from HOCP 85-845 were grouped. Then, at the lower left portion of the tree, TUCCP 77-42 and clones derived from this variety were located. On the other hand, the first three axes of the PCA using 107 DArT markers distributed across the sugarcane genome were significant following the Tracy-Widom statistic. The PCA scores for each genotype at each axes were included as random covariates in the GWAS model to model the variance-covariance matrix among genotypes. The first two axes explained 7.47 and 4.99 % of the total variation, respectively (Fig. 2). The first axis could be associated to filial relations; where two groups seems associated to LCP 85-384 offspring (right side of the PC1 axis) and non-LCP 85-384 offspring (left side of the PC1 axis). At PC2 level, TUCCP 77-42 variety was distant from the rest of the genotypes. Results showed at Fig. 2 are congruent with those previously mentioned in Fig. 1, since clone descendant from LCP 85-384 were detached from the rest of genotypes.

Fig. 1
figure 1

Neighbour-joining tree based on the Dice dissimilarity index calculated from 1745 polymorphic markers data (103 TRAP and 1642 DArT) assembling the 88 sugarcane genotypes

Fig. 2
figure 2

The top two axes of variation of 88 sugarcane clones studied resulting of Principal Component Analysis by using 107 DArT markers distributed across the genome. The percentage of variation represented by each component is in parentheses. Accessions are colored according to their parentage with LCP 85-384. Progeny of LCP 85-384 are in black triangle (▲); the remaining genotypes are in empty circles (◯)

GWAS analysis

GWAS analysis was conducted by using 1638 discrete markers (1535 DArT and 103 TRAP). QQ-plots of P-values showed that population structure was properly accounted for by using a stratified selection of markers to correct for population structure as random effect (Additional file 5). In the present study, 43, 42 and 41 markers significantly associated (FDR α = 0.01) with CY in 2009 (cane plant), 2010 (first ratoon) and 2011 (second ratoon), respectively, were found. In addition, 38, 34 and 47 significant marker-trait associations for SC were detected, in 2009 (cane plant), 2010 (first ratoon) and 2011 (second ratoon), respectively (Additional file 6). Certain stability across crop-cycles was observed since twenty markers were found to be associated with CY in 2 years of study, being the coincidence between 2010 and 2011 (first and second ratoon) more frequent. For SC, one marker-trait association was found significant for the 3 years of study, while twelve markers presented association for 2 years. These association were also more frequent when 2010 and 2011 years were involved (Table 5). Mostly markers associated with one trait were not associated with the other; however, four markers were associated with both traits (M54 for CY-2010, CY-2011 and SC-2011; M58 for CY-2010, CY-2011 and SC-2011; M173 for CY-2010, SC-2010 and SC-2011; and, M188 for CY-2010, SC-2010 and SC-2011).

Table 5 Summary of results found for markers associated with traits of interest at least in two years of study and comparison with sorghum genome

A multi-QTL model by year was constructed with markers significantly associated with each trait. Considering the 3 years, 23 markers were significant in the multi-QTL for CY while 21 remained significant in the multi-QTL for SC (Table 6). For CY, markers M100, M120, M140, M200 and M202 had allelic substitution effect (ASE) larger than 8.33 t ha-1. For SC, M28, M51 and M171 had ASE larger than 0.70 %. Marker M64 was detected in more than 1 year in the multi-QTL model (SC 2010 and 2011). The effect of this marker was the same in the 2 years of association and 57 % of the genotypes analyzed had the favorable allele for this marker.

Table 6 Significant markers associated to cane yield and sugar content and their allelic substitution effect (ASE) in the multi-QTL model for the sugarcane GWAS panel

Sugarcane DArT markers sequences on sorghum genome

The 27 available sequences of DArT markers significantly associated with a trait in at least 2 years of study were blasted to the sorghum genome sequence database (Table 5). When the sequences of sugarcane DArT markers were analyzed, three of them were found to present the same nucleotide sequence. This was useful as internal control because genotypes presented the same configuration (absence or presence) for markers with the same sequence. Most of alignments involved sequences of hypothetical proteins of sorghum showing a high identity and low e-values. Noticeably M120 showed a high identity (94 %) with a sorghum sequence located on chromosome 6 with an e-value of 0, indicating that there is no probability of alignments with scores equivalent to or better in a database search by chance.

Similarly, sequences of DArT markers significantly associated with a trait in the multi-QTL model that resulted in highest ASE value (M100, M120 and M140 for CY; M28, M51, M64 and M171 for SC) were blasted to the sorghum genome sequence database. Results are shown in Table 7, markers M120 and M64 were not included since they were already shown in Table 5. Other markers with highest ASE value that were not blasted since no sequence information is available, were TRAP markers M200 and M202, which derive from T15 and T17 amplifications respectively (see Table 2). Interestingly, some DArT markers, mainly associated with SC, showed high identity with an alpha kafirin protein that it is involved in the storage of nutritious substrates.

Table 7 Results of alignments of markers associated with cane yield and sugar content with higher value of allelic substitution effect (ASE) in the multi-QTL model for the sugarcane GWAS panel against Sorghum bicolor sequences

Discussion

In the last decade, several approaches tested in plant genetics have allowed the precise identification of “desirables” alleles at molecular level. In the most recent years, the development of association mapping for this purpose has gained large importance. In this work, association mapping was used in sugarcane to identify molecular markers associated with both sugar and biomass yields. The quantitative nature of both traits and the polyploid genome of this crop make the use of association mapping a great challenge compared to other studies conducted for crops with less complex genomes. Even considering that, in the present study we were able to detect QTL for both traits, which are consistent across harvesting seasons.

Population mapping

Several studies suggest that the use of elite germplasm could be useful for association mapping [5860], although there are only a few approaches conducted with this type of population in plant crops (see [60] for a review). In order to take advantage of the available large phenotypic data accumulated from replicated field experiments over locations and years for the SCBP-EEAOC, association mapping was conducted over accessions of its current elite breeding pool (genotypes of the advanced yield trails). All genotypes characterized in the present study were planted and evaluated at the same time obtaining both balanced data across environments and an extensive phenotyping. Therefore, although population sizes are relatively small, the high quality extensive phenotyping provides a reasonable foundation for the GWAS study. Small population sizes would result in decreased power to detect QTL [61, 62] and increased false-positive rate [63, 64]. However, assembling large populations in sugarcane could be a challenge mainly because of the phenotyping requirements and the relative size of the breeding programs. Furthermore, exploring diverse germplasm not adapted to local conditions and with strong population structure could hinder the QTL detection due to the additional challenge of modeling such population structure [65]. Additionally, the mapping approach (i.e., candidate-gene or genome-wide), the relatedness of the individuals, the extent of LD, and the number of markers will determine the optimal population size in GWAS studies [60]. Finally, population sizes close to 100 have been used elsewhere as a first approach to QTL mapping in other species [6670]. Recent sugarcane GWAS studies included 189 and 183 individuals [27, 71]. However, since the experimental population in the present study was a representative sample of the population to which inference is desired, it is expected that the information obtained from the association study will be useful and readily applicable to local crop improvement [6].

Controlling population structure

The presence of subpopulations in the mapping population creates a challenge for association studies. Several methods have been proposed for dealing with false positives related to population structure [1618]. In that sense, many studies conducted especially with small datasets and diploid organisms implemented the method proposed in the freely available software Structure [16]. However, in the case of sugarcane considering its complex polyploid genome, several assumptions are not fulfilled for the use of Structure; therefore, the applicability of this algorithm may be limited in sugarcane [72]. For example, in a previous study in sugarcane [73], when population structure was taken into account by using Structure, arbitrary subpopulations of the genotypes were observed; however, as there were no clear discontinuities in the population, this algorithm failed to conclusively group the population [28]. In the present study a GWAS mapping was applied within a mixed-model framework according to [19] and [46]. Spurious associations were controlled while the power to detect true associations was maximized by using a PCA as a random component to control for population structure [19, 36, 4648]. When PCA as a random component is included in the analysis, the large population structure is captured with the first few axes that account for most of the variation while the more subtle relationships among individuals are captured by the remaining significant axes.

Population structure was inferred with an independent set of markers to avoid dependency among terms in the model and to prevent the structure from absorbing the QTL effects from the model [44, 46]. A sub-set of available markers to infer population structure has been used in other studies [44], including sugarcane [27]. Gouy et al. [27] used a sub-set of the available markers to ensure genome coverage and avoid over-representation of genomic regions. The sugarcane DArT-based map recently published [45] was used to sample independent markers of each linkage group. Furthermore, QQ-plots of P-values showed that population structure was properly accounted for by using a stratified selection of markers to correct for population structure as random effects (Additional file 5). On the other hand, using a random selection of markers without accounting for marker position failed to properly account for population structure (data not shown). Additionally, the grouping observed at PC1 has a biological interpretation, reflecting genetic variation among progeny (Fig. 2). For instance, the right-hand side of the plot includes cv. LCP 85-384 and its progeny; while the left-hand side represents the remaining genotypes. This was also found in other studies, where LCP 85-384 was genetically more distant to modern varieties [7476]. Cultivar LCP 85-384 is a BC4 derived line of S. spontaneum US 56-15-8 and therefore have a strong wild genetic component [74]. TUCCP 77-42, another variety with a strong wild genetic component (BC1 of S. spontaneum SES 147B), was distant from the rest of the genotypes at PC2 level (Fig. 2). These results showed enough evidence of the ability of these few markers (107) used in the PCA to reveal the genetic background of genotypes. Furthermore, most of the structure found in these genotypes seems to come from subtle kinship relationships more than large-scale population structure. Our method properly accounted for these relationships.

Sugarcane GWAS

Association studies are becoming a popular strategy for unraveling the genetic underlying complex traits. The first association mapping studies conducted in sugarcane have focused on genome-wide approaches attempted at looking for associations between disease resistance and molecular markers [27, 71, 73, 77, 78]. Few reports were found [27, 28] involving associations between molecular markers and traits related with cane and sucrose yield and/or yield components. Wei et al. [28] conducted a study where field-data for cane yield (t ha-1) and commercially extractable sucrose content were obtained in plant-cane. However, one of the major concerns in order to find markers contributing to yield during several ages is the repeatability of the marker–trait associations across harvesting seasons (mainly for ratoons). Another study including sucrose yield and yields components, among other traits, was carried out by Gouy et al. [27], obtaining plant-cane phenotypic data from trials planted during different season or years and locations. However, only a few marker-trait associations were detected for the traits analyzed. In that sense, in the present study several markers (20) were found to be associated with CY in at least 2 years, being more frequent the coincidence among first and second ratoon. Sequences of four markers (M58, M54, M97 and M120) showed very high similitude and low e-value with coding sequences of Sorghum bicolor. Sequences of M58 and M54 were found to be the same, and they presented high identity and low e-value with a sequence located on chromosome 9 from Sorghum bicolor, where QTLs for plant height and tiller number were previously found [79]. Sequence of M97 was located on chromosome 2 of Sorghum bicolor, were QTLs for stem diameter and plant height were previously reported [80] and validated later [79]. Regarding M120 marker, whose effect was 8.70 t ha-1 in the multi-QTL model, the sequence of this marker presented high similitude with a Sorghum bicolor sequence located on chromosome 6, where QTLs for stem biomass yield, plant height and tiller number were reported in different studies (see [79]).

For SC, one marker-trait association was found significant for the 3 years of study. This marker, M64, showed high identity (99 %) and low e-value (6,00 E-143) with a sequence located in chromosome 3 from Sorghum bicolor, where several QTL related to sugar content were previously reported (Glucose content [80]; Sugar content [81]; Brixº, Juice Sugars g L-1 and Juice Sucrose g L-1 [82]) and validated [79]. Moreover, in our multi-QTL model, M64 showed the same marker effect (-0.48 %) in two consecutive years (2010 and 2011), indicating that a negative selection for this marker could increase SC.

The highly conserved sequences found in the sorghum genome confirm the usefulness of this database to study regions of interest in sugarcane genome. Sequence of marker M120 presented 407 identical nucleotides with an e-value of zero, suggesting that this is a region shared by sugarcane and sorghum. Other sugarcane sequence markers were also significantly similar to sorghum, which is in agreement with previous studies [24, 25, 83] that reported a high gene microlinearity between sorghum and sugarcane.

Even though this GWAS study is mostly focused on exploring the entire genome with DArT makers, also TRAP markers that targeted to coding regions were employed. This information resulted useful in finding regions controlling traits of interest since three of the 103 TRAP markers used for association analysis were significantly associated with CY in 2 years of study.

It is important to highlight the challenge in finding strong marker-trait associations in complex polyploid species using dominant markers. It is well known that this type of markers are less informative than co-dominant ones, especially in polyploids, because copies of homologous chromosomes “dilute” the polymorphisms. When the markers are evaluated with a binary system, they are scored as 0 for the absence of the allele, or 1 for the presence of at least one copy of the allele. This constitutes one intrinsic limitation of the method that is associated with overlooking ploidy level [84]. In that sense, further research need to be conducted to investigate the establishment of associations between continuous data obtained from DArT markers and allele dosage, instead of binary data. This would probably increase the number of markers associated with characteristics of interest. Moreover, considering that the study was carried out on a panel of sugarcane varieties and elite lines, most favorable alleles would probably be fixed; however, no single variety has all favorable alleles giving an opportunity to accumulate those alleles and thus achieve crop improvement.

Conclusions

This study demonstrated that association mapping in elite germplasm seems to have a clear potential for improving sugarcane, especially for complex traits such as CY and SC, for which measurements are costly and time consuming. Combining existing phenotypic trial data and genotypic DArT and TRAP marker characterizations within an LD approach using PCA as a random component to control for population structure may prove to be highly successful to find molecular markers significantly associated with the measured traits. Two aspects were key to obtain the results shown here: the high quality of phenotypic data from the EEAOC-SCBP collected in successive crop cycles and under the same environmental conditions for all genotypes; and the adequate selection of markers to be used in the analysis of population structure, since the choice of markers that do not adequately reflect the presence of such structure could hinder the detection of QTLs of interest. Additionally, sequences of DArT marker associated with trait of interest were aligned in chromosomal regions where sorghum QTLs has been previously reported. The whole role of these regions will need to be further investigated.

Even though the small size of the population could affect the power of the GWAS and increase false positive rate [85], findings reports here must be considered early evidence about the genome regions and markers associated with the genetic control of yield-related characteristics in sugarcane and should be further validated.

Abbreviations

ASE, allelic substitution effect; CONICET, Consejo Nacional de Investigaciones Científicas y Técnicas; CY, cane yield; DArT, Diversity Arrays Technology; EEAOC, Estación Experimental Agroindustrial Obispo Colombres; GWAS, genome-wide association; IVT, Infield Variety Trials; LD, linkage disequilibrium; MAF, minor allele frequency; MINCYT, Ministerio de Ciencia, Tecnología e Innovación Productiva; PCA, principal component analysis; QTL, quantitative trait loci; SC, sugar content; SCBP-EEAOC, sugarcane breeding program of Estación Experimental Agroindustrial Obispo Colombres; TRAP, target region amplification polymorphism.