Bayesian estimation of marker dosage in sugarcane and other autopolyploids

Abstract

In sugarcane or other autopolyploids, after generating the data, the first step in constructing molecular marker maps is to determine marker dosage. Improved methods for correctly allocating marker dosage will result in more accurate maps and increased efficiency of QTL linkage detection. When employing dominant markers like AFLPs, single-dose markers represent alleles present as one copy in one parent and null in the other parent, double-dose markers are those present as two copies in one parent and null in the other parent and so on. Observed segregation ratios in the offspring are employed to infer marker dosage in the parent from which the marker was inherited. Commonly, for each marker, a χ 2 test is used to assign dosage. Such an approach does not address important practical considerations such as multiple testing and departures from theoretical assumptions. In particular, extra-binomial variation or overdispersion has been observed in sugarcane studies and standard methods may result in fewer correct dosage allocations than the data warrant. To address these shortcomings, a Bayesian mixture model is proposed where all markers are considered simultaneously. Since analytic solutions are not available, Markov chain Monte Carlo methods are employed. Marker dosage allocation for each individual marker employs the estimated posterior probability of each dosage. For a sugarcane study these methods resulted in more markers being allocated a dosage than by standard approaches. Simulation studies demonstrated that, in general, not only are more markers classified but that more markers are also correctly classified, particularly if overdispersion is present.

This is a preview of subscription content, log in to check access.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

References

  1. Aitken K, Jackson P, McIntyre C (2005) A combination of aflp and ssr markers provides extensive map coverage and identification of homo(eo)logous linkage groups in a sugarcane cultivar. Theor Appl Genet 110:789–801

    Article  CAS  PubMed  Google Scholar 

  2. Aitken KS, Jackson PA, McIntyre CL (2007) Construction of a genetic linkage map for Saccharum officinarum incorporating both simplex and duplex markers to increase genome coverage. Genome 50:742–756

    Article  CAS  PubMed  Google Scholar 

  3. Akaike H (1974) A new look at the statistical model identification. IEEE Trans Automat Control AC-19:713–716

    Google Scholar 

  4. Al-Janabi SM, Honeycutt RJ, McClelland M, Sobral BW (1993) A genetic linkage map of Saccharum spontaneum L. ‘SES 208’. Genetics 134:1249–1260

    CAS  PubMed  Google Scholar 

  5. Altman DG, Bland JM (1983) Measurement in medicine: the analysis of method comparison studies. Statistician 32:307–317

    Article  Google Scholar 

  6. Besag J, Green P, Higdon PJ, Mengersen K (1995) Bayesian computation and stochastic systems. Stat Sci 10:3–66 (with discussion)

    Article  Google Scholar 

  7. Best N, Cowles MK, Vines K (1995) CODA convergence diagnosis and output software for Gibbs sampling output Version 0.30. MRC Biostat Unit, Cambridge

    Google Scholar 

  8. Bland J, Altman D (1995) Comparing methods of measurement: why plotting difference against standard method is misleading. Lancet 346:1085–1087

    Article  CAS  PubMed  Google Scholar 

  9. Burner DM (1997) Chromosome transmission and meiotic behaviour in various sugarcane crosses. J Am Soc Sugar Cane Tech 17:38–50

    Google Scholar 

  10. Celeux G, Forbes F, Robert C, Titterington D (2006) Deviance information criteria for missing data models. Bayesian Anal 4:651–674

    Google Scholar 

  11. da Silva JAG (1993) A methodology for genome mapping of autopolyploids and its application to sugarcane Saccharum spontaneum. PhD thesis, Cornell University, Ithaca, NY

  12. da Silva JAG, Sorrells ME, Burnquist WL, Tanksley ST (1993) RFLP linkage map and genome analysis of Saccharum spontaneum. Genome 36:782–791

    Article  CAS  Google Scholar 

  13. da Silva J, Honeycutt RJ, Burnquist W, Al-Janabi SM, Sorrells ME, Tanksley SD, Sorbral BWS (1995) Saccharum spontaneum L. SES 208 genetic linkage map combining RFLP-and PCR-based markers. Mol Breed 1:165–179

    Article  CAS  Google Scholar 

  14. De Winton D, Haldane JBS (1931) Linkage in the tetraploid. J Genet 24:121–144

    Article  Google Scholar 

  15. Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the EM algorithm (with discussion). J R Stat Soc Ser B 39:1–38

    Google Scholar 

  16. Gelman A, Carlin JB, Stern HS, Rubin DB (2004) Bayesian data analysis, 2nd edn. Chapman Hall, London

    Google Scholar 

  17. Geweke J (1992) Evaluating the accuracy of sampling based approaches to calculating posterior moments. In: Bernado JM, Berger JO, David AP, Smith AFM (eds) Bayesian statistics 4. Oxford University Press, Oxford, pp 169–194

    Google Scholar 

  18. Gilks W, Richardson S, Spiegelhalter D (1996) Markov chain Monte Carlo in practice. Chapman Hall, London

    Google Scholar 

  19. Grivet L, Arruda P (2002) Sugarcane genomics: depicting the complex genome of an important tropical crop. Curr Opin Plant Biol 5:122–127

    Article  CAS  PubMed  Google Scholar 

  20. Hackett CA (2001) A comment on Xie and Xu: ‘mapping quantitative trait loci in tetraploid species’. Genet Res 78:187–189

    Article  CAS  PubMed  Google Scholar 

  21. Hackett CA, Luo ZW (2003) TetraploidMap: construction of a linkage map in autotetraploid species. J Hered 94:358–359

    Article  CAS  PubMed  Google Scholar 

  22. Haldane JBS (1930) Theoretical genetics of autopolyploids. J Genet 22:359–372

    Article  Google Scholar 

  23. Haley C, Knott S (1992) A simple regression method for mapping quantitative trait loci in line crosses using flanking markers. Heredity 69:315–324

    CAS  PubMed  Google Scholar 

  24. Heidelberger P, Welch P (1983) Simulation run length control in the presence of an initial transient. Oper Res 31:1109–1144

    Article  Google Scholar 

  25. Janoo N, Grivet L, David J, D’Hont A, Glaszmann JC (2004) Differential chromosome pairing affinities at meiosis in polyploid sugarcane revealed by molecular markers. Heredity 93:460–467

    Article  Google Scholar 

  26. Lander ES, Botstein D (1989) Mapping Mendelian factors underlying quantitative traits using RFLP linkage maps. Genetics 121:185–199

    CAS  PubMed  Google Scholar 

  27. Luo ZW, Hackett CA, Bradshaw JE, McNicol JW, Milbourne D (2001) Construction of a genetic linkage map in tetraploid species using molecular markers. Genetics 157:1369–1385

    CAS  PubMed  Google Scholar 

  28. Luo ZW, Zhang RM, Kearsey MJ (2004) Theoretical basis for genetic linkage analysis in autotetraploid species. PNAS 101:7040–7045

    Article  CAS  PubMed  Google Scholar 

  29. Mao CX (2007) Estimating population sizes for capture–recapture sampling with binomial mixtures. Comput Stat Data Anal 51:5211–5219

    Article  Google Scholar 

  30. Mather K (1951) The measurement of linkage in heredity. Methuen, London

    Google Scholar 

  31. Mengersen KL, Robert CP (1996) Testing for mixtures: a Bayesian entropic approach. In: Bernando JM, Berger JO, Dawid AP, Smith AFM (eds) Bayesian statistics 5. Oxford University Press, Oxford, pp 225–276

    Google Scholar 

  32. Meyer R, Milbourne D, Hackett C, Bradshaw J, McNichol J, Waugh R (1998) Linkage analysis in tetraploid potato and association of markers with quantitative resistance to late blight (Phytophthora infestans). Mol Gen Genet 259:150–160

    Article  CAS  PubMed  Google Scholar 

  33. Ming R, Liu S, Lin Y, Braga D, da Silva J, van Deynze, Wenslaff A, Wu K, Moore P, Burnquist W, Sorrells M, Irvine J, Paterson A (1998) Alignment of Sorghum and Saccharum chromosomes: comparative organization of closely-related diploid and polyploid genomes. Genetics 150:1663–1882

  34. Mood AM, Graybill FA, Boes DC (1974) Introduction to the theory of statistics, 3rd edn. McGraw–Hill, New York

    Google Scholar 

  35. Plummer M (2005) JAGS Version 0.90 manual. International Agency for Research on Cancer, Lyon

  36. Qu L, Hancock JF (2001) Detecting and mapping repulsion-phase linkage in polyploids with polysomic inheritance. Theor Appl Genet 103:136–143

    Article  CAS  Google Scholar 

  37. Qu L, Hancock J (2002) Pitfalls of genetic analysis using a doubled-haploid backcrossed to its parent. Theor Appl Genet 105:392–396

    Article  CAS  PubMed  Google Scholar 

  38. Raftery AL, Lewis S (1992) How many iterations in the Gibbs sampler? In: Bernado JM, Berger JO, David AP, Smith AFM (eds) Bayesian statistics 4. Oxford University Press, Oxford, pp 763–774

    Google Scholar 

  39. Ripol MI, Churchill A, da Silva JA, Sorrells M (1999) Statistical aspects of genetic mapping in autopolyploids. Gene 235:31–41

    Article  CAS  PubMed  Google Scholar 

  40. Robert C (1996) Mixtures of distributions: inference and estimation. In: Gilks W, Richardson S, Spiegelhalter D (eds) Markov chain Monte Carlo in practice. Chapman Hall, London

    Google Scholar 

  41. Rufo MJ, Pérez CJ, Martìn J (2007) Bayesian analysis of finite mixtures of multinomial and negative-multinomial distributions. Comput Stat Data Anal 51:5452–5466

    Article  Google Scholar 

  42. Skellam JG (1948) A probability distribution derived from the binomial distribution by regarding the probability of success as variable between the sets of trials. J R Stat Soc Ser B 10:257–261

    Google Scholar 

  43. Smith AFM, Roberts GO (1993) Bayesian computation via the Gibbs sampler and related Markov Monte Carlo Methods. J R Stat Soc Series B 55:3–23

    Google Scholar 

  44. Soltis PS, Soltis DE (2000) The role of genetic and genomic attributes in the success of polyploids. PNAS 97:7051–7057

    Article  CAS  PubMed  Google Scholar 

  45. Spiegelhalter D, Thomas A, Best N, Gilks W (1995) BUGS. Bayesian inference Using Gibbs Sampling, Version 0.50. MRC Biostatistics Unit, Cambridge

    Google Scholar 

  46. Spiegelhalter DJ, Best N, Carlin B, van der Linde A (2002) Bayesian measures of model complexity and fit. J R Stat Soc Ser B 64:583–639

    Article  Google Scholar 

  47. Stebbins GL (1950) Variation and evolution in plants. Columbia University Press, New York

    Google Scholar 

  48. Stephens M (2000) Bayesian analysis of mixtures with an unknown number of components—an alternative to reversible jump methods. Ann Stat 28:40–74

    Article  Google Scholar 

  49. Sybenga J (1994) Preferential pairing estimates from multivalent frequencies in tetraploids. Genome 37:1045–1055

    Article  CAS  PubMed  Google Scholar 

  50. Sybenga J (1995) Meiotic pairing in autohexaploid Lathyrus: a mathematical model. Heredity 75:343–350

    Article  Google Scholar 

  51. Sybenga J (1996) Chromosome pairing affinity and quadrivalent formation in polyploids: do segmental allopolyploids exist? Genome 39:1176–1184

    Article  CAS  PubMed  Google Scholar 

  52. Tanner MA (1993) Tools for statistical inference, 2nd edn. Springer, New York

    Google Scholar 

  53. Tanner MA, Wong WH (1987) The calculation of posterior distributions by data augmentation: with discussion. J Am Stat Assoc 82:528–550

    Article  Google Scholar 

  54. R Development Core Team (2007) R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0

  55. Tweedie RL, Mengersen K (1996) Rates of convergence of the Hastings and Metropolis algorithms. Ann Stat 24:101–121

    Article  Google Scholar 

  56. Ukoskit K, Thompson PG (1997) Autopolyploidy versus allopolyploidy and low-density randomly amplified polymorphic DNA linkage maps of sweetpotato. J Am Soc Hortic Sci 122:822–828

    CAS  Google Scholar 

  57. Wu KK, Burnquist W, Sorrells ME, Tew TL, Moore PH (1992) The detection and estimation of linkage in polyploids using single-dose restriction fragments. Theor Appl Genet 83:294–300

    Article  Google Scholar 

  58. Wu R, Gallo-Meagher M, Littell RC, Zeng ZB (2001a) A general polyploid model for analyzing gene segregation in outcrossing tetraploid species. Genetics 159:869–882

    CAS  PubMed  Google Scholar 

  59. Wu SS, Wu R, Ma CX, Zeng ZB, Yang MC, Casella G (2001b) A multivalent pairing model of linkage analysis in autotetraploids. Genetics 159:1339–1350

    CAS  PubMed  Google Scholar 

  60. Wu R, Ma CX, Casella G (2002) A bivalent model for linkage analysis in outcrossing tetraploids. Theor Popul Biol 62:129–151

    Article  PubMed  Google Scholar 

  61. Xie CG, Xu SH (2000) Mapping quantitative trait loci in tetraploid populations. Genet Res 76:105–115

    Article  CAS  PubMed  Google Scholar 

Download references

Acknowledgments

We thank Jingchuan Li for assistance with AFLP marker data production and Ross Darnell and Kerrie Mengersen for useful statistical discussions. This work was partially funded by the Cooperative Research Centre for Sugarcane Industry Innovation through Biotechnology.

Author information

Affiliations

Authors

Corresponding author

Correspondence to Peter Baker.

Additional information

Communicated by J. Bradshaw.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (PDF 792 kb)

Appendices

Appendix 1: Informative prior specification for τ k

Conjugate prior distributions are employed for the means μ k and precisions \( {{\uptau}}_{\text{k}} ,k = 1, \ldots, K. \) A method to determine the hyperparameters as \( {\text{logit}}\left( {P_{k} } \right) \) and T k for the prior distribution of the mean μ k are outlined in “Priors”. Prior distributions for the means μ k in autooctoploids are provided in Table 4.

Table 4 Expected segregation ratios for autooctoploids on the logit scale and hyperparameters set for strong priors assuming theoretical segregation ratios

Employing a similar approach for the mean of \( {{\uptau}}_{k} \sim {\text{Gamma}}\left( {A_{k} ,B_{k} } \right), \) we can calculate the logit transformed 2.5 and 97.5 percentiles of the theoretical binomial distribution with parameters in Eq. 1 to obtain the expected value \( {{\hat{\tau}}}_{\text{k}} \) of \( {{\uptau}}_{\text{k}} \). Denoting the percentiles as q 0.025 and q 0.975 on the untransformed scale then for a 95% confidence region on the logit scale

$$ \tilde{s}_{k} \approx {{\left[ {{\text{logit}}\left( {q_{0.0975} } \right) - {\text{logit}}\left( {q_{0.0925} } \right)} \right]} \mathord{\left/ {\vphantom {{\left[ {{\text{logit}}\left( {q_{0.0975} } \right) - {\text{logit}}\left( {q_{0.0925} } \right)} \right]} 4}} \right. \kern-\nulldelimiterspace} 4} $$
(12)

and so expected value \( {{\tilde{\tau}}}_{k} = {1 \mathord{\left/{\vphantom {1 {\tilde{s}_{k}^{2}}}} \right. \kern-\nulldelimiterspace} {\tilde{s}_{k}^{2}}}, \), where \( \tilde{s}_{k} \) is defined in Eq. 12 and so may be obtained directly from the percentiles q 0.025 and q 0.075 of the binomial distribution with size equal to the number of individuals N k and probability P k .

The conjugate prior distribution for the precision τ k is a Gamma(A k ,B k ) which has mean \( {{A_{k} } \mathord{\left/ {\vphantom {{A_{k} } {B_{k} }}} \right. \kern-\nulldelimiterspace} {B_{k} }} \) and variance \( {{A_{k} } \mathord{\left/ {\vphantom {{A_{k} } {B_{k}^{2} }}} \right. \kern-\nulldelimiterspace} {B_{k}^{2} }} \). Ideally, the mean of the prior distribution would be specified as \( {{\hat{\tau}}}_{\text{k}} \) but the variance may be specified by several methods. Those considered here involve setting an interval around either \( {{\tilde{\tau}}}_{\text{k}} \) or \( \tilde{s}_{k} \).

First, if the prior distribution is specified to have a 95% confidence region around τ k as \( {{\uptau}}_{k} \left( {1 \pm x} \right) \), then the interval has length \( 2x{{\uptau}}_{k} \) which is 4 SD(τ k ). A k and B k are obtained by equating the mean \( A_{k}/B_{k}\) and variance \( {{A_{k} } \mathord{\left/ {\vphantom {{A_{k} } {B_{k}^{2} }}} \right. \kern-\nulldelimiterspace} {B_{k}^{2} }} \) to their observed values \( {{\hat{\tau}}}_{\text{k}} \) and \( 2x{{\hat{\tau}}} \), respectively.

In a similar fashion, if the 95% confidence region around s k is set to \( s_{k} \left( {1 \pm x} \right) \) then the SD(τ k ) is a quarter of the interval on the on the τ k scale. This is simply

$$ \begin{gathered} {\text{SD}}\left( {{{\uptau}}_{k} } \right) = {\frac{1}{4}}\left( {{\frac{1}{{s_{k}^{2} (1 - x)^{2} }}} - {\frac{1}{{s_{k}^{2} (1 + x)^{2} }}}} \right) \times \hfill \\ \;\;\;\;\;\;\;\;\;\;\;\;\;{\frac{x}{{s_{k}^{2} \left( {1 + x} \right)^{2} \left( {1 - x} \right)^{2} }}} \hfill \\ \end{gathered} $$

and once the observed and expected means and variances are equated then \( A_{k} = C.{{\hat{\tau}}}_{k}^{4} \)and \( B_{k} = C{{\hat{\tau}}}_{k}^{3} \) where \( C = x^{ 2} /\left( { 1+ x} \right)\left( { 1- x} \right). \) This produces a narrower prior distribution than specifying limits around τ k .

Appendix 2: Generating markers with overdispersion

For simulation studies, markers with specified dosage may be generated from a Binomial (n,p) distribution where p is the appropriate segregation proportion in Eq. 1. When overdispersion is present, a beta-binomial distribution may be used with p ~ beta (α,β), where α and β are the first and second shape parameter, respectively (Skellams 1948). If the theoretical segregation ratio P jk in Eq. 1 is equated to the expected value E(p) = α/(α + β) then simply setting the first shape parameter α fixes the value of \( \beta = \alpha \left( {1 - P_{jk} } \right)/P_{jk} \). Note that larger values of α correspond to smaller values of Var \( \left( p \right) = ab\left( {a + b} \right)^{2} \left( {a + b + 1} \right) \) which results in less overdispersion.

Appendix 3: Comparison of mixture model options

From Fig. 11 the model with more components performs slightly better in that, on average, the median percentage of correctly allocated markers was higher with more components and the misclassification rate was lower. In general, while results are better when more components are employed at higher ploidy levels, the range may appear to indicate that worse results may actually be obtained for particular data sets. Further investigation reveals that this only occurs for medium to severe overdispersion (see Fig. 12).

Fig. 11
figure11

Box plots of the i percentage of markers with dosage correctly allocated and ii misclassified by models with three or four components where equal variances on the logit scale were assumed and strong prior information was incorporated. Three component models may avoid computational problems but could result in more markers being misclassified

Fig. 12
figure12

Box plots of the percentage of misclassified markers for non to severely overdispersed data for models with three or four components. The model employed was that of equal variances on the logit scale with strong prior information incorporated. The range of results increases with increasing overdispersion which could result in worse classification for some data sets

While the percentage of correctly allocated markers decreases with increasing threshold (see Fig. 13), the trend becomes more pronounced with increasing overdispersion and ploidy levels. On the other hand, misclassification rates increase with smaller thresholds and increasing ploidy or overdispersion levels. While there is no clear optimal threshold value, it would seem that that a value of around 0.8 is a reasonable compromise and corresponds in some ways to the value of 0.2 which is commonly used in false discovery rate studies and commonly used as a reasonable power when designing experimental studies (Fig. 14).

Fig. 13
figure13

Box plots of the percentage of markers with dosage correctly allocated by mixture models for a range of thresholds by four levels of overdispersion (None, Slight, Medium, Severe) and ploidy (4, 6, 8, 10). Dosage is allocated when the posterior probability exceeds the threshold. The models fitted were chosen to be those with the maximum number of components, equal variances on the logit scale were assumed and strong prior information incorporated. The percentage of correctly allocated markers tails off for thresholds larger than 0.8

Fig. 14
figure14

Box plots of the percentage of misclassified markers for a range of thresholds with increasing overdispersion (None, Slight, Medium, Severe) varying ploidy levels (4, 6, 8, 10). Dosage is allocated when the posterior probability exceeds the threshold. When there is little or no overdispersion, very few markers are misclassified. However, for moderate to severe overdispersion the misclassification rate decreases as the threshold increases but is greater for higher ploidy levels

Rights and permissions

Reprints and Permissions

About this article

Cite this article

Baker, P., Jackson, P. & Aitken, K. Bayesian estimation of marker dosage in sugarcane and other autopolyploids. Theor Appl Genet 120, 1653–1672 (2010). https://doi.org/10.1007/s00122-010-1283-z

Download citation

Keywords

  • Sugarcane
  • Mixture Model
  • Amplify Fragment Length Polymorphism
  • Markov Chain Monte Carlo
  • Segregation Ratio