Molecular Breeding

, 36:69 | Cite as

Marker imputation efficiency for genotyping-by-sequencing data in rice (Oryza sativa) and alfalfa (Medicago sativa)

  • Nelson NazzicariEmail author
  • Filippo Biscarini
  • Paolo Cozzi
  • E. Charles Brummer
  • Paolo Annicchiarico


Genotyping-by-sequencing (GBS) is a rapid and cost-effective genome-wide genotyping technique applicable whether a reference genome is available or not. Due to the cost-coverage trade-off, however, GBS typically produces large amounts of missing marker genotypes, whose imputation becomes therefore both challenging and critical for later analyses. In this work, the performance of four general imputation methods (K-nearest neighbors, Random Forest, singular value decomposition, and mean value) and two genotype-specific methods (“Beagle” and FILLIN) was measured on GBS data from alfalfa (Medicago sativa L., autotetraploid, heterozygous, without reference genome) and rice (Oryza sativa L., diploid, 100 % homozygous, with reference genome). Alfalfa SNP were aligned on the genome of the closely related species Medicago truncatula L.. Benchmarks consisted in progressive data filtering for marker call rate (up to 70 %) and increasing proportions (up to 20 %) of known genotypes masked for imputation. The relative performance was measured as the total proportion of correctly imputed genotypes, globally and within each genotype class (two homozygotes in rice, two homozygotes and one heterozygote in alfalfa). We found that imputation accuracy was robust to increasing missing rates, and consistently higher in rice than in alfalfa. Accuracy was as high as 90–100 % for the major (most frequent) homozygous genotype, but dropped to 80–90 % (rice) and below 30 % (alfalfa) in the minor homozygous genotype. Beagle was the best performing method, both accuracy- and time-wise, in rice. In alfalfa, KNNI and RFI gave the highest accuracies, but KNNI was much faster.


SNP Genotyping by sequencing (GBS) K-nearest neighbors imputation (KNNI) Random Forest imputation (RFI) Singular value decomposition imputation (SVDI) Beagle FILLIN Alfalfa Rice Imputation Reference genome 



The rice data used in this research paper were produced within the framework of the Italian national project “RISINNOVA” (Grant No. 2010–2369), financially supported by the AGER Foundation. The creation of the alfalfa data sets was funded by the projects Genomic selection in alfalfa (GENALFA) funded by the Italian Ministry of Foreign Affairs and International Cooperation in the framework of the Italy-USA scientific cooperation program, the Italian share of the FP7-ArimNet project Resilient, water- and energy-efficient forage and feed crops for Mediterranean agricultural systems (REFORMA) funded by the Italian Ministry of Agricultural and Forestry Policies.

Supplementary material

11032_2016_490_MOESM1_ESM.pdf (421 kb)
Supplementary material 1 (pdf 421 KB)


  1. Annicchiarico P, Nazzicari N, Li X, Wei Y, Pecetti L, Brummer EC (2015) Accuracy of genomic selection for alfalfa biomass yield in different reference populations. BMC Genomics 16(1):1–13. doi: 10.1186/s12864-015-2212-y CrossRefGoogle Scholar
  2. Aulchenko YS, Ripke S, Isaacs A, Van Duijn CM (2007) Genabel: an r library for genome-wide association analysis. Bioinformatics 23(10):1294–1296CrossRefPubMedGoogle Scholar
  3. Bellman R (1957) Dynamic programming. Princeton University Press, PrincetonGoogle Scholar
  4. Biscarini F, Stevanato P, Broccanello C, Stella A, Saccomani M (2014) Genome-enabled predictions for binomial traits in sugar beet populations. BMC Genet 15(1), 87.
  5. Breiman L (2001) Random forests. Mach Learn 45(1):5–32.
  6. Brøndum RF, Ma P, Lund MS, Su G (2012) Short communication: Genotype imputation within and across nordic cattle breeds. J Dairy Sci. 95(11):6795–6800CrossRefPubMedGoogle Scholar
  7. Browning SR, Browning BL (2007) Rapid and accurate haplotype phasing and missing-data inference for whole-genome association studies by use of localized haplotype clustering. The Am J Hum Genet 81(5):1084–1097. doi:  10.1086/521987.
  8. Crossa J, Beyene Y, Kassa S, Prez P, Hickey JM, Chen C, Campos Gdl, Burgueo J, Windhausen VS, Buckler E, Jannink JL, Cruz MAL, Babu R (2013) Genomic prediction in maize breeding populations with genotyping-by-sequencing. G3: Genes|Genomes|Genetics 3:11:1903–1926. doi:  10.1534/g3.113.008227.
  9. Elshire RJ, Glaubitz JC, Sun Q, Poland JA, Kawamoto K, Buckler ES, Mitchell SE (2011) A robust, simple genotyping-by-sequencing (GBS) approach for high diversity species. PLoS One 6(5):e19379. doi: 10.1371/journal.pone.0019379 CrossRefPubMedPubMedCentralGoogle Scholar
  10. Endelman JB (2011) Ridge regression and other kernels for genomic selection with r package rrblup. Plant Genome 4:250–255CrossRefGoogle Scholar
  11. Glaubitz JC, Casstevens TM, Lu F, Harriman J, Elshire RJ, Sun Q, Buckler ES (2014) TASSEL-GBS: a high capacity genotyping by sequencing analysis pipeline. PLoS One 9(2):E90,346.
  12. Hayes B, Bowman P, Chamberlain A, Goddard M (2009) Invited review: genomic selection in dairy cattle: progress and challenges. J Dairy Sci 92(2):433–443CrossRefPubMedGoogle Scholar
  13. He H, Garcia EA (2009) Learning from imbalanced data. IEEE Trans Knowl Data Eng 21(9):1263–1284CrossRefGoogle Scholar
  14. Hickey JM, Crossa J, Babu R, de los Campos G (2012) Factors affecting the accuracy of genotype imputation in populations from several maize breeding programs. Crop Sci 52:2:654 doi: 10.2135/cropsci2011.07.0358.
  15. Huang BE, Raghavan C, Mauleon R, Broman KW, Leung H (2014) Efficient imputation of missing markers in low-coverage genotyping-by-sequencing data from multiparental crosses. Genetics 197(1):401–404. doi: 10.1534/genetics.113.158014.
  16. International Rice Genome Sequencing Project (2005) The map-based sequence of the rice genome. Nature 436:7052:793–800.
  17. Kawahara Y, de la Bastide M, Hamilton JP, Kanamori H, McCombie WR, Ouyang S, Schwartz DC, Tanaka T, Wu J, Zhou S, et al (2013) Improvement of the Oryza sativa Nipponbare reference genome using next generation sequence and optical map data. Rice 6(1):4.
  18. Kotsiantis S, Kanellopoulos D, Pintelas P (2006) Handling imbalanced datasets: a review. GESTS Int Trans Comput Sci Eng 30(1):25–36Google Scholar
  19. Langmead B, Salzberg SL (2012) Fast gapped-read alignment with Bowtie 2. Nature Methods 9(4:357–359.
  20. Li H, Durbin R (2009) Fast and accurate short read alignment with burrows-wheeler transform. Bioinformatics 25(14):1754–1760CrossRefPubMedPubMedCentralGoogle Scholar
  21. Li X, Wei Y, Acharya A, Hansen JL, Crawford JL, Viands DR, Michaud R, Claessens A, Brummer EC (2015) Genomic prediction of biomass yield in two selection cycles of a tetraploid alfalfa breeding population. Plant Genome. doi: 10.3835/plantgenome2014.12.0090.
  22. Li X, Wei Y, Acharya A, Jiang Q, Kang J, Brummer EC (2014) A saturated genetic linkage map of autotetraploid alfalfa (Medicago sativa L.) developed using genotyping-by-sequencing is highly syntenous with the Medicago truncatula genome. G3: Genes| Genomes| Genetics 4(10):1971–1979 (2014).
  23. Lu F, Lipka AE, Glaubitz J, Elshire R, Cherney, JH, Casler MD, Buckler ES, Costich DE Switchgrass genomic diversity, ploidy, and evolution: novel insights from a network-based snp discovery protocol. PLoS Genet 9(1):e1003,215. doi: 10.1371/journal.pgen.1003215
  24. Ma P, Brndum RF, Zhang Q, Lund MS, Su G (2013) Comparison of different methods for imputing genome-wide marker genotypes in Swedish and Finnish Red Cattle. J Dairy Sci 96(7):4666–4677.
  25. Marimont RB, Shapiro MB (1979) Nearest neighbour searches and the curse of dimensionality. IMA J Appl Math 24(1):59–70. doi: 10.1093/imamat/24.1.59.
  26. Nicolazzi EL, Biffani S, Biscarini F, Orozco ter Wengel P, Caprera A, Nazzicari N, Stella A (2015) Software solutions for the livestock genomics SNP array revolution. Anim Genet . doi: 10.1111/age.12295.
  27. Pei YF, Li J, Zhang L, Papasian CJ, Deng HW (2008) Analyses and comparison of accuracy of different genotype imputation methods. PloS One 3:(10):e3551.
  28. Pérez P, de los Campos G (2014) Genome-wide regression & prediction with the bglr statistical package. Genetics pp. genetics–114Google Scholar
  29. Perry PO (2009) Bcv: cross-Validation for the SVD (bi-cross-validation).
  30. Poland J, Endelman J, Dawson J, Rutkoski J, Wu S, Manes Y, Dreisigacker S, Crossa J, Snchez-Villeda H, Sorrells M, Jannink JL (2012) Genomic selection in wheat breeding using genotyping-by-sequencing. Plant Genome J 5(3):103. doi: 10.3835/plantgenome2012.06.0006.
  31. R Core Team: R (2014) A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria.
  32. Rocher S, Jean M, Castonguay Y, Belzile F (2015) Validation of genotyping-by-sequencing analysis in populations of tetraploid alfalfa by 454 sequencing. PLoS One 10(6):e0131918. doi: 10.1371/journal.pone.0131918 CrossRefPubMedPubMedCentralGoogle Scholar
  33. Rutkoski JE, Poland J, Jannink JL, Sorrells ME (2013) Imputation of unordered markers and the impact on genomic selection accuracy. G3: Genes| Genomes| Genetics 3(3):427–439.
  34. Schwender H (2007) Statistical analysis of genotype and gene expression data. Ph.D. thesis.
  35. Schwender H, Fritsch A (2013) Scrime: analysis of high-dimensional categorical data such as SNP data.
  36. Stekhoven DJ, Bhlmann P (2012) MissForest non-parametric missing value imputation for mixed-type data. Bioinformatics 28(1):112–118.
  37. Sun Y, Wong AK, Kamel MS (2009) Classification of imbalanced data: a review. Int J Pattern Recognit Artif Intell 23(04):687–719. doi: 10.1142/S0218001409007326.
  38. Swarts K, Li H, Romero Navarro JA, An D, Romay MC, Hearne S, Acharya C, Glaubitz JC, Mitchell S, Elshire RJ, Buckler ES, Bradbury PJ (2014) Novel Methods to optimize genotypic imputation for low-coverage, next-generation sequence data in crop plants. Plant Genome 7(3):0. doi: 10.3835/plantgenome2014.05.0023.
  39. The 1000 Genomes Project Consortium (2012) An integrated map of genetic variation from 1,092 human genomes. Nature 491(7422):56–65. doi: 10.1038/nature11632
  40. Troyanskaya O, Cantor M, Sherlock G, Brown P, Hastie T, Tibshirani R, Botstein D, Altman RB (2001) Missing value estimation methods for DNA microarrays. Bioinformatics 17(6):520–525 (2001).
  41. VanRaden PM, Null DJ, Sargolzaei M, Wiggans GR, Tooker ME, Cole JB, Sonstegard TS, Connor EE, Winters M, vanKaam JBCHM, Valentini A, Van Doormaal BJ, Faust MA, Doak GA (2013) Genomic imputation and evaluation using high-density Holstein genotypes. J Dairy Sci 96(1):668–678 (2013). doi: 10.3168/jds.2012-5702.
  42. VanRaden PM, OĆonnell JR, Wiggans GR, Weigel KA (2011) Genomic evaluations with many more genotypes. Genet Sel Evol 43(10):10–1186 .
  43. Ward JA, Bhangoo J, Fernndez-Fernndez F, Moore P, Swanson JD, Viola R, Velasco R, Bassil N, Weber CA, Sargent DJ (2013) Saturated linkage map construction in Rubus idaeus using genotyping by sequencing and genome-independent imputation. BMC Genomics 14(1):2.
  44. Young ND, Debell F, Oldroyd GED, Geurts R, Cannon SB, Udvardi MK, Benedito VA, Mayer KFX, Gouzy J, Schoof H, Van de Peer Y, Proost S, Cook DR, Meyers BC, Spannagl M, Cheung F, De Mita S, Krishnakumar V, Gundlach H, Zhou S, Mudge J, Bharti AK, Murray JD, Naoumkina MA, Rosen B, Silverstein KAT, Tang H, Rombauts S, Zhao PX, Zhou P, Barbe V, Bardou P, Bechner M, Bellec A, Berger A, Bergs H, Bidwell S, Bisseling T, Choisne N, Couloux A, Denny R, Deshpande S, Dai X, Doyle JJ, Dudez AM, Farmer AD, Fouteau S, Franken C, Gibelin C, Gish J, Goldstein S, Gonzlez AJ, Green PJ, Hallab A, Hartog M, Hua A, Humphray SJ, Jeong DH, Jing Y, Jcker A, Kenton SM, Kim DJ, Klee K, Lai H, Lang C, Lin S, Macmil SL, Magdelenat G, Matthews L, McCorrison J, Monaghan EL, Mun JH, Najar FZ, Nicholson C, Noirot C, O’Bleness M, Paule CR, Poulain J, Prion F, Qin B, Qu C, Retzel EF, Riddle C, Sallet E, Samain S, Samson N, Sanders I, Saurat O, Scarpelli C, Schiex T, Segurens B, Severin AJ, Sherrier DJ, Shi R, Sims S, Singer SR, Sinharoy S, Sterck L, Viollet A, Wang BB, Wang K, Wang M, Wang X, Warfsmann J, Weissenbach J, White DD, White JD, Wiley GB, Wincker P, Xing Y, Yang L, Yao Z, Ying F, Zhai J, Zhou L, Zuber A, Dnari J, Dixon RA, May GD, Schwartz DC, Rogers J, Qutier F, Town CD, Roe BA (2011) The Medicago genome provides insight into the evolution of rhizobial symbioses. Nature 480(7378):520–524. doi: 10.1038/nature10625 PubMedPubMedCentralGoogle Scholar

Copyright information

© Springer Science+Business Media Dordrecht 2016

Authors and Affiliations

  • Nelson Nazzicari
    • 1
    Email author
  • Filippo Biscarini
    • 2
  • Paolo Cozzi
    • 2
  • E. Charles Brummer
    • 3
  • Paolo Annicchiarico
    • 2
  1. 1.Council for Agricultural Research and Economics (CREA) Research Centre for Fodder Crops and Dairy ProductionsLodiItaly
  2. 2.Dipartimento di BioinformaticaFondazione Parco Tecnologico PadanoLodiItaly
  3. 3.Plant Sciences DepartmentUniversity of CaliforniaDavisUSA

Personalised recommendations