Journal of Applied Genetics

, Volume 60, Issue 2, pp 187–198 | Cite as

Comparing assignment-based approaches to breed identification within a large set of horses

  • Lenka PutnováEmail author
  • Radek Štohl
Animal Genetics • Original Paper


Considering the extensive data sets and statistical techniques, animal breeding embodies a branch of machine learning that has a constantly increasing impact on breeding. In our study, information regarding the potential of machine learning and data mining within a large set of horses and breeds is presented. The individual assignment methods and factors influencing the success rate of the procedure are compared at the Czech population scale. The fixation index values ranged from 0.057 (HMS1) to 0.144 (HTG6), and the overall genetic differentiation amounted to 8.9% among the breeds. The highest genetic divergence (FST = 0.378) was established between the Friesian and Equus przewalskii; the highest degree of gene migration was obtained between the Czech and Bavarian Warmblood (Nm = 14,302); and the overall global heterozygote deficit across the populations was 10.4%. The eight standard methods (Bayesian, frequency, and distance) using GeneClass software and almost all mainstream classification algorithms (Bayes Net, Naive Bayes, IB1, IB5, KStar, JRip, J48, Random Forest, Random Tree, PART, MLP, and SVM) from the WEKA machine learning workbench were compared by utilizing 314,874 real allelic data sets. The Bayesian method (GeneClass, 89.9%) and Bayesian network algorithm (WEKA, 84.8%) outperformed the other techniques. The breed genomic prediction accuracy reached the highest value in the cold-blooded horses. The overall proportion of individuals correctly assigned to a population depended mainly on the breed number and genetic divergence. These statistical tools could be used to assess breed traceability systems, and they exhibit the potential to assist managers in decision-making as regards breeding and registration.


Assignment success Horse breeds Genetic differentiation Microsatellite variability Machine learning 



The authors would like to thank Professor Petr Hořín (Department of Animal Genetics, VFU Brno) for providing samples of the Camargue, Murgese, and Icelandic horses. This section would be incomplete without quoting Irena Vrtková, PhD (Laboratory of Agrogenomics) and her unwavering support over the years.

Funding information

The research was funded by a project (NAZV QH92277) of the National Agency for Agricultural Research of the Ministry of Agriculture of the Czech Republic, utilizing the institutional support for the development of Mendel University in Brno. Furthermore, the research was supported by the Ministry of Education, Youth and Sports under project No. LO1210 solved at the Centre for Research and Utilization of Renewable Energy.

Compliance with ethical standards

Conflict of interest

The authors declare that they have no conflict of interest.

Ethical statement

All procedures performed in studies involving animals were in accordance with the ethical standards of the institution or practice at which the studies were conducted.

Supplementary material

13353_2019_495_MOESM1_ESM.docx (49 kb)
Table S1 The multilocus Nm (below the diagonal) and FST values (above the diagonal) between pairs of 43 populations studied across all loci (n = 9261). (DOCX 48 kb)
13353_2019_495_MOESM2_ESM.docx (35 kb)
Table S2 The numbers of animals sampled per population and correctly assigned, and the individual assignment success rates for each population achieved using different assignment methods and numbers of microsatellite markers (GeneClass). (DOCX 35 kb)
13353_2019_495_MOESM3_ESM.docx (23 kb)
Table S3 The individual assignment success as calculated by GeneClass using the Bayesian method (Rannala & Mountain) for each horse breed (n = 2879). (DOCX 22 kb)
13353_2019_495_MOESM4_ESM.docx (36 kb)
Table S4 The numbers of animals sampled per population and correctly assigned, and the individual assignment success rates for each population achieved using different assignment methods and numbers of microsatellite markers (the WEKA software). (DOCX 36 kb)
13353_2019_495_MOESM5_ESM.docx (22 kb)
Table S5 The performance of the Bayes Net classification model tested for breed identification as the confusion matrix (the average accuracy of 84.8%). (DOCX 21 kb)


  1. Baudouin L, Lebrun P (2000) An operational bayesian approachfor the identification of sexually reproduced cross-fertilized populations using molecular markers. Acta Hortic 546:81–93. Google Scholar
  2. Bjørnstad G, Røed KH (2002) Evaluation of factors affecting individual assignment precision using microsatellite data from horse breeds and simulated breed crosses. Anim Genet 33:264–270CrossRefGoogle Scholar
  3. Cavalli-Sforza LL, Edwards AWF (1967) Phylogenetic analysis: models and estimation procedures. Am J Hum Genet 19:233–257Google Scholar
  4. Cornuet JM, Piry S, Luikart G, Estoup A, Solignac M (1999) New methods employing multilocus genotypes to select or exclude populations as origins of individuals. Genetics 153:1989–2000Google Scholar
  5. Dalvit C, De Marchi M, Dal Zotto R, Gervaso M, Meuwissen T, Cassandro M (2008) Breed assignment test in four Italian beef cattle breeds. Meat Sci 80:389–395CrossRefGoogle Scholar
  6. Fan B, Chen YZ, Moran C, Zhao SH, Liu B, Zhu MJ, Xiong TA, Li K (2005) Individual-breed assignment analysis in swine populations by using microsatellite markers. Asian Australas J Anim Sci 18:1529–1534CrossRefGoogle Scholar
  7. Goldstein DB, Ruiz Linares A, Cavalli-Sforza LL, Feldman MW (1995) Genetic absolute dating based on microsatellites and the origin of modern humans. Proc Natl Acad Sci U S A 92:6723–6727CrossRefGoogle Scholar
  8. Goodman SJ (1997) Rst Calc: a collection of computer programs for calculating estimates of genetic differentiation from microsatellite data and determining their significance. Mol Ecol 6:881–885CrossRefGoogle Scholar
  9. Goudet J (2001) FSTAT, a program to estimate and test gene diversities and fixation indices (version 2.9.3). Available from Accessed 24 December 2017
  10. Hauser L, Seamons TR, Dauer M, Naish KA, Quinn TP (2006) An empirical verification of population assignment methods by marking and parentage data: hatchery and wild steelhead (Oncorhynchus mykiss) in Forks Creek, Washington, USA. Mol Ecol 15:3157–3173CrossRefGoogle Scholar
  11. Iquebal MA, Sarika, Dhanda SK et al (2013) Development of a model webserver for breed identification using microsatellite DNA marker. BMC Genet 14:118CrossRefGoogle Scholar
  12. Iquebal MA, Ansari MS, Sarika DSP, Verma NK, Aggarwal RA, Jayakumar S, Rai A, Kumar D (2014) Locus minimization in breed prediction using artificial neural network approach. Anim Genet 45:898–902CrossRefGoogle Scholar
  13. Jaiswal S, Dhanda SK, Iquebal MA, Arora V, Shah TM, Angadi UB, Joshi CG, Raghava GPS, Rai A, Kumar D (2016) BIS-CATTLE: a web server for breed identification using microsatellite DNA markers. Curr Res Bioinforma 5:10–17CrossRefGoogle Scholar
  14. Jamieson A, Taylor SCS (1997) Comparisons of three probability formulae for parentage exclusion. Anim Genet 28:397–400CrossRefGoogle Scholar
  15. Kalinowski ST, Taper ML, Marshall TC (2007) Revising how the computer program CERVUS accommodates genotyping error increases success in paternity assignment. Mol Ecol 16:1099–1106CrossRefGoogle Scholar
  16. Koskinen M (2003) Individual assignment using microsatellite DNA reveals unambiguous breed identification in the domestic dog. Anim Genet 34:297–301CrossRefGoogle Scholar
  17. Liu K, Muse SV (2005) PowerMarker: integrated analysis environment for genetic marker data. Bioinformatics 21:2128–2129CrossRefGoogle Scholar
  18. Nei M (1972) Genetic distance between populations. Am Nat 106:283–291CrossRefGoogle Scholar
  19. Nei M (1973a) The theory and estimation of genetic distances. In: Morton NE (ed) Genetic Structure of Populations. University Press of Hawaii, HonoluluGoogle Scholar
  20. Nei M (1973b) Analysis of gene diversity in subdivided populations. Proc Natl Acad Sci U S A 70:3321–3323CrossRefGoogle Scholar
  21. Nei M, Tajima F, Tateno Y (1983) Accuracy of estimated phylogenetic trees from molecular data. J Mol Evol 19:153–170CrossRefGoogle Scholar
  22. Paetkau D, Calvert W, Stirling I, Strobeck C (1995) Microsatellite analysis of population structure in Canadian polar bears. Mol Ecol 4:347–354CrossRefGoogle Scholar
  23. Pérez-Enciso M (2017) Animal breeding learning from machine learning. J Anim Breed Genet 134:85–86CrossRefGoogle Scholar
  24. Piry S, Alapetite A, Cornuet JM, Paetkau D, Baudouin L, Estoup A (2004) GeneClass2: a software for genetic assignment and first-generation migrant detection. J Hered 95:536–539CrossRefGoogle Scholar
  25. Putnová L, Štohl R, Vrtková I (2018) Genetic monitoring of horses in the Czech Republic: a large-scale study with a focus on the Czech autochthonous breeds. J Anim Breed Genet 135:73–83CrossRefGoogle Scholar
  26. Rannala B, Mountain JL (1997) Detecting immigration by using multilocus genotypes. Proc Natl Acad Sci U S A 94:9197–9201CrossRefGoogle Scholar
  27. Rousset F (2008) Genepop'007: a complete reimplementation of the Genepop software for windows and Linux. Mol Ecol Resour 8:103–106CrossRefGoogle Scholar
  28. Talle SB, Fimland E, Syrstad O, Meuwissen T, Klungland H (2005) Comparison of individual assignment methods and factors affecting assignment success in cattle breeds using microsatellites. Acta Agric Scand Sect A-Anim Sci 55:74–79Google Scholar
  29. Van de Goor LH, van Haeringen WA, Lenstra JA (2011) Population studies of 17 equine STR for forensic and phylogenetic analysis. Anim Genet 42:627–633CrossRefGoogle Scholar
  30. Van Oosterhout C, Hutchinson WF, Wills DPM, Shipley P (2004) MICRO-CHECKER: software for identifying and correcting genotyping errors in microsatellite data. Mol Ecol Notes 4:535–538CrossRefGoogle Scholar
  31. Weir BS, Cockerham CC (1984) Estimating F-statistics for the analysis of population structure. Evolution 38:1358–1370Google Scholar

Copyright information

© Institute of Plant Genetics, Polish Academy of Sciences, Poznan 2019

Authors and Affiliations

  1. 1.Laboratory of Agrogenomics, Department of Morphology, Physiology and Animal Genetics, Faculty of AgronomyMendel University in BrnoBrnoCzech Republic
  2. 2.Department of Control and Instrumentation, Faculty of Electrical Engineering and CommunicationBrno University of TechnologyBrnoCzech Republic

Personalised recommendations