Abstract
Considering the extensive data sets and statistical techniques, animal breeding embodies a branch of machine learning that has a constantly increasing impact on breeding. In our study, information regarding the potential of machine learning and data mining within a large set of horses and breeds is presented. The individual assignment methods and factors influencing the success rate of the procedure are compared at the Czech population scale. The fixation index values ranged from 0.057 (HMS1) to 0.144 (HTG6), and the overall genetic differentiation amounted to 8.9% among the breeds. The highest genetic divergence (FST = 0.378) was established between the Friesian and Equus przewalskii; the highest degree of gene migration was obtained between the Czech and Bavarian Warmblood (Nm = 14,302); and the overall global heterozygote deficit across the populations was 10.4%. The eight standard methods (Bayesian, frequency, and distance) using GeneClass software and almost all mainstream classification algorithms (Bayes Net, Naive Bayes, IB1, IB5, KStar, JRip, J48, Random Forest, Random Tree, PART, MLP, and SVM) from the WEKA machine learning workbench were compared by utilizing 314,874 real allelic data sets. The Bayesian method (GeneClass, 89.9%) and Bayesian network algorithm (WEKA, 84.8%) outperformed the other techniques. The breed genomic prediction accuracy reached the highest value in the cold-blooded horses. The overall proportion of individuals correctly assigned to a population depended mainly on the breed number and genetic divergence. These statistical tools could be used to assess breed traceability systems, and they exhibit the potential to assist managers in decision-making as regards breeding and registration.
Similar content being viewed by others
References
Baudouin L, Lebrun P (2000) An operational bayesian approachfor the identification of sexually reproduced cross-fertilized populations using molecular markers. Acta Hortic 546:81–93. https://doi.org/10.17660/ActaHortic.2001.546.5
Bjørnstad G, Røed KH (2002) Evaluation of factors affecting individual assignment precision using microsatellite data from horse breeds and simulated breed crosses. Anim Genet 33:264–270
Cavalli-Sforza LL, Edwards AWF (1967) Phylogenetic analysis: models and estimation procedures. Am J Hum Genet 19:233–257
Cornuet JM, Piry S, Luikart G, Estoup A, Solignac M (1999) New methods employing multilocus genotypes to select or exclude populations as origins of individuals. Genetics 153:1989–2000
Dalvit C, De Marchi M, Dal Zotto R, Gervaso M, Meuwissen T, Cassandro M (2008) Breed assignment test in four Italian beef cattle breeds. Meat Sci 80:389–395
Fan B, Chen YZ, Moran C, Zhao SH, Liu B, Zhu MJ, Xiong TA, Li K (2005) Individual-breed assignment analysis in swine populations by using microsatellite markers. Asian Australas J Anim Sci 18:1529–1534
Goldstein DB, Ruiz Linares A, Cavalli-Sforza LL, Feldman MW (1995) Genetic absolute dating based on microsatellites and the origin of modern humans. Proc Natl Acad Sci U S A 92:6723–6727
Goodman SJ (1997) Rst Calc: a collection of computer programs for calculating estimates of genetic differentiation from microsatellite data and determining their significance. Mol Ecol 6:881–885
Goudet J (2001) FSTAT, a program to estimate and test gene diversities and fixation indices (version 2.9.3). Available from http://www.unil.ch/izea/softwares/fstat.html. Accessed 24 December 2017
Hauser L, Seamons TR, Dauer M, Naish KA, Quinn TP (2006) An empirical verification of population assignment methods by marking and parentage data: hatchery and wild steelhead (Oncorhynchus mykiss) in Forks Creek, Washington, USA. Mol Ecol 15:3157–3173
Iquebal MA, Sarika, Dhanda SK et al (2013) Development of a model webserver for breed identification using microsatellite DNA marker. BMC Genet 14:118
Iquebal MA, Ansari MS, Sarika DSP, Verma NK, Aggarwal RA, Jayakumar S, Rai A, Kumar D (2014) Locus minimization in breed prediction using artificial neural network approach. Anim Genet 45:898–902
Jaiswal S, Dhanda SK, Iquebal MA, Arora V, Shah TM, Angadi UB, Joshi CG, Raghava GPS, Rai A, Kumar D (2016) BIS-CATTLE: a web server for breed identification using microsatellite DNA markers. Curr Res Bioinforma 5:10–17
Jamieson A, Taylor SCS (1997) Comparisons of three probability formulae for parentage exclusion. Anim Genet 28:397–400
Kalinowski ST, Taper ML, Marshall TC (2007) Revising how the computer program CERVUS accommodates genotyping error increases success in paternity assignment. Mol Ecol 16:1099–1106
Koskinen M (2003) Individual assignment using microsatellite DNA reveals unambiguous breed identification in the domestic dog. Anim Genet 34:297–301
Liu K, Muse SV (2005) PowerMarker: integrated analysis environment for genetic marker data. Bioinformatics 21:2128–2129
Nei M (1972) Genetic distance between populations. Am Nat 106:283–291
Nei M (1973a) The theory and estimation of genetic distances. In: Morton NE (ed) Genetic Structure of Populations. University Press of Hawaii, Honolulu
Nei M (1973b) Analysis of gene diversity in subdivided populations. Proc Natl Acad Sci U S A 70:3321–3323
Nei M, Tajima F, Tateno Y (1983) Accuracy of estimated phylogenetic trees from molecular data. J Mol Evol 19:153–170
Paetkau D, Calvert W, Stirling I, Strobeck C (1995) Microsatellite analysis of population structure in Canadian polar bears. Mol Ecol 4:347–354
Pérez-Enciso M (2017) Animal breeding learning from machine learning. J Anim Breed Genet 134:85–86
Piry S, Alapetite A, Cornuet JM, Paetkau D, Baudouin L, Estoup A (2004) GeneClass2: a software for genetic assignment and first-generation migrant detection. J Hered 95:536–539
Putnová L, Štohl R, Vrtková I (2018) Genetic monitoring of horses in the Czech Republic: a large-scale study with a focus on the Czech autochthonous breeds. J Anim Breed Genet 135:73–83
Rannala B, Mountain JL (1997) Detecting immigration by using multilocus genotypes. Proc Natl Acad Sci U S A 94:9197–9201
Rousset F (2008) Genepop'007: a complete reimplementation of the Genepop software for windows and Linux. Mol Ecol Resour 8:103–106
Talle SB, Fimland E, Syrstad O, Meuwissen T, Klungland H (2005) Comparison of individual assignment methods and factors affecting assignment success in cattle breeds using microsatellites. Acta Agric Scand Sect A-Anim Sci 55:74–79
Van de Goor LH, van Haeringen WA, Lenstra JA (2011) Population studies of 17 equine STR for forensic and phylogenetic analysis. Anim Genet 42:627–633
Van Oosterhout C, Hutchinson WF, Wills DPM, Shipley P (2004) MICRO-CHECKER: software for identifying and correcting genotyping errors in microsatellite data. Mol Ecol Notes 4:535–538
Weir BS, Cockerham CC (1984) Estimating F-statistics for the analysis of population structure. Evolution 38:1358–1370
Acknowledgments
The authors would like to thank Professor Petr Hořín (Department of Animal Genetics, VFU Brno) for providing samples of the Camargue, Murgese, and Icelandic horses. This section would be incomplete without quoting Irena Vrtková, PhD (Laboratory of Agrogenomics) and her unwavering support over the years.
Funding
The research was funded by a project (NAZV QH92277) of the National Agency for Agricultural Research of the Ministry of Agriculture of the Czech Republic, utilizing the institutional support for the development of Mendel University in Brno. Furthermore, the research was supported by the Ministry of Education, Youth and Sports under project No. LO1210 solved at the Centre for Research and Utilization of Renewable Energy.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Ethical statement
All procedures performed in studies involving animals were in accordance with the ethical standards of the institution or practice at which the studies were conducted.
Additional information
Communicated by: Maciej Szydlowski
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Electronic supplementary material
Table S1
The multilocus Nm (below the diagonal) and FST values (above the diagonal) between pairs of 43 populations studied across all loci (n = 9261). (DOCX 48 kb)
Table S2
The numbers of animals sampled per population and correctly assigned, and the individual assignment success rates for each population achieved using different assignment methods and numbers of microsatellite markers (GeneClass). (DOCX 35 kb)
Table S3
The individual assignment success as calculated by GeneClass using the Bayesian method (Rannala & Mountain) for each horse breed (n = 2879). (DOCX 22 kb)
Table S4
The numbers of animals sampled per population and correctly assigned, and the individual assignment success rates for each population achieved using different assignment methods and numbers of microsatellite markers (the WEKA software). (DOCX 36 kb)
Table S5
The performance of the Bayes Net classification model tested for breed identification as the confusion matrix (the average accuracy of 84.8%). (DOCX 21 kb)
Rights and permissions
About this article
Cite this article
Putnová, L., Štohl, R. Comparing assignment-based approaches to breed identification within a large set of horses. J Appl Genetics 60, 187–198 (2019). https://doi.org/10.1007/s13353-019-00495-x
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s13353-019-00495-x