Abstract
The goal of this study is to show two new clustering and visualising techniques developed to find the most typical clusters of 18-dimensional Y chromosomal haplogroup frequency distributions of 90 Western Eurasian populations. The first technique called “self-organizing cloud (SOC)” is a vector-based self-learning method derived from the Self Organising Map and non-metric Multidimensional Scaling algorithms. The second technique is a new probabilistic method called the “maximal relation probability” (MRP) algorithm, based on a probability function having its local maximal values just in the condensation centres of the input data. This function is calculated immediately from the distance matrix of the data and can be interpreted as the probability that a given element of the database has a real genetic relation with at least one of the remaining elements. We tested these two new methods by comparing their results to both each other and the k-medoids algorithm. By means of these new algorithms, we determined 10 clusters of populations based on the similarity of haplogroup composition. The results obtained represented a genetically, geographically and historically well-interpretable picture of 10 genetic clusters of populations mirroring the early spread of populations from the Fertile Crescent to the Caucasus, Central Asia, Arabia and Southeast Europe. The results show that a parallel clustering of populations using SOC and MRP methods can be an efficient tool for studying the demographic history of populations sharing common genetic footprints.
Similar content being viewed by others
References
Balanovsky O, Dibirova K, Dybo A, Mudrak O, Frolova S et al (2011) Parallel evolution of genes and languages in the Caucasus region. Mol Biol Evol 28(10):2905–2920
Ben-Israel A, Iyigun C (2007) Probabilistic D-Clustering, J Classif 25 doi:10.1007/s00357-007-0021-y
Bezdek JC, Ehrlich R, Full W (1984) FCM: the fuzzy c-means clustering algorithm. Comput Geosci 10(2–3):191–203
Bíró AZ, Zalán A, Völgyi A, Pamjav H (2009) A Y-chromosomal comparison of the Madjars (Kazakhstan) and the Magyars (Hungary). Am J Phys Anthropol 139(3):305–310
Borg I, Groenen PJF (2005) Modern multidimensional scaling: theory and applications, 2nd edn. Spinger, New York
Breuel TM (2001) Classification by probabilistic clustering, Acoustics, Speech, and Signal Processing, Proc. (ICASSP ‘01) IEEE International Conference on IEEE International Conference (Volume:2) pp. 1333–1336
Capelli C, Redhead N, Romano V, Calì F, Lefranc G, Delague V (2005) Population structure in the mediterranean basin: a Y chromosome perspective. Ann Hum Genet 70((Pt 2)):207–225
Cavalli-Sforza LL (1966) Population structure and human evolution. Proc R Soc Lond Ser B 164:362–379
Chiaroni J, Underhill PA, Cavalli-Sforza LL (2009) Y chromosome diversity, human expansion, drift, and cultural evolution. Proc Natl Acad Sci 106(48):20174–20179
Chikhi L, Nichols RA, Barbujani G, Beaumont MA (2002) Y genetic data support the Neolithic demic diffusion model. Proc Natl Acad Sci 99(17):11008–11013
Childe G (1942) What happened in history. Penguin books, Harmondsworth
Childe G (1960) Vorgeschichte der europäischen Kultur. Rowohlt, Hamburg
Cruciani F, Trombetta B, Massaia A, Destro-Bisol G, Sellitto D, Scozzari R (2011) A revised root for the human Y chromosomal phylogenetic tree: the origin of patrilineal diversity in Africa. Am J Hum Genet 88(6):814–818
Demartines P, H´erault j (1997) Curvilinear component analysis: a self-organizing neural network for nonlinear mapping of data sets. IEEE Trans Neural Networks 8(1):148–154
Diaz-Lacava A, Walier M, Willuweit S, Wienker TF, Fimmers R, Baur MP, Roewer L (2011) Geostatistical inference of main Y-STR-haplotype groups in Europe. Forensic Sci Int Genet 5(2):91–94
Ester M, Kriegel HP, Sander J, Xu X (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In: Simoudis E, Han J, Fayyad U (eds) Second international conference on knowledge discovery and data mining. AAAI Press, Portland, pp 226–231
Excoffier L, Lischer HEL (2010) Arlequin suite ver 3.5: a new series of programs to perform population genetics analyses under Linux and Windows. Mol Ecol Res 10:564–567
Felsenstein J (2004) Inferring phylogenies. Sinauer, Sunderland
Forgy EW (1965) Cluster analysis of multivariate data: Efficiency versus interpretability of classifications. Biometric Soc. Meetings, Riverside, California, 21
Gayden T, Cadenas AM, Regueiro M, Singh NB, Zhivotovsky LA, Underhill PA, Cavalli-Sforza LL, Herrera RJ (2007) The himalayas as a directional barrier to gene flow. Am J Hum Genet 80(5):884–894
Goldstein DB, Schlotterer C (1999) Microsatellites: evolution and applications. Oxford University Press, Oxford
Grugni V, Battaglia V, Kashani BH, Parolo S, Al-Zahery N, Achilli A et al (2012) Ancient migratory events in the middle east: new clues from the Y-Chromosome variation of modern Iranians. PLoS One 7(7):e41252
Hancar F (1956) Das Pferd in prähistorischer und früher historischer Zeit, Wien
Hudson RR (2002) Generating samples under a Wright-Fisher neutral model of genetic variation. Bioinformatics 18:337–338
Jancey RC (1966) Multidimensional group analysis. Austral J Bot 14:127–130
Jobling MA, Tyler-Smith C (2003) The human Y chromosome: an evolutionary marker comes of age. Nat Rev Genet 4(8):598–612
Jombart T, Pontier D, Dufour AB (2009) Genetic markers in the playground of multivariate analysis. Heredity 102:330–341
Juhász Z (2007) Analysis of melody roots in Hungarian folk music using self-organizing maps with adaptively weighted dynamic time warping. Appl Artif Intell 21(1):35–55
Juhász Z (2011) Low dimensional visualisation of folk music systems using the self organising cloud. Proceedings of the 12th International Society for Music Information Retrieval Conference. Miami (Florida), USA. October 24–28 pp. 299–304
Kanaya S, Kinouchi M, Abe T, Kudo Y, Yamada Y, Nishi T, Mori H, Ikemura T (2001) Analysis of codon usage diversity of bacterial genes with a self-organizing map (SOM): characterization of horizontally transferred genes with emphasis on the E. coli O157 genome. Gene 276:89–99
Kanungo T, Mount DM, Netanyahu NS, Piatko CD, Silverman R, Wu AY (2002) “An efficient k-means clustering algorithm: analysis and implementation”. IEEE Trans Pattern Anal Mach Intell 24:881–892
Karun K, Isaac E (2013) Cogitative analysis on k-means clustering algorithm and its variants. Int J Adv Res Comp Communi Eng 2(4):1875–1880
Kharkov VN, Stepanov VA, Medvedeva OF, Spiridonova MG, Voevoda MI, Tadinova VN, Puzyrev VP (2007) Gene pool differences between northern and southern Altaians inferred from the data on Y-chromosomal haplogroups. Russ J Genet 43(5):551–562
Kimura M, Weiss GH (1964) The stepping stone model of population structure and the decrease of genetic correlation with distance. Genetics 49(4):561–576
Kohonen T (1995) Self-organising maps. Springer-Verlag, Berlin
Kruskal JB (1964) Multidimensional scaling by optimizing goodness of fit to a nonmetric hypothesis. Psychometrika 29:1–27
Kussmaul F (1952–53) Frühe Nomadenkulturen in Innerasien. Tribus, pp. 305–360
Lessa EP (1990) Multidimensional analysis of geographic genetic structure. Syst Zool 39:242–252
Li WH, Gouy M (1990) Statistical tests of molecular phylogenies. Methods Enzymol 183:645–659
Mirabal S, Regueiro M, Cadenas AM, Cavalli-Sforza LL, Underhill PA, Verbenko DA, Limborska SA, Herrera RJ (2009) Y-chromosome distribution within the geo-linguistic landscape of northwestern Russia. Eur J Hum Genet 17(10):1260–1273
Morozova I, Evsyukov A, Kon’kov A, Grosheva A, Zhukova O, Rychkov S (2012) Russian ethnic history inferred from mitochondrial DNA diversity. Am J Phys Anthropol 147(3):341–351
Myres NM, Rootsi S, Lin AA, Järve M, King RJ, Kutuev I et al (2011) A major Y-chromosome haplogroup R1b Holocene era founder effect in Central and Western Europe. Eur J Hum Genet 19(1):95–101
Nei M (1972) Genetic distance between populations. The American Naturalist, 106(949): 283-292. The University of Chicago Press
Nei M (1996) Phylogenetic analysis in molecular evolutionary genetics. Annu Rev Genet 30:371–403
Nock R, Nielsen F (2006) On Weighting Clustering. IEEE Trans Pattern Anal Mach Intell 28(8):1–13
Pamjav H, Zalán A, Béres J, Nagy M, Chang YM (2011) Genetic structure of the paternal lineage of the Roma people. Am J Phys Anthropol 145(1):21–29
Pamjav H, Juhász Z, Zalán A, Németh E, Damdin B (2012) A comparative phylogenetic study of genetics and folk music. Mol Genet Genomics 287(4):337–349
Ray N, Currat M et al (2005) Recovering the geographic origin of early modern humans by realistic and spatially explicit simulations. Genome Res 15(8):1161–1167
Rootsi S, Myres NM, Lin AA, Järve M, King RJ, Kutuev I, Cabrera VM et al (2012) Distinguishing the co-ancestries of haplogroup G Y-chromosomes in the populations of Europe and the Caucasus. Eur J Hum Genet 20(12):1275–1282
Singh SS, Chauhan, NC (2011) K-means v/s k-medoids: A comparative study. National conference on recent trends in engineering and technology 2011-bvmengineering.ac.in
Sanchez-Mazas A, Langaney A (1988) Common genetic pools between human populations. Hum Genet 78:161–166
Scozzari R, Massaia A, D’Atanasio E, Myres NM, Perego UA, Trombetta B, Cruciani F (2012) Molecular dissection of the basal clades in the human Y chromosome phylogenetic tree. PLoS One 7(11):e49170
She JX, Autem M, Kotulas G, Pasteur N, Bonhomme F (1987) Multivariate analysis of genetic exchanges between Solea aegyptiaca and Solea senegalensis (Teleosts, Soleidae). Biol J Linnean Soc 32:357–371
Slatkin M (1995) A measure of population subdivision based on microsatellite allele frequencies. Genetics 139:457–462
Tamayo P, Slonim D, Mesirov J, Zhu Q, Kitareewan S, Dmitrovsky E, Lander ES, Golub TR (1999) Interpreting patterns of gene expression with self-organizing maps: methods and application to hematopoietic differentiation. Proc Natl Acad Sci 96(6):2907–2912
Wawro N and Pigeot I (2008) Application of self-organizing Maps to detect population stratification. In: Shalabh, Heuman C (eds) Recent advances in linear models and related areas. Physica Verlag, Heidelberg, pp 368–445
Zupan A, Vrabec K, Glavač D (2013) The paternal perspective of the Slovenian population and its relationship with other populations. Ann Hum Biol 40(6):515–526
Acknowledgments
This work was supported by the Hungarian National Research Foundation (Grant No. K81954). We would like to say special thanks to Dr. Eva Susa (General Director of the Network of Forensic Science Institutes) for her financial support. We are also grateful to Kinga Rudolf for the birdsong field recordings. We thank two unknown reviewers for their constructive comments and suggestions and Ati Rosselet and István Borsos for the English editing.
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by S. Xu.
Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
About this article
Cite this article
Juhász, Z., Fehér, T., Bárány, G. et al. New clustering methods for population comparison on paternal lineages. Mol Genet Genomics 290, 767–784 (2015). https://doi.org/10.1007/s00438-014-0949-7
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00438-014-0949-7