Introduction

Human genomics studies in the whole-genome sequencing era have updated our understanding of the patterns of genetic diversity among ethnolinguistically diverse worldwide populations, fine-scale population structure, variation spectrum of various kinds of genetic variations of single nucleotide variations (SNV), structural variations (SV) and complex mobile elements [1,2,3]. The properties of the non-recombining region of the human Y-chromosome (NRY), that is, male specificity, haploidy, and absence of crossing over, make its genetic variations a powerful tool in evolutionary studies and forensic investigations, especially in cases where standard autosomal short tandem repeat (STR) profiling is not informative [4, 5]. Haplotypes composed of Y-STRs or Y-SNPs have been applied to characterize the paternal lineages of unknown male contributors or infer paternal biogeographical ancestry [5,6,7]. Y-SNPs define stable haplotypes as haplogroups, which could be used to construct robust phylogeny. A large body of research has been dedicated to identifying novel Y-chromosomal genetic variations, detailing the phylogenetic tree, and characterizing the patterns of differentiated paternal lineages [7,8,9]. Nowadays, the advances and applications of next-generation sequencing (NGS) technology provide unbiased ascertainment of Y-SNPs, leading to the construction of detailed phylogenies in which branch lengths are proportional to divergence times and enabling the estimates of time to the most recent common ancestor (TMRCA) of branch nodes [7, 10, 11]. The Y-chromosomal molecular clock built a link between genetic diversity and human migration and admixture history, which could be used to estimate the time when a lineage originated or expanded or when an ancestral population split into two paternal populations and migrated into different areas.

Y-chromosome-based phylogeny was used to explore the population origin, admixture, and evolutionary history at the era of the first genome sequencing era. Zerjal et al. genotyped over 32 markers among 2123 males and identified the genetic legacy of the Mongol western expansion [12]. Similar studies focused on large-scale population cohorts in Mongolia and China revealed that regional population migration and admixture models contributed to the observed patterns of genetic diversity rather than simple cultural diffusion [13, 14]. The significant advance changes in the research patterns occurred when Wei et al. introduced the high-coverage complete Y-sequences to identify the novel phylogenetic variations and calibrate the Y-chromosomal phylogeny [15]. They reported 6662 high-confidence lineage informative SNPs (LISNPs) by analyzing 8.97 Mb of the NRY regions in 36 diverse genomes. Followingly, Karmin et al. sequenced 456 geographically diverse individuals and found that global cultural changes were associated with the identified bottleneck of Y-chromosome diversity based on the phylogenetic analysis of the high-coverage Y-chromosome sequences [16]. Poznik et al. focused on the 1244 whole Y-chromosome genomes from the 1000 Genomes Project and observed punctuated bursts in human male demography inferred from the identified ~ 6000 variants [7]. Large-scale Y-chromosome phylogeny reconstruction from the European, Oceanian and Siberian populations further reported complex population origin tracts, gene flow events, population standstill, rapid population divergence and expansion [17,18,19,20]. The single population-scale Y-chromosome investigations focused on Chinese Tibetan, Han, Mongolian and other populations have been conducted, which have identified the population-specific founding lineages and corresponding population origin, separation, and following admixture events [10, 21,22,23]. However, large-scale Y-chromosome surveys based on high-resolution systems or whole-genome sequencing remained to be conducted to present the entire landscape of genetic diversity and fine-scale paternal genetic history.

The deeper understating of Y-chromosome variations in human evolutionary research further promoted its wide applications in forensic science [4, 24]. The Y-chromosomal phylogeny has become an essential pillar of forensic pedigree searches and paternal ancestry inference. On this basis, several Y-SNP panels of different resolutions have been developed and validated in geographically and linguistically diverse populations [25,26,27,28,29,30,31]. Due to the popularity of the capillary electrophoresis (CE) platform, currently available dedicated Y-SNP genotyping tools developed based on this system have restrictions on the number of Y-SNPs analyzed simultaneously [26, 28, 29, 32, 33], hindering the dissection of paternal biogeographical ancestry at higher resolution. Here, targeted NGS technologies are up-and-coming, which have the capability to sequence multiple targets and samples simultaneously and can take advantage of a large number of Y-SNPs for forensic investigations on a detailed level. An early proof-of-principle study showed that 530 Y-SNPs could be genotyped simultaneously in a single sequencing run [25]. This Y-SNP NGS panel covered branches of the entire phylogenetic tree (Y-DNA Haplogroup Tree 2013) and could be applied to comprehensive paternal lineage classification. Subsequently, Ralf et al. presented a vastly improved Y-SNP NGS panel covering 859 Y-SNPs and 640 corresponding paternal lineages [34]. Although all major Y-chromosomal lineages were included in these two panels, finer-scale paternal lineages were lacking. Chinese populations with a large population size possessed a large number of terminal paternal lineages, which limited the application of the above two NGS panels in pedigree search, personal identification and biogeographical ancestry inference. To promote the application of Y-SNPs for forensic investigations in China, multiple Y-SNP NGS panels aiming at the paternal lineage classification of ethnolinguistically diverse populations were successively developed [27, 31, 35, 36]. However, Chinese populations possess complex patterns of cultural, geographical, ethnic and genetic diversity. These dedicated Y-SNP genotyping tools could not simultaneously cover the dominant paternal lineages of Chinese populations and meet the requirement of high-resolution paternal lineage classification. Whole-genome sequencing-based genetic studies have illuminated that Chinese population structures were strongly correlated with geographical regions or language families [37, 38]. Ancient DNA of East Eurasians further identified multiple ancestral sources and complex demographic events that contributed to the gene pool of modern Chinese people, including the westward migrations of the steppe pastoralists and herders, north-to-south di-directional population movements along the Yangtze River and Yellow River basins, peopling of the Tibetan Plateau, and extensive interaction with ancient Siberians [39,40,41,42,43]. These ancient population events further complicated the patterns of the Y-chromosomal lineages in China. To explore the fine-scale paternal genetic structure and illuminate the patterns of genetic diversity of Chinese populations, we developed one high-resolution revised Y-SNP panel with high coverage of geographic and ethnic specificity and the high resolution of terminal haplogroup dissection. We genotyped 639 Y-SNPs in 1033 unrelated individuals from 33 Sinitic-speaking Han and Hui, Mongolic-speaking Mongolian, Tai-Kadai-speaking Gelao and Li, and Tungusic-speaking Manchu populations (Fig. 1A). We aggregated our data with previously publicly available data and comprehensively characterized the genetic diversity and population genetic features based on the sequence variations. We constructed one comprehensive revised forensic phylogenetic tree to present the patterns of Chinese Y-chromosome diversity and illuminate the high performance of our developed panel for forensic applications. Our work has conducted one of the most extensive genetic studies to present one high-density Y-chromosomal phylogeny among linguistically representative Chinese populations. We identified extensive Chinese genetic diversity of ethnolinguistically diverse populations, which can be used as a repertoire of Chinese genomic variations for a better understanding of population admixture events in future studies focused on migration, ancestral source, evolution, demography, adaptation and human genomic resources in China.

Fig. 1
figure 1

The geographical positions of 33 studied ethnolinguistically diverse populations and genetic features inferred from the haplogroup frequency spectrum. A Geographical locations and the haplogroup composition of 32 predefined 4-level haplogroups. All used haplogroups were manually cut at the fourth level to achieve a statistical possibility. B The heatmap showed the clustering patterns of 32 cut-haplogroups from 33 populations. C, D Principal component analysis (PCA) showed the genetic similarities and differences among 33 populations based on the top three components extracted from the haplogroup frequency spectrum (HFS). E Multidimensional scaling plots showed the genetic clustering patterns based on the pairwise Fst matrix. F Heatmap showed the pairwise Fst calculated based on the HFS and the population clustered patterns G, H Neighboring-Joining phylogenetic tree reconstructed based on the genetic distances in the different levels of HFS

Results

Genotyping and genetic diversity among Chinese people

We generated genotype data of 639 Y-SNPs from 322 unrelated individuals from six Chinese populations belonging to three different language families using our developed 639-plex Y-SNP panel (Method), including three Mongolic-speaking Mongolian populations in North China and one Sinitic-speaking Hui in Northwest China, two Tai-Kadai-speaking Gelao in Southwest China and Li in southernmost China (Fig. 1A). To provide one comprehensive population reference data from geographically diverse Chinese populations, we genotyped approximately 30 K Y-SNPs in 711 Han, Manchu, Mongolian and Hui people from 27 populations using a high-density array, which included our genotyped 639 Y-SNPs. The average sequencing depth was high enough for our quality control (Method). Among the final dataset of 1033 individuals from 33 populations, we observed 256 definitive Y-chromosomal lineages with the haplogroup frequency ranging from 0.0010 to 0.0687. Eighty-seven haplotypes were observed only once among our dataset, mainly including lineages from Q, R, C, D and other rare O terminal lineages. If we defined a haplogroup frequency larger than 5% as the threshold of the common lineages, the threshold of 1% as low-frequency lineages and other non-singleton as the rare lineages, we observed one common lineage (O2a2b1a1a1a1a1a1a1-M6539, 6.87%, 71/1033), 18 low-frequency lineages (O2a1b1a1a1a1a1a1-F17, O2a2b1a1a1a1a1b1a1b-MF15397, O2a2b2a1b1-A16609, O1b1a1a1a1b2a1a1-F2517, O2a2b1a1a1a1a1a1-F155, O1a1a1b-Z23420, C2a1a2a2b-F12502, O2a1b1a1a1a1-F325, O2a1b1a2a1a1-F1894, O2a2b1a2a1a1a2a1-F273, O2a1b1a1a1a1a1-F110, C2a1a3a-F3796, C2a1a1b1a-F3830, O2a1b1a1a1a1e1a-Y16154, J2-PF4922, Q1a1a1-F1626, O1a1a2a1a-Z23266 and O2b1a1a1-Y173834, the observed number over 11) and 150 rare lineages. We also statistically estimated the distribution of upstream Y-chromosomal lineages. We randomly cut the fourth level of each included haplogroup and observed six common lineages, including C2b1-F2613, O1a1a-CTS4351, C2a1-F3914, O1b1-F2320, O2a1-CTS7638 and O2a2-P201, seven low-frequency lineages, 15 rare lineages and six singletons. Finally, we calculated the nucleotide diversity (π) among 322 individuals from six populations and obtained an estimate of 0.033. The estimated segregating sites of these included Y-SNPs were 256 (the number of sites). The number of sites of parsimony-informative sites was 193. Tajima's D statistic was -1.85776 with a p-value of 0.9856. Analysis of genetic diversity showed our developed panel was suitable for capturing the Chinese population's genetic variations.

Haplogroup frequency spectrum among 33 Chinese populations suggested their differentiated paternal genetic structure

We explored the similarities and differences based on the haplogroup frequency spectrum (HFS) among 33 investigated populations at level 4 (Fig. 1A, B). O1a only existed in Han Chinese, and the common lineage of O1a1 (0.3559) and low-frequency lineage of O1a2 (0.0169) were observed in Tai-Kadai-speaking Li with the highest proportion. Interestingly, Li-dominant lineage O1a1 was commonly identified in southern Han Chinese, Gelao and Hui people. O1b1, with the highest frequency of 0.5593, was mainly observed in southern Han Chinese populations and also identified in northern Han and Mongolian people. Two of four O2 lineages (O2a1 and O2a2) were frequently observed in Han Chinese. O2a1 had the highest frequency in Tianjin Han, but O2a2 had the highest frequency in southern Hans (Fujian, Guangdong and Guizhou). C2a1 widely existed in Mongolian and other northern Han Chinese, and C2b1 was frequently observed in Manchu, Hui and other Han people. Except for the lineages mentioned above, we also observed that Siberian lineages of Q1a1, and western Eurasian-dominant lineages of J1a2, R1a1 and R1b1 contributed to the paternal genetic diversity of Chinese populations (Additional file 1: Tables S1, S2 and Fig. 1A). Heatmap of the HFS showed aforementioned common lineages contributed to the significant components of our studied population's gene pool. We also found that geographically or ethnolinguistically close people shared similar patterns of HFS.

We additionally explored the genetic similarities and differences among 33 Chinese populations using principal component analysis (PCA) based on the HFS on the fourth level. PC1, extracting 51.45% variance from total variations, separated the northern and southern Chinese populations and PC2, extracting 10.15% variance, separated Tai-Kadai-speaking Li from other Chinese people (Fig. 1C). Clustering patterns based on the first and third components separated Mongolian populations from others. Interestingly, Han Chinese populations from Inner Mongolia and Liaoning clustered closely with Mongolian people, suggesting their admixture and extensive interaction status, consistent with the MDS-based (multidimensional scaling analysis) clustering patterns (Fig. 1D, E). Mongolian and Manchu people from the metropolitan populations clustered with Han Chinese, which suggested that southern Altaic people mixed with Han Chinese and other indigenous populations in the historical periods. These observed patterns were consistent with the admixture patterns inferred from our recent genomic studies of Mongolian and Manchu people from Guizhou province [44]. We further validated the identified patterns based on the pairwise Fst matrix, in which Li separated from others, and Mongolian and northern Han clustered together and separated from others. Indeed, the estimated Fst values showed that Li people had the most considerable genetic distances with other comparative populations (Additional file 1: Table S3) and separated from other populations, and formed one isolated clade in the heatmap clustering pattern (Fig. 1F). Other populations were distinguished into two groups, mainly from northern and southern China. Finally, to confirm the robustness of our reconstructed genetic affinity, we reconstructed two Neighboring-Joining trees based on the Fst matrixes at the fourth and second levels (Fig. 1G, H). We found that the phylogenetic relationship inferred from the upstream Y-chromosomal lineages was more consistent with the cluster patterns observed in the PCA, MDS and heatmap.

High-resolution Y-chromosomal lineages for ethnolinguistically diverse Chinese populations

Recent whole-genome sequencing studies of Chinese populations have identified the fine-scale paternal genetic structure of ethnolinguistically different Chinese people and unreported LISNPs [10, 22, 23, 45]. However, whole-genome sequencing for every forensic case sample is impossible. Thus, our developed high-resolution Y-SNP panel is the best choice for promoting forensic applications. We chose all essential lineages in Chinese populations with the divergence times before 500 years. To validate the lineage coverage of our panel, we first genotyped 322 unrelated samples from six populations (Mongolian, Hui, Gelao and Li, Fig. 2A). the reconstructed revised phylogeny among six ethnic groups showed that paternal lineages fell into O2a2, O2a1, O1b1, C2a1, O1a1, C2b1, Q1a1, R1a1 and D1a1, respectively, sampled from 61, 46, 43, 42, 32, 24, 11, 10 and 9 individuals (Fig. 2A). Dominant sublineages (O1b1a1a1a1b2a1a1, 23; O1a1a1b, 12 and C2a1a3a, 11) were observed and restricted to Li and Mongolian populations. Phylogeny results suggested that the C2a/2b can be identified as the founding lineage and ethnicity-specific lineage informative Y-SNP markers of Mongolians for further population genetics and forensic pedigree search as well as biogeographical ancestry inference. Similarly, the identified common sublineages of O1a/1b can be used as the Li-specific founding lineage for subsequent forensic application. We also found some rare lineages originated from Siberia or western Eurasia and participated in the formation of Mongolian and Hui people in North China, suggesting the extensive population admixture along the population migration between North China, Siberia and Central Asia along the silk road or ancient Trans-Eurasian cultural and population communication. Based on the phylogenetic topology, we found that these founding lineages experienced population expansion and the admixture-introduced lineages remained a limited population size in Chinese populations.

Fig. 2
figure 2

Phylogeny reconstruction among 322 Chinese individuals from Mongolian, Gelao, Hui and Li people. A Phylogenetic relationships were reconstructed based on the Y-LineageTracker. Different colors denoted the diverse populations. B The Network reconstructed via the popART showed the shared haplotypes and mutations among other terminal haplogroups. The different colors showed different populations. Different color backgrounds denoted the highlighted Y-chromosomal lineages among Chinese people

To directly explore the genetic connection among different Y-chromosomal lineages and ethnically different populations, we constructed the Network relationship among 322 Chinese males. Consistent with the identified common, rare, or low-frequency lineages in the phylogenetic topology, we observed eleven different lineages contributed to the genetic diversity of these studied populations. C2a1 lineage was unique in Mongolian, and C2b1 was dominant in Mongolian and presented some sublineages in Gelao. O1a/b was dominant in southern Chinese populations (Gelao and Li). O2a2a1/b1, O2a1a1, and O2a1a1/b1 contributed to Mongolian, Hui and Gelao. We could also identify some star-like expansion of some dominant lineages, such as C2a2b1a1a1 in Mongolian populations.

Extensive population expansion and admixture within and between Han Chinese populations and other minorities

Han Chinese is the largest ethnic group in the world and has a significant influence on the formation of the gene pool of Chinese populations [37, 38, 46,47,48]. To further explore phylogenetic topology among Han Chinese and illuminate the genetic relationship between Han Chinese and aforesaid minorities, we collected and genotyped 711 individuals via Affymetrix array, which included 690 Han Chinese from 25 geographically different populations. We reconstructed a more representative phylogeny using 1033 sequences and confirmed the Mongolian and Li-specific founding lineages (Fig. 3). We found that these two founding lineages participated in the formation of Han Chinese populations. Among Han Chinese populations, we also identified the western Eurasian-introduced R1a/1b and Siberian-mediated Q1. We identified Han Chinese dominant Y-chromosomal lineages of O2a2, O2a1, O1b1, O1a1, C2b1, C2a1, O2b1 and Q1a1, respectively, sampled from 303, 147, 38, 35, 33, 26, 26 and 18 individuals. Some sublineages were also underwent recent population expansion (O2a2b1a1a1a1a1a1a1, 69; O2a1b1a1a1a1a1a1, 38; O2a2b1a1a1a1a1b1a1b, 5; O2a2b1a1a1a1a1a1, 22; O2a2b2a1b1, 22). The observed mosaic pattern of the identified paternal lineages showed extensive gene flow among Han Chinese populations and minority groups, and the gene flow influence was bi-directional. Mongolian and Li dominant founding lineages were observed in Han Chinese individuals, suggesting that ancient Baiyue ancestors and Eurasian pastoralist people participated in the formation of Han Chinese. Han Chinese dominant lineages were also identified in Mongolian, Hui, Li and Gelao people, which supported population interaction from the paternal perspective.

Fig. 3
figure 3

Phylogenetic relationships based on the sequence diversity among 1033 Chinese males from 33 populations. Different color in a circle showed their different composition of population origins. The different color backgrounds showed different founding lineages. The haplogroup followed by the sample ID was classified using the HaploGrouper

The reconstructed merged Network confirmed the complex admixture and expansion events among Han Chinese and other non-Han Chinese populations (Fig. 4). The Mongolian and Li-specific lineages (C2a/b and O1a/b) were separated from the Han Chinese-related lineages. We could also find that Li and Mongolian dominant lineages influenced the Han Chinese gene pool as the population composition of one target haplogroup. The distribution of samples belonging to O2a1/a2 denoted that their primary ancestry was derived from Han Chinese populations and minor ancestry from Li. However, the obtained ancestral lineages from Han Chinese in Gelao, Mongolian, and Hui people were more evident than that in Li people. O2a2b1a1a1a1 and O2a1b1a1a1a1e1 were two important lineages that experienced population expansion.

Fig. 4
figure 4

The reconstructed phylogeny of Chinese populations. The map showed the primary distribution of our investigated Y-chromosomal lineages. The arrow showed the possible migration direction. Circles in the map were coded based on their language families. The line color and triangle denoted the posterior. The triangle length showed the population size and the height showed the relative divergence time. Population ID was categorized based on their ethnicity belongs

Paternal genetic connection and differentiation between East Asians and Southeast Asians

Ancient DNA from Island Southeast Asia (Nagsabaran Site) and Mainland Southeast Asia (Man Bac, Nui Nap and others) has demonstrated that both ancient Southeast Asian Hòabìnhian hunter-gatherers and Neolithic southern Chinese rice farmers contributed to the complexity of genetic diversity of Southeast Asia [49, 50]. Larena et al. recently reported the large-scale genetic diversity of geographically and ethnolinguistically diverse modern Philippine populations and illustrated more complex processes of the peopling history of Island Southeast Asia [51, 52]. Northern and southern Negritos, Manobo, Sama, Papuan, and Cordilleran-related ancestral populations contributed to the gene pool of the modern landscape of Philippines [51], and Ayta people possessed the highest level of Denisovan ancestry among Negritos and Papuan people [52]. Complex population admixture and rich genetic diversity are also reported in Mainland Southeast Asia [53, 54]. Focused on the uniparental history of Southeast Asians, Kutanan and his cooperators genotyped 2.3 MB Y-chromosome variations among ethnolinguistically diverse Southeast Asian and reported their similarities and differences between the paternal and maternal genetic history [11, 55, 56]. However, the genetic interaction between Southeast Asia and East Asia remains to be comprehensively evaluated.

To comprehensively investigate the population interaction between East and Southeast Asians, we aggregated our data with publicly available haplogroup information from 3094 people. We obtained one aggregation dataset that included 4728 individuals from 114 ethnolinguistically populations from China and Southeast Asia, including 5 Altaic-, 6 Austroasiatic-, 5 Austronesian-, 62 Sino-Tibeto-, 10 Hmong-Mien-, 26 Tai-Kadai-speaking populations (Additional file 1: Table S4). We first conducted PCA analysis based on the haplogroup frequency at the second level and found the top three components extracted 77.84% variances from the total variations. Generally, patterns inferred from Y-chromosomal haplogroup variations were more obscure than that inferred based on the whole-genome autosomal variations. Nevertheless, we can still identify that some ethnolinguistically specific people were separated from others (Fig. 5A, B). Apparent population affinity within similar language families can be identified in the pairwise Fst heatmap, where the lowest genetic distances were observed among linguistically close populations (Fig. 5C). We further explored the haplogroup composition of major lineages among 114 populations, and we found that the dominant lineages were enriched in some linguistically specific or geographically isolated populations. F lineages were observed in Lahu and Phula, and D lineages were dominant in Yi and Tibetan. O1 was dominant in Southeast Asians, and O2 was dominant in Chinese populations (Fig. 5D). We interestingly identified that Hmong people from Southeast Asia possessed the largest proportion of O2 lineage, which was consistent with the long-distance population migration and connection inferred from the genome-wide SNP data [57]. The reconstructed phylogenetic relationship based on the pairwise Fst matrix showed two major branches associated with the stratification of haplogroup composition of Southeast and East Asia (Fig. 5E). We also found that Hui people possessed complex haplogroup composition and clustered together with Han Chinese branch. Generally, our comprehensive population comparison showed the close genetic connection between Southeast Asian and southern Chinese populations and differentiation between Southeast Asian and northern Chinese groups, especially with Altaic-speaking populations.

Fig. 5
figure 5

Genetic connection and differentiation between Southeast Asians and East Asians inferred from Y-chromosome variations. A, B PCA inferred from the top three components showed the genetic connections among 114 populations. C Heatmap of pairwise Fst genetic distances among 114 populations. D Haplogroup composition of major Y-chromosomal lineage from 4727 individuals from 114 populations. E The phylogenetic relationship among 114 populations showed their genetic affinity. The branch length was not associated with the genetic drift. All populations from one family, ethnicity or geographically close regions were color-coded via one color

Discussion

The full landscape of Y-chromosomal diversity reveals complex population migration and admixture tracts

Non-crossover regions of the human Y-chromosome harbor the feature of male-specific inheritance and can record most male behavior, phenotype and human demographic details [4, 7]. To explore the patterns of Y-chromosomal diversity, we reported the genotypes of 1033 Y chromosomes randomly sampled from 33 Chinese populations belonging to five ethnic groups (Mongolian, Manchu, Hui, Gelao, Li and Han), which were genotyped using our newly-developed 639-plex Y-SNP panel and high-density Affymetrix array. We have conducted a comprehensive population evolutionary analysis and population comparison tests within and between Chinese populations belonging to different geographical regions or language families. Population genetic survey suggested that our panel captured the richest Y-chromosomal genetic diversity to date in all forensic Y-SNP genotyping tools focused on the Chinese populations [26, 27, 31, 32, 36, 58,59,60]. Phylogeny constructed among Non-Han Chinese (Mongolic, Hui and Tai-Kadai) and all included subjects consistently demonstrated that strong geography or ethnicity-related Y-chromosomal features indicated the underlying complex population evolutionary history and potential for forensic pedigree search and biogeographical ancestry inference.

HFS analysis among our studied populations or regional northern Mongolian and southern island Li people has revealed their common founding lineages of C2a/2b and O1a/1b (Fig. 1A, B). Geographical distribution further confirmed that these dominant lineages could be used as forensic markers for genetic localization. C2a1-F3914 can be used as a Mongolian predominant founding lineage, which was observed in 42 Mongolian individuals and 26 Han Chinese individuals, mainly from Shaanxi, Shanxi and Inner Mongolia. Another Mongolian predominant lineage C2b1-F2613 was observed in 16 Mongolians (16/159), two Manchu, one Li, two Hui, 32 Han (32/693) and six Gelao individuals. Upstream C2b-F1067 (previously classified as C2c) was reported first in northeastern Asia and associated with the origin and expansion of Mongolic-speaking populations. Subclades of C2b1a1a1a-M407 (C2c1a1a1 in the previous version) appeared in ten individuals (two Hans, one Hui, one Li and four Mongolians), and C2b1b1b-F5477 and C2b1a2b2-FGC45548 were also, respectively, observed seven and six times. Huang et al. presented one revised phylogenetic tree and distribution map focused on all available C2b1a1a1a-M407 samples and found that C2b1a1a1a-M407 has a frequency of over 50% in the northeastern Asian populations [61]. Thus, C2b1-F2613 and C2b1a1a1a-M407 can be used to trace the population origin and migration of Kalmyks, Mongolians, Buryats and other genetically close northeastern Asians.

Network and phylogeny constructions further showed that the lineage of O1a-M119 was the common paternal lineage in southern Chinese populations (Fig. 2). Sublineages of O1b1a1a1a1b2a1a1-F2517 (23), O1b1a2a1-F1759 (10), O1a1a1b-Z23420 (15) and O1a1a2a1a-Z23266 (11) had undergone population expansion recently. All F2517 lineages were observed in Li people, which was consistent with a recent whole-genome sequencing study [62]. Chen et al. found O1b1a1a was the dominant lineage in southern East Asians, which diverged from others 10,998 years ago, and the F2517 sublineages were further divided into O1b1a1a1a1b2a1a1a and O1b1a1a1a1b2a1a1b clades at 2828 years ago [62]. Sun et al. also found that most sublineages of M119, including our identified F1759, Z23420 and Z23266, contributed to the ancient gene pool of modern Tai-Kadai, Austronesian and southern Han Chinese [10]. We could also identify the shared paternal lineages among Li, southern Han and Gelao people in the distribution of M119 mutations. The unique paternal genetic structure of the Hainan Li people was also consistent with our previous findings of the fine-scale genetic structure [63].

Our survey has identified 195 samples of O2a1-CTS7638 and 372 samples of O2a2-P201 in Han Chinese populations. Sublineages of O2a2b1a1a1a1a1a1a1-M6539 (71), O2a1b1a1a1a1a1a1-F17 (43), O2a2b1a1a1a1a1b1a1b-MF15397 (26), O2a2b2a1b1-A16609 (24) and O2a2b1a1a1a1a1a1-F155 (23) have experienced population expansion events recently, which was consistent with population expansion among Han Chinese populations inferred from the whole-genome sequencing project. Admixture-introduced rare lineage Q1a1-F746/F790 was observed in 30 samples from Mongolian, Han, and Gelao people, and steppe pastoralist-related R1a1a1b2-Z93 was observed in 17 Mongolian and Hui people. Our results provided genetic evidence for extensive admixture between northern East Asians and surrounding populations. Similar patterns were also illuminated in the paternal genetic history reconstruction by Wang et al., who concluded that multiple ancestral sources contributed to the formation of the paternal gene pool of Mongolian people [32]. Besides, He et al. found that Mongolic-speaking populations have strong population stratifications, in which the northern one was influenced by Siberian ancestry, the western one was influenced by western Eurasian and the southern one was influenced by Han Chinese expansion [37]. Recent ancient DNA also found that western Eurasian steppe ancestry has influenced the genetic makeup of northern East Asians. For western Eurasian ancestry identified in Hui people, our paternal results were also consistent with the admixture patterns via the genome-wide SNP data. Complex demographical models suggested that geographically diverse Chinese Hui people harbored complex and different admixture processes and possessed approximately 10% ancestry related to the ancestor from Central Asians [64,65,66].

We also identified paternal genetic structure among Chinese populations in the clustering patterns via HFS, PCA and MDS. These population stratifications were in accordance with the language or geographical categories. Island Li people shared their specific paternal genetic structure and clustered far from other Chinese populations. Similarly, Mongolian people clustered together and have a relatively close genetic relationship with northern East Asians. We also should note that the Y-chromosome-based population structure in China is rougher than that inferred from the genome-wide SNPs. Our recent genetic studies have identified five population substructures correlated with languages and geography in China. Mongolic and Tungusic people in northeastern China harbored the highest Ulchi or ancient Boisman/DevilsCave-related ancestry [37, 41]. Tibeto-Burman groups from Tibetan Plateau had the highest proportion of ancestry related to core Tibet Tibetan and Nepal Chokhopani, Mebrak, and Samdzong people [67, 68]. The primary ancestral component of Hmong-Mien people from southwestern China was maximized in ancestry related to Miao and Yao people [69], and Austronesian-speaking people from Taiwan Island have more Ami/Atayal or ancient Hanben-related ancestry [41]. Han Chinese ancestry localized between the four ancestries mentioned above and showed a northern-to-southern genetic cline. The recent large-scale genetic structure also identified fine-scale population structure among geographically diverse Han Chinese populations [46, 47, 70]. Our population comparison among Southeast Asians and East Asians also identified the shared genetic material, such as O1 in ingenious southern people and O2 in Hmong-Mien people. These patterns were consistent with multiple waves of migrations from South China to Southeast Asia inferred from ancient DNA [49, 50] and previous modern DNA from Southeast Asia [11, 51, 53,54,55,56]. These fine-scale genetic backgrounds could promote better study design for large medical clinical cohorts and forensic genetic localization of crime cases.

639-plex Y-SNP panel can be used as a powerful forensic tool for Chinese forensic pedigree search and biogeographical ancestry inference

The forensic community has noticed that whole-genome sequencing in a forensic case needs to overcome specific infrastructure of the specialist, platform and genomic statistical methods, as well as the experiment method focused on the forensic case samples. Besides, the cost of one sample is another important obstacle to the wide application of whole-genome sequencing technology in forensics. Evolutionary genetic scientists have conducted many vital projects to explore the complete anthropologically-informed phylogeny [7, 19, 20]. Our work has identified most paternal founding lineages in Chinese populations and comprehensively characterized their geographical and ethnic distribution. Our panel harbored the high coverage of genetic variations of terminal lineages and complete coverage of reference data from main populations or ethnic groups in China.

Forensic pedigree search can help trace possible crime suspects based on the shared Y-chromosome mutations. Lineages informative Y-SNPs were usually used together with Y-STR markers. Many prior works provided relatively high-resolution forensic phylogenic trees and presented the corresponding scientific examination and analysis strategies. They promoted the advances of forensic Y-chromosome applications in pedigree search and biogeographical ancestry inference based on the customized SNaPshot and NGS technologies [26, 27, 31, 32, 36, 58,59,60]. Song et al. explored the paternal genetic structure of Hainan Li using their developed panel containing 141 Y-SNPs. They found that haplogroup O1b1a1a1a1a1b-CTS5854 can be used as one ethnicity-specific lineage in population and forensic genetics [59]. Song et al. further updated their panel, including 233 Y-SNPs used for Chinese Qiang people, and found that O2a2b1a1-M117, O2a2b1a1a1-F42 and O2a1b1a1a1a-F11 were the founding lineages in Qiang people [60]. Wang et al. also investigated the paternal genetic structure of Zhuang people using this panel and identified the O2-dominant lineages in Tai-Kadai people [58]. Xie et al. developed one panel focused on Hui people, which included 157 Y-SNP, and identified the population substructure of Hui people [26]. Wang et al. focused on the genetic diversity of Mongolian people (N1b-F2930, N1a1a1a1a3-B197, Q-M242 and O2a2b1a1a1a4a-CTS4658) and developed one Mongolian-specific panel included 215 Y-SNPs [32]. The panels mentioned above consisted of several customized SNaPshot systems, which limited the rapid use in forensic cases. Wang et al. developed one 165-plex Y-SNP panel based on an Ion S5 XL system. They comprehensively conducted the sequencing performance and concordance, reliability, sensitivity, and stability studies based on the ISFG guidelines [27]. Liu et al. recently updated this system by increasing the final Y-SNP number to 256 [36]. Significantly, Tao et al. developed a customized SifaMPS Y-SNP panel that included 381 Y-SNPs focused on Chinese populations and investigated the basic structure and sub-branches of Chinese major haplogroup branches [31]. Our panel included two significant features: the first one is that lineages specific to or common in most Chinese populations were included (O, D, C, R and Q et al.), and the other important one is that we retained a higher resolution of the terminal Y-chromosomal lineages, which can complete the shortcoming of previously developed panels limited to some common lineages or only focused on specific populations. The newly-developed panel overcame the limitation of the lineage's representatives, terminal lineage resolution and sequencing platforms, which can provide the best practice tool in forensic applications. The identified paternal population structure in China can provide more clues for biogeographical ancestry inference, which can be used as a complementary tool for forensic ancestry prediction based on the autosome-based ancestry informative SNP panel [64].

Conclusion

The complete landscape of human Y-chromosome variations and the gradually updated Y-chromosome phylogenetic tree with more population-specific LISNPs formed the fundamental for forensic application and evolutionary study. To overcome the shortness of the whole Y-chromosome sequencing in forensic science at the initial stage of the genome sequencing era, we developed one high-resolution 639-plex Y-SNP panel that included 639 LISNPs defined 573 terminal Y-chromosomal lineages. We generated the population data from 1033 individuals from 33 populations, including Han, Hui, Mongolian, Li and Gelao and investigated the forensic features, HFS and evolutionary processes via multiple statistical models. We identified 257 terminal Y-chromosomal lineages with several common founding lineages of O2a2b1a1a1a1a1a1a1-M6539, O2a1b1a1a1a1a1a1-F17, O2a2b1a1a1a1a1b1a1b-MF15397, O2a2b2a1b1-A16609, O1b1a1a1a1b2a1a1-F2517 and O2a2b1a1a1a1a1a1-F155. Patterns of HFS and corresponding geographical distribution illuminated that some Siberian or western Eurasian-originated paternal lineages contributed to the formation of the paternal gene pool of Mongolian, Hui and northern Han Chinese populations. Network and our reconstructed forensic phylogenic topology further illuminated the complex population divergence and expansion of different paternal lineages, which also found some ancestral lineages shared by geographically or linguistically different Chinese populations. PCA and MDS clustering patterns showed that the paternal genetic structure was correlated with the geographical and linguistic categories, which provided the basic genetic background for forensic paternal biogeographical ancestry inference. Our reconstructed revised phylogeny and comprehensive population genetic investigation based on this Y-SNP panel can provide the highest resolution of the terminal lineage and genetic diversity, which provides one panel with high marker coverage and lineage representation. Ethnolinguistically diverse Chinese populations had the highest genetic diversity. Thus, anthropologically-informed Y-chromosome whole-genome sequencing will promote the further development of higher-resolution Y-SNP NGS panels and corresponding population-specific dataset construction. We also emphasized that the large-scale population cohorts, such as 10,000 Chinese Person Genomic Diversity Project (10K_CPGDP) and 100 K-GSRDWCH (100 K genome sequencing of rare disease), can provide more unreported LISNPs for forensic application of Y-chromosome.

Materials and methods

Sample collection and DNA extraction

We collected peripheral blood samples from 1033 unrelated Han, Tibetan, Hui, Gelao, Manchu and Li individuals from 33 geographically different regions (Fig. 1A). Each donor provided written informed consent. The medical ethics boards at Sichuan University and North Sichuan Medical College have approved our study protocol. Our experiments followed the recommendations and regulations of our institute and national guidelines of standards of the Declaration of Helsinki [71]. PureLink Genomic DNA Mini Kit (Thermo Fisher Scientific, Waltham, USA) was used to extract the genomic DNA. Based on the official manufacturer's guidelines, Quantifiler Human DNA Quantification Kit (Thermo Fisher Scientific) and 7500 Real-time PCR System (Thermo Fisher Scientific) were used to quantify the DNA quantity and finally reserved in a low-temperature environment.

Marker composition and NGS-based panel development

We chose the final included markers based on the following five rules to present a full-scale Chinese Y-chromosome diversity and high resolution of each terminal lineage. First, all major clade lineages recorded in the International Society of Genetic Genealogy (ISOGG) Y-DNA Haplogroup Tree 2019–2020 version 15.73 (https://isogg.org/tree/index.html) and Yfull databases focused on Chinese populations were included. Second, key mutations included in our previously developed panel and validated in the population genetic studies were included [27, 36]. Third, we determined the terminal mutations based on the population-scale HFS according to the revised phylogenetic tree in the whole-genome sequencing projects. Based on the public data from the expanded 1000 Genomes Project cohort [3], Human Genetic Diversity Project (HGDP) [2], Simons Genome Diversity Project [72], Estonian Biocentre Human Genome Diversity Panel (EGDP) [73] and 10K_CPGDP and others, we have built one in-house Y-chromosome population database with detailed HFS distribution and estimated divergence times (will be published soon). We choose the terminal mutations with a frequency larger than 5%. Fourth, based on the representatives of the included individuals and haplogroup coverage, we estimated the divergence times of each branch based on the localization of the mutations in the revised phylogeny. We included the mutations with a divergence time older than 500 years. We included 1000 Y-SNPs for the final primer design. We finally developed the Y-SNP NGS panel based on the MGISEQ-2000RS (MGI Tech Co., Ltd., Shenzhen, Guangdong, China) sequencing platform, which has been formally validated based on the SWGDAM guidelines (paper in preparation).

Genotyping and quality control

We sequenced 639 Y-SNPs in 322 samples from Mongolian populations in three geographically different regions (154 males), Hui people from Wuzhong (50 males), Gelao people from Daozhen (59 males) and Li people (59 males) from Hainan, using our developed 639-plex Y-SNP panel on the MGISEQ-2000RS sequencing platform. In each sequencing run, positive control of 2800M (Promega, Madison, WI, USA) and negative ddH2O control were used. To provide one comprehensive comparative database, we also genotyped 639 Y-SNPs in 711 Han, Mongolian and Manchu individuals using an Affymetrix array. We used the PLINK v1.90b6.26 64-bit (2 Apr 2022) to conduct quality control on the merged dataset based on the missing SNP rate and missing genotyping rate with two parameters (–geno: 0.1 and –mind: 0.1) [74]. We finally kept a dataset of 639 Y-SNPs in 1033 unrelated individuals from 33 populations in the following forensic effectiveness evaluation and evolutionary history reconstruction.

Classifying NRY haplogroups

We manually classified the NRY haplogroups for sequencing data based on the predefined phylogenetic tree with the chosen mutation markers. And then, we merged the sequencing data and the chip-based data and used the python package of hGrpr2.py instrumented in HaploGrouper to classify the haplogroups [75]. All branches in the haplogroup tree (treeFileNEW_isogg2019.txt) and ISOGG SNP file (snpFile_b38_isogg2019.txt) were used in the HaploGrouper-based haplogroup classification. We additionally used the chip version (–chip) in the LineageTracker to classify the NRY haplogroups based on the GRCH38 reference genome [76].

Haplogroup frequency spectrum estimation and clustering analysis

We calculated the HFS at different levels of the terminal haplogroups. We estimated the geographical distribution via a pie chart in the map and a heatmap based on the HFS matrix. Followingly, we conducted PCA based on the HFS matrix. We used the top three components extracted from the total variations to cluster our studied populations. We also calculated the pairwise Fst genetic matrix based on the HFS and conducted the MDS to explore the genetic affinity between included populations.

Phylogeny analysis for NRY haplogroups

We used the LineageTracker to construct the phylogenetic topology of all individuals [76]. Tools for variant calling and manipulating VCFs and BCFs (Bcftools Version: 1.8) and PLINK v1.9 were used to convert the vcf files to fasta files. We used BEAUti to convert fasta files into XML files and ran the BEAST analysis using BEAST2.0 [77]. Tracer v1.7.2 was used to evaluate the power of statistical parameters and TreeAnnotator was used to choose the best trees in the BEAST results [77]. We finally used FigTree v1.4.4 to visualize and organize the phylogenetic tree [77].

Network analysis for Y-SNP haplotype data

Python package of fasta_to_nexus_Main.py (https://github.com/rubenAlbuquerque/fasta_nexus_converter/blob/master/fasta_to_nexus/Main.py) was used to generate the nexus files. We used popART to construct the Network relationship. Here, five Network models were used to build the phylogenetic relationship among different lineages, including Minimum spanning network, Median Joining Network, Integer NJ Net, Tight Span Walker and TCS Network [78]. AMOVA was conducted to explore the genetic similarities and differences between or within groups and populations. Nucleotide diversity was also estimated using the popART [78].