Background

Austronesian is one of the most important linguistic families, spread in most regions of Island Southeast Asia, the Pacific Ocean, and the Indian Ocean, and comprising more than one fifth of all the languages in the world [1]. This linguistic family was originally proposed by Murdock [2] by bringing two groups of speakers, i.e. Malayo-Polynesians (Island Southeast Asians (ISEA), Malagasy, Micronesians, and Polynesians) and Taiwan aborigines together as a monophyletic unit based on their linguistic similarity [3, 4]. Later, Benedict found that another linguistic family in East Asia, Daic, has many resemblances with the so-called Austronesian, and therefore announced a super-phylum of Austro-Tai [5]. Daic is a linguistic family located to the north of the ISEA groups, mainly in South China. Some Daic populations spread to Laos, Thailand, and as far as India [1]. Substantial resemblances among Taiwan aborigines, Malayo-Polynesians, and Daic speakers have been reported by ethnologists [610] and linguists [1115], linking Taiwan aborigines and Malayo-Polynesians to coastal populations in Southeast China, primarily Daic speakers and their ancestry, Baiyue.

The origin of Austronesian has always been a controversial subject in linguistics and other related fields. The Express Train Hypothesis, a well accepted linguistic theory on the origin of Austronesian [3, 4, 16, 17], postulates that proto-Austronesians originated in Taiwan and began to expand southward about 5,000–6,000 years ago by way of the Philippines and Eastern Indonesia. They eventually navigated eastward to Micronesia and Polynesia, and westward to Western Indonesia and Madagascar. The 'express train' refers to a rapid dispersal across the present Austronesian range starting from Eastern Indonesia. The hypothesis of the Taiwan origin of all the Austronesians (Taiwan Homeland Hypothesis or THH hereafter) is primarily based on the observation that a much higher linguistic diversity exists among languages of Taiwan aborigines than among the Malayo-Polynesians [3, 4]. However, some linguists found evidences against the THH, and suggested that Kalimantan or Sulawesi may be the homeland of Austronesian [15, 18, 19]. The THH was further challenged by ethnologists [69], archaeologists [10], and geneticists [2025].

Genetic evidence has been equally controversial. Some mitochondrial DNA (mtDNA) studies suggested a Taiwan origin of Polynesians [2022]. A recent mtDNA study on Taiwan aborigines found a root of the "Polynesian Motif" in Taiwan, which suggests that the THH may be confirmed in maternal lineages [26]. On the other hand, this theory was challenged in paternal lineages by the Y-Chromosome studies that showed a lack of resemblance between the Polynesians and Taiwan aborigines [23]. It was also challenged by other mtDNA studies, which suggest an Indonesian origin of Polynesians [24, 25]. The conflicts in the genetic evidence can be attributed to the lack of evidence or populations from two crucial regions: (1) coastal populations in Southeast Asia ancestral to three Austronesian groups (Taiwan aborigines, ISEA, and Polynesians), and (2) ISEA populations including Indonesians from which Polynesians derived.

Another important factor in the genetic structure of Austronesians is that Eastern Austronesians are distinctly different from Western Austronesians (ISEA and Taiwan aborigines, Figure 1). Autosomal STR variation studies [27] revealed a pronounced genetic division between Polynesians and Western Austronesians. These studies suggest that the Polynesians might have undergone natural selection or have been admixed with Melanesians. This process changed their genetic structure [16, 20, 28]. There is also the possibility of genetic drift and founder effects during the dispersal of Polynesians. The genetic structure of Western Austronesians, especially that of the ISEA, is more pivotal to the origin of Austronesians (Figure 1). The high Y chromosome diversity of Indonesian populations, Bali and Sumba islanders, suggests that these populations have existed since the Palaeolithic age [29, 30]. Because of this high genetic diversity, it appears that the ISEA, especially the Indonesians are not just of Taiwanese origin.

Figure 1
figure 1

Geographic distribution of sampled populations and migration routes suggested by Y chromosome analysis. The codes for the population samples are the same as those in Table 1. Green arrows indicate expansion of Daic; blue arrows, Taiwanese; orange arrows, ISEA. The origin of Polynesians, purple arrows, remains controversial in paternal lineages.

Here, we examined the THH of ISEA by studying the Y chromosome diversity of all relevant population groups such as that of the Daic, Indonesians, and Taiwan aborigines. We show that the paternal lineages of both ISEA and Taiwan aborigines derived from the Daic, although independently of each other. In addition, our findings indicate that it is unlikely that Taiwan is the homeland of the paternal lineages of the ISEA populations.

Results and Discussion

To determine the genetic affinity between the Daic populations and the Western Austronesians, we typed twenty single nucleotide polymorphisms (SNPs) and seven short tandem repeats (STRs) in the non-recombining region of 1,509 Y chromosomes sampled from 30 Daic populations, 23 ISEA populations, and 11 Taiwan aboriginal populations (see Figure 1 for locations of the populations and Table 1 for population information). Almost all of the Daic populations in China and all of the Taiwan aboriginal populations were sampled in this study.

In addition, principal component (PC) analysis of 134 East Asian populations encompassing all linguistic groups in East and Southeast Asia was performed using the frequencies of haplogroups defined by SNPs. The result showed that Daic populations are closer to the Western Austronesian groups than any other East and Southeast Asian populations are (Figure 2), indicating a strong genetic affinity between Daic speakers and Western Austronesians. The separation of the Daic-ISEA-Taiwan cluster from the other ethnic groups is attributable to PC2 rather than to PC1, and O1a* is the haplogroup that shows the strongest correlation with PC2 (r2 = -0.875, P < 10-4; see Additional file 1 for details). Furthermore, O1a-M119 is the dominating haplogroup in Taiwan aborigines (average 77%) ranging from 54% to 100% (Table 2, sum of O1a* and O1a2). This lineage is also highly prevalent in Daic speakers (20.5%) and in ISEA (21.2%), but not in the other East Asians (< 5%) [23, 3134]. Therefore, O1a-M119 is expected to provide much information for delineating the relationship between the Daic and Western Austronesians.

The PC plot of Figure 2 indicates that some Daic populations are close to the Sino-Tibetan cluster. It is possible that Daic and Sino-Tibetan populations have a common ancestry, which might have resulted in their genetic resemblance. However, another explanation for this observation is that Daic populations in mainland East Asia may have been influenced by Han Chinese genetically as they coexisted as neighbors since around 2,500 years ago. Admixture analysis can estimate the proportions of assumed Daic or Han ancestry in the present Daic populations, and some Daic populations isolated from Han Chinese can be used as the parental population in this admixture analysis. Aboriginal populations on Hainan Island (Hlai, Jiamao, and Cun) and Taiwan Island are assumed to have been relatively isolated, as their cultures were little influenced by the exotic cultures on the mainland. Therefore, the genetic structures of these island aborigines might be the closest to that of ancestral Daic [35].

Table 1 Classification, population, and location information of the populations sampled in this study
Table 2 Y-SNP haplogroup frequencies of the newly studied samples (%)
Figure 2
figure 2

Principal component plot of Y-SNP. (A) PC plot of all the population samples. DC (green stars) is closest to MP (purple crosses) and TA (blue crosses). All of the other groups including ST, HM, AA, and AT (red spots including triangles, squares and diamonds) are rather far removed from MP and TA, which indicates that DC is the only group that might be related to MP and TA. (B) PC plots of pooled samples. The ST, HM, AA, and AT samples were pooled according to the linguistic families. The DC samples were pooled according to the sub-families. MP and TA samples were pooled according to the geographic locations. Ethnic groups: AA, Austro-Asiatic speakers; AT, Altaic speakers; DC, Daic speakers; HM, Hmong-Mien speakers; MP, Malayo-Polynesian speakers; ST, Sino-Tibetan speakers; TA, Taiwan aborigines.

To estimate the assumed genetic influence of Han Chinese on the mainland Daic, we applied the Y SNP data of mainland Daic, Hainan aborigines, Taiwan aborigines, and Han Chinese [34] to our admixture analysis. For this analysis, we set the latter three pooled populations as the parental populations of mainland Daic. Our results show that the genetic contribution of the Hainan aborigines is very high (2.145 ± 0.927), while those of the Han Chinese (-0.314 ± 0.422) and Taiwan aborigines (-0.831 ± 0.662) are hardly detected. Here the negative values of the genetic contribution estimated by the ADMIX program suggest that there is no possible contributions to the present Daic populations. This result indicates that the paternal lineages of Daic populations are relatively undisturbed, and the genetic affinity between Daic and Western Austronesian populations has hardly been influenced by population admixture.

The ISEA populations may also be admixed. In our study, we assumed that the ISEA were mixed by three potential parental populations: Daic populations, Taiwan aborigines, and the indigenous populations of the Sunda Islands, who are similar to Papuans. We performed an admixture analysis on the Indonesians, and included data of the Papuans from the literature [36, 37] as one of the parental population structures in the analysis. Our analysis showed the following admixture proportions: Daic (0.713 ± 0.124), Taiwan (0.143 ± 0.125), and Papuans (0.144 ± 0.050), indicating that the contribution of the Daic ancestry on the Indonesians is the most dominant. There is some uncertainty in these data as our assumption that the ISEA population is an admixture can not be tested.

As the haplogroup O1a* is the most unique haplogroup of the Daic and Western Austronesian populations, we estimated pairwise genetic divergence between Daic, Indonesians, and Taiwan aborigines using seven STRs carried by O1a* individuals (see Table 3 for genetic distances and Additional file 2 for STR raw data). Our study shows that the divergence between Taiwan aborigines and Indonesians is the largest, and is about 3-fold as much as that between the Daic group and Taiwan aborigines. The divergence between the Daic group and Indonesians is comparable to that between the Daic group and Taiwan aborigines. These findings indicate that the Indonesians and Taiwan aborigines are genetically closer to the Daic group than the two Western Austronesian groups are to each other. Furthermore, the diversity based on the seven STRs carried by O1a* individuals is higher in the Daic speakers than the diversities in Indonesians and Taiwan aborigines (Table 3). The population with the highest diversity is not always the oldest, but can also be a result of admixture with other neighbouring populations. However, the high diversity of the O1a* haplogroup of the Daic speakers should have resulted from the oldest age of the population, as this haplogroup is almost absent in the neighbouring populations and no admixture can bring more diversity. Taking the results of diversity and divergence together, the Daic population group is likely the ancestral group from which the Indonesians and Taiwan aborigines derived separately in paternal lineages. Other haplogroups of Y chromosomes (e.g. O3-M122, O2a-M95) displayed a similar pattern as O1a*, showing that the Daic group is genetically closer to Indonesians and Taiwan aborigines than these latter two groups are to each other (Table 3). Interestingly, O2a may be traced even further to Austro-Asiatic populations as suggested by a recent study [38].

Table 3 Y-STR diversity of O1a, O2a, and O3 haplogroup

A median-joining network was constructed based on 7-STR haplotypes of O1a* individuals in the three ethnic groups (Figure 3). If THH of ISEA is true, i.e., ISEA primarily derived from Taiwan aborigines, one would expect sharing and/or connections of ISEA lineages and Taiwan aboriginal lineages in the network. In Figure 3, Daic lineages (green nodes) constitute the center of the network. All ISEA lineages (yellow nodes) and Taiwan aboriginal lineages (blue nodes) are either shared or connected to one of the Daic lineages, either directly or indirectly. In contrast, none of the Taiwan aboriginal lineages (except for one) are shared with or connected to the ISEA lineages. These observations suggest that ISEA did not directly derive from Taiwan aborigines but that the ISEA and Taiwan aborigines derived from the Daic independently of each other.

Figure 3
figure 3

Haplotype network of Y-STRs of Haplogroup O1a* individuals. As the original network was too complicated to display, here we presented the shortest tree of the largest possibility reduced from the network (this function is available in the recent versions of NETWORK program). Each node represents an O1a* STR haplotype. The lengths of the lines are proportional to the mutation steps. The broken line stands for only one step. The sizes of the nodes are proportional to their frequencies. Almost none of the ISEA haplotypes is directly linked to Taiwan aborigines, and both ISEA and Taiwanese are linked directly or indirectly to the Daic haplotypes holding the centre of the network (big green node).

We further noticed the Daic lineages that are connected to ISEA lineages in the network. Interestingly, most of the Daic haplotypes connecting to the ISEA are either from Hainan Island or from Guangxi, which is to northwest of Hainan (green nodes with dark green frames in Figure 3). These Hainan and Guangxi populations are located around the Gulf of Tonkin. In particular, Cham, a Malayo-Polynesian population in South Vietnam, as well as Tsat in Hainan, which is a subgroup of Cham [11, 39], were found to connect Daic and Indonesians in the network. Therefore, we hypothesized that the ISEA likely originated in the area around the Gulf of Tonkin, and migrated southward through the Indochina Peninsula to the Malaya Peninsula before they spread to most of the islands of the Pacific Ocean and the Indian Ocean.

The age of the O1a* haplogroup was estimated in the network. The total age is 33765 ± 5221 years, which corresponds to the last Ice Age. The age of all the Daic samples in the network is 33193 ± 5577 years, close to the age of O1a*. It is not easy to estimate the real age of the Taiwan clusters as they overlap with the Daic haplotypes to a large extent. This kind of overlap also indicates multiple migrations from Daic populations to Taiwan aborigines. We estimated the age of the Taiwan cluster in the left side of the network to be 14659 ± 3110 years. The estimated age of all the Taiwan samples is 21268 ± 3148 years. Interestingly, this latter age is close to the age of the oldest human remains found in Taiwan, those of the Chochen Man [40]. Therefore, we conclude that the migration of O1a* individuals from the mainland to Taiwan Island occurred during the Palaeolithic Age.

Because two fairly specific clusters of ISEA haplotypes can be observed in the network, we performed time estimates in both clusters. The age of the left ISEA cluster in the network is 9895 ± 2393 years, whereas that of the right cluster is 25880 ± 7137 years. The linguistic estimate for the origin of the Malayo-Polynesian is younger than that of our estimates, around 5000–6000 years ago [16]. Moreover, little overlap between Daic haplotypes and ISEA haplotypes is observed in the network, which indicates bottleneck effects might have formed the two ISEA clusters during the emigration of ISEA populations out of the ancestral Daic populations. Geographically, the bottleneck might be the narrow seashore of Vietnam. Therefore, the O1a* haplogroup was most probably introduced into ISEA populations during the origin of the Malayo-Polynesians more than 7500 years ago. However, the possibility of recent migrations of the O1a individuals into ISEA can not be ignored, because the genetic time estimate is not precise enough to eliminate such a possibility.

It should be noted that, in the Express Train Hypothesis, there are two different aspects: 1) the origin of the migrations, i.e. the Taiwan Homeland Hypothesis, and 2) the mode of migrations, i.e., a rapid dispersal starting from Indonesia. In this study, we examined the THH in Western Austronesians by including the Daic speakers and ISEA, both of which are largely missing in previous studies. We show that Taiwan is not likely the homeland of Indonesian ISEA, at least not for the major paternal lineages. Although both Taiwan aborigines and Indonesian ISEA derived from the Daic, their departures occurred separately, suggesting that the major paternal lineages of Western Austronesian populations are not monophyletic.

Interestingly, the spread of the domestic pig in the Southeast Asia archipelago and the Pacific took place in almost the same way as that of Western Austronesian populations suggested by our study. The pigs in Taiwan and in regions as far as Micronesia came directly from the mainland of East Asia, while those in the Southeast Asian archipelago and Polynesia came from the Indochina Peninsula. It is assumed that the domestic pig was introduced by human populations during early migrations, which would imply that humans have also entered the Southeast Asia archipelago and the Pacific in two different routes [41].

In fact, our observations are consistent with a monophyletic Austro-Tai super-phylum which contains Daic speakers, Malayo-Polynesians, and Taiwan aborigines [5]. The observations presented in this study demonstrate that it is absolutely necessary to include Daic populations and ISEA in the Austronesian origin studies. Without these groups, Polynesians and Taiwan aborigines would have appeared most similar to each other, leading to the conclusion that all the Austronesians originated in Taiwan.

Our results suggest that the Gulf of Tonkin is more likely the homeland of the paternal lineages of ISEA. Due to the complex nature of population migrations from Eastern Indonesia to the Pacific Islands [23, 4247], and the pronounced genetic division between Eastern and Western Austronesians [27], we opted not to include Polynesian data in our analysis. Instead, we only analyzed Western Austronesians. The absence of O1a-M119 in Polynesian populations is intriguing and it can not be simply explained by invoking the bottleneck effect [2125] given that a great deal of diversity of Y chromosome haplotypes has been observed in Polynesians [23, 42].

Consistent with our findings for paternal lineages, mitochondrial DNA studies on populations from Peninsular Malaysia also suggest an ancestry of aboriginal Malays in Indochina around the time of the Last Glacial Maximum [48]. This ancestry subsequently dispersed through the Malaya Peninsula into island Southeast Asia [48]. The ISEA mtDNA studies also indicated that if an Austronesian migration from Taiwan did take place, it was demographically minor [49].

Most of our conclusions are based on the analysis of O1a*, which is only a fraction of the Y-chromosome lineages found in these populations. The frequency of this group of lineages is remarkable in Taiwanese populations, but it is not so dramatic in Malayo-Polynesians or Daic populations. It is possible that some population events could have involved other Y-chromosome lineages. It is also reasonable that there are other minor parts of paternal lineages with different origins, such as aboriginal populations of Indonesia prior to the formation of Austronesian, or that more recent migrations from South Asia took place [29]. The genetic relationship amongst the East and Southeast Asians are much more complicated than expected.

Conclusion

Our results show that the Daic populations are closer to the Western Austronesian populations in paternal lineages than any other ethnic groups in East Asia are. The STR diversity of the Y chromosome haplogroup O1a-M119, the major haplogroup among the Daic and Western Austronesian populations, shows that Taiwan and ISEA, two groups of Western Austronesian, derived from the Daic independently of each other. Therefore, it is most likely that the ISEA populations mainly originated in the region around the Tonkin Gulf, the homeland of the Daic, and migrated to Indonesia through the Vietnam corridor. In contrast, the Taiwan aborigines migrated from mainland China directly. Our results indicate that a super-phylum, which includes Taiwan aborigines, Daic, and Malayo-Polynesians, is genetically educible.

Methods

Sampling

Blood samples from 30 Daic populations across South China were collected using FTA cards (Whatman® Inc), covering almost all of the Daic populations in China. Those from 11 Taiwan aborigine populations were collected from both the lowlands and the highlands of Taiwan. Samples from 23 Malayo-Polynesian populations were collected, among which 21 were collected across Indonesia, 1 from Binhdinh of Vietnam, and 1 from Hainan of China. The sample sizes from each population are given in Table 2. All of the 1,509 individuals studied from these populations are unrelated and gave their consents for this study. Individual samples were from diverse regions of the population distribution area to make the sample more diverse. Reference data for 70 other groups in East and Southeast Asia were obtained from the literature (including some Daic speaking populations [23], Malayo-Polynesians [23], Taiwan aborigines [23], Tibeto-Burman speaking populations [3133], Han Chinese [31, 34], and Altaic speaking populations [31]), for a total reference sample size of 1,348 individuals. In PC analysis, these samples refer to a total of 134 different population groups, including newly typed and previously published populations.

Although the sample sizes of some populations were relatively small, we do not think it is necessary to enlarge these sample sizes, as they were collected from very small populations with low Y chromosome diversity, such as the Ai-Cham and Geelvink Irians. The effective population size of the Y chromosome is usually less than one fourth of the size of that of autosomes. Therefore, Y chromosome diversity studies require much smaller sample sizes than studies of autosomal genetic markers. For a normal size population of some hundred thousand, a sample of around 30 individuals will be sufficient. Even fewer samples are required for small populations. Here we maintained a sample size of around 30 for most of the populations, and around 15 for small populations.

Genetic markers

Twenty bi-allelic Y-chromosome markers (SNP), YAP, M15, M130, M89, M9, M5, M122, M134, M7, M117, M121, M111, M17, M175, M119, M110, M95, M88, M45, and M120 were typed by PCR-based restriction-fragment length polymorphism methods [31]. Most of these markers are highly informative in East Asians and define 19 haplogroups following the Y Chromosome Consortium nomenclature [50].

Seven microsatellite markers (STR) on Y-chromosome, DYS19, DYS388, DYS389-1, DYS390, DYS391, DYS392, and DYS393 were typed using fluorescent-labelled primers [51]. The genotyping results are given in Additional file 2.

Data analysis

Population relationships were investigated with principal component analyses using Y-chromosome haplogroup frequencies and SPSS11.0 software (SPSS Inc.). Some of the SNPs, such as M175 and M117, were not typed for the previously published populations, therefore our O*-M175 data were combined into haplogroup K, and O3a5a-M117 into O3a5* in our PC analysis. Correlation analysis among haplogroups and PCs was also conducted using SPSS11.0.

The admixture analysis was performed using an ADMIX 2.0 program [52] in order to evaluate the genetic influence of Han Chinese on the Daic populations. We assumed the potential admixture started 2,500 ago when the Qin army entered the Daic area in Canton. The admixture proportions of the Indonesians were also estimated by ADMIX 2.0, and the admixture history was to start 5,000 years ago.

The genetic distances among Daic, Taiwan aborigines, and Malayo-Polynesians were estimated by RST and linearized RST [53] using ARLEQUIN software [54], and the diversities of three groups were evaluated by average gene diversity, haplotype diversity [55], and variance of the STR allele sizes [56].

A Median-Joining network of O1a* STR haplogroups was drawn by Network 4.1 software (Fluxus Technology Ltd). The age of O1a* was estimated in the network. The mutation rate used in the time estimate is 1.932 × 10-4 per year, the sum of the mutation rates [57] of all the STRs used in the network. We assumed 25 years for one generation.