Complete mitochondrial genomes of Thai and Lao populations indicate an ancient origin of Austroasiatic groups and demic diffusion in the spread of Tai-Kadai languages

The Tai-Kadai (TK) language family is thought to have originated in southern China and spread to Thailand and Laos, but it is not clear if TK languages spread by demic diffusion (i.e., a migration of people from southern China) or by cultural diffusion, with native Austroasiatic (AA) speakers switching to TK languages. To address this and other questions, we obtained 1,234 complete mtDNA genome sequences from 51 TK and AA groups from Thailand and Laos. We find high genetic heterogeneity, with 212 haplogroups. TK groups are more genetically homogeneous than AA groups, with the latter exhibiting more ancient/basal mtDNA lineages, and showing more drift effects. Modeling of demic diffusion, cultural diffusion, and admixture scenarios consistently supports the spread of TK languages by demic diffusion. Surprisingly, there is significant genetic differentiation within ethnolinguistic groups, calling into question the common assumption that there is genetic homogeneity within ethnolinguistic groups.

generally considered to have arisen in southeast China prior to 2.5 kya and then spread to SEA 73 between 1-2 kya 11-12 . 74 Although archaeological and linguistic evidence point to an expansion from southern 75 China, physical anthropological studies indicate that the present-day Thai people resemble ancient 76 people 13 as well as modern AA people in northern Thailand 14 . Therefore, there are two competing 77 hypotheses concerning the origin of the modern Thai/Lao TK people: (1) a demic expansion of 78 people from southern China that brought their genes, culture, and language to Thailand/Laos; or 79 (2) a cultural diffusion from southern China that resulted in native AA peoples adopting the TK  94 For the 1,234 mtDNA genome sequences obtained, there are 761 distinct sequences 95 (haplotypes) belonging to 212 haplogroups (Supplementary Table 1  The multidimensional scaling (MDS) analysis ( Fig. 2a-b) revealed that in the third 111 dimension AA and TK groups tended to be separated; this separation was more apparent when 112 three outliers were excluded (Fig. 2c- such classifications the among-population component of the variance is higher than the among-122 group component (Table 1). Moreover, the Mantel test for the correspondence between genetic 123 and geographic distances between populations is not significant in all types of geographic distances 124 tested (great circle distance: r = 0.03, P= 0.31, least cost path distance: r = 0.04, P = 0.30 and 125 resistance distance: r = -0.65, P = 0.75). Thus, the genetic structure of the Thai/Laos populations 126 is more complicated than would be predicted from either linguistics or geography.

127
Greater genetic homogeneity among the TK populations was also reflected in the haplotype 128 sharing analysis (Supplementary Table 3), which showed that they shared more haplotypes than 129 the AA populations. In particular, the various KM populations shared a number of haplotypes, as

Significant genetic differentiation within ethnolinguistic groups
138 Surprisingly, we observed striking and significant genetic differences between populations 139 classified as the same ethnolinguistically but sampled from different locations. This can be seen in 140 the MDS analysis ( Fig. 2a-b), in which two of the three most extreme outliers are from the same 141 6 ethnolinguistic group, namely two of the three AA-speaking H'tin groups, TN1 and TN2 (the third 142 outlier is the SK, a TK-speaking group from northeastern Thailand). In fact, the MDS analysis 143 shows that in many cases populations from the same ethnolinguistic group are not genetically 144 similar. This is further indicated by an AMOVA for each separate ethnolinguistic group that was 145 sampled from multiple locations ( Table 1); in all such instances, the among-populations variance       B4b1a2c and B4b1a2d 28 ) and Oceania (e.g., B4a1a1a 29 ) were not found in our study, in agreement 206 with previous studies 26, 30 . Overall, the lack of sharing of recent sublineages indicates a lack of 207 recent contact between MSEA and ISEA ( Supplementary Fig. 4).

208
Finally, the more extensive sampling of Thai/Laos mtDNA sequences in this study has 209 resulted in much deeper ages for some haplogroups that were poorly sampled in previous studies.

210
For example, we estimate that haplogroups R9b and R22 both coalesce at ~39 kya ( Fig. 4) Thailand is also a potential source for these haplogroups ( Supplementary Fig. 5).   Table 5). In sum, these results 267 confirm the reliability of the posterior probabilities of the models.  (Table 1). It appears that this heterogeneity arises from various sources. In the hill tribes, 297 such as the Lawa and H'tin, isolation and drift due to geography and cultural constraints (e.g.,     Table 4). NJ tree (based on the Φst) were generated by MEGA 7 61 .

359
An ABC procedure was employed to choose the best-supported hypothesis about the 360 maternal origins of the Thai and Laotian populations. Owing to the different local histories specific 361 to each region, three different mtDNA data sets from the TK and AA as well as priori parameters 362 (e.g. divergence times) were used in the simulation process (Fig. 7). As the origin time of Thailand has been recorded 64 . One potential scenario was that the IS (IS1-IS4) diverged from the 385 LA (LA1-LA2) without any genetic contact with the KH (KH1-KH2); a second scenario is that IS 386 did admix with KH after diverging from LA. Although an origin of IS from KH is unlikely, we 387 also investigated this scenario.

388
The simulated datasets were generated by the software package ABCtoolbox 65 . The 389 posterior probabilities were calculated by employing two different approaches, AR 66 , and LR 67 .

390
The former approach considers only a certain number of "best" simulations, and then simply 391 counts the proportion of those retained simulations that were generated by each investigated