Introduction

Mitochondrial DNA (mtDNA) analysis has become a routine approach in forensic casework where STR markers cannot be used. The associated estimation of the frequency of obtained mtDNA profiles in the respective population sample is based on the availability of suitable population data sets. MtDNA data from Thailand has already been published [1,2,3,4,5]. However, these datasets either are limited to SNPs in the mitochondrial hypervariable regions 1 and 2 [1] or the collection strategy focuses on a different priority, such as language [2,3,4]. Furthermore, a forensic study on a population from Thailand was done in the northern province of Chiang Mai [5]. Given that the population of Thailand is composed of different ethnolinguistic groups [6], a regional population sample cannot be considered to be representative for the whole country. In this paper, we present a population dataset of 213 individuals living in all four major regions of Thailand.

Materials and methods

Samples

Hair samples were obtained from 213 unrelated individuals of both sexes living in southern, central, northern, and northeastern Thailand (Fig. 1, Supplementary Table S1). The samples were collected from volunteer donors and anonymized. Written informed consent was obtained from all participants. Ethical approval for mtDNA sequencing analysis was given by the Ethics Committee of the University of Freiburg, Germany (398/16).

Fig. 1
figure 1

Schematic map of Thailand. The four major regions of Thailand and number of collected samples from each region are provided

DNA extraction, amplification, and sequencing

Total DNA of 10–12 hairs per individual was extracted with the MagCore® Genomic DNA Tissue Kit (RBC Bioscience, New Taipei City, Taiwan) at the University of Khon Kaen (Khon Kaen, Thailand). PCR and sequencing of the entire control region were performed at the University Medical Center Freiburg – University of Freiburg (Freiburg, Germany) as described in [7] using primers given in Supplementary Tables S2 and S3.

Sequence analysis and haplogroup assignment

A total of 213 full double strand control region sequences were successfully analyzed and aligned to the revised Cambridge Reference Sequence (rCRS) [8, 9] using Sequencher V5.2.2 (GeneCodes, Ann Arbor, MI, USA). Alignment was done according to the revised and extended guidelines of the ISFG [10]. Haplogroups were assigned with Haplogrep 2 (www.haplogrep.uibk.ac.at) [11] and EMPOP v4/R12 (www.empop.online) [12] based on PhyloTree, build 17 [13]. Assignments were revised manually and conservative estimates of most recent common ancestors (MRCA) were chosen.

MtDNA data quality was controlled using the EMPOP tool NETWORK [14]. Further quality control was done by the team of EMPOP at Medical University of Innsbruck. All 213 sequences are incorporated into the EMPOP database under the accession number EMP00699.

Statistical analysis

Intra- and inter-population statistical analysis was done using Arlequin v3.5.2.2 [15]. Numbers of different and unique haplotypes were counted and genetic diversity indices of the population (random match probability, haplotype diversity, number of polymorphic positions, mean number of pairwise differences, and nucleotide diversity) were calculated. Length variants at nucleotide positions 16193, 309, and 573 were ignored for statistical tests. Random match probability was calculated as the sum of squared haplotype frequencies.

We compared our data with six other Southeast and East Asian populations from recent studies [5, 16,17,18,19,20] only considering control region data (np 16024–576). We performed a molecular variance analysis (AMOVA) and calculated genetic diversity indices for the additional included studies, pairwise differences between and within populations, and pairwise FST-values.

Results and discussion

We obtained 213 high-quality mtDNA control region sequences from Thailand to establish reference data (Supplementary Table S1). Summary statistics are presented in Table 1. From a total of 170 different haplotypes, 146 were unique. The population sample had a random match probability of 0.87% and a haplotype diversity of 0.9960 ± 0.0013 revealing a high heterogeneity in the population making it useful for forensic analyses.

Table 1 Molecular diversity indices for mtDNA control region data of a population sample from Thailand

We compared the detected haplotypes with those of six earlier studies of Southeast and East Asian populations including one with samples from Northern Thailand [5, 16,17,18,19,20]. A total of 44 of the 170 haplotypes (25.9%) were found in at least one other population (Supplementary Table S4). Accordingly, 126 haplotypes (74.1%) of our study were not observed in the other studies, including the most common haplotype of our study (9 samples) (cf. Supplementary Table S1).

Haplogroup composition

The 213 samples from Thailand were assigned to 85 different haplogroups (Supplementary Table S1). Some sequences could not be classified to a terminal branch of the PhyloTree [13] and were assigned to their MRCA such as macrohaplogroup M (8.9%). The most frequent terminally assigned haplogroups were B5a (9.4%), F1a1a (8.9%), and M (8.9%). All samples belong to macrohaplogroups R (50.7%), M (39.4%), and N (9.4%) except of one sample which could only be assigned to L3 as MRCA.

In general, the haplogroup composition is comparable with other Southeast Asian population samples with F1, B5, and M being the most frequent lineages [2, 4, 5, 16,17,18,19].

Genetic distances between Southeast and East Asian populations

We compared the genetic structures of our population sample from Thailand with the six other Southeast and East Asian population samples [5, 16,17,18,19,20]. The total number of samples was 1789. Analysis of molecular variance (AMOVA) revealed that 98.06% of the genetic variation is due to differences within populations. Thus, only 1.94% of the total genetic variance is caused by differences between populations (Table 2). Bodner et al. (2011) had found an inter-population variance of only 0.84% in a very similar dataset, but only considering hypervariable regions (HVS-I and HVS-II) and a regional restricted population [16].

Table 2 Analysis of molecular variance (AMOVA) of 7 Southeast and East Asian populations (analyzed range: np 16024–576). AMOVA design and results

The number of mean pairwise differences (MPD) in the Thai population is 12.86, which is in the dimension of that observed in other Asian populations (Table 3). The lowest MPD value (11.75) was observed in a South Korean population, the highest (13.42) in Northern Thailand.

Table 3 Analysis of molecular variance (AMOVA) of 7 Southeast and East Asian populations (analyzed range: np 16024–576). Mean number of pairwise differences (MPD) between seven Southeast and East Asian populations

Pairwise FST-values between population samples were relatively low and similar, indicating a close relation between populations. Higher variance was found between Southeast Asian populations (Thailand, Northern Thailand, Laos, Northern Vietnam, and Myanmar) and South Korea (FST 0.037–0.051), whereas genetic variance between Hong Kong and South Korea was low (FST 0.016). All FST-values between Thailand and other population samples were significant (Table 4).

Table 4 Analysis of molecular variance (AMOVA) of 7 Southeast and East Asian populations (analyzed range: np 16024–576). Population pairwise FST-values

Heteroplasmy

We observed a total of 27 point heteroplasmies at 14 different positions in 26 hair samples (12.2%) (Supplementary Table S5). In a former study including 691 hair shaft samples, the frequency of point heteroplasmy was 11.4% [21]. Based on other studies using various types of tissue material, in Southeast and East Asian populations, the percentage of samples with point heteroplasmy was calculated as follows: Northern Thailand 2.6% (blood), Laos 3.7% (blood), Hong Kong 8.5% (blood), Myanmar 8.6% (blood), South Korea 10.3% (blood and buccal swabs), and Northern Vietnam 15.0% (buccal swabs) [5, 16,17,18,19,20]. These values confirm that mtDNA heteroplasmy frequency is dependent on the analyzed tissue as specified in [7]. However, it has to be considered that low level heteroplasmic positions were not detected using Sanger sequencing.

Conclusion

The sample of 213 mtDNA control region sequences will serve as a high-quality mtDNA reference for Thailand. Most of the detected haplotypes were unique within the known data and will complement the available data from Northern Thailand [5] and other Southeast Asian populations.

Recently, 960 complete mtDNA genomes from Thailand originally sequenced to investigate anthropologic questions [4] were also incorporated into the EMPOP database [14]. The increasing data of Southeast Asian mtDNA sequences is leading to a reliable forensic reference for this region.