Introduction

For many years, the DNA markers for forensic practice have been short tandem repeat (STR) loci that are highly polymorphic with different numbers of repeat units at each locus (Budowle et al. 1998). Over the years, the numbers of standard STR loci have increased and the similarities of the different commercial panels and those in different countries have increased (Butler and Hill 2012; Schumm et al. 2013; Guo et al. 2014; Novroski et al. 2019). Other types of markers have been proposed starting with SNPs especially in the early 2000s (cf., Pakstis et al. 2007 for early studies). Early forensic studies of SNPs were focused on individual identification (Sanchez et al. 2006; Pakstis et al. 2007), on panels of SNPs for inferring population ancestry (e.g., Shriver et al. 2004; Tishkoff and Kidd 2004; Phillips et al. 2007), and on SNPs for phenotype (e.g., Lamason et al. 2005; Walsh et al. 2011; Walsh et al. 2013). Several commercial panels of SNPs have been introduced, some of which combine SNPs with STRs, for analysis using Massively Parallel Sequencing (MPS). MPS has also allowed the further development of a new type of genetic marker, microhaplotypes (Kidd et al. 2013,2014).

Microhaplotypes (microhaplotypes, MHs) have been defined as small genomic regions of less than ~ 300 bp with two or more polymorphisms, usually SNPs, resulting in at least three common haplotypes in the population (cf., review in Oldoni et al. 2019). They were first proposed as potentially highly informative and useful genetic markers for forensics, anthropology, and biomedical research generally. Their desirable characteristics include multiple alleles with high heterozygosity and low mutation rates. Since then, MHs have been studied by many researchers with clear demonstration of their potential for forensic, medical, and anthropologic applications (Bulbul et al. 2018; Chen et al. 2018; Kidd et al. 2018a,2021; Cheung et al. 2019; Phillips et al. 2019; Puente et al. 2020a; Puente et al. 2020b), but they have not yet been incorporated into routine forensic casework.

Although conceived of for use with MPS, the original studies which were designed to evaluate the potential for microhaplotypes (Kidd et al. 2013,2014) used TaqMan to type individual SNPs and then PHASE (Stephens et al. 2001) to determine the genotypes and haplotype frequencies. The SNPs that were chosen to study were those of at least modest frequency (5–10%) in some populations, those that were not in complete LD with another, and those for which TaqMan assays were available. Other factors could be included in selection of SNPs if different ultimate objectives were favored (Kidd and Speed 2015; Kidd et al. 2018a). While MPS was not used in these exploratory studies, it was clear that the existence of MPS was what made study of microhaplotypes relevant. Now, there have been several studies that have used MPS successfully to study panels of microhaplotypes on multiple individuals and/or populations (Turchi et al. 2019; Chen et al. 2018; Oldoni et al. 2019; Bennett et al. 2019; Gandotra et al. 2020; Puente et al. 2020a; Puente et al. 2020b; Kureshi et al. 2020; Wu et al. 2021a; Wu et al. 2021b).

We previously presented a panel of 90 microhaplotypes evaluated using data for 26 populations extracted from the 1000 Genomes (1 KG) project data (1000 Genomes Consortium Project 2015) as well as data on 155 individuals from four other populations studied using multiplex microhaplotype sequencing (mMHseq) of all 90 loci (Gandotra et al. 2020). The 90 loci had a high overall effective number of alleles (Ae) in the 30 populations studied (average Ae > 5.08). Analyses of frequency variation among populations showed that some of the loci had significant variation among populations. To be of maximal value in forensics as well as in other areas of research, a panel of loci needs a broad set of reference population frequencies. To that end, we have now assembled and analyzed sequence-based data on 4009 individuals in 79 populations for these 90 microhaps. These results also demonstrate the value of microhaplotypes for biomedical and anthropologic studies of human populations.

Materials and methods

Population samples

The 4009 individuals in 79 populations studied (Table 1) include 524 individuals in 16 populations that we have typed by MPS (Table 2). The DNA for the individuals sequenced was purified using phenol–chloroform from lymphoblastic cell lines that are part of the Kidd Lab collection. Greater detail of the population samples can be found in ALFRED (alfred.med.yale.edu). Comparable data for 3485 other individuals in 63 populations that were curated from public repositories: the Human Genome Diversity Project (HGDP) which includes individuals sequenced from the Kidd Lab collection of world population samples (see Table 2 and Bergstrom et al. 2020); the Genome Asia database (Genome Asia100 K Consortium 2019); and the 1000 Genomes (1000 Genomes Consortium Project et al. 2015).

Table 1 The 79 populations
Table 2 Summary of individuals from Kidd lab populations included in analyses after sequencing and passing QC

The 536 sequenced individuals included 155 individuals sequenced previously (Gandotra et al. 2020) and 381 individuals that were sequenced and passed quality control steps in this study (Table 2). Twelve individuals were also successfully sequenced from other groups including 4 samples from Southern Tunisia in this study and 6 Euro-Americans and 2 Chinese from Taiwan in Gandotra et al. (2020). These were excluded from statistical analyses, because the sample sizes were too small. They will be included in future studies as more sequenced samples accumulate. The data from all samples sequenced are available on the Scharfe lab mMHseq website (see Data Availability).

Data collection

The descriptions of the 90 microhaplotype loci and the primers for MPS are described in Gandotra et al. (2020) (cf. Table S2 in that paper) as are the detailed mMHseq methods. Table 3 provides an overview of key characteristics of the 90 microhaps. The mMHseq libraries of 48 individuals and two non-template water controls were sequenced in a single Illumina MiSeq run for all 90 microhaplotypes. This number of samples per run assures that each sample receives sufficient sequence read coverage based on the assay’s empirically established performance parameters. Data analysis included sample demultiplexing, primer trimming, read alignment to the human reference genome (hg19/GRCh37), data quality control (QC), DNA variant calling (GATK UnifiedGenotyper (GATK UG), and SNP phasing for each microhaplotype using Read backed phasing tools from GATK to phase the SNP’s along the microhaplotype (McKenna et al. 2010). Following identification of variants at each of the 90 MH loci in the 536 individuals using mMHseq, base calls at the same variant sites were extracted for 3485 individuals from various whole-genome sequencing (WGS) repositories.

Table 3 Microhaplotype list sorted by chromosome and initial SNP in upstream nt-position (build GRCh37)

Data analyses

Effective number of alleles (Ae) is a measure that standardizes the information among diverse populations for their different frequencies among the multiple alleles (Kimura and Crow 1964; Kidd and Speed 2015). Ae for a locus is calculated as the inverse of homozygosity, Ae = 1/sum(pi2). As such, it is the number of equally frequent alleles that would yield the same heterozygosity as the observed set of alleles with diverse frequencies. This measure is good for evaluating multiallelic loci (such as microhaplotypes) for individualization and mixture analysis. Informativeness (In) for measuring allele frequency differences among populations was calculated according to Rosenberg et al. (2003). This measure is appropriate for evaluating loci for their ability to infer population ancestry of an individual and relationships among populations.

For the extracted data that were not phased in the respective repositories, the haplotypes were inferred using PHASE version 2.1.1 (Stephens et al. 2001; Stephens and Scheet 2005). For all of the QC passed samples, the phasing was obtained directly from the reads for each of the MH loci.

Structure, PCA, and population trees

To help visualize clustering of individuals in populations, we used version 2.3.4 of the STRUCTURE software (Pritchard et al. 2000). The program was run under the standard admixture model assuming correlated allele frequencies. The input data consisted of the microhaplotype genotype profiles for all individuals in the 79 populations. The program was run 20 times at each K level from K = 2 to K = 16 with 10,000 burn-in and 10,000 Markov Chain Monte Carlo (MCMC) iterations. The result with the highest likelihood of the 20 runs was selected to illustrate the results for a given K value.

To help visualize clustering of populations, we used Principal Component Analyses (PCA). We used the XLSTAT 2019 software (http://www.xlstat.com/en/about-us/company.html) on the matrix of haplotype allele frequencies for all 90 microhaplotype loci in the populations relevant to each analysis.

We also conducted tree analyses for the 79 populations using pairwise Tau genetic distances (Kidd and Cavalli-Sforza 1974) and methods and logic described in Kidd and Sgaramella-Zonta (1971) and Cherni et al. (2016). Analyses started with the Neighbor Joining tree (Saitou and Nei 1987), which gives an approximate Least Squares fit, and then explored similar tree structures by an exact Least Squares fit to the defining set of linear equations. The Neighbor Joining (NJ) program employed is part of the PHYLIP software package (Felsenstein 1989,2009). The Drawtree utility (version 3.69) in the PHYLIP package was used to plot the postscript images of the best population trees.

Results

mMHseq data analysis and quality control

Assay performance was assessed using our algorithms for monitoring sequence read coverage on three levels: samples, amplicons (loci), and sequence bases (Fig. S1, Table S2). Any sample that failed this QC was removed from further analysis. The first QC metric (sample coverage), defined as the number of reads per sample, was used for detecting samples that failed in the multiplex PCR. An average read depth across 384 samples was 705,536 reads per sample. Eight out of 384 samples had lower read depth coverage of less than 150,000 reads and were flagged for further analysis of amplicon and base coverage (Table S2 and Fig. S1). The second QC metric (amplicon coverage) was used to identify samples with partially failed amplification, such as individual amplicons that may have been insufficiently covered despite an overall normal read count for that sample. For each sample, we obtained the number of amplicons that had > 0.2-fold the mean amplicon coverage and used a threshold of 2 standard deviations below the mean to flag samples for review. This metric identified 4 samples with poor amplicon uniformity (Table S2 and Fig. S1). The third QC metric (base pairs) assessed base coverage for each sample, reasoning that if base coverage was sufficiently high, even samples with lower amplicon uniformity could be analyzed further. Five samples had a lower base coverage (< 75% of bases with 100 × reads per nucleotide per amplicon). Three samples failed QC at all three levels and were removed from the analysis, while the other samples flagged in one of the three QC steps yielded interpretable results in sequence analysis. Thus, final analyses are based on data for 381 individuals (Tables 2 and S2). Additionally, we investigated the data for MH genotypes that could have been due to allele dropouts. We found 4 MH alleles that were present only as homozygous MH genotypes in a single individual (but in different sequenced individuals for each allele type) and the inferred two alleles were the only occurrences of those alleles in the whole dataset; so, these genotypes were removed from the analyses.

We estimate that each genotype call was based, on average, by 7067 reads. That number is the average of the sequencing reads per locus (amplicon) in the last five sequencing runs, each of which involved sequencing of 48 individuals. Thus, sequencing of a total of 240 individuals contributed to this number. These are the right-most 62% of the reads in supplemental Fig. S1. Some variation in read numbers occurred among the five runs considered, but the variation in reads per locus was consistent; the distribution of the number of loci by the number of reads is given in Fig. S2. We note that except for 13 loci, there were more than 500 reads per allele per locus per individual. Only one locus, mh01KK-001, averaged fewer than 100 reads per allele with 75.3 reads per allele. In general, coverage per locus exceeds the clinical exome sequencing standard of 80×. It is unclear whether the differences in reads per locus per individual are inherent to the locus or are inherent just to the sequence or concentration of the specific primer pair used for the sequencing. A future effort will be made to better balance across loci to assure a higher minimum number of reads for all loci.

In summary, the mMHseq 90-plex data for the sequenced individuals from 16 populations are available at the Scharfe lab mMHseq website and have also been deposited in the Zenodo archive (see Data Availability). Our previous study (Gandotra et al. 2020) identified 717 SNPs in the 90 MHs for 30 populations, while this study of 79 populations recorded 905 SNPs in the 90 MHs (Table S1), which included 65 novel SNPs in 39 of the 90 MHs.

Characteristics of MH markers

As noted earlier, two statistics characterize the information in the markers with respect to variation within populations (Ae) and variation among populations (In). Figure 1 is a scatterplot of all 90 MHs according to In by average Ae for the total of 79 populations. Some of the markers rank very high by both criteria. The six MHs that are highest for Ae are shaded and included in Table 4. The clinal decrease in the average Ae across loci for populations that are farther from Africa is evident in Fig. 2. The markers have high heterozygosity with mean values of Ae ranging from 3.0 to more than 6.0 (Fig. 2) depending on the population. Among the 7110 individual population values (79 × 90) for Ae, it is noteworthy that 81.7% are ≥ 3.0 and 96.8% are ≥ 2.0. Supplemental Fig. S3 plots the average Ae value for each of the 90 microhaplotypes. The most common genotype frequency in each population is also plotted in Fig. 3. Note that the specific genotype will likely be different in each population, the point being that no genotype is common anywhere when all 90 loci are considered.

Fig. 1
figure 1

Scatterplot of 90 microhaplotypes by In and average Ae for 79 populations (79p). 6 MHs with highest Ae values in all 6 biogeograhic regions (cf Table 4) are shown as patterned circles

Table 4 Region-specific average Ae for 6 highest ranking microhaplotypes worldwide
Fig. 2
figure 2

Box plots of Ae values for 90 microhaplotypes in each population. Box boundaries are at the 25th and 75th percentiles; the light dot in the box marks average Ae; the “whiskers” line extends from the minimum to maximun Ae

Fig. 3
figure 3

Random match probability and most common genotype frequency

The high Ae for many loci individually and on average across all populations indicates considerable variation within populations. A forensic measure, Random Match Probability (RMP), at a single locus is the sum over all the possible genotypes in the population of the squares of the genotype frequencies. In other words, it is the expected frequency (probability) for the population of, having randomly selected one individual, another unrelated individual will have that same specific genotype. For multi-locus genotypes, RMP becomes the product of the individual locus probabilities. It is often used in criminal cases to note how unlikely it is that someone else has the same genotype as a defendant. The RMP values are quite small for these 90 MH loci. However, RMP is population-specific and has a dramatic difference of 50 orders of magnitude depending on the population (Fig. 3). The range goes from the very small RMP values for Africans up to the much larger, but still highly probative, values for the Pacific and Native American populations: 10–115 up to 10–66. Globally, the probability of two unrelated individuals having the same genotype for these markers is vanishingly small. Note, this RMP is not the same as the probability that a random person will have the same genotype as a specific evidentiary genotype profile.

Informativeness (In) across the 79 population samples likewise shows considerable variation by locus (Fig. 4). The specific loci with the highest In values are clearly distinct in Fig. 1 as are those loci with the lowest In values.

Fig. 4
figure 4

Rosenberg informativeness (In) across 79 populations for each of 90 microhaplotypes. The 6 dark triangles correspond to the 6 MH with the highest Ae values in Fig. 1

Inference of population relationships

Structure

STRUCTURE analyses were run on all 79 population samples from K = 2 through K = 16. The first K value at which all major biogeographic regions are distinct is K = 6 (Fig. 5, Fig. S4). Those six clusters are the ones that correspond to “continental” clusters when representatives of all “continents” are present: Sub-Sahara Africa; North Africa, Southwest Asia, and Europe; South Central Asia; East Asia; the Pacific; and the Americas. These six are the commonly seen clusters from many studies based on SNPs (Soundararajan et al. 2016; Li et al. 2016; Cherni et al. 2016; Santos et al. 2016; Fondevila et al. 2017; Pakstis et al. 2017; Pakstis et al. 2019;Xavier et al. 2020), on studies of microhaplotypes (Kidd et al. 2017,2018a; Bulbul et al. 2018; Gandotra et al. 2020; Puente et al. 2020b; Staadig and Tillmar 2021), and on studies combining single SNPs and MHs (Phillips et al. 2019; Kidd et al. 2021). K = 6 provides a convenient basis for summarizing aspects of the data such as the MHs with the highest regional Ae values. K = 6 is also the point at which the likelihood increases with increasing K values begin to be progressively smaller until the curve is nearly flat at K = 14 to K = 16 (Fig. S5). K = 7 shows that these loci can begin to distinguish among the sub-Saharan Africans. Yet, when all 79 populations were analyzed up to K = 16, the African clustering looks identical to the K = 7 pattern (Fig. S6). In contrast, the East Asia pattern became much more complex at K = 16. This panel of 90 loci is capable of more refined STRUCTURE clustering when subsets are analyzed separately. When the 21 African and Southwest Asia populations were analyzed as a group, K = 6 showed five clusters within sub-Saharan Africa (Fig. 6) distinct from the Southwest Asians. When the 21 Siberian, East Asian, and Pacific populations were analyzed as a group, K = 7 showed the clearest set of clusters (Fig. 7).

Fig. 5
figure 5

STRUCTURE population average bar plot at K = 6 and 7 for all 79 populations

Fig. 6
figure 6

STRUCTURE of 21 populations from sub-Saharan Africa to Southwest Asia

Fig. 7
figure 7

STRUCTURE of 21 Populations from East Asia to the Pacific

PCA

The African populations are a distinct group and their distinctiveness is the primary driver of PC1 when all 79 populations are analyzed (Suppl. Fig. S7). All other populations are primarily distributed according to PC2. To separate those non-African populations better, a separate analysis was done omitting all of the sub-Saharan populations (Fig. 8). This analysis clusters the European and SW Asia populations close together at one end of PC#1 followed by the South Central Asian populations with an internal differentiation along a West-to-East axis. The Native Americans form a distinct cluster as do the East Asians. The Oceania populations form a loose cluster next to the tight East Asian cluster. The two North Asian populations (BUR and YAK) are very close together but far from the Western Siberian Khanty (KTY) which is not part of any cluster. Similarly, the Hazara (HZR) is a distinct population.

Fig. 8
figure 8

PCA of the populations after eliminating the sub-Saharan populations

Tree analysis

The tree analysis of Tau genetic distances on all 79 populations involved evaluations of a total of 294 different additive tree structures of which 31 had no internal negative segments. The best of these 31 is shown in Suppl. Fig. S8. There are two small negative segments connecting the two mostly West African populations (ACB and ASW) to the African branch of the tree. This is an indication that these do not conform to the underlying assumption of an additive tree for which only random genetic drift has caused divergence of populations. Indeed, these two populations are admixed and do not meet the assumptions, but were included as part of the 1 KG set of populations.

In general, many of the clusters of populations are similar to those seen in the STRUCTURE and PCA analyses. The South Asians are divided into four different clusters in the tree. One is close to the European and SW Asia cluster; the others are more differentiated.

Discussion

The utility/value of a locus in forensics can relate to at least four different questions: individualization, ancestry inference, kinship analysis, and mixture resolution. Individualization is often noted as the random match probability (RMP) reflecting the low likelihood that a match between evidence and an accused individual would have occurred by chance alone. Ancestry inference can be pursued as the identification of the population for which the probability of the observed genotype is highest (Kidd et al. 2018b; Rajeevan et al. 2020). The value of a panel of loci in anthropology is related to what the genetic data can tell about population relationships and histories (Kidd et al. 2021). Kinship analysis compares DNA sequence or dense markers among individuals to determine the likely degrees of relationship. Paternity testing is one form of kinship analysis. Mixture deconvolution is a developing field with probabilistic genotyping available for STR analysis but not yet for microhaplotypes. As discussed in the following sections, microhaplotypes are useful in all of these areas.

Individualization

SNPs are overwhelmingly di-allelic and hence provide less information per locus than the polymorphic STRs when comparing a forensic sample with a reference sample. High levels of individualization measured by random match probability (RMP) are a consequence of the high Ae values of the loci. Figure 3 plots the RMP by population based on all 90 microhaplotypes. Although the scales are very different, Figs. 2 and 3 show otherwise similar variation among populations, because both are based on the heterozygosities of the 90 loci in the 79 populations. Both show high Ae values in African and significantly lower values in the Pacific and Native American populations. The range of population-specific RMP values is close to 50 orders of magnitude from a minimum of 10–115 to a maximum of 10–66. Even at the maximum value, the RMP based on all 90 loci is highly probative.

There is a significant range in the average Ae values (3.00–6.25) across all 79 populations among the 90 microhaplotypes (Fig. 2). While some of the loci are at the low end of the distribution overall, a relevant question is whether or not some of the better markers exist in different regions of the world. The STRUCTURE software can show reliable clusters of populations at higher K values (Fig. S6), but K = 6 provides a convenient basis for summarizing aspects of the data such as the MHs with the highest regional Ae values. Table S3 summarizes the top 20 MHs ranked by Ae value for each of the six biogeographic regions defined in Fig. 5. The averages of the average Ae values for the 20 highest loci are lower for the non-African regions with the smallest for the Pacific populations, but the decrease is not great compared to the overall decrease seen in Fig. 2. Overall, there are 38 different loci in this tabulation. Many of the loci have a high Ae in more than one broad region of the world. Only 6 of these 38 loci occur in all six biogeographic regions (cf. Figure 5) and are listed in Table 4. These are the highlighted loci in Fig. 1. The averages for those loci that rank among the top 20 are above 4.0 (See Suppl. Table S3). Many markers have good Ae values for random match probabilities and for mixture deconvolution for nearly all populations.

The large number of MH alleles varying in the six biogeographic regions are illustrated in Fig. 9. There are 3018 total different MH alleles in the dataset analyzed with 1337 occurring at common frequencies ≥ 5% in specific populations, while a total of 1810 MH alleles occur at frequencies > 2%. The remaining 1208 alleles mostly occur at very low (usually rare) frequencies; for example, 910 of the 1208 very-low-frequency mh-alleles are only counted to occur once or twice in the whole dataset. Supplemental Table S3 lists the 20 highest ranking MHs by average Ae in each of six world regions. The average MH allele frequencies in each of six major geographic regions are shown as bar plots for the microhaps, mh01KK-212 (Fig. 10) and mh05KK-170 (Fig. 11), with the highest In values (0.88 and 0.81) in 79 populations and the highest average Ae (9.708 and 9.750) in the 79 populations.

Fig. 9
figure 9

Microhaplotype alleles present and at common frequencies in specific populations for each of 6 world regions. Most of the low-frequency alleles are very rare from a global perspective

Fig. 10
figure 10

Average allele frequency bar plot for mh01KK-212 for each of 6 major biogeographic regions. This microhaplotype has the largest value for Rosenberg’s In in 79 populations (0.88; Fig. 4) and the second higheset average Ae for 79 populations (9.708; Suppl. Fig. S3). The 34 alleles with frequencies ≥ 5% in specific populations are plotted separately with different colors/patterns; the 58 alleles with frequencies < 5% are pooled (bars shown with black diagonal lines and green background)

Fig. 11
figure 11

Average allele frequency bar plot for mh05KK-170 for each of 6 major biogeographic regions. This microhaplotype has the second largest value for Rosenberg’s In in 79 populations (0.81; Fig. 4) and the highest average Ae for 79 populations (9.750; Suppl. Fig. S3). The 33 alleles with frequencies ≥ 5% in specific populations are plotted separately with different colors/patterns; the 24 alleles with frequencies < 5% are pooled (bars shown with black diagonal lines and yellow background)

Ancestry inference: population relationships

High In markers require a reference database to determine allele frequencies for calculating RMP values and for use in forensic attempts to identify the population ancestry of the donor of a DNA profile. This study provides reference data on 79 population samples. Several of those populations are smaller than ideal for forensic reference, but as seen in Fig. 5, the clusters at K = 6 and K = 7 define Mendelian populations of considerable size in some cases. It is clear that an amalgam of European population samples in one STRUCTURE cluster is as valid a reference population as a forensic reference population such as “U.S. White”.

The PCA and STRUCTURE results presented show that the extensive genetic variation in the 79 populations analyzed with the 90 MH panel can both differentiate clear population groupings for major geographical areas of the world as well as delineate distinct subgroupings of populations, especially when analyses are restricted to particular biogeographic regions.

There were no real surprises in the population relationships seen in STRUCTURE analyses and PCA. Indeed, as noted earlier, several other sets of markers on similar collections of populations have shown similar relationships (e.g., Bulbul et al. 2018) to those seen in Figs. 5 and 8. What these analyses do demonstrate is that this set of markers is highly informative for population similarities and differences at K values > 6. The new marker data do provide new information on some of the populations as discussed and also presented separately for African and East Asian populations below.

Comments of overall analyses of 79 populations using these 90 microhaplotypes

The six main clusters of populations seen in Fig. 5 and Fig. S4 remained distinct at higher K values. Figure S5 shows that likelihoods increased through K = 14 but at progressively lower increases as K increases until likelihoods remain almost constant after K = 14. What happened is that the six major regions have been subdivided at the higher K values and the “intermediate” populations (i.e., the magnified blocks in Fig. 5) with small sample sizes have differing patterns at the higher K values. In supplemental material, we present analyses at K = 16 (Fig. S6) which is a higher K value than the likelihood increases warrant, but illustrates the general pattern for subdivisions of the six major regions. For Africa, the change from K = 7 (Fig. 5) occurred at K = 13 in the 79-population analysis when the Biaka Pygmies became distinct from the East Africans. That pattern persisted through K = 16 (Fig. S6) but with the Ethiopian Jews showing differing patterns at higher K levels. The North African and Southwest Asian populations became a separate cluster from the Europeans at K = 9 and the cluster persisted through higher K values. The South-Central Asia cluster separates off the Pakistani populations with a distinct admixture component at K = 13 and that distinction remains through K = 16. Three of the South-Central Asia populations show inconsistent patterns of clustering after K = 13. In contrast to the small refinements of the African and European patterns, the East Asian patterns became more subdivided with increasing K value, as discussed below. The Oceania populations show several different patterns at the different K values.

Comments on African ancestry inference of these 90 microhaplotypes

Based on the overall analyses, we chose 21 populations for a more detailed analysis: the African and Southwest Asian samples. STRUCTURE analyses stabilized at K = 5 and K = 6 (Fig. 6). The Mozabites clustered with the SW Asian populations as a distinct group. The Ethiopian Jews were intermediate between the SW Asian and Sandawe from Tanzania. Other East African populations form a distinct cluster and the Central African Biaka population was distinct. The West African populations show some indication of two distinct groups with the Gambians and Mandenka distinct from both Yoruba samples and the Esan. This pattern of subdivision of the African cluster does not occur in the larger analyses of all 79 populations (Fig. S6). PCA of all 79 populations (Fig. S7) showed a distinct African cluster but no clear separation of Eastern vs. Western African populations. The Ethiopian Jews were distinct. PCA of the 21 populations showed that these populations generally are distributed along PC#1 (24.5%) as West Africa, East Africa, Ethiopian Jews, the Mozabites, and the SW Asian populations. PC#2 (9.1%) essentially separated the Biaka from all others (Suppl. Fig. S9a). PC#3 (8.2%) more clearly separated the East Africans and Ethiopian Jews from all the others (Suppl. Fig. S9b). PCA provided barely any evidence of clustering among the West African populations with only the Mandeka slightly different from the others. The two samples of admixed African-European origin cluster with the African populations by PCA but closer to the East Africans.

Comments on East Asia and the Pacific

The most striking result for the 79 population analysis is that at K = 11, the three samples of Han Chinese all show an “admixture” pattern with many individuals showing mixed membership in the Northeast Asia (Koreans and Japanese) cluster and the Southeast Asia (Dai, Vietnamese, and Cambodians) cluster. That pattern persisted through K = 16. If it has any meaning, it is probably that the Han Chinese are intermediate in a North-to-South cline in far East Asia and not that they are individually admixed of those flanking populations. At K = 9, the Atayal became distinct. At K = 10–16, the Khanty became distinct and usually (for K = 10 to 14) group with the Buryat and Yakut; in both cases, they remained distinct through K = 16 (Fig. S6). Oceania showed inconsistent clustering among the populations except for the consistent clustering of the two Melanesian populations together.

Similar population groupings are seen in the PCA results (Fig. S10). The Khanty from northwest Siberia is a clearly distinct population in this analysis. Note that in the full global context, it was intermediate between the Europeans and East Asians. We chose 21 population samples from Western Siberia to the Pacific omitting the South Central Asian samples that were a clearly distinct cluster in Fig. 5. STRUCTURE analysis of these 21 populations showed clear clusters at K = 7 (Fig. 7). The Buryat and Yakut samples cluster together both in the STRUCTURE analysis of the 21 samples and in the PCA of all 79 populations (Fig. 5). The Koreans and the three samples of Japanese ancestry form a clean cluster in STRUCTURE at all K levels, but are close to the Chinese in the PCA analyses. The three Chinese samples appear admixed between the Japanese and the three South East Asia populations that form a clean cluster. The STRUCTURE data constitute evidence for a North-to-South cline of genetic differentiation in Far East Asia. The Atayal sample defined its own isolated cluster in STRUCTURE at K = 9, 10, and 16 but group with the South East Asian populations from K = 11 to 15. The various Oceania populations form a noisy cluster with evidence of admixture except for the two Melanesian samples from Papua New Guinea that are distinct at all K values in analyses of both the full (79) and restricted (21) sets of populations.

A general comment

Overall, these 90 microhaplotype markers are a powerful set for population relationships, but it was impossible from these analyses to determine when a subset of populations would provide an answer not inferable from the full set of populations. The Africans, in the separate 21 population analysis, clearly show clustering at K = 5 that is not seen in any of the results for all 79 populations. In contrast, the East Asians by themselves cluster in ways that are similar (but never identical) to the clustering of all populations at K levels up to K = 16. We do not fully understand the cause in this case of the different patterns. We know that different markers are most relevant to different regions; the magnitude of the allele frequency differences is undoubtedly relevant. How well this regional inconsistency in finer clustering generalizes to other datasets is unknown at present.

Kinship

Any multiallelic genetic system is useful for kinship analysis. Indeed, even a di-allelic locus provides evidence of relationship by allele sharing. In this respect, the high Ae values of this set of MHs should be especially informative, because the probability of allele sharing identical by state can be much less than sharing identical by descent for close relatives. However, no direct test has been done. Recent papers by Puente et al. (2020a), Staadig and Tillmar (2021), and Wu et al. (2021b) have assessed microhaplotypes in kinship analyses to varying degrees. Based on (Wu et al. 2021b) with 54 high In MHs that were problematic at relationships beyond second degree, we cannot expect the 90 MHs in our study to be good at distant relationships. How good the 90 will be is for future research.

Mixture deconvolution

Three questions arise when considering the existence of mixtures in a forensic sample. First, is there a mixture? The essential proof that a mixture exists is the presence of at least three alleles at several of the loci. Note that this criterion cannot be met by a di-allelic SNP. The only way a di-allelic SNP can contribute to the inference of a mixture is if a quantitative method is used and the two alleles differ in their values, e.g., sequence read number, more than heterozygote read imbalance would explain. Second, how many contributors are there to this mixture? At any one locus, the minimum number of contributors is the number of alleles seen divided by 2 and, if a fraction, rounded to the next whole number: five alleles seen implicates 3 contributors; six alleles also implicates 3 contributors. The loci with the largest numbers of alleles seen provide an overall minimum estimate of contributors that applies to all loci. Note that sensitivity issues and diminishing concentrations with larger numbers of contributors prevents any realistic estimate of the maximum number of contributors. However, the global sum of all the alleles seen at all the loci can implicate more contributors than the maximum seen at individual loci (see Fig. 2 in Bennett et al. (2019) for an illustration). Also, quantitative variation in allele “intensity” may also provide hints at larger numbers of contributors, but some model of the relationships of numbers of copies of alleles to their intensity is required.

Finally, can the individual multi-locus genotypes of the contributors be determined? It may be possible to readily infer the contributing genotypes at a single locus using allele “intensity” (e.g., read count in MPS) as seen at locus mh05KK-170 in Bennett et al. (2019). However, the permutations of the individual locus results overwhelm such single locus approaches. This becomes an issue for probabilistic genotyping of microhaplotypes analogous to the use of STRMix (Buckleton et al. 2019) for probabilistic genotyping of forensic STR data. In the forensic case, the question is usually whether a known sample can be part of a mixture. This is a different question than attempting to fully deconvolute a mixture. This is an area that needs development for microhaplotypes because of the many variables that are involved. Elements of such deconvolution methods include the number of contributors, the relative amounts of each contributor, and the allele frequencies in the relevant population(s). The 90 MHs provide a set of highly heterozygous loci that can help with some of these issues and have the advantage of low mutation rates and the absence of stutter.

Optimizing the panel

This panel of 90 MH loci was designed to have high Ae and high In. This has resulted in loci with, on average, greater extent to encompass more SNPs. Eliminating the loci with the lowest Ae and/or In values globally should improve the efficiency of the panel. However, a careful analysis should be undertaken to assure that the lowest In marker for all populations is not providing significant differentiation of some population(s). We generated exploratory STRUCTURE runs from K = 2 to K = 8 for 79 populations after excluding 19 MH with In ≤ 0.25. The cluster patterns of the highest likelihood runs for the 71 MHs were all very similar to those obtained with all 90 MH. The most noticeable difference occurred at K = 7 where the Biaka from central Africa clustered with the West African groups instead of the East African cluster. Some of the excluded MH markers undoubtedly have value in differentiating among the sub-Saharan groups. Given the high level of informativeness of the panel for obtaining results at 90 loci, efficiency is not an issue. Rather, any pruning would allow space for adding additional marker loci with higher values, including some of the best of the loci identified by others, e.g., (Wu et al. 2021a), have identified many MHs with global average Ae values > 5.0. Those are issues for future research.

General utility of microhaplotypes

While the loci studied here are human specific and will not be relevant to other species, the general molecular approach and methods (Gandotra et al. 2020) are applicable tools in population genetic studies of other organisms. The fields of ecology and conservation are increasingly using molecular techniques and some researchers are already using microhaplotypes (Meek and Larson 2019). Microhaplotypes have been shown to be much more informative per locus than SNPs in studying the familial relationships among Kelp Rockfish (Baetscher et al. 2018). Microhaplotypes have also been used to study porpoises (Morin et al. 2021) and salmon (Larson et al. 2016; McKinney et al. 2017). Tessema et al. (2020) identified 93 microhaplotypes in Plasmodium falciparum. Those P. falciparum microhaplotypes had a median Ae of 3.33 and provided good discrimination between related and unrelated polyclonal infections.

Impact on forensic practice

In spite of their technical advantages over the forensic STR markers, SNPs have not been incorporated in routine forensic practice. Part of the reason has been the need for separate methodologies to type STR loci and SNPs. With the advent of MPS, it is now possible to use one technology and multiplex the standard STR markers with a reasonable panel of SNP-based markers in the same sequencing run. We show in this study that microhaplotypes with high Ae, rivaling the Ae values for STR markers, can be found and are far superior to individual SNPs. We believe that such microhaplotypes will supplant individual SNPs in future applications. As more laboratories acquire sequencing technology, it may be possible for microhaplotypes to become a tool in forensic practice while maintaining the standard STR markers and the national databases of convicted felons. However, the costs of new equipment and training of personnel and the absence of an agreed upon panel of highly informative microhaplotypes remain major obstacles.

Future studies

Refining and optimizing the microhaplotype markers that have already been identified for more localized geographic regions will likely be productive. Identifying additional useful microhaplotypes would be helpful. Some may emerge as more diverse human populations are studied routinely. While we have studied 79 populations from major geographical regions of the world, there is still a need to obtain better coverage of the diversity of human populations, especially in Africa, North Asia, Southeast Asia, and the Americas. Recent reviews and population genetic studies (Ramsay et al. 2021), for example, continue to indicate that the diversity of African populations is greater than what has been routinely studied. Indigenous populations of the Americas (Moreno-Estrada et al. 2014; Homburger et al. 2015; Barbieri et al. 2019) also need better coverage.

Conclusions

Our results document this panel of microhaplotype markers as the best one so far with highest overall values of Ae and In in the largest number of populations studied. The combination of multiplex mMHseq) and the expanded set of populations studied from around the world revealed a highly informative set of markers that has characteristics that can serve a range of forensic, medical, and anthropological applications. Additional useful microhaplotypes will likely emerge from other and future studies (e.g., Wu et al. 2021a). New analyses can focus on tailoring the best subsets and supersets of MH markers for use in specific geographical regions as well as for major world regions. As more extensive sampling and analyses of world populations occur, it can be expected that the ability to distinguish more refined population relationships in multiple world regions will increase, especially in Africa.