Introduction

The Shompen, an isolated aboriginal population (population size: ~200 individuals), is confined to the southernmost Great Nicobar Island of the Andaman and Nicobar archipelago located in the Bay of Bengal. Following a hunting–gathering way of life, the Shompen have remained in seclusion, in contrast to the Nicobarese, the other indigenous population that is dispersed over all 12 inhabited islands of the Nicobar archipelago. Although the Shompen and Nicobarese are ethnically described as “Mongoloids”, their evolutionary histories have been debated due to their evident differences in physical appearance (Man 1886; Temple 1901). The Shompen have a stark phenotypic resemblance to African populations with a trace of Mongoloid traits. Juxtaposed to this, the languages spoken by these two groups are distinct and unintelligible even to each other. These facts have stimulated two primary lines of inquiry for population biologists. First, to trace the temporal and geographic origins of the Shompen. Second, with respect to intertribal differentiation, to determine whether the Shompen and Nicobarese share a common origin and have differentiated due to long-term geographic segregation, or whether the differences result from their separate origin from diverse populations that arrived in successive waves to inhabit the Andaman and Nicobar Islands. There are several theories regarding the origin of the Shompen, alluding to their descent from Malay, Burmese and Chinese (Rizvi 1990).

Languages are known to carry imprints of ancestries and migrational histories of populations (Cavalli-Sforza et al. 1988). The language isolates of the Shompen and Nicobarese have both been grouped into the Mon-Khmer branch of the Austro-Asiatic linguistic family (Khasi is the third Indian language that derives from the Mon-Khmer branch) (Ruhlen 1991). The Shompen speak ‘Shompen’, which has 25 consonants and 35 vowels. The language is polysyllabic, having words of foreign origin from Portuguese and Malay. Linguistic and genetic evolution do not necessarily go hand in hand, although the two can be correlated in certain instances (Cavalli-Sforza et al. 1988; Sajantila and Pääbo 1995), leaving open the possibility that the two Austro-Asiatic-speaking populations, the Nicobarese and the Shompen, share common genetic origins.

Recent molecular studies on the Nicobarese population (Endicott et al. 2003; Thangaraj et al. 2003) have revealed their Southeast Asian origin (Prasad et al. 2001); however, the Shompen remain poorly understood due to inaccessibility due to their isolation. Earlier studies on the Shompen were confined to conventional genetic markers (Agarwal 1966), which did not help in unravelling the position of the Shompen with respect to modern human diversity. To address the issues of origin and antiquity of the Shompen, we carried out a comprehensive analyses of mitochondrial control and coding region polymorphisms; 20 Y-chromosomal short tandem repeats (Y-STRs), 35 lineage defining Y-chromosomal single nucleotide polymorphisms (Y-SNPs) and 15 autosomal short tandem repeats (STRs) in the Shompen and compared the results with available data on related populations of India, East Asia, Southeast Asia and Oceania.

Materials and methods

DNA sample

Blood samples were collected from 33 unrelated Shompen individuals after ethical clearance from appropriate bodies. Morphometric indices and disease status were recorded at the time of sample collection. DNA was isolated by the organic extraction method (Sambrook et al. 1989).

Analysis of autosomal STRs

Fifteen STRs were typed for 33 unrelated individuals using an AmpF/STR Identifiler kit (Applied Biosystems, Foster City, CA). The amplified products were electrophoresed in a 6% denaturing polyacrylamide gel on ABI Prism 377 DNA Sequencer (PE Applied Biosystems, Foster City, CA). Sizing of DNA fragments was performed using GeneScan analysis software (version 3.7), and allele designation was assigned using Genotyper DNA fragment analysis (version 3.7) software (Applied Biosystems, Foster City, CA).

Mitochondrial analysis

Of the 33 DNA samples, 29 were first amplified for selected regions of the mitochondrial DNA (mtDNA). They were assayed for the presence of the intergenic 9 bp deletion along with mtDNA restriction fragment length polymorphisms using standard primers and protocols (Torroni et al. 1993, 1996). Sequencing of mitochondrial hypervariable segment I (HVS-I) (16024–16383) and hypervariable segment II (HVS-II) (57–372) (GenBank accession numbers DQ094084–DQ094112 for HVS-I sequences and DQ094113–DQ094141 for HVS-II sequences; http://www.ncbi.nlm.nih.gov/Genbank/) was carried out according to the technical booklet on mtDNA Sequencing (PE Applied Biosystems). Sequencing for informative coding region mutations was also carried out in few selected samples from B5a and R9b lineages using primers described by Torroni et al. (2001).

Analysis of Y-chromosomal polymorphisms

Binary polymorphism typing

Thirty-five SNPs distributed throughout the Y-chromosome were typed hierarchically in 12 Shompen males. The M9 (C–G) mutation was detected by HinfI restriction digestion of a 164 bp PCR product obtained using standard primers and protocols (Underhill et al. 1997). Sequencing of the other polymorphisms was carried out using standard primer sets (Underhill et al. 1997).

Analysis of Y-STRs

Twenty Y-chromosomal STRs comprising three trinucleotide repeats (DYS426, DYS392 & DYS388), 12 tetranucleotide repeats (DYS389I, DYS389II, DYS439, DYS437, DYS391, DYS385a, DYS385b, DYS390, DYS393, DYS19, DYSH4, DYS460), two pentanucleotide repeat polymorphisms (DYS447, DYS438), two dinucleotide markers (YCAa, YCAb) and one hexanucleotide marker, DYS448, were analysed by an in-house-developed novel 20plex PCR reaction, performed for all the microsatellite markers in a final reaction volume of 10 μl containing 10 ng genomic DNA. The chosen Y-STRs were amplified using primers from the Y-STR database (http://www.ystr.charite.de) and the reaction conditions were standardised in our laboratory (S. Sahoo and V.K. Kashyap, unpublished). Amplified products were run on ABI Prism 3100 Genetic Analyzer using LIZ-500 (PE Applied Biosystems, Foster City, CA) as the internal lane size standard.

Statistical analysis

Allele frequencies, and expected and observed heterozygosities (Guo et al. 1992) for autosomal STRs were calculated using DNATYPE software (Windows 95/NT version 1998, University of Texas, Houston, TX). DA distances (Nei et al. 1983) were employed for constructing the neighbour-joining phylogeny (Saitou and Nei 1987). δμ2 distances (Goldstein et al. 1995) were computed for estimating divergence time, T, using the formula, Tμ2/2ω, where ω is the effective mutation rate.

Mitochondrial sequences were edited between positions 57–372 for HVS-II and positions 16024–16383 for HVS-I and sequence traces of Y-chromosome were compared with respective control consensus sequences using BIOEDIT Software (North Carolina State University, NC). Diversity indices and mean number of pairwise differences between sequences/haplotypes for mitochondrial and Y-haplotypes were calculated using Arlequin software (Schneider et al. 2000). Median-joining networks for mitochondrial and Y-STR haplotypes were constructed by NETWORK v 4.1 software (Bandelt et al. 1999). ρ statistics (Forster et al. 1996) was applied to calculate coalescence estimates taking into account the mutation rate of one mutation every 20,180 years. The mutational age was then converted into years by multiplication with the mutation rate. The respective standard errors were computed as described by Saillard et al. (2000). FST distances for mitochondrial haplotypes and Y-STRs were calculated using Arlequin software and corresponding MDS plots constructed with SPSS v7.1 software. Y-STR variance, v=Tμη2 (Kittles et al. 1998), was calculated to estimate the coalescence age using the mutation rate described by Zhivotovsky et al. (2004).

Results

Mitochondrial DNA diversity

mtDNA polymorphisms were analysed in 29 individuals. Three haplotypes were identified, with a total of 14 polymorphic sites. The haplotype diversity was found to be low, at 0.51 (Table 1). Analyses of coding region polymorphisms established the absence of haplogroup M, the predominant lineage found in Indian and in other Asian populations. On further analyses, 19 individuals were found to harbour the intergenic COII/ tRNALys 9 bp deletion, which, along with the control region polymorphisms, placed them under the B5a lineage. A new R clade, which we propose to designate as R12, defined by transitions at nucleotide positions 16249, 16288 and 16304, was identified in ten Shompen individuals; the ancestral sequence of this clade occurs among the Nicobarese (Prasad et al. 2001). The major haplogroups prevalent among Shompen were similar to those identified earlier among Nicobarese (authors’ unpublished data). In particular, the mutations in HVS-I between Shompen and Nicobarese B5a lineages were identical with differences present only in HVS-II. The R12 lineage, however, was found to be more divergent between Nicobarese and Shompen, with three mutational differences in HVS-I alone (Fig. 1). Although mutation at position 16304 indicates the sequences belong to the R9b lineage, the absence of R9 defining mutations in the coding region at positions 3970 and 13928 in both populations implied a new clade, designated as R12 in the present study. The occurrence of only two lineages, coupled with their low diversity in the Shompen, in addition to the high diversity of these two lineages in the Nicobarese, indicates a founder effect in the Shompen. An MDS plot (Fig. 2) constructed using other related B5a lineages present among Nicobarese (authors’ unpublished data), Han Chinese (Yao et al. 2002), Indonesians (Redd et al. 1995), Taiwanese aborigines (Melton et al. 1998) and R12 lineages prevalent among the Nicobarese (authors’ unpublished data) and Indonesians (Lum et al. 1994) reveals that the Shompen are closer to Indonesians than to Nicobarese.

Table 1 Mitochondrial DNA (mtDNA) and Y-chromosomal diversity in Shompen. n Number of individuals, SD standard deviation, MPD mean pairwise difference
Fig. 1
figure 1

Median-joining network depicting the identified mtDNA lineages in Shompen. Substitutions are shown relative to the CRS. Node areas are proportional to haplotype frequencies. Character change is specified only for transversions

Fig. 2
figure 2

Multidimensional scaling (MDS) plot of FST distances between similar mitochondrial haplotypes present among Shompen, Nicobarese and East/Southeast Asian populations

Y-chromosomal diversity

A single major Y-chromosomal haplogroup, O2a, defined by M95 marker was identified in 12 Shompen males. The O2a lineage was also found in Nicobarese individuals. This lineage is also reported among Central Asians, as well as East and Southeast Asians (Su et al. 1999; Kayser et al. 2000, 2001, 2003; Su et al. 2000; Underhill et al. 2000; Wells et al. 2001). Given the high frequency of occurrence of the O2a lineage among the Austro-Asiatic linguistic gene pool, it has been suggested that this haplogroup might be related to the spread of Austro-Asiatic speakers (Kivisild et al. 2003) in the Asian continent.

In order to further investigate the correlation of the O2a lineage with Austro-Asiatic language speakers, 20 Y-STRs were analysed in the Shompen; 11 loci were found to be monomorphic. Of the ten Y-STR haplotypes identified in Shompen, eight were unique. The haplotype diversity and mean pairwise differences were low, at 0.879 and 1.333, respectively (Table 1). Haplotype data for 20 Y-STRs are presented in Table 2. A multidimensional scaling (MDS) plot of RSTdistances of Y-chromosomal haplotype frequencies for Shompen and related populations (Kayser et al. 2001; 2003, authors’ unpublished data) (Fig. 3) demonstrate the affinity of the Shompen to Austro-Asiatic speakers, and to Nicobarese and Vietnamese, rather than to mainland Indian Austro-Asiatic populations that share comparable O2a lineage frequencies. The reduced median-joining network based on Y-STR haplotypes (Fig. 4), produced distinct groups of mainland Indian Austro-Asiatic speakers (with two individuals sharing a Shompen haplotype), Shompen, Nicobarese and East and Southeast Asian populations.

Table 2. Haplotype data for 20 Y-chromosomal microsatellites in the Shompen
Fig. 3.
figure 3

MDS plot of RST distances of Y-chromosomal haplotype frequencies for Shompen and related populations

Fig. 4
figure 4

Reduced median-joining network of haplogroup O2a individuals, based on their Y-short tandem repeat (STR) haplotypes. Different populations of origin are differentially shaded as indicated

Autosomal microsatellite diversity

Distribution of alleles across the 15 microsatellite loci in 33 Shompen individuals revealed low polymorphism with an average heterozygosity of 61.9% (Table 3). Phylogenetic analysis employing common shared microsatellites in related populations (Linacre et al. 2001; Bagdonavicius et al. 2002; Alves et al. 2004) demonstrated the Shompen clustering away from mainland Indian populations with greater proximity to Southeast Asian populations (Fig. 5).

Table 3. Distribution of allele frequency at 15 autosomal short tandem repeats (STRs) (indicated in bold) in the Shompen population. Obs Het Observed heterozygosity
Fig. 5
figure 5

Neighbour-joining phylogeny constructed using DA distances based on autosomal microsatellite analysis, demonstrating the relationship of the Shompen to mainland Indian, East Asian and Southeast Asian populations

Discussion

Although both the Shompen and the Nicobarese probably derive from the same founding population, considerable genetic differentiation has contributed to their current recognition as two distinct populations, as evident from the extensive analyses performed on mitochondrial, Y-chromosomal and autosomal markers. Occupying the most remote island of the Nicobar archipelago, the Shompen have existed in physical isolation for a considerable period of time. This feature is evident from the existence of marked linguistic differences between the Shompen and Nicobarese languages. On the other hand, the Nicobarese, spread across 12 inhabited islands of the Nicobar archipelago as well as in Little Andaman have adapted well to existing conditions and constitute an expanding population, maintaining contacts with settlers and traders from mainland India and other Asian countries.

Mitochondrial sequence imprints

MtDNA analysis of the Shompen reveals the occurrence of only two haplogroups, B5a and R12, which are also the predominant lineages amongst Nicobarese, suggesting common ancestry of the two populations. The presence of both lineages in the two populations suggests either that the founding population of the Nicobar archipelago harboured these two different lineages or that they were brought in by two distinct founders. Mutational differences in the haplotypes of the Shompen and Nicobarese could have arisen following subsequent genetic differentiation of these two groups. The HVS-I sequences of the B5a haplotype were identical in both Shompen and Nicobarese, differing only at two additional HVS-II mutations (positions 309.1 and 309.2) in the Nicobarese. An identical match to the Shompen haplotype profile was found in a Han Chinese sample (Yao et al. 2002), while one with a single-mutational step difference was reported in an Indonesian sample (Redd et al. 1995) and another with a two-mutational step difference was found in a Taiwanese aborigine (Melton et al. 1998). The B5a haplogroup is reported predominantly in populations of China, Thailand and Taiwan. The R12 lineage identified by the HVS-I sequence motif 16249, 16288 and 16304 further differs by additional mutations in both Shompen and Nicobarese (Fig. 1). This lineage is also more diverse in the Nicobarese (Prasad et al. 2001). It is probable that populations harbouring a high frequency of this lineage are yet to be sampled, especially across Southeast Asia. The estimated coalescence time for the R12 lineage is 21,000±13,000 years. The MDS plot (Fig. 2) reveals that the Shompen maternal lineage is genetically closer to that of Indonesians than that of other compared populations.

Y-chromosomal signature

The Y-chromosomal haplogroup O2a was the only lineage present among the Shompen. The frequency of the O2a lineage is reported highest among the Mon-Khmer-speaking Malayan aborigines (Orang-Asli; Su et al. 2000), the Bulang of southern China (Su et al.1999), Vietnamese (Kayser et al. 2003), and among Austronesian-speaking Taiwanese aborigines (Yami; Su et al. 2000) and Indonesians (Kayser et al. 2003). MDS analysis with Y-STRs (Fig. 3) has revealed that the Shompen are closer to Nicobarese and Vietnamese than to mainland Indian Austro-Asiatic populations. Additional data from other Asian populations would be required to trace the spread of this lineage in Asia. The Shompen and Nicobarese clusters were found to be distinct from the mainland Indian and East Asian/ Southeast Asian clusters (Fig. 4), illustrating the lack of geneflow in contemporary populations following their early splits. The single haplotype shared between the Shompen and mainland Indians could be due to low resolution resulting from the low number of Y-STR markers available for comparison. The age of the Shompen O2a cluster as estimated from Y-STR variance was ~3,000 years.

Synthesis of mitochondrial, Y-chromosomal and autosomal evidences

Based on their physical appearance, the Shompen have been considered to be of mixed Indo-Chinese, Malay, Negrito and Dravidian origin (Rizvi 1990). Our data, based on mitochondrial, Y-chromosomal, and autosomal analyses, however, support the Southeast Asian origin of this population. Although the current dataset has not revealed the occurrence of any haplogroups specific to Dravidian populations, this does not exclude such an occurrence in a larger sample size. With only a single Y-chromosomal lineage identified in the Shompen, it is plausible that the other lineages have been lost as a result of the founder effect in this small population. The time of divergence of the Shompen from the Nicobarese deduced from the δμ2 distance of the autosomal data was ~14,000 years ago. It is likely that the mitochondrial B5a lineage was introduced later, independent of the older R12 lineage, probably along with bearers of the Y-chromosomal O2a lineage. The R12 haplogroup has also been observed among Great Andamanese (authors’ unpublished data). However, its low frequency in the Great Andamanese does not support its origin in the Andaman Islands, ruling out the genetic contribution of Andaman Negritos to the current set of Shompen analysed.

With the Shompen demonstrating proximity to Southeast Asian populations with all genetic systems employed in this study, we infer that they trace their origins to mainland or island Southeast Asia with no discernible admixture in their contemporary gene pool.