Background

Malaysia, a multi-ethnic, multi-lingual, multi-cultural and multi-religious country, is located at the crossroads of Southeast Asia. It is separated by the South China Sea into two land masses namely, the Peninsular Malaysia and East Malaysia (also known as the Borneo island). Malaysia has a total population of about 30 million people, of which approximately 26 million populate the Peninsular Malaysia. Among the major populations in Peninsular Malaysia, the Malays are the largest ethnic group and make up to 63% of the total population follow by Chinese, Indians and other minority ethnic groups.

Many Malays are of Malayo-Polynesian (Austronesian) origin that are culturally and historically heterogeneous [1]. The Malays from the west coast of Peninsular Malaysia are historically linked to Sumatera across the Straits of Malacca; while those from the south are thought to have migrated from Jawa, Sulawesi and other parts of Indonesia [2]. The Malays from the north Peninsular have a closer affinity to the Malay Muslims from the Southern Thai due to geographical location. The history of Singapore has never been separated from Peninsular Malaysia since the first century until the year 1965 when Singapore became an independent republic. Therefore, it is very likely that the Malays of Singapore have a similar history of origin as those from the southern part of Peninsular Malaysia [3].

Earlier studies had indicated potential genetic sub-structure among the different groups of Malays from Peninsular Malaysia [2, 4, 5], which could be possibly attributed to the migration history of these respective sub-groups. However, fine-scale sub-structure of the Malay population remained poorly described, especially, previous studies were based on very limited sample size. Indeed, this potentially poses confounding factors to the genetic association studies, in particular genome-wide association studies (GWAS), leading to spurious association signals [6]. Being one of the major populations in the Southeast Asia, characterizing population substructure is crucial in designing, analyzing and interpreting any genetic association study in this region.

In this study, we showed that the genetic diversity and population sub-structure of the Malays from Peninsular Malaysia are correlated to the geographical latitude. Notably, we observed the main differentiations between populations corresponding to the north and south Peninsular Malaysia. In addition, simulation analyses carried out also revealed that the genetic association is greatly affected by population sub-structure, suggesting that consideration of population stratification of samples at the stage of study design and careful interpretation of the association signals are necessary when mapping complex diseases in Malay populations.

Results

Population substructure

We first compared the genetic diversity of the Peninsular Malays from a global scale with 6 populations from HapMap3 including YRI, CHB, JPT, CEU, MEX and GIH. PC plot indicated that the Malays clustered closely to the East Asian populations as expected, and showed a rather small genetic diversity. Several Malay individuals from northern Peninsular Malaysia (PMM) showed closer affinity to the South Asia populations (GIH) (Fig. 1a). We then performed PCA for the Peninsular and Singapore Malays, and revealed a seemingly homogenous cluster (Additional file 1: Figure S1). However, some level of differentiations were observed corresponding to three geographical regions (north, center and south), despite samples from center region that was scattered around (Fig. 1b). We subsequently excluded the samples form the center regions (Pahang and Selangor), and re-ran the smartPCA. Two clusters were observed representing the north and south regions, respectively (Fig. 1c).

Fig. 1
figure 1

Principle Component Analysis (PCA) (a) Global PCA including populations from HapMap3. GIH, Gujarati India Houston; CEU, Northern and Western European from CEPH collection; YRI, Yoruba Ibadan from Nigeria; CHB, Chinese Beijing; JPT, Japanese Tokyo; MEX, Mexican ancestry from Los Angeles; MAS, Metropolitan Malays from Singapore; PMM, Malays from Peninsular Malay. The Malay populations are of East Asian descendant. (b) PCA plot including samples categorized into North vs Centre vs South; (c) PCA plot which included only North vs South. Symbols in red represent the northern region; symbols in blue represent southern region. Several outliers were excluded from the PCA plot

In ADMIXTURE analysis, a significant difference was observed between the Malays from the north and south in the major component, with 57% and 65% in the north and south, respectively (P < 0.0001; Fig. 2). At K=3, the newly appeared component (denoted in green) was seen slightly higher in the central Malays than in the south Malays (6.8% vs 3%; P = 0.0415).

Fig. 2
figure 2

ADMXITURE analysis of the Malay populations classified according to regions. The bottom plots represented by percentages (Y-axis) indicates the average ADMIXTURE values for each region

Correlation of genetic and geographic coordinates

Given the fact that the PC1 as well as the ADMIXTURE analysis showed significant differences between northern and southern Malay samples, we then investigated if the genetic diversity between these sub-structure of Malays in Peninsular Malaysia were attributed to geographical coordinates. Average PC1 values of southern Malay samples (corresponding to Fig. 1b) were generally less than 0 (except for Johor), whilst all geographically defined northern regions with PC1 >0 (Fig. 3). When we compared the PC1 with geographical latitude of these sample locations, a significant correlation was observed (R2 = 0.3925; P = 0.029; Fig. 4). Due to the geographical nature, Peninsular Malaysia is divided into west coast and east coast by the Titiwangsa Ranges. We therefore asked if the genetic diversity could be attributed to the geographical longitude as well. Analysis between PC1 and geographical longitude, however showed no significant correlation (R2=0.0066; P = 0.7924; Addional file 1: Figure S2). We also evaluated if the genetic diversity was related to geographical distance between two populations, but found no significant correlation of FST between populations and the geographical distances between them (R2 = 0.01918; P = 0.1385; Additional file 1: Figure S3).

Fig. 3
figure 3

Average PC1 values of the Malay sub-populations from Peninsular Malaysia and Singapore. Standard error of each population is indicated. The PC1 values correlated well to the geographical locations of each population except for Johor

Fig. 4
figure 4

Correlation between PC1 and latitude coordinate (P = 0.029)

Genetic differentiation between northern and southern Malays

The regional FST values indicated the highest regional diversity between the north and the south after 1,000 times bootstrapping repeats (FST = 0.001; CI = 2.07E - 04) (Table 1; Additional file 2: Table S1). To further identify the genomic regions that are highly differentiated between northern and southern Peninsular Malay, we computed the FST values of the 41,400 SNPs between northern and southern samples, and identified 428 SNPs listed in the top 1% of the FST (Additional file 2: Table S2); of which 80 (0.1%) had an FST value >0.05 (Table 2). SNP with the highest FST value was rs4149264, residing in the candidate gene ABCA1 - a major gene responsible for high-density lipolipoprotein cholesterol (HDL-c) synthesis. Another highly differentiated SNP, rs4148475, is located at the candidate gene ABCC4. This gene is a member of the superfamily of ATP-binding Cassette (ABC) transporters, which may play a role in cellular detoxification [7]. A missense variant rs1056836 appears to be one of the four highly differentiated SNPs, leading to a change of valine to leucine in candidate gene CYP1B1, which had a minor allele frequency of 0.48 and 0.19 in northern and southern peninsular Malays, respectively (FST = 0.2037). This candidate gene is responsible in drug metabolism and synthesis of cholesterols, steroids and lipids. It was found to play a role in the susceptibility of glaucoma [8, 9]. We performed an enrichment analysis with DAVID (http://david.abcc.ncifcrf.gov/) by including the top 1% highly differentiated variants, but identified no significant enrichment after Benjamini correction (Additional file 2: Table S3).

Table 1 Pairwise FST bootstrap values of the Malay between the 3 regions of Peninsular Malaysia
Table 2 Top 0.1 % SNPs that are highly differentiated between the Malays from northern and southern region of Peninsular (total SNP = 42633)

We observed that 1,666 SNPs were presented in different minor alleles between the north and south Malays, and their allele frequencies in Malays were compared with that in South Asian (GIH) and East Asian (CHB) (Additional file 2). Although not substantial, differences in allele frequencies were observed between the South- and East- Asians, as well as the between the Malays and both South- and East- Asians. Notably, rs1126965 located at the candidate gene SLC39A11 revealed an alternative allele frequency of 0.8178 in the northern Malays and 0.4965 in the southern Malays. This gene has been reported to play a role in liver enzyme and smoking initiation [10, 11]. Whether or not this gene is under positive selection in the Malays, however, remain further investigation. We subsequently assessed if these SNPs play a role in phenotypic association, and found that 19 of these SNPs were reported in GWAS catalogue (Additional file 4).

To evaluate the potential effect of population sub-structure on a disease association study, a series of computer simulation studies were carried out with PLINK following a case–control GWAS design (Additional file 2: Table S4). The GWAS simulations revealed that the effect on false positive rate and statistical power were greater than expected [12].

Discussion

We demonstrated in this study, that the Malays from Peninsular Malaysia and Singapore are essentially sub-structured. Although genetic correlation with geographical latitude had been previously reported in the Chinese populations [12, 13], it is indeed surprising to reveal such differentiation among the Malay populations even within a small region in Peninsular Malaysia and Singapore (~800 KM from north to south). In addition to that, the FST between the north and south Malays were similar to those of the earlier report between the northern and southern Han Chinese (FST = 0.0011) [12] but lower than those within Europeans (FST = 0.0033) [14] However, we observed higher diversity within the substructures of the Malays. For instance, the FST between two northern Peninsular Malays from Kedah and Kelantan was 0.017 (Table S4), which is in line with the finding in a recent study [4]. This suggests higher heterogeneity among Malays than previously expected, possibly be due to the recent migration and gene flow from the surrounding populations in this region.

The Pahang Malays were found to have a closer affinity to the north, although they were classified as the central region in this study. This is likely due to the reason that samples were collected from the Federal Land Development Authority (FELDA) settlers in the Pahang state, of which the majority of them were originated from Kelantan. On a separate note, Selangor, being as the most advanced and most populated state of Malaysia, is where the metropolitan city Kuala Lumpur located. PCA revealed that samples from this population was scattered across both the north and south regions (Fig. 1b & c). We believe that urbanization had likely blurred the boundaries. Similar findings were observed in Xu et al. (2009), where the populations from metropolitan areas showed more complicated composition with multiple ancestral origins compared with those from the rest of the area.

Essentially, identification a panel of ancestry informative markers (AIMs) would be an ideal strategy to correct the population stratification in future genetic association studies [15]. However, the SNP coverage and the sample size in the current study are insufficient for such purpose. Those highly differentiated SNPs between the north and south Malays could be possibly due to genetic drift or, to a lesser extent, natural selection. These SNPs, however may be considered as the putative set of variants as the AIMs for the Malay populations. The candidate gene ABCA1 is a major gene that plays an important role in high-density lipoprotein cholesterol (HDL-c) synthesis and cholesterol transport [16]. However, whilst we suspect the genetic drift is likely to be the cause, the reason of this gene being highly differentiated between northern and southern Malays remains further investigated. Cautions should be taken though when positive signals of HDL-c and ABCA1 are identified in the genetic association study of Malays.

We acknowledge several limitations in this study. Sample collection from several locations were small, hence might have resulted into outliers which confounded the outcome of the correlation between genetic differentiation and geographical coordinates. In addition, self-reported ancestry might have also confounded the analysis when assigning to their respective state of origin. However, the number of samples covering all states in Peninsular Malaysia (and Singapore), and the marker utilized in our study are larger than the previous reports, thus provides further insights into the genetic structure of the Malays in Peninsular Malaysia. Notably, we revealed close relationship between genetic and geographical coordinates in the Malay populations. In addition, our results and to which extent the admixtures in Southeast Asia could impact the population stratification thus affect the genetic association studies. Therefore we call for attention to look into alternative strategies for disease mapping in genetically complex populations particularly from Southeast Asia.

Conclusion

In summary, we revealed that the population substructure of the Malays was correlated to the latitude coordinate. The genetic diversity of the Malays is more heterogeneous than previously expected, and that we proved that such population sub-structure occurred even though within a small geographical region may potentially lead to spurious signals in disease based genetic association studies. Therefore cautions should be taken when carrying out such study design.

Methods

Population and samples

A total of 431 Malay samples were included in this study. These samples were self-identified Malays from Peninsular Malaysia, 116 of which were genotyped with Affymetrix Genome-Wide Human SNP Array 6.0, whilst the remaining samples were genotyped with Illumina 660W (Sample size, N = 90) and Illumina Omni Express (N = 119). The additional 17 Malays samples from Kelantan genotyped with Affymetrix Genome-Wide Human SNP Array 6.0 [17], and 89 samples of metropolitan Malays from Singapore (SGVP) were also included in this analysis [3]. The studies were approved by the research and ethics committees of Universiti Teknologi MARA and Universiti Sains Malaysia, and the design of this study followed the Helsinki Declaration 1975, as revised in year 2000. The collected samples covered all 11 states of Peninsular Malaysia (Fig. 5), of which were divided into 3 geographical regions for the purpose of this study namely, North, South and the Centre regions, according to their respective latitude coordinate (Table 3). The number of samples and their geographical locations are listed in Table 3. Six selected populations involving 805 samples from the International HapMap Project 3 (HapMap 3) [18] were included in the analysis to characterize the genetic variation of the Malays on a global scale: YRI, GIH, CEU, CHB, JPT and MEX.

Fig. 5
figure 5

The geographical map of Peninsular Malaysia. The sampling locations are shown in red dots

Table 3 Regional categorization of the Peninsular Malaysia states according to geographical locations and final number of sample included after QC

Data assemblage

Data QC and assemblage were carried out with PLINK 1.07. Datasets from each platform were first filtered for individuals with >10% missing rate, > 10% SNP missing rate, minor allele frequencies (MAF) < 0.05, and Hardy-Weinberg Equilibrium (HWE) P < 0.002. Then the filtered datasets were subsequently merged, consisting 42,633 SNPs shared among all the 402 Malay samples. The dataset was further pruned down by removing any SNP with r2>0.8, leaving a total SNP of 41,400 for further analyses.

Analysis of population structure

Principal Component Analysis (PCA) was first carried out using the smartPCA in EIGENSOFT (ver 4.0) package. The genetic component of the Malay populations was inferred with ADMIXTURE ver 1.22 (Alexander et al., 2009) [19], with the 41,400 SNPs overlapped across all samples.

Latitude-PC correlation

Pearson’s correlation coefficient was calculated to evaluate the relationship between the genetic coordinates (PC values) and the geographic latitudes.

Pairwise FST

Unbiased estimation of FST was calculated according to Weir and Hill (2002) [20], with confidence intervals estimated by bootstrapping with 1,000 replications.

GWAS simulation

Simulations on genome-wide association study (GWAS) were performed using PLINK 1.07, following the procedure of Xu et al. (2009) [12].