Ancestry-informative marker (AIM) SNP panel for the Malay population

  • Padillah Yahya
  • Sarina Sulong
  • Azian Harun
  • Pongsakorn Wangkumhang
  • Alisa Wilantho
  • Chumpol Ngamphiw
  • Sissades Tongsima
  • Bin Alwi ZilfalilEmail author
Original Article


Ancestry-informative markers (AIMs) can be used to infer the ancestry of an individual to minimize the inaccuracy of self-reported ethnicity in biomedical research. In this study, we describe three methods for selecting AIM SNPs for the Malay population (Malay AIM panel) using different approaches based on pairwise FST, informativeness for assignment (In), and PCA-correlated SNPs (PCAIMs). These Malay AIM panels were extracted from genotype data stored in SNP arrays hosted by the Malaysian node of the Human Variome Project (MyHVP) and the Singapore Genome Variation Project (SGVP). In particular, genotype data from a total of 165 Malay individuals were analyzed, comprising data on 117 individual genotypes from the Affymetrix SNP-6 SNP array platform and data on 48 individual genotypes from the OMNI 2.5 Illumina SNP array platform. The HapMap phase 3 database (1397 individuals from 11 populations) was used as a reference for comparison with the Malay genotype data. The accuracy of each resulting Malay AIM panel was evaluated using a machine learning “ancestry-predictive model” constructed by using WEKA, a comprehensive machine learning platform written in Java. A total of 1250 SNPs were finally selected, which successfully identified Malay individuals from other world populations with an accuracy of 90%, but the accuracy decreased to 80% using 157 SNPs according to the pairwise FST method, while a panel of 200 SNPs selected using In and PCAIMs could be used to identify Malay individuals with an accuracy of approximately 80%.


Ancestry Ancestry-informative markers Admixture Malay Population SNP 



We would like to thank all participants who contributed samples for this study. All authors contributed to aspects of the conception or design of the experimental work, analysis of data, and review and approval the final content of the manuscript.

Funding information

This work was supported by a Universiti Sains Malaysia Apex Grant: 1002/PPSP/910343and an NTU Grant (Muhammed Ariff Research Grant (MAS): 304.PPSP.6150148.N119.

Compliance with ethical standards

The humane and ethical research standards recommended by Universiti Sains Malaysia were followed in this study. All participants signed the written informed consents before sample collection. This study was approved by Universiti Sains Malaysia ethics committee.

Conflict of Interest

The authors declared that they have no competing interests.

Supplementary material

414_2019_2184_Fig7_ESM.png (258 kb)
Supplement Fig. 1.

Method overview: We propose AIM selection methods (See Method Section for details) including quality control (QC), population clustering using ipPCA, and three approaches for selecting AIMs. In particular, we apply standard QC steps to the Malaysian population and 11 world populations from HapMap and population data from three Singaporean ethnic groups. We perform population clustering using ipPCA and obtain 11 subpopulations (SP1-SP11) as a result. We apply the AIM selection methods among the 11 subpopulations, including pairwise Fst (we selected the top 5 and top 50 SNPs from each SP), informativeness for assignment (200 SNPs) and PCAIMs (200 SNPs). Finally, we assess each model’s performance via ROC analysis using WEKA. (PNG 325 kb)

414_2019_2184_MOESM1_ESM.tif (836 kb)
High resolution image (TIF 835 kb)
414_2019_2184_Fig8_ESM.png (29 kb)
Supplement Fig. 2

The performance of the SNPs selected based on In in classifying Malay individuals into their correct group. (PNG 324 kb)

414_2019_2184_MOESM2_ESM.tif (1.1 mb)
High resolution image (TIF 1079 kb)
414_2019_2184_Fig9_ESM.png (2 mb)
Supplement Fig. 3

The performance of the 100 SNPs selected based on In as shown by ADMIXTURE analysis. (PNG 20934 kb)

414_2019_2184_MOESM3_ESM.tif (3.2 mb)
High resolution image (TIF 3278 kb)
414_2019_2184_Fig10_ESM.png (2.5 mb)
Supplement Fig. 4

The performance of the 100 SNPs selected based on PCAIMs (k=3) as shown by ADMIXTURE analysis. (PNG 26171 kb)

414_2019_2184_MOESM4_ESM.tif (3.8 mb)
High resolution images (TIF 3912 kb)
414_2019_2184_Fig11_ESM.png (70 kb)
Supplement Fig. 5

Comparison of the performance of the PCAIMs and In method. (PNG 877 kb)

414_2019_2184_MOESM5_ESM.tif (1.2 mb)
High resolution image (TIF 1184 kb)
414_2019_2184_MOESM6_ESM.xlsx (46 kb)
Supplement data 6 Comparison of AIM model based on WEKA analysis. (XLSX 46 kb)
414_2019_2184_Fig12_ESM.png (1.7 mb)
Supplement Fig. 7

Genetic structure of the Malay population using a set of 555 AIMs. (PNG 4704 kb)

414_2019_2184_MOESM7_ESM.tif (10.7 mb)
High resolution image (TIF 10999 kb)
414_2019_2184_Fig13_ESM.png (239 kb)
Supplement Fig. 8

Classification assessment via ROC analysis for 200 SNPs selected using PCAIMs (k=3). (PNG 4637 kb)

414_2019_2184_MOESM8_ESM.tiff (14.4 mb)
High resolution image (TIFF 14768 kb)
414_2019_2184_Fig14_ESM.png (246 kb)
Supplement Fig. 9

Classification assessment via ROC analysis for 1250 SNPs selected using Fst. (PNG 4501 kb)

414_2019_2184_MOESM9_ESM.tiff (14.4 mb)
High resolution image (TIFF 14768 kb)
414_2019_2184_Fig15_ESM.png (246 kb)
Supplement Fig. 10

Classification assessment via ROC analysis for 200 SNPs selected using In. (PNG 4791 kb)

414_2019_2184_MOESM10_ESM.tiff (14.4 mb)
High resolution image (TIFF 14768 kb)
414_2019_2184_Fig16_ESM.png (234 kb)
Supplement Fig. 11

Classification assessment via ROC analysis for 157 SNPs selected using Fst. (PNG 4526 kb)

414_2019_2184_MOESM11_ESM.tiff (14.4 mb)
High resolution image (TIFF 14768 kb)


  1. 1.
    Brooks LD (2003) SNPs: why do we care? In: Kwok P-Y (ed) Single nucleotide polymorphisms methods and protocols. Humana Press Inc., Totowa, pp 1–14Google Scholar
  2. 2.
    Holsinger KE, Weir BS (2009) Genetics in geographically structured populations: defining, estimating and interpreting FST. Nat Rev Genet 10(9):639–650PubMedPubMedCentralCrossRefGoogle Scholar
  3. 3.
    Syvänen AC (2001) Accessing genetic variation: genotyping single nucleotide polymorphisms. Nat Rev Genet 2:930–942PubMedPubMedCentralCrossRefGoogle Scholar
  4. 4.
    Phillips C, Fernandez-Formoso L, Gelabert-Besada M, Garcia-Magarinos M, Santos C, Fondevila M et al (2013) Development of a novel forensic STR multiplex for ancestry analysis and extended identity testing. Electrophoresis 34(8):1151–1162PubMedCrossRefGoogle Scholar
  5. 5.
    Branco CC, Palla R, Lino S, Pacheco PR, Cabral R, De Fez L, Peixoto BR, Mota-Vieira L (2006) Assessment of Azorean ancestry by Alu insertion polymorphisms. Am J Hum Biol 18(2):223–226PubMedCrossRefGoogle Scholar
  6. 6.
    Inácio A, Costa HA, Vieira da Silva C, Ribeiro T, Porto MJ, Santos JC et al (2017) Study of InDel genetic markers with forensic and ancestry informative interest in PALOP’s immigrant populations in Lisboa. Int J Legal Med 131(3):657–660PubMedCrossRefGoogle Scholar
  7. 7.
    Hwa H-L, Lin C-P, Huang T-Y, Kuo P-H, Hsieh W-H, Lin C-Y et al (2017) A panel of 130 autosomal single-nucleotide polymorphisms for ancestry assignment in five Asian populations and in Caucasians. Forensic Sci Med Pathol 13(2):177–187PubMedCrossRefGoogle Scholar
  8. 8.
    Glover KA, Hansen MM, Lien S, Als TD, Høyheim B, Skaala Ø (2010) A comparison of SNP and STR loci for delineating population structure and performing individual genetic assignment. BMC Genet 11(2).
  9. 9.
    Kidd K.K., , Speed W.C., Pakstis A.J., Furtado M.R., Fang R., Madbouly A. et al. Progress toward an efficient panel of SNPs for ancestry inference. Forensic Sci Int Genet, 2014; 10: 23-32.PubMedCrossRefGoogle Scholar
  10. 10.
    Tian C, Gregersen PK, Seldin MF (2008) Accounting for ancestry: population substructure and genome-wide association studies. Hum Mol Genet 17(R2):R143–R150PubMedPubMedCentralCrossRefGoogle Scholar
  11. 11.
    Pfaff CL, Barnholtz-Sloan J, Wagner JK, Long JC (2004) Information on ancestry from genetic markers. Genet Epidemiol 26(4):305–315PubMedCrossRefGoogle Scholar
  12. 12.
    Paschou P, Ziv E, Burchard EG, Choudhry S, Rodriguez-Cintron W, Mahoney MW, Drineas P (2007) PCA-correlated SNPs for structure identification in worldwide human populations. PLoS Genet 3(9):1672–1686PubMedCrossRefPubMedCentralGoogle Scholar
  13. 13.
    Paschou P, Lewis J, Javed A, Drineas P (2010) Ancestry informative markers for fine-scale individual assignment to worldwide populations. J Med Genet 47(12):835–847PubMedCrossRefPubMedCentralGoogle Scholar
  14. 14.
    Huckins LM, Boraska V, Franklin CS, Floyd JAB, Southam L, GCAN, WTCCC3, Sullivan PF et al (2014) Using ancestry-informative markers to identify fine structure across 15 populations of European origin. Eur J Hum Genet 22(10):1190–1200PubMedPubMedCentralCrossRefGoogle Scholar
  15. 15.
    Gettings KB, Lai R, Johnson JL, Peck MA, Hart JA, Gordish-Dressman H et al (2014) A 50-SNP assay for biogeographic ancestry and phenotype prediction in the U.S. population. Forensic Sci Int Genet 8(1):101–108PubMedCrossRefPubMedCentralGoogle Scholar
  16. 16.
    Bansal V, Libiger O (2015) Fast individual ancestry inference from DNA sequence data leveraging allele frequencies for multiple populations. BMC Bioinformatics 16:4PubMedPubMedCentralCrossRefGoogle Scholar
  17. 17.
    Rogalla U, Rychlicka E, Derenko MV, Malyarchuk BA, Grzybowski T (2015) Simple and cost-effective 14-loci SNP assay designed for differentiation of European, East Asian and African samples. Forensic Sci Int Genet 14:42–49PubMedCrossRefPubMedCentralGoogle Scholar
  18. 18.
    Galanter JM, Fernandez-Lopez JC, Gignoux CR, Barnholtz-Sloan J, Fernandez-Rozadilla C, Via M et al (2012) Development of a panel of genome-wide ancestry informative markers to study admixture throughout the Americas. PLoS Genet:8(3). PubMedPubMedCentralCrossRefGoogle Scholar
  19. 19.
    Rosenberg NA, Li LM, Ward R, Pritchard JK (2003) Informativeness of genetic markers for inference of ancestry. Am J Hum Genet 73(6):1402–1422PubMedPubMedCentralCrossRefGoogle Scholar
  20. 20.
    Zeng X, Chakraborty R, King JL, LaRue B, Moura-Neto RS, Budowle B (2016) Selection of highly informative SNP markers for population affiliation of major US populations. Int J Legal Med 130(2):341–352PubMedCrossRefGoogle Scholar
  21. 21.
    Lins TC, Vieira RG, Abreu BS, Grattapaglia D, Pereira RW (2010) Genetic composition of Brazilian population samples based on a set of twenty eight ancestry informative SNPs. Am J Hum Biol 22(2):187–192PubMedGoogle Scholar
  22. 22.
    Kersbergen P, Duijn KV, Kloosterman AD, Dunnen JTD, Kayser M, Knijff PD (2009) Developing a set of ancestry-sensitive DNA markers reflecting continental origins of humans. BMC Genet 10:69PubMedPubMedCentralCrossRefGoogle Scholar
  23. 23.
    Pritchard JK, Stephens M, Donnelly P (2000) Inference of population structure using multilocus genotype data. Genetics 155(2):945–959PubMedPubMedCentralGoogle Scholar
  24. 24.
    Alexander DH, Novembre J, Lange K (2009) Fast model-based estimation of ancestry in unrelated individuals. Genome Res 19(9):1655–1664PubMedPubMedCentralCrossRefGoogle Scholar
  25. 25.
    Patterson N, Price AL, Reich D (2006) Population structure and eigenanalysis. PLoS Genet:2(12). PubMedPubMedCentralCrossRefGoogle Scholar
  26. 26.
    Intarapanich A, Shaw PJ, Assawamakin A, Wangkumhang P, Ngamphiw C, Chaichoompu K et al (2009) Iterative pruning PCA improves resolution of highly structured populations. BMC Bioinformatics 10:382PubMedPubMedCentralCrossRefGoogle Scholar
  27. 27.
    Sankararaman S, Sridhar S, Kimmel G, Halperin E (2008) Estimating local ancestry in admixed populations. Am J Hum Genet 82(2):290–303PubMedPubMedCentralCrossRefGoogle Scholar
  28. 28.
    Price AL, Tandon A, Patterson N, Barnes KC, Rafaels N, Ruczinski I et al (2009) Sensitive detection of chromosomal segments of distinct ancestry in admixed populations. PLoS Genet:5(6). PubMedPubMedCentralCrossRefGoogle Scholar
  29. 29.
    Tang H, Peng J, Wang P, Risch NJ (2005) Estimation of individual admixture: analytical and study design considerations. Genet Epidemiol 28(4):289–301PubMedCrossRefGoogle Scholar
  30. 30.
    Bouaziz M, Paccard C, Guedj M, Ambroise C (2012) SHIPS: spectral hierarchical clustering for the inference of population structure in genetic studies. PLoS One 7(10). PubMedPubMedCentralCrossRefGoogle Scholar
  31. 31.
    Patterson N, Hattangadi N, Lane B, Lohmueller KE, Hafler DA, Oksenberg JR et al (2004) Methods for high-density admixture mapping of disease genes. Am J Hum Genet 74(5):979–1000PubMedPubMedCentralCrossRefGoogle Scholar
  32. 32.
    Omar AH The Malays in Australia language, culture, religion. Dewan Bahasa dan Pustaka (DBP). DBP, Kuala LumpurGoogle Scholar
  33. 33.
    Hoh B-P, Deng L, Julia-Ashazila MJ, Zuraihan Z, Nur-Hasnah M, Nur-Shafawati AR et al (2015) Fine-scale population structure of Malays in Peninsular Malaysia and Singapore and implications for association studies. Hum Genomics 9:16PubMedPubMedCentralCrossRefGoogle Scholar
  34. 34.
    Crawfurd J On the Malayan and Polynesian languages and races. J Ethnol Soc Lond 1:1848, 330–1374 CrossRefGoogle Scholar
  35. 35.
    Fix AG (1995) Malayan paleosociology: implications for patterns of genetic variation amongst the Orang Asli. Am Anthropol 97(2):313–323CrossRefGoogle Scholar
  36. 36.
    Lim LS, Ang KC, Mahani MC, Shahrom AW, Md-Zain BM (2010) Mitochondrial DNA polymorphism and phylogenetic relationships of Proto Malays in Peninsular Malaysia. J Biol Sci 10(2):71–83CrossRefGoogle Scholar
  37. 37.
    Hatin WI, Nur-Shafawati AR, Zahri M-K, Xu S, Jin L, Tan S-G et al (2011) Population genetic structure of peninsular Malaysia Malay sub-ethnic groups. PLoS One 6(4). PubMedPubMedCentralCrossRefGoogle Scholar
  38. 38.
    Embong AM, Jusoh JS, Hussein J, Mohammad R (2016) Tracing the Malays in the Malay land. Procedia Soc Behav Sci 219:235–240CrossRefGoogle Scholar
  39. 39.
    Edinur HA, Zafarina Z, Spinola H, Nurhaslindawaty AR, Panneerchelvam S, Norazmi M-N (2009) HLA polymorphism in six Malay subethnic groups in Malaysia. Hum Immunol 70(7):518–526PubMedCrossRefGoogle Scholar
  40. 40.
    Deng L, Hoh B-P, Lu D, Saw W-Y, Ong RT-H, Kasturiratne A et al (2015) Dissecting the genetic structure and admixture of four geographical Malay populations. Sci Rep 5, Article number :14375Google Scholar
  41. 41.
    Halim-Fikri H, Etemad A, Abdul Latif AZ, Merican AF, Baig AA, Annuar AA et al (2015) The first Malay database toward the ethnic-specific target molecular variation. BMC Res Notes 8:176PubMedPubMedCentralCrossRefGoogle Scholar
  42. 42.
    Teo YY, Sim X, Ong RTH, Tan AKS, Chen J, Tantoso E et al (2009) Singapore Genome Variation Project: a haplotype map of three South-east Asian populations. Genome Res 19(11):2154–2162PubMedPubMedCentralCrossRefGoogle Scholar
  43. 43.
    Thorisson GA, Smith AV, Krishnan L, Stein LD (2005) The international HapMap project web site. Genome Res 15:1592–1593PubMedPubMedCentralCrossRefGoogle Scholar
  44. 44.
    Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MAR, Bender D et al (2007) PLINK: a tool set for whole-genome association and population-based linkage analyses. Am J Hum Genet 81(3):559–575PubMedPubMedCentralCrossRefGoogle Scholar
  45. 45.
    Anderson CA, Pettersson FH, Clarke GM, Cardon LR, Morris AP, Zondervan KT (2010) Data quality control in genetic case-control association studies. Nat Protoc 5(9):1564–1573PubMedPubMedCentralCrossRefGoogle Scholar
  46. 46.
    Weir BS, Cockerham CC (1984) Estimating F-statistics for the analysis of population structure. Evolution 38(6):1358–1370PubMedGoogle Scholar
  47. 47.
    Limpiti T, Intarapanich A, Assawamakin A, Shaw PJ, Wangkumhang P, Piriyapongsa J et al (2011) Study of large and highly stratified population datasets by combining iterative pruning principal component analysis and structure. BMC Bioinformatics 12:255PubMedPubMedCentralCrossRefGoogle Scholar
  48. 48.
    Witten IH, Frank E, Trigg L, Hall M, Holmes G, Cunningham SJ (1999) WEKA: practical machine learning tools and techniques with Java implementations. (Working paper 99/11). University of Waikato, Department of Computer Science, HamiltonGoogle Scholar
  49. 49.
    Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH (2003) The WEKA data mining software: an update. SIGKDD Explor 11(1):10–18Google Scholar
  50. 50.
    Bhargavi P, Jyothi S (2009) Applying naive bayes data mining technique for classification of agricultural land soils. Int J Comput Sci Netw Secur 9:117–122Google Scholar
  51. 51.
    Bouckaert RR, Frank E, Hall M, Kirkby R, Reutemann P, Seewald A, Scuse D (2015) WEKA manual for version 3-6-13. University of Waikato, Department of Computer Science, HamiltonGoogle Scholar
  52. 52.
    Hatin WI, Nur-Shafawati AR, Etemad A, Jin W, Qin P, Xu S et al (2014) A genome wide pattern of population structure and admixture in peninsular Malaysia Malays. HUGO J 8:5. CrossRefPubMedPubMedCentralGoogle Scholar
  53. 53.
    Deng L, Hoh BP, Lu D, Fu R, Phipps ME, Li S et al (2014) The population genomic landscape of human genetic structure, admixture history and local adaptation in Peninsular Malaysia. Hum Genet 133(9):1169–1185PubMedCrossRefGoogle Scholar

Copyright information

© Springer-Verlag GmbH Germany, part of Springer Nature 2019

Authors and Affiliations

  • Padillah Yahya
    • 1
  • Sarina Sulong
    • 2
  • Azian Harun
    • 3
  • Pongsakorn Wangkumhang
    • 4
  • Alisa Wilantho
    • 4
  • Chumpol Ngamphiw
    • 4
  • Sissades Tongsima
    • 4
  • Bin Alwi Zilfalil
    • 1
    Email author
  1. 1.Department of Paediatrics, School of Medical SciencesUniversiti Sains MalaysiaKubang KerianMalaysia
  2. 2.Human Genome Centre, School of Medical SciencesUniversiti Sains MalaysiaKubang KerianMalaysia
  3. 3.Department of Medical Microbiology and Parasitology, School of Medical SciencesUniversiti Sains MalaysiaKubang KerianMalaysia
  4. 4.National Center for Genetic Engineering and Biotechnology (BIOTEC)Thailand Science ParkPathum ThaniThailand

Personalised recommendations