PopAffiliator: online calculator for individual affiliation to a major population group based on 17 autosomal short tandem repeat genotype profile
- First Online:
- Cite this article as:
- Pereira, L., Alshamali, F., Andreassen, R. et al. Int J Legal Med (2011) 125: 629. doi:10.1007/s00414-010-0472-2
- 196 Views
Because of their sensitivity and high level of discrimination, short tandem repeat (STR) maker systems are currently the method of choice in routine forensic casework and data banking, usually in multiplexes up to 15–17 loci. Constraints related to sample amount and quality, frequently encountered in forensic casework, will not allow to change this picture in the near future, notwithstanding the technological developments. In this study, we present a free online calculator named PopAffiliator (http://cracs.fc.up.pt/popaffiliator) for individual population affiliation in the three main population groups, Eurasian, East Asian and sub-Saharan African, based on genotype profiles for the common set of STRs used in forensics. This calculator performs affiliation based on a model constructed using machine learning techniques. The model was constructed using a data set of approximately fifteen thousand individuals collected for this work. The accuracy of individual population affiliation is approximately 86%, showing that the common set of STRs routinely used in forensics provide a considerable amount of information for population assignment, in addition to being excellent for individual identification.
KeywordsOnline calculatorGenotype profileAutosomal STRsIndividual affiliation
Because of their high discriminating power, microsatellites or short tandem repeats (STRs) are the preferred genetic markers used in forensic genetics. These markers are characterized by size variation of short (2–8 bp) tandem repetitive motifs, with a mutation rate of 10−3 per loci per year. The improvement on high-throughput technologies and the need for high-quality assurance in forensic investigation led to the development of reliable commercial multiplex kits. These kits have high detection sensitivity, allowing results to be obtained from residual and degraded samples. Additionally, as several STRs are screened in the same reaction, sample amount is conserved and the opportunity for laboratory errors and contamination is reduced. Two commercial kits are very successful in the forensic community: the AmpFℓSTR® Identifiler® PCR Amplification Kit from AB Applied Biosystems (Foster City, CA, USA) with 15 STR loci (CSF1P0, D2S1338, D3S1358, D5S818, D7S820, D8S1179, D13S317, D16S539, D18S51, D19S433, D21S11, FGA, TH01, TPOX, vWA) and the gender marker Amelogenin; the PowerPlex® 16 System from Promega (Madison, WI, USA) with 15 loci (CSF1P0, D3S1358, D5S818, D7S820, D8S1179, D13S317, D16S539, D18S51, D21S11, FGA, Penta D, Penta E, TH01, TPOX, vWA) and the gender marker Amelogenin. When used in tandem, the two kits generate information for 17 STRs and provide quality control because 13 of the STRs amplified by the kits are the same .
During the last decade, a large amount of allele frequency data have accumulated for these STRs at a worldwide level. In an online database reporting published data on these STR markers in the main forensic science journals , the last update summed up to a total of 842,826 individuals sampled on average for each of the 17 STRs, from 92 countries (2 in Australasia; 1 in North America; 14 in Central and South America; 27 in Europe; 11 in Near East; 6 in North Africa; 11 in sub-Saharan Africa; 7 in South Asia; 5 in East Asia; 8 in Southeast Asia). Unfortunately, most of these publications only report allele frequencies, which are not as informative as the genotype profiles. For instance, many classifiers in machine learning methods, as the ones applied in this work, can take into account the information of which alleles are present in the individual for each biallelic marker. Recently, authors have been advised to publish the genotype profiles along with the allele frequencies, but many forensic laboratories have ethical concerns in publishing them due to the high capacity of individual identification attained by the typing of these markers (for instance, for the AmpFℓSTR® Identifiler® PCR Amplification Kit, the probability that two individuals selected at random will have an identical profile is 5.01 × 10−18 for US Caucasians; company’s information). Moreover, these publications usually do not provide information concerning ethnic group affiliation and the strategy for sample collection, which would be very useful for application in population genetic studies. Nonetheless, this is not a major concern for clearly European, African and East Asian populations.
The individual affiliation in a population group has obvious advantages in forensic genetics, namely, in the identification of a missing person or as an investigative tool. Non-recombining and uniparental transmitted markers, such as those in mitochondrial DNA and on the Y-chromosome, can be informative to ascertain the affiliation of maternal and paternal lineages, respectively, due to a high level of population structure for these markers [3, 4]. They do not allow, however, individual affiliation. Some authors have investigated the use of biallelic markers that have extreme differences in allelic frequencies between population groups, for the purpose of population affiliation (the so-called ancestry-informative-marker single nucleotide polymorphism (SNP) [5–7]). These ancestry-informative SNPs were recently shown to be evenly distributed across the genome . However, these markers have very low informative power for individual identification, being almost fixed in a population, so that they may only be used as in conjunction with the common forensic systems. On the other hand, biallelic markers selected as highly polymorphic to be informative for individual identification (although always less informative than STRs ) are not so informative for population affiliation.
As most of the human genetic variation is observed within populations (93–95% as estimated from autosomal STRs ), a large set of markers, both STRs and SNPs, is traditionally considered necessary to be informative for population affiliation. For instance, a data set of 377 autosomal STRs in 1056 individuals from 52 populations  was used to ascertain the identification of six main genetic clusters, five of which correspond to major geographical regions (Africa, Europe, the part of Asia south and west of Himalayas, East Asia, Oceania and the Americas). A general trend for clusteredness was noticeably smaller for 10 and 20 loci and for database sample sizes of 100 , but comparatively larger for 50 or more loci and database sample sizes of 250 and 500. In another work , an average accuracy of at least 90% required a minimum of ∼60 markers, and the assignment for a historically admixed southern India sample was of only 87% even using 160 markers, while an average accuracy of 95% was attained to predict ancestral continent of origin from 50 SNPs picked up from the HapMap large data set of SNPs and informative for population affiliation .
A few tests on inferring ethnicity were conducted for the common forensic STR package , from 6 [14, 15], to 13 , to 15 markers  and up to 19  STRs. These studies applied very different methods, from empirical evaluations (recalculating allelic frequencies by removing one individual at a time and using this to estimate the percentage of correct affiliation ) to application of Bayesian classifiers to a simulated genotype profile database (constructed from allelic frequencies ), and concluded in general for correct classifications rates of around 90% for 16–18 STRs (or slightly higher when comparing pairs of very distinct populations ). None of these works, however, provided researchers with a tool for evaluating population assignment of an individual in their daily casework investigations.
Data set 3R: encompassing three regions (Asia, Eurasia and sub-Saharan Africa) and including data from 14,714 individuals;
Data set 5R: encompassing five regions (Asia, Eurasia, sub-Saharan Africa, North Africa and Near East) with 16,090 individuals;
Data set 7R: encompassing all seven regions and including data from 54,267 individuals.
It is expected that as the number of regions increases from 3R to 5R to 7R, the difficulty of predicting the geographical origin of an individual also increases. This is due to the fact that 5R and 7R data sets include some regions long known as being on the path for many past human migrations, such as North Africa and Near East, and the affiliation of individuals to regions like North and Central-South America is artificial since their ancestor’s recent origins is from elsewhere, namely Eurasia, East Asia and sub-Saharan Africa.
Furthermore, for each data set, the machine learning analyses were conducted in two subsets: a ‘balanced data set’ composed of an even distribution of individuals per population classes considered and an ‘unbalanced test data set’ composed of the remaining data.
Not all of the 17 STR markers were typed in all populations. The higher percentages of missing values were observed for markers only present in one of the kits (Penta D and Penta E present in PowerPlex® 16 kit, with 90% of missing values; and D2S1338 and D19S433 from the AmpFℓSTR® Identifiler® PCR Amplification Kit were missing in 15% of the profiles). The other 13 common markers included between 2% and 8% missing values.
Machine learning methods
Machine learning methods aim at extracting information (knowledge) from data, by applying algorithms that allow computers to automatically construct models for data. The Weka (Waikato Environment for Knowledge Analysis) software package  was used in our study to discover relationships between the alleles for each marker and the geographical region. Weka contains a wide collection of data pre-processing and modeling techniques, being, therefore, a good choice to explore different modeling techniques on the data. The method for the construction of the model will be published elsewhere , but basically it consists in exploiting the features of several learning and meta-learning methods available in Weka (0R; 1R; DTNB; SMO; NaiveBayes; J48; PART; DecisionStump; MultilayerPercepteron; NBTree; RandomForest; BayesNet). These algorithms were applied to each of the data sets 3R, 5R and 7R. To handle the missing values existing in the data, we used each machine learning algorithm capability to handle such missing values.
A direct approach to analyze the data is to use the STRs markers as features. Since humans are diploid, the values of the two alleles for a given STR were ordered and designated as the first (lowest) and second (highest) values of a feature. For instance, the marker CSF1PO is associated with two features: CSF1PO-1 and CSF1PO-2. A total of 34 features were considered for each individual.
The best model to infer population affiliation was evaluated by calculating the predictive accuracy, also known as generalization accuracy. The (predictive) accuracy is the proportion of correct predictions over the whole set of instances. To estimate the accuracy of the classifiers, a tenfold cross-validation procedure was used on the balanced data set and the unbalanced test data set was used as a test set. The evaluation procedure was applied to each of the three data sets: 3R, 5R and 7R. Additionally, sensitivity testes on two variables were also undertaken: (i) the size of the training data set (subsets of 50, 100, 150, 200, 250, 300, 350, 400, 450 and 500 individuals of each class were considered) and (ii) the number of markers (6, 9, 13, 15 and 17 markers were considered).
The best model was obtained by WEKA's DTNB method with boosting. DTNB combines decision tables with naive Bayes and was applied to the data set 3R, a balanced data set with a size of 1200 individuals and 17 STRs. This model achieved an accuracy of 86.77%. This is the model implemented in PopAffiliator, the online calculator. The effect of the data set size and of the number of STRs showed that a big increment in accuracy is observed when increasing data from 150 to 450 genotype profiles per population group, but then the increment stabilizes. We conjecture that better models can still be obtained with a reduction in the percentage of missing values for the two Penta markers (currently with 90% of missing values).
The online calculator
Lowe et al.  call the attention to the fact that “[…] as long as it is made clear that the information provided from the DNA profile is probabilistic—not a simple categorical classification—then we believe that it can provide useful strategic guidance when set into the context of the other information available to the investigator. An indication that the offender was of Caucasian origin may be of little use in an area where the majority of the inhabitants are Caucasians but may be far more valuable in a locality where they form a minority of the population.” We agree with these authors.
Our confirmation of an 86% accuracy of individual population affiliation for the common 17 STR genotype profiles shows that this well-known forensic set of STRs has also a considerable amount of information for population assignment, besides being excellent for individual identification. We believe that our online calculator will be a valuable tool in helping forensic researchers to predict population affiliation in a specific forensic casework. However, researchers should always be aware that this information is just a first indication, which should be confirmed by other genetic and nongenetic evidence if the population affiliation is really essential to resolve a case. This is especially true for populations that result from a high miscegenation between population groups, such as populations from the Near East or America, for which, in any case, most individuals will have a real mixed ancestry.
IPATIMUP is an Associate Laboratory of the Portuguese Ministry of Science, Technology and Higher Education and is partially supported by FCT, the Portuguese Foundation for Science and Technology. CRACS-INESC Porto is supported by Programa Operacional Ciência, Tecnologia e Inovação (POCTI) e Quadro Comunitário de Apoio III. NJ and DH were supported by grant 196-1962766-2751. LZh received grants from the Russian Academy of Science for Mol & Cell Biol and FSM.