Introduction

Y-chromosomal lineages are established by single nucleotide polymorphism (SNP) and short tandem repeats (STRs), which provide the corresponding haplogroup and haplotype, respectively. In the study of human population genetics, haplogroup determination is of great interest, as it reveals the phylogenetic relationships by descent.

Considering the findings of Bosch et al. [1] and Behar et al. [2], where the STR variability is partitioned by haplogroups to a greater extent than by populations, there has been increasing interest in unifying these sources and finding further ways of predicting the haplogroup of a given haplotype when SNP data are unavailable. One of them is Whit Atheys’ haplogroup predictor [3, 4] (https://home.comcast.net/~hapest5/index.html) which has been employed in previous studies [57] to estimate the ethnic composition of different populations and diagnostic STR values of a given haplogroup. The other is the haplogroup classifier [8] (http://bcf.arl.arizona.edu/haplo), consisting of machine-learning algorithms that require previous models and haplotypes with a known haplogroup for training the software.

The purpose of this study was to establish the accuracy of both software systems: the haplogroup predictor and the haplogroup classifier.

Materials and methods

We analyzed a sample of 119 males from four provinces of the Northwest of Argentina (Jujuy, Salta, Catamarca, and Tucumán), all of them with the informed consent of donors.

Haplogroups were determined in a previous report [9]; the nomenclature used followed YCC [10] recommendations, and haplotypes were defined by the amplification of DYS19, DYS389 I and II, DYS390, DYS391, DYS392, and DYS393, according to methods previously published [1115].

These haplotypes were submitted to the haplogroup predictor, with equal priors, obtaining probabilities for inferred haplogroups. In the case of the haplogroup classifier, we followed the models, tree files, and public data provided by the authors in the downloadable version of the software; this data set consisted of 1,527 Y-chromosome profiles with haplogroup and haplotype gathered from published data [16, 17]. SNP-determined haplogroups were compared with those provided by the software. It was not possible to evaluate the sample analyzed by Schlecht et al. [8] given the availability of their data.

Sensitivity (s), specificity (e), positive likelihood ratio (LR+ = s/(1 − e), and negative likelihood ratio (LR− = (1 − s)/e) of the haplogroup predictor, and the haplogroup classifier were calculated per haplogroup and total [18]: “s” represents the probability of corroboration for a predicted haplogroup when the haplotype was determined of that haplogroup by SNP analysis, while “e” stands for the probability of confirmation of another haplogroup when the haplotype corresponded to a different haplogroup by our typing. It has been stated that a test is adequate if it has both high sensitivity and specificity [19] and a LR+ value of at least 10; this is the reason why we followed these criteria.

In the case of the predictor, we considered different precision-of-assignment categories: by being the first in the haplogroup ranking and 50–95% cut-off points by intervals of five, whereas for the classifier, we only counted those cases showing an agreed haplogroup, thus reducing the number of cases considered in the calculations (N = 62; 52.1% of the original sample).

In order to get quantitative estimates of the software predictive quality, we computed the uncertainty coefficient of y, U(y|x), with the subroutine cntab2 of Press et al. [20] who also provide a detailed explanation of the meaning and way of computing that coefficient. Let us see what is the meaning of U(y|x) and how it can aid us to estimate the software predictive quality. Suppose that we have a certain sample and that we want to know the result of performing the SNP typing on it, i.e., that result is all the information we want. Let us further suppose that before performing the SNP typing, we get, for that same sample, the prediction of the software. Then, knowing beforehand the results given by the software will produce a loss of the information that we could later obtain from the SNP typing and the better the software, the larger that loss will be. If the software was perfect and could accurately predict the result that we would later obtain from the SNP typing, then knowing the result of the former would make us lose all the information that the result of the latter would provide, and of course, it would be unnecessary to do the SNP typing. On the other hand, if the software were useless and had no predictive value, to know its result beforehand would make us lose no information at all and accordingly, the whole information will have to come from the SNP typing. The uncertainty coefficient of y, U(y|x), quantifies what we have just explained qualitatively. If we take the results from the software systems as the x variable and those from SNP_typing as the y variable, then U(y|x) gives the fraction of the SNP typing information that is lost if the software result is already known. As the two extreme cases described before we would have: (1) U(y|x) = 1.00 (or 100%), that would imply that the software gives perfect answers (i.e., all the information that would be provided by subsequent SNP typing has already been provided by the software) and (2) U(y|x) = 0.00 (or 0%), that would imply that the software provides no information (i.e., all the information should be obtained by subsequent SNP typing).

Results

The s, e, LR+, and LR− values per cut-off point for the haplogroup predictor are summarized in Table 1. The classifier software showed the following values: s = 0.45, e = 0.92, LR+ = 5.99, and LR− = 0.59. It is important to highlight that about half of the haplotypes (57 out of a total of 119) could not be assigned by the classifier to an agreed haplogroup (“Unclassified” category in Table 2). The profiles of the Argentinean population sample are presented as electronic supplementary material in Table 3 to allow the validation of these results.

Table 1 Total s, e, LR+, and LR− values at each cut-off point for the haplogroup predictor
Table 2 Haplogroup frequencies by SNP typing, haplogroup predictor and haplogroup classifier, and s, e, LR+, and LR− values

When we consider LR+ per haplogroup, Q1a3a in both classifier and predictor and DE* in the classifier, values are higher than 10.

R* haplogroup showed the highest false positive proportion, significantly higher than false positive proportions of the remaining haplogroups.

Our results for U(y|x) are 0.244 (or 24.4%) for the haplogroup predictor and 0.207 (or 20.7%) for the haplogroup classifier. In other words, only about 20% and 25% of the information on the SNP typing is lost if software results are already known.

Discussion

These results represent a high probability of error and a bias towards the R* haplogroup, so it is most likely that results based on the haplogroup predictions of these software systems are weakened. For cases in which sex bias in multiethnic populations is estimated by this method, an overestimation of the European component is expected. Haplogroup determination by SNP analysis remains the best approach, considering the low reliability of prediction of software available.

The adequate LR+ for the Q1a3a and DE haplogroups could be explained by a lower diversity within each group. Especially in the Q1a3a case, which is a relatively recent haplogroup, the homogeneity is the result of its young evolutionary age, given that the time lapse in which the haplotypes spread away from the haplogroup founder is rather short [14].

Considering that the samples from which the calibration frequencies are estimated belong to European (or of European descent) populations, the high false positive proportion in the R* haplogroup could be a reflection of this sampling error, as this haplogroup is the most common among those populations.

In order to get a better idea of what the U(y|x) values mean, let us assume that we are interested in knowing the result of throwing a die (i.e., the information we want to know is whether we will get 1, 2, …, or 6) and take it as the variable y. Let us further assume that we had some way to know beforehand whether the result would be an odd (1, 3, 5) or an even (2, 4, 6) number and take it as the variable x. It turns out that in this case, we get U(y|x) = 0.387 (38.7%), that is, knowing beforehand whether the throw of the die will result in an odd or even number, produces a loss of 38.7% of the information that the actual throw will yield. If we compare this result with the values obtained above (24.4% and 20.7%), we see that, for the knowledge of the result of the SNP typing, knowing in advance the result of any of the software systems provides less information than, for the throwing of the die, knowing beforehand whether the result will be odd or even would give.

For the case of the classifier, in which the user chooses the data to train the software, greater Y-STR profiles with associated haplogroups might improve its accuracy. Even so, this software shows haplotypes without an agreed haplogroup, given their recurrence across haplogroups, which is a better approach than suggesting a haplogroup when there is evidence that the haplotype could belong to different haplogroups. However, this software shows higher accuracy than the predictor, given LR+ values.

Why do these software systems show such low accuracy levels? We propose two explanations: (1) there are not enough Y-STR profiles with associated haplogroups to calibrate the software properly and (2) given the mutation rates of the STRs (available at www.yhrd.org) and the time depth of the haplogroup ramifications [21], it is possible to find the same haplotype in samples from different haplogroups (cases of convergence). For instance, the most recent haplogroup, R1*, has a time to the most recent common ancestor (TMRCA) of 18,500 (12,500–25,700) years, and the most ancient clade is estimated at 70,000 years [20], whereas the Y-chromosome STR mutation rates vary from 6.35 (4.19–9.22 95%CI) × 10−3 for DYS 439 to 0.45 (0.12–1.17 95%CI) × 10−3 for the case of DYS 392, so that the repetition of allele combinations among evolutionary divergent clades are not uncommon, given the meiosis events accumulated along such a vast time depth.

A simple way to confirm this is to check the allele frequencies for each STR locus among the metapopulations of the YHRD database, which show that the same alleles are present. Even though their frequencies can differ, (and taking into account that this database does not provide haplogroup information) given the geographic association of the haplogroups evidenced by other authors [22], if the association between STR alleles and haplogroups was strong enough to allow the prediction of the last based on the haplotypes, these should show similar geographic distribution to the one found on the haplogroups.

An increase in the number of STRs employed to predict the haplogroup would not enhance accuracy, considering the few reference samples available with the standard seven STRs and associated haplogroup, while these reference samples decrease even more as the amount of STRs demanded increases. Also, there is no homogeneity in the use of more STRs, different authors chose different markers, reducing the haplotype references drastically. For example, Zalloua et al. [23] analyzed the seven standard STRs plus DYS 388, DYS 437, DYS 438, and DYS 439, while Di Gaetano et al. [24] studied the seven standard STRs plus DYS 385 A/B in all samples (in only a fraction of their sample other STRs were also included); these data sets are only comparable with each other on the standard STRs, not on the rest.

At present, haplogroup prediction software available does not show adequate accuracy. Thus, typing a set of SNPs which precisely define a phylogenetic branch, is the only reliable method to establish to which haplogroup a given sample belongs.