Population samples
Blood or buccal samples were collected from a total of 3,674 individuals according to procedures approved by the Human Subjects Committees at the University of Arizona, Rambam Medical Center and the National Laboratory for the Genetics of Israeli Populations. Volunteers reported the birthplace of their father, grandfather, and in many cases, great grandfather. Jewish volunteers were also asked to report their affiliation to one of the three Jewish castes, Cohen, Levite, or Israelite. Those who did not know their caste status were classified as “unknown”. Table S1 lists the Jewish population samples genotyped, which were comprised of unrelated Jewish males representing the major Jewish communities across the Jewish Diaspora (n = 1,575), including 215 Cohanim, 738 Israelites, 154 Levites, and 468 of unknown caste status. Table S2 lists our surveyed samples of unrelated non-Jewish men from 30 populations representing Europe, India, the Near East, North Africa, and Central Asia (n = 2,099). We note that many additional markers were genotyped in samples that were previously reported, and that Cohanim samples reported here do not overlap with the original collection of Skorecki et al. (1997) and Thomas et al. (1998).
NRY marker analysis
We chose the following set of 75 binary markers to be typed hierarchically in this set of 3,674 chromosomes: Hg BT: SRY10831.1; Hg B: M60 or (its equivalent) M181; Hg D: M174, P99, P47; Hg E: M96, P2, P1 or M2, M35, M78, M81, M123; Hg C: M216; Hg FT: P14 or M89; Hg G: M201, M285 or M342, P287, P15, M287, M377; Hg H; M69; Hg IJ: P123; Hg I: P19 or M170 or M258, M253, P37.2, M223; Hg J: 12f2a or M304, M267, M62, M365, M390, P58, M367, M368, M369, M172, M410, M47 or M322, M67, M68, M318, M319, M12; Hg KT: M9; Hg L: M20; Hg M: M256, P35 or M106; Hg NO: M214; Hg N: M231; Hg O: M175, M119, P31, M122; Hg PQR: M45; Hg Q: M242, M378; Hg R: M207, M173, SRY10831.2, M17 or M198, M343, P25, M269, M124; Hg S: M230; Hg T: M70, M184. This set of binary markers represents a total of 64 different bifurcations on the NRY phylogeny, as 10 are phylogenetically equivalent (Fig. 1). The genotypes for these sites were determined by multiple techniques such as allele-specific PCR, TaqMan, Kaspar, and direct sequencing. The technical information for detecting these binary polymorphisms has been previously reported by Karafet et al. (2008).
For the microsatellite analysis, 12 short tandem repeats (Y-STRs): DYS19, DYS385a, DYS385b, DYS388, DYS389I, DYS389II, DYS390, DYS391, DYS392, DYS393, DYS426, and DYS439) were genotyped in 2 multiplex reactions following the protocol of Redd et al. (2002). To allow more accurate coalescence estimates and better comparison ability with published databases the Cohanim samples were genotyped by an additional set of the following ten STRs: DYS437, DYS438, DYS447, DYS448, DYS449, DYS454, DYS455, DYS458, DYS459a and DYS459b. For the duplicated microsatellites DYS385a,b and DYS459a,b, the short and long scores are reported according to allele size (i.e., without confirming that identical scores observed in two different samples represent the same locus). PCR products were electrophoresed on a 3730xl Genetic Analyzer (Applied Biosystems) and fragment lengths were converted to repeat number by the use of allelic ladders. Table S3 lists all 215 Cohanim surveyed here and their allele scores for 22 Y-STRs.
Terminology
We follow the terminological conventions recommended by the Y Chromosome Consortium (Karafet et al. 2008) for naming NRY chromosomes. Capital letters A-T identify the 20 major NRY haplogroups and are followed by the names of the binary markers used to assign samples to their positions on the NRY phylogeny. When no further downstream markers in the Karafet et al. (2008) NRY phylogeny were typed, we considered the most derived marker to define a haplogroup (Fig. 1). Haplogroups not defined on the basis of a final derived character state represent interior nodes of the tree and are potentially paraphyletic. In these cases, all binary markers that were excluded by our genotyping strategy are noted within parentheses after an initial “x” symbol. We note that all Cohanim J-P58* chromosomes were found to have the ancestral state at the three downstream markers shown in Fig. 1 (i.e., M367, M368, and M369). Therefore, these chromosomes belong to paragroup J-P58(xM367, M368, M369) (or J1e*); however, we refer to this lineage as J-P58* for simplicity. The term haplotype is used to describe any combination of STRs for a given sample.
Network analysis
Compilation and organization of the data was performed using standard Excel files. Median Joining networks were created using the software NETWORK 4.5.1.0 using the following 12 STRs: DYS19, DYS385a, DYS385b, DYS388, DYS389I, DYS389II, DYS390, DYS391, DYS392, DYS393, DYS426, and DYS439. Networks were constructed by the median-joining method (Bandelt et al. 1995). We weighted the STR loci according to their observed variation in our collection of J-P58* Y chromosomes, giving less weight to STRs with higher variance in repeat numbers. To allow maximal resolution we included the duplicated microsatellite DYS385a,b in our network analyses (Niederstatter et al. 2005). While it is possible that identical scores in different samples represent different loci we assume that this potential error is less likely in the case of closely related Cohanim Y chromosomes. We applied the reduced-median algorithm followed by the median-joining algorithm as described at the Fluxus Engineering Web site.
Coalescence analysis
The coalescence times of closely related clusters of haplotypes were estimated using several approaches. First, the age of the Cohanim lineages were calculated as previously reported by Zhivotovsky et al. (2004). The age of STR-variation was computed by averaging across loci the single-locus variances in repeat scores (i.e., with respect to median value at each locus), and then dividing by an average mutation rate of 0.00069 per 25 years. Standard errors were computed across loci as described by Zhivotovsky et al. (2004). The duplicated loci, DYS385ab and DYS459ab, were omitted from the analysis. In addition DYS449 was excluded as it was previously shown to be characterized by multi-repeat variation that substantially differs from all other genotyped STRs (Kayser et al. 2004). We also performed calculations based on a subset of the 17 Y-STRs used in the above analysis. To control for the use of different STRs in our study and that of Zhivotovsky et al. (2004), we calculated divergence times using the same nine Y-STRs in Zhivotovsky et al. (2004): DYS19, DYS388, DYS389I, DYS389II, DYS390, DYS391, DYS392, DYS393, and DYS439. In addition, we used the same five Y-STRs as in the original CMH paper by Thomas et al. (1998): DYS19, DYS390, DYS391, DYS392, and DYS393.
We also estimated coalescence times for key Cohanim lineages by employing the Bayesian Analysis of Trees With Internal Node Generation (BATWING) program of Wilson et al. (2003). To do this within a population genetics framework, we constructed an Ashkenazi population by including our sample of Ashkenazi Israelites and a sub-sample of our Ashkenazi Cohanim to equal 5% of the total population. We used both Y-STR and SNP data to constrain the coalescence of lineages and estimate the TMRCA of individual haplogroups. To estimate the age of a particular Cohanim haplogroup, we excluded non-Cohanim samples carrying this haplogroup (i.e., so that only Cohanim samples carried the particular haplogroup under study). For example, in the case of J-P58*, we included all 317 Ashkenazi Israelite samples that did not carry J-P58*, and a random sample of 17 Cohanim (i.e., which could carry any Y chromosome lineage including J-P58*). We repeated this analysis four times with a different random sample of Cohanim. A phylogenetic tree of binary markers (UEPs) was considered (under option 2) using SNPs that were variable in the Ashkenazi population: M96, M123, P14, M201, M69, P123, M304, M67, P58, M172, M410, M12, M9, M45, M207, M173, M269, M17, and M70. The priors used were: gamma(1.46, 2124) for STR mutation rate, corresponding to a mean of 0.00069 and standard error of 0.00057 as in Zhivotovsky et al. (2004); gamma(1.2, 0.00016) for N, corresponding to a mean ancestral population size of 7,500 and a standard deviation of 6,847; gamma(1.5, 75) for alpha, corresponding to a mean exponential growth rate of 0.02 and a standard deviation of 0.016; and gamma(1.2, 4) for beta, the time of start of expansion (in units of N generations). These priors returned posteriors of the mutation rate that were similar to the prior for mutation rate. A total of 5 × 104 samples of the program’s output were taken after discarding the first 2 × 104 samples as “burn-in”. Convergence was confirmed by finding that results of longer runs (i.e., 105 MCMC cycles) were similar to those of the shorter runs.
Simulations of the decay of paternal lineages
Simulations of the decay of paternal lineages were performed in Matlab using code written for this purpose. The standard constant population size Wright-Fisher model without mutation was used, since only the loss of lineages due to drift was of interest. Each simulation began by assigning the N individuals a number indicating haplogroup membership. In the first set of simulations, each individual was given a unique number representing N distinct founding lineages, while in the second, each individual was randomly assigned to a haplogroup (with the number of haplogroups varying from 2 to 10) (see Supplementary Material). For all initial conditions 10,000 simulations were performed and the results were averaged. Simulations were run for 160 generations corresponding to between 4 and 5,000 years. We note that a bottleneck or founder event would reduce the number of haplogroups faster on average than we see in a population with constant size, whereas the decay of lineages would be slower on average in an expanding population.
Comparative data
We conducted an extended literature search for Cohanim haplotypes identified here, by comparing allelic scores at as many Y-STRs as possible (i.e., that were typed in our study and the published literature). Allele scores at the 12 STRs that we use to define the extended CMH (DYS19, DYS385a, DYS385b, DYS388, DYS389I, DYS389II, DYS390, DYS391, DYS392, DYS393, DYS426, and DYS439) are 14-13-15-16-13-30-23-10-11-12-11-12, respectively (Table S4). The addition of the four STRs DYS437, DYS438, DYS459a, and DYS455 showed no further variation while the addition of DYS459b and DYS454 demonstrated two additional haplotypes that were each comprised of two samples one mutation step away from the extended CMH. The four remaining STRs DYS447, DYS448, DYS449 and DYS458 contained most of the observed variation with 26, 21, 26, and 17.2, respectively, being the most frequent scores observed for these sites. A similar pattern was found for with two relatively frequent haplotypes that are closely related to the extended CMH at 12 Y-STRs (see “Results and Discussion”). A total of 14 out of the possible 17 STRs were compared with the YHRD database: DYS19, DYS385a, DYS385b, DYS389I, DYS389II, DYS390, DYS391, DYS392, DYS393, DYS437, DYS438, DYS439, DYS448 and DYS458. The search yielded zero out of 10,243 matching haplotypes in 66 populations for the extended CMH and its two closely related haplotypes. The same STRs demonstrated no matches when compared to Cadenas et al. (2008). No matches were found when the following 12 STRs were used to the dataset reported by Arredi et al. (2004): DYS19, DYS388, DYS389I, DYS389II, DYS390, DYS391, DYS392, DYS393, DYS426, DYS437, DYS438 and DYS439. Similarly, no matches were found when the following 14 STRs were used to compare with Robino et al. (2008) dataset: DYS19, DYS385, DYS389I, DYS389II, DYS390, DYS391, DYS392, DYS393, DYS437, DYS438, DYS439, DYS448, DYS458. Two matches were found when the following nine STRs were screened in Cinnioglu et al. (2004): DYS19, DYS388, DYS389I, DYS389II, DYS390, DYS391, DYS392, DYS393 and DYS439; one of which was J1-M369 and the second was J1-M267(xM369). Three Lebanese samples from Zalloua et al. (2008), defined as J-M304(xM172) matched the extended CMH or its two closely related haplotypes when the following 11 STRs were compared: DYS19, DYS388, DYS389I, DYS389II, DYS390, DYS391, DYS392, DYS393, DYS437, DYS438, DYS439. In a recent survey of Hg J-M267, only five of the 282 J-M267 samples studied fell within J-P58; all of these were M367 positive and did not fit within the J-P58* lineage found among our Cohanim sample (Tofanelli et al. 2009). We compared the following 15 STRs overlapping between our paper and Tofanelli et al. (2009) and found no matches: DYS19, DYS389I, DYS389II, DYS390, DYS391, DYS392, DYS393, DYS385a, DYS385b, DYS388, DYS437, DYS438, DYS439, DYS448 and DYS458. Available datasets that did not allow comparison of more than seven STRs were not included in our comparative database because they did not allow a sufficiently high level of resolution.