Introduction

Over the last decade [1], the Y chromosome has become firmly established as a powerful system in forensic analysis, showing particular utility in male–female DNA mixtures. Y-chromosomal short tandem repeats (Y-STRs) have proven to be informative DNA markers, featuring in a large database (the Y Reference Haplotype Database; http://www.yhrd.org [2]) that allows rapid interrogation of population-specific frequencies of haplotypes and provides some information about the likely geographical origins of individuals.

The first forensically useful Y-STR and one that has since become incorporated in every commercial forensic Y-chromosomal haplotyping kit and in the minimal haplotype listed in the Y Reference Haplotype Database (YHRD) is DYS19 [3]. This tetranucleotide repeat marker was pioneered as an exclusion tool in a rape case [4] and as a marker in deficiency paternity testing [5], and was also the first Y-STR to be used in dating of anthropologically important events [6].

Despite its ubiquity in the fields of Y-chromosomal forensic analysis and evolutionary studies, DYS19 is surprisingly poorly understood at the molecular level compared to other Y-STRs. It has a number of paralogues elsewhere on the Y chromosome [7] and is deleted (‘null’ alleles [8]), duplicated [9, 10], or even triplicated [8, 9] on some chromosomes. The YHRD (release 22) contains 104 duplications (∼0.2%), one triplication, and four null alleles [two Bhutan, one Hrodna (Belarus), one Tver (Russia)] among its ∼53,000 haplotypes.

Deletions and duplications of Y-STRs are of forensic relevance because they can be interpreted as allele drop-outs and DNA contamination, respectively, which may affect the evidential value of a DNA profile. It is therefore important to understand the molecular mechanisms and rates of the processes underlying these deletions and duplications and how they are distributed in human populations.

Variability involving DYS19 can be seen against a background of the generally high degree of structural variability of the Y chromosome. Cytogenetic and molecular studies have demonstrated that many large-scale structural variants exist, including deletions [1113], duplications [11, 12, 14, 15], and inversions [12]. Underlying this structural polymorphism is a high rate of mutation through non-allelic homologous recombination (NAHR) between very similar paralogous sequences, which are particularly frequent on the Y [16]. These paralogues also act as substrates for the frequent non-reciprocal transfer of sequence information via gene conversion events [17, 18].

The availability of the near-complete sequence of the euchromatic region of a single Y chromosome [16] offers an opportunity to analyse the genomic context of DYS19 in more detail, to investigate the homologies involving sequences around the marker and to address candidate mechanisms for duplications and deletions. The availability of a robust and well-resolved phylogeny based on slow-mutating binary markers such as single-nucleotide polymorphisms (SNPs) [19, 20], allows us to interpret these molecular events in a phylogeographic context.

Here, we analyse Y chromosomes carrying null and duplicated alleles of DYS19 and use deletion mapping to show that its position on the chromosome itself is polymorphic. We use haplotyping with binary markers and multiple Y-STRs, together with deletion mapping and bioinformatic prediction, to address the molecular basis of the underlying rearrangements and to identify examples that descend from a common ancestor. This analysis shows that DYS19 duplications are apparently not mediated by repeat-mediated recombination events and identifies two founder lineages carrying DYS19 duplications that have reached high frequencies in particular haplogroups and populations.

Materials and methods

DNA samples

DNA samples from a total of 55 unrelated men were from collections of the authors and were obtained with appropriate informed consent. Some samples form part of sets described previously [2124], and four were from the Centre d’Etude du Polymorphisme Humain-Human Genome Diversity Project (CEPH-HGDP) panel [25]. With the exception of these four lymphoblastoid DNAs, all samples were derived from either blood or buccal scrapes.

Deletion mapping

Y-specific sequence-tagged sites (STSs-primer sequences available from the literature [16, 26]) were amplified by polymerase chain reaction (PCR) and analysed by agarose gel electrophoresis. An STS was considered to be deleted when absent in the presence of a larger Y-specific control amplicon coamplified in the same PCR reaction. The PCR system was as described [13], and cycling conditions were: 94°C 30 s, 60°C 30 s and 70°C 30 s for 33 cycles.

Y chromosome haplotyping and identification of DYS19 deletions and duplications

Twenty-six Y-specific STRs (DYS19, DYS385a/b, DYS388, DYS389I, DYS389II, DYS390, DYS391, DYS392, DYS393, DYS425, DYS426, DYS434, DYS435, DYS436, DYS437, DYS438, DYS439, DYS447, DYS448, DYS460, DYS461, DYS462, YCAII-a/b and Y-GATA-H4.1) were typed in a 20-plex [27] and an additional 14-plex [22]. PCR products were resolved on an ABI3100 capillary electrophoresis apparatus (Applied Biosystems) and analysed using GeneMapper software (Applied Biosystems). Allele nomenclature was as described [22] and in accordance with International Society of Forensic Genetics recommendations [28]. To allow us to combine our data with published datasets, we consider here only 15 of the 26 loci (DYS19, DYS388, DYS389I, DYS389II-I, DYS390, DYS391, DYS392, DYS393, DYS426, DYS437, DYS439, DYS434, DYS435, DYS436 and DYS438).

DYS19 deletions and duplications were initially ascertained using a number of published multiplexes [22, 27, 29] and commercial kits, i.e. AmpFlSTR®YFiler® PCR Amplification kit (Applied Biosystems), PowerPlex® Y System (Promega) and Mentype® Argus Y-MH (Biotype) and were confirmed in repeated amplifications. Deletions (as opposed to small-scale primer site mutations) were verified by use of the two non-overlapping primer pairs 3F/3R and 2F/2R [7]. DYS19 was considered to be duplicated when its two peaks in an electropherogram were of approximately equal height and area; see “Results” for details of peak area and height ratios.

Binary markers were typed in a hierarchical fashion, using either the SNaPshot minisequencing protocol (Applied Biosystems) on an ABI3100 capillary electrophoresis apparatus (Applied Biosystems) or primer extension on the Sequenom mass spectrometry system (Sequenom, San Diego, CA, USA). Amplification and extension primers were based on ones published previously [30, 31], with additional primers based on published sequences [19].

Identifying and testing candidate rearrangement-sponsoring repeats

Perfect direct repeats as candidates for sponsoring deletions and duplications were identified using the REPuter program [32] at http://bibiserv.techfak.uni-bielefeld.de/reputer/.

Having identified a candidate pair of partial L1 long interspersed nuclear element (LINE) sequences (see “Results”, final section), we sought junction products by PCR in deletion and duplication chromosomes. Deletion sponsored by the repeats would allow the generation of a junction PCR product with the primer pair pL1f (5′-aac tga aag aga gag gaa ctt tgg-3′) and dL1r (5′-cta gtg tcg gaa tta ttt caa tg-3′) (Fig. 1d), while duplication would allow us to detect a junction product with the primer pair pL1r (5′-tga act ccc att cac aat tgc-3′) and dL1f (5′-ggt act atc aat aac act ggc-3′).

Fig. 1
figure 1

Complex genomic environment of DYS19 and putative mechanisms for inversions, deletions and duplications. a Reference sequence organisation around DYS19, showing position of the Y-STR on an idiogram of the Y chromosome (with genome position of start of the marker given according to build 36.1 of the reference assembly), and below, a schematic view of the region around DYS19 and its paralogues (pDYS19) on Yq. Arrows indicate inverted repeats (IR3 elements) or large repeat units in the region proximal to AZFc [42]. STSs and other markers used in mapping are shown below. b Structure of Yp following IR3-mediated inversion, with breakpoints indicated by dotted lines and mapping of presence (+) or absence (−) of markers in XX male WA48. The horizontal dotted line indicates the region of uncertainty of the breakpoint, resulting from the wide marker spacing. c Alignment of the IR3 inverted repeats; in the reference sequence organisation, DYS19 lies within the proximal IR3, corresponding to a 3-kb gap in the distal IR3. d Putative mechanism of DYS19 deletion or duplication mediated by unequal exchange (curved grey arrow) between flanking direct partial L1 repeats within the proximal IR3. Small arrows indicate PCR primers used to seek junction PCR products in deletion and duplication chromosomes

Y-STR network construction and dating

Weighted median-joining networks [33] were constructed from 14-locus (DYS388, DYS389I, DYS389II-I, DYS390, DYS391, DYS392, DYS393, DYS426, DYS437, DYS439, DYS434, DYS435, DYS436 and DYS438) or 6-locus (DYS389I, DYS389II-I, DYS390, DYS391, DYS392 and DYS393) Y-STR haplotypes using Network 4.0 (http://www.fluxus-engineering.com/sharenet.htm). Weighting [34] was used to remove some reticulations (closed structures) within the network by taking into account the range of different mutation rates of the markers, reflected indirectly by their allele length variances among all chromosomes included in each network. Note that our conclusions are not affected by the choice of weighting scheme.

Time-to-most-recent-common-ancestor (TMRCA) of the clusters within the network of Fig. 2 was estimated using the rho statistic within Network, taking the per-STR per-generation mutation rate to be 2.0 × 10−3 [35] and the generation time 31 years [36].

Fig. 2
figure 2

Haplogroups and Y-STR haplotypes of DYS19 deletion and duplication chromosomes. a Binary marker phylogeny of the Y chromosome, with haplogroups (hg) [19] containing DYS19 deletion and duplication chromosomes indicated by coloured circles. b) Weighted median joining network [33] containing the 14-locus (DYS388, DYS389I, DYS389II-I, DYS390, DYS391, DYS392, DYS393, DYS426, DYS437, DYS439, DYS434, DYS435, DYS436 and DYS438) Y-STR haplotypes of 3 deletion and 51 duplication chromosomes. Circles represent haplotypes, with area proportional to frequency and coloured according to haplogroup as in a. Nodes used as roots in TMRCA estimations are indicated by asterisks

Results

Inversion encompassing DYS19

According to the reference Y chromosome sequence [16], DYS19 lies on Yp, within the proximal member of a pair of ∼300-kb inverted repeat sequences (IR3; Fig. 1a), separated by ∼3.6 Mb. However, these IR3 elements are known to sponsor recurrent paracentric inversions [12, 16], which could in principle transpose DYS19 into the distal IR3 region.

During analysis of the partial Y chromosome carried by a translocation XX man, we found evidence to demonstrate that this does indeed occur. Male WA48 was identified in a screen for men lacking the Amelogenin Y sex-test locus on Yp. As well as AMELY, he lacks a Y-chromosomal long arm, shown by the absence of 13 Y-STRs mapping to Yq. He carries distal Yp markers (translocated onto the short arm of one of his X chromosomes), including the sex-determining SRY gene and the distal Y-STRs DYS393 and DYS456. Based on the reference sequence organisation, we would therefore expect him to lack the Yp STRs DYS458 and DYS19 (Fig. 1a), since they are proximal to the absent AMELY. However, both are actually present (Fig. 1b). While we cannot rule out a more complex rearrangement, the most parsimonious explanation for this discrepant pattern of markers is a paracentric inversion, transferring DYS19 to the distal IR3 element, followed by translocation of a terminal segment of Y-chromosomal material (between ∼6.5 and ∼8.6 Mb in size) onto the X chromosome.

A maximum-likelihood estimate of the per-generation rate of IR3-sponsored inversion is 9.2 × 10−4 [12], with chromosomes carrying the two different orientations scattered among different branches of the Y phylogeny and among different populations. This suggests that, in any given Y haplogroup or population, both the position and orientation of DYS19 on Yp are in fact uncertain.

DYS19 deletions

To investigate deletion of DYS19, we identified three Y chromosomes each carrying a DYS19 null allele on an otherwise complete Y-STR haplotype (Table S1 of the Electronic Supplementary Material). These null alleles were ascertained using the DYS19 primers employed in the 20-plex PCR [27]. To exclude the possibility of small-scale mutations affecting only the primer binding sites, we confirmed all three deletions using an independent non-overlapping primer pair [7].

Analysis of deletions is, in principle, more straightforward than that of duplications, since the absence of a portion of the chromosome is easier to score than its presence in two copies, which normally requires quantitative analysis [e.g. quantitative PCR (Q-PCR)] or direct physical mapping (e.g. high-resolution fluorescent in situ hybridization). However, in this case, the fact that DYS19 lies within one of a pair of closely related (99.75% similar [16]) IR3 elements makes deletion mapping around this STR particularly difficult. We analysed the three deletion chromosomes using the four unique STSs (sY1241–1244) that mark the boundaries of the two IR3 elements (Fig. 1a); in each case, all four boundaries are present, showing that the deletion breakpoints must be contained within an IR3 element (either proximal or distal). The deletions therefore cover an extent of less than ∼300 kb, but further definition of their extents through STS analysis is not straightforward because the presence of one intact IR3 masks any internal deletion within the other. Not only are the two IR3 elements almost identical in sequence, but their roles in mediating inversions suggest that designing internal proximal- or distal-specific STSs would not be profitable.

DYS19 duplications

To investigate duplication of DYS19, we identified 51 chromosomes each showing two alleles of different lengths but approximately equal peak heights and areas in capillary electrophoresis (Table S1 of the Electronic Supplementary Material). Note that duplicated alleles are under-ascertained, since cases where both copies carry identical repeat copy numbers cannot be identified without quantitative analysis.

The amount and quality of available DNA precluded Q-PCR-based approaches to understanding the molecular basis of the DYS19 duplications. If a duplicated region contains more than one Y-STR, then duplications of multiple STRs within a physical interval can help to delimit its length; this was the case for the AZFa interval [14], which contains nine Y-STRs including DYS388, DYS389I, DYS389II, DYS438 and DYS439. However, the only other STR lying on Yp included in our Y-STR multiplexes is DYS393, some 7 Mb distal to DYS19 in the reference sequence. Unsurprisingly, no examples of co-duplication of this locus are observed in our set. DYS458, present in the Y-filer (Applied Biosystems) kit, lies ∼2.4-Mb distal to DYS19 in the reference sequence. DYS458 data available for four chromosomes (19dup23, 24, 50 and 51) show no evidence of duplication (not shown), so the extents of the duplicated regions therefore remain undetermined.

Haplotyping of deletions and duplications

Typing of binary markers shows that the three deletion chromosomes belong to three different haplogroups (D*, J2 and R1a; Fig. 2a), and therefore must represent independent events.

A similar analysis places the 51 duplication chromosomes in four different haplogroups (Fig. 2a), indicating that there are at least four independent duplications of DYS19. Within a 14-locus Y-STR network (Fig. 2b), the six hgG and 43 hgC3c duplication chromosomes each form clusters suggesting single duplication origins and identity-by-descent within each of these haplogroups; time to most recent common ancestor estimates are 2,330 ± 840 years and 1,780 ± 630 years, respectively, for these clusters. The issue arises of whether or not all chromosomes within these two haplogroups might carry DYS19 duplications—chromosomes presenting only a single peak (allele) in an electropherogram can nonetheless be duplicated for the STR, so ascertainment is not straightforward. While it is neither more or less likely that a duplication occurred before or after the haplogroup-defining SNP, the knowledge would be interesting because then typing the SNP would, in effect, be ascertaining the duplication. For haplogroup G, it is clearly not the case that all chromosomes carry the duplication: for example, a published dataset of 56 hgG haplotypes [37] contains no DYS19 two-allele cases, despite displaying DYS19 repeat numbers between 14 and 17 and therefore providing ample opportunity for ascertainment of duplications. A recent study [10] has shown that ten Italian chromosomes carrying DYS19 duplications all belong to hgG2*(xG2a,G2b); it may be that duplications are confined to this sublineage and indeed that all hgG2*(xG2a,G2b) chromosomes carry duplications. However, the published evidence [10] does not make clear how many hgG chromosomes not belonging to this sublineage were tested for DYS19 duplication, so it is difficult to assess this possibility. For haplogroup C3c, duplication chromosomes are more predominant, with 46 of 126 chromosomes (37%) showing two distinct alleles (unpublished; [38, 39]); however, the fact that no examples of chromosomes belonging to one particular haplotype cluster within hgC3c (the ‘Manchu haplotype’ [40]) show two DYS19 alleles, again despite high diversity of DYS19 (alleles 14 to 18), suggests that DYS19 duplication chromosomes actually form a subset of hgC3c.

Singletons carrying duplications are found in hgs M (19dup50) and Q(xQ3a) (19dup51), and interpretation of these cases is more problematic. While they may represent genuine germ-line mutations, in the absence of other examples within these haplogroups (indicating identity-by-descent), it remains possible that these chromosomes bear somatic DYS19 mutations with approximately balanced cell populations carrying the two different alleles. Both cases are from the CEPH-HGDP panel and therefore derived from lymphoblastoid cell lines, in which somatic STR mutations have been noted in the past [41]. Comparisons of peak area ratios for the two DYS19 alleles support this idea. For all other chromosomes, the average value for the shorter allele divided by the longer is 1.12 (range 1.04 to 1.24); however, for 19dup50 and 19dup51, the ratios are, respectively, 1.39 and 1.93.

Possible mechanisms of deletions and duplications

Are these deletions and duplications caused by NAHR, mediated by directly repeated sequences? It is difficult to address this question for the duplications because we know little about their extents. However, for the deletions, we know that any sponsoring repeats must lie within an IR3 element. We sought candidate repeats by carrying out a sliding self-alignment of the ∼300-kb proximal IR3 copy using the program REPuter [32]. This revealed, as the only plausible candidates for sponsoring such local rearrangements, duplicated direct partial L1 LINE repeats ∼69 kb apart, flanking DYS19 (Fig. 1d). These sequences show 96% sequence similarity over 1.75 kb and contain the largest block of sequence identity of 182 bp. NAHR between these sequences could lead to deletion or duplication of DYS19. To test this idea, we designed primers flanking each L1 segment and sought junction products by PCR in both deletion and duplication chromosomes. No junction products were obtained (data not shown), so the L1-mediated mechanism is not supported.

An alternative mechanism for duplication/deletion would be non-reciprocal transfer through gene conversion mediated by homology—the genomic context of DYS19 provides potential opportunities for this. As described above, it lies within a repeated IR3 element; furthermore (and as noted by others [7]), two paralogues, lacking variable tetranucleotide repeats, lie on the long arm within the b1 and b2 repeats [42] (part of palindrome 3 [16]) proximal to AZFc (Fig. 1a). In principle, gene conversion between IR3 elements could be responsible for DYS19 deletion or duplication. A sequence alignment of the two IR3 repeats (Fig. 1c) suggests that this possible mechanism is unlikely: DYS19 lies in a region of the proximal element corresponding to a ∼3-kb gap in the distal element, and gene conversion is very unlikely to operate over such a large region of heterology. Gene conversion might also operate between a DYS19-containing IR3 element and a b1 or b2 repeat on Yq, each of which contains a DYS19 paralogue (Fig. 1a). However, the paralogues lack the DYS19 (TAGA)n repeat array, and show a mean of only 92% sequence similarity to the DYS19 region itself in regions 200 bp either side of the array; the largest block of sequence identity is only 53 bp in length. Again, this degree of heterology makes gene conversion seem an improbable mechanism [43].

In summary, our investigations do not provide any evidence of an NAHR-based mechanism for rearrangements involving DYS19, whether by unequal exchange or gene conversion. It seems likely that the rearrangements are sporadic events occurring through diverse processes that are probably not mediated by homologous recombination, as has been shown for some Amelogenin Y deletions [13].

Discussion

Recently, with the convenience of the availability of the human genome sequence, the systematic identification of new Y-STRs has become relatively straightforward [44]. However, most of the markers so well established in forensic practice today were developed more laboriously, before this resource was available, and represent a heterogeneous set of loci. With our current knowledge of the genomic sequence context and behaviour of DYS19, it would not have been a strong candidate for selection as a useful new marker from genomic sequence information: it is observed to be deleted, duplicated or triplicated and lies in one of a pair of dynamic short-arm inverted repeats with >99% sequence identity, with very similar long-arm paralogues, making the design of truly specific primers difficult [7]. Despite these apparent disadvantages, DYS19 has become one of the most widely typed markers on the Y chromosome.

Statements about the locations of Y-STRs need to be made carefully, given the fluidity of the organisation of the Y chromosome, and it is clear that the position of DYS19 is uncertain in any given lineage (although this finding does not materially affect its forensic utility). Any study of Y-chromosomal structural variation needs to bear in mind that, although it is an invaluable resource, the reference sequence [16] is not necessarily a relevant starting point for considering mechanisms of rearrangement in other lineages.

The Y chromosome bears a rich complement of ampliconic repeats [16], and these are known to sponsor many recurrent rearrangements, including examples at AZFa [14, 45], AZFb [46], AZFc [26, 42, 4749] and Amelogenin Y [13, 15]. However, this does not mean that every rearrangement is driven by these NAHR processes. In the case of DYS19 duplications and deletions, we have found no evidence to suggest that they are caused by recurrent NAHR or conversion between paralogous repeats. It seems more likely that they result from sporadic, and perhaps non-homology-mediated, processes with very low mutation rates.

The very large number of chromosomes surveyed for DYS19 copy number variation means that these rare events are nonetheless detected, and this is facilitated by the expansion of particular lineages through the strong genetic drift and possible social selection that operates on the Y. At least two duplication lineages have propagated in East Asia (within hgC3c) and West Asia/Central Europe (within hgG), so that duplications in populations from these regions may be relatively frequent and of importance to forensic practice. In terms of absolute numbers of chromosomes, duplications are more common than deletions, but considering the number of independent events, the different rearrangements exist at similar frequencies. There is no evidence for a deletion lineage that has spread in the manner of the duplication lineages. Note, however, that DYS19 deletions might be under-reported to databases such as the YHRD if they are interpreted by contributors as ‘incomplete’ profiles, signifying technical failure; this effect may be stronger for deletions than for duplications, which are more widely known and better understood.

The YHRD (release 22) contains 96 DYS19 duplications in addition to the examples analysed by us; ten are known to belong to hgG [10]. We used Network analysis (Figure 1 of the Electronic Supplementary Material) to ask if there is evidence that any of the remaining 86 chromosomes belong to either of the predominant clusters in hgG and C3c, combining the YHRD cases with ours. Our hgM and hgQ*(xQ3a) chromosomes are peripheral in the network, with no closely related haplotypes, suggesting that the YHRD contains no further cases belonging to these two lineages and supporting the idea that the hgM and Q*(xQ3a) cases represent somatic mutations. Many of the known hgG and hgC3c chromosomes, however, share haplotypes with YHRD examples or are their single-step mutational neighbours, strongly suggesting that they also share haplogroups. On this conservative basis, at least 18 of the 86 YHRD chromosomes of unknown haplogroup probably belong to the hgG lineage (a total proportion, including chromosomes identified by us, of 23% of duplications), and 43 of 86 chromosomes (total sample proportion 59%) belong to the hgC3c lineage.

The existence of DYS19 deletions and duplications can have a number of practical consequences. Mutation at a single-copy Y-STR is easily recognised and interpreted in a deficiency paternity test [50], but if a duplication is present, results could be confusing: for example, a patrilineal relative apparently carrying allele 16 (but in reality the duplication 16–16) might be compared to a son carrying 16–17. Consideration of peak–height ratios and of the population of origin and background haplotypes of the tested individuals would aid interpretation. In forensic casework, deletions and duplications might be interpreted as allele drop-outs or evidence of DNA contamination [9], but provided that they can be reliably recognised, they could elevate the significance of a match between a suspect and a stain, rather in the way that heteroplasmy of mitochondrial DNA can increase the strength of evidence [51]. Finally, given the high frequencies of duplications in two haplogroups and their associated populations, DYS19 duplications may be useful in the deduction of population of origin of a DNA sample.