Introduction

Copy number variation (CNV, which also refers to copy number variants) is well-documented in the human genome, affecting more nucleotides than are affected by SNP variation and contributing abundantly to phenotypic diversity (Lupski 2015; Sudmant et al. 2015 a, b). CNVs arise by several mechanisms, including non-allelic homologous recombination, non-homologous end joining, microhomology-mediated break-induced replication, and retroelement insertions (Hastings et al. 2009; Carvalho and Lupski 2016). Gene conversion, the non-reciprocal transfer of genetic information from one locus or allele to another, is also well-documented in the human genome as a possible consequence of double-strand breaks and subsequent repair involving homologous regions, including the formation and resolution of a double Holliday junction during meiosis (Szostak et al. 1983). It generally involves short stretches of DNA converting between alleles or nearby paralogs, requires high sequence similarity, and occurs less frequently between different chromosomes. Allelic conversion tract lengths of up to 22 kb are known (Wang et al. 2012), while lengths for non-allelic events are shorter, although > 9 kb has been reported (Chen et al. 2007; Hallast et al. 2013). These two processes of CNV and gene conversion are generally considered quite distinct, but here, we describe a CNV, whose origin is best explained by gene conversion events, linking the two processes.

The male-specific region of the Y chromosome offers unique opportunities for investigating both CNV and gene conversion, because (1) it is particularly tolerant of genetic variation (Poznik et al. 2016), so a wide variety of variants persist in the population and (2) the lack of recombination between different Y lineages allows the history of variants to be identified from the phylogeny (Jobling 2008; Jobling and Tyler-Smith 2017; Massaia and Xue 2017; Trombetta and Cruciani 2017). We have previously described the genetic variation in a set of 1244 diverse worldwide Y chromosomes sequenced in phase 3 of the 1000 Genomes Project, including the identification and validation of 110 CNVs (Poznik et al. 2016). The current study presents a detailed analysis of one of these, dbVar ID esv3818053 [chrY: 9,640,466–9,653,590 GRCh37 (hg19); chrY: 9,802,857–9,815,981, GRCh38 (hg38)], here designated by the more descriptive name TTTY22-CNV, because it overlaps with ~ 85% of this lincRNA gene and changes the number of its functional copies.

Results and discussion

Validation of TTTY22-CNV

TTTY22-CNV lies within the ~ 300 kb long inverted repeat 3 (IR3), which has two copies on the short arm of the Y chromosome, at approximately 6.1–6.4 and 9.4–9.7 Mb (GRCh37) (Skaletsky et al. 2003). In the reference sequence (Skaletsky et al. 2003), derived mostly from haplogroup R1b (Wei et al. 2103), the proximal copy contains an additional ~ 10 kb segment which carries most of TTTY22 (Fig. 1a). Variation in copy number of this additional segment was initially identified using Genome STRiP and validated by array-CGH (Poznik et al. 2016), revealing 0–3 copies, compared with the single copy in the reference sequence. TTTY22-CNV was further validated here by establishing a PCR assay using one pair of flanking primers which generated a 568 bp product in the absence of the TTTY22-CNV, and one pair within TTTY22 which generated a 249 bp product in its presence (Table S1). Structures matching the reference sequence with one copy of TTTY22-CNV generate both products, those with 0 copies generate only the 568 bp band, and those with 2 or 3 copies only the 249 bp band (Fig. 1c). In addition, fibre-FISH experiments using probes generated from two partially overlapping BAC clones (P1 and P2) spanning TTTY22-CNV, together with two 5 kb custom PCR-generated probes (P3 and P4, combined; Table S1) lying mainly within TTTY22-CNV are expected to show differential patterns: (1) hybridization of only the BAC clones plus a small fragment of P4 when the TTTY22-CNV is absent and (2) hybridization of all four probes when the gene is present. As expected, both patterns were detected in similar proportions (7:9, Fig. 1b, Fig. S1) in samples with structures matching the reference sequence. Conversely, only the former pattern (18:0, Fig. S1) was detected in samples with 0 copies of TTTY22-CNV and only the latter pattern (0:12, Fig. S1) was observed in samples with 2 copies of TTTY22-CNV (Fig. 1b). Samples with three copies showed the latter pattern plus a separate, additional fibre-FISH signal, and are described separately below. Finally, 10x Genomics Chromium linked-read data (Zheng et al. 2016) were available for samples with one and two copies; the sample with one copy shows uniform distributions of barcode sharing and read depth across both TTTY22-CNV locations, while the sample with two copies shows increased barcode sharing and read depth in IR3proximal and decreased sharing and depth in IR3distal (Fig. S2). Thus, this combination of validation approaches confirms both the predicted details of the breakpoints (PCR) and the broader context of the TTTY22-CNV location within IR3 (fibre-FISH and linked-read sequencing), so we conclude that individuals with 0, 1, or 2 copies of TTTY22-CNV can be explained by variation within the IR3 repeats.

Fig. 1
figure 1

Location and validation of TTTY22-CNV. a Top: the Y chromosome, with the red box indicating the location of the ~ 3.7 Mb region carrying the two copies of IR3 (black) and intervening sequence (grey). Middle: the IR3 structure in the reference sequence (GRCh37) showing IR3distal (left) divided into two blocks (B and A), and IR3proximal (right) similarly divided into two blocks (A and B) separated by the TTTY22-CNV (small white block). F1, R1 and F2, R2 are pairs of PCR primers used in validation (see part C). The lighter purple block shows the location of the previously-reported IR3 inversion breakpoint interval (Turner et al. 2006). Coordinates are given for GRCh37 in this study to be consistent with previous work on this CNV in the same samples (Poznik et al. 2016). Lower: location of probes P1, P2, P3, and P4 used in fibre-FISH validation (see part B) and schematic representation of gene conversion events which would change the single copy of TTTY22-CNV in the reference sequence to 0 copies (left) or 2 copies (right). b Fibre-FISH validation: HG00096, with 1 copy of TTTY22-CNV matching the reference sequence, shows both fibre-FISH patterns that include P3 (entirely within TTTY22-CNV, upper) and patterns that lack P3 (lower). NA19146, with 0 copies of TTTY22-CNV, shows only the pattern that lacks P3. NA18953, with 2 copies of TTTY22-CNV, shows only the pattern that includes TTTY22-CNV. c PCR validation: primers F1 and R1 amplify across TTTY22-CNV (see part A) and produce a 568 bp product in the absence of TTTY22-CNV, while primers F2 and R2 amplify a fragment within TTTY22-CNV and produce a 249 bp product in the presence of TTTY22-CNV. HG00096, with 1 copy of TTTY22-CNV matching the reference sequence, shows both products; NA19146, with 0 copies of TTTY22-CNV, shows only the 568 bp product; NA18953, with 2 copies, shows only the 249 bp product; NA19661, with 3 copies, shows only the 249 bp product: the same pattern as the 2-copy sample; neither human female, male chimpanzee nor male gorilla produce any product

Population variation and phylogeny of TTTY22-CNV

In the 1234 Y chromosomes examined for copy number (ten samples were excluded because of Y-chromosomal mosaicism), 266 had 0 copies, 943 had 1, 23 had 2 copies, and two had 3 copies (Table S2). Examination of the Y-chromosomal phylogeny established for these samples using Y-SNP variation (Poznik et al. 2016) showed that, as expected, the different copy numbers of TTTY22-CNV were often clustered in branches of the phylogeny, so that the number of mutational events inferred was just 20 (Fig. 2, and see below for further details), nevertheless indicating a relatively high mutation rate compared with SNPs. IR3 (including TTTY22-CNV) is not detectable in the chimpanzee or gorilla genome sequences (the longest BLAST hit is 7.4 kb, in block B, and shares only 87.97% sequence identity with chimpanzee Y:15,046,773-15,054,094) and the TTTY22-CNV PCR primers do not detect any product in male chimpanzee or gorilla (Fig. 1c), so examination of an outgroup is not informative about the ancestral state of TTTY22-CNV. Within the human Y-chromosomal phylogeny sampled, all major haplogroups include structures matching the reference with one copy of TTTY22-CNV (Fig. 2), so we infer that the most recent common ancestor of the Y chromosomes examined probably also carried one copy.

Fig. 2
figure 2

Phylogenetic distribution of TTTY22-CNV copy number variation. Left: simplified Y-chromosomal phylogeny showing the branching pattern of the lineages considered in this study; branch lengths do not correspond to the number of SNPs. Subsequent columns show the numbers of samples in our data set of 1234 chromosomes with 1, 0, 2, or 3 copies of TTTY22-CNV, the number of deletion (Dels) and duplication (Dups) events inferred from the full phylogeny (total of 20) (Poznik et al. 2016), and for comparison, the IR3 orientation reported for these haplogroups, where available (Repping et al. 2006). The simplified haplogroups common between the two data sets were named according to the International Society of Genetic Genealogy (2013): Y-DNA Haplogroup Tree 2013, Version: (8.89), Date: (31 December 2013), http://isogg.org/tree/2013/index13.html

A gene conversion mechanism for most TTTY22-CNV variation

Since the copy number of the IR3 repeat as a whole does not vary in the samples examined (see below for a partial exception), the most likely mechanisms to generate 0, 1, or 2 copies of TTTY22-CNV variation would be gene conversion or double crossover in the IR3 sequences flanking TTTY22-CNV, either replacing the proximal copy with the distal to generate 0 copies of TTTY22-CNV, or replacing the distal copy with the proximal to generate 2 copies (Fig. 1a, lower section). These flanking sequences are 99.45% identical (99.65% for Block A; 99.13% for Block B), but show occasional variants in the reference sequence differentiating the proximal and distal IR3 copies, known as Paralogous Sequence Variants (PSVs) or Sequence Family Variants (SFVs), which are expected to be present in the population as well. The low-coverage sequences available from the 1000 Genomes Project prevented reliable de novo detection of PSVs, because first, an erroneous PSV allele call might be made in a single read and misinterpreted as a true variant, and second, PSV state would not be called at all if there was zero coverage of a particular position. In contrast, the genotypes of PSVs known in the reference sequence can be extracted reliably, since the probability of a single-nucleotide variant that matches a known PSV allele being genuine is much higher than for a novel variant, even for low-coverage sequences. By aligning the reference sequences of the distal and proximal IR3 copies, we identified 165 PSVs and counted the number of reads carrying each PSV allele in each sample, usually detecting both alleles (Table S3). In this way, sizes of mutations including TTTY22-CNV and all such TTTY22-CNV mutations throughout IR3 were estimated (Figs. 3, 4, respectively; Table S4). Missing data increase the uncertainty of these estimates and thus decrease the resolution of the mapping; low sample depth increases the probability of data being missing and thus decreases the resolution. The resolution of the size estimates was entirely determined by the location of the PSVs: we can only detect a mutation when a PSV is affected. Therefore, for every mutation event, we give two numbers: the maximum and minimum lengths given the information from the PSVs.

Fig. 3
figure 3

Changes to PSVs flanking TTTY22-CNV. Left, Y-chromosomal phylogeny of the haplogroups carrying non-reference copy numbers of TTTY22-CNV. In haplogroups R2a and O2b, there are two independent copy number mutations, and we added (1) and (2) after the haplogroup names to distinguish them. The next columns show the sample sizes, and minimum (Min) and maximum (Max) lengths of gene conversion events surrounding each CNV deduced from changes to the PSV patterns. Right, PSV patterns, colored according to the reference sequence (top) as indicated in Fig. 1a, proximal and distal positions of PSVs (A6, A5 … B8, B9) are shown in Table S3. The black box indicates the minimum length of gene conversion; the maximum length could extend to the next PSV on either side. The relative paucity of events affecting left-hand-side PSVs reflects the larger physical distances of these PSVs from TTTY22-CNV. The PSVs patterns of 934 individuals with one copy of TTTY22-CNV are quite diverse and are summarised in Table S6

Fig. 4
figure 4

Minimum length distributions of gene conversion events inferred from changes to PSV patterns in IR3 as a whole (distal to proximal, blue; proximal to distal, red) and for events involving TTTY22-CNV (green), which are on average larger. The events here are minimum numbers inferred from PSV patterns, and the same pattern will sometimes have arisen independently on different Y-chromosomal lineages

Examination of the PSV profiles adjacent to the TTTY22-CNV thus provided the minimum and maximum size estimates for the genomic region accompanying the CNV change, as shown in Fig. 3. We then mapped the CNV changes onto the phylogeny constructed using the SNPs in the same sample set (Poznik et al. 2016). We identified 12 different mutation sizes, which could be resolved into 20 mutation events using the phylogeny, since events with indistinguishable sizes sometimes occurred on independent branches of the phylogeny. The largest mutation had a minimum size of > 32 kb.

Since gene conversion and double crossover produce indistinguishable structures (Chen et al. 2007), additional factors have to be considered to distinguish between the two possibilities. It is generally reasoned that crossovers are rare, so the chance of two occurring in close proximity is low, and therefore, structures resulting from exchange of information over lengths of 10 kb or less, which have been abundantly documented on the Y chromosome (Rozen et al. 2003; Bosch et al. 2004; Hurles et al. 2004; Hallast et al. 2013), have been interpreted as resulting from gene conversion. It is notable that gene conversion events of up to 9 kb have been reported on the Y chromosome (Hallast et al. 2013), exceeding the maximum size of ~ 4 kb on other chromosomes (Dumont and Eichler 2013; Trombetta and Cruciani 2017). Nevertheless, the large size of some of the events has been reported (Williams 1998; Repping et al. 2006). Although recurrent, only 12 changes in orientation for this inversion have so far been identified here suggests that reconsideration of the possible involvement of double crossovers is merited. We know of no reported measurement of the frequencies of double crossovers on the human Y chromosome. However, some information about the frequencies of single crossovers is available. Single crossovers between sister chromatids would result in isodicentric or acentric chromosomes, which are both very rare (and evolutionarily lethal), but intra-chromatid single crossovers result in inversions which are unlikely to affect the phenotype and have in fact been inferred between IR3proximal and IR3distal within the Y-chromosomal phylogeny (Repping et al. 2006). Repping et al. counted 12 inversions in the phylogeny they sampled and inferred a mutation rate of ≥ 2.3 × 10−4 per generation. This phylogeny consisted of a slightly different set of haplogroups from the current study, so we constructed and examined a consensus phylogeny based on the subset of haplogroups in common between the two studies. On this consensus phylogeny, there were eight inversion events and 13 TTTY22-CNV copy number changes (Fig. 2). Thus, TTTY22-CNV mutations are more common than single intra-chromatid crossovers. Double cross-over events close together should be considerably more rare than single crossovers. The known breakpoints of single crossovers are also located in a different region of IR3 (Turner et al. 2006, 2008), suggesting that the known crossovers in IR3 are both structurally independent from, and less frequent than, changes to TTTY22-CNV copy number. We, therefore, conclude that double crossovers do not provide a plausible mechanism for TTTY22-CNV copy number change, while gene conversion does, as noted for other Y loci (Hallast et al. 2013).

A distinct and more standard mechanism generating three copies of TTTY22-CNV

The two individuals with three copies of TTTY22-CNV are clustered in the phylogeny and share a common origin (Fig. 5a, b) which, in contrast to all the other TTTY22-CNV variants, cannot be accounted for by a simple gene conversion event between IR3 copies. Fibre-FISH shows the presence of a third copy of TTTY22-CNV and > 100 kb of flanking sequence (the maximum tested by this method) located ~ 230 kb away from one of the reference copies (Fig. 5d). Measurement of read depth in the two individuals carrying three copies identified ~ 420 kb extending from within IR3proximal into the proximal flanking region with increased read depth, and analysis of the PSVs within this segment showed over-representation of proximal PSVs. For the part of IR3 shown by the fibre-FISH and read depth to be present in three copies, the proximal:distal PSV read depth ratio was 1.7 or 1.9 in the two individuals carrying three copies of TTTY22-CNV, while for the part present in two copies, the ratios were 1.3 and 1.0 (Table S5). Together, these data suggest that copy 3 originated by tandem duplication of a 420 kb region including TTTY22-CNV in IR3proximal, from a chromosome with two copies (Fig. 5c).

Fig. 5
figure 5

Characterization of chromosomes with three copies of TTTY22-CNV. a Read depth in 1 kb bins between IR3proximal and the Y chromosome centromere for the two samples with three copies of TTTY22-CNV (NA19685 and NA19661, central panels) and two phylogenetically-nearby samples with one copy (NA07357 and NA12144, top and bottom panels). The pink shading highlights a 420 kb region with increased read depth, resulting from increased copy number. The phylogenetic relationships of these four chromosomes are indicated to the left; the number of SNPs on each branch is shown (Poznik et al. 2016). b Validation of the copy number differences using published array-CGH data (Poznik et al. 2016), where the same shaded region shows increased log2 ratio. c Schematic representation of the origin of the chromosome with three copies of TTTY22-CNV, starting with a gene conversion event to generate two copies followed by tandem duplication of a 420 kb region to generate the third copy. d Fibre-FISH using the same probes as described in Fig. 1 revealing the location of the 420 kb insertion. An expanded view of the fibre-FISH is included in Fig. S1

Implications of TTTY22-CNV variation

Men carrying non-reference copy numbers of TTTY22-CNV are likely to have the corresponding numbers of copies of the functional TTTY22 gene, because ~ 15% of this gene that lies outside the CNV is provided by the flanking sequence. Their high frequency, and the large numbers who sometimes share the same mutational event, most marked in an E1b sub-lineage which carries 0 copies and is very common in sub-Saharan Africa, represented 260 times in the current data set (Poznik et al. 2016), demonstrate that copy numbers 0–3 allow male lineage expansion, and are unlikely to be detrimental. Given the long-term persistence of lineages with 0–3 copies of TTTY22-CNV in the population, and the extreme drift experienced by the Y chromosome, this variation in compatible with evolutionary neutrality. Nevertheless, the location of TTTY22 in the CNV suggests that the abundance of this transcript may vary between men. The gene is transcribed primarily in testis, with a lower level in brain (The GTEx Consortium 2017) and further work is needed to investigate whether or not subtle phenotypic consequences of variation in transcript levels can be detected. Men with three copies of TTTY22-CNV carry an additional large duplication of proximal Yp sequences and have in addition three copies of TTTY23, compared with two copies (one in each IR3 repeat) in other men. This duplication has also spread in the population, so also seems unlikely to be detrimental.

Overall, our findings highlight gene conversion as an additional mechanism for generating CNV in the human genome, especially on the Y chromosome, that has previously received little attention. The link between these two processes should be further explored and considered when either is investigated.

Methods

Data sets

We used the existing Y-chromosomal sequence data and CNV calls from phase 3 of the 1000 Genomes Project, where an initial CNV call set had been made using Genome STRiP and calls validated using array-CGH (Poznik et al. 2016). We also generated new 10× Genomics Chromium libraries from two of these samples (HG01097 and NA18953) following the manufacturer’s instructions (https://www.10xgenomics.com/genome/) followed by sequencing on the Illumina HiSeq X platform with 150 bp paired-end reads to a depth of ~ 30×. The sequence data were processed using the LongRanger 1.0 software using the reference sequence GRCh37 and viewed using the Loupe 2.1.0 software from 10x Genomics. All coordinates in this paper are based on GRCh37 to make them compatible with the initial CNV calls (Poznik et al. 2016).

TTTY22-CNV validation

We designed two sets of primers to validate the presence or absence of the 10 kb insert of TTTY22-CNV (dbVar description: https://www.ncbi.nlm.nih.gov/dbvar/variants/esv3818053/#VariantGenome), one set spanning the breakpoint of the empty site and the other lying within the 10 kb insert region. Primers, PCR conditions, and predicted product sizes are described in Table S1.

Molecular combing fibre-FISH experiments were carried out as described previously (Poznik et al. 2016). Probes consisted of two BAC clones (RP11-117N22 and RP11-453C1, obtained from the clone archive resource of Wellcome Trust Sanger Institute) used to identify the genomic region of interest, as well as two custom PCR probes of ~ 5 kb each lying mainly within the insert of TTTY22-CNV (details also in Table S1), which were combined to produce a single probe to distinguish the presence of the insert from its absence. We validated the CNV calls using both PCR and fibre-FISH for the same four samples: NA19146 with 0, HG00096 with 1, NA18953 with 2, and NA19661 with 3 copies. In addition, we applied 10× Genomics Chromium to two samples: HG01097 with 1 and NA18953 with 2 copies of TTTY22-CNV.

Sequence and phylogenetic analyses

We defined blocks A and B within IR3 (https://genome.ucsc.edu/) as being separated by TTTY22-CNV, and measured their similarity between the two copies of IR3 as 99.65% for the ~ 170 kb of block A and 99.13% for the ~ 101 kb of block B.

We placed the TTTY22-CNV copy number of each samples onto the full Y-chromosomal phylogeny based on SNPs (Poznik et al. 2016) to infer the number of mutational events (deletion or duplication). To understand the relation between the TTTY22-CNV and the IR3 inversion events (Repping et al. 2006), we placed both TTTY22-CNV copy number changes and IR3 inversion events onto a simplified phylogeny as described in the main text and compared their phylogenetic locations.

To investigate gene conversion events, we first identified PSVs (Table S3) between the two copies of IR3 from the reference sequence after aligning them using Basic Local Alignment Search Tool (https://blast.ncbi.nlm.nih.gov/Blast.cgi), and then counted the number of reads covering each allele of each PSV from the BAM files for each individual in the 1000 Genomes Project phase 3 data using a custom Perl script. To allow for the low sequence coverage, we established the following protocol. When six or more reads covering the same PSV were found in a sample, the chance of both IR3 copies being represented was ~ 97%, and so a ≥ 6:0 ratio of PSV alleles was taken to indicate that both IR3 copies carried the same allele, implying that a gene conversion (or double crossover) event had occurred. The length of such mutational events was inferred by extending the analysis to PSVs adjacent to the ≥ 6:0 seed, accepting ≥ 1:0 counts of the same allele as resulting from the same event.

Data access

All data on Y-chromosomal variation from the 1000 Genomes Project phase 3 are freely available:http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/. PSV read coverage is reported in Table S3.