Copy number variation arising from gene conversion on the human Y chromosome
We describe the variation in copy number of a ~ 10 kb region overlapping the long intergenic noncoding RNA (lincRNA) gene, TTTY22, within the IR3 inverted repeat on the short arm of the human Y chromosome, leading to individuals with 0–3 copies of this region in the general population. Variation of this CNV is common, with 266 individuals having 0 copies, 943 (including the reference sequence) having 1, 23 having 2 copies, and two having 3 copies, and was validated by breakpoint PCR, fibre-FISH, and 10× Genomics Chromium linked-read sequencing in subsets of 1234 individuals from the 1000 Genomes Project. Mapping the changes in copy number to the phylogeny of these Y chromosomes previously established by the Project identified at least 20 mutational events, and investigation of flanking paralogous sequence variants showed that the mutations involved flanking sequences in 18 of these, and could extend over > 30 kb of DNA. While either gene conversion or double crossover between misaligned sister chromatids could formally explain the 0–2 copy events, gene conversion is the more likely mechanism, and these events include the longest non-allelic gene conversion reported thus far. Chromosomes with three copies of this CNV have arisen just once in our data set via another mechanism: duplication of 420 kb that places the third copy 230 kb proximal to the existing proximal copy. Our results establish gene conversion as a previously under-appreciated mechanism of generating copy number changes in humans and reveal the exceptionally large size of the conversion events that can occur.
Copy number variation (CNV, which also refers to copy number variants) is well-documented in the human genome, affecting more nucleotides than are affected by SNP variation and contributing abundantly to phenotypic diversity (Lupski 2015; Sudmant et al. 2015 a, b). CNVs arise by several mechanisms, including non-allelic homologous recombination, non-homologous end joining, microhomology-mediated break-induced replication, and retroelement insertions (Hastings et al. 2009; Carvalho and Lupski 2016). Gene conversion, the non-reciprocal transfer of genetic information from one locus or allele to another, is also well-documented in the human genome as a possible consequence of double-strand breaks and subsequent repair involving homologous regions, including the formation and resolution of a double Holliday junction during meiosis (Szostak et al. 1983). It generally involves short stretches of DNA converting between alleles or nearby paralogs, requires high sequence similarity, and occurs less frequently between different chromosomes. Allelic conversion tract lengths of up to 22 kb are known (Wang et al. 2012), while lengths for non-allelic events are shorter, although > 9 kb has been reported (Chen et al. 2007; Hallast et al. 2013). These two processes of CNV and gene conversion are generally considered quite distinct, but here, we describe a CNV, whose origin is best explained by gene conversion events, linking the two processes.
The male-specific region of the Y chromosome offers unique opportunities for investigating both CNV and gene conversion, because (1) it is particularly tolerant of genetic variation (Poznik et al. 2016), so a wide variety of variants persist in the population and (2) the lack of recombination between different Y lineages allows the history of variants to be identified from the phylogeny (Jobling 2008; Jobling and Tyler-Smith 2017; Massaia and Xue 2017; Trombetta and Cruciani 2017). We have previously described the genetic variation in a set of 1244 diverse worldwide Y chromosomes sequenced in phase 3 of the 1000 Genomes Project, including the identification and validation of 110 CNVs (Poznik et al. 2016). The current study presents a detailed analysis of one of these, dbVar ID esv3818053 [chrY: 9,640,466–9,653,590 GRCh37 (hg19); chrY: 9,802,857–9,815,981, GRCh38 (hg38)], here designated by the more descriptive name TTTY22-CNV, because it overlaps with ~ 85% of this lincRNA gene and changes the number of its functional copies.
Results and discussion
Validation of TTTY22-CNV
Population variation and phylogeny of TTTY22-CNV
A gene conversion mechanism for most TTTY22-CNV variation
Examination of the PSV profiles adjacent to the TTTY22-CNV thus provided the minimum and maximum size estimates for the genomic region accompanying the CNV change, as shown in Fig. 3. We then mapped the CNV changes onto the phylogeny constructed using the SNPs in the same sample set (Poznik et al. 2016). We identified 12 different mutation sizes, which could be resolved into 20 mutation events using the phylogeny, since events with indistinguishable sizes sometimes occurred on independent branches of the phylogeny. The largest mutation had a minimum size of > 32 kb.
Since gene conversion and double crossover produce indistinguishable structures (Chen et al. 2007), additional factors have to be considered to distinguish between the two possibilities. It is generally reasoned that crossovers are rare, so the chance of two occurring in close proximity is low, and therefore, structures resulting from exchange of information over lengths of 10 kb or less, which have been abundantly documented on the Y chromosome (Rozen et al. 2003; Bosch et al. 2004; Hurles et al. 2004; Hallast et al. 2013), have been interpreted as resulting from gene conversion. It is notable that gene conversion events of up to 9 kb have been reported on the Y chromosome (Hallast et al. 2013), exceeding the maximum size of ~ 4 kb on other chromosomes (Dumont and Eichler 2013; Trombetta and Cruciani 2017). Nevertheless, the large size of some of the events has been reported (Williams 1998; Repping et al. 2006). Although recurrent, only 12 changes in orientation for this inversion have so far been identified here suggests that reconsideration of the possible involvement of double crossovers is merited. We know of no reported measurement of the frequencies of double crossovers on the human Y chromosome. However, some information about the frequencies of single crossovers is available. Single crossovers between sister chromatids would result in isodicentric or acentric chromosomes, which are both very rare (and evolutionarily lethal), but intra-chromatid single crossovers result in inversions which are unlikely to affect the phenotype and have in fact been inferred between IR3proximal and IR3distal within the Y-chromosomal phylogeny (Repping et al. 2006). Repping et al. counted 12 inversions in the phylogeny they sampled and inferred a mutation rate of ≥ 2.3 × 10−4 per generation. This phylogeny consisted of a slightly different set of haplogroups from the current study, so we constructed and examined a consensus phylogeny based on the subset of haplogroups in common between the two studies. On this consensus phylogeny, there were eight inversion events and 13 TTTY22-CNV copy number changes (Fig. 2). Thus, TTTY22-CNV mutations are more common than single intra-chromatid crossovers. Double cross-over events close together should be considerably more rare than single crossovers. The known breakpoints of single crossovers are also located in a different region of IR3 (Turner et al. 2006, 2008), suggesting that the known crossovers in IR3 are both structurally independent from, and less frequent than, changes to TTTY22-CNV copy number. We, therefore, conclude that double crossovers do not provide a plausible mechanism for TTTY22-CNV copy number change, while gene conversion does, as noted for other Y loci (Hallast et al. 2013).
A distinct and more standard mechanism generating three copies of TTTY22-CNV
Implications of TTTY22-CNV variation
Men carrying non-reference copy numbers of TTTY22-CNV are likely to have the corresponding numbers of copies of the functional TTTY22 gene, because ~ 15% of this gene that lies outside the CNV is provided by the flanking sequence. Their high frequency, and the large numbers who sometimes share the same mutational event, most marked in an E1b sub-lineage which carries 0 copies and is very common in sub-Saharan Africa, represented 260 times in the current data set (Poznik et al. 2016), demonstrate that copy numbers 0–3 allow male lineage expansion, and are unlikely to be detrimental. Given the long-term persistence of lineages with 0–3 copies of TTTY22-CNV in the population, and the extreme drift experienced by the Y chromosome, this variation in compatible with evolutionary neutrality. Nevertheless, the location of TTTY22 in the CNV suggests that the abundance of this transcript may vary between men. The gene is transcribed primarily in testis, with a lower level in brain (The GTEx Consortium 2017) and further work is needed to investigate whether or not subtle phenotypic consequences of variation in transcript levels can be detected. Men with three copies of TTTY22-CNV carry an additional large duplication of proximal Yp sequences and have in addition three copies of TTTY23, compared with two copies (one in each IR3 repeat) in other men. This duplication has also spread in the population, so also seems unlikely to be detrimental.
Overall, our findings highlight gene conversion as an additional mechanism for generating CNV in the human genome, especially on the Y chromosome, that has previously received little attention. The link between these two processes should be further explored and considered when either is investigated.
We used the existing Y-chromosomal sequence data and CNV calls from phase 3 of the 1000 Genomes Project, where an initial CNV call set had been made using Genome STRiP and calls validated using array-CGH (Poznik et al. 2016). We also generated new 10× Genomics Chromium libraries from two of these samples (HG01097 and NA18953) following the manufacturer’s instructions (https://www.10xgenomics.com/genome/) followed by sequencing on the Illumina HiSeq X platform with 150 bp paired-end reads to a depth of ~ 30×. The sequence data were processed using the LongRanger 1.0 software using the reference sequence GRCh37 and viewed using the Loupe 2.1.0 software from 10x Genomics. All coordinates in this paper are based on GRCh37 to make them compatible with the initial CNV calls (Poznik et al. 2016).
We designed two sets of primers to validate the presence or absence of the 10 kb insert of TTTY22-CNV (dbVar description: https://www.ncbi.nlm.nih.gov/dbvar/variants/esv3818053/#VariantGenome), one set spanning the breakpoint of the empty site and the other lying within the 10 kb insert region. Primers, PCR conditions, and predicted product sizes are described in Table S1.
Molecular combing fibre-FISH experiments were carried out as described previously (Poznik et al. 2016). Probes consisted of two BAC clones (RP11-117N22 and RP11-453C1, obtained from the clone archive resource of Wellcome Trust Sanger Institute) used to identify the genomic region of interest, as well as two custom PCR probes of ~ 5 kb each lying mainly within the insert of TTTY22-CNV (details also in Table S1), which were combined to produce a single probe to distinguish the presence of the insert from its absence. We validated the CNV calls using both PCR and fibre-FISH for the same four samples: NA19146 with 0, HG00096 with 1, NA18953 with 2, and NA19661 with 3 copies. In addition, we applied 10× Genomics Chromium to two samples: HG01097 with 1 and NA18953 with 2 copies of TTTY22-CNV.
Sequence and phylogenetic analyses
We defined blocks A and B within IR3 (https://genome.ucsc.edu/) as being separated by TTTY22-CNV, and measured their similarity between the two copies of IR3 as 99.65% for the ~ 170 kb of block A and 99.13% for the ~ 101 kb of block B.
We placed the TTTY22-CNV copy number of each samples onto the full Y-chromosomal phylogeny based on SNPs (Poznik et al. 2016) to infer the number of mutational events (deletion or duplication). To understand the relation between the TTTY22-CNV and the IR3 inversion events (Repping et al. 2006), we placed both TTTY22-CNV copy number changes and IR3 inversion events onto a simplified phylogeny as described in the main text and compared their phylogenetic locations.
To investigate gene conversion events, we first identified PSVs (Table S3) between the two copies of IR3 from the reference sequence after aligning them using Basic Local Alignment Search Tool (https://blast.ncbi.nlm.nih.gov/Blast.cgi), and then counted the number of reads covering each allele of each PSV from the BAM files for each individual in the 1000 Genomes Project phase 3 data using a custom Perl script. To allow for the low sequence coverage, we established the following protocol. When six or more reads covering the same PSV were found in a sample, the chance of both IR3 copies being represented was ~ 97%, and so a ≥ 6:0 ratio of PSV alleles was taken to indicate that both IR3 copies carried the same allele, implying that a gene conversion (or double crossover) event had occurred. The length of such mutational events was inferred by extending the analysis to PSVs adjacent to the ≥ 6:0 seed, accepting ≥ 1:0 counts of the same allele as resulting from the same event.
All data on Y-chromosomal variation from the 1000 Genomes Project phase 3 are freely available:http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/. PSV read coverage is reported in Table S3.
We thank Ed Hollox for comments. Our work was supported by The Wellcome Trust (098051); W.S. was also supported by the State Scholarship fund (No. 201606940004) of the China Scholarship Council, and a National Natural Science Foundation of China grant (No. 31201029); and P.H. was supported by Estonian Research Council Grant PUT1036.
Compliance with ethical standards
Cell lines and DNA samples were obtained from the Coriell Institute for Medical Research (Camden, New Jersey, USA) following their ethical procedures (https://catalog.coriell.org/1/Support/FirstOrder).
Conflict of interest
The authors declare that they have no conflict of interest.
- Hallast P, Balaresque P, Bowden GR, Ballereau S, Jobling MA (2013) Recombination dynamics of a human Y-chromosomal palindrome: rapid GC-biased gene conversion, multi-kilobase conversion tracts, and rare inversions. PLoS Genet 9:e1003666. https://doi.org/10.1371/journal.pgen.1003666 CrossRefPubMedPubMedCentralGoogle Scholar
- Poznik GD, Xue Y, Mendez FL, Willems TF, Massaia A, Wilson Sayres MA, Ayub Q, McCarthy SA, Narechania A, Kashin S, Chen Y, Banerjee R, Rodriguez-Flores JL, Cerezo M, Shao H, Gymrek M, Malhotra A, Louzada S, Desalle R, Ritchie GR, Cerveira E, Fitzgerald TW, Garrison E, Marcketta A, Mittelman D, Romanovitch M, Zhang C, Zheng-Bradley X, Abecasis GR, McCarroll SA, Flicek P, Underhill PA, Coin L, Zerbino DR, Yang F, Lee C, Clarke L, Auton A, Erlich Y, Handsaker RE, Genomes Project C, Bustamante CD, Tyler-Smith C (2016) Punctuated bursts in human male demography inferred from 1244 worldwide Y-chromosome sequences. Nat Genet 48:593–599. https://doi.org/10.1038/ng.3559
- Repping S, van Daalen SK, Brown LG, Korver CM, Lange J, Marszalek JD, Pyntikova T, van der Veen F, Skaletsky H, Page DC, Rozen S (2006) High mutation rates have driven extensive structural polymorphism among human Y chromosomes. Nat Genet 38:463–467. https://doi.org/10.1038/ng1754 CrossRefPubMedGoogle Scholar
- Skaletsky H, Kuroda-Kawaguchi T, Minx PJ, Cordum HS, Hillier L, Brown LG, Repping S, Pyntikova T, Ali J, Bieri T, Chinwalla A, Delehaunty A, Delehaunty K, Du H, Fewell G, Fulton L, Fulton R, Graves T, Hou SF, Latrielle P, Leonard S, Mardis E, Maupin R, McPherson J, Miner T, Nash W, Nguyen C, Ozersky P, Pepin K, Rock S, Rohlfing T, Scott K, Schultz B, Strong C, Tin-Wollam A, Yang SP, Waterston RH, Wilson RK, Rozen S, Page DC (2003) The male-specific region of the human Y chromosome is a mosaic of discrete sequence classes. Nature 423:825–837. https://doi.org/10.1038/nature01722 CrossRefPubMedGoogle Scholar
- Sudmant PH, Rausch T, Gardner EJ, Handsaker RE, Abyzov A, Huddleston J, Zhang Y, Ye K, Jun G, Hsi-Yang Fritz M, Konkel MK, Malhotra A, Stutz AM, Shi X, Paolo Casale F, Chen J, Hormozdiari F, Dayama G, Chen K, Malig M, Chaisson MJ, Walter K, Meiers S, Kashin S, Garrison E, Auton A, Lam HY, Jasmine Mu X, Alkan C, Antaki D, Bae T, Cerveira E, Chines P, Chong Z, Clarke L, Dal E, Ding L, Emery S, Fan X, Gujral M, Kahveci F, Kidd JM, Kong Y, Lameijer EW, McCarthy S, Flicek P, Gibbs RA, Marth G, Mason CE, Menelaou A, Muzny DM, Nelson BJ, Noor A, Parrish NF, Pendleton M, Quitadamo A, Raeder B, Schadt EE, Romanovitch M, Schlattl A, Sebra R, Shabalin AA, Untergasser A, Walker JA, Wang M, Yu F, Zhang C, Zhang J, Zheng-Bradley X, Zhou W, Zichner T, Sebat J, Batzer MA, McCarroll SA, Genomes Project C; Mills RE, Gerstein MB, Bashir A, Stegle O, Devine SE, Lee C, Eichler EE, Korbel JO (2015) An integrated map of structural variation in 2504 human genomes. Nature 526:75–81. https://doi.org/10.1038/nature15394
- Sudmant PH, Mallick S, Nelson BJ, Hormozdiari F, Krumm N, Huddleston J, Coe BP, Baker C, Nordenfelt S, Bamshad M, Jorde LB, Posukh OL, Sahakyan H, Watkins WS, Yepiskoposyan L, Abdullah MS, Bravi CM, Capelli C, Hervig T, Wee JT, Tyler-Smith C, van Driem G, Romero IG, Jha AR, Karachanak-Yankova S, Toncheva D, Comas D, Henn B, Kivisild T, Ruiz-Linares A, Sajantila A, Metspalu E, Parik J, Villems R, Starikovskaya EB, Ayodo G, Beall CM, Di Rienzo A, Hammer MF, Khusainova R, Khusnutdinova E, Klitz W, Winkler C, Labuda D, Metspalu M, Tishkoff SA, Dryomov S, Sukernik R, Patterson N, Reich D, Eichler EE (2015) Global diversity, population stratification, and selection of human copy-number variation. Science 349:aab3761. https://doi.org/10.1126/science.aab3761 CrossRefPubMedPubMedCentralGoogle Scholar
- Williams G (1998) Mapping studies of the centromeric region of the human Y chromosome. D.Phil. Thesis, Oxford UniversityGoogle Scholar
- Zheng GX, Lau BT, Schnall-Levin M, Jarosz M, Bell JM, Hindson CM, Kyriazopoulou-Panagiotopoulou S, Masquelier DA, Merrill L, Terry JM, Mudivarti PA, Wyatt PW, Bharadwaj R, Makarewicz AJ, Li Y, Belgrader P, Price AD, Lowe AJ, Marks P, Vurens GM, Hardenbol P, Montesclaros L, Luo M, Greenfield L, Wong A, Birch DE, Short SW, Bjornson KP, Patel P, Hopmans ES, Wood C, Kaur S, Lockwood GK, Stafford D, Delaney JP, Wu I, Ordonez HS, Grimes SM, Greer S, Lee JY, Belhocine K, Giorda KM, Heaton WH, McDermott GP, Bent ZW, Meschi F, Kondov NO, Wilson R, Bernate JA, Gauby S, Kindwall A, Bermejo C, Fehr AN, Chan A, Saxonov S, Ness KD, Hindson BJ, Ji HP (2016) Haplotyping germline and cancer genomes with high-throughput linked-read sequencing. Nat Biotechnol 34:303–311. https://doi.org/10.1038/nbt.3432 CrossRefPubMedPubMedCentralGoogle Scholar
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.