DNA sequence features underlying large-scale duplications and deletions in human

Copy number variants (CNVs) may cover up to 12% of the whole genome and have substantial impact on phenotypes. We used 5867 duplications and 33,181 deletions available from the 1000 Genomes Project to characterise genomic regions vulnerable to CNV formation and to identify sequence features characteristic for those regions. The GC content for deletions was lower and for duplications was higher than for randomly selected regions. In regions flanking deletions and downstream of duplications, content was higher than in the random sequences, but upstream of duplication content was lower. In duplications and downstream of deletion regions, the percentage of low-complexity sequences was not different from the randomised data. In deletions and upstream of CNVs, it was higher, while for downstream of duplications, it was lower as compared to random sequences. The majority of CNVs intersected with genic regions — mainly with introns. GC content may be associated with CNV formation and CNVs, especially duplications are initiated in low-complexity regions. Moreover, CNVs located or overlapped with introns indicate their role in shaping intron variability. Genic CNV regions were enriched in many essential biological processes such as cell adhesion, synaptic transmission, transport, cytoskeleton organization, immune response and metabolic mechanisms, which indicates that these large-scaled variants play important biological roles. Supplementary Information The online version contains supplementary material available at 10.1007/s13353-022-00704-0.


Introduction
The 1000 Genomes Project, finished in 2015, resulted in 2504 sequenced genomes of individuals representing 26 populations as well as in the identification of over 88 million of polymorphisms (1000Genomes Project Consortium et al. 2015. The study found out that an individual human genome differs from the reference genome at 4-5 million sites. The most common type of polymorphisms is single nucleotide polymorphisms (SNPs) -about 84.7 million. Copy number variations (CNVs) defined as deletions and duplications longer than 50 bp are less common that SNPs, but because of their length, they constitute up to 12% of the human genome (Redon et al. 2006). It is known that CNVs are not randomly distributed in eukaryotic genomes, but the biological mechanism of their genomic distribution is not fully understood (Nguyen et al. 2006;Makino et al. 2013). Certainly, there exist a considerable variation in CNV breakpoint location among individuals from the same species, as demonstrated, e.g. by Nicholas et al. (2009) for individuals representing several domestic dog (Canis familiaris) breeds. DNA sequence composition is one of the factors triggering the formation of CNV. Repeats of A/T nucleotides and sequences promoting the formation of hairpin structures were observed to mark CNV breakpoints in Plasmodium falciparum (Huckaby et al. 2019). Conversely, in mammals (domestic dogs), CNV breakpoints were enriched in G and C nucleotides (Berglund et al. 2012).
The aim of this study was to characterise DNA structure in regions of human genome that are susceptible to structural duplications or deletions. We searched for DNA sequence features promoting the formation of CNVs and the patterns of functional annotations of such deleted and duplicated regions.

Dataset
The human reference genome GRCh38 was downloaded from the National Center for Biotechnology Information database (NCBI Resource Coordinators 2018). Polymorphisms, including CNVs, were obtained within the frame of the 3rd phase of 1000 Genomes Project and are available from the European Bioinformatics Institute (https:// www. ebi. ac. uk) under the ID: estd214. Primary data resulted from oligonucleotide genotyping, whole genome and exome sequencing. Nine software packages were used to identify large-scale genomic variants including Breakdancer (Chen et al. 2009), Delly (Rausch et al. 2012), Variation Hunter (Hormozdiari et al. 2010), CNVnator (Abyzov et al. 2011), ReadDepth (Miller et al. 2011, Genome STRIP (Mills et al. 2011), Pindel (Ye et al. 2009), MELT (Gardner et al. 2017) and Dinumt (Dayama et al. 2014) and their call sets were merged. Selected variants were then validated using various methods, including microarrays, PCR-free whole genome sequencing and PacBio sequencing, as well as PCR. The estimated false discovery rate for CNVs was below 5% (1000Genomes Project Consortium et al. 2015. Since a combination of filtering, calling and validation methods is a recommended approach to obtain reliable large-scale variants (Butty et al. 2020;Gabrielaite et al. 2021), we considered the 1000 Genomes Project Consortium calls as a high confidence dataset. In our study, from all available high confidence variants (copy number variants, indels, insertions, inversions and mobile elements), only CNVs defined as duplications or deletions were extracted. Overlapping CNVs were considered independently, resulting in 5867 tandem duplications and 33,181 deletions. Length of duplications ranged between 3006 and 988,090 bp, with median of 37,036 bp and mean of 66,527 ± 91,091 bp. Length of deletions ranged between 204 bp and 2,258,238 bp, with median of 3774 bp and mean of 12,143 ± 34,749 bp ( Fig. 1).

Reference genome sequence features
The Samtools software (Li and Durbin 2009) was used to extract regions covered by CNVs from the GRCh38 reference genome. Moreover, coordinates of reference sequences flanking CNVs (100 nucleotides upstream and downstream of each deletion and duplication) were extracted. These regions were considered in the context of unknown nucleotides (denoted as "N"), Guanine-Cytosine pairs, sequence complexity and functional annotation. In order to compare regions covered by CNVs with random genomic sequences, we selected random region coordinates and extracted the regions from the reference genome, using the Samtools software. The process was repeated to match the actual numbers of analysed CNVs, so that four sets of random sequences were selected: (i) set 1 contained 5859 sequences of length equal to the median length of duplications (37,036 bp); (ii) set 2 contained 33,175 sequences of length equal to the median length of deletions (3774 bp); (iii) set 3 contained 5867 sequences of 100-bp length and was used for comparisons with sequences upstream and downstream of duplications; (iv) set 4 contained 33,181 sequences of 100-bp length and was used for comparison with sequences upstream and downstream of deletions. All sequences containing unknown nucleotides were excluded. The distributions of GC content were tested for normality using the Kolmogorov test. The H 0 stating that the distributions of GC content follow the normal distribution with mean and variance given by the considered data sets. The test statistics, which is defined as the supremum of difference between theoretical and empirical distribution, has the same distribution as the classical Kolmogorov statistics. Furthermore, the distributions of GC pair content of high confidence CNVrelated sequences were compared with the corresponding randomised sets, i.e. high confidence duplications and set 1, high confidence deletions and set 2, flanking regions of high confidence duplications and set 3, and flanking regions of high confidence deletions and set 4. It was done using the Wilcoxon-Mann-Whitney test, with H 0 stating that the distributions of GC content are equal. The normalised Wilcoxon-Mann-Whitney test statistic is given by: , S j denotes ranks corresponding to the GC pair percentage classes in the random sequences, n is a count of deletion/duplication/flanking CNV

Sequence complexity
Sequence complexity of the entire reference genome was estimated using the sDust software (Morgulis et al. 2006). The overlap between low-complexity regions defined by sDust and CNV-related regions was determined by using the bedtools software (Quinlan and Hall 2010) for high confidence CNVs and flanking regions, as well as for the random sets. The distributions of low-complexity sequence contents in CNV and flanking regions as well as in randomised data were compared the same way as GC pair content by using the Wilcoxon-Mann-Whitney test.

Functional annotation
The

Results
Reference genome sequence features.

Unknown nucleotide (N) content
Among all of the regions of the human reference genome GRCh38 (Schneider et al. 2017)

GC content
All sequences containing unknown nucleotides were excluded from the GC content analysis. The average content S5a, S5b, S6a and S6b), indicating no link between CNV breakpoint formation and the GC content. The distributions of GC pair contents of duplications (P = 0.004) and deletions (P = 7.955·10 −12 ) significantly differed from the contents of corresponding randomised sequences. In particular, high confidence deletions contained less GC pairs than random regions (P = 3.977·10 −12 ), while duplications were enriched in GC pairs as compared to a randomised set of sequences (P = 0.0024). High confidence deletion flanking regions contained more GC pairs than the corresponding randomised sequences, i.e. P = 1.5·10 −10 for upstream and P = 1.259·10 −21 for downstream regions. The same situation was observed for downstream region of duplications (P = 1.74·10 −9 ), but for upstream, it was lower than in random case (P = 0.014). graphical representation of Randomised duplications (Fig. S3b), deletions (Fig. S4b) and their flanking regions (Figs. S5c and S6c) GC content distributions are provided in the supplementary material.

Sequence complexity
A total of 4,798,406 low-complexity regions (LCRs) were identified within the whole GRCh38 reference genome. Lengths of those regions varied between 7 and 25,072 bp with mean of 29 bp (± 56). All duplications and 93.93% of deletions contained within LCRs. Median number of LCRs overlapped with a single duplication was 57 and with a single deletion was six (Fig. 2, Table 2). On average, LCRs made up 4.59% of a duplication length and 4.66% of a deletion length (Fig. 3, Table 2). On the other hand, CNV breakpoint regions contained much less LCRs. Only, 20.83% of sequences upstream of duplications and 16.52% of sequences downstream of duplications contained a low-complexity region(s). Similarly among deletion breakpoints, we identified 20.37% of upstream sequences and 19.25% of downstream sequences with LCR. Among them, on average, 4.73% of the length of regions upstream of duplications, 3.47% of the length of regions downstream of duplications, 4.44% of the length of regions located upstream of deletions and 4.16% of the length of regions downstream of deletions. In the random sequence set 1, 99.21% of sequences contained low-complexity regions, in set 2 -97.62%, in set 3 -18.12% and in set 4 -18.61%. None of the empirically constructed frequency distributions in the considered regions deviated from the normal distribution. The distributions of low-complexity sequence content in randomised duplications and in high confidence duplications (P = 0.106) as well as between randomised downstream deletions and regions downstream of deletions (P = 0.078) did not differ. The percentage of low-complexity sequences was significantly higher upstream of deletions (P = 1.907·10 −8 ), upstream of duplications (P = 8.982·10 −5 ) and within deletions (P = 2.963·10 −19 ) than in corresponding randomised upstream and downstream regions. Conversely, the distribution of low-complexity sequence content downstream of duplications was significantly lower than in set 3 (P = 0.007).

Discussion
Our study revealed a non-random distribution of GC pairs within CNVs and in CNV flanking regions. This could have been expected, following the hypothesis that the GC content serves as a tool for differentiation between intron (lower GC content) and exons (higher GC content) during splicing (Amit et al. 2012). In our study, we observed that the GC content of deletions was lower and of duplications -higher than in random genomic regions, what indicates that intronic regions are more prone to deletions, whereas exonic regions are more prone to duplications. However, a contradictory result was obtained for humans by Rigau et al. (2019) who observed that deleted regions had significantly higher GC content. In our study, the majority of deletions was annotated to introns what further supports the GC content imbalance (Aïssani and Bernardi 1991). Deletions in genic regions, containing more GC, are functionally more severe than duplications. Moreover, duplications which associate with GC-rich regions (i.e. exons) have some evolutionary advantage (Levasseur and Pontarotti 2011). It is also worth to notice that according to Dittwald et al. (2013), GC content is positively correlated with the frequency of nonallelic homologous recombination (NAHR) which is a common cause of CNV formation. According to Romiguier et al. (2010), GC-rich sequences are prone to deletions because base composition imbalance triggers replication slippage. On the other hand, Chen et al.
(2011) did not report a difference in GC content between CNV regions and autosomal average. Our study also demonstrated a non-random GC content in CNV flanking regions, albeit without a consistent trend, i.e. enriched GC content in deletion breakpoints, but only downstream of duplications. Similarly, Bose et al. (2014) investigated SNV breakpoints and concluded that all SV types had a higher GC percentage than the genome average. Also in terms of sequence complexity, a non-random pattern was revealed, with deletions being enriched with LCR, but without a consistent pattern in breakpoint regions. Barski et al. (2019) investigated sequence complexity in regions flanking CNV in Bos taurus. The study concluded that duplications and deletions preferentially form in regions of low complexity. CNVs also appear to be enriched in regions of low mappability, as well as within satellites and Short Tandem Repeats (Nguyen et al. 2006;Monlong et al. 2018), all of those characterised by low complexity. Chen et al. (2014) postulated that low-copy and high-copy repeats can induce DNA instability, resulting in errors in replication and repair mechanisms and consequently leading to the formation of CNVs.
Functional annotation revealed that majority of CNVs were located in introns. Similar observation was made by Chen et al. (2011) for population-specific CNVs. Higher  gene density in regions covered by CNVs than in random genome regions was also highlighted by Johansson and Feuk (2011). Moreover, Nguyen et al. (2006) reasoned that large-scale DNA changes, if beneficial, they should be enriched in genes, especially those involved in fighting infection and sensing our environment. According to Rigau et al. (2019), intronic deletions are the most frequent CNVs in protein-coding genes in humans, while deletions overlapping exons are less frequent than expected by chance. Therefore, it was also suggested that intronic CNVs contribute to the variability of gene expression and splicing in human populations. The homophilic cell adhesion identified as an ontology over-represented in deletions in our study was also reported for genes with somatic duplications in placenta by Kasak et al. (2015). Moreover, Morello et al. (2019) observed that synaptic transmission, an ontology over-represented in deletions in our study, was the most highly enriched term in CNVdriven differentially expressed genes in a sporadic form of amyotrophic lateral sclerosis. Involvement of CNVs in immune response mechanisms has already been reported by Perry et al. (2008) and, the same as in our study, genes with immune response functions were overrepresented in human CNV regions (Redon et al. 2006). Deleted genes were significantly overrepresented in GO terms related to transport, cellular component organization and regulation of cellular process, which indicates that deletions significantly affect essential cellular mechanisms (Alloza et al. 2011). Duplicated genes were enriched in the Cadherin signalling pathway, which is involved in multiple biological processes, such as development, neurogenesis, cell adhesion and inflammation. Its enrichment has been reported in the context of many diseases including cancer (Mi et al. 2019). In conclusions, genomic regions containing large-scale duplications and deletions, called copy number variants (CNVs), constitute a common source of genetic variation. In this study, we analysed duplications and deletions identified within the frame of the 1000 Genomes Project, in the context of identification of the unique DNA sequence features in CNV regions and of annotation of CNVs to functional segments of the human genome. We discovered that (i) guanine-cytosine content is associated with the formation of CNVs; (ii) duplications are initiated in low-complexity regions and (iii) CNVs are preferentially located within introns. Our findings provide a step towards more complete understanding of the human genomic landscape in the context of copy number variants.