Background

Mosaicism is non-integer CNV resulting from a mixture of deleted and diploid or duplicated and diploid cells. Mosaic CNV creation mechanisms include: chromosome nondisjunction, anaphase lag, and endoreplication. Mosaicism was first studied in fruit flies by Alfred Sturtevant and Curt Stern demonstrating mitotic recombination. “Somatic mosaicism” terminology was used by C. W. Cotterman in his seminal paper about antigenic variation [1]. Mosaic CNV detection is important in clinical settings for accurate assessment and estimate of disease recurrence risk [2,3,4,5].

Almost all CNV detection algorithms interpret a splitting of genotypes into allelic imbalance to mean duplication. However, splitting of genotypes into allelic imbalance clusters means duplication only when paired with a gain in intensity, otherwise the event is actually a mosaic deletion. In this way, mosaicism is often incorrectly classified or missed entirely by conventional CNV detection algorithms.

Mosaicism calls from publically accessible programs, such as R-GADA-MAD, BAFSegmentation, Mocha, and triPOD have provided insight into the prevalence and impact of mosaic CNVs [6,7,8]. However, we were motivated to create a computationally efficient and easy to perform algorithm, particularly at the scale needed for large projects. The goal of the algorithm was to deliver results in a high-throughput way with few spurious calls and to successfully detect mosaic events multiple megabases in size. Previous studies have been limited by small sample size to achieve accurate frequency estimates ([9, 10], Bonnefond et al. 2013, Schick et al. 2013, [11, 12], and Rodriguez-Santiago et al. 2010).

Mosaic Alteration Detection (MAD) is among the most commonly used computational tool to identify mosaic events using both B-allele frequency (BAF) and log R ratio (LRR) values from SNP array data. The MAD method first performs a segmentation procedure using the GADA algorithm (Pique-Regi R, Caceres A, Gonzalez JR. BMC Bioinformatics, 2010), and then searches for aberrant segments (mosaic regions) with B-deviations (Bdev) different from zero. Bdev occurs in mosaic regions when the locus has a mixture of genotypes from normal and mosaic tissue. The detected mosaic regions are further classified as copy-loss, copy-gain, or copy-neutral events based on the alteration of the LRR from baseline. The MAD method is implemented as a package in R. and the program is available at GitHub (https://github.com/isglobal-brge/MAD). The state of mosaic alterations is determined based on log2-ratio segment values together with the percentage of normal heterozygous (BAF ∼ 0.5) and homozygous (BAF ∼ 0 or 1) probes. While MAD has been successfully applied in several mosaicism projects (Forsberg et al. 2012, 2014 [11];), the program installation remains difficult requiring many undocumented dependencies including the R package devtools rendering it difficult to operate.

It’s important to be aware that inherited alleles at some loci may appear to affect the probability of somatic mutations, and at other loci they may constitute objects of positive or negative clonal selection. Several specific mosaic CNVs are strongly associated with future risk of hematological malignancies [8]. Loh et al. evaluated blood-derived DNA from 151,202 UK Biobank participants genotyped with Affymetrix arrays, finding 8342 mosaic CNVs ranging 50 kb–249 Mb by using phase-based programming (false discovery rate ≈7.5%). Mosaic deletions were observed more frequently in males while mosaic duplications were observed more frequently in elderly and male samples and copy number neutral loss of heterozygosity (CNN-LOH) affected the sexes equally.

In addition to the large number of samples from the Center for Applied Genomics at Children’s Hospital of Philadelphia biobank presented here (Supplementary Table 1), we also explored the pediatric and adolescent age range (Supplementary Fig. 1), providing further insights into early mosaic detection possibilities. Taken together, our results reveal clonal expansions with a wide range of effects on human health.

Implementation

Due to current limitations and difficulties in available mosaic genomic events callers, we have developed a mosaic calling tool programmed in Perl and freely available through our GitHub webpage https://github.com/CAG-CNV/MONTAGE. We wrote the Perl code to be flexible in terms of column order and column inclusion in sample based input files. The minimum column requirement for input files is SNP Name, BAF, and LRR. SNP Name with associated chromosome and base pair position can be specified separately or combined in the input. Sorting by chromosome and position (if not done already) is the first step. This is the most run time consuming step at 25 s. If sorting is detected to be done already, only the mosaic CNV detection portion of the code runs taking 10 s per sample (Table 1).

Table 1 Performance Comparison of Mosaic CNV Detection Tools

Since Windows computers are needed to run GenomeStudio to load idats and export BAF/LRR signal files, we remove any Windows carriage returns. We remove SNPs that failed to UCSC Liftover or low call rate SNPs marked with Position REMOVE. If not sorted, sort by chromosome and position (for linear runtime). Remove position 0 SNPs. A bash awk statement embedded in the Perl code efficiently performs a sliding window of 1 MB with 1 MB increments to roughly assess potential regions of mosaicism. The algorithm monitors position modulus window, if diff< 0 then report window: CHIP REGION AB ABLow ABHigh AAorBB AvgLRR BAF_SD(0.1–0.9). To define deletion, we take q3 + 1.5 interquartile range (iqr) as BAF standard deviation (SD) threshold and q1–1.5 iqr as LRR average threshold. To define duplication, we take q3 + 1.5 iqr as BAF SD threshold and q3 + 1.5 iqr as LRR average threshold. PennCNV script clean_cnv merges fragmented mosaic CNVs in neighboring genomic intervals. We record the first and last base pair position of mosaic evidence in these intervals to provide specific breakpoints (Fig. 2).

Methods used include programming in Bash, Perl, and R.

Efficient and minimal dependency coding allows for rapid ease of deployment of the software.

The BAF ranges used are tallied in the following intervals: (0–0.1) (0.1–0.4) (0.4–0.6) (0.6–0.9) (0.9–1). We use average LRR for each 1 Mb window in comparison with genome-wide average LRR. This presents the key command in the script. We note that our current version of MONTAGE, runs at 35 s per sample in a sample independent manner vs. 10 s if the input BAF/LRR signal file is sorted by chromosome and position.

Data preprocessing

We dynamically assess the column header to determine the presence and order of SNP Name, Chromosome, Position, B Allele Freq, and Log R Ratio column data in the user provided input files. These inputs may be generated by exporting text files from graphical user interface on Windows: Illumina GenomeStudio or Affymetrix Genotyping Workbench. Alternatively, these files can be generated by exporting text files from command line on Linux: Illumina iaap or Affymetrix apt. The flexibility to include only SNP Name, B Allele Freq, and Log R Ratio column data is allowed to minimize the disk space footprint of input data files, provided a separate map file linking SNP Name to chromosome and position. SNP microarray, whole exome sequencing and whole genome sequencing data are all supported input data types based on normalization of read depth to generate LRR signal and clustering of allele depth to generate BAF.

Mosaicism detection

Using an ultra-efficient awk bash command, we are able to run an optimal non-overlapping sliding window algorithm to determine BAF in the mosaic deletion indicative ranges of (0.1–0.4) and (0.6–0.9) as well as the standard deviation of these observations to determine clarity (lack of noise) in the signal observed in a given sample. We implemented a sliding window approach to assess these BAF intervals for allelic imbalances and strong deviations from expected values. Average LRR values across the sliding window interval classify the mosaic CNV as a deletion or duplication relative to normal diploid copy number. High standard deviation of BAF (0.1,0.9) regions were prioritized (for those samples passing quality control with acceptably low genome-wide standard deviation of BAF).

Mosaicism breakpoint refinement

PennCNV (version 1.0.4) component script clean_cnv was used to combine segments in close proximity into one merged mosaic CNV call. Record the first and last base pair position of mosaic evidence in these intervals to provide specific breakpoints (Fig. 2).

Mosaicism algorithm differences

We do not perform phasing as done in MoCha to save on computational time. We do not require or use family information as done in triPOD. We focus our code in the extensible Perl and Bash programming languages as opposed to RGADA-MAD which is written in R. We use standard modern GitHub code tracking as opposed to BAFSegmentation which is on an institutional website.

Results

We assessed the performance of existing mosaic CNV detection algorithms (Table 1). We constructed a model reference of various levels of mosaicism (Fig. 1). We measured the allelic imbalance between proper heterozygous (AB) genotypes centered on 0.5 BAF versus those outside of this region (Fig. 2 and Supplementary Fig. 2). We evaluated mosaic events in 367,785 samples and found 3716 putative mosaic events in 2269 individuals with 2/3 of the raw mosaic calls being visually validated. In total, 187,096 mosaic CNV candidates were suggested by the first pass screening of our algorithm applied to approximately 350,000 SNP-array data sets. Next, we filtered out mosaic CNV candidates with overlapping homozygous deletion calls as detected by PennCNV, since the random noise in BAF for real homozygous deletions can give a false indication of aberrant BAF banding, leaving us with 126,020 mosaic CNV candidates in 43,781 samples. The MONTAGE algorithm no longer requires PennCNV homozygous deletion calls in order to minimize runtime using an approximation subroutine. There were 51,326 mosaic CNV candidates with at least one mosaic CNV candidate > = 3 Mb in genomic span. Finally, 19,090 mosaic CNV candidates had strictly one mosaic CNV candidate > = 3 Mb, suggesting high specificity of mosaic CNV detection in these samples (Fig. 3). Therefore, we set forth visualizing the underlying BAF and LRR profiles corresponding to these mosaic CNV candidates.

Fig. 1
figure 1

Example BAF and LRR Profiles Sampled from Various Levels of Mosaicism. Mosaic deletions have clearing of the 0.5 AB range and relatively equal banding of AAB and ABB range genotypes reflected by BAF. Less relative gain in LRR is observed in mosaic duplications, making them more difficult to detect with certainty. Data displayed is simulated to ensure consistent underlying data quality profiles for fair comparison of different mosaic copy number states

Fig. 2
figure 2

MONTAGE Algorithm Conceptual Flow Diagram. Code overview with key inputs and variables. First data is normalized to BAF and LRR values genome-wide for each sample. Then the number of deviating BAF values in a first pass sliding window are accounted to screen for potential mosaic events as an initial algorithmic step. The first quartile minus 1.5 interquartile range of the LRR paired with strong BAF deviation defines the calling threshold for mosaic deletion events. Fragmented windows meeting threshold are then merged together to form larger calls. The breakpoints are then refined based on the first and last specific base pair evidence of strongly deviating BAF within the merged window

Fig. 3
figure 3

MONTAGE Filtering to Arrive at Curated Callset. Filtering of the putative mosaic CNV calls and respective size of the curated callset at each step. PennCNV calls for homozygous deletions (copy number 0) on the same samples analyzed by MONTAGE were intersected with MONTAGE initial mosaic calls using bedtools software. Further visualization of BAF/LRR underlying potential mosaic CNV calls was conducted manually by a human expert reviewer (in the case only 1 mosaic call in the sample) or by DeepCNV algorithm (in the case 2 or more mosaic calls in the sample)

We verified ability to detect both high (Fig. 4) and low (Fig. 5) level mosaic CNV events with high sensitivity and specificity. We identified 273 putative mosaic CNV deletions in 76 out of 228 samples analyzed. Of those, 202 visually validated as true positive, confirming mosaic CNV deletions in 50 samples out of 228 samples. Median length of these CNVs was 8.5 Mb with average length of 24 Mb. The visually validated mosaic calls had at least 2 AB clearing ratio (equating to 15 AB(0.4–0.6 BAF), 16 low AB (0.1–0.4 BAF), and 16 high AB (0.6–0.9 BAF) observations in a 1 Mb genomic window. Approximately half of the mosaic calls had non-zero AB(0.4–0.6 BAF) signal indicating noise and or lower levels of mosaicism.

Fig. 4
figure 4

Higher Levels of Mosaicism Genome-wide LRR BAF plot representative for one Individual. Full deletion of chromosome 1p contrasted by duplication of 1q is shown along with mosaic deletions of high proportion of cells in the person’s sample on 2q, 8q, and 11p. We use alternating colors similar to a Manhattan plot for GWAS to represent the switch between chromosomes on a linear x axis

Fig. 5
figure 5

Lower Levels of Mosaicism Genome-wide LRR BAF plot representative for one Individual. Mosaic deletions of low proportion of cells in the person’s sample on 3p, 10q, 11q, and 14q

In order to validate the rest of the mosaic CNV candidates, we used a machine learning approach we developed called DeepCNV (in review). DeepCNV is based on a trained model of positive and negative mosaic CNV examples based on a human expert’s labeling. Using this model and images of LRR/BAF plots from PennCNV visualize_cnv which are standard and popularly used, probabilistic predictions of the mosaic CNV candidate being a true positive are output. This makes the prospect of visual validation much more tractable and reproducible with minimum generated bias.

We compared our observed mosaic CNV counts and frequencies to previous studies of mosaicism and found high concordance in genomic regions and their corresponding frequencies observed in populations (Fig. 6). In addition, when examined in the context of multiple disease phenotypes that these individuals harbored, several disease categories were associated with mosaic CNVs based on results generated using the ParseCNV software (Table 2). ParseCNV is a CNV GWAS tool [13]. Association p-values as low as 1E-39 were observed across phenotypes including: adhd, autism, autoimmune, cancer, congenital, healthy, and neurodegenerative [14,15,16]. Thirty three genomic loci were observed p < 5E-4 associated between mosaic events and human phenotypes. Interestingly, we observed potential protective association in healthy subjects and also may consider mosaic CNVs related to healthy subjects as negative controls [17, 18].

Fig. 6
figure 6

Genome-wide Count Mosaic CNVs Detected in Curated Callset in Comparison to Previous Studies. Blue line represents the current work from Montage mosaic CNV callset. Red line represents the Loh et al. mosaic CNV callset. Black line represents the other previously published mosaic CNV callsets. Previous includes: [9, 10], Bonnefond et al. 2013, Schick et al. 2013, [11, 12], and Rodriguez-Santiago et al. 2010. See Supplementary Dataset (glessner2020_hg19.txt, loh2018_hg19.txt, and prev_calls_hg19.txt)

Table 2 Top 5 Significant Results p < 5E-4 Per Disease Category

We compared calling on the same sample data for 755 individuals using MONTAGE, MoCha, and RGADA-MAD (Fig. 7). Using a majority voting scheme, many more mosaic CNV calls overlap MONTAGE than MoCha or RGADA-MAD. If we assume samples selected by two or more callers as true positives, then we have 1748 + 194 + 61 + 418 = 2421. Apparently MONTAGE has the highest sensitivity: 1–61/2421 = 0.975, followed by MoCha 1–194/2421 = 0.920 and RGADA-MAD 1–1748/2488 = 0.280. To compute specificity using this similar majority-vote approach, we need to know the size of the background in the background, namely, the number of samples that are considered as negatives by all three callers, which is not well defined. Alternatively, we can look at accuracy (true positive rate), which is generally a trade-off for sensitivity. Based on the Venn diagram, MoCha has the largest number of samples called by itself (3316). If we assumed the samples called by only one caller are false positives, then the accuracy (true positive rate) of MONTAGE is (1748 + 194 + 418)/ (1748 + 194 + 418 + 1236) = 0.656 vs. MoCha (1748 + 61 + 418)/(1748 + 61 + 418 + 3316) = 0.402 vs. RGADA-MAD (194 + 61 + 418)/ (194 + 61 + 418 + 1130) = 0.373. This is a better tradeoff for MONTAGE. Namely, sensitivity of 0.975 at false positive rate 0.344 (MONTAGE) vs. sensitivity of 0.920 at false positive rate 0.598 (Mocha). We note that this estimate is contingent on assumptions. While we acknowledge that these assumptions are imperfect, this analysis gives good evidence that our FDR is well-controlled. (We also note that while we cannot completely rule out the possibility that our FDR is higher than we estimated, the key results of our paper are robust to higher FDRs than estimated; e.g., we would only expect a higher-than-estimated FDR to weaken GWAS associations and decrease effect sizes.)

Fig. 7
figure 7

Venn Diagram Callsets Mosaic Positive Samples. We compared calling on the same sample data for 755 individuals using MONTAGE, MoCha, and RGADA-MAD. Sensitivity of 0.975 at false positive rate 0.344 (MONTAGE) vs. sensitivity of 0.920 at false positive rate 0.598 (MoCha) vs. sensitivity of 0.280 at false positive rate 0.627 (RGADA-MAD) based on 755 Illumina Human610-Quadv1 samples analyzed by all algorithms. The MONTAGE algorithm provides a competitive tool to be used for mosaic CNV detection

63% of the mosaic events identified were found in males compared to the 50% male percentage in the input dataset Fisher’s exact test (2-Tail) p = 1.554e-11 (Supplementary Table 2). Mosaic duplications were observed more in males and older individuals. Mosaic CNN-LOH were found to affect the sexes equally. The sample race as determined by principal components analysis is provided in Supplementary Table 3.

Discussion

Mosaic CNVs of intermediate states between integer copy number variation are important genetic/genomic events in both clinical and research settings. However, detection of these mosaic events has been limited to incidental findings from CNV algorithms designed for integer discrete copy numbers and not the continuous nature of mosaic CNVs. For example, the tripod algorithm requires parents for calling mosaic events therefore is limited. Moreover, in a recent study using Mocha [8], 8342 mosaic chromosomal alterations (mCAs) were reported in 7585 individuals ranging in length from 50 kb–249 Mb. These mCAs were obtained from blood-derived DNA samples from 151,202 UK Biobank participants aged 40–70 years using new phase-based computational techniques (estimated false discovery rate, 6–9%). However, as 5522 of these mosaics have negative LRR values, they should be considered to be deletions. MAD was notably used in The Cancer Genome Atlas (TCGA) mosaic CNV analysis [19]. Exome sequencing (~ 8000 samples) was used to compare 22 different cancer phenotypes with more than 6000 controls using a case–control study design and demonstrate that mosaic protein truncating variants in these genes are also associated with solid-tumor cancers.

In light of shortage of high performance tools, we designed a new mosaic CNV detection tool aimed at providing high sensitivity and specificity mosaic CNV detection and fast runtime. In comparison, we show that in certain circumstances other algorithms miss critical mosaic events while overcalling other false events.

Others have shown that mosaic CNVs are enriched in males [8]. Our analysis concurs with this, showing that 63% of the mosaic events identified were found in males.

This mosaic CNV detection work has implications in cancer, cell free fetal DNA, and aging [20, 21]. Tumor-normal heterogeneity can appear similarly to germline mosaic CNV. Therefore, cancer phenotyping records are important in conditioning the assessment of supposed mosaic CNV callsets. Cell free fetal DNA is another application that such mosaic CNV detection and association presented here could be of utility. Prenatal testing could be enhanced by deconvolution of the maternal and child CNV genotype profile. Aging accumulating CNVs has been investigated previously [11, 12]. In older age cohorts which have had more exposure to potential environmental hazards inducing CNVs in subsets of cells is another important longitudinal consideration [22, 23].

Conclusions

Mosaic CNVs represent an important class of variation in clinical genetic diagnosis that are often missed. To successfully diagnose mosaic CNVs, it’s important to develop targeted detection tools and systematically apply them to large cohorts to truly understand its relevance and frequency of mosaic CNVs in the general population. Here we demonstrate the utility of our fast scalable tool, MONTAGE, specifically designed for mosaic CNV detection. We envision MONTAGE being an integral part to include for future mosaic CNV detection and analysis.

Availability and requirements

- Project name: MONTAGE.

- Project home page: https://github.com/CAG-CNV/MONTAGE

- Operating system(s): Platform independent.

- Programming language: bash, perl.

- Other requirements: NA.

- License: GNU GPL.

- Any restrictions to use by non-academics: license needed.