Background

Chronic Lymphocytic Leukemia (CLL) is the most frequent lymphoproliferative disease in western countries, and it shows remarkable clinical heterogeneity [1]. Recently, some studies demonstrated a wealth of genomic and epigenomic differences that determine part of its clinical aggressivity [2], such as point mutations in NOTCH1, SF3B1, ATM, TP53 and POT1, and the absence of somatic hypermutation in the IGHV locus.

Inherited predisposition to the development of CLL has been addressed by various genome wide association studies (GWAS) during the last years. In this regard, dozens of common variants at genes such as BCL2, EOMES, CASP10 and POT1 have been associated with significant risk of CLL development [3,4,5]. Similarly, GWAS studies in other lymphoproliferative disorders such as follicular lymphoma and diffuse large B cell lymphoma have found evidence for the association of germline variants with overall survival (OS) and progression-free survival [6, 7]. Despite this evidence, analysis of CLL clinical evolution have been limited almost exclusively to acquired somatic events.

In this paper, we addressed for the first time to our knowledge the association of common genomic variants with time to treatment (TTT) and OS of CLL patients participating in the Spanish ICGC cohort. Our results suggest the existence of polymorphisms at some genes (e.g., PPP4R2 and MAP3K4) significantly associated with TTT. Moreover, we found significant associations with TTT and OS both at the gene and pathway-level, which could shed new light about CLL biology and its mechanisms of progression.

Methods

Data source

We applied for access to the International Cancer Genome Consortium (ICGC) CLL sequencing data [8] deposited in the European Genome-Phenome Database (EGA). The Data Access Committee approved access to this data under DACO-1040945. We downloaded exome-seq data from control non-tumoral samples from patients with CLL under the accession code EGAD00001001464.

Data preprocessing

Exome-seq data were previously aligned to the reference genome (GRCh37.75) using bwa [9] as described in Puente et al [10]. Briefly, 3 μg of genomic DNA were used for paired-end sequencing library construction, followed by enrichment in exomic sequences using the SureSelect Human All Exon 50 Mb v4 kit or the SureSelect Human All Exon 50 Mb + UTR kits (Agilent Technologies). Next, DNA was pulled down using magnetic beads with streptavidin, followed by 18 cycles of amplification. Sequencing was performed on an Illumina GAIIx or on a HiSeq2000 sequencer (2x76bp). Duplicate read removal, sorting and indexing was done using samtools [11]. Base quality score recalibration was made with BamUtil [12] using a logistic regression model.

Variant detection and filtering

Platypus2 [13] was run on genotyping mode. All dbSNP variants [14] were used as input for genotyping. We used the following specifications: “minVarFreq = 0.02”, “minReads = 2”, “maxReads = 8000”, “assemble = 1”, “minBaseQual = 20”, “trimSoftClipped = 1”, “minPosterior = 20”, “sbThreshold = 0.01”, “badReadsWindow = 15” and “badReadsThreshold = 15”, at least 10 reads covering a position and 2 reads covering a variant, a minimum genotype quality (GQ) of 20 Phred, genotype likelihood (GL) below − 3, maximum homopolymer run (HP) below 11, minimum variant quality adjusted per read depth (QD) above 2 and minimum median minimum base quality for bases around variant (MMLQ) above 10. Variants labeled by platypus as “HapScore”, “SC”, “strandBias” and “MQ” were discarded. Heterozygous loci with variant allele frequency (VAF) < 35% or > 70% were also discarded.

Sample filtering

We used principal component analysis (PCA) to detect outliers in our study cohort. Similarly, identity-by-descent (IBD) was used to discard all individuals with a degree of relatedness equivalent to third degree or higher. PCAs and IBD data were computed on a linkage disequilibrium (LD) pruned dataset (LD upper threshold of 0.2) using the Bioconductor [15] package SNPRelate [16]. Our final filtered dataset contained 426 cases. Among these, 253 were males and 173 were females. By IGHV status, there were 146 unmutated cases and 273 mutated cases; and by clinical staging, the data contained 47 monoclonal B-cell lymphocytosis (MBL), 332 Binet A, 37 Binet B and 8 Binet C cases. Information about clinical staging and IGHV mutation status was not available for 2 and 7 cases, respectively.

Regression analysis

Cox regression and assumption of proportional hazards was performed with the survival R package [17, 18]. Variables with p-value < 0.2 in a univariate model were selected as covariates for the GWAS. In the cases of TTT these were “donor sex”, “IGHV mutation status” and “Binet stage”; whilst in the case of overall survival we used “IGHV mutation status”, “Binet stage” and “donor age at diagnosis” as covariates. Three association models were computed: an additive model, a dominant model and a recessive model. P-values were adjusted using the Benjamini-Hochberg (BH) method.

Due to the heterogeneity of exome-seq coverage and quality metrics, many variables had incomplete data. We included in the analysis variables with at least 25% call rate, a minimum of 10 events (progression or death). A minimum allele frequency of 1% was selected as the lowest threshold. Furthermore, we only analyzed polymorphisms where Platypus called at least 10 minor alleles (additive model) or genotypes (dominant and recessive models).

Inflation values were estimated with the R package QQperm [19]. Briefly, a random distribution of p-values was created by randomly permuting phenotype variables. Then, the association p-values are compared with the null. This method doesn’t consider the null distribution to be distributed uniformly.

Gene-level analysis

VEGAS2 [20] was used to calculate LD-adjusted association p-values for TTT and OS. Briefly, VEGAS2 takes GWAS p-values, and then uses a simulation-based approach using information from population variant reference panels to adjust for LD effects. We used the 1000 Genomes phase 3 data from the iberian population in Spain as our reference population, since all patients of this cohort were of Spanish origin [21]. Only variants falling within the 5′ and 3′ coordinates of RefSeq genes were included. P-values were adjusted for multiple testing using the BH method.

Pathways and gene ontology (GO) analysis

GSEA4GWAS version 1.1 [22] was used for testing significant associations in pathways and biological process annotations. Our input pathways were “Canonical Pathways” and “Gene Ontology Biological Process”. Maximum distance was set to 20 Kb, and the major histocompatibility complex region was masked from the analysis. P-values were adjusted with the BH method.

Results

Genomic polymorphisms associated with treatment-free survival

We created three models to analyze variant association with time to first treatment: an additive, a dominant and a regressive model. PCA plots (Additional file 1: Figure S1) and lambda inflation values (lambda values of 1.06, 1.03 and 1.04, Fig. 1) revealed no significant inflation or population stratification. In the additive model we observed 6 polymorphisms associated with TTT (BH adjusted p-value < 0.05), and other 6 showed association with BH adjusted p-value in the range of 0.05–0.1 (Fig. 2, Table 1, Additional file 9 Table S1). These variants were located in MAP3K4, PEX26 (4 variants), PPP4R2 (2 variants), TTLL12/TSPO, TXNRD2, ZCCHC7, MKI67IP and MARCH10. Notably, rs537453728 at MAP3K4 broke the genome-wide association p-value (4.53 × 10− 8). Other 391 variants were suggestively associated with TTT (BH adjusted p-value 0.1–0.5).

Fig. 1
figure 1

QQplots for the TTT and OS additive, dominant and recessive models

Fig. 2
figure 2

Manhattan plot of the additive TTT model results

Table 1 Results of the additive, dominant and recessive TTT models. Polymorphisms with a BH-adjusted P-value < 0.1 are show

The dominant and the recessive models evidenced very significant enrichment of variants at PPP4R2 (lowest p-value at rs7620924: 8.07 × 10− 11). Variants at GPR98, MAP3K4 and TTLL12 were also significantly associated with TTT (BH adjusted p-value < 0.05) (Figs. 3 and 4, Table 1, Additional file 9: Tables S2 and S3).

Fig. 3
figure 3

Manhattan plot of the dominant TTT model results

Fig. 4
figure 4

Manhattan plot of the recessive TTT model result

Some of this polymorphisms are associated with functional changes in their corresponding genes. According to dbSNP [14], rs2247870 induces a missense change in the GPR98 gene. In a similar fashion, a search in the HaploReg database [23] points toward functional implications of some of these variants. For example, rs7620924 is located in a lymphocyte-specific enhancer region that is strongly associated with PPP4R2 expression in whole blood (p-value 2.18 × 10− 34) and in lymphoblastoid cells (p-value 2.36 × 10− 8). Similarly, the polymorphisms rs361807, rs361946, rs5992169 and rs1043278 in PEX36 are associated with PEX36 expression in lymphoblastoid cells (minimum p-value 6.66 × 10− 10). rs9463 in TTLL12 strongly correlates with the expression of the adjacent gene TSPO (p-value 9.81 × 10− 198), and to a lower extent, with TTLL12 (p-value 9.57 × 10− 4) in blood cells. Other polymorphisms suggestively associated with TTT such as those in TXNRD2 and ZCCHC7 are also significantly associated with the expression of their respective genes in blood cells. On the contrary, no functional information exists about the intronic variant rs537453728 within the MAP3K4 gene.

Genomic polymorphisms associated with overall survival

We created an additive, a dominant and a recessive model to investigate variant association with OS. No significant inflation was observed (lambda values of 1.02, 1.01 and 1.01 for additive, dominant and recessive models, respectively). No variant achieved BH-adjusted p-values below 0.05 in any model (Additional file 2: Figure S2, Additional file 3: Figure S3 Additional file 4: Figure S4, Table 2, Additional file 9: Tables S4-S6). Nevertheless, 20 variants were below 0.06 in the additive models, with the top variants falling in TTC32, WDR35 and CLIP1. Notably, rs2304588 at TTC32 achieved the lowest p-value (7.6 × 10− 7, Bonferroni-adjusted p-value 0.067, BH-adjusted p-value 0.056). Overall, the number of variants with BH adjusted p-values below 0.1 was 25 in the additive model, 4 in the dominant model and 18 in the recessive model.

Table 2 Results of the additive, dominant and recessive OS models. Polymorphisms with a BH-adjusted P-value < 0.1 are show

Gene-based association with TTT and OS

Gene-level integration of p-values for the TTT model using VEGAS2 evidenced the association of 5 genes with BH adjusted p-values < 0.05 and 10 genes with BH adjusted p-values in the range of 0.05–0.1 (Table 3, Additional file 9: Table S7). The most significant genes were AURKAIP1, NIFK, RIPK3, SIK1 and ZCCHC7.

Table 3 VEGAS2 gene-level analysis of the additive TTT and OS models. Genes with a BH-adjusted P-value < 0.1 are shown

Similarly, we observed 12 genes associated with OS at a BH-adjusted p-value < 0.05 (Table 2, Additional file 9: Table S8), namely CLUAP1, TTC32, DRD3, RAD51AP2, RIPK3, EPYC, TENM4, GTPBP1, GAMT, DCP1B, MIR548D1 and MIR548AA1. Other 5 genes were associated with OS with BH-adjusted p-values in the range of 0.05–0.1. These genes were WDR35, BNIPL, KARS, ACO1 and PARP6.

Pathway-level variation significantly associated with TTT and OS

We used GSEA4GWAS to analyze associations with TTT and OS at the pathway level. No significant enrichment neither in “Canonical pathways” nor in “Biological Process” terms was observed for the TTT variable. Nevertheless, a different scenario was found for the OS variable. Significant “Canonical pathways” annotations (BH-adjusted p-values < 0.05) were “Pentose Phosphate pathway”, “Gluconeogenesis”, “Glycolysis”, “GNα13 pathway”, “Nitric Oxide pathway”, “Apoptosis”, “Glycolysis and Gluconeogenesis”, “Tumor Necrosis Factor pathway”, “Ovarian Infertility genes”, “Bile Acid Biosynthesis”, “Keratinocyte pathway” and “Glycine Serine and Threonine pathway” (Table 4, Additional file 5: Figure S5 and Additional file 6: Figure S6). Other pathways had evidence of suggestive association (BH-adjusted p-value < 0.25), such as “Notch Signalling pathway”, “EGF pathway”, “JNK MAPK pathway” and “PDGF pathway” among others. Among “GO Biological Processes” terms, the following were associated with OS with a BH-adjusted p-value < 0.05: “Induction of Apoptosis by Extracellular Signals”, “Mitosis”, “M phase”, “M phase of mitotic cell cycle” and “Negative regulation of Developmental Process”. Terms with BH-adjusted p-value < 0.25 were “Anti Apoptosis”, “Chromosome Segregation”, “Vasculature Development” and “Negative Regulation of Apoptosis”, among others (Table 4, Additional file 7: Figure S7 and Additional file 8: Figure S8).

Table 4 GSEA4GWAS analysis of the additive OS model. Gene Ontology Biological Process and Canonical Pathways terms with a BH-adjusted P-value < 0.25 are shown

Discussion

In this work we present evidence that suggest the existence of germline variation modulating CLL’s clinical aggressivity. The most remarkable finding was the strong and recessive association of rs7620924 near PPP4R2 with short time to treatment. The implication of PPP4R2 in the regulation of cell survival and DNA repair in hematopoietic and leukemia cells has been recently reported [24]. Indeed, different studies have identified PPP4R2 as a modulator of protein phosphatase 4 (PPP4), which regulates DNA repair through non-homologous end joining [25]. Concordantly, the ablation of PPP4 activity in mice increases genomic instability and aborgates class switch recombination in B cells, leading to an abnormal immune response [26]; and its function also seems to be essential in V(D) J recombination during normal B cell maturation [27]. Other polymorphisms associated with time to first treatment were located in MAP3K4, PEX26 and TTLL12. MAP3K4 participates in the TRAIL/MAP3K4/p38/HSP27/Akt pathway, thereby modulating processes such as autophagy and cell migration. Indeed, MAP3K4 is affected by recurrent loss-of-function mutations in different types of cancers [28,29,30,31,32]. Conversely, less is known about the peroxisome-related gene PEX26 [33]; the G-protein GPR98; and TTLL12, which participates in chromosome stability and mitosis-related processes [34]. On the contrary, although we did not find any variant significantly associated with overall survival, we devised some variants with suggestive associations. The two most significant ones were in the TTC32/WDR35 and CLIP1 loci, the last of which is overexpressed in Reed-Sternberg cells of Hodgkin lymphoma [35, 36].

In a similar fashion, we detected variation on different genes associated with CLL evolution. The most relevant was the association of RIP3K with both time to treatment and overall survival. RIP3K encodes a protein that regulates necroptosis, a form of regulated cell death characterized by cell membrane permeabilization [37]. Other relevant genes associated with rapid progression were the pro-proliferative gene NIFK [38], the tumor suppressor SIK1 [39, 40] and ZCCHC7, which is encoded near the B-cell specific PAX5 super-enhancer locus [41, 42]. On the other hand, various genes were associated with overall survival, such as CLUAP1, which participates in tumor growth and cytoskeleton regulation [43,44,45]; and the enzyme GAMT, which converts S-adenosylmethionine to creatine in order to foster high energy demands [46]. Moreover, the BCL-2 interacting gene BNIPL [47, 48] was suggestively associated with survival and deserves further characterization. In the same direction, the most remarkable pathway-level association with survival was that of the pentose phosphate metabolic pathway, which fuels cells with metabolites for nucleotide and lipid biosynthesis, and provides reducing power to promote cell survival under stressful conditions [49]. Other pathways such as GNα13 and Nitric Oxide were also significant. Concordantly, recurrent inactivating mutations in the G-protein superfamily gene GNA13 have been described in B cell lymphomas [50,51,52,53], and the contribution of nitric oxide to apoptosis resistance in CLL cells has been addressed by various studies [54, 55].

The main limitation of this study is the lack of an independent cohort for validation of these findings. Furthermore, although inflation values were low, we assume that treatment heterogeneity could have an impact on overall survival associations. Nevertheless, the global results are not only statistical significant but also biologically plausible. Thus, we believe that this report will motivate further studies in order to confirm the effect of these variants and to determine their mechanisms of action in lymphoproliferative disorders.

Conclusions

Our results point towards the existence of germline variability as a determinant of CLL clinical aggressivity. Future studies to validate and characterize the activity of these variants in CLL are needed.