De novo variants in exomes of congenital heart disease patients identify risk genes and pathways
Congenital heart disease (CHD) affects ~ 1% of live births and is the most common birth defect. Although the genetic contribution to the CHD has been long suspected, it has only been well established recently. De novo variants are estimated to contribute to approximately 8% of sporadic CHD.
CHD is genetically heterogeneous, making pathway enrichment analysis an effective approach to explore and statistically validate CHD-associated genes. In this study, we performed novel gene and pathway enrichment analyses of high-impact de novo variants in the recently published whole-exome sequencing (WES) data generated from a cohort of CHD 2645 parent-offspring trios to identify new CHD-causing candidate genes and mutations. We performed rigorous variant- and gene-level filtrations to identify potentially damaging variants, followed by enrichment analyses and gene prioritization.
Our analyses revealed 23 novel genes that are likely to cause CHD, including HSP90AA1, ROCK2, IQGAP1, and CHD4, and sharing biological functions, pathways, molecular interactions, and properties with known CHD-causing genes.
Ultimately, these findings suggest novel genes that are likely to be contributing to CHD pathogenesis.
KeywordsCongenital heart disease De novo variants Trios Pathway Enrichment analysis
Combined Annotation Dependent Depletion
Congenital heart disease
De novo variant
Exonic splicing enhancer
Exome Sequencing Project
Exome Aggregation Consortium
False discovery rate
Gene damage index
The Genome Aggregation Database
The Human Gene Connectome
High heart expression
Human Protein Atlas
Ingenuity Pathway Analysis
Minor allele frequency
Mouse Genome Informatics
Mammalian Phenotype Ontology
Mutation significance cut-off
Pediatric Cardiac Genetics Consortium
Pediatric Heart Network
Protein kinase A signaling
Congenital heart disease (CHD) is the most common type of birth defect affecting ~ 1% of births. There have been increasing efforts to elaborate genetic variation underlying CHD using the advances in high-throughput genomic technologies. De novo variants (DNVs) have been shown to play a major role in severe, early-onset genetic disorders such as neurodevelopmental disorders and CHD, and their contribution in sporadic CHD has been estimated as nearly 8%, increasing to 28% for individuals with CHD plus extra-cardiac anomalies and/or neurodevelopmental delays [1, 2, 3, 4]. The genetic causes of sporadic CHD, the most common form of CHD, remain largely unknown [5, 6].
Exome sequencing studies of parent-offspring trios have been successful in providing insights into DNVs and identifying causal genes, therefore extending our understanding of mechanisms underlying human diseases [4, 7]. In recent studies of CHD trios enrolled in the Pediatric Cardiac Genetics Consortium (PCGC) , significant enrichment for genes related to histone modification, chromatin modification, transcriptional regulation, neural tube development, and cardiac development and enrichment in pathways including Wnt, Notch, Igf, HDAC, ErbB, and NF-κB signaling have been reported [1, 2, 3]. A comprehensive analysis of WES data of a single large CHD cohort (2871 probands including 1204 previously reported trios) was recently performed, where rare inherited recessive and dominant variants were analyzed by comparing observed and expected numbers estimated from the de novo probabilities .
De novo variants in patients of CHD and controls were obtained from the recent study of the Pediatric Cardiac Genomics Consortium (PCGC) on a large CHD cohort . We studied 2675 CHD parent-offspring trios recruited to the PCGC and the Pediatric Heart Network (PHN) programs and 1789 control trios comprising parent and unaffected siblings of autism. Each participating subject or their parent/guardian provided informed consent.
PCGC subjects were selected for structural CHD (excluding PDA associated with prematurity, and pulmonic stenosis associated with twin-twin transfusion) and were recruited to the Congenital Heart Disease Genetic Network Study (CHD GENES) . PHN subjects were chosen from the DNA biorepository of the Single Ventricle Reconstruction trial . Controls included 1789 previously analyzed families that include one offspring with autism, one unaffected sibling, and unaffected parents . The permission to access to the genomic data in the Simons Simplex Collection (SSC) on the National Institute of Mental Health Data Repository was obtained. Written informed consent for all participants was provided by the Simons Foundation Autism Research Initiative . Only the unaffected sibling and parents were analyzed in this study. Controls were designated as unaffected by the SSC .
Our validation cohort consisted of 559 CHD parent-offspring trios recruited to the PCGC’s CHD GENES whose DNAs had been subjected to WES similar to the discovery case cohort.
The ethnicity and sex distributions of cases and controls are given in Additional file 1: Table S1. Samples with known trisomies or CNVs that are known to be associated with CHD were excluded. Cases include phenotypes with and without extracardiac manifestations or neurodevelopmental deficiency. CHDs were divided into five categories (Additional file 1: Table S2): (i) conotruncal defects (CTD), (ii) d-transposition of the great arteries (d-TGA), (iii) heterotaxy (HTX), (iv) left ventricular outflow tract obstruction (LVO), and (v) other .
Identification of de novo variants
All samples were sequenced at the Yale Center for Genome Analysis following the same protocol as previously described . Genomic DNA from venous blood or saliva was captured using the Nimblegen v.2 exome capture reagent (Roche) or Nimblegen SeqxCap EZ MedExome Target Enrichment Kit (Roche) followed by Illumina DNA sequencing. WES data were processed using two independent analysis pipelines at Yale University School of Medicine and Harvard Medical School (HMS). At each site, sequence reads were independently mapped to the reference genome (hg19) with BWA-MEM (Yale) and Novoalign (HMS) and further processed using the GATK Best Practices workflows [13, 14, 15]. Single nucleotide variants and small indels were called with GATK HaplotypeCaller and annotated using ANNOVAR, dbSNP (v138), 1000 Genomes (August 2015), NHLBI Exome Variant Server (EVS), and ExAC (v3) [16, 17]. The MetaSVM algorithm, annotated using dbNSFP (version 2.9), was used to predict deleteriousness of missense variants using software defaults [18, 19]. Variant calls were reconciled between Yale and HMS before downstream statistical analyses.
Relationship between proband and parents was estimated using the pairwise identity-by-descent (IBD) calculation in PLINK . The IBD sharing between the proband and parents in all trios was between 45 and 55%, as expected.
DNVs were called by Yale using the TrioDenovo program  and filtered yielding a specificity of 96.3% as previously described . These hard filters include (i) an in-cohort minor allele frequency (MAF) ≤4 × 10−4; (ii) a minimum 10 total reads, 5 alternate allele reads, and a minimum 20% alternate allele ratio in the proband if alternate allele reads ≥ 10, or if alternate allele reads is < 10, a minimum 28% alternate ratio; (iii) a minimum depth of 10 reference reads and alternate allele ratio < 3.5% in parents; and (iv) exonic or canonical splice-site variants.
The observed and expected rates for presumably benign synonymous DNVs showed no enrichment in cases or controls . The rate of synonymous DNVs in cases was not different from that in controls.
The gene sets
The genes in which coding mutations cause isolated or syndromic CHD used in this study are referred to as known CHD-causing genes and include both human and mouse CHD genes. The human CHD gene set was manually curated by members of the Pediatric Cardiac Genomics Consortium [1, 2]. To generate the mouse CHD gene set, mammalian phenotype ontology (MPO) terms potentially relevant to CHD were identified. These were reviewed to remove cardiovascular terms not specific to CHD, such as cardiac dilation/hypertrophy, arrhythmias, and coronary artery disease . Data on the mouse strains associated with these MPO terms (n = 1020) were obtained from MouseMine dataset (http://www.mousemine.org/mousemine/). Only single-gene transgenic mutant mouse strains were kept (n = 730), and these mouse genes were converted to their human orthologs (n = 728) based on data downloaded from the Mouse Genome Informatics (MGI) (ftp://ftp.informatics.jax.org/pub/reports/HOM_MouseHumanSequence.rpt). Mouse CHD genes were not split based into recessive/dominant because there was no concordance between autosomal dominant human CHD genes and mouse zygosity (of the 50 monoallelic human CHD genes with mouse models, only 20 have CHD observed on a heterozygous background).
Another set of genes used in this study is the top quarter of expressed genes during heart development (high heart expression, HHE genes), which was identified by RNA sequencing of mouse hearts at embryonic day E14.5 [1, 2].
To identify potentially damaging mutations, we applied several filtering steps based on molecular class, allele frequency, intolerance to mutations, functional impact, and the number of variants in cases and controls. Here, it is important to note that the aim of this filtering strategy was to identify a set of variants that were highly likely to be pathogenic and the filtered-out variants were not necessarily benign.
The synonymous variants were filtered out from our analyses by giving priority to frameshift, nonsense, canonical splice site, start loss, missense, and non-frameshift insertion–deletion variants.
Functional variants with MAF < 0.001 across all samples in the Exome Aggregation Consortium (ExAC), the NHLBI Exome Sequencing Project (ESP), the Genome Aggregation Database (gnomAD), and the 1000 Genomes Project were examined by ANNOVAR [15, 16, 17, 23]. Variants whose frequency data were not available in any of the databases were also taken into consideration.
We evaluated intolerance of genes to mutations using the gene damage index (GDI) that provides an estimate for the accumulated mutational damage of each gene in the general population and helps to filter out highly damaged genes as those unlikely to be disease causing . The genes with high GDI were filtered out from our dataset.
To improve the use of common variant-level methods that use a standard cut-off values across all genes, such as the Combined Annotation Dependent Depletion (CADD) score , we used the mutation significance cut-off (MSC) method with 95% confidence interval (CI) which provides gene-level and gene-specific low/high phenotypic impact cut-off values . Since the variants with CADD≥MSC predicted scores suggest high functional effect, we filtered out the variants with CADD score below the MSC.
As a last step of filtration, the variants that were specific to the cases were determined by comparing the number of variants in cases to the number of variants in controls in each gene. Here, we tried several different approaches to decide how stringent a filter was appropriate for our data: (a) applying Fisher’s exact test on all genes, (b) applying Fisher’s exact test on only cases genes, (c) allowing all variants that are absent from controls, and (d) considering the genes in which ncases − ncontrols ≥ 2, where n is the number of variants. All approaches except for (d) did not show statistical significance in pathway analysis due to the small number of genes in cases that account for the likely genetic heterogeneity of CHD. Thus, we used (d) for the analyses described in this study.
Similar filtration steps, (i) removing synonymous, (ii) MAF < 0.001, (iii) low or medium GDI, (iv) CADD>MSC, and (v) ncontrols − ncases ≥ 2, were applied to the controls’ data.
DNVs occurring on X chromosome with X-linked recessive inheritance pattern were excluded from the analysis.
Function, pathway, and network analysis
We investigated enrichment of variants in Gene Ontology (GO) terms and biological pathways using InnateDB, version 5.4 data analysis tool . InnateDB performs a hypergeometric distribution test to find over-represented GO terms and pathways (imported from KEGG, NetPath, PID NCI, Reactome, INOH, and PID BioCarta) that are represented more than would be expected by random chance [28, 29, 30, 31, 32, 33]. The NetworkAnalyst tool on String Interactome was applied with high-confidence (score > 0.9) to determine the interconnected subnetworks of protein-protein interactions (PPIs) [34, 35]. Additionally, Ingenuity Pathway Analysis (IPA) software, version 49309495 (http://www.qiagen.com/ingenuity) was used for identifying statistical significance of canonical pathways, diseases, biological functions, and networks that were most relevant to the input genes. To adjust the false discovery rate, the Benjamini-Hochberg (B-H) correction method was applied to the p values in all analyses. IPA analysis included the following parameters: (i) Ingenuity Knowledge Base (genes only) was used as the reference set, both direct and indirect relationships are considered; (ii) endogenous chemicals were included in networks interaction, the number of molecules per network was selected as 140, and the number of networks was selected as 25; (iii) all node types and all data sources were used; (iv) only experimentally observed information was considered; (v) molecules and interactions were limited to human only; (vi) molecules and relationships were selected from all tissues and cell lines; and (vii) all mutation findings were used.
Biological distance calculations
The human gene connectome (HGC) is tailored to prioritize a given list of genes by their biological proximity to genes that are known to be associated with a phenotype of interest . The biological proximity is defined by in silico predicted biologically plausible routes, distances, and degrees of separation between all pairs of human genes and calculated by a shortest distance algorithm on the full network of human protein-protein interactions. Since the causal genes of a specific phenotype are generally closely related via core genes or pathways, we determined the genes within the top 1% of each candidate gene’s connectome.
Candidate gene prioritization
A priority score was defined to rank the genes based on their proximity to the known CHD-causing genes. For a given candidate gene, the score was the total number of known disease-causing genes in (i) the significantly enriched pathways (IPA canonical pathways, InnateDB pathways, GO terms); (ii) the networks (IPA network of cardiovascular diseases and PPI network); and (iii) the top 1% of genes connectome (significant proximity to the gene with p < 0.01) based on HGC. After ranking the candidate genes based on their priority scores, their expression levels during heart development were also taken into consideration.
To assess whether the known CHD-causing genes have higher priority scores as expected, we performed an independent two sample t test. We randomly selected 100 known CHD-causing genes and 100 genes from our filtered control set among the genes having more variants in controls than cases (ncontrols > ncases), and compared the scores of two samples.
To test our gene candidates, we performed ToppGene suite and ranked the genes based on functional similarity to known CHD genes . ToppGene first generated a representative profile from the training genes (known to be CHD-associated genes) based on functional properties such as gene expression, protein domains, protein interactions, gene ontologies, pathways, drug-disease associations, transcription factor-binding sites, and microRNAs, and then compared the candidate gene set to this profile. All available features were used with default test parameters. The genes were ranked based on their similarity to the known CHD-causing genes by calculating p values.
Prediction of functional effects on proteins
Functional effects of amino acid substitutions were predicted using PROVEAN v1.1 that uses sequence alignment-based scoring and SNAP2 that is based on a variety of sequence and variant features [38, 39]. Both methods evaluate the effect of an amino acid substitution on protein function.
The PROVEAN score measures the change in sequence similarity of a given protein sequence to a protein sequence homolog before and after the variant occurs where the sequence similarity is computed by an amino acid substitution matrix. A score equal to or below a predefined threshold (default threshold = − 2.5) is considered to indicate a “deleterious” effect, and a score above the threshold is considered to indicate a “neutral” effect.
SNAP2 is a machine learning classifier based on a variety of sequence and variant features including the evolutionary information taken from multiple sequence alignment, secondary structure, and solvent accessibility. The predicted score ranges from −100 (strong neutral prediction) to +100 (strong effect prediction) and indicates the likelihood of variant to alter the protein function.
The intolerance of protein domains to functional variants was calculated using subRVIS . SubRVIS calculates a rank for sub-regions of gene by their intolerance to functional variation. The sub-regions can be either defined as protein domains based on conserved domain sequences or exons. While a lower score indicates a more intolerant sub-region, a higher score indicates a more tolerant sub-region.
Prediction of exonic splicing enhancers
We applied our in-house software to identify if the genetic variants were located in exonic splicing enhancers (ESEs) close to the canonical splice sites. There were in total 2341 ESE motifs collected from RESCUE-ESE, PESX, and SpliceAid [41, 42, 43]. By removing 16 duplicated ESEs from different resources, a collection of 2325 ESE motifs was retained for further analysis of our variants.
Optimizing case-control ratio
Since the numbers of cases and controls were not equal (127 genes with 320 variants in cases and 36 genes with 73 variants in controls), we also tested our analysis on an extended control set. We randomly selected 91 genes from the 769 genes in controls where ncontrols − ncases = 1 and increased the size of the control set to 127 genes with 164 variants.
Selection of de novo variants for analyses
We applied variant-level and gene-level filtrations on DNVs observed in 2645 CHD trios and 1789 controls. For the variant-level analysis, we filtered DNVs based on (i) functional effect, (ii) allele frequency, and (iii) phenotypic impact. For the gene-level, we filtered genes based on (i) accumulated mutational damage and (ii) the difference in the mutational burden between cases and controls (described in the “Methods” section). The results included 127 genes (320 variants) in cases and 36 genes (73 variants) in controls that we further explored in our analyses (Fig. 1a, b, Additional file 1: Tables S2 and S3). Notably, 232/320 variants were missense mutations (37 nonsense, 36 frameshift, 14 splicing mutations, and 1 start-loss) (Additional file 2: Figure S1). Among cases, 282 had only one predicted damaging DNV and 19 had two predicted damaging DNVs. In controls, 65 samples had only one predicted damaging DNV and four samples had two predicted damaging DNVs.
Gene enrichment and pathway analyses
CHD DNVs are enriched in signaling pathways
In enrichment analyses, genes sets are tested for over-representation of shared biological or functional properties as defined by the reference databases; hence, the results depend on the database used in the analysis [44, 45]. As no single database covers all known pathway genes, a comprehensive interpretation of the results requires analyses be performed on several complementary databases. For example, while Ingenuity Pathway Analysis (IPA) software (QIAGEN Inc., https://www.qiagenbioinformatics.com/products/ingenuity- pathway-analysis) uses its own curated database, InnateDB uses major public databases (e.g., KEGG, Reactome) as resources [27, 28, 31]. Hence, to achieve a deeper understanding of the 127 genes in cases, we performed pathway analyses using both tools.
The pathway analysis using InnateDB returned 211 over-represented pathways (with a large proportion of biological overlap) (FDR < 0.05), including VEGF, GPCR metabotropic glutamate receptor, PDGFR-beta, ERK, Notch, Igf, and NGF, affirming enrichment in signaling pathways (Additional file 3: Table S5). The most significant pathway was identified as focal adhesion (FDR = 1.72 × 10−4), which was found enriched by IPA as well and is known to have an important role in cellular differentiation and migration during cardiac development [56, 58, 59]. Another significantly enriched pathway was axon guidance (FDR = 0.0026). Slit-Robo signaling is known to have roles in axon guidance and has been suggested to be involved in heart development. Netrins, a class of axon guidance molecules, have also been suggested to have roles in cardiovascular biology and disease including angiogenesis [60, 61, 62, 63].
Over-represented Gene Ontology (GO) terms included heart development (FDR = 8.96 × 10−4), axon guidance (FDR = 0.0011), pulmonary valve morphogenesis (FDR = 0.0018), chromatin binding (FDR = 0.0017), Notch signaling involved in heart development (FDR = 0.0035), histone-lysine-N-methyltransferase activity (FDR = 0.0035), and in utero embryonic development (FDR = 0.0053) (Additional file 3: Table S6). Histone-modifying genes and chromatin binding have been previously implicated to have a role in heart diseases [1, 64, 65, 66]. Interestingly, among the ten genes associated with the GO term heart development, only CAD had not been related to CHD previously.
No enrichment was detected in the extended control set
We did not identify any significant GO term or signaling pathway enriched in the control genes using IPA. By InnateDB, only five pathways had FDR < 0.05 (Additional file 3: Table S7). To check if the lack of enrichment in controls data could be attributable to smaller number of variants, we repeated all pathway enrichment analyses on an extended control set of the same size as for the cases, 127 genes with 164 DNVs (see the “Methods” section). Filtered DNVs in the extended control set did not show any significantly enriched canonical pathway by IPA. There were only one statistically significant Reactome pathway (FDR = 0.0027), transport of inorganic cations/anions and amino acids/oligopeptides, and no significant GO terms found by InnateDB in the extended control set. The lack of pathway enrichments in the controls group suggests the specificity of our results to CHD.
Enrichment in cardiovascular disease categories
To investigate the causal relatedness between the identified genes and biological functions/diseases, we analyzed the IPA-predicted top enriched diseases/functions categories (FDR < 0.05) and observed cardiovascular disease as a highly significant disease category in CHD cases (FDR = 5.36 × 10−13) (Additional file 3: Table S8). Among the disease subcategories under “cardiovascular disease” category, familial cardiovascular disease was the most enriched. As the biological function/disease categories have a hierarchical nature, the following enriched cardiovascular disease subcategories give more specific information on candidate genes. For example, while CDK13, CHD4, KDM5A, and SCN10A are related to familial heart disease, CFH, DGUOK, and POLE are related to familial vascular disease. In contrast, the only statistically significant cardiovascular disease in controls was the branching morphogenesis of vascular endothelial cells with FDR = 0.013, and involved only the gene PTPRJ. Taken together, these results suggest that the candidate CHD genes are enriched in phenotypes that are closely associated with CHD.
A high-confidence subnetwork associated with cardiovascular disease
Validation of the enrichment results in cases
To assess our findings in the cases, we repeated our analysis on an independent CHD cohort comprising 559 parent-offspring trios with a total of 977 de novo variants. After following the same variant filtering method that we applied on cases and controls (described in the “Methods” section), we identified 30 genes (with 54 DNVs) to further analyze (Additional file 4: Table S10). Despite the smaller sample size, we again observed enrichment in signaling pathways including opioid, netrin, protein kinase A, and axonal guidance, as well as enrichment in GO terms including blood vessel development and embryonic heart tube development (Additional file 4: Tables S11-S13). The most significant network identified by IPA (p = 10−54) included 26 genes and was associated with cardiac dysfunction, cardiovascular disease, and organismal injury and abnormalities (Additional file 4: Table S14a). We further explored our findings by randomly selecting 30 genes from the unfiltered dataset of 559 samples and repeating the enrichment analyses. In the random set of genes, we did not identify any significantly enriched pathway, or a network related to cardiovascular disease. There were only some GO terms with FDR > 0.04 including a single gene, which were not significantly enriched in the cases (Additional file 4: Table S15). These results validated that our approach is effective in identifying CHD-related gene pathways and networks.
Candidate novel CHD-causing genes
Our gene enrichment analysis results revealed that some genes that were not among currently known CHD-causing genes (see the “Methods” section) were involved in multiple significantly enriched pathways and in a network of cardiovascular disease together with known CHD-causing genes. Since we have applied relaxed criteria to allow analyses of additional genes, these genes had a low number of hits (2 or 3), while the genes with higher number of hits (> 5) were all known genes (KMT2D: 16, CHD7: 15, PTPN11: 10, and NOTCH1: 6) (Additional file 5: Table S16). To identify the most plausible novel CHD-causing gene candidates, we performed systematic analyses by considering involvement in enriched pathways, connections in the biological networks, and expression levels during heart development.
To assess novel candidate CHD-causing genes suggested by the enrichment analyses in the previous section, we defined a priority score (see the “Methods” section), where a higher score indicates the gene’s connectivity to a high number of known CHD-causing genes through (i) multiple significant pathways (FDR < 0.05) [27, 28, 29, 30, 31, 32, 33, 67], (ii) multiple significant networks [34, 67, 68], and (iii) the Human Gene Connectome (HGC) . We also checked if the candidate gene was highly expressed during heart development (Additional file 5: Table S16) [1, 2]. Pathway and network analysis have been effectively integrated in candidate gene prioritization by different methods based on the rationale that disease-associated genes/proteins interact with each other [69, 70, 71]. Similarly, the biological distance between candidate genes and known disease-causing genes is shown to be an efficient measure for gene prioritization . Altogether, these analyses that are based on different heterogenous data types and data sets provided partially overlapping and complementary information, resulting in prioritizing the plausible candidate genes based on the combined evidence of their biological relatedness to the known CHD-causing genes.
To investigate if considering mouse CHD genes as known CHD-causing genes had an impact on our results, we repeated our analysis with only human CHD genes as the known genes. All novel candidate genes were again ranked at top of the list along with nine mouse CHD genes (see Additional file 5: Table S17). We further calculated the average biological distance of candidate genes with respect to human CHD genes only (mean = 13.36, sd = 4.27) and mouse CHD genes only (mean = 13.04, sd = 4.17). The average distances showed no significant difference (independent t test, t = 0.57, p = 0.56) when using human or mouse CHD genes (Additional file 5: Table S18), supporting the notion that mouse CHD genes were plausible to use in this study.
Tissue enrichment in candidate genes
We examined the expression of 23 novel candidate genes using the Human Protein Atlas (HPA) RNA-seq data and observed that 20/23 of the genes were expressed in all tissues or mixed, and 3/23 were tissue enhanced (LAMB1: placenta, LAMC1: placenta, and RACGAP1: testis). We also observed that the majority of the known CHD-causing genes (67.5%) are expressed in all or mixed and the rest (32.5%) have elevated expression (tissue enhanced/enriched or group enriched), while approximately 54% of the protein coding genes in human body is expressed in all/mixed [74, 75] (https://www.proteinatlas.org/). While the tissue expression profiles of the candidate genes are significantly different from expression levels of all genes (chi-square with Yates correction, two-tailed p value = 0.0077), there is no significant difference from the expression profiles of the known CHD-causing genes (chi-square with Yates correction, two-tailed p value = 0.08).
Association of candidate genes with known CHD-causing genes
The closest known CHD-causing gene to the 23 candidate genes calculated by HGC
Degrees of separation
Assessing candidate genes with ToppGene
To further validate our findings, we also prioritized genes based on their functional similarity to the known genes by using ToppGene suite . Ten of the 23 novel candidate genes were also ranked at the top by ToppGene with p < 10−3 (Additional file 5: Table S16). The ranked gene list was in good agreement with our list of candidate genes.
Candidate genes in isolated and syndromic CHD
Among 301 CHD cases carrying possibly damaging DNVs, 73 were isolated CHD patients (CHD without extracardiac manifestation or neurodevelopmental deficiency) and 180 were syndromic CHD patients (with EM and/or NDD) (Additional file 1: Table S2). To investigate the pathways and genes altered in these two different types of CHD, we performed pathway enrichment analyses and gene prioritization in the two subgroups separately. We identified 64 candidate genes involved in isolated CHD and 105 candidate genes involved in syndromic CHD (45 involved in both). In isolated CHD, the pathways including nitric oxide signaling in the cardiovascular system, PKA signaling, Igf receptor activity, positive regulation of cardioblast differentiation, Notch signaling involved in heart development, and pulmonary valve morphogenesis were found to be highly enriched (Additional file 6: Tables S19–21). Some of these pathways (e.g., Notch1, Igf-1 signaling) were reported in a recent study of Sifrim et al. on a predominantly nonsyndromic CHD cohort . In syndromic CHD, the pathways such as PKA signaling, opioid signaling, heart development, chromatin binding, and focal adhesion were found to be significantly enriched (Additional file 6: Tables S24–26). Despite the smaller sample sizes, following our gene prioritization approach, we identified 11 and 22 candidate genes for isolated and syndromic CHD, respectively (Additional file 6: Tables S23 and S28). Top candidate genes in isolated CHD include HSP90AA1, IQGAP1, and TJP2, and top candidate genes in syndromic CHD include ROCK2, APBB1, KDM5A, and CHD4.
Candidate genes in patients with conotruncal defects and left ventricular obstruction
Cardiac phenotypes of the CHD proband were defined as (i) conotruncal defects (CTD, 30%), (ii) d-transposition of the great arteries (d-TGA, 9%), (iii) heterotaxy (HTX, 9%), (iv) left ventricular outflow tract obstruction (LVO, 28%), and (v) other (24%) in the previously reported study  (see Additional file 2: Figure S3 for details). Among 301 patients carrying possibly damaging DNVs, 84 had CTD (27.5%), 21 had d-TGA (7%), 23 had HTX (7.5%), 99 had LVO (33%), and 74 had other (25%) types of CHD (Additional file 1: Table S2). We identified 59 candidate genes in CTD and 68 candidate genes in LVO and, therefore, were able to perform a subgroup analysis for these two subtypes of CHD. Pathway analyses in CTD genes showed that VEGF signaling, PKA signaling, axon guidance, distal tube development, and Igf-1 signaling pathways were highly enriched (Additional file 7: Tables S29–31). After prioritizing the genes, ROCK2 was on top of the list (Additional file 7: Table S33). LVO genes showed significant enrichment in CDK5 signaling, Notch signaling, pulmonary valve morphogenesis, and Beta3 integrin cell surface interactions pathways (Additional file 7: Tables S34–36). Gene prioritization revealed that the top genes include KDM5A and PHIP (Additional file 7: Table S38).
Function-affecting genetic variants in candidate CHD-causing genes
To verify that the 23 novel candidate genes were unlikely to be false positives, we checked if the variants in those genes existed in the non-pathogenic genetic variants list, the “blacklist” . This recently curated list includes variants absent or rare in public databases but too common in patients suffering from severe genetic diseases and, therefore, are unlikely to cause disease. None of our damaging DNVs was included in the blacklist.
Next, to evaluate whether the 41 missense variants in the 23 strong candidate genes are likely to have functional effects, we analyzed them with PROVEAN and SNAP2 [38, 39] (Additional file 8: Table S39). We did not use the functional impact prediction tools in the filtering step as we considered all non-synonymous mutations, and they provide a score for missense mutations only. Among 41 missense variants, 24 were predicted to be damaging by both tools and 6 were predicted to be damaging by one of the tools. We also estimated the intolerance of protein domains to functional variation using the subRVIS  tool to further analyze the effects of the DNVs in candidate CHD-causing genes. Among 41 variants, 31 were found to affect regions intolerant to mutations and, therefore, more likely to cause disease. We then checked if the candidate CHD-causing genetic variants were already included in the HGMD database . Four DNVs (one in CDK13, one in KDM5A, and 2 in NAA15) were classified as CHD-causing variants, and 23 DNVs were classified as likely to be CHD-causing mutations in the HGMD Professional 2019.2 database (Additional file 8: Table S39).
To check the population genetics-level functional impact of the variants occurring in the top four candidate genes (HSP90AA1, ROCK2, IQGAP1, and CHD4), we visualized the minor allele frequencies with respect to damage prediction scores (CADD) using PopViz . Additional file 2: Figure S4 displays all missense variants in European population with CADD>MSC score (95% confidence interval) in gnomAD database . These plots suggest that the rare variants in the top candidate genes likely have a strong functional impact.
Twenty-three plausible CHD candidate genes
Amyloid beta precursor protein binding family B member 1
Biorientation of chromosomes in cell division 1 like 1
Bromodomain containing 4
Calcineurin binding protein 1
Carbamoyl-phosphate synthetase 2, aspartate transcarbamylase, and dihydroorotase
Cyclin dependent kinase 13
Chromodomain helicase DNA binding protein 4
CTR9 homolog, Paf1/RNA polymerase II complex component
Glucosidase II alpha subunit
Heat shock protein 90 alpha family class A member 1
IQ motif containing GTPase activating protein 1
Lysine demethylase 5A
Laminin subunit beta 1
Laminin subunit gamma 1
Misshapen like kinase 1
N (alpha)-Acetyltransferase 15, NatA auxiliary subunit
Pleckstrin homology domain interacting protein
Pogo transposable element derived with ZNF domain
Glycogen phosphorylase L
Rac GTPase activating protein 1
Rho-associated coiled-coil containing protein kinase 2
Tight junction protein 2
Ubiquitin specific peptidase 4
Synonymous DNVs in exonic splicing enhancers
To check if synonymous DNVs in cases contribute to CHD, we analyzed them by first applying the same filtering steps as described for the other variant types, and next performing enrichment analyses. We identified nine genes having two synonymous variants in cases and none in controls. Four of these genes (HSP90B1, GIT1, ARID1B, and CASZ1) were highly expressed during heart development. Interestingly, one of these genes, HSP90B1, was previously associated with CHD. We applied the state-of-the-art pathogenicity prediction tool, S-CAP, and calculated scores of eight synonymous variants . Except for the two synonymous variants in CASZ1, all six variants were predicted to be pathogenic by S-CAP. We further applied our in-house software to identify if these variants are located in the exonic splicing enhancers (ESE) near the canonical splice sites (see the “Methods” section). We observed the variant (chr12-104336346-C-T), which locates + 41 bp of the splice acceptor site of exon 12 of gene HSP90B1, was shown to overlap with 7 aligned ESE motifs (GATCAA, ATCAAG, CAAGAA, TCAAGA, CAAGAAGA, TCAAGAAG, ATCAAGAA). The underscored nucleotide in each motif sequence is where the variation occurs. These seven ESE motifs are aligned to the same genomic region close to the splice acceptor site, suggesting the importance of this region to bind with SR proteins to promote the exon splicing. The variant changes the highly conserved C to T in these ESE motifs, which may result in reduced or inhibited affinity for splicing factors. Subsequently, the altered ESEs by this variant may in turn lead to the aberrant splicing events.
Here, we performed a comprehensive analysis of DNVs in a large set of CHD patient and control trio data. Our goal was to identify novel CHD-associated candidate genes through pathway/network analyses and by using the controls and a validation set to assess the significance of our findings. Our approach included variant filtering to identify potentially damaging DNVs followed by enrichment analysis and knowledge-driven prioritization based on biological pathways, annotations, molecular interactions, functional similarities, and expression profiles. While filtering and prioritization depend on the specific study at hand, we demonstrate that our procedure yielded plausible candidate genes with statistically significant enrichment by supporting evidence from multiple aspects.
Unlike previous CHD studies where gene-level case-control studies were performed, in this study, we applied a pathway-level approach to identify risk genes. Another major novel component of our analysis was comparing the number of variants in cases and controls instead of applying a strict gene burden filter such as Fisher’s exact test. To account for the very low number of hits in individual genes, we followed a relaxed approach, thereby obtaining sufficient numbers of potentially disease-causing mutations to enable statistical power for case-control enrichment analyses.
Pathway analysis showed significant enrichment in heart development and signaling pathways (i.e., PKA, EMT, nitric oxide signaling, focal adhesion) in filtered cases genes that have been previously associated with heart disease, and conversely, no enrichment was found in filtered controls genes [3, 9]. In addition to previously known CHD-associated genes, we also observed novel genes involved in these pathways. Since we have applied a relaxed approach to include more candidate genes into pathway analyses, we evaluated the plausibility of each candidate gene.
To prioritize the candidate genes, we defined a priority score based on the number of known CHD-causing genes in a candidate gene’s pathway, network, and HGC distance to known CHD-causing genes. The higher scores and high expression levels during heart development provided supporting evidence for candidate genes, since a majority (54%) of human CHD genes are highly expressed in the developing heart. It is also important to note that the genes with lower scores or lower expression levels should be considered as candidates with less evidence. The genes HSP90AA1, ROCK2, IQGAP1, and CHD4 were at the top of the list with highest scores and as being highly expressed during heart development. For example, HSP90AA1 is associated with pathways including nitric oxide signaling in the cardiovascular system, VEGF signaling that has been shown to be linked to CHD [86, 87, 88], and axon guidance; ROCK2 is associated with pathways including PAK signaling, VEGF signaling, focal adhesion, and axon guidance; IQGAP1 is associated with IL-8 signaling, epithelial adherens junction signaling, and EGFR1; and CHD4 is associated with Th2 pathway, transcription factor binding, and zinc ion binding.
Notably, DNVs in HSP90AA1 and IQGAP1 were found in isolated CHD patients, whereas DNVs in ROCK2 and CHD4 were found in syndromic CHD patients. Two DNVs in CHD4 (p.Y1345D and p.M202I), p.R1330W in IQGAP1, and p.S39F in ROCK2 were previously associated with CHD and p.M954I in CHD4 was associated with developmental disorder [2, 3, 9] (Additional file 8: Table S39). Overall, our findings suggested 23 novel plausible genes contributing to CHD.
To ensure that our results were robust and not biased as a result of lower number of filtered control variants compared to cases (320 variants in cases and 73 variants in controls), we repeated our analyses on an extended control set. We still did not identify any significant enrichment in the extended control gene set.
To test our filtering strategy, we also performed enrichment analysis on rare DNVs after removing the synonymous variants (2278 variants in 1951 genes) without further filtering. Significant enrichment persisted in signaling pathways and cardiovascular diseases among 1951 genes supporting our findings for potentially damaging DNVs.
Due to the extreme heterogeneity of CHD, gene-level approaches have statistical power limitations for suggesting novel risk genes. This study represents a pathway-level approach that enables discovery of novel plausible CHD risk genes. We considered all genes having at least two more DNVs in cases than controls to be able to reach pathway-level statistical significance. However, it is important to note that this criterion depends on the size of the cohort and characteristic of the disease. While this approach has been efficient for identifying novel risk genes in this large cohort, we anticipate that it can be applied for studying rare variants in other genetically heterogeneous diseases.
Previous approaches that use DNVs to estimate variant rates or perform gene-level case-control analysis have limitation on identifying novel CHD genes due to extreme genetic heterogeneity of the disease. A recent study comparing the observed and expected rates of DNVs on the same data suggested 66 genes having more than one damaging de novo variants as risk genes . Among those, only five genes (CHD7, KMT2D, PTPN11, GATA6, and RBFOX2) reached genome-wide significance and all were already known CHD-causing genes. In this study, we aimed to discover new plausible candidate genes and applied a pathway-level approach that enabled us to discover 23 novel genes. Our approach explored whether genes having a low number of hits altered common molecular pathways in CHD patients and prioritized genes based on their biological proximity to the known CHD-causing genes. This large-scale study indicates that using pathway-level approaches is effective to analyze the effects of rare de novo variants in heterogenic diseases.
We are grateful to the patients and families who participated in this research.
BG and YI conceived the project and supervised the study. CSB and YI designed the computational methods. CSB implemented the methods and performed the calculations. PZ developed and performed the exonic splicing enhancer analysis. YI, BG, and MTF advised the analysis. CSB wrote the manuscript with PZ, MTF, BG, and YI. All authors have read and approved the final manuscript.
The project described was supported by Award Number(s) UM1HL098147, UM1HL098123, UM1HL128761, UM1HL128711, UM1HL098162, U01HL098163, U01HL098153, U01HL098188, and U01HL131003 from the National Heart, Lung, and Blood Institute, and The Charles Bronfman Institute for Personalized Medicine, Icahn School of Medicine at Mount Sinai. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Heart, Lung, and Blood Institute or the National Institutes of Health.
Ethics approval and consent to participate
The institutional review boards of Boston’s Children’s Hospital, Brigham and Women’s Hospital, Great Ormond Street Hospital, Children’s Hospital of Los Angeles, Children’s Hospital of Philadelphia, Columbia University Medical Center, Icahn School of Medicine at Mount Sinai, Rochester School of Medicine and Dentistry, Steven and Alexandra Cohen Children’s Medical Center of New York, and Yale School of Medicine approved the protocols of this study. All individual participants or their parent/guardian provided written informed consent. This research study conformed to the principles of the Helsinki Declaration.
Consent for publication
The authors declare that they have no competing interests.
- 23.Karczewski KJ, et al. 2019. Variation across 141,456 human exomes and genomes reveals the spectrum of loss-of-function intolerance across human protein-coding genes. https://doi.org/10.1101/531210.
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.