Genome-wide association of SNPs and indels
Genome-wide associations using allelic and genotypic transmission disequilibrium test (TDT) were run separately in 315 European and 265 Colombian trios and then in the Combined set of all 580 trios on bi-allelic single nucleotide polymorphic (SNP) markers and indels with minor allele frequency (MAF) greater than 10% (see “Methods” for discussion of MAF cutoff). A comparison of the p values between allelic TDT (aTDT) and genotypic TDT (gTDT) showed high concordance (see “Comparison between aTDT and gTDT” in Supplementary and Supplementary Figure S1, Fig. 1); therefore, only the aTDT results are discussed in the following sections. p values calculated using the exact binomial distribution from McNemar’s test are reported for the aTDT.
Tables 2 and 3 show the most significant results in the European (Table 2) and Colombian trios (Table 3). Several SNPs gave genome-wide significantly associated p values in the stratified aTDT analysis of European (Table 2 and Fig. 1 top panel) and Colombian trios (Table 3 and Fig. 1 middle panel), and a single SNP achieved genome-wide significance in the Combined sample (Fig. 1 bottom panel). In the European sample, 17 significant associations are observed across multiple chromosomes (Table 2). In the Colombian sample, four significant associations are observed for markers on chromosomes 6, 8, 19 and 21. After close examination of the genome-wide significant associations in the European and Colombian trios, the one strongly supported new result was a region on chromosome 21q22.3, discussed below. In the Combined aTDT, a single genome-wide significant association (p = 9.35E−14, OR = 2.13, 95% CI = [1.74–2.62], SNP rs72728755) was observed in the 8q24.21 chromosomal region. Many of the other associations showed properties that reduced our confidence in their reliability, which included (1) no additional variants yielding either significant or suggestive p values close to the lead SNP, (2) the lead SNP was located in a highly repetitive region, or (3) the lead SNPs showed substantial differences in MAF across European or Latino samples in gnomAD (Karczewski et al. 2019). Therefore, we concluded that these might not be reliable signals. Note that the first criterion alone was not sufficient to make us deem a result unreliable, as the 10% MAF cutoff may have been responsible for single-SNP association peaks.
Table 2 Significant associations in European (315 trios) compared with Colombian (265 trios) and Combined (580 trios) Table 3 Significant associations in Colombian (265 trios) compared with European (315 trios) and Combined (580 trios) Comparison between allelic TDTs of European and Colombian trios
A qualitative comparison of the European and Colombian aTDT results showed few commonalities between the two analyses of common SNPs. Except for the peaks at the 8q24.3 chromosomal region, all other genome-wide significant regions in the European trios were neither significant nor suggestive in the Colombian trios, and vice versa. The lack of new signals from the Combined trios supports this observation. For the purposes of comparison, Table 2 lists all European peaks and contains the least associated p values with their corresponding estimated odds ratios (OR) observed in the Colombian and Combined aTDTs within 500 KB on either side of each European peak SNP (Table 2 columns 4–7). Since allele frequencies for specific SNPs may differ between the two samples, this provides a region-level view of replication across the samples. Similarly, Table 3 lists the Colombian peaks, along with the minimally associated p values and corresponding odds ratios observed in the European and Combined aTDTs within 500 KB on either side of each Colombian peak. As seen in Tables 2 and 3, European and Colombian trios differ considerably with respect to the genomic regions that show significant association with CL/P.
Previously reported OFC risk loci
Two of the genome-wide significant associations observed in this study, 1p36.13 and 8q24.21, have been previously reported as associated with risk to OFCs by our group and others (Beaty et al. 2010; Birnbaum et al. 2009; Ludwig et al. 2012). The 1p36.13 peak is located 23 kb upstream of the transcription start site of the PAX7 gene. These associations were significant only in our European trios, consistent with previous studies suggesting a stronger association in participants of European ancestry compared to other racial/ethnic groups (Leslie et al. 2015).
The 8q24.21 region has been consistently implicated in nearly all previous OFC studies especially among samples of European ancestry. The lead SNP among Europeans (rs55658222) is in strong linkage disequilibrium (LD) with another SNP rs987525 in the HapMap European sample. The rs987525 SNP was found to be the lead SNP in this region in several previous GWASs and also showed modest evidence of association and linkage in the Colombian trios (p value 8.609e−06, odds ratio = 1.984, CI = [1.46–2.69]). In the European trios, a suggestive association was observed for an indel located at 9,295,770 bp on chromosome 17, approximately 52 kb centromeric to the NTN1 gene (p = 2.77e−07, odds ratio = 0.29, CI = [0.18– 0.48]). No other previously reported OFC variant reached even a suggestive level of significance (suggestive threshold p < 1.0e−05) in our WGS study, which is not unexpected given the smaller sample size of this WGS study compared to published GWASs. Supplementary Table S2 shows the most significant aTDT p values within 500 KB of all previously reported OFC risk variants.
Chromosome 21q22.3 association in the Colombian trios
We observed genome-wide significant associations in the Colombian trios within a 30 kb interval on chromosome 21q22.3 (Fig. 2, top panel). In this sample, the common variants had relatively large estimated odds ratios ranging from 2.33 to 2.48, i.e. approximately twofold increases in the transmission of the risk alleles from parents to the proband offspring. The smallest p value was observed at rs2839575 (p = 9.75e−09, odds ratio = 2.48, 95% CI = [1.81 – 3.45]).
GWAS of a Latino sample from a previous study, the POFC Multiethnic study, reported suggestive association at this genomic region [see Fig. 1 in Leslie et al. (2016)]. That Latino sample included diverse Hispanic groups from the US, Guatemala, Argentina, and Colombia, and all of the current WGS Colombia trios. However, the POFC Multiethnic study also had 129 additional Colombian trios. In that study, the GWASs of Asian and European samples did not show association in this region, nor did the combined GWAS of all the POFC Multiethnic study samples. The fact that the current WGS case–parent trio study yielded a genome-wide significant association with a smaller sample size suggests this association might be unique to Colombians. We explored the validity and implications of this observation through a number of analyses, as described below.
We first examined the aTDT p values for our Colombian WGS trios using their SNP array data from the POFC Multiethnic study. The p values in this region were nearly identical to those observed in our WGS association, confirming the association we observed here was not an artifact of sequencing.
We next investigated whether population substructure within the Colombian parents could have caused the observed association in the WGS data by examining the ancestry principal components (PCs) as well as results of quantitative association between PCA eigenvalues to variants within the peak region (see “Methods” for details). PCA showed no evidence of population substructure (Supplementary Figure S1, Fig. 2a), and no association was observed between the eigenvalues and variants in the chromosome 21q22.3 region (Supplementary Figure S1, Fig. 2b). A positive association between eigenvalues and variants would have indicated that the observed association with CL/P is in reality due to population substructure; therefore, this association did not appear to be an artifact of population admixture.
We verified that this region does not show evidence of association in other Latin Americans, by reanalyzing imputed genotype data of independent Latino trios from the previously published POFC Multiethnic GWAS study (Leslie et al. 2016). The aTDT p value and corresponding odds ratio at rs2839575 observed in the Colombian subjects were considerably different from those in the Latino sample, and the non-Colombian Latino trios showed no significant association at rs2839575 (Fig. 3a, forest plot). Moreover, the combined set of non-Colombian Latinos resulted in much weaker associations across a 1 MB region flanking SNP rs2839575 as well as for this SNP itself. The odds ratios at the rs2839575 variant showed an opposite (although non-significant) effect in the non-Colombian Hispanics as compared to Colombians (Fig. 3b, regional p value plot and Supplementary Table S3). We concluded from the stratified aTDT results that this SNP influences OFC risk only in Colombians.
We, therefore, investigated the possibility of ancestry differences between our Colombian sample and the other Latino populations. Ancestry principal components calculated from the POFC Multiethnic SNP genotype data (unrelated individuals only) showed Colombians to be ancestrally diverse from the other Latino populations (Supplementary Figure S1, Fig. 3).
Given that the 21q22.3 association is observed only in the Colombian sample and that the ancestry of Colombians is different from the other Latin American samples, we checked whether the absence of an association signal in the other Latin American samples merely reflects differences in MAF rather than differences in true effects of risk alleles. That is, it is possible that a causal variant exists in all populations but has a considerably higher frequency (or is in LD with a variant of higher frequency) in the Colombians. Given the population history of Colombians, causal OFC variants may have arisen from one particular ancestral group, and such variants may be more frequent (and therefore more informative) among Colombians. The origin of African ancestry of Colombians is different from that of the other Latino populations (Gouveia et al. 2019). We, therefore, looked at the frequencies of the Colombian risk alleles across different populations. For this analysis, we again turned to genotyped and imputed SNP genotypes from the POFC Multiethnic study. The MAFs of the 30 most significantly associated SNPs within the 21q22.3 peak region in Colombian trios were compared to 15 populations defined by country of recruitment from the POFC Multiethnic study. None of these 30 SNPs had higher MAF among Colombians compared to other Latino populations (Supplementary Figure S1, Fig. 4). Moreover, the 15 most significant SNPs in this peak region had higher allele frequencies in all other population groups (European, African, and Asian) compared to Colombians or other Latinos. Thus, there was no conclusive evidence that population-specific variants contributed to the association signal seen in this study. However, several of these variants had estimated odds ratios between 1.1 and 1.5 in Asian, Europeans, or Africans, suggesting these variants in this region may also increase risk for OFCs in other populations, but at a reduced level.
Finally, we tested for effects of rare variants within the Colombian trios using burden and collapsing tests because we observed a number of low-frequency and rare variants with large odds ratios in this region (see “Methods” for rare variant testing procedure). Common variants with the strongest associations were all intronic variants within the PDE9A gene; however, all had moderate odds ratios around 2.0. In this region, there were 37 SNPs with minor allele frequencies near or below 1% in the Colombian trios and estimated OR > 5 (Supplementary Table S4), including mainly intronic and a few intergenic SNVs (28 intronic, 8 intergenic). The exception was one non-synonymous SNV, rs138007679 in the RSPH1 gene (aTDT odds ratio 8, 95% CI = [1.001–63.96]), which produces an amino acid change (A > C, leucine to tryptophan according to ClinVar). Alone, this variant does not clearly implicate RSPH1 over other genes in the region, so we performed a rare variant TDT on all non-synonymous variants within the 13 genes falling in this region. None of the individual genes achieved the nominal significance (Supplementary Table S5), so this result remained inconclusive. We also carried out rare variant TDTs of intronic and intergenic variants with similar results, finding only nominally significant associations attributable to intergenic, low-frequency variants (MAFs ranging between 0.5 and 1%).
In the absence of any clearly pathogenic variant or gene based on combined effects of rare variants, we examined regulatory elements and protein–protein interaction pathways in this region with respect to craniofacial development. All associated variants below a suggestive level of significance (p < 1.0e−05) were located within the PDE9A gene, which does not have any known role in controlling risk to OFCs. However, the PDE9A gene overlaps a super-enhancer region for craniofacial development identified from histone profiling in early human craniofacial development (Wilderman et al. 2018). Multiple genes in the region, including PDE9A, appear to be actively transcribed during human craniofacial development (Fig. 2). Another gene of interest is UBASH3A, located ~ 220 kb centromeric to this peak signal. The UBASH3A protein was previously shown to physically associate with SPRY2 via a yeast two-hybrid assay (33). SPRY2 has been reported by GWASs of OFC and shown required for palatogenesis in mice (Welsh et al. 2007); whether UBASH3A is also expressed in craniofacial structures has not yet been determined.