Introduction

Nonsyndromic orofacial clefts (NSOFCs) constitute the largest proportion of orofacial clefts (OFCs) and are estimated to affect 1.25 in every 1000 live births worldwide1. Due to distinct embryological origins and epidemiological patterns, NSOFCs can be broadly classified into nonsyndromic cleft lip with or without palate (NSCL/P) and nonsyndromic cleft palate only (NCSPO)2. The primary treatment for NSOFCs is surgical to correct structural defects. However, restoring optimal function in affected children requires a multidisciplinary team of orthodontists, maxillofacial surgeons, prosthodontists, otolaryngologists, geneticists, and pediatricians, among others3. In the United States, the annual cost of hospital stays for children with OFCs was over 400 million in 20134. Moreover, the team of experts required for OFC care is often unavailable in resource-limited settings, leading to significant inequalities in cleft management5.

Developing preventive or improved therapeutic strategies for NSOFCs requires a thorough understanding of their etiology. As with many complex traits, the etiology of NSOFCs is multifactorial, with genetic factors playing a considerable role6. According to a recent review, over 60 (> 40 associated with NSCL/P) risk loci have been implicated mainly through common variant association studies7. However, all identified loci/genes are estimated to explain only a small fraction (~ 25% for NSCL/P and even less for NSCPO) of the estimated heritability8. Low frequency/rare variants, gene–gene interactions, and gene-environmental interaction effects may likely explain the missing heritability9.

The role of rare variants in complex traits is well documented10, and a recent study found rare variants to be responsible for a larger proportion of the missing heritability in complex traits etiology11. However, studies evaluating the role of rare variants for NSOFCs have been restricted largely to the resequencing of known cleft candidate genes, limiting the discovery of novel genes. Although resequencing has provided insights into the burden of rare variants in these candidate genes, the small effect sizes often estimated for complex traits, together with the low MAF of rare variants (minor allele frequency [MAF] ≤ 0.01) makes the conventional single variants association analysis underpowered12,13,14. A more efficient approach to rare variant analysis is to select a fixed MAF threshold and conduct an aggregate test on all variants with a MAF below this threshold within a specified region (e.g. gene), either by assigning equal weight to all variants identified or by varying weights based on the estimated variance of each variant under the null hypothesis of no association13. This approach has been applied successfully in discovering genes associated with complex traits like blood pressure, myocardial infarction, and schizophrenia15,16,17.

Attempts at leveraging rare variant aggregation to identify novel genes for NSOFCs are relatively new and have been limited to samples of individuals of European ancestry8,18,19. Additionally, these studies included all the identified rare variants within a gene, an approach that has been shown to be less sensitive due to the inclusion of synonymous variants, which often do not contribute to disease etiology20. To address these limitations in previous studies, we conducted gene-based rare variant aggregation tests, including only the protein-altering rare variants using our African genome-wide association study (GWAS) data. We hypothesized that genes enriched for rare protein-altering variants associated with NSOFCs would contribute to the etiology of NSOFCs. Further, we utilized transcriptomics data to provide additional evidence for the associated genes. The African population, the ancestral origin of modern humans, harbors the greatest number of genetic variations and, thus, provides a considerable opportunity for genetic discoveries21.

Results

Gene-based rare variant results

Our analysis included 21, 829 rare variants (21, 333 missense variants, and 70 loss of function variant) (Supplementary Table 2). We identified thirteen genes with suggestive associations [E−04 for the PA, Bonferroni corrected p-value = 0.05/5784 (9E−06), E−04 for the MO, Bonferroni corrected p-value = 0.05/5676 (9E−06), and 0.05 for the LOFO model, Bonferroni corrected p-value = 0.05/26 (2E−03)] were identified in the African GWAS data. Among the genes showing suggestive associations in the protein-altering (PA) model, ABCB1 and TTC28 were associated with NSCL/P, while TTC12, PDZD8, FCRL4, CENPF, and SLC16A9 were associated with NSCPO. In the NSCL/P sub-group analyses into NSCLO and NSCLP, MASP2 and OR5K1 genes were associated with NSCLO, while EXPH5, CSAD, ALKBH8, and RGL4 were associated with NSCLP. The results were similar for the missense-only (MO) model, with the addition of ABCB1 identified with NSCL (Table 1). None of the genes showed association in the loss-of-function-only (LOFO) models. Furthermore, among the associated genes, three genes (ABCB1, TTC28 and PDZD8 genes) showed significant mutation constraint to missense mutations using the GnomeAD database.

Table 1 Gene-based results for the protein-altering and missense only models.

Gene prioritization using transcriptomics data.

10 genes (ABCB1, TTC28, TTC12, CSAD, EXPH5, SLC16A9, MASP2, ALKBH8, CENPF and PDZD8) of the 13 genes showed expression during the formation of the human face. Most of the genes were biased towards some mesenchymal cells subtype except for the PDZD8 and ABCB1 genes (Fig. 1 and Supplementary Fig. 1). Mouse orthologs for ten of the 13 human candidate genes (ABCB1, ALKBH8, CENPF, CSAD, EXPH5, MASP2, PDZD8, SLC16A9, TTC12, and TTC28) were analyzed using the SysFACE gene expression analysis tool. Eight of the ten genes (ABCB1, ALKBH8, CENPF, CSAD, EXPH5, PDZD8, SLC16A9, and TTC28) showed consistently high expression and enrichment in relevant mouse craniofacial tissues—maxillary, medial and lateral eminence, and palate) (Fig. 2 and Supplementary Fig. 2). Interestingly, these eight genes were also among the 10 genes that showed expression during human face development (Fig. 1). Thus, we prioritized 8 genes: (ABCB1, ALKBH8, CENPF, CSAD, EXPH5, PDZD8, SLC16A9, and TTC28) with relevant human and mouse craniofacial expression. Further, we summarized the previous reports linking the prioritized genes to NSOFCs (Supplementary Table 1).

Figure 1
figure 1

(a) Bubble plots of gene expression for associated genes in each of the major cell types from human craniofacial tissues of CS17 embryos. Size correlates to the percent of cells per cluster which express the gene. Color corresponds to the average expression of that gene within cells of that cluster (Low = blue, high = red). (b) Expression analysis in craniofacial structures during human face development. Heatmap of expressed genes across all samples in the developmental series. Z-scores for normalized gene counts ranging from high (yellow) to low (blue) expression. The dendrogram on the left is the hierarchical clustering of all samples.

Figure 2
figure 2

SysFACE-based expression analysis of candidate genes in mouse facial development. Mouse orthologs for 10 of the 13 human candidate genes were examined using the SysFACE tool that is based on microarray gene expression data from isolated facial tissue in mouse development. Heat-map denotes the relative expression of individual genes at various stages of mouse embryonic (E) or postnatal (P) development in specific facial tissues, namely (A) Mandibular and maxillary columnar epithelium, (B) nasal eminence/prominence, and mandibular and maxillary arch, (C) frontonasal and mandible, (D) maxilla and palate. The intensity of the color in the heat-map (row-wise) is representative of the extent of candidate gene expression based on the average fluorescence signal intensity in the specific tissue. Note that for palate, there were independent datasets for E13.5 and E14.5 and these are denoted as E13.5a, E13.5b, etc. FaceBase datasets generated on Affymetrix Mouse Gene 1.0 ST Array microarray were meta-analyzed for (A,B) and Affymetrix Mouse Genome 430 2.0 Array microarray datasets were meta-analyzed for (C,D).

Discussion

We conducted gene-based rare variant aggregation tests to identify novel candidate genes that could explain the missing heritability for NSOFCs. In total, we identified 13 genes with suggestive associations primarily driven by rare missense variations. Eight genes (ABCB1, ALKBH8, CENPF, CSAD, EXPH5, PDZD8, SLC16A9, and TTC28) showed consistent expression in relevant mouse and human craniofacial tissues during the formation of the face. Further, three genes (ABCB1, TTC28, and PDZD8) were predicted to be intolerant to missense variations.

Using biological plausibility to prioritize loci/genes with suggestive association in GWAS has been previously shown as a valid approach to bypassing the large sample size requirement needed to achieve significant association22. While gene mutation constraint is a good metric for identifying pathogenic genes, it is more commonly seen with a dominant disease-causing gene and may not be informative in other disease models (e.g., recessive). Thus, we prioritized the 8 associated genes (ABCB1, ALKBH8, CENPF, CSAD, EXPH5, PDZD8, SLC16A9, and TTC28) with consistent expression in relevant craniofacial tissues during human and mouse face development. Majority of these genes (TTC28, CSAD, EXPH5, SLC16A9, ALKBH8, and CENPF) showed biased expression towards mesenchymal cells subtypes from human craniofacial tissues. This could point to the importance of mesenchymal cells in palate formation since these genes were associated with either NSCLP or NSCPO and not NSCLO. Moreover, this could also be due to the stage (CS17) of embryonic development at which the facial prominences were harvested. The CS17 stage (~ 7 weeks post fertilization), a period that coincides with the later stages of lip formation and the beginning of palate formation. Furthermore, the lower prevalence of NSCPO compared to NSCL/P suggests a higher burden of variants threshold in NSCPO compared to NSCL/P. This may explain why we identified more genes enriched for rare variants associated with NSCPO/NSCLP compared to NSCLO.

Five genes (ABCB1, TTC28, PDZD8, CENPF, and ALKBH8) have been previously implicated in NSOFCs or diseases presenting with cleft phenotypes. The ABCB1 and TTC28 were associated with NSCL/P while the PDZD8, CENPF, ALKBH8 genes are associated with NSCPO. These genes showed consistently high expression and enrichment in mouse craniofacial tissues especially the palate. The ABCB1 gene is an ATB binding cassette gene that functions to regulate fetal exposure to xenobiotics through the placenta23. Single nucleotide variations in the ABCB1 gene have been reported to increase the risk of NSCL/P23. Additionally, knockout of the ABCB1 gene in mice increases susceptibility to cleft palate24. The TTC28 gene is in the 22q12.2 region and previous case report on patients with microdeletion of this region implicate this gene as potential candidate for pierre robin sequence—a condition that presents with cleft palate25. Furthermore, copy number variations overlapping this gene have been reported in cleft palate patients26. The PDZD8 gene assists with lipid transfer from the endoplasmic reticulum to the endosomes and lysosomes27,28. Burden of variations though not statistically significant have been previously reported in this gene among patients with cleft lip with or without palate29. CENPF is a kinetochore associated protein that colocalizes with the IFT88 (a ciliopathy gene) and compound heterozygous mutations in the CEPNF gene were reported in a human fetus with ciliopathic malformations, including cleft palate30. Mutations in ALKBH8 cause intellectual developmental disorder, autosomal recessive 71, MRT71 (OMIM #618504) in humans. This condition presents with craniofacial dysmorphic features, which include long lips with V-shaped upper lip, macrostomia, and retruded mandible31. Macrostomia and retruded mandible may cause cleft palate by impeding the elevation of palatal shelves during palate formation32.

We identified three potential novel genes (CSAD, EXPH5, SLC16A9) for NSOFCs. The burden of variants in CSAD and EXPH5 were associated with NSCLP while those in the SLC16A9 gene were associated with NSCPO. These genes, except for the EXPH5 gene which showed a suggestive association with NSOFCs in gene-by-sex interaction analysis using the same GWAS data33, lack previous reports of direct association with NSOFCs phenotypes. However, they have been implicated in processes crucial for craniofacial development. The EXPH5 gene, for instance, has been shown to play a role in cell–cell adhesion34; a critical process in craniofacial morphogenesis. The SLC16A9 gene is linked to lipid metabolic traits35; a functionally relevant downstream target of the non-canonical transforming growth factor beta (TGFbeta) signaling—a signaling mechanism involved in face formation. The CSAD gene functions in the biosynthesis of taurine and the role of taurine in organogenesis has been demonstrated in mice36.

In the current study, we did not find associations with previously reported genes from rare variant association studies on NSOFCs in other populations. Similarly, a previous multiethnic gene-based GWAS on NSCL/P showed that different genes were enriched for low-frequency/rare variants associated with NSCL/P across genetically defined ancestral groups (Asians, Europeans, and Latin Americans) without overlap8. Additionally, none of the genes showing suggestive or significant associations in this study, as well as the previous study, were known NSOFCs risk genes. We replicated the ABCB1 gene association which was previously reported for NSCL/P in a common variants’ association study. While some reports have demonstrated the accumulation of rare variants in genes previously identified through common variants association analyses37, suggesting a convergence of common and rare variants in loci/genes associated with NSOFCs phenotypes, other reports suggest that common and rare variants may act through separate loci/genes29. For instance, targeted sequencing of regions surrounding genome-wide significant loci for NSOFCs showed no evidence of rare variants burden in genes/regulatory regions proximal to these loci29. Additionally, rare variants are population-dependent, which could explain why we did not replicate previously reported genes/loci from rare variant association studies in other populations. Furthermore, our tests of association require that we exclude any gene with only one rare variant, even if the same gene harbored common variants. This might have resulted in the omission of genes with contributions from both rare and common variants if participants in our cohort only harbor one rare variant in the gene. Therefore, future studies should consider a model that allows for the incorporation of both common and rare variants.

Our study has some limitations. First, we used array-based genotype data for discovery. This approach means some rare variants were not examined. Second, we controlled for population stratification by adding the top 10 genotype PCs as covariates in our gene-based association regression models. Adjusting for top PCs has been shown to prevent p-value inflation and reduce false positive rates in common variant analysis, but its performance for rare variant analysis remains controversial38. The reason is that rare variants, being newer mutations, reflect a more granular population substructure compared to common variants39. Finally, we restricted our analysis to only the coding (missense and Lof) and splice-altering variants. Rare variants including insertions/deletions in non-coding regulatory regions also contribute to the etiology of NSOFCs; however, defining these regions' analytical units and their functional characterization remains a challenge. Hence, future studies should use WGS data for the gene-based analysis to capture more rare variants and leverage annotated craniofacial enhancers regions to analyze rare variants in non-coding regions.

In summary, we identified 13 genes with suggestive associations in our GWAS data. Among the 13 genes, 3 genes (ABCB1, TTC28 and PDZD8) were predicted to be intolerant to missense variations. Human and mouse transcriptomics data further supported the association of 8 genes. Of the 8 genes, five genes (ABCB1, TTC28, PDZD8, CENPF, ALKBH8) were previously associated with NSOFCs or diseases presenting with cleft phenotypes. The remaining three genes (CSAD, EXPH5, SLC16A9) are potentially novel candidate genes for NSOFCs.

Methods

GWAS study participants

The details of the GWAS participants have been previously published40. Briefly, NSOFCs case children were recruited during surgical repair at cleft clinics and free surgical missions sponsored by Smile Train in Nigeria, Ghana, and Ethiopia. The cleft surgeons at each participating site used a standardized phenotyping protocol (physical examination and clinical photographs) to confirm NSOFCs status. Additionally, echocardiography was used to rule out the presence of congenital heart defects to ensure nonsyndromic status. Control children were those without a birth defect diagnosis attending immunization/welfare clinics at the same center where the case children were recruited. To be eligible to participate in the study, case and control children must have biological parents of African ancestry who reside in Africa. Our sample included 1019 NSOFCs case children and 2150 unrelated control children. Among the cases, 810 had non syndromic cleft lip with or without palate (NSCL/P)—394 non-syndromic cleft lip only (NSCLO), 420 non-syndromic cleft lip and palate (NSCLP), and 205 had non-syndromic cleft palate only (NSCPO). The distribution of the GWAS study participants by cleft status and country of origin is shown in Table 2.

Table 2 Distribution of the GWAS study participants by country of origin and cleft status.

Data collection, DNA extraction, and genotyping

Demographic information (age, sex, and residential location) and limited exposure information were obtained. Saliva specimens were collected using the Oragene saliva kit, de-identified and shipped to the Butali laboratory at the University of Iowa. DNA was extracted from the saliva using the standard Oragene saliva DNA extraction protocol and quantified using Qubit (http://www.invitrogen.com/site/us/en/home/brands/Product-Brand/Qubit.html;Thermo Fisher Scientific, Grand Island, NY). As part of internal QC, Taqman XY genotyping was used for sex confirmation. Subsequently, aliquoted DNA was sent to the Center for Inherited Disease Research (Baltimore, Maryland, USA) for genotyping. The Illumina Multi-Ethnic Genotyping Array MEGA v2 15070954 A2 (genome build 37), which has over 2 million variants including over 60 000 rare variants selected from populations of African origin, was used for the genotyping. Details on genotyping and QC measures have been previously published40.

Data analysis

Gene-based analyses

Variant predication was performed using annotate variation (ANNOVAR) to identify the functional consequences (synonymous, nonsynonymous, and splice-altering) of each variant. Splice altering and non-synonymous variants (variations resulting in either amino acid change or premature termination of the protein) were filtered against the 1000 genome (1 kg) population databases (http://www.1000genomes.org/) to identify rare variants (found in ≤ 1% of Africans included in the database). Additionally, we filtered for only the variants with a MAF ≤ 1% in the controls included in our sample (Supplementary Table 2). Subsequently, genes with two or more rare non-synonymous variants in the data were filtered to satisfy the aggregation requirement for the proposed analyses (Fig. 3). For the analysis, both non-synonymous and splice-altering variants were included and referred to as "protein-altering". Three gene-based collapsing models were used: (1) protein-altering (PA), (2) missense only (MO), and (3) loss of function only (LOFO). The different phenotypes (NSCL/P and NSCPO) were analyzed independently and separately. NSCL/P was further subdivided into nonsyndromic cleft lip only (NSCLO) and nonsyndromic cleft lip and palate (NSCLP). Gene-based rare variant aggregate tests were used to identify genes enriched in rare variants associated with NSOFCs. Three rare variant aggregation tests were conducted: the combined multivariate and collapsing (CMC) test, the sequence kernel association test (SKAT), and the SKAT-O test. The first two tests are complementary, operating under different assumptions41. The CMC test, being a burden test, assumes the same direction of effects for all the variants within a gene, whereas SKAT is a variance component test that allows for an opposing direction of effects41. The SKAT-O test is an omnibus test that is more robust and efficient across different scenarios41. In all three tests, population stratification and sex were controlled for by adding the top 18 principal components (PC) and child sex into the regression-based models. The lack of prior knowledge about the underlying biology of these variants precluded the ability to select one optimal test. Hence, the decision to conduct the three tests where the CMC and SKAT will show rigor and SKAT-O will confirm reproducibility. The Bonferroni correction was used to adjust for multiple testing and set the cut-off for statistical significance at a 5% error rate to 0.05 divided by the number of genes tested and a 102 higher threshold for suggestive significance. A gene was considered significant if it showed a statistically significant or suggestive significant association in either CMC and SKATO or SKAT and SKATO. Rare variant aggregation tests were conducted on our GWAS data using the SKAT R package (https://cran.r-project.org/web/packages/SKAT/index.html) for rare variant association tests implemented under the case–control study design. Further, we used the genomeAD gene mutation constraint prediction tool (https://gnomad.broadinstitute.org/help/constraint) to identify associated genes level of intolerance to mutational changes.

Figure 3
figure 3

(a) Summarized flow chart for the analytical pipeline. (b) The process of sorting for variants within a gene of interest. The left side of the image shows all variants identified within the gene after sequencing. The variants of interest on the right were predicted to result in a change in amino acid or affect splicing by in-silico predictive tools and were included in our analysis. Color codes; Red: Non-synonymous variants in protein-coding regions; Yellow: Synonymous variants (not selected for analysis); Blue: Splice-site variants.

Gene expression analyses

To provide biological insight, expression of the associated genes during organogenesis of the human and mouse faces was examined. The human craniofacial gene expression dataset was generated from extensive bulk mRNA sequencing of human craniofacial tissue between 4 and 8 weeks post-conception. Since gene expressions are time and tissue-specific, the dataset included bulk mRNA sequencing at five time points corresponding to Carnegie stages of the embryonic period (CS13, CS14, CS15, CS17, and CS22), along with a culture model of cranial neural crest cells. This resulted in transcriptomic data in a time series covering nearly one-half of the embryonic period, during which critical events in craniofacial development occur. Additionally, the dataset included single-nuclei RNA-seq of craniofacial prominences of human CS17 embryos, corresponding to approximately 7 weeks post-fertilization, a period that coincides with the later stage of lip formation and the early stage of palate formation. Additional details on the RNA seq data quality control and analysis was reported by Yankee et al.28.

SysFACE analysis was performed as previously described33 using GSE7759, GSE22989, GSE31004, and GSE11400 microarray data (Affymetrix Mouse Genome 430 2.0 Array) and GSE55965 microarray data (Affymetrix Mouse Gene 1.0 ST Array). The datasets were analyzed using affy package in R. Multiple probesets representing individual genes were normalized and the probeset with highest median expression was considered representative of gene expression. WB data generated on Affymetrix Mouse Genome 430 2.0 Array platform as previously described was used for enriched expression analysis.