Background

Childhood asthma is a heterogeneous chronic airway disease with various clinical phenotypes [1, 2]. Its phenotypic and biologic heterogeneity contributes to the challenges clinicians face in its diagnosis and effective management [3]. It is therefore crucial to clearly define subphenotypes of asthma with homogeneous clinical characteristics in order to search for better asthma management and to develop novel therapeutic strategies. Although a large number of clinical phenotypes are often collected in childhood asthma studies, asthma genetic study has been mostly focused on case-control disease status. Such an endpoint-based analysis ignores the complexity of asthma phenotype [4,5,6]. In addition, although there is ample evidence for an intrinsic genetic variation among racial/ethnic populations [7, 8] suggesting possible genetic ancestry variation in childhood asthma, most genetic analyses rely on self-reported race thus do not account for the potential contribution of genetic ancestry to disease variation in diverse populations.

An approach to overcome the phenotypic heterogeneity of childhood asthma is to identify homogeneous subgroups by establishing either classical “endotype”, based on experts’ criteria, or statistical phenotype clustering on asthma clinical phenotypes. The latter has been successfully applied to identify clinically relevant subgroups of asthmatics and other airway diseases [9,10,11,12,13,14,15,16,17]. However, these studies differ in some key elements: variation in phenotyping, analytical approaches used and the patient population under study. These differences limit the comparability of the identified subphenotypes and pose difficulty in applying clustering results to individual patients. Furthermore, little is understood regarding the genetic ancestry of the identified subphenotypes.

The objective of the study is to investigate childhood asthma phenotypic heterogeneity and genetic ancestry variations and their relationships. Specifically, we used childhood asthma data from the NIH controlled database of Genotype and Phenotype (dbGaP) to identify homogeneous subphenotypes, determine clinical phenotypes, estimate subphenotype-specific genetic ancestry, and analyze the relationship between ancestry and subphenotypes using a stepwise approach incorporating cluster analysis, classification tree analysis, and genetic ancestry analyses [9,10,11,12,13,14,15,16, 18, 19]. Our goal is to combine both cluster and genetic ancestry to identify biologically-relevant subphenotypes in childhood asthma.

Methods

Data

The database of Genotypes and Phenotypes (dbGaP) is the repository for both genotype and phenotype data from most NIH-funded GWAS and other whole-genome or exome sequence data. We used baseline data from the SNP Health Association Resource (SHARe) Asthma Resource Project (SHARP) (phs000166.v2.p1), the National Heart, Lung, and Blood Institute’s clinical research trials on asthma, specifically, the Childhood Asthma Management Program (CAMP) and the Childhood Asthma Research and Education (CARE) network. The CAMP is a multicenter, randomized, double-masked clinical trial designed to determine the long-term effects of three inhaled treatments for mild to moderate childhood asthma [20]. The CARE data evaluates current and novel therapies and management strategies for children with asthma. Individual level data with asthma diagnosis is available for 1211 subjects through Authorized Access, including 813 in CAMP and 398 in CARE.

An array of phenotypic variables have been harmonized across the CAMP and CARE datasets, including demographics and participant characteristics; intermediate asthma phenotypes such as lung function, skin prick test (SPT), serum total immunoglobulin (IgE), and atopic dermatitis (AD), as well as environmental exposure. See Table 1 for a complete list of variables.

Table 1 Demographic, clinical phenotypes and environmental exposures of CAMP and CARE study participants

We downloaded CAMP and CARE genotype data which were performed using 1 million single nucleotide polymorphisms (SNPs) in the Affymetrix 6.0 chip and stored in the database of dbGaP (accession number phs000166.v2.p1). Quality control criteria included minor allele frequency ≥ 0.05, Hardy-Weinberg equilibrium (p ≥ 10− 5), ≤ 5% missing rate per person, ≤ 5% missing rate per SNP, families with less than 5% Mendel errors and SNPs with less than 10% Mendel error rate [21].

Hierarchical cluster analysis (HCA)

HCA is a hypothesis free statistical method to group subjects into relatively homogeneous sub-clusters according to similarity quantification based on a set of critical characteristic variables. The grouping is constructed such that the similarity is strong between members of the same cluster and weak between members of different clusters. The baseline phenotypic measures listed in Table 1 were included in the cluster analysis. To reduce collinearity, we examined the variables for absolute correlation (> 0.80). We also assessed missing pattern of the phenotypes and planned to exclude measures with ≥10% missingness from the analysis. Blood eosinophils (EOS) and IgE were log transformed.

Since we have mixed types of variables, i.e., continuous and categorical, Gower’s distance [22] was used as a similarity index. To avoid inconsistent cluster solutions due to changes in scale of the variables and heavy impact of variables with larger standard deviations, Gower’s standardization, based on the range, was applied. HCA was then carried out with Ward’s minimum-variance method [23]. Consensus between a pseudo F and a pseudo t2 statistics [24, 25] was used to select the number of clusters. The number of clusters was also guided by clinical characteristics in addition to statistical considerations.

Descriptive statistics of all variables were obtained and compared across clusters using analysis of variance, Kruskal-Wallis, or Chi-square tests as appropriate. Conditional inference trees [26], a non-parametric class of regression trees that embeds tree structured regression models into a well-defined theory of conditional inference procedures, was used to identify intermediate phenotypes that distinguish the subphenotypes. The cluster analysis was first carried out on the CAMP data and repeated on the CARE data. Replication of the clustering results was examined between the two studies as well as with previously published studies.

Additional analyses were run to investigate if the subphenotypes were associated with clinical outcomes. Two clinical outcomes were examined, number of prednisone bursts (an anti-inflammatory oral steroid medication) since last visit, and number of ER visit or hospitalizations since last visit. Number of prednisone bursts since last visit was modeled as a count variable using Poisson regression with a random subject effect. Number of ER visit or hospitalizations since last visit was dichotomized (given over 95% of the subjects did not had an ER visit or hospitalization), and modeled using a logistic regression with a random subject effect. Potential covariates included age, sex, race, visit month, time since last visit, treatment, and subphenotypes that were significantly associated with the outcome (adjusted p-value < 0.05). All analyses were run for CAMP and CARE data separately. All the above analyses were conducted in SAS version 9.3 (SAS Institute Inc., Cary, NC, USA) and R [27].

Genetic ancestry analysis

Genetic ancestry was estimated using both genome-wide SNPs and asthma-specific GWAS SNPs for African-American asthmatic individuals in CAMP and CARE. Supervised approach in the ADMIXTURE software program [28] was use to estimated global genetic ancestry, where SNP data of 108 YRI (Yoruba in Ibadan, Nigeria) and 99 CEU (Utah Residents (CEPH) with Northern and Western Ancestry) individuals from the 1000 Genomes Project were included as surrogates for European and African ancestry. The reference populations and the CAMP/CARE subjects shared 857,127 genetic markers across all autosomes, which reduced to 225,374 SNPs after linkage disequilibrium (LD) pruning with window of 50 (kb), 10 kb window shift and a r2 value of 0.2.

Asthma GWAS SNPs, 157 in total, were retrieved from the GWAS catalog [29] and STRUCTURE software [30] was used to estimate African ancestry proportion at asthma GWAS SNPs. CEU and YRI individuals from the 1000 Genomes Project were used as parental populations.

Correlations between genetic ancestry and the subphenotypes derived by clustering and the discriminate factors of the subphenotypes were examined using the Kruskal-Wallis test, Wilcoxon rank-sum test, Spearman correlation coefficient, or linear regression as appropriate.

Results

Participants from CAMP and CARE were different except in sex, exposure to in utero smoking, PC20, and FEV1/FVC ratio (Table 1). All pairwise Spearman correlation coefficients were less than 0.60, except between FEV1 percent predicted and FVC percent predicted (0.71) and between FEV1/FVC and maximum bronchodilator percent change (− 0.65). No variables had more than 10% of missing values.

HCA identified distinct subphenotypes

Clustering on CAMP cohort identified distinct subphenotypes

Three clusters were identified from CAMP data (Table 2). Members of cluster 1 had a moderate AD rate (15.3%) and all but one had negative SPT (99%). This group also had the lowest age at baseline, age at onset of asthma, bronchodilator percent change, EOS, IgE level, AM peak flow, and AM symptoms, and highest body mass index z-sore (BMIZ), PC20, FEV1 percent predicted, and FEV1/FVC ratio. All these characteristics, but BMIZ and AM symptoms, were statistically different across the clusters at a significant level of 0.05. This is the moderate AD group with negative SPT and preserved lung function.

Table 2 CAMP hierarchical clustering results

Members of cluster 2 had a high rate of AD (97.7%) and all had one or more positive SPT. This group also had the highest EOS and IgE level, and lowest bronchodilator percent change among the 3 clusters. This is the high AD group with positive SPT and airway hyperresponsiveness.

Members of cluster 3 had the highest age at baseline and age onset of asthma and lowest BMIZ. This group had also the lowest FEV1 percent predicted and FEV1/FVC ratio, and highest AM symptoms. Furthermore, members of cluster 3 were mostly AD free and all had one or more positive SPT, moderate EOS and IgE levels, but lower lung function measures and higher AM symptoms compared to the other clusters. This is the low AD group with positive SPT and lower lung function.

Clustering on CARE cohort replicated the subphenotypes identified in CAMP

Three clusters were identified in CARE (Table 3). Members of cluster 1 had a moderate rate of AD (35%) and none of them had a positive SPT. This group also had the lowest bronchodilator percent change, EOS, IgE, AM peak flow, and lowest AM symptoms. All these characteristics, but the last, were statistically different across the clusters at a significant level of 0.05. This is the moderate AD group with negative SPT and preserved lung function similarity identified in CAMP.

Members of cluster 2 had a high rate of AD (98.4%) and one or more positive SPT (95.3%). This group also had the highest EOS and IgE level among the 3 clusters. This is the high AD asthma group with positive SPT and airway hyperresponsiveness similarly identified in CAMP.

Members of cluster 3 had the highest age at baseline and age onset of asthma, were mostly AD free (3.3%) and all had one or more positive SPT (92.2%), had moderate EOS and IgE levels, but higher AM symptoms compared to the other clusters. This is the low AD group with positive SPT and lower lung function similarly identified in CAMP.

Table 3 CARE hierarchical clustering results

Atopic dermatitis status and SPT distinguished the subphenotypes

Conditional inference trees analysis revealed that, in both CAMP and CARE data, AD and one or more positive SPT were the top two variables that best discriminated the individuals into the subphenotypes (Fig. 1, prediction accuracy 95.8%). Given the consistent findings across CAMP and CARE data, we combined the two datasets and grouped the three clusters individually identified in CAMP and CARE into three subphenotypes. One subphenotype was the moderate AD group with negative SPT and preserved lung function (subphenotype 1, n = 158), one was the high AD group with positive SPT and airway hyperresponsiveness (subphenotype 2, n = 300), and one was the low AD group with positive SPT and lower lung function (subphenotype 3, n = 753).

Fig. 1
figure 1

Conditional inference tree analysis of the three subphenotypes. SPT and atopic dermatitis are the top two factors distinguishing the subphenotypes. The prediction accuracy is 95.8%

Subphenotypes were associated clinical outcomes

Table 4 shows the association between the subphenotypes and clinical outcomes. In CAMP data, the incident rate of prednisone bursts since last visit for subphenotype 2 is 2.63 (1.45, 2.70) times the incident rate for subphenotype 1, and the incident rate of prednisone bursts since last visit for subphenotype 3 is 2.04 (1.56, 2.70) times the incident rate for subphenotype 1. Also in CAMP data, the odds of any ER visit or hospitalizations since last visit for subphenotype 3 is 1.54 (1.01, 2.23) times the odds for subphenotype 1. For CARE data, the odds of any ER visit or hospitalizations since last visit for subphenotype 2 is 0.32 (0.13, 0.98) times the odds for subphenotype 1, and the odds of any ER visit or hospitalizations since last visit for subphenotype 3 is 3.45 (1.47, 7.69) times the odds for subphenotype 2.

Table 4 Association between subphenotypes and number of prednisone bursts and any ER visit or hospitalizations since last visit

Genetic ancestry proportion varied at asthma GWAS SNPs among asthma subphenotypes

The three subphenotypes had 15, 49, and 103 African American individuals, respectively. Global African ancestry proportion varies from 71.2 to 100% with mean 96.6% and standard deviation (SD) 7.2%. Higher global African ancestry was associated with AD (mean ± SD of African origin is 0.96 ± 0.08 for AD free vs. 0.98 ± 0.06 for AD subjects, p-value = 0.0294), but not with other clinical phenotypes. Proportion of African ancestry at asthma GWAS SNPs was correlated with the subphenotypes (mean 64.9, 89.4 and 94.4% for subphenotypes 1, 2, and 3, respectively, p-value < 0.0001, Figs. 2 and 3(a)). The subphenotypes were associated with lung function: FEV1 percent predicted is 96.8 ± 14.1, 95.3 ± 13.9, and 93.9 ± 13.7 (p-value = 0.0083); and FEV1/FVC ratio is 81.9 ± 7.6, 80.5 ± 8.1, and 79.0 ± 8.2 (p-value < 0.0001) for subphenotypes 1, 2, and 3, respectively. Furthermore, African ancestry at asthma GWAS SNPs was inversely associated with AD (median 0.95 with IQR (0.93, 0.95) for AD free vs. 0.92 (0.89, 0.94) for AD subjects, p-value < 0.0001, Fig. 3(b)). Additionally, genetic ancestry at asthma GWAS SNPs was associated with positive SPT with median and interquartile range (IQR) 0.94 (0.92, 0.95) for positive SPT individuals vs. 0.74 with IQR (0.59, 0.78) for negative SPT individuals (p-value < 0.0001, Fig. 3(c)). The odds of one or more positive SPT was 6.3 higher (95% confidence interval: (3.4, 13.8), p-value < 0.0001) with each additional 10% of African origin at asthma GWAS SNPs. African origin at asthma GWAS SNPs was also associated with IgE levels (Spearman correlation coefficient = 0.27, p-value = 0.0004) and IgE was 1.5 fold higher with each additional 10% of African origin (Fig. 3(d)).

Fig. 2
figure 2

Population ancestry estimates of African American asthmatic individuals in CAMP and CARE at asthma GWAS SNPs by subphenotypes. Dashed lines indicate average proportions of African ancestry proportion at the asthma GWAS SNPs. Ibadan, Nigeria (YRI) and northern and western European (CEU) from the 1000 Genomes project were used as parental populations

Fig. 3
figure 3

Boxplots and scatterplot of proportions of African ancestry at the asthma GWAS SNPs by: a subphenotypes, b Atopic dermatitis status, c SPT, and d IgE levels

Discussion

Current clinical practice in childhood asthma treatment tends to use average patient care strategies. Such a “one size fits all” treatment approach faces major challenges when it is becoming clearer that childhood asthma is heterogeneous in pathogenesis. Our unbiased cluster and genetic ancestry analyses pointed toward three distinct phenotypic clusters with differences in clinical characteristics, genetic ancestry, and clinical outcomes, underscoring the clinical and genetic heterogeneity of asthma [10, 13, 17, 31]. Previous studies have also identified clusters with atopic or non-atopic asthma, clusters with preserved or lower lung function, and clusters with mild asthma [13, 14, 32]. It is reassuring that the two independent studies replicated the clustering results and there are similarities with previous clustering-based childhood asthma subphenotypes.

We determined genetic ancestry [33] using genome-wide SNPs and asthma GWAS SNPs for African-American asthmatic individuals in CAMP and CARE data. Our estimate of African global ancestry in asthmatic children is higher than what has been reported in different general populations confirming the higher prevalence of asthma in individuals with higher African ancestry than others. Our results showed that genetic ancestry at asthma GWAS SNPs differed between the childhood asthma subphenotypes and was associated with lung function, SPT, IgE levels, and AD. Previous studies have also showed association between genetic ancestry and asthma prevalence and related clinical phenotypes [34,35,36,37,38,39,40,41,42]. To our best knowledge, our study is the first to show the association between genetic ancestry at asthma GWAS SNPs and cluster-based subphenotypes in childhood asthma. Leveraging ancestry and cluster analyses to derive genetic and phenotypic homogeneity subgroups in childhood asthma demonstrates the utility of these approaches to characterize and understand the complexity of asthma towards individual based precision medicine strategies.

This study demonstrates that genetic ancestry at asthma GWAS SNPs is more strongly associated with asthma subgroups sharing similar clinical characteristics compared to broadly defined asthma. The results suggest that validation of genetic studies are more likely to be successful for replication studies carried-out in more homogeneous asthma cohorts (sharing similar clinical characteristics) compared to the multifactorial case-control status. In addition, the results indicate that ancestry-specific genetic loci of asthma are likely to be found by focusing on better defined asthma patients. Furthermore, genetic ancestry analysis in homogeneous asthma subgroups is suitable to refine the biological role of asthma susceptibility variants from GWAS studies in a given phenotype. For example, SNPs at STARD3/PGAP3 are strongly associated with the high atopic dermatitis subgroup suggesting that STARD3/PGAP3 may act on the allergic component of asthma [43]. Another example is that ORMDL3/17q locus is associated with asthma in multiple studies in the European ancestry but not in African ancestry asthmatic individuals [44]. We also investigated associations between asthma GWAS SNPs with the identified subphenotypes in CAMP and CARE data (methods and results in Additional file 1: Table S1). Several significant associations were identified at p = 0.05, but none after multiplicity adjustment, possibly due to small sample size and limited statistical power.

Our study had several limitations. First, participants in CAMP and CARE represent studies of childhood asthma, thus the results herein may not be applicable to adulthood asthma. Second, although we identified clinically relevant subphenotypes of childhood asthma using clinical phenotypes [45], the integration of this result with molecular and physiologic phenotyping may help to better understand childhood asthma pathogenesis for possibly more personalized therapeutic strategies. Furthermore, subgroup analyses of asthma may limit sample sizes and impair statistical power. However, given asthma is a highly heterogeneous phenotype, studying homogeneous subgroups of asthma patients not only recovers power limitation, but achieves more statistically significant results. Classifying asthma patients in more homogenous groups may help us to identify new susceptibility or modifying subphenotype-specific genes. Our ability to better define subtypes might help to predict who may respond to treatment vs subjects who may not. Future studies need to elucidate the mechanisms that distinguish each ancestral and clinical clusters to facilitate the development of targeted therapies and providing personalized treatments.

The present study has notable strengths. First, we were able to dissect childhood asthma heterogeneity into subphenotypes using cluster analysis of clinical phenotypes in one study and replicate the findings in an independent study. Second, we were able to show associations between the identified subphenotypes with asthma clinical outcomes. Third, analysis of genetic ancestry at asthma GWAS SNPs in childhood asthma clinical phenotypes provide biologically relevant subphenotype-specific results. Lastly, our study used a more accurate and direct assessment of genetic ancestry instead of self-reported race to determine the relationship between ancestry and childhood asthma subphenotypes and relevant clinical phenotypes. Studies have shown that people with the same self-reported race could have drastically different levels of genetic ancestry, and self-reported race may not be as accurate as direct assessment of genetic ancestry in predicting treatment outcomes [33]. Future studies to identify genetic ancestry-specific variants associated with a specific subphenotype are important as we move towards applying precision medicine paradigm. The finding indicates that African genetic ancestry at asthma GWAS SNPs are differentially associated with the asthma clinical subphenotypes. Unraveling the reasons why individuals with higher African origin at asthma GWAS SNPs had higher IgE level or rate of positive SPT is necessary to determine the potential clinical applications of our findings. In addition, genetic analysis based on more refined phenotypes may increase the statistical power and allow for the detection of population structure-specific phenotype-genotype associations that are undetectable otherwise.

Conclusions

In conclusion, through our systematic clinical phenotype analysis, we identified distinct subphenotypes for childhood asthma using cluster analysis. Further genetic ancestry analysis showed correlations between African ancestry at asthma GWAS SNPs and childhood asthma subphenotypes and related clinical outcomes. Our results demonstrated that cluster analyses on clinical phenotypes followed by ancestry analysis can enhance the understanding of the phenotypic and genetic heterogeneity of childhood asthma. Our approach is distinct from previous efforts in that we developed cluster-based subphenotype and applied genetic ancestry analysis to define subphenotype-ancestry relationships which can be subsequently used as the basis of genetic ancestry based clinical risk prediction. Our findings suggest that defining asthma heterogeneous subgroups on the basis of clinical phenotypes and genetic ancestry proportion is an essential step to understand and refine patient subsets and enable more targeted therapy.