Comparison of hypothesis- and data-driven asthma phenotypes in NHANES 2007–2012: the importance of comprehensive data availability
Half of the adults with current asthma among the US National Health and Nutrition Examination Survey (NHANES) participants could be classified in more than one hypothesis-driven phenotype. A data-driven approach applied to the same subjects may allow a more useful classification compared to the hypothesis-driven one.
To compare previously defined hypothesis-driven with newly derived data-driven asthma phenotypes, identified by latent class analysis (LCA), in adults with current asthma from NHANES 2007–2012.
Adults (≥ 18 years) with current asthma from the NHANES were included (n = 1059). LCA included variables commonly used to subdivide asthma. LCA models were derived independently according to age groups: < 40 and ≥ 40 years old.
Two data-driven phenotypes were identified among adults with current asthma, for both age groups. The proportions of the hypothesis-driven phenotypes were similar among the two data-driven phenotypes (p > 0.05). Class A < 40 years (n = 285; 75%) and Class A ≥ 40 years (n = 462; 73%), respectively, were characterized by a predominance of highly symptomatic asthma subjects with poor lung function, compared to Class B < 40 years (n = 94; 25%) and Class B ≥ 40 years (n = 170; 27%). Inflammatory biomarkers, smoking status, presence of obesity and hay fever did not markedly differ between the phenotypes.
Both data- and hypothesis-driven approaches using clinical and physiological variables commonly used to characterize asthma are suboptimal to identify asthma phenotypes among adults from the general population. Further studies based on more comprehensive disease features are required to identify asthma phenotypes in population-based studies.
KeywordsAsthma Phenotypes Population-based study Unsupervised analysis
American Thoracic Society/European Respiratory Society
asthma with concurrent COPD
asthma with obesity
body mass index
chronic obstructive pulmonary disease
fraction of exhaled nitric oxide
forced expiratory volume during the first second
forced vital capacity
latent class analysis
lower limit of normal
National Health and Nutrition Examination Survey
Airways diseases, such as asthma and chronic obstructive pulmonary disease (COPD), comprise a heterogeneous set of subtypes with different underlying pathophysiological mechanisms [1, 2, 3]. Both hypothesis-driven and data-driven methods can be used to classify patients into sub-groups of airways diseases [4, 5, 6].
The hypothesis-driven approach classifies airways diseases based on pre-defined criteria following immunopathology concepts and asthma literature, while in data-driven methods no prior disease classification is required [7, 8]. Data-driven approaches have provided insights into “novel” phenotypes of complex disease pathogenesis, suggesting disease stratification depending on the individual pathophysiologic characteristics [8, 9, 10, 11].
Most studies on asthma phenotyping using data-driven methods emphasize patients with moderate to severe asthma and/or clinically-based settings [12, 13, 14, 15]. Therefore, the generalization to the general asthma population may be limited.
Different types of data-driven methods have been widely used in airway diseases, such as hierarchical , partitioning , and latent class analysis (LCA) . Notably, LCA appeared to account better for the heterogeneity of airways symptoms, compared to other commonly used data-driven approaches (e.g. partitioning around medoids) . Moreover, the application of the latent class assignments developed from a national data source has previously demonstrated higher degrees of generalizability .
Recently, we reported a significant overlap between five distinct hypothesis-driven asthma phenotypes in adults from the general population included in the US National Health and Nutrition Examination Survey (NHANES) . We have emphasized that a combination of clinical information and biomarkers, using a more comprehensive data analysis approach, such as data-driven methods, could provide a better taxonomy of non-severe asthma.
In this study, we aimed to compare previously defined hypothesis-driven asthma phenotypes  with data-driven asthma phenotypes derived by applying LCA to a sample of adults representative of the US general population.
Study setting and participants
We have included subjects that participated in the NHANES study, a nationally representative survey of the civilian, non-institutionalized US population performed with the aim of gathering data regarding health and nutritional status. Protocols were approved by the National Center for Health Statistics Research Ethics Review Board and all participants gave written informed consent. Detailed information can be found in the NHANES documentation (www.cdc.gov/nchs/nhanes.htm).
Data from three NHANES surveys was used (n = 30,442). We included adults (≥ 18 years old) with current asthma (n = 1059), defined by a positive answer to the questions : “Has a doctor ever told you that you have asthma?” together with “Do you still have asthma?”, and either “wheezing/whistling in the chest in the past 12 months” or “asthma attack in the past 12 months.”
Anthropometric and demographic characteristics, such as age, gender, body mass index (BMI), and smoking status were analysed, as well as blood eosinophils (B-Eos) count, fraction of exhaled nitric oxide (FeNO) and spirometric parameters. FeNO and spirometry were performed following ATS/ERS recommendations [19, 20]. Basal predicted values of forced expiratory volume during the first second (FEV1) and forced vital capacity (FVC) were calculated [21, 22] and abnormal values were defined as being below the lower limit of normal (LLN) .
Hypothesis-driven asthma phenotypes
The analysis based on the report of smoking status, presence of obesity and inflammatory markers enabled the definition of five asthma phenotypes : B-Eos-high asthma phenotype, if B-Eos ≥ 300/mm3; FeNO-high asthma, if FeNO ≥ 35 ppb; B-Eos&FeNO-low asthma, if B-Eos < 150/mm3 and FeNO < 20 ppb; asthma with obesity (AwObesity), if BMI ≥ 30 kg/m2; and asthma with concurrent COPD (AwCOPD), if subjects had self-reported chronic bronchitis/emphysema with age of diagnosis ≥ 40 years and being either a current or an ex-smoker (ever smoked). Subjects were considered as “non-classified” if they did not meet the criteria for any of the defined asthma phenotypes. Additionally, to account for individuals with probable co-existence of asthma and COPD and minimize age as a confounding variable, we conducted the analysis considering two age groups: < 40 and ≥ 40 years old .
Data-driven asthma phenotypes
LCA was used to identify asthma phenotypes in an unsupervised manner (data-driven approach). Two models for “current asthma” were developed (Additional file 1: Table S1): Model 1 was based on the 4 variables previously used to define the hypothesis-driven asthma phenotypes (BMI ≥ 30 kg/m2, ever-smoking status, FeNO ≥ 35 ppb, B-Eos ≥ 300/mm3) ; and in Model 2, we have added to the former 4 variables, sex, early asthma onset (< 16 years old), wheezing-related questions (presence/absence of at least one wheezing attack, wheezing with exercise, sleep disturbance by wheezing, limit activity by wheezing, absenteeism by wheezing), asthma-related emergency department (ED) visit in the previous 12 months, FEV1/FVC < LLN, FEV1 < LLN, and self-reported hay fever.
Additionally, to explore the results in different “asthma populations”, we’ve developed two other models using similar variables. For the “ever asthma” subgroup (model 3) we included subjects with a positive answer to “Has a doctor ever told you that you have asthma?” (n = 2611); and for the “difficult asthma” (model 4) we included subjects with poor asthma-related outcomes, defined as current asthma plus, at least, one of the following: asthma-related ED visit, FEV1 < LLN, or oral corticosteroids use in the past 30 days (n = 673) (Additional file 1: Table S1).
Latent class models were derived independently for each age group, using the same variables, and a secondary analysis without stratifying by age was done on the three asthma subgroups. The most appropriate number of clusters was determined by examining commonly used criteria . Further methodological details are found in the Additional file 1.
All analyses considered the complex multistage sampling and 6-year sampling weights provided by the NHANES documentation . LCA was performed with MPlus (version 6.12), that considered the complex survey design of NHANES when performing LCA-modelling. All other analysis was performed in Stata/IC 15.1 (Stata Corp, College Station, TX, USA). A p-value < 0.05 was considered statistically significant.
We included 1059 adults with current asthma. The weighted proportions of the previously defined hypothesis-driven asthma phenotypes, according to age groups (< 40 and ≥ 40 years old) were, respectively: 42% and 53% with AwObesity; 34% and 37% with B-Eos-high asthma; 26% and 21% for B-Eos&FeNO-low; 18% and 19% with FeNO-high asthma; and 19% AwCOPD, in the older group . In addition, 17% and 12% of the individuals in the < 40 and ≥ 40 years old groups, respectively, were categorized as “non-classified”.
Proportions of each variable according to the LCA-classes identified in Model 2 (subjects with current asthma, n = 1059)
Additionally, LCA identified 2 classes on the models for “ever-asthma” and “current asthma” without stratifying by age, but not for the difficult-asthma sub analysis where no subgroup was identified (Additional file 1: Table S1).
This was the first study comparing previously defined hypothesis-driven asthma phenotypes with data-driven ones in a sample representative of the US general population. The proportions of the hypothesis-driven phenotypes were similar between the two data-driven phenotypes obtained by LCA using clinical and physiological variables commonly used to characterize asthma.
Previous studies using data-driven approaches contributed to the definition of clusters/phenotypes based on similarities in clinical and inflammatory biomarkers [9, 12, 13, 14]. However, these approaches have been scarcely applied to adults with asthma from population-based studies. The studies from Siroux et al.  and Mäkikyrö et al.  provided further evidence for identifying subgroups of asthma based on clinical markers and questionnaire data commonly available in primary health care or large epidemiological studies and found a larger range of asthma phenotypes.
Our study showed that performing LCA with the variables used to define some of the most common hypothesis-driven asthma phenotypes, could not identify subgroups within adults with current asthma from the general population. By including additional clinical and physiological variables commonly used to classify asthma, LCA identified two data-driven phenotypes in the same subjects. Overall, these phenotypes only differed in symptom frequency and lung function parameters. Inflammatory biomarkers, presence of obesity, smoking status, age of asthma onset and self-reported hay fever were not different between classes.
Moreover, using a less stringent asthma definition (ever asthma) and in subjects with poor clinical outcomes (difficult asthma), these variables were also suboptimal to differentiate asthma subgroups.
In contrast to studies with severe asthma patients, our results suggest that, for the general asthma population, the clinical and physiological variables available to classify asthma and commonly used predefined cut-offs seem to be insufficient to identify specific phenotypes. The inclusion in data-driven models of additional easily measurable biomarkers that have already been shown to be helpful in discriminating asthma phenotypes in this population (e.g. serum IgE and/or periostin) [28, 29], combined with comprehensive clinical, physiologic, and/or disease features, might result in the identification of more precise phenotypes. Also, the identification of new, more accurate biomarkers could also improve phenotyping . Furthermore, the use of fixed cut-offs values, although common and more intuitive for daily clinical practice, may potentially miss more complex, and yet unidentified phenotypes. The use of absolute values (as seen in other studies [13, 31, 32]), or appropriate reference equations for predicted values [33, 34] could be more adequate.
Similarly, research efforts are being made to integrate clinical characteristics with available biomarkers to identify data-driven asthma phenotypes in children [35, 36]. However, the obtained phenotypes vary on key features that are more pronounced during childhood, including natural history of wheeze over time , suggesting that further work is required to compare data- and hypothesis-driven approaches to identify asthma phenotypes in children.
Limitations inherent to a survey study design must be acknowledged and the self-reported variables may lead to misclassifications and information biases; to account for these biases, we used previously validated definitions [38, 39]. Also, despite including the most commonly used variables for respiratory disease assessment available in the NHANES study, when using the less stringent asthma definition, the differentiation of asthma subgroups was not improved in this population. However, to reduce the risk of poor LCA-class differentiation, we did not include any of the variables used in the asthma groups definition into the LCA models. Finally, LCA modelling should comprehend all the domains relevant to the understanding of the disease to classify observations into discrete and mutually exclusive classes , suggesting that the use of predefined cut-offs and the lack of data regarding, for example, objective assessment of atopy, nasal and ocular symptoms (which have proved to be useful in the stratification of allergic respiratory diseases [10, 41]), may have limited the ability to differentiate specific asthma phenotypes using unsupervised analysis.
In conclusion, this brief communication extends our previous work on the need for a broader data analysis combining different asthma-related domains for differentiating phenotypes in the general asthma population . The clinical and physiological variables commonly used to subdivide asthma seem to be insufficient to differentiate specific asthma phenotypes among adults from the general population, irrespective of using data-driven or hypothesis-driven approaches. Further studies based on more comprehensive disease features are required to identify asthma phenotypes with the potential to be useful for clinicians and for population-based research.
RA, AMP, JAF, contributed to study conception and design, analysis and interpretation of data, writing and revising the article. TJ, AM, CJ and KA contributed to data interpretation, writing and revising the article. All authors read and approved the final manuscript.
This article was supported by FEDER through the operation POCI-01-0145-FEDER-007746 funded by the Programa Operacional Competitividade e Internacionalização – COMPETE2020 and by National Funds through FCT - Fundação para a Ciência e a Tecnologia within CINTESIS, R&D Unit (reference UID/IC/4255/2013).
The authors declare that they have no competing interests.
Availability of data
Data and respective datasets are displayed at the NHANES website: https://www.cdc.gov/nchs/nhanes/Index.htm.
Consent of publication
Ethics approval and consent to participate
The NHANES survey operates under the approval of the National Center for Health Statistics Research Ethics Review Board (Protocols #2005-06, and #2011-17), available in www.cdc.gov/nchs/nhanes/irba98.htm. All the NHANES data meet the conditions described in Research Using Publicly Available Datasets (Secondary Analysis) - Policy #39 - for use without application to Institutional Review Board. All study participants provided written informed consent.
RA is supported by a Ph.D. grant (grant no. PD/BD/113659/2015), financed by the Fundação para a Ciência e Tecnologia, I.P., PhD program (reference no. PD/0003/2013: Doctoral Program in Clinical and Health Services Research).
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
- 7.Han J, Kamber M, Pei J. Data mining: concepts and techniques. 3rd ed. Waltham: Morgan Kaufmann Publishers; 2012.Google Scholar
- 16.Amaral R, Jacinto T, Pereira A, Almeida R, Fonseca J. A comparison of unsupervised methods based on dichotomous data to identify clusters of airways symptoms: latent class analysis and partitioning around medoids methods. Eur Respir J. 2018;. https://doi.org/10.1183/13993003.congress-2018.PA4429.CrossRefGoogle Scholar
- 24.Muthén LK, Muthén BO. Mplus user’s guide. 7th ed. Los Angeles: Muthén & Muthén; 2012.Google Scholar
- 25.Specifying weightning parameters. https://www.cdc.gov/nchs/tutorials/nhanes/SurveyDesign/Weighting/intro.htm. Accessed 9 Dec 2018.
- 36.Collins SA, Pike KC, Inskip HM, Godfrey KM, Roberts G, Holloway JW, et al. Validation of novel wheeze phenotypes using longitudinal airway function and atopic sensitization data in the first 6 years of life: evidence from the Southampton Women’s survey. Pediatr Pulmonol. 2013;48(7):683–92.CrossRefGoogle Scholar
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.