Distinguishing Asthma Phenotypes Using Machine Learning Approaches

Asthma is not a single disease, but an umbrella term for a number of distinct diseases, each of which are caused by a distinct underlying pathophysiological mechanism. These discrete disease entities are often labelled as ‘asthma endotypes’. The discovery of different asthma subtypes has moved from subjective approaches in which putative phenotypes are assigned by experts to data-driven ones which incorporate machine learning. This review focuses on the methodological developments of one such machine learning technique—latent class analysis—and how it has contributed to distinguishing asthma and wheezing subtypes in childhood. It also gives a clinical perspective, presenting the findings of studies from the past 5 years that used this approach. The identification of true asthma endotypes may be a crucial step towards understanding their distinct pathophysiological mechanisms, which could ultimately lead to more precise prevention strategies, identification of novel therapeutic targets and the development of effective personalized therapies.


Introduction
Asthma is increasingly recognized as a heterogeneous condition [1], an umbrella diagnosis for several diseases which present with common symptoms such as wheeze and cough, but differ in their aetiology, pathogenesis and responses to treatment ( Fig. 1) [2][3][4][5][6][7][8][9]. These variants of asthma have been described as 'asthma endotypes' [10,11]. Unlike phenotypes, which are defined by sharing similar observable characteristics, endotypes may be defined as subtypes of a condition with overlapping clinical symptoms, but each being caused by a distinct underlying pathophysiological mechanism [10] ( Fig. 1).
At this moment, 'asthma endotype' is predominantly a hypothetical construct which has a potential value in helping us to uncover the mechanisms underlying different diseases in the 'asthma syndrome' [12]. Unravelling unique mechanisms for each asthma endotype may improve our understanding of the natural history of these diseases and ultimately could lead to more precise (possibly mechanism-specific) prevention strategies and may be crucial for the development of more effective personalized therapies and stratified health care [7, 8, 12, 13•]. For this to be feasible, amongst patients with the diagnostic label of 'asthma', it will be necessary to distinguish between different endotypes more precisely and in an unbiased way, as opposed to the currently prevailing classifications based on simple clinical phenotypic characteristics which usually focus on a single dimension of the disease (such as eosinophilic or neutrophilic inflammation). One of the obstacles to this approach is that the information domains from which endotypes should be identified are not well defined. This, in conjunction with the persistence of different definitions of asthma in the medical literature affects the performance of prediction models [12,14].

Investigator-Imposed vs. Data-Driven Approaches to Subtyping Asthma
The current approaches used to classify patients into asthma sub-groups can be split into two main types: subjective (i.e. investigator-imposed) and data-driven. The former is often referred to as being hypothesis-driven, and the latter as hypothesis-generating. In most subjective approaches, an investigator (usually an expert in the field) reviews the patterns of change in an individual's symptoms, triggers, pathology or airway obstruction, and then classifies patients into different 'phenotypes'. An example of this approach which withstood the test of time is a seminal publication from the Tucson Children's Respiratory Study, which described phenotypes of wheezing illness in pre-school children based on clinical assessment of whether a child had wheezed in the previous 12 months at ages 3 and 6 years [15]. The children with wheezing were assigned to three phenotypes: transient early wheezers, late-onset wheezers and persistent wheezers. However, by using this approach, one cannot estimate the uncertainty of phenotype classification. The research in this area has therefore moved on from using classifications predetermined by experts towards data-driven methodologies. Such approaches incorporate statistical learning techniques, facilitating the exploration of high dimensional clinical data sets. Here, we review the last 5 years of developments of latent variable modelling techniques-specifically, the latent class analysis (LCA) [16]-through which the dimensionality of the data sets can be reduced and variables can be grouped into patterns. We will describe the main cohorts used in cited studies and discuss issues related to the analytical approaches used in different studies (such as model definition and parameter estimation, model selection and class assignment) and the advantages and disadvantages of cross-sectional vs. longitudinal LCA approaches.

Background on Latent Class Analysis
Model Definition and Parameter Estimation: Which Classes?
Latent class analysis aims to fit a probabilistic model to data containing observable variables such as asthma symptoms (e.g. wheeze and/or cough) or atopic sensitization (e.g. serum specific IgE levels and skin prick tests): the observed variables are considered to be imperfect indicators of a set of unobserved latent variables. The assumption underpinning LCA is that all associations amongst the observed variables are due to the unobserved latent classes. The probabilistic model used for LCA is often referred to as a mixture model, because the probability of the observed data is a weighted sum, or mixture, of the data probability for each latent class. The weights of the mixture are the prior probabilities of each latent class, i.e. the chance to observe that class in the overall population. Latent classes are derived entirely from the observed data in an unsupervised manner [17,18]. The subjects assigned to each class will be similar to each other according to the descriptor variables used, and the latent classes should correspond to clusters of similar subjects. LCA has been shown to be appropriate for modelling data on the occurrence or absence of symptoms in heterogeneous diseases such as asthma [3,[19][20][21], with the assumption that the cooccurrence of symptoms within the resulting latent classes is the consequence of unique disease-specific mechanisms, and that therefore the latent classes may be regarded as bona fide asthma endotypes.
In order to produce the latent classes, the model estimates two important quantities: 1. Conditional probability of each variable's response within each class. 2. Posterior probabilities of class membership for each subject given their response history. Parameter estimation in the LCA (e.g. weights of the mixtures or average variable values in a class) can be done via different methods, one of the most popular being the expectation-maximization (EM) algorithm. In EM, two distinct steps (called E-and M-step) iteratively find the best parameter set by using a current (best guess) estimation on unseen data and improving the estimate by recalculating its likelihood (i.e. how well the model Figure 1 Asthma: an umbrella diagnosis which comprises multiple diseases with distinct mechanisms fits/explains data points) on observed data, until the two estimates do not converge to the same value. It is typically assumed that each class has a characteristic distribution of the key variables modelled by some parameterized density function [22]. Often, the choice of class density function is restricted by the need for the EM optimization scheme to be tractable, e.g. class distributions from the exponential family are often chosen since the M-step of the EM algorithm is exactly tractable in this case. An alternative estimation procedure that has been used is the approximate Bayesian approach of variational message passing [23, 24, 25•].
The models typically used for LCA assume conditional independence of observed variables within each latent class, which is a strong assumption. Original data may be reduced into independent components (thus reducing the original number of variables) using techniques such as principal component analysis (PCA), exploratory factor analysis or multiple correspondence analysis [26], as variables representing the same dimensions are thought to be likely to be dependent within all identified latent classes. These variables are also typically required to be categorical, not continuous. Additionally, covariates such as sex effects can be (and have been) estimated [27, 28•].

Model Selection: How Many Classes?
The EM-algorithm can be used to learn the model parameters. However, this leaves the problem of choosing the optimal number of latent classes. This model selection problem is much more challenging than parameter estimation. It cannot be solved by simply maximizing the data likelihood: it is always true that a more complex model (i.e. more parameters) can achieve the same or higher likelihood than a simpler model (i.e. less parameters). Therefore, it is necessary to account for such model complexity (the number of free parameters) when selecting the optimal model. There is not yet a single agreed method that determines the optimal number of latent classes for a model [29], although some methods are better suited for specific cases (or more popular) than others. Often, a variety of different model selection procedures are used, with their results compared and interpreted to help determine the model with the best number of classes. The most popular method used amongst the literature we reviewed was the Bayesian Information Criterion (BIC) [20,21,26,27,28•, 30••, 31, 32••, 33•, 34-37] along with some variations on its original formulation [28•, 38]. The function combines a model's log-likelihood value which is penalized by the number of parameters [39]-namely, the model log-likelihood is penalized by subtracting a quantity proportional to the number of parameters times the logarithm of the sample size-where the penalty can be interpreted in a Bayesian fashion. Complex models are penalized to avoid over-fitting, and parsimony is rewarded [40]. The model with the lowest BIC value is considered the best-fitting model and is usually selected [32••]. Its solution can depend on the sample size and the starting value. To account for the former, the algorithm can be adjusted [37]. To negate the latter, analyses are often rerun with many different starting values to confirm stable solutions and to avoid local maxima [37,41]. The BIC is currently regarded as one of the most efficient in-sample estimators [31,38] that does not require an external test set.
Another popular method is the Akaike Information Criterion (AIC) [26, 27, 28•, 42••], which is very similar to the BIC. The former uses a smaller penalty term for the number of parameters in a model [43]-i.e. only the number of parameters, without accounting for the sample size. An adjusted version was also used [28•, 38]. As with the BIC, a lower value indicates a superior model [16].
The next most popular methods are likelihood ratio tests [44] [36]), all of which are parametric [22,38]. These test an improvement in fit between models with n vs. n+1 classes, resulting in a fit index [41]. For the tests that assume a chi-squared distribution, their results may be affected by sample sizes as the test statistic follows the distribution asymptotically. The bootstrap likelihood ratio test instead constructs a distribution using parametric bootstrap approach and as such should be less affected by sample sizes [22,31].
Other methods used to select the models used include entropy [21,27,28•, 42••, 45•], the Bezdek partition coefficient [26], confusion matrices [23] and a dissimilarity index [36]. More recently, Bayesian latent variable methods closely related to LCA have also been introduced which allow models to be selected using the Bayesian evidence (also referred to as the Marginal likelihood) [47].
The selected model's resulting classes are interpreted as distinct subtypes, characterized by the model variables that apply best to each class. Because of this, classes that share very similar model variable characteristics are interpreted as representing very similar phenotypes and so are often merged in favour of fewer classes [31].
Class Assignment: How Are (New) Subjects Assigned to the Latent Class?
For each subject, the probability of belonging to each of the latent classes is calculated. In a process called modal allocation, each subject is assigned to the latent class with the largest a posteriori probability of membership [19]. A classification is supported by high membership probabilities, indicating good separation between clusters. These classifications are validated by testing the reproducibility of these classes [26, 32••], sensitivity analyses [26] and the analysis of their association with objective measurements considered to be relevant to asthma such as lung function (e.g. forced expiratory volume in 1 s [FEV 1 ]), and bronchial hyperresponsiveness (BHR) [31] using regression analyses [28•, 32••, 36, 45•, 48] and true-and false-positive rates from receiver operating characteristic curves [42••, 49].

Strengths and Limitations of Different LCA Approaches
A major advantage of LCA is that subjects are not absolutely assigned to a single class, but instead have probabilities of membership to various classes. Also, as the model selection criteria previously described assist in determining the optimal number of classes, LCA can be regarded as fairly objective in the sub-group sets identification [19,50]. However, a degree of subjectivity is introduced into LCA as some a priori decisions need to be made, such as which variables will be included in the model, with the assumptions on the data values distributions, and on the class shapes [19].
LCA is particularly suitable for categorical input variables and can accommodate missing values when they are assumed to be missing at random [51], thus allowing the analysis of a whole sample. However, caution is required when using records containing missing values, as their use can pose a high risk of bias where missing values are correlated with clinical attributes. For example, in some studies, allergy tests are rarely performed in non-asthmatic children, and a lack of information about family history, environmental exposures and clinical features can also contribute to bias. There is also the risk of unrecorded episodes (e.g. of wheeze), which can lead to an underestimation of incidence rates, possibly resulting in imprecise class definitions [34]. Manifestation of biases such as these could result in misclassification of subjects.
Another LCA approach, longitudinal latent class analysis (LLCA) can temper these biases as it accounts for correlation between reports at different time points [30••, 51, 52]. This is possible because LLCA clusters individuals into classes alongside others that share similar longitudinal response patterns across discrete time points. Subjects with sporadic or incomplete reports are assigned to classes with less certainty that those with consistent reports across time or those who report patterns consistent with other subjects. However, LLCA sometimes requires responses to be collected at the same discrete time point in each subject. This has major implications for data collection. Ideally, every subject should be exactly the same age when each measurement is performed. However, in most epidemiological longitudinal studies, most measurements are not collected in this fashion, so the rounding of age is required, introducing measurement errors [33•]. Also, LLCA also does not allow the modelling of the effect of time-varying causative factors (such as environmental exposures, e.g. seasons) on the prevalence of a response such as wheeze.
These limitations can be overcome by using another, more flexible form of LCA, latent class growth analysis (LCGA) [41,53]. Here, the variables need not be categorical, but can be continuous, removing the need to round ages or other numerical measurements. This allows trajectories of development (e.g. of wheeze or atopic sensitization) to be estimated as a continuous function of age. The effect of time-varying causative factors can now be included as well (for example the effect of common cold/flu season on the prevalence of wheezing) [30••]. This relatively novel method enables the investigation of associations between time-varying and timeinvariant factors on response patterns and makes it easier to compare classes across different populations and to account for a variable number of repeated assessments. All of this can be particularly advantageous when studying the effects of repeated environmental exposures and their outcomes that fluctuate over time through childhood.

Cohort Details and Phenotype Associations
Cohorts that have utilized data-driven approaches to analysis in the field of asthma and allergy within the last 5 years are listed in Table 1. They are a mixture of longitudinal and crosssectional studies, with the 19 out of the 25 being birth cohorts (see Table 1). Sample sizes used in different analyses range from 201 to 11,632 participants. Variants of LCA performed upon these study populations include LLCA, LCGA and latent growth mixture modelling (LGMM) [54]. The number of resulting classes for each of these studies ranged from 3 to 8, with the patients allocated to each class based on characteristics such as wheezing [ [48], who based theirs upon cytokine production, denoting a burden of infection; Bochenek et al. [27] who based theirs upon sub-phenotypes of a recognized sub-type of asthma-aspirinexacerbated respiratory disease; Rzehak et al. [54] who based theirs upon body mass index trajectories; and Belgrave et al. [66••] who included eczema, wheeze and rhinitis. The resulting sub-groups (classes) were then often associated with clinical features such as atopy, physician-diagnosed asthma and fractional exhaled nitric oxide (FeNO) [20, 23, 25•, 26, 28•, 32••, 36, 67, 68]. For example, FeNO-a surrogate biomarker of the degree of eosinophilic airway inflammation [69,70]-measured at age 8 years in the prevention and incidence of asthma and mite allergy (PIAMA) cohort, was found to be different amongst wheezing phenotypes, but only in atopic children, fuelling speculation that the pathophysiology of wheezing phenotypes differs between atopic and non-atopic children, and that they are the result of differing endotypes [67].
Wheezing is the feature that the analyses reviewed here most commonly used to derive classes (which are often referred to as 'wheeze phenotypes'), some (but not all) of which are similar across different studies and analyses. Types of wheeze phenotypes that have been identified across the studies include early life wheeze (transient and prolonged), lateonset wheeze and persistent wheeze (controlled and troublesome) [7]. For the early life wheeze phenotypes, a prolonged early wheeze (PEW) was identified in the Avon Longitudinal Study of Parents and Children (ALSPAC) study [21], but it was not found in the PIAMA cohort in the same analysis. However, the latter cohort's transient early wheeze (TEW)also identified in the ALSPAC cohort-appeared to be a combination of ALSPAC's PEW and TEW classes, both in terms of size and of the prevalence of wheeze over time. Children in these classes were found to have diminished lung function at age 6-8 years (i.e. after the wheeze had resolved) compared to those who had never wheezed in both cohorts [21]. These phenotypes of early childhood wheezing were not associated with allergic sensitization, eczema or rhinitis, and it has since been confirmed that the developmental profiles of eczema, wheeze and rhinitis are indeed heterogeneous [66••].
The late-onset wheeze phenotype is generally characterized as wheeze which starts after the age of 3 years which then persists into later childhood. Studies have mixed reports with respect to the association of late-onset wheeze with secondary asthma phenotypes such as lung function and bronchial hyperresponsiveness [7]. For example, ALSPAC, PIAMA and MAAS all found that children in this subgroup are significantly more likely to have bronchial hyperresponsiveness [3, 21, 32••], but only MAAS and ALSPAC found significant associations of late-onset wheeze with lung function impairment at the age of 6 years [3,71].
Lastly, the persistent wheeze phenotypes have been characterized by diminished lung function by school age in all cohorts which assessed their association [7]. In contrast to other studies which described a single persistent wheeze class, in the MAAS birth cohort, the children with persistent wheeze fell into two distinct classes: persistent controlled wheeze (PCW) and persistent troublesome wheeze (PTW) [32••]. The PTW group had worse lung function and more reactive airways than all other groups, including PCW [32••]. However, it is worth noting that this analysis utilized information on wheezing derived from two different sources-parentally reported and physician-confirmed-likely enabling a more precise allocation into sub-groups.
Most studies reported a strong association between persistent wheezing and atopy, with more than 50 % of persistent wheezers having been found to be atopic [72]. However, atopy is not a feature that is unique to this class, and it is also present in other wheeze classes, and amongst children who have never wheezed. Thus, on its own atopic status is not a good discriminator of wheeze class. Several studies hypothesized that similar to wheezing illness, atopy may also comprise of several distinct subtypes. For example, Herr et al. identified three distinct atopic phenotypes regarding atopy within the first 18 months of life amongst participants in the PARIS cohort [26]. In the analysis spanning the first 8 years of life, Simpson et al. have identified different structure within the MAAS data, with the optimal model containing five classes (early sensitization to multiple allergen sources, late sensitization to multiple allergen sources, mite, non-dust mite and no latent vulnerability) [23]. The atopic class of early sensitization to multiple allergen sources was associated with persistent wheeze phenotypes and was very strongly associated with physician-diagnosed asthma in the school age (odds ratio ∼30) [23]. Lazic et al. extended this work by including newly available skin test and IgE data from age 11 years from the MAAS cohort and that of the Isle of Wight cohort [25•, 73-76]. Very similar five-class models emerged across the two cohorts, suggesting that these atopy classes were stable across time and different populations. In both cohorts, children in the class with sensitivity to a wide variety of allergens were considerably more likely to have asthma compared to all other classes [25•]. The children in this class (comprising approximately one quarter of children defined as atopic using the standard definition) across both cohorts had significantly poorer lung function, most reactive airways, highest eNO and most hospital admissions for asthma. Of note, the associations between asthma presence and severity and conventionally defined atopy were much weaker. These results indicate that there is a latent heterogeneity in atopy, similar to that found in asthma/ wheezing illness [23, 25•]. Because of this, attempting to define atopy as a dichotomous trait could well be an oversimplification, much as it would be to define childhood wheezing in such a fashion.
Further research will be necessary in order to replicate different asthma, wheeze and atopy subtypes across independent cohorts, to assess their stability over time and to confirm the existence of distinct pathophysiological mechanisms underpinning each sub-type. Since unique pathophysiological mechanisms for the subtypes of wheezing illness and atopy identified so far using machine learning approaches have not as yet been elucidated, these cannot be considered as 'true endotypes' but are mostly hypothetical constructs to facilitate further research in this area. The identification of underlying biology and real endotypes may have major implications for effective and precise asthma prevention, treatment and management strategies, as it is anticipated that the different groups may respond differently to the treatments currently offered.

Conclusions
A distinct set of heterogeneous diseases with the diagnostic label of asthma may potentially be identified using data-driven, computational techniques such as latent class analysis. Such techniques disambiguate the complex patterns of symptoms shared by these different diseases. This may be a first step towards elucidation and better understanding of their distinct underlying pathophysiological mechanisms, which could facilitate the development of personalized mechanismspecific prevention strategies and more effective stratified therapies.

Compliance with Ethics Guidelines
Conflicts of Interest Adnan Custovic reports grants from Medical Research Council, grants from The JP Moulton Charitable Foundation, grants from North West Lung Research Centre Charity, grants from European Union 7th Framework Programme, grants from National Institute of Health Research, personal fees from Novartis, personal fees from Thermo Fisher, personal fees from AstraZeneca, personal fees from ALK and personal fees from GlaxoSmithKline.
Rebecca Howard, Magnus Rattray and Mattia Prosperi declare that they have no conflicts of interest.
Human and Animal Rights and Informed Consent This article does not contain any studies with human or animal subjects performed by any of the authors.