Analytic Complexity and Challenges in Identifying Mixtures of Exposures Associated with Phenotypes in the Exposome Era

Purpose of Review Mixtures, or combinations and interactions between multiple environmental exposures, are hypothesized to be causally linked with disease and health-related phenotypes. Established and emerging molecular measurement technologies to assay the exposome, the comprehensive battery of exposures encountered from birth to death, promise a new way of identifying mixtures in disease in the epidemiological setting. In this opinion, we describe the analytic complexity and challenges in identifying mixtures associated with phenotype and disease. Recent Findings Existing and emerging machine-learning methods and data analytic approaches (e.g., “environment-wide association studies” [EWASs]), as well as large cohorts may enhance possibilities to identify mixtures of correlated exposures associated with phenotypes; however, the analytic complexity of identifying mixtures is immense. Summary If the exposome concept is realized, new analytical methods and large sample sizes will be required to ascertain how mixtures are associated with disease. The author recommends documenting prevalent correlated exposures and replicated main effects prior to identifying mixtures.


Introduction
We are exposed to many different factors simultaneously throughout our lifespan, ranging from drugs, infectious agents, environmental pollutants, and macro-and micronutrients. The exposome is an emerging attempt to conceptualize and characterize the breadth of exposures humans encounter from birth to death [1•, 2•]. The exposome concept promises to capitalize on our accelerating ability to measure indicators of environmental exposure in a high-throughput manner leveraging new tools, such as metabolomics [3]. In fact, investigations are now underway to measure the exposome in children to discover new associations in development-related traits, such as the National Institute of Environmental Health Sciences Children's Health Exposure Analysis Resource (CHEAR, see https://www.niehs.nih. gov/research/supported/exposure/chear/). In contrast, most epidemiological investigations to date consider one or a few exposures at a time and we currently lack data-driven methods to associate, and discover, numerous environmental exposures, including mixtures, in etiological studies of disease.
New high-throughput measurements provide data to potentially identify mixtures, defined as a combinations of cooccurring exposures, associated with disease. Identification of mixtures is important for public health and epidemiology because it is hypothesized that exposures induce changes in phenotype (disease and health indicators) not as a single agent, but as a collection of agents acting in concert [4,5]. In other words, mixtures can be further thought of as groups of exposures that together induce a change in phenotype that is different than the effect of each exposure component separately. We note that the search for mixtures may contrast with the well-known Bradford Hill criteria to assess causality [6]. Specifically, Hill's criterion 3 is one of specificity or, in other words, the simpler the association between exposure and disease, the more likely the association is causal [6]. In the era of high-throughput measurement, Bradford Hill may need to be revisited if evidence for combinations of exposures associated with diseases are realized [7••, 8]).
First, we present some definitions. We define phenotype as a manifestation of a trait, such as disease and quantitative characteristics (e.g., height, body mass index). In environmental epidemiological investigations, we attempt to model the relationship between phenotypes (e.g., a disease or quantitative characteristic) and exposures. We define exposure combinations as a set of two or more exposures that occur together. Exposure co-occurrences are a pair of two exposures that co-occur together. Finally, mixtures are potentially interacting combinations of exposures that are associated with a phenotypic change. In other words, the potential effect on the phenotype is influenced not by the individual constituents of the mixture, but the combination. In the following sections, we aim to further define analytically both co-occurrences and mixtures of exposures and their associations with phenotype.
Analytic identification of mixtures of exposures in phenotype is fraught with challenges [9•, 10•]. In the following, we review existing and emerging analytic methods to understand the complex phenomena of co-occurring exposures and mixtures in the epidemiological setting and open challenges, extending the insights from Braun and others [11••].

Potential Complexity of Identifying Mixtures: Expansive Number of Possible Combinations of Exposures and Mixtures
The total number of sets of exposures that can exist is immense. Suppose we have the capability of measuring M number of exposures in an epidemiological study. A combination of exposures can have any size: two (e.g., lead and cadmium), three (e.g., lead, cadmium, and bisphenol A), or more. In a setting with binary exposure variables (e.g., median split) and without any prior hypotheses, the total number of potential combination of high exposures that can occur in a dataset of size 2 and up to N is defined as "M choose N", equal to m!/(nm)!*m! where M is defined as the number of high exposures. For example, if our study had measured three (e.g., serum lead, cadmium, and arsenic) exposures in total, the number of possible combinations of size two (M = 2) is 3, including (lead, cadmium), (lead, arsenic), and (arsenic, cadmium). Suppose we measure 100 exposures (N = 100): the total number of combinations of size two that could co-occur is 4950. Of size 5, the total combinations are on the order of millions (exactly 75,287,520). One can imagine the scale of the number of combinations when scaling up to 100s of exposures of the exposome. Therefore, a primary challenge of identification of mixtures in phenotypes is the expansive number of exposure combinations possible.

Dense Correlations and Examples of Co-occurring Exposures in NHANES
While there are a vast number of possible combinations of exposures, investigators may choose to focus on the most prevalent co-exposures or those that co-occur more frequently in the population. One way to identify co-occurring exposures includes estimating the correlation between each pair of exposures (e.g., through a Spearman or Pearson correlation measure) and "grouping" the highly correlated exposures with an "unsupervised machine learning" technique, such as clustering.
Others and we have described how to identify prevalent coexposures by displaying the correlation between multiple exposures and finding co-occurring exposure "networks" through unsupervised learning [12•] based on "relevance networks" [13]. Briefly, a network is composed of a combination of exposures, or nodes, whose connections are estimated by the correlation between exposures (e.g., Spearman or Pearson correlation). We assess the strength of the connection between exposures in a network by estimating the p value of significance through permutation-based tests. Specifically, each exposure is randomly permuted (sampled without replacement), and the correlations are re-computed to create a set of correlations to reflect a null distribution of no correlation. Then, we choose the most significant pairwise correlations to construct the network. Networks are then visualized in a "correlation globe" (Fig. 1), whereby each exposure variable is arranged in a circle ("nodes") and lines between nodes indicate nonrandom correlation between exposure variables. There are many ways to create such visualization; for example, we created Fig. 1 with the Circos visualization package [14].
We implemented this method using epidemiological survey data from the National Health and Nutrition Examination Survey (NHANES), systematically computing pairwise correlations between 289 exposure variables, or 81,937 total correlations possible. Of these, the exposome network consisted of 2656 significant and replicated pairwise correlations. We identified several co-occurring exposures that are well known (see Table 1 for examples of exposures positively correlated with serum PCB170, cadmium, and β-carotene). The correlation structure is potentially "dense" (Fig. 1), and many exposures co-occur with many others [7••, 12•, 15, 16••]. Therefore, while the total number of combinations of exposures can be impossibly expansive as described in the previous section, the total number of prevalent co-occurring combinations of exposures can be reduced in magnitude but is still is a challenge to dissect.
We offer an additional cautionary note: by only focusing on combinations that co-occur frequently, investigators will lose opportunities to identify mixtures that involve dependencies between rare exposures. For example, it may be possible that hypothetical exposures A and B do not frequently co-occur; however, a phenotypic response may be induced in the rare chance that individuals encounter A and B. Or, put even more simply, there could be exposures whose influence on a phenotype is "triggered" by another exposure B. Correlation, or co-occurrence, does not equate to interaction.
GWASs are a way to associate millions of genetic factors with disease or phenotype along the entirety of the genome. Due to wide accessibility of genome-scale GWAS assays that can ascertain millions of genetic variants simultaneously, human geneticists have now moved from studying a handful of genetic variants at a time to a more data-driven, comprehensive, systematic, and agnostic reporting of genetic associations and their replication in independent populations. At its simplest, GWASs are epidemiological studies that examine the association between the prevalence of genetic variants in cases (diseased) versus controls (healthy), not unlike most environmental epidemiological studies in design. Specific genetic factors are prioritized by examining what associations have a p value lower than a family-wise error rate (e.g., Bonferroni correction) and are replicated by testing significant associations in an independent cohort.
EWASs borrow the methods of GWAS to search for and analytically validate environmental factors associated with continuous phenotypes or cases versus controls. This type of question is different from a hypothesis-driven approach in which a single candidate or a handful of environmental factors are chosen a priori and tested individually for their association to a phenotype. Briefly, the strength of EWAS is in comprehensively testing for linear associations between each exposure and a phenotype, similar to that of GWAS studies. Instead of testing a few environmental associations at a time, EWAS evaluates multiple environmental factors. EWAS is comprehensive in that each factor measured is assessed for possible association with the target phenotype. Next, associations are systematically adjusted for multiplicity of comparisons. Finally, EWAS calls for validation of significant associations in an independent population.
In an EWAS study, M total exposure variables are associated with a phenotype or outcome using a linear model, iteratively or one at a time. Thus, for an environmental factor X i in the list of measured factors X i … X m , the disease state (Y) is modeled as a linear function of environmental factors and adjustment variables (represented by Z) X i corresponds to the environmental factor, and b i corresponds to the effect size of that factor (e.g., beta coefficient or odds ratio), adjusted by Z. The strength of association is computed by the two-sided p value for b i , which tests the "null hypothesis" that b i is equal to zero. Next, to mitigate the chance for false positive discovery (type I error), a familywise error rate, such as Bonferroni correction, is applied. The Bonferroni adjustment simply adjusts the nominal p value threshold by the total number of tests conducted. This adjustment guarantees the "family-wise error rate"-the probability of having one or more false positive(s). However, the threshold is conservative, and therefore, statistical power for detection is lost. An alternative to the family-wise error rate includes the false discovery rate (FDR) [18••]. The FDR is less conservative and therefore statistically more powerful than the Bonferroni correction [19]. The FDR is the estimated proportion of false discoveries made versus the number of real discoveries made for a given significance level to control for multiple hypothesis testing [18••]. The usual method of estimating the FDR is the Benjamini-Hochberg "step-down" approach; however, the approach assumes independence between tests. One way to address correlation between tests includes estimation of the FDR empirically through permutation based approaches (e.g., [20]). Further, replication in an independent dataset is sought for significant associations. We have applied EWAS to prioritize individual exposures in diabetes [21•], preterm birth [22], all-cause mortality [23], and serum lipid levels [24].
Despite comprehensively searching M exposures in a database with a phenotype Y of interest, EWAS does not explicitly In an EWAS in telomere length [25], multiple correlated PCB congeners were found in telomere length. These associations indicate that these exposures statistically co-occur in association with their respective phenotype. One can then visualize these co-occurring exposures in a focused correlation globe. For example, Fig. 2 shows a correlation globe that only Fig. 2 Visualization of a correlation globe focusing on exposures identified in EWAS in telomere length. Exposures identified in the EWAS procedure are seen in the outer circle in the orange (e.g., PCBs, lower left) or blue (cadmium, bottom center) colored font. The orange color font indicates a positive direction of the association in telomere length (for example, higher PCB levels were associated with longer telomere length) and a blue color font a negative direction in telomere length (e.g., higher cadmium levels were associated with shorter telomeres). Edges are drawn between exposures that co-occur with EWAS-identified exposures; for example, cadmium and lead are both correlated with volatile compound levels, seen in the bottom of the figure shows exposures that are correlated, or co-occur, with those found in a EWAS in telomere length.
One can utilize exposure co-occurrences to increase the power of detection of exposures associated with phenotype. How? Simply put, suppose two (of the M) exposures measured in a cohort study co-occur 100% of the time (e.g., their correlation equals 1). Therefore, these two exposures provide redundant information, and the actual analytic number of exposures will not be M, but M − 1 (subtracting 1 of the redundant exposures that co-occur 100% of the time). Conceptually, one can estimate how redundant variables are in a dataset by computing their correlations. The difference between the number of variables measured and the number of redundant variables is known as the "effective number of exposure variables" (in the toy example above, M − 1 is the effective number of exposure variables). Knowing the "effective number of exposure variables," or the number of exposures in the dataset after taking into account their co-occurrence/correlation, can increase the power of detection of exposures and mixtures by reducing the space of possible exposures to explore. As a proof-of-principle, using the NHANES, we re-estimated the effective number of variables for different categories of exposure [26••]. For example, we found many of the 38 total polychlorinated biphenyls (PCBs) measured in NHANES 1999-2004 to be redundant and estimated an effective number of PCBs to be 24, significantly less than the 38 measured. For details on how to estimate the effective number of variables to reduce the complexity of testing for exposure-phenotype associations, we direct readers to [26••].

Emerging Analytical Methods to Identify Interactions Between Exposures Associated to Phenotype
Emerging and existing machine-learning methods will enhance detection of mixtures associated with phenotypes and disease outcomes. Typically, machine-learning approaches apply algorithms to find variables (exposures) that are predictive of an outcome (phenotype) in two steps. In the first step, an algorithm "learns" the variables that are associated with the outcome. The algorithm is then tested in an independent dataset to estimate the predictive capability or generalizability of the algorithm.
The EWAS screening method considers each environmental factor in a separate linear model one at a time or iteratively. An issue that remains includes how to generalize beyond correlated factors associated with the outcome. One solution might be to test exposures that are identified in EWAS for interaction (e.g., [27]). For example, given L number of exposures identified in an EWAS, an investigator may choose to test all pairwise exposures of the L for interaction.
To discover interactions associated with the outcome, one needs to test or model them simultaneously. However, as the reader may know, modeling all possible interactions in a linear model will lead to overfitting and nongeneralizable associations. Therefore, to analytically select exposures in phenotype, one can leverage a "feature selection" algorithm, such as stepwise regression, whereby different combinations of variables are input in a model and the most predictive model is chosen while preserving model parsimony (limited the number of variables in the linear model). Classical stepwise regression is a challenge to apply due to their high variability in variable selection, ultimately reducing their prediction accuracy [28]. One alternative includes extensions to the linear model, such as regularized regression.
One class of well-known methods includes extensions to the linear model, known as "regularized" regression, such as the "least angle selection and shrinkage operator" (LASSO, [29]) or "ElasticNet" [30]. Both the LASSO and the ElasticNet avoid overfitting the model by constraining the size of coefficients ("shrinking"). One practical way of identifying interactions between pairs of exposure variables is to enter all possible pairwise exposure interactions into an algorithm such as LASSO, with a "multiplicative term" (e.g., interaction between exposure X 1 and exposure X 2 is entered as X 1 × X 2 ), expanding the scope of Eq. 1 to the following, assuming all exposure variables are continuous (and z-standardized or mean-subtracted and divided by the standard deviation) where all exposure variables X 1 through X m are included in the model as "additive terms" along with all pairs of multiplicative terms for interactions between pairs of variables (e.g.,X 1 × X 2 ), resulting in a model of size M plus M choose 2. Non-zero coefficients (e.g., a 1 ) on multiplicative terms indicate interaction. Therefore, the output of such a procedure may include a list of interacting pairs of exposuresand their additive termsthat are predictive of the phenotype or disease outcome after shrinkage.
Along with ElasticNet, Agier et al. demonstrate feasibility of other established and emerging methods for conducting EWAS-like analyses, including use of sparse partial least squares, "deletion/subtraction/addition" (DSA), and Graphical Unit Evolutionary Stochastic Search (GUESS) [31] in simulation studies. While promising, these methods still underperform in identifying single exposures in phenotypes as the number of correlated exposures increases, a problem that will undoubtedly influence identification of mixtures.
Another "off-the-shelf" method that may consider dependencies between exposure variables includes treebased methods and their variants, including random forests and boosted machines. The first tree-based method, "Classification and Regression Tree" (CART; [32]) is an algorithm that selects combinations of variables, or exposures, that are predictive of the phenotype (e.g., disease) that resemble a "rule" or decision tree. A decision tree for the prediction of diabetes (versus non-diabetics in a casecontrol study) could resemble: "if bisphenol A > 10 mg/dL and polychlorinated biphenyl 170 > 0.01 mg/dL then predict diabetes with probability 0.8." Rules or decision trees can thereby represent dependencies between multiple (greater than two) exposure variables associated with a phenotype. However, classical CART is known to have high variance, or their predictive capability is variable in the test step after learning in the training dataset. This occurs mainly because the algorithm is too specific or is said to overfit the training data. Newer tree-based methods, such as "random forests" aim to minimize this variance. Random forests lower variance by executing a CART-like algorithm on many randomized versions (or "bootstraps") of the training dataset and then selecting a prediction decision tree that produces an average prediction that results from the trees built on the multiple randomized versions of the data. Usually, these trees are deep and represent decision rules that are complex.
Lampa and colleagues applied another variant of a treebased approach to identify multiple exposures of a mixture associated with a phenotype called "boosting" [33•]. Boosted trees, unlike random forests, grow many small (e.g., two to three exposures) trees that each aim to reduce the prediction error. However, because random forests and boosted trees harness predictive capability of multiple trees (often hundreds and thousands of trees), additional analytical steps must be taken to ascertain how pairs of exposure variables are dependent or interact. This is done through estimation of Friedman and Popescu's H statistic [34]. In essence, the statistic estimates how much variance can be explained by interaction between pairs of exposure versus their additive contributions by the boosted tree algorithm and can be modified to ascertain interactions between multiple exposures. Lampa and colleagues tested this method to identify mixtures in 27 correlated exposures (e.g., PCBs, bisphenol A, etc.) in 1000 participants that were predictive of serum bilirubin. The investigators tentatively found an interaction between PCB-127 and bisphenol A in predicting serum bilirubin concentrations. Specifically, they found that an increase in the biomarker levels of both of these analytes results in a greater than additive increase in serum bilirubin levels.

Power, Replication, and Interpretation of Identified Mixtures
Practical considerations that can also be acknowledged to identify mixtures in phenotypes include (1) adequate statistical power or, equivalently, sample size, (2) interpretation of mixture-phenotype associations, and (3) replication of mixture-phenotype associations. First, it is well known that large sample sizes are required to find pairwise and higher order interactions. Conceptually, detecting mixtures is a type of "stratified" analysis. For example, suppose an investigator is attempting to identify an effect of a mixture consisting of two binary-valued exposures (E 1 and E 2 ) in a phenotype Y. Since E 1 and E 2 are binary valued, there are four possible configurations of the two exposures of the mixture. Inferring a mixture or interaction effect between E 1 and E 2 on Y requires data on all four strata of E 1 and E 2 , increasing the sample size requirement to detect stratum-specific associations in Y. As described above, considering more than two-way interactions between exposures in a putative mixture will consist of immense number of combinations, increasing the power burden. We claim that with current cohort sample sizes (i.e., hundreds to low thousands), it may be impossible to reproducibly find interactions between more than a few exposures in a phenotype.
Second, how does one interpret a mixture? Interpretation of pairwise variables is often conveyed through stratified analyses, whereby risk or correlations in a phenotype (Y) of one exposure (e.g., X 1 ) are displayed for different levels of another exposure (e.g., X 2 ) [35,36]. In a simple regression setting, this can be written as the following However, testing a three-way interaction between a mixture of three exposures (X 1 , X 2 , and X 3 ) is analytically more complex. For example, in a simple regression setting, the three-way interaction model is written as follows This model is much more difficult to interpret than one that contains only pairwise interaction terms (e.g., do all interaction coefficients, β 5 , β 6, β 7 , β 8 , have to be non-zero to infer a mixture?-the answer to this question is not an easy one to answer). Third, and relatedly, replicating a single exposurephenotype association can be reduced to a simple heuristic: the effect sizes must be concordant in independent cohorts or datasets. But how does one replicate a set of exposures associated with a phenotype? There are numerous coefficients to verify consistency when attempting to replicating associations between mixtures in phenotype, increasing the possibility of a false negative or positive finding. Model choice, or what mixture configurations (e.g., two-way, three-way, or N-way give example) are input into the model, will influence associations and inference [37].

Conclusions
It is hypothesized that environmental factors act in concert, or as mixtures, to induce changes in phenotype and risk for disease. The sheer number of possible combinations of exposures makes the identification of mixture-phenotype associations an analytic challenge. Specifically, identifying mixtures (i.e., interaction between exposures in a phenotype) is resource intensive and requires large sample sizes and power. Second, interpretation and replication of synergistic relationships between exposures associated to a phenotype are not as straightforward as interpreting single exposure-phenotype associations.
However, going forward in the coming high-throughput exposome era, we claim that methods such as EWAS, may help prioritize single agents which can then be tested systematically to identify mixtures. Secondly, narrowing down the potential mixtures by searching for naturally co-occurring exposures in databases such as NHANES may also enhance the search for prevalent mixtures associated with phenotypes (e.g., examples in Fig. 1 and Table 1); however, this will come at the cost of not identifying rare mixtures that contain rare exposures. Third, there is promise in extending statistical machine learning methods, such as regularized regression and tree-based methods, to identify dependent exposures. We suggest that more research effort be devoted in developing new analytic methods. In fact, as of this writing, the National Institute of Environmental Health Sciences (NIEHS) has issued a request for application (i.e., https://grants.nih. gov/grants/guide/rfa-files/RFA-ES-17-001.html) and sponsored a workshop [9•] to promote the development of new methods to identify mixtures. Ultimately, highthroughput measurement of exposure indicators (e.g., the exposome) will enable environmental health researchers to dissect comprehensive environmental burden of disease. However, new methods and larger datasets must be built to address the vast complexity of a large number of potential mixtures that could exist in phenotypic variability.