Interpretable machine learning for genomics

High-throughput technologies such as next-generation sequencing allow biologists to observe cell function with unprecedented resolution, but the resulting datasets are too large and complicated for humans to understand without the aid of advanced statistical methods. Machine learning (ML) algorithms, which are designed to automatically find patterns in data, are well suited to this task. Yet these models are often so complex as to be opaque, leaving researchers with few clues about underlying mechanisms. Interpretable machine learning (iML) is a burgeoning subdiscipline of computational statistics devoted to making the predictions of ML models more intelligible to end users. This article is a gentle and critical introduction to iML, with an emphasis on genomic applications. I define relevant concepts, motivate leading methodologies, and provide a simple typology of existing approaches. I survey recent examples of iML in genomics, demonstrating how such techniques are increasingly integrated into research workflows. I argue that iML solutions are required to realize the promise of precision medicine. However, several open challenges remain. I examine the limitations of current state-of-the-art tools and propose a number of directions for future research. While the horizon for iML in genomics is wide and bright, continued progress requires close collaboration across disciplines.


§1 Introduction
Technological innovations have made it relatively cheap and easy to observe biological organisms at molecular resolutions.High-throughput methods such as next generation sequencing and the full suite of "omic" platforms -e.g., genomic, proteomic, metabolomic, and related technologies -have inaugurated a new era of systems biology, providing data so abundant and detailed that researchers are not always sure just what to do with the newfound embarrassment of riches.One of the most salient traits of these datasets is their sheer size.
Sequencing technologies can record anywhere from a few thousand to a few billion features per sample.Another important factor, related but distinct, is that genomic data is not immediately intelligible to humans.Whereas a small child can accurately classify pictures of animals, experts cannot generally survey a genetic sequence and predict health outcomes -at least not without the aid of advanced statistical models.ologies.I examine important opportunities and challenges for iML in genomics, arguing that more intelligible algorithms will ultimately advance our understanding of systems biology and play a crucial role in realizing the promise of precision medicine.I show how iML algorithms augment the traditional bioinformatics workflow with explanations that can be used to guide data collection, refine training procedures, and monitor models throughout deployment (see Fig. 1).To do so, however, the field must overcome several conceptual and technical obstacles.
I outline these issues and suggest a number of future research directions, all of which require close interdisciplinary collaboration. to be independently and identically distributed instances of some fixed but unknown joint distribution (, ).The goal is to infer a function :  →  that maps -vectors to outcomes.For example, we may want to group cancer patients into subtypes with different prognostic trajectories based on gene expression.If outcomes are continuous, we say that  is a regressor; if they are categorical, we say that  is a classifier.In either case, performance is evaluated via some loss function that quantifies model error.While the expected loss -also known as the risk -cannot be directly calculated without knowledge of the underlying distribution, it can be estimated either on the training data or, preferably, on an independent test set sampled from the same distribution.Empirical risk minimization is the learning strategy whereby we select the top performing model from some predefined function class (Vapnik, 1995).Popular algorithms of this type include (potentially regularized) linear models, neural networks, and tree-based ensembles such as random forests and gradient boosting machines.See (Hastie, Tibshirani, & Friedman, 2009) for a good introduction.
It is not always obvious what would constitute a successful explanation of a given model prediction.Indeed, explanation itself is an epistemologically contested concept, the subject of ancient and modern philosophical debates (Woodward, 2019).It should perhaps come as no surprise then to learn that algorithmic explanations come in many flavors.The first point to acknowledge is that iML tools are used for different analytic purposes.For instance, they may help to estimate or understand a true functional relationship presumed to hold in nature.Alternatively, they may be used to analyze the behavior of a fitted model -to illuminate the black box, as it were.Finally, they may be involved in the design of so-called "glass box" algorithms, i.e. some novel function class specifically built for transparency.These goals may overlap at the edges, and methods originally intended for one may be repurposed for another.However, each represents a distinct task with its own challenges.Not all are equally prevalent in genomics, though this review will discuss examples of each.Keeping these aims separate is crucial to avoid the conceptual pitfalls addressed in Sect. 5.
The following typology is adapted from Molnar's (2021) (Rudin, 2019).Unfortunately, many interesting real-world problems cannot be adequately modelled with intrinsically explainable algorithms.Watson et al. (2019) argue that in clinical medicine, doctors are obligated to use whatever available technology leads to the best health outcomes on average, even if that involves opaque ML algorithms.Of course, flexible ML models are also prone to overfit the training data, especially in high-dimensional settings.The choice of which approach to use invariably depends on contextual factors -the task at hand, the prior knowledge available, and what assumptions the analyst deems reasonable.Should researchers choose to use some black box method, interpreting predictions will require post-hoc tools, which take a target model f as input and attempt to explain its predictions, at least near some region of interest.
Model-specific vs. model-agnostic.Model-specific iML solutions take advantage of the assumptions and architectures upon which particular algorithms are built to generate fast and accurate explanations.For example, much work in iML has been specifically devoted to deep neural networks (Bach et al., 2015;Montavon et al., 2017;Shrikumar, Greenside, & Kundaje, 2017;Sundararajan, Taly, & Yan, 2017), an especially rich class of functions with unique explanatory affordances and constraints.Model-agnostic tools, on the other hand, strive for more general applicability.Treating the fitted function f as a black box, they attempt to explain its predictions with few or no assumptions about the data generating process.Model-agnostic approaches are especially useful in cases where f is inaccessible (for example if an algorithm is protected by intellectual property laws), while model-specific methods are generally more efficient and reliable when f's structure is known.
Global vs. local.A global explanation helps the user understand the behavior of the target model f across all regions of the feature space.This is difficult to achieve when f is complex and/or high-dimensional.A local explanation, by contrast, is only meant to apply to the area near some particular point of interest.For instance, a properly specified linear regression is globally explainable in the sense that the model formula holds with equal probability for any randomly selected datapoint.However, a local linear approximation to some nonlinear f will fit best near the target point, and does not in general tell us anything about how the model behaves in remote regions of the feature space.In biological contexts, we can think of global and local explanations applying at population-and individual-levels, respectively.These are poles of a spectrum that also admits of intermediate alternatives, e.g.
subpopulation-or group-level explanations, which are possible as well.
A final axis of variation for iML tools is their output class.Typically, these methods explain predictions through some combination of images, statistics, and/or examples.Visual explanations are especially well suited to image classifiers (see Fig. 2).Other common visual approaches include plots that illustrate the partial dependence (Friedman, 2001) or individual conditional expectation (Casalicchio, Molnar, & Bischl, 2019) of variables, which can inform users about feature interactions and/or causal effects (Zhao & Hastie, 2019).Statistical outputs, by contrast, may include rule lists, tables, or numbers quantifying explanatory value in some predefined way.Finally, exemplary methods report informative datapoints, either in the form of prototypes, which typify a given class (Chen et al., 2019), or counterfactuals, which represent the most similar sample on the opposite side of a decision boundary (Wachter et al., 2018).These latter methods are less common in genomics, although conceptually similar matching algorithms are used in clinical medicine (Bica et al., 2021).Healthcare often magnifies social inequalities.For recent evidence, look no further than the COVID-19 pandemic, which disproportionately affects minority populations in the US and UK (Egede & Walker, 2020).ML threatens to automate these injustices.Obermeyer et al. (2019) found evidence of significant racial bias in a healthcare screening algorithm used by millions of Americans.Simulations suggest that rectifying the disparity would nearly triple the number of Black patients receiving medical care.Similar problems are evident in genomic research.
Individuals of white European ancestry make up some 16% of the global population, but constitute nearly 80% of all genome-wide association study (GWAS) subjects, raising legitimate concerns that polygenic risk scores and other tools of precision medicine may increase health disparities (Martin et al., 2019).As genomic screening becomes more prevalent, there will be substantial and justified pressure to ensure that new technologies do not reinforce existing inequalities.Algorithmic fairness and explainability may even be legally required under the European Union's 2018 General Data Protection Regulation (GDPR), depending on one's interpretation of the relevant articles (Selbst & Powles, 2017;Wachter et al., 2017).By making a model's reliance on potentially sensitive attributes more transparent, iML methods can help quantify and mitigate potential biases.

§3.2 To Validate
The second motivation concerns the generalizability of ML models.Supervised learning algorithms are prone to overfitting, which occurs when associations in the training data do not generalize to test environments.In a famous example, a neural network trained on data from a large New York hospital classified asthmatics with pneumonia as low risk, a result that came as a surprise to the doctors and data scientists working on the project (Caruana et al., 2015).
The algorithm had not uncovered some subtle pulmonological secret.On the contrary, the apparent association was spurious.Asthmatics with pneumonia are at such great risk that emergency room doctors immediately send them to the intensive care unit, where their chances of survival are relatively high.It was the extra medical attention, not the underlying condition, that improved outcomes for these patients.Naively applying this algorithm in a new environment -e.g., some hospital where patient triage is performed by a neural network with little or no input from doctors -could have grave consequences.
Overfitting has been observed in GWAS models (Nicholls et al., 2020), where associations that appear informative in one population do not transfer to another.Such failures can be difficult to detect given the complexity of the underlying signals, which may depend on environmental factors or subtle multi-omic interactions.The problem of external validity or transportability is well known in the natural and social sciences, if not always well understood (Bareinboim & Pearl, 2016;Pearl & Bareinboim, 2014).Because causal dependencies are robust to perturbations of upstream variables, environmental heterogeneity -e.g., different patient subpopulations or data collection protocols -can help isolate and quantify causal effects (Heinze-Deml et al., 2018;Peters et al., 2016).These methods have motivated novel GWAS methodologies that search for persistent associations across varying patterns of linkage disequilibrium (Li et al., 2021).iML algorithms, together with causal inference tools (Imbens & Rubin, 2015;Pearl, 2000;Peters et al., 2017), can help researchers identify and remove spurious signals, ensuring better generalizability to new environments.

§3.3 To Discover
A final goal of iML, less widely discussed than the previous two but arguably of greater interest in genomics, is to reveal unknown properties and mechanisms of the data generating process.
In this case, the guiding assumption is not that the target model f is biased or overfit; on the contrary, we assume that it has found some true signal and investigate its internal logic to learn more.This may mean examining weights from a support vector machine, approximating a decision boundary with some local linear model, or extracting Boolean rules to describe the geometry of some complex regression surface.Examples of all these approaches and more will be examined below, demonstrating how iML can be -and to some extent already has beenintegrated into genomic research workflows.By unpacking the reasoning that underlies highperformance statistical models, iML algorithms can mine for insights and suggest novel hypotheses in a flexible, data-driven manner.Rightly or wrongly, it is this capacity -not its potential utility for auditing or validation -that is likely to inspire more widespread adoption in bioinformatics.

§4 Methodologies and Applications
In this section, I introduce a number of prominent approaches to iML.As a running example, I will consider a hypothetical algorithm that classifies breast cancer patients into different subtypes on the basis of a diagnostic biomarker panel.Such tools have been in use for decades, although recent advances in ML have vastly increased their accuracy and sophistication (Cascianelli et al., 2020;Sarkar et al., 2021).I examine three particular iML methods at length -variable importance, local linear approximators, and rule lists -describing their basic forms and reviewing some recent applications.There is considerable variety within each subclass, and the choice of which to use for a given task is inevitably context dependent.While all three could be fruitfully applied for auditing, validation, or discovery, the latter has tended to dominate in genomic research to date.§4.1 Variable Importance Variable importance (VI) measures are hardly new.Laplace and Gauss, writing independently in the early 19 th century, each described how standardized linear coefficients can be interpreted as the average change in response per unit increase of a feature, with all remaining covariates held fixed.Coefficients with larger absolute values therefore suggest greater VI.If our cancer subtyping algorithm were linear, then we might expect large weights on genes such as BRCA1, which is strongly associated with basal-like breast cancer (BLBC) (Turner & Reis-Filho, 2006), and ESR1, a known marker of the luminal A subtype (Sørlie et al., 2003).
The ease of computing VI for linear models has been exploited to search for causal variants in single nucleotide polymorphism (SNP) arrays via penalized regression techniques like the lasso (Tibshirani, 1996) and elastic net (Zou & Hastie, 2005), both popular in GWAS (Waldmann et al., 2013).Random forests (Breiman, 2001a), one of the most common supervised learning methods in genomics (Chen & Ishwaran, 2012), can provide a range of marginal or conditional VI scores, typically based either on permutations or impurity metrics (Altmann et al., 2010;Nembrini et al., 2018;Strobl et al., 2008).Support vector machines (SVMs) may give intelligible feature weights depending on the underlying kernel (Schölkopf et al., 2004).For example, Sonnenburg et al. (2008) use a string kernel to predict splice sites in C. elegans and extract relevant biological motifs from the resulting positional oligomer importance matrices.The method has since been extended to longer sequence motifs and more general learning procedures (Vidovic et al., 2015;Vidovic et al., 2017).More recently, Kavvas et al. (2020) used an intrinsically interpretable SVM to identify genetic determinants of antimicrobial resistance (AMR) from whole genome sequencing data.Whether these methods tell us about the importance of features in nature or just in some fitted model f depends on whether we assume that f accurately captures the functional form of the relationship between predictors and outcomes.
To go back to the typology above, VI measures are global parameters that may be intrinsic (as in linear models) or post-hoc (as in random forest permutation importance).
Model-specific versions are popular, although several model-agnostic variants have emerged in recent years.These include targeted maximum likelihood measures (Hubbard, Kennedy, & van der Laan, 2018;Williamson et al., 2020), nested model tests using conformal inference (Lei et al., 2018;Rinaldo et al., 2019), and permutation-based reliance statistics (Fisher et al., 2019).Such methods typically involve computationally intensive procedures such as bootstrapping, permutations, and/or model refitting, which pose both computational and statistical challenges when applied in settings with large sample sizes and/or highdimensional feature spaces, such as those commonly found in genomics.
A notable exception specifically designed for high-dimensional problems is the knockoff test for variable selection (Barber & Candès, 2015;Candès et al., 2018).The basic idea behind this approach is to generate a set of "control" variables -the eponymous knockoffs -against which to test the importance of the original input features.For a given  ×  design matrix , we say that  9 is the corresponding knockoff matrix if it meets the following two criteria: It should be noted that VI measures can be applied either to raw or processed features.
The latter can be especially valuable when the original data is difficult to interpret.In a recent paper, Chia et al. (2020)  Alice wants to be certain that the classification is correct.A local explanation reveals that, though BRCA1 mutations account for many BLBC predictions, in her case, the feature is relatively unimportant.Instead, her local explanation turns largely on CXCR6 -a gene associated with the Basal I subtype, which has a better prognosis on average than Basal II, and is therefore less likely to require high doses of chemotherapy (Milioli et al., 2017).Local linear approximators have become the de facto standard method for computing local explanations of this sort (Bhatt et al., 2020).These algorithms assign weights to each input feature that sum to the model output.The idea derives from the insight that even though target functions may be highly complex and nonlinear, any point on a continuous curve will have a linear tangent (see Fig. 3).By estimating the formula for this line, we can approximate the regression surface or decision boundary at a particular point.Popular examples of local linear approximators include LIME (Ribeiro et al., 2016) and SHAP (Lundberg & Lee, 2017), both of which are implemented in user-friendly Python libraries.The latter has additionally been incorporated into iML toolkits distributed by major tech firms such as Microsoft,2 Google,3 and IBM.4I will briefly explicate the theory behind this method, which unifies a number of similar approaches, including LIME.
SHAP is founded on principles from cooperative game theory, where Shapley values were originally proposed as a way to fairly distribute surplus across a coalition of players (Shapley, 1953).In iML settings, players are replaced by input features and Shapley values measure their contribution to a given prediction.Let  !∈ ℝ % denote an input datapoint and ( ! ) ∈ ℝ the corresponding output of function .Shapley values decompose this number into a sum of feature attributions: , where  0 denotes a baseline expectation (e.g., the mean response) and  & ( ≥ 1) the weight assigned to feature  & at point  ! .Let : 2 % → ℝ be a value function such that () is the payoff associated with feature subset  ⊆ [] and ({∅}) = 0.The Shapley value  & is given by j's average marginal contribution to all subsets that exclude it: It can be shown that this is the unique value satisfying a number of desirable properties, including efficiency, linearity, sensitivity, and symmetry.(2020).Each reference distribution offers certain advantages and disadvantages, but the choice of which to use is ultimately dependent upon one's analytical goals (see Sect. 5).
SHAP has been used to identify biomarkers in a number of genomic studies.A modelspecific variant known as DeepSHAP -with close ties to related methods DeepLIFT (Shrikumar et al., 2017) and integrated gradients (Sundararajan et al., 2017), all techniques for explaining the predictions of deep neural networks -was recently used in conjunction with a model for predicting differential expression based on genome-wide binding sites on RNAs and promoters (Tasaki et al., 2020).The same tool was used to identify CpG loci in a DNA methylation experiment that best predicted a range of biological and clinical variables, including cell type, age, and smoking status (Levy et al., 2020).Yap et al. (2021) trained a convolutional neural network to classify tissue types using transcriptomic data, prioritizing 5 For formal statements of the Shapley axioms, see Lundberg & Lee (2017).
top genes as selected by DeepSHAP.Another SHAP variant -TreeExplainer (Lundberg et al., 2020), which is optimized for tree-based ensembles such as random forests -was used to identify taxa in the skin microbiome most closely associated with various phenotypic traits (Carrieri et al., 2021).SHAP is also gaining popularity in mass spectrometry, where data heterogeneity can complicate more classical inference procedures (Tideman et al., 2020;Xie et al., 2020).In each of these cases, further investigation was required to confirm the involvement of selected features in the target functions.The outputs of SHAP, or any other iML algorithm for that matter, are by no means decisive or infallible.However, they offer a principled and novel approach for feature ranking and selection, as well as exploring interactions that can reveal unexpected mechanisms and guide future experiments.For instance, Lundberg et al.
(2020) use Shapley interaction values to demonstrate that white blood cells are positively associated with risk of end stage renal disease (ESRD) in patients with high blood urea nitrogen, but negatively associated with ESRD in patients with low blood urea nitrogen (see Fig. 4).The multivariate nature of these attributions makes them more informative than the probe-level analyses common in differential expression testing, even when shrinkage estimators are used to "pool information" across genes (Law et al., 2014;Love, Huber, & Anders, 2014;Smyth, 2004).They are potentially more meaningful to individuals than global estimates such as those provided by knockoffs, since Shapley values are localized to explain particular predictions rather than average behavior throughout an entire feature space.With their model-agnostic flexibility and axiomatic underpinnings, local linear approximators are an attractive tool for genomic research.

§4.3 Rule Lists
Rule lists are sequences of if-then statements, often visualized as a decision tree.Psychological studies have shown that humans can quickly comprehend explanations with such structures, at least when there are relatively few conditions, which is why rule lists are widely promoted as "intrinsically interpretable" (Lage et al., 2018).This accords with the privileged position of material implication in propositional logic, where → is typically regarded as a primitive relation, along with conjunction (∧), disjunction (∨), and negation (¬).These logical connectives form a functionally complete class, capable of expressing all possible Boolean operations.This flexibility, which allows for nonmonotonic and discontinuous decision boundaries, affords greater expressive power than linear models.Thus, if the true reason for Alice's unexpected diagnosis lies neither in her BRCA1 allele nor her CXCR6 expression but rather in some nonlinear interaction between the two, then she may be better off with a rule list that can concisely explain the (local or global) behavior of that function.
In statistical contexts, rule lists are generally learned through some process of recursive partitioning.For instance, the pioneering classification and regression tree (CART) algorithm (Breiman et al., 1984) predicts outcomes by dividing the feature space into hyperrectangles that minimize predictive error.Computing optimal decision trees is NPcomplete (Hyafil & Rivest, 1976), but CART uses greedy heuristics that generally work well in practice.Because individual decision trees can be unstable predictors, they are often combined through ensemble methods such as bagging (Breiman, 2001a), in which predictions are averaged across trees trained on random bootstrap samples, and boosting (Friedman, 2001), in which predictions are summed over a series of trees, each sequentially optimized to improve upon the last.While combining basis functions tends to improve predictions, it unfortunately makes it difficult if not impossible to extract individual rules for better model interpretation.
However, some regularization schemes have been developed to post-process complex learning forests for precisely this purpose.For instance, Friedman & Popescu (2008) propose the RuleFit algorithm, which mines a collection of Boolean variables by extracting splits from a gradient boosted forest.These engineered features are then combined with the original predictors in a lasso regression, producing a sparse linear combination of splits and inputs.
Nalenz & Villani (2018) develop a similar procedure using a Bayesian horseshoe prior (Carvalho et al., 2010) instead of an  # penalty to induce shrinkage.They also add splits extracted from a random forest with those learned via gradient boosting to promote greater diversity.
Another strand of research in this area has focused on falling rule lists, which create monotonically ordered decision trees such that the probability of the binary outcome  = 1 strictly decreases as one moves down the list.These models were originally designed for medical contexts, where doctors must evaluate patients quickly and accurately.For instance, Letham et al. (2015) design a Bayesian rule list to predict stroke risk, resulting in a model that outperforms leading clinical diagnostic methods while being small enough to fit on an index card.Falling rule lists can be challenging to compute -see the note above about NPcompleteness -and subsequent work has largely focused on efficient optimization strategies.
Specifically, researchers have developed fast branch-and-bound techniques to prune the search space and reduce training time (Chen & Rudin, 2018;Yang et al., 2017), culminating in several tree-learning methods that are provably optimal under some restrictions on the input data (Angelino et al., 2018;Hu et al., 2019).
Less work has been done on localized rule lists, but there have been some recent advances in this direction.Ribeiro et al. (2018) followed up on their 2016 LIME paper with a new method, Anchors, which combines graph search with a multi-armed bandit procedure to find a minimal set of sufficient conditions for a given model prediction.Guidotti et al. (2018) introduce LORE, which simulates a balanced dataset of cases using a genetic algorithm designed to sample heavily from points near the decision boundary.A decision tree is then fit to the synthetic dataset.Lakkaraju et al. ( 2019)'s MUSE algorithm allows users to specify particular features of interest.Explanations are computed as compact decision sets within this subspace.
Figure 5. Example rule lists for AMR prediction from genotype data.Each rule detects the presence/absence of a k-mer and is colored according to the genomic locus at which it was found.From (Drouin et al., 2019, p. 4).
To date, rule lists have not been as widely used in genomics as feature attributions or local linear approximations.This likely has more to do with computational obstacles than any preference for particular model assumptions, per se.Still, some recent counterexamples buck the trend.Drouin et al. (2019) combine sample compression theory with recursive partitioning to learn interpretable genotype-to-phenotype classifiers with performance guarantees.As depicted in Fig. 5, these lists -visualized as trees and formulated as logical propositions below -can predict AMR in M. tuberculosis and K. pneumoniae with high accuracy using just a small handful of indicator functions over the space of all k-mers.Though their experiments focus on AMR, the method can be applied more generally.Anguita-Ruiz et al. ( 2020) use a sequential rule mining procedure to uncover gene expression patterns in obese subjects from longitudinal DNA microarray data.Garvin et al. (2020) combined iterative random forests (Cliff et al., 2019), a method for gene regulatory network inference, with random intersection trees (Shah & Meinshausen, 2014), which detect stable interactions in tree-based ensembles, to discover potentially adaptive SARS-CoV-2 mutations.§5 Open Challenges Despite all the recent progress in iML, the field is still struggling with several challenges that are especially important in genomics.I highlight three in particular: ambiguous targets, error rate control, and variable granularity.§5.1 Ambiguous Targets I have done my best above to be clear about the distinction between two tasks for which iML is often used: to better explain or understand (a) some fitted model f, or (b) some natural system that f models.It is not always obvious which goal researchers have in mind, yet modeland system-level analyses require entirely different tools and assumptions.Whereas a supervised learning algorithm does not generally distinguish between correlation and causation, the difference is crucial in nature.Clouds predict rain and rain predicts clouds, but the causal arrow runs in only one direction.Genomic researchers face a fundamental ambiguity when seeking to explain, say, why Alice received her unexpected algorithmic diagnosis.Is the goal to explain why the model made the prediction it did, independent of the ground truth?Or, alternatively, is the goal to understand what biological conditions led to the diagnosis?The former, which I will call a model-level explanation, may be preferable in cases of auditing or validation, where the analyst seeks merely to understand what the algorithm has learned, without any further restrictions.In this case, we do not necessarily assume that the model is correct.The latter, which I will call a system-level explanation, is more useful in cases of discovery and/or planning, where real-world mechanisms cannot be ignored.In such instances, we (tentatively) presume that the prediction in question is accurate, at least to a first approximation.
Model-level explanations are generally easier to compute, since features can be independently perturbed one at a time.This is the default setting for popular iML tools such as LIME and SHAP.System-level explanations, by contrast, require some structural assumptions about dependencies between variables.Such assumptions may be difficult or even impossible to test, raising legitimate questions about identifiability and underdetermination.Yet, as Pearl (2000) has long argued, there is value in articulating one's assumptions clearly, opening them up for scrutiny and debate instead of burying them behind defaults.The last year has seen a burst of new papers on causally-aware iML tools (Heskes et al., 2020;Karimi et al., 2020;Wang et al., 2021), indicating that researchers in computational statistics are increasingly sensitive to the distinction between model-and system-level analyses.Genomic practitioners should avail themselves of both explanatory modes, but always make sure the selected tool matches the stated aim.Addressing this challenge is difficult at both a conceptual level, because the distinction between model-and system-level analyses may not be immediately obvious to practitioners, and at a technical level, because causal approaches can require careful covariate adjustments and data reweighting.The sooner these issues are addressed head on, the more fruitful the results will be.

§5.2 Error Rate Control
Another open challenge in iML concerns bounding error and quantifying uncertainty.
Bioinformaticians are no strangers to p-values, which are typically fixed at low levels to control false positive rates in GWAS (Panagiotou & Ioannidis, 2012), or else adjusted to control familywise error rates (Holm, 1979) or FDR (Benjamini & Hochberg, 1995) in other omic settings.Bayesians have their own set of inferential procedures for multiple testing scenarios (Gelman et al., 2012;Scott & Berger, 2010), although there is some notable convergence with frequentism on the subject of q-values (Storey, 2003).In any event, the error-statistical logic that guides testing in computational biology is largely absent from contemporary iML.This may be partially a result of cultural factors.As Breiman (2001b) observed some 20 years ago, there are two main cultures of statistical modeling -one focused on predicting outcomes, the other on inferring parameter values.Authors in contemporary iML, which grew almost exclusively out of the former camp, are generally less worried about error rates than their colleagues in the latter camp.
Several critics have pointed out that post-hoc methods do not generally provide standard errors for their estimates or goodness of fit measures for their approximations (Ribeiro et al., 2018;Wachter et al., 2018).Indeed, it would be difficult to do so without some nonparametric resampling procedure such as the bootstrap (Davison & Hinkley, 1997), which would add considerable computational burden as the number of samples and/or features grows.It is not clear that such methods are even applicable in these settings, however, given the instability of bootstrap estimators in high dimensions (Karoui & Purdom, 2018).This makes it impossible to reliably rank biomarkers or evaluate the reliability of experimental results.Knockoffs are a notable exception, given their focus on FDR control.However, the Candès et al. (2018) algorithm is not, strictly speaking, a post-hoc iML method, since it requires training an algorithm on an expanded  × 2 design matrix that includes both the original features and their knockoffs.This is just a preliminary step toward a final model, which contains only those variables that pass some predetermined FDR threshold.The modified method of Watson & Wright (2021) adapts knockoffs for post-hoc importance, but in so doing loses the finite sample error rate guarantees of the original adaptive thresholding procedure.
A handful of other iML methods make at least a nominal effort to quantify uncertainty (Gimenez & Zou, 2019;Ribeiro et al., 2018;Schwab & Karlen, 2019).Yet these examples are perhaps most notable for their scarcity.To gain more widespread acceptance in genomicsand the sciences more generally -iML algorithms will need to elevate rigorous testing procedures from an occasional novelty to a core requirement.Generic methods for doing so in high dimensions raise complex statistical challenges that remain unresolved at present.

§5.3 Variable Granularity
A final challenge I will highlight concerns variable granularity.This is not a major issue in the low-or moderate-dimensional settings for which most iML tools are designed.But it quickly becomes important as covariates increase, especially when natural feature groupings are either known a priori or directly estimable from the data.For instance, it is well-established that genes do not operate in isolation, but rather work together in co-regulated pathways.
Thus, even when a classifier uses gene-level RNA-seq data as input features, researchers may want to investigate the prognostic value of pathways to test or develop new hypotheses.In multi-omic models, where features typically represent a range of biological processes, each measured using different platforms, analysts may want to know not just which variables are most predictive overall, but which biomarkers are strongest within a given class.Interactions across subsystems may also be of particular interest.
Few methods in use today allow users to query a target model at varying degrees of resolution like this, but such flexibility would be a major asset in systems biology.Once again, some exceptions are worth noting.Sesia et al. ( 2020) introduce a knockoff method for localizing causal variants at different resolutions using well-established models of linkage disequilibrium.This amounts to a global post-hoc feature attribution method for whole genome sequencing data.The leave-out-covariates (LOCO) statistic (Lei et al., 2018;Rinaldo et al., 2019) can be used to quantify the global or local importance of arbitrary feature subsets, but only at the cost of extensive model refitting.Groupwise Shapley values have been formally described (Conitzer & Sandholm, 2004) but are not widely used in practice.Resolving the granularity problem will help iML tools scale better in high-dimensional settings, with major implications for genomics.The problem is complicated, however, by the fact that hierarchical information regarding biomolecular function is not always available.Automated methods for discovering such hierarchies are prone to error, while data-driven dimensionality reduction techniques -e.g., the latent embeddings learned by a deep neural network -can be difficult or impossible to interpret.Promising directions of research in this area include causal coarsening techniques (Beckers et al., 2019;Chalupka et al., 2017) and disentangled representation learning (Locatello et al., 2019;Schölkopf et al., 2021).§6 Conclusion The pace of advances in genomics and ML can make it easy to forget that both disciplines are relatively young.The subfield of iML is even younger, with the vast majority of work published early career academics in particular are being drawn to this highly interdisciplinary area of research.Existing work has been promising, though not without its challenges.As the field continues to gather more data, resources, and brainpower, there is every reason to believe the best is yet to come.

Figure 1 .
Figure 1.The classic bioinformatics workflow spans data collection, model training, and deployment.iML augments this pipeline with an extra interpretation step, which can be used during training and throughout deployment (incoming solid edges).Algorithmic explanations (outgoing dashed edges) can be used to guide new data collection, refine training, and monitor models during deployment.The remainder of this article is structured as follows.I review relevant background material and provide a typology of iML in Sect. 2. I examine three motivations for iML in Sect.3, each of which has a role to play in computational biology.In Sect.4, I introduce a number of popular iML methodologies that have been used in recent genomic research.A discussion follows in Sect.5, where I consider three open challenges for iML that are especially urgent in bioinformatics.Sect.6 concludes.

Figure 2 .
Figure 2. A saliency map visually explains a cancer diagnosis based on whole-slide pathology data.The highlighted regions on the right pick out the elements of the image that the algorithm deemed most strongly associated with malignancy.From (Zhang et al., 2019, p. 237).

( a )
Pairwise exchangeability.For any proper subset  ⊂ [] = (1, … , ): (,  9 ) '()%(+) = -',  9 *, where = -represents equality in distribution and the swapping operation is defined below.(b) Conditional independence. 9 ⊥  | .A swap is obtained by switching the entries  & and  B & for each  ∈ .For example, with  = 3 and  = {1, 3}: ( # ,  .,  / ,  B # ,  B .,  B / ) '()%(+) = -' B # ,  .,  B / ,  # ,  B .,  / *.A supervised learning algorithm is then trained on the expanded  × 2 feature matrix, including original and knockoff variables.If  & does not significantly outperform its knockoff  B & by some model-specific importance measure -Candès et al. (2018) describe methods for lasso and random forests -then the original feature may be safely removed from the final model.The authors outline an adaptive thresholding procedure that provably provides finite sample false discovery rate (FDR) control for variable selection under minimal assumptions.They also describe a conditional randomization test (CRT) for asymptotic type I error rate control, in which observed values are compared to null statistics repeatedly sampled from the knockoff distribution.Experiments suggest the CRT is more powerful than the adaptive procedure, and therefore could be preferable when signals are sparse.However,Candès et al.    caution that the CRT may be infeasible for large datasets.The most challenging aspect of this pipeline is computing the knockoff variables themselves.However, when good generative models are available, the technique can be quite powerful.Watson & Wright (2021)  use Candès et al.'s original semidefinite programming formulation to approximate knockoffs for a DNA microarray experiment.Sesia et al. (2019) extend the method to genotype data using a hidden Markov model to sample knockoffs.In a recent study, Bates et al. (2020) used knockoffs with GWAS data from parents and offspring to create a "digital twin test" based on the CRT, in which causal variants are identified by exploiting the natural randomness induced by meiosis.The method has greater power and localization than the popular transmission disequilibrium test.

Figure 3 .
Figure3.A complex decision boundary (the pink blob/blue background) separates red crosses from blue circles.This function cannot be well-approximated by a linear model, but the boundary near the large red cross is roughly linear, as indicated by the dashed line.From(Ribeiro et al., 2016(Ribeiro et al.,  , p. 1138)).

Figure 4 .
Figure 4. Shapley values show that high white blood cell counts increase the negative risk conferred by high blood urea nitrogen for progression to end stage renal disease (ESRD).From (Lundberg et al., 2020, p. 61).
in just the last three to five years.The achievements to date at the intersection of these research programs are numerous and varied.Feature attributions and rule lists have already revealed novel insights in several genomic studies.Exemplary methods have not yet seen similar uptake, but that will likely change with better generative models.As datasets grow larger, computers become faster, and theoretical refinements continue to accumulate in statistics and biology, iML will become an increasingly integral part of the genomics toolkit.I have argued that better algorithmic explanations can serve researchers in at least three distinct ways: by auditing models for potential bias, validating performance before and throughout deployment, and revealing novel mechanisms for further exploration.I provided a simple typology for iML and reviewed several popular methodologies with a focus on genomic applications.Despite considerable progressand rapidly expanding research interest, iML still faces a number of important conceptual and technical challenges.I highlighted three with particular significance for genomics: ambiguous targets, limited error rate control, and inflexible feature resolution.These obstacles can be addressed by iML solutions in a post-hoc or intrinsic manner, with model-agnostic or model-specific approaches, via global or local explanations.All types of iML require further development, especially as research in supervised learning and genomics continues to evolve.Ideally, iML would become integrated into standard research practice, part of hypothesis generation and testing as well as model training and deployment.As the examples from Sect. 4 illustrate, this vision is already on its way to becoming a reality.The future of iML for genomics is bright.The last few years alone have seen a rapid proliferation of doctoral dissertations on the topic -e.g., Greenside (2018); Danaee (2019); Nikumbh (2019); Kavvas (2020); Ploenzke (2020); and Shrikumar (2020) -suggesting that

2 Local Linear Approximators
use direct wavelet transform to capture the location and frequency of bacterial Raman spectra as a preprocessing step toward building a more interpretable classifier.They use knockoffs for (processed) feature selection and train a multinomial logistic regression, reporting performance on par with the best black box models.Of course, not all genomic settings have comparably interpretable low-dimensional representations.I revisit the issue of variable abstraction and granularity as an open challenge in Sect.5.3.§4.Perhaps Alice shows few obvious signs of BLBC and deviates markedly from the classic patient profile, yet our algorithm assigns her to this class with high probability.Given the aggressive treatment regime likely to follow such a diagnosis, All methods described above are global in scope.The goal in many iML settings, however, is to provide explanations for individual predictions.This could be useful if, for instance, a subject receives an unexpected diagnosis.