1 Introduction

In many fields, statistical testing is classically used as an exploratory tool to look for the association between a variable of interest and many possible explanatory variables. For example, in transcriptomics, the link between a phenotype and the expression of tens of thousands of genes is tested (McLachlan et al. 2005), in Genome Wide Association Studies (GWAS) the association between millions of markers and a phenotype is tested (Bush and Moore 2012), in functional Magnetic Resonance Imaging (fMRI), the goal is to identify voxels that are significantly activated in two different conditions (Cremers et al. 2017).

This problem of multiple comparisons dates back to the work of Tukey (Tukey 1953). It has since been the subject of abundant literature and aims at controlling a probability of error of some sort. Most of the literature focus on the control of the Familiy Wise Error Rate (FWER) (Bland and Altman 1995), being the probability of at least one false discovery among detections, or of the False Discovery Rate (FDR) (Benjamini and Hochberg 1995), defined as the expected proportion of false positives among detections.

Most of the correction procedures for controlling FWER or FDR, such as the popular Benjamini-Hochberg (BH) procedure, rely on independence, or some form of weak dependence, among the hypotheses, which is rarely observed in practice. Multiple testing under dependence is a difficult problem occurring in many fields. In transcriptomics, differential analysis has to deal with gene expressions that are often highly correlated. When performing GWAS, the linkage desiquilibrium imposes a strong spatial dependence between markers, and in Functional Magnetic Resonance Imaging (fMRI), two spatially close voxels have often comparable activation.

The control of the FDR remains valid under arbitrary dependency structures by replacing the BH procedure with the more conservative BY procedure of Benjamini and Yekutieli (2001). However, based on results obtained from simulated datasets, it is obvious that there is a substantial loss of power when the real dependency structure is not taken into account, as discussed in depth in Blanchard et al. (2020).

An alternative approach for dealing with multiple testing is to reduce the number of tests by aggregating certain hypotheses. Aggregation strategies vary and can be based on a priori knowledge (e.g. metabolic pathways, functional modules of genes) or on clustering algorithms (Sankaran and Holmes 2014; Renaux et al. 2020).

This article aims to take into account the dependencies between variables in order to offer a powerful statistical procedure of multiple testing. A hierarchical dependency structure between variables is assumed to be known up to certain constants. This assumption is common in our motivating example of microbiome studies (Sankaran and Holmes 2014; Xiao et al. 2017; Huang et al. 2021; Matsen IV and Evans 2013; Silverman et al. 2017), where the phylogeny is a natural hierarchical structure encoding similarities between variables (or namely species in that context). The hypotheses tested can then be organized in a tree structure which captures correlations at different scales of observation. This type of hierarchical structure is observable in transcriptomics differential analysis, where gene expressions can easily be represented by a hierarchy based on gene expression correlation. In GWAS and fMRI, spatial dependence also proves to be very suitable for hierarchical modeling (Ambroise et al. 2019; Eickhoff et al. 2015; Sesia et al. 2020).

We propose to model the hierarchical structure of the multiple tests through an Ornstein-Uhlenbeck process on a tree. The process correlation structure is used for smoothing the p-values, after conversion to z-scores, similarly to the algorithm proposed in Xiao et al. (2017) but with an explicit underlying model.

We then consider a three stage approach for our differential analysis procedure. The first stage reframes the initial problem as a linear regression problem that preserves the hierarchical structure. This linear problem is ill defined (\(p \sim 2n\)) and we therefore resort to an \(\ell _1\) penalized estimation of the mean of the Ornstein-Uhlenbeck process. The second stage produces asymptotically valid p-values. The output of \(\ell _1\) penalized estimation produces are indeed biased and offer no theoretical guarantees about their asymptotic distribution; we therefore correct them using a debiasing procedure (Javanmard and Montanari 2013, 2014; Zhang and Zhang 2014) to compute valid p-values. The third and final stage controls the FDR of the overall procedure, using the tuning strategy of Javanmard et al. (2019).

The selection strength of the Ornstein-Uhlenbeck process and the penalty parameter are hyperparameters of our model, whose selection is achieved via a Bayesian Information Criterion (BIC). We provide some background on hierarchical procedures in Sect. 2, introduce the model and statistical procedure in Sect. 3 and detail the computational steps in Sect. 4. The performances of the algorithm are assessed via simulations in Sect. 5. The use of the proposed model is illustrated in Sect. 6, where we demonstrate its ability to discover novel associations in a metagenomic dataset.

2 Background

2.1 Examples of multiple testing strategies

A classic example in genomics consists in grouping the markers according to whether they belong to the same genes (aggregation by a prior). The genes can then be grouped according to their similarity, computed for example from expression profiles. Kim et al. (2010) have, for example, proposed a hierarchical testing strategy controlling the FWER in a hierarchical manner, by testing clusters of genes, then individual genes associated with a phenotype with the goal of finding genomic regions associated with a specific type of cancer. This type of top-down approach uses the concept of sequential rejection principle (Goeman and Finos 2012; Meinshausen 2008; Renaux et al. 2020).

fMRI is another domain where tests are aggregated: neighboring voxels that are highly correlated are aggregated into a single voxel cluster. Benjamini and Heller (2007) propose an adaptation of the False Discovery Rate (FDR) to allow for cluster-level multiple testing for fMRI data.

Ad hoc aggregating methods for multiple testing also exist in Metagenomics. LEfSe (Segata et al. 2011) performs a bottom up approach where a factorial Kruskal-Wallis rank sum test is applied to each feature with respect to a class factor, followed by a pairwise Wilcoxon test, and a linear discriminant analysis. MiLineage (Tang et al. 2017) performs multivariate tests concerning multiple taxa in a lineage to test the association of lineages to a phenotypic outcome.

2.2 Independence assumption

The assumption of independence of tests is convenient as it enables for both exact analyses and simple error bounds for classical procedures (Benjamini and Hochberg 1995, e.g.). It is however unrealistic in practice. In many fields, including all the previous examples, measurements typically exhibit strong correlations. Some correction procedures, like the one proposed by Benjamini and Yekutieli (2001), make few assumptions while guaranteeing control of the FDR. Those general guarantees come with a high cost in terms of statistical power: the nominal FDR typically is much smaller that the target, resulting in many FN. Permutation procedures are an appealing alernative that can automatically adapt to the dependence structure of the p-values (Tusher et al. 2001) but may fail when confronted to unbalanced design or correlated data. Knowledge of the correlation structure can be leveraged to increase the power while still controling the FDR below a given target. Several approaches have been developed along those lines when the tests are organized along a hierarchical structure, typically encoded in a tree.

2.3 Hierarchical testing

The Hierarchical FDR (hFDR) introduced by Yekutieli (2008) and implemented in the R package structSSI (Sankaran and Holmes 2014), proposes a top-down algorithm to sequentially reject hypotheses organized in a tree. The same approach is used in (Renaux et al. 2020) to select a group of variables arranged in a clustering tree. However, this approach suffers from some limitations, as shown in (Bichat et al. 2020; Huang et al. 2021). First, the algorithm in its vanilla formulation commonly fails to move down on the tree because of failure to reject the topmost node. Second, it only controls for an a posteriori FDR level, which is a complex function of the (user-defined) a priori FDR level and the structure of rejected nodes. This makes it difficult to calibrate the a priori FDR that would achieve a target a posteriori FDR and thus to compare it to other correction methods. Finally, it does not produce a corrected p-value, or q-value, per leaf, but only a reject / no reject decision and was shown in (Bichat et al. 2020) to perform no better than BH in many instances. Given all these drawbacks, we did not include the hFDR in our benchmark and use BH as a baseline instead.

StructFDR (Xiao et al. 2017) was developed for metagenomics Differential Abundance Testing (DAT) and relies on z-scores / p-values smoothing followed by permutation correction. Given any taxa-wise DAT procedure, p-values \({\mathfrak {p}}\) are first computed for all m taxa (i.e. leaves of the tree) and then transformed to z-scores \({\mathfrak {z}}\). The tree is used to compute a distance matrix \(\left( {\mathbf {D}}_{i,j}\right) \) and then turned into a correlation matrix \({\mathbf {C}}_{\rho } = \left( \exp \left( -2\rho {\mathbf {D}}_{i,j}\right) \right) \) between taxa using a Gaussian kernel. The z-scores are then smoothed using the following hierarchical model:

$$\begin{aligned}&\left. {\mathfrak {z}}\mid \mu \right. \sim \mathcal {N}_m\left( \mu ,\sigma ^2{\mathbf {I}}_m\right) , \\&\quad \mu \sim \mathcal {N}_m\left( \gamma {\mathbf {1}}_m,\tau ^2{\mathbf {C}}_{\rho }\right) , \end{aligned}$$

where \(\mu \) captures the effect size of each taxa and \({\mathfrak {z}}\) is a noisy observation of \(\mu \). The maximum a posteriori estimator \(\mu ^*\) of \(\mu \) is given by

$$\begin{aligned} \mu ^* = \left( {\mathbf {I}}_m+ k{\mathbf {C}}_{\rho }^{-1}\right) ^{-1}\left( k{\mathbf {C}}_{\rho }^{-1}\gamma {\mathbf {1}}_m+{\mathfrak {z}}\right) \quad \text {where} \quad k = \sigma ^2 / \tau ^2 . \end{aligned}$$

The FDR is controlled by means of a resampling procedure to estimate the distribution of \(\mu ^*\) under \(H_0\) and estimate adjusted p-values \({\mathfrak {q}}^{\text {sf}}\). This method is implemented in the StructFDR package (Chen 2018).

TreeclimbR (Huang et al. 2021) is a bottom-up approach also developed for metagenomics DAT but with a broader scope. It relies on aggregating abundances at each node of the tree (understood as a cluster of taxa) and performing a test to compute one p-value per node (compared one test per leaf for StructFDR). The main idea is then to use those p-values to compute a score for node i

$$\begin{aligned} U_i(t) = \left| \frac{\sum _{k\in B(i)} {\mathfrak {s}}_k \mathbb {1}_{\{{\mathfrak {p}}_k \le t\}}}{\#B(i)}\right| \end{aligned}$$

where B(i) is the set of descendants of node i, \({\mathfrak {p}}_k\) and \({\mathfrak {s}}_k \in \{-1, +1\}\) are the p-value of the node k and the sign of the associated effect, and t is a tuning parameter. A node i will be considered as candidate if \(U_i(t) \simeq 1\) and \({\mathfrak {p}}_i < \alpha \). This ensure that all descendants are (i) significant at level t with (ii) effects of coherent sign. At the end, multiplicity correction is only done on nodes (including leaves) that do not descend from another candidate.

3 Models and algorithms

Our correction methods assumes that p-values, or rather z-scores, evolve according to an Ornstein-Uhlenbeck process on a tree. We thus use the corresponding correlation structure to decorrelate the z-scores and, in turn, the p-values. This is similar in spirit to the smoothing algorithm of Xiao et al. (2017) but we derive our procedure from first principles and explicit assumptions. We first remind a few properties of Ornstein-Uhlenbeck processes before proceeding to our model and procedure.

3.1 Ornstein-Uhlenbeck process on a tree

An Ornstein-Uhlenbeck (OU) process \((W_t)\) with optimal value (also called drift) \({\beta }_{\text {ou}}\), selection strengh (also called mean reversion parameter) \({\alpha }_{\text {ou}}\) and variance of the white noise \({\sigma ^2}_{\text {ou}}\), is a Gaussian process that satisfies the stochastic differential equation:

$$\begin{aligned} \mathrm {d}^{}\mathord {W_t} = - {\alpha }_{\text {ou}} (W_t - {\beta }_{\text {ou}}) \mathrm {d}^{}\mathord {t} + {\sigma }_{\text {ou}}\mathrm {d}^{}\mathord {B_t}. \end{aligned}$$

The important properties of OU processes are bounded variance and convergence to a stationary distribution centered on the optimal value \({\beta }_{\text {ou}}\), namely \(W_t\xrightarrow []{(d)} \mathcal {N}_{{}} \left( {{\beta }_{\text {ou}}}, {{\sigma }_{\text {ou}}^2/ 2{\alpha }_{\text {ou}}}\right) \) when \(t \rightarrow \infty \). Thanks to those properties, OU processes have become a popular model applied in various subfields of biology, ranging from evolution of continuous traits, such as body mass (Freckleton et al. 2003), fitness (Lande 1976) or CpG enrichment in viral sequences (MacLean et al. 2021) to animal movement (Dunn and Gipson 1977) and epidemiology (Nåsell 1999). They naturally emerge as the continuous limit of broad range of discrete-time evolution models (Lande 1976). Ornstein-Uhlenbeck processes can be readily adapted to tree-like structures as illustrated in Fig. 1.

Formally, we consider a rooted ultrametric tree \(\mathcal {T}\) with m leaves and n branches (\(n = 2m - 1\) for binary trees). The internal nodes are labeled \(N_1\) (the root) to \(N_{n-m}\) and the leaves \(T_1\) to \(T_m\). Let i be a node, \(W_i\) the value of the trait at that node and denote pa(i) its unique parent. By convention, we set \(t_{N_1} = 0\) and assume \(W_{N_1} = 0\). The branch leading to i from pa(i) is denoted \(b_i\) and has length \(l_i = t_i -t_{pa(i)}\) where \(t_i\) is the time elapsed between the root and node i. Since the tree is ultrametric, \(t_i = h\) for all \(i \in \{T_1, \dots , T_{m}\}\). For any pair of nodes (ij), let \(t_{ij}\) be the time elapsed between the root and the most recent common ancestor of i and j and denote \(d_{ij} = t_i - t_j -2t_{ij}\) the distance in the tree between nodes i and j. The distribution of the trait at node i is given by:

$$\begin{aligned} W_{i} | W_{pa(i)} \sim \mathcal {N} \left( \lambda _i W_{pa(i)} + (1 - \lambda _i) {\beta }_{\text {ou},i}, \frac{{\sigma }_{\text {ou}}^2}{2{\alpha }_{\text {ou}}} (1 - \lambda _i^2) \right) \end{aligned}$$
(1)

where \(\lambda _i = \exp (-{\alpha }_{\text {ou}} l_i)\) and \({\beta }_{\text {ou},i}\) is the optimal value on branch i. Remark that the process mean value does not immediately shift to \({\beta }_{\text {ou},i}\) but lags behind it with a shrinkage parameter controlled by \(1 - \lambda _i\). If \({\beta }_{\text {ou},i} = 0\) for all i, straightforward computations show that \(W = (W_{T_1}, \dots , W_{T_m})\) is a gaussian vector with distribution

$$\begin{aligned} W \sim \mathcal {N}(0, \varSigma ) \quad \text {where} \quad \varSigma _{ij} = \frac{{\sigma }_{\text {ou}}^2}{2{\alpha }_{\text {ou}}} e^{-2{\alpha }_{\text {ou}}d_{ij}} (1 - e^{-2{\alpha }_{\text {ou}}t_{ij}}). \end{aligned}$$

When, the optimal value can shift on a branch (e.g. the branch \(b_{N_4}\) leading to \(N_4\) in Fig. 1), the mean vector of W is a slightly more complex and depends on both the tree topology and the location and magnitude of the shifts. Denote U the \(m \times (n+m)\) incidence matrix of \(\mathcal {T}\) with rows labeled by leaves (\(i \in \{T_1, \dots , T_{m}\}\)) and columns labeled by inner nodes and leaves (\(j \in \{N_1, \dots , N_{n-m}, T_1, \dots , T_{m}\}\)), with entries defined as \(U_{ij} = 1\) if and only if leaf i is in the subtree rooted at node j. Intuitively, column \(U_{.j}\) encodes all leaves descending from node j and row \(U_{i.}\) encodes all ancestors of leaf i. Denote \(\varDelta \) the dimension n column vector with entries defined as \(\varDelta _{i} = {\beta }_{\text {ou},i} - \beta _{\text {ou},pa(i)}\) where \(i \in \{N_1, \dots , N_{n-m}, T_1, \dots , T_{m}\}\). Non-zero entries of \(\varDelta \) correspond to shifts location, nodes for which the optimal value \({\beta }_{\text {ou},i}\) differ from its parent’s and their values to shifts magnitude (see Figure 2 for an example). Finally let \(\varLambda \) be the n diagonal matrix with diagonal entries \(\varLambda _{i} = 1 - \exp ({\alpha }_{\text {ou}}(h - t_{pa(i)}))\) where \(i \in \{N_1, \dots , N_{n-m}, T_1, \dots , T_{m}\}\). Straightforward computations (see Bastide et al. (2017) for detailed derivations) show that W is a gaussian vector with joint distribution:

$$\begin{aligned} W \sim \mathcal {N}(\mu , \varSigma ) \quad \text {where} \quad \mu = U \varLambda \varDelta \quad \text {and} \quad \varSigma _{ij} = \frac{{\sigma }_{\text {ou}}^2}{2{\alpha }_{\text {ou}}} e^{-2{\alpha }_{\text {ou}}d_{ij}} (1 - e^{-2{\alpha }_{\text {ou}}t_{ij}}).\nonumber \\ \end{aligned}$$
(2)
Fig. 1
figure 1

(A) Phylogenetic tree with 5 leaves and 4 internal nodes (root \(N_1\) included). A shift occurs on the branch leading to \(N_4\). (B) Ornstein-Uhlenbeck process with shifts on the tree defined in the left panel. At each node, the process spawns two independent process with the same initial value. The shifts on the optimal value on the branch leading to \(N_4\) results in a different mean value for \(N_4\) and all its offsprings (\(T_1\) and \(T_2\))

Fig. 2
figure 2

Incidence matrix U, shift vector \(\varDelta \) and mean vector \(\mu \) associated with Fig. 1. \(\varLambda _{N_4} = 1 - e^{{\alpha }_{\text {ou}}(h - t_{N_3})}\) is the shrinkage parameter from equation (1)

When \(\mathcal {T}\) is known, the matrix \(T = U \varLambda \) is completely specified up to parameter \({\alpha }_{\text {ou}}\). The shifted Ornstein-Uhlenbeck model, with parameters \({\alpha }_{\text {ou}}\), \({\sigma }_{\text {ou}}^2\) and shift vector \(\varDelta \), has been used (Bastide et al. 2017; Khabbazian et al. 2016) to find adaptive events, modeled as non zero values in \(\varDelta \), in the evolution of continuous traits of interest (turtle shell size, great monkey brain shape, etc). In this work, we apply the same mathematical framework to the joint distribution of p-values transformed to z-scores.

3.2 Procedure

We show here how to use the previously described Ornstein-Uhlenbeck process to incorporate the tree structure \(\mathcal {T}\) in the correction of the p-values vector \({\mathfrak {p}}\).

Framework. Noting \(m_{i}^1\) (resp. \(m_{i}^2\)) the median count (or relative abundance) of taxon i under condition 1 (resp. condition 2), we want to test \(\mathcal {H}_{i0}: m_{i}^1 = m_{i}^2\) against \(\mathcal {H}_{i1}: m_{i}^1 \ne m_{i}^2\) and assume that we have a testing procedure that outputs p-values, e.g. the Wilcoxon-Mann-Whitney test (Mann and Whitney 1947; Wilcoxon 1992). We first convert the p-values to z-scores using the quantile function \(\varPhi ^{-1}\) of the standard gaussian:

$$\begin{aligned} {\mathfrak {z}}= \varPhi ^{-1}({\mathfrak {p}}). \end{aligned}$$

Provided the use of a correct statistical test, we known that \({\mathfrak {p}}_i \sim \mathcal {U}([0, 1])\) under \(\mathcal {H}_{i0}\), so that \({\mathfrak {z}}_i \sim \mathcal {N}(0, 1)\). We also know that \({\mathfrak {p}}_i \preccurlyeq \mathcal {U}([0, 1])\) and thus \({\mathfrak {z}}_i \preccurlyeq \mathcal {N}(0, 1)\) under \(\mathcal {H}_{i1}\). We could also test \(\mathcal {H}_{i0}: m_{i}^1 = m_{i}^2\) against \(\mathcal {H}_{i1}: m_{i}^1 < m_{i}^2\) or \(\mathcal {H}_{i1}: m_{i}^1 > m_{i}^2\), we only require the procedure to output p-values that satisfy the previous distributional assumptions for these \(\mathcal {H}_{i0}\) and \(\mathcal {H}_{i1}\). Note that, even if the test statistic is itself a z-score before being transformed to a p-value, the z-score \({\mathfrak {z}}_i\) may differ from the raw test statistic \(z_i\) because of the intermediate p-value \({\mathfrak {p}}_i\). Indeed when considering the simple case of testing equality of means in two samples of size n, with gaussian distributions and known variance \(\sigma \), the relation between \({\mathfrak {z}}_i\) and \(z_i = \sqrt{n}({\hat{m}}_i^1 - {\hat{m}}_i^2)/2\sigma \) is given by:

$$\begin{aligned} {\mathfrak {z}}_i = \varPhi ^{-1}({\mathfrak {p}}_i) = {\left\{ \begin{array}{ll} \varPhi ^{-1}(\varPhi (z_i)) = z_i &{} \text {if } \mathcal {H}_{i1}: m_{i}^1 < m_{i}^2 \\ \varPhi ^{-1}(\varPhi (1 - z_i)) = -z_i &{} \text {if } \mathcal {H}_{i1}: m_{i}^1 > m_{i}^2 \\ \varPhi ^{-1}(2\varPhi (- |z_i|)) &{} \text {if } \mathcal {H}_{i1}: m_{i}^1 \ne m_{i}^2 \\ \end{array}\right. } \end{aligned}$$

After transformation, the test can be thus always be reframed as one-sided on \({\mathfrak {z}}_i\): \(\mathcal {H}_{i0}: E[{\mathfrak {z}}_i] = 0\) against \(\mathcal {H}_{i1}: E[{\mathfrak {z}}_i] < 0\). We make two assumptions regarding the distribution of \({\mathfrak {z}}\).

  1. (A1)

    Under \(\mathcal {H}_{i1}\), \({\mathfrak {z}}_i \sim \mathcal {N}(\mu _i, 1)\) where \(\mu _i \le 0\);

  2. (A2)

    \({\mathfrak {z}}\) arises from a shifted Ornstein-Uhlenbeck process on an ultrametric tree \(\mathcal {T}\) with parameters \({\alpha }_{\text {ou}}\), \({\varDelta }_{\text {ou}}\) and \(\varDelta \).

Assumption (A1) is very classic when working with z-scores (McLachlan and Peel 2000): finding the alternative hypotheses is equivalent to finding the negative entries of \(\mu \). Assumption (A2) allows us to specify the joint distribution of \({\mathfrak {z}}\) as:

$$\begin{aligned} {\mathfrak {z}}\sim \mathcal {N}_{{m}} \left( {\mu }, {\varSigma }\right) \end{aligned}$$
(3)

where \(\varSigma \) is fully specified by the parameters \({\sigma }_{\text {ou}}\) and \({\alpha }_{\text {ou}}\). Note that the diagonal coefficients of \(\varSigma \) are all equal to \({\sigma }_{\text {ou}}^2 / 2{\alpha }_{\text {ou}} (1 - 2e^{-2{\alpha }_{\text {ou}}h})\). As they correspond to marginal variances, this forces the equality \({\sigma }_{\text {ou}}^2 = (1 - 2e^{-2{\alpha }_{\text {ou}}h}) / 2{\alpha }_{\text {ou}}\) so that \(\varSigma \) depends only on \({\alpha }_{\text {ou}}\), i.e. \(\varSigma = \varSigma ({\alpha }_{\text {ou}})\). Finally, the decompositon \(\mu = T \varDelta \), where T acts as a phylogenetic design matrix, ensures that alternative hypotheses are likely to form clades, i.e. groups of leaves obtained by cutting a single branch in the tree.

This framework allows us to use \(\mathcal {T}\) as a prior structure in the mean vector \(\mu \) and variance matrix \(\varSigma \) and to recast the hypothesis testing problem as a regression problem.

3.2.1 Parameter estimation

Estimation of \({\hat{\mu }}\). Assume first that \(\varSigma \), or equivalently \({\alpha }_{\text {ou}}\), is known. Our main goal is to estimate the negative components of \(\mu \).

To leverage the known tree structure, we use the decomposition \(\mu = T\varDelta \) and estimate \(\mu \) by means of \(\varDelta \). Since \(\varDelta \) has dimension n compared to dimension m for \(\mu \), we force \({\hat{\varDelta }}\) to be sparse using a constrained lasso penalty (Tibshirani 1996) :

$$\begin{aligned} {\hat{\varDelta }} = \mathop {\mathrm{argmin}}\limits _{\varDelta \in {\mathbb {R}}^{n} \; \text {s.t.} \; T\varDelta \in {\mathbb {R}}^m_-} \frac{1}{2} \left\| {\mathfrak {z}}- T\varDelta \right\| _{\varSigma ^{-1},2}^2 + \lambda \Vert \varDelta \Vert _1. \end{aligned}$$
(4)

where \({\mathbb {R}}_- = \{x \in {\mathbb {R}}\; \text {s.t.} \; x \le 0\}\).

Intuitively, the decomposition together with the \(\ell _1\) penalty works as a nested group lasso penalty for the components of \(\mu \), where the groups correspond to clades of \(\mathcal {T}\), while the constraint \(T\varDelta \in {\mathbb {R}}^m_-\) forces components of \(\mu \) to be non positive. For compacity, we define the feasible set \(\mathcal {D}= \{ \varDelta \in {\mathbb {R}}^n \; \text {s.t.} \; T\varDelta \in {\mathbb {R}}_-^m\}\). Finally, we use the Cholesky decomposition \(\varSigma ^{-1} = R^TR\) to simplify the problem into the very well studied optimisation problem:

$$\begin{aligned} {\hat{\varDelta }} = \mathop {\mathrm{argmin}}\limits _{\varDelta \in \mathcal {D}} \frac{1}{2} \left\| y - X\varDelta \right\| _2^2 + \lambda \Vert \varDelta \Vert _1 \end{aligned}$$
(5)

with \(y = R{\mathfrak {z}}\in {\mathbb {R}}^{m}\) and \(X = RT \in {\mathbb {R}}^{m \times n}\). Note that y is a whitened version of \({\mathfrak {z}}\), with independent components and spherical covariance matrix. This is a lasso problem with a convex feasability constraint on \(\varDelta \). The optimisation algorithm used to solve this problem is detailed in Sect. 4.

Estimation of \({\hat{\varSigma }}\) and tuning of \(\lambda \).

Remember first that \(\varSigma \) is completely determined by \({\alpha }_{\text {ou}}\) because of the link between \({\alpha }_{\text {ou}}\) and \({\sigma }_{\text {ou}}^2\). There are no closed-form expression for the maximum likelihood estimator of \({\alpha }_{\text {ou}}\). We therefore resort to numerical optimisation. To tune the parameter \(\lambda \), we test several values to estimate models with different sparsity levels and select the best one using a modified BIC criterion:

$$\begin{aligned} ({{\hat{\alpha }}}_{\text {ou}}, {\hat{\lambda }}) = \mathop {\mathrm{argmin}}\limits _{\alpha > 0, \lambda \ge 0} \left\| {\mathfrak {z}}- T\varDelta _{\alpha , \lambda }\right\| _{\varSigma ^{-1}(\alpha ),2}^2 + \log |\varSigma (\alpha )| + \Vert \varDelta _{\alpha , \lambda }\Vert _0 \log (\log {m})\log {m}\nonumber \\ \end{aligned}$$
(6)

where \(\varDelta _{\alpha , \lambda }\) is the solution of problem (4) for \(\varSigma (\alpha )\) and \(\lambda \). In practice, \(\alpha \) and \(\lambda \) vary in a bidimensional grid and we select the values that minimize the objective. We use a modified BIC, where \(\log (\log {m})\log {m}\) replaces \(\log {m}\), to account for the fact that m scales like n as suggested in Fan and Tang (2013).

3.2.2 Confidence intervals

Lasso procedures are known to produce biased estimators and do not return confidence intervals for the point estimate \({\hat{\mu }}_i\). Instead of simply returning all negative components of \({\hat{\mu }} = T{\hat{\varDelta }}\), we first debias the estimates and construct confidence intervals for the components of \(\varDelta \), and in turn of \({\hat{\mu }}\), using the debiasing procedure of Javanmard and Montanari (2013, 2014); Zhang and Zhang (2014).

Debiasing. All debiasing procedures assume a model \(Y \sim \mathcal {N}_{{m}} \left( {X\varDelta }, {\sigma ^2 I_m}\right) \) and require both an initial estimator \({\hat{\varDelta }}^{\text {(init)}}\) of \(\varDelta \) and \({\hat{\sigma }}\) of \(\sigma \). We use the scaled lasso (Sun and Zhang 2012) with the same negativity constraint as in (4):

$$\begin{aligned} \left( {\hat{\varDelta }}^{\text {(init)}}, {\hat{\sigma }}\right) = \mathop {\mathrm{argmin}}\limits _{\varDelta \in \mathcal {D}, \sigma > 0} \frac{\Vert y - X \varDelta \Vert _2^2}{2\sigma m} + \frac{\sigma }{2} + \lambda _{scaled} \Vert \varDelta \Vert _1 . \end{aligned}$$
(7)

Problem (7) can be solved efficiently by iterating between updates of (i) \({\hat{\sigma }}\) using the closed-form expression \({\hat{\sigma }} = \Vert y - X {\hat{\varDelta }}\Vert _2 / \sqrt{m}\) and (ii) of \({\hat{\varDelta }}\) by solving the constrained lasso problem (5) with tuning parameter \(\lambda _{scaled}=\lambda m {\hat{\sigma }}\). Debiasing is achieved by the corrected update:

$$\begin{aligned} {\hat{\varDelta }}_j = {\hat{\varDelta }}_j^{\text {(init)}} + \frac{\langle s_j,y-X{\hat{\varDelta }}^{(\text {init})}\rangle }{\langle s_j,x_j\rangle }. \end{aligned}$$
(8)

where the \(s_j\) form a score-system (SS). Intuitively, \(s_j\) should form a relaxed orthogonalization of \(x_j\) against other column-vectors of X. The \(s_j\) are used to decorrelate the estimators. We used the strategy of Zhang and Zhang (2014) and take the residuals of a lasso regression of \(x_j\) against \(X_{-j}\). We also considered the alternative debiasing strategy of Javanmard and Montanari (2013, 2014), which is based on a pseudo-inverse of \({\hat{\varSigma }} = \frac{X^TX}{m}\). Their debiased estimate is again a simple update of the initial scaled lasso estimator:

$$\begin{aligned} {\hat{\varDelta }} = {\hat{\varDelta }}^{(\text {init})} + \frac{1}{m}SX^T \left( y-X{\hat{\varDelta }}^{(\text {init})}\right) \end{aligned}$$

but the decorrelation matrix S is computed in a so-called colwise inverse approach (CI), by inverting \({\hat{\varSigma }}\) in a columnwise fashion. Column \(s_j\) is solution of the optimization problem :

$$\begin{aligned} \left\{ \begin{aligned}&s_j = \text {argmin}_{s \in {\mathbb {R}}^{n}} \ s^T{\hat{\varSigma }}s \\&\text {s.t.}\ \Vert {\hat{\varSigma }}s - e_j\Vert _{\infty } \le \gamma . \end{aligned} \right. \end{aligned}$$
(9)

where \(e_j\) is the \(j^\text {th}\) canonical vector and \(\gamma \ge 0\) is a slack hyperparameter. If \(\gamma \) is too small, the problem is not feasible (unless \({\hat{\varSigma }}\) is non singular). If \(\gamma \) is too large, the unique solution is \(s_j = 0\).

Confidence Interval. Zhang and Zhang (2014) showed that asymptotically \({\hat{\varDelta }} \sim \mathcal {N}\left( \varDelta , V\right) \) with the covariance matrix V defined by

$$\begin{aligned} v_{ij} = {\hat{\sigma }}^2 \frac{\langle s_i,s_j\rangle }{\langle s_i,x_i\rangle \langle s_j,x_j\rangle } . \end{aligned}$$
(10)

Similarly, the columnwise-inverse estimator of Javanmard and Montanari (2013) has asymptotic distribution \(\mathcal {N}\left( \varDelta , V\right) \) with variance matrix \(V = S {\hat{\varSigma }} S^T / m\). For both procedures, the bilateral confidence interval at level \(\alpha \) for \({\hat{\varDelta }}_j\) is

$$\begin{aligned} IC_{\alpha }({\hat{\varDelta }}_j) = \left[ {\hat{\varDelta }}_j \pm \phi ^{-1}\left( 1-\frac{\alpha }{2}\right) \sqrt{v_{jj}} \right] . \end{aligned}$$

Note that the estimator of the \(i^{\text {th}}\) component of \(\mu \) can be written \({\hat{\mu }}_i = t_{i.}^T{\hat{\varDelta }}\) with \(t_{i.}^T\) the \(i^{\text {th}}\) row of T. Its unilateral confidence intervals at level \(\alpha \) is thus given by \(\left[ -\infty , {\hat{\mu }}_i + \sqrt{t_{i.}^T V t_{i.}} \phi ^{-1}\left( 1-\alpha \right) \right] \). We can thus simply check whether 0 falls in the interval to test \(\mathcal {H}_{i0} : \{\mu _i = 0\}\) versus \(\mathcal {H}_{i1}: \{\mu _i < 0\}\) at level \(\alpha \) or compute the p-value of the one-sided test as:

$$\begin{aligned} {\mathfrak {p}}^\text {ss}_i = \varPhi \left( \frac{t_{i.}^T{\hat{\varDelta }}}{\left( t_{i.}^TVt_{i.}\right) ^{1/2}}\right) . \end{aligned}$$
(11)

3.2.3 FDR control

The debiasing procedure achieves marginally consistent interval estimation of the shifts \(\varDelta \) but additional care is required to control the FDR when testing all components of \(\mu \) simultaneously. We use the procedure proposed in Javanmard et al. (2019), which is specific to debiased lasso estimators, and relies on the t-scores \({\mathfrak {t}}_i = \frac{t_{i.}^T{\hat{\varDelta }}}{\left( t_{i.}^TVt_{i.}\right) ^{1/2}}\). Briefly, for FDR control at a given level \(\alpha \), let \(t_{\text {max}} = \sqrt{2 \log m - 2 \log \log m}\) and set:

$$\begin{aligned} t^{\star } = \inf \left\{ 0 \le t \le t_{\max } : \frac{2m(1 - \varPhi (t))}{R(t) \vee 1} \le \alpha \right\} \end{aligned}$$

where \(R(t) = \sum _{i = 1}^m 1_{\{t_i \le -t\}}\) is the total number of rejections at threshold t, or \(t^{\star } = \sqrt{2 \log m}\) if the previous expression is empty. Applying the procedure from Javanmard et al. (2019) strictly would replace 2m with m in the numerator, as we’re considering one-sided tests instead of two-sided ones for \(\mu _i\). However, numerical analysis showed that the extra 2 led to better control of the FDR and we thus kept it. Hypothesis \(\mathcal {H}_{i0}\) is rejected if \({\mathfrak {t}}_i \le -t^{\star }\) or in term of q-values if

$$\begin{aligned} {\mathfrak {q}}^{\text {ss}}_i := \frac{{\mathfrak {p}}^{\text {ss}}_i \alpha }{\varPhi (-t^{\star })} \le \alpha . \end{aligned}$$
(12)

Since \({\mathfrak {t}}\) itself depends on \(\alpha \), the corrected p-values depend on \(\alpha \), unlike in the standard BH procedure, where they only depend on the order statistics.

3.2.4 Algorithm

The algorithm 1 summarises our procedure. We call it zazou for "z-scores az Ornstein-Uhlenbeck".

figure a

4 Sign-constrained lasso

Our inference procedure is based on very standard estimates but requires to solve the following constrained lasso problem:

$$\begin{aligned} {\hat{\varDelta }} = \mathop {\mathrm{argmin}}\limits _{\varDelta \; \text {s.t.} \; T\varDelta \in {\mathbb {R}}_-^m} \frac{1}{2} \left\| y - X\varDelta \right\| _2^2 + \lambda \Vert \varDelta \Vert _1 . \end{aligned}$$

For arbitrary vector y and matrices X and T. This a convex problem as both the objective function and feasibility set are convex. We therefore adapt the shooting algorithm (Fu 1998), an iterative algorithm used to solve the standard lasso by looping over coordinates and solving simpler unidimensional problem, to our constrained problem.

Let \(X_{-j}\) (resp. \(\varDelta _{-j}\)) be the matrix X (resp. vector \(\varDelta \)) deprived of its \(j^\text {th}\) column (resp. \(j^\text {th}\) coordinate). We can isolate \(\varDelta _j\) in (5) and decompose the objective as \(\Vert y - X\varDelta \Vert ^2_2 + \lambda |\varDelta | = \Vert y - z_j - x_j \varDelta _j \Vert ^2_2 + \lambda |\varDelta _j| + \lambda \Vert \varDelta _{-j}\Vert _1\) where \(z_j = X_{-j}\varDelta _{-j} \in {\mathbb {R}}^{m}\). We can likewise decompose \(T\varDelta = u_j + v_j\varDelta _j\) where \(u_j = T_{-j}\varDelta _{-j}\in {\mathbb {R}}^{m}\) and \(v_j = t_j\). When updating \(\varDelta _j\), we can thus consider the simpler univariate problem in \(\theta \):

$$\begin{aligned} {\left\{ \begin{array}{ll} &{} \mathop {\mathrm{argmin}}\limits _{\theta \in {\mathbb {R}}} h(\theta ) = \frac{1}{2} \Vert y - z - x\theta \Vert ^2_2 + \lambda |\theta | \\ &{}\text {s.t.}\ u + v\theta \le 0. \end{array}\right. } \end{aligned}$$
(13)

Let \(I_+ = \{i: v_i > 0\}\) and \(I_- = \{i: v_i < 0\}\) and denote \(\theta _{\max } = \min _{I_{+}} \{ {-u_i}/{v_i} \}\) and \(\theta _{\min } = \max _{I_{-}} \{ {-u_i}/{v_i} \}\) with the usual conventions that \(\max (\emptyset ) = -\infty \) and \(\min (\emptyset ) = +\infty \). Problem (13) is feasible only if (i) \(\theta _{\min } \le \theta _{\max }\) and (ii) for all i, \(v_i = 0 \Rightarrow u_i \le 0\), in which case the feasible region is \([\theta _{\min }, \theta _{\max } ]\). Computing the subgradient \(\partial h(\theta )\) of h and looking for values \(\theta \) such that \(0 \in \partial h(\theta )\) leads to the usual shrinked estimates:

$$\begin{aligned} {\left\{ \begin{array}{ll} \frac{(y-z)^Tx+\lambda }{x^Tx} &{} \text {if } (y-z)^Tx< -\lambda , \\ \frac{(y-z)^Tx-\lambda }{x^Tx} &{} \text {if } (y-z)^Tx > \lambda , \\ 0 &{} \text {if } | (y-z)^Tx | < \lambda . \end{array}\right. } \end{aligned}$$

By convexity of h, the solution of (13) can be found by projecting the previous unconstrained minimum to the feasibility set. If problem (13) is feasible, its solution is thus given by

$$\begin{aligned} \theta ^{\star } = {\left\{ \begin{array}{ll} P_\mathcal {I} \left( \frac{(y-z)^Tx+\lambda }{x^Tx}\right) &{} \text {if } (y-z)^Tx< -\lambda , \\ P_\mathcal {I}\left( \frac{(y-z)^Tx-\lambda }{x^Tx}\right) &{} \text {if } (y-z)^Tx > \lambda , \\ P_\mathcal {I} (0) &{} \text {if } | (y-z)^Tx | < \lambda , \end{array}\right. } \end{aligned}$$

where \(P_\mathcal {I} : u \mapsto \max (\theta _{\min }, \min (u, \theta _{\max }))\) is the projection of u on the segment \(\mathcal {I} = [\theta _{\min }, \theta _{\max } ]\).

5 Synthetic data

5.1 Metagenomics

Metagenomics data are made up of three components. The first component is the count or abundance matrix \(X = (x_{ij})\), with \(1 \le i \le m\) and \(1 \le j \le p\), which represents the quantity of taxa i in sample j. The second component is a set of sample covariates, such as disease status, environmental conditions, group, etc. The final component is a phylogenetic tree which captures the shared evolutionary history of all taxa. When performing DAT, we are interested in taxa whose abundance is significantly associated to a covariate.

Most DAT procedures proceed with univariate tests (one test per species) followed by a correction procedure. In the synthetic datasets, we consider discrete covariates only. Dozens of full-fledged testing pipelines are published each year, including some designed with omics data in mind. Since our goal is this study is to compare correction procedures rather than full testing procedures, we use Wilcoxon or Kruskall-Wallis tests, which are classical and widespread non parametric tests in metagenomics.

5.2 Simulations

Simulation scheme. We use the following simulation scheme:

  1. 1.

    start with a homogeneous dataset,

  2. 2.

    assign each sample to group A or B at random

  3. 3.

    select differentially abundant taxa in a phylogenetically consistent manner (diffentially abundant taxa)

  4. 4.

    apply a fold-change to the observed abundance of diffentially abundant taxa in group B.

This non-parametric simulation scheme was previously used in Bichat et al. (2020). We considered two variants for step 3, respectively called positive and negative. In the negative variant, differentially abundant taxa were selected randomly across the tree, so that the phylogeny is not informative. In the positive variant, taxa are instead selected in a phylogenetically consistent manner. Formally, the phylogeny was first used to compute the cophenetic (Sneath et al. 1973) distance matrix between taxa. A partioning around medoids algorithm was then used to create cluster of related species. One or more clusters were then picked at random and all species in those clusters were selected as differentially abundant.

For each fold-change (\(\text {fc} \in \{3, 5, 10\}\)), 500 simulated datasets were created, with a proportion of differentially abundant species ranging from 3 % to 35 %. For each simulation, we corrected p-values using no correction (Raw), BH procedure (BH), BY procedure (BY), StructFDR (TF) or our procedure with either score system (SS) or colwise inverse debiasing (CI), targeting in all instances a 5% FDR level. We compared the 6 procedures in terms of True Positive Rate (TPR), nominal FDR and AUC (Area Under the Curve).

Positive simulations.

The results of positive simulations (i.e. where the phylogeny is informative) are shown in Fig. 3. All correction methods have controlled the FDR at the target rate or below when the fold change is larger than 5. For smaller fold changes, both SS and CI variations of zazou exhibit nominal FDR slightly above the target level (up to 9% in the worst case). In all settings, BY had the lowest TPR, whereas TF was comparable to vanilla BH, in line with results of Bichat et al. (2020). Finally, zazou (both SS and CI variations) had the best overall TPR, with largest gains observed in the lowest fold-change setting.

Fig. 3
figure 3

Boxplots and average (red point) TPR and FDR across positive simulation settings. Each facet corresponds to a different fold-change (fc) and each boxplot is computed over 500 simulation replicates. All corrections control the FDR at the target level or slightly above but zazou (SS and CI) achieve higher TPR, especially for small fold changes

The higher than intended FDR of zazou methods suggests that the problem of finding an adequate threshold for \({\mathfrak {p}}_i^{ss}\) is not completely solved by Javanmard et al. (2019) procedure. To assess the performance of zazou in a threshold-independent manner, we also compared the AUC of all procedures. Fig. 4 shows that zazou (both variants) has higher AUC than all other methods. As reported previously, TF and BH are at the same level and BY has the lowest ROC curve. Focus on the beginning of left hand side side of the curve shows that zazou is more efficient starting from the first discoveries.

Fig. 4
figure 4

AUC boxplots (top) and average ROC curves (bottom) across positive simulations settings. Facets correspond to fold-changes (fc). ROC curves are computed for each simulation and linearly interpolated over a fixed grid before being averaged. Each boxplot and each curve are computed over 500 replicates. In all settings, SS/CI have the highest AUC / ROC curve, followed by BH/TF while BY has the lowest values

Negative simulations. The negative simulations are designed to assess the robustness of our algorithm with respect to uninformative phylogenies, or equivalently mispecified hierarchies. Fig. 5 shows that, as expected, standard BH outperforms competing methods (in terms of AUC) when the tree is mispecified. Forcing an inadequate tree structure results in AUC losses ranging from 15 to 20 percentage points compared to no structure. The puzzling lack of AUC loss for the TF procedure is explained by an implementation trick: TF always performs BH correction in parallel to its hierarchical procedure and falls back to BH when the hierarchical procedure detects much fewer species than BH (Bichat et al. 2020; Xiao et al. 2017).

Fig. 5
figure 5

AUC boxplots (computed over 500 replicates) in negative simulations. BH outperforms SS and CI, highlighting the cost of imposing a mispecified hierarchical structure

6 Application

We use our zazou procedure on a gut microbiota dataset from the Fiji Islands (Brito et al. 2016; Pasolli et al. 2017) to identify species that are differentially abundant between adults and children. The data sets consists in the abundances of \(p=387\) species among \(n = 146\) islanders, split into 112 adults and 34 children.

To mimick the simulation study, we used Wilcoxon tests for the univariate tests. Without correction, 21 species were detected as differentially abundant at the 5% level. None of them remained significant after correction by BH, BY, TreeFDR or treeclimbR. By contrast, zazou detected differentially abundant species with both desparsification methods: 17 for SS and 6 for CI.

Fig. 6 shows that they are not a strict subset of the 21 detected with no correction. Smoothing salvages some species that are closely related to one of the 21 without being significant on their own (red box in the figure). It also illustrate some numerical problems associated with colwise-inverse debiasing, which is highly sensitive to the choice of the slack hyperparameter \(\gamma \). The window of relevant values for \(\gamma \) is narrow and too large or too small values \(\gamma \) respectively lead to no correction or a faulty p-value correction.

Fig. 6
figure 6

Phylogeny of the 387 species from the Fidji dataset with associated z-scores (inner circle), evidence (middle circle) and detection status (outer circle) under different correction procedures. Species detected by zazou are generally close-by on the tree and often, but not always, detected by raw p-values. The red strip highlight the smoothing property of the procedure in a subtree where individual species are not detected when using independant univariate tests but are detected when accounting for the hierarchical structure

7 Conclusion

In this work, we introduced zazou, a new method for correcting p values in a hierarchical context. zazou is based on recasting the testing problem as a regression problem, under the framework of stochastic processes on an ultrametric tree, and using the tree topology as a regularization parameter.

It outperforms competing methods, hierarchical (TreeFDR, TreeclimbR) or not (BH, BY) in terms of AUC but this does not translate immediately to superior results in terms of FDR and TPR. The threshold for rejecting hypotheses is turned out to be quite difficult to calibrate while controling the FDR and warrants further work.

There are several other parts of the procedure that are not as powerful as expected. First, the BIC step used to select \(\lambda \) and in turn the number of shifts tends to choose models with very few shifts, and sometimes even none. In such instances, the relevance of the debiasing step is limited. Second, the correction procedure proposed by Javanmard et al. (2019) is too conservative for our purpose. It was indeed developed to control both the FDR and the directional FDR (i.e. proportion of Type S errors, where the effect size have the wrong sign, in the discoveries) whereas we only need to control the former. For both these steps, specific developments taking into account the sign constraint on \({\hat{\mu }}\) and the structure of the topology matrix of tree \(\mathcal {T}\) could lead to better performances for zazou.