False discovery rate envelopes

False discovery rate (FDR) is a common way to control the number of false discoveries in multiple testing. There are a number of approaches available for controlling FDR. However, for functional test statistics, which are discretized into m highly correlated hypotheses, the methods must account for changes in distribution across the functional domain and correlation structure. Further, it is of great practical importance to visualize the test statistic together with its rejection or acceptance region. Therefore, the aim of this paper is to find, based on resampling principles, a graphical envelope that controls FDR and detects the outcomes of all individual hypotheses by a simple rule: the hypothesis is rejected if and only if the empirical test statistic is outside of the envelope. Such an envelope offers a straightforward interpretation of the test results, similarly as the recently developed global envelope testing which controls the family-wise error rate. Two different adaptive single threshold procedures are developed to fulfill this aim. Their performance is studied in an extensive simulation study. The new methods are illustrated by three real data examples.


Motivation and overview
Nowadays, functional test statistics appear in many applications, for example, in spatial statistics, functional regression, and neuroimaging.A wealth of literature can be found for multiple comparison testing, in particular for micro-arrays applications.However, new challenges arise with functional applications: (i) the functional test statistics are highly correlated over the functional domain, (ii) the distribution of test statistic can change across the domain and even (iii) the correlation of the test statistic can change across the functional domain.Further, the distribution of the functional test statistic is rarely known; therefore, nonparametric methods are often required.With the functional domain, also the visualization of the test statistic, together with its rejection or acceptance region, is of great practical importance.
Recently, we developed a global envelope test (Myllymäki et al, 2017) which provides a global envelope, i.e., the acceptance region for testing with a functional test statistic, under the control of family-wise error rate (FWER).This method relies on resampling to obtain the distribution of the test statistic under the null hypothesis in the presence of the high correlation of the test statistics across the functional domain.Using rank based ordering of functional statistics as the basis, the method handles the changes of the distribution and correlation across the functional domain.Further, the acceptance region is visualized in the space of the functional test statistics: the parts of the functional domain where the observed test statistics lies outside of the envelope show the reasons for the rejection of the tested global null hypothesis.This direct visualization helps the user to interpret the results in more details, especially because resampling allows to use of any test statistics and not just the conventional standardized test statistics.Consequently, this method allows for a very general application while controlling FWER.Our aim in this paper is to define a procedure that does the same job while controlling the false discovery rate (FDR).To the best of our knowledge, none of the available FDR methods provide such direct visualization with the test result.
Indeed, in functional tests, it is often essential not only to know if the global test is significant but also to estimate the whole domain which is responsible for the rejection, i.e., the local inference.A very popular and powerful control for local inference is the FDR.An attempt to introduce the FDR control to the infinite-dimensional case was made by Olsen et al (2021) as an extension of the FDR procedure of Benjamini and Hochberg (1995).However, this procedure is a step-up procedure, whereby it has no direct visualization.Further, Benjamini and Hochberg (1995) method was overcome by newer FDR procedures.
For the purpose of estimating all alternative hypotheses, the FDR control was defined first in Benjamini and Hochberg (1995).More specifically, the difference between FDR and FWER can be seen from Table 1 of outcomes when testing m hypotheses.The FWER is defined to be P(V ≥ 1), whereas FDR is defined to be E(Q), where Q = V ρ whenever ρ > 0, and Q = 0 if ρ = 0. Clearly, FWER provides more strict control than FDR.Only in the case of all hypothesis being true null (i.e.m 0 = m), the FDR control implies FWER control (Benjamini and Hochberg, 1995).
Since Benjamini and Hochberg (1995) many theoretical achievements have been made for FDR.Benjamini and Hochberg (1995) proved that their linear step-up procedure controls the FDR for independent test statistics.The subsequent article of Benjamini and Yekutieli (2001) established the control of FDR for test statistics with more general dependence structures, such as positive regression dependence.Benjamini and Hochberg (2000) and Benjamini et al (2006) defined adaptive step-up procedures, which include the estimation of the number of true null hypotheses.Storey (2002) defined another adaptive one-step procedure for controlling FDR.Since our aim is to define a very general FDR procedure, we stick to the resampling-based methods.The first resampling based procedure was defined in Yekutieli and Benjamini (1999), where the FDR conditional on S = s (see Table 1) was estimated.This was criticized by Storey and Tibshirani (2001), and they proposed a resampling method for estimating FDR based on the estimation of the number of true null hypotheses for a single threshold rejection rule.Ge et al (2003) then rephrased this procedure.Dudoit et al (2008) defined a Bayes posterior resampling FDR control under the assumption of a mixture model, which assumes that the distributions of all true null statistics are equal and the distributions of all false null statistics are equal.Romano et al (2008) constructed a bootstrap procedure to estimate critical values in a step-down FDR procedure; this method requires bootstrapping the whole data set, which makes its applicability limited.Finally, Ge et al (2008) proposed a resampling procedure based on the step-down procedure of Westfall and Young (1993), which requires the subset pivotality and relies on the assumption that the joint distribution of statistics from the true nulls is independent of the joint distribution of statistics from the false nulls.Since our goal is direct visualization and step-down procedures cannot be used for this purpose, the methods of Yekutieli and Benjamini (1999) and Storey and Tibshirani (2001) can be adapted to our aim.
The following software can be used for the computation of FDR: The R/Bioconductor package multtest (Gilbert et al, 2009) contains the procedures of Benjamini et al (2006) and Dudoit et al (2008).The package fdrtool (Strimmer, 2008) offers the unified approach for the computation of FDR by a semiparametric method allowing for parametric modeling of tests interdependencies.The FDR has further been modeled by many parametric approaches, see Strimmer (2008) and the references therein.The R/Bioconductor package qvalue (Storey et al, 2021) determines the q-value of Storey (2002) for given pointwise p-values.The q-value gives the scientist a multiple hypothesis testing measure for each observed statistic with respect to FDR.This corresponds to the multiple testing adjusted p-value in the FWER sense.Implementation of the FDR envelopes proposed in this work is provided in the development version of the R library GET (Myllymäki and Mrkvička, 2020) available at https://github.com/myllym/GET.

A comparison of FWER and FDR envelopes: a motivational example
To illustrate the merits of the FDR global envelopes, we explore linear trends in annual water temperature curves sampled at the water level of Rimov reservoir in the Czech republic every day from 1979 to 2014 (Figure 1, top).Viewing the observations as daily samples of functional data, we fit the model where y i (k) is the water temperature at day k, k = 1, . . ., 365, in year i, i = 1979, . . ., 2014, β 0 (k) and β 1 (k) are model parameters, and e(k) denotes the error.To test the null hypothesis we use the estimator of the regression coefficient, i.e. β1 (k), as our test statistic following the methodology proposed in Mrkvička et al (2021).Then 7000 resamples of the test statistic were obtained under the null hypothesis using simple permutation of raw functional data.We utilized the R packages GET (Myllymäki and Mrkvička, 2020) and pppvalue (Xu and Reiss, 2020b,a) to compute the three different tests explained below.
Figure 1 (middle) shows the output of the global extreme rank length (ERL) envelope test (Myllymäki et al, 2017;Narisetty and Nair, 2016;Mrkvička et al, 2020) on testing the global null hypothesis (2) under the FWER control.The global ERL test shows the behavior of β1 (k) estimated from the data in comparison to the 95% global envelope constructed from the permutations and detects a significant increase in temperature for 24 days around the 120th day of the year (21 days).So it directly identifies the reason for the rejection (increase), and further, it expresses the evolution of the variability of the test statistic under the null hypothesis via the global envelope.It also specifies the amount of the increase over the threshold given by the global envelope.However, the global envelope is not designed for local inference.Therefore, the step-down method of Westfall and Young (1993) can be more powerful, but only if many permutations are used.The step-down p-values (Xu and Reiss, 2020a) are shown in Figure 1 (bottom).Using 7000 permutations, we could obtain significant increases (or decreases) for 17 days.Repeating the experiment for 2000 permutations, the results of the FWER (ERL) and FDR envelopes produced rather the same results, but none of the FWER adjusted step-down p-values were below 0.05.Thus, the step-down procedure requires plenty of permutations, and it provides only visualization of p-values.
Both the global ERL test as well as the stepdown method control FWER.The aim of this paper is to employ the FDR control instead, retain the direct visualization (single-step methods) and not require plenty of permutations.For example, in applications in spatial statistics or neuroimaging, the simulations of the null model can be rather complex computationally or in terms of memory.Figure 2 shows an FDR envelope proposed in this work (IATSE algorithm presented in Section 4.1).The test controlling FDR rejected null hypotheses in springtime, around the 120th day of the year as the FWER envelope test, but additionally also in summertime around the 180th day of the year.The null hypothesis was rejected for 35 days in total under FDR control.

Outline of the work
Because the data are assumed to be highly dependent, and the aim is to use the envelope for any functional test statistic, nonparametric resampling methods will be used.We also assume a general model for alternative hypotheses, since, e.g., in functional permutation tests, the differences between groups of functions will range from zero (null hypothesis) to small (alternative hypothesis) and large ones (alternative hypothesis).Since our aim is to develop an FDR envelope that allows for direct visualization, we will concentrate on a single threshold multiple testing procedure, such as developed by Storey (2002) or Yekutieli and Benjamini (1999).
We will propose two new two-stage procedures, with the basis in the algorithm of Storey (2002) to estimate FDR for a given rejection region.Essential new features of our algorithms are the following: 1) an estimate of the probability that a hypothesis is a true null which works well for highly dependent hypotheses, and 2) defining the rejection regions, for which the FDR is estimated, using a new minmax envelope.As such we will obtain an envelope which (a) allows for direct visualization, (b) is directly able to identify all the false hypotheses under the control of FDR, (c) is not sensitive to the change of the distribution of the test statistic across the functional domain, and (d) is not sensitive to the change of the correlation structure across the functional domain.
The setup for multiple hypothesis testing and the proposed rejection (and acceptance) regions are described in Section 2 and 3. Section 4 proposes two algorithms for the FDR envelopes together with q-value definition for them.An extension of this method for the case where nuisance parameters have to be estimated is introduced in Appendix A. Section 5 contains a thorough simulation study comparing the performance of proposed methods with several FDR alternatives.The advantage of direct visualization is emphasized by a data example in Section 6.An example from spatial statistics with a two-dimensional test statistic studying local spatial correlation is presented in Section 7. Conclusions are given in Section 8.

Multiple hypothesis testing
Assume the functional hypothesis testing problem, with hypotheses H x , x ∈ R d , which is discretized into m hypotheses.Assume now that all H x are simple hypotheses.The discretization can be arbitrary; equidistances are not required.Thus we assume that we are testing m hypotheses H k , k = 1, . . ., m, simultaneously.Denote H k = 0 if the k-th hypothesis is a true null hypothesis and H k = 1 if the k-th hypothesis is a false null hypothesis.The sets M 0 = {k : H k = 0} and M 1 = {k : H k = 1} are unknown and the aim is to estimate them.Let us denote by m 0 =| M 0 | and m 1 =| M 1 | the numbers of true and false hypotheses.The number of rejected hypotheses ρ is observed together with the number of not rejected hypotheses W .On the other hand, the numbers of rejected and not rejected true null hypotheses V and U and the numbers of rejected and not rejected false null hypotheses S and T are not observed (see Table 1).
Every single test k is tested by the statistic T k .The vector of the statistics, T = (T 1 , . . ., T m ), can have an arbitrary distribution.The marginal distributions of T k , k = 1, . . ., m, can be interdependent and the distribution of T k can change across k.Also the interdependence can vary across k.A usual approach is to consider the parametric statistics as t, F, . .., but since such a parametric statistic can change its marginal distribution across k, we will switch to rank statistics of these parametric statistics.Obviously, when the statistic changes its marginal distribution, the simultaneous tests will be blind to alternatives that occur for k with smaller variability of test statistic T; the rank statistics overcome this problem.
When R denotes the rejection region in a single test, P(T ∈ R | H = 0) is called the type I error rate and is controlled by a predetermined level α.On the other hand, P(T / ∈ R | H = 1) is called the type II error rate and is not controlled.
In multiple testing, the problem of controlling the type I error is more complex, and it is generalized in several ways.The most common is FWER, where the probability of at least one type I error is controlled, i.e., FWER= P(V ≥ 1).But also, per comparison error rate can be defined as PCER= E(V )/m or per family error rate PFER= E(V ).In functional setting, Pini and Vantini (2017) recently introduced the interval-wise error rate in order to control the type I error for each subinterval of the functional domain.All the above-mentioned controls are designated to control the existence of rejection of a true null hypothesis, whereas the famous FDR was introduced in order to determine M 0 and M 1 .The natural way to control FDR would be by E(V /ρ), but the case of division by zero has to be treated.Therefore, the first definition of FDR (Benjamini and Hochberg, 1995) was FDR= E( V ρ | ρ > 0)P(ρ > 0).Then Storey (2002) defined positive false discovery rate by pFDR= E( V ρ | ρ > 0), which controls error rate when positive findings occurred.
The rejection region for the multiple hypothesis test can be given by a step-up procedure, subsequently rejecting the hypothesis from the smallest p-values, as proposed by Benjamini and Hochberg (1995).Such rejection region is estimated to fulfill the FDR control.If an opposite approach is taken, it is possible to estimate FDR or pFDR for a general fixed rejection region Γ = (R 1 , . . ., R m ).Storey (2002) proposed to use a single value threshold γ as a natural rejection region for the continuous statistics T k .While we would like to use rank statistics, this approach brings many ties and has to be refined.For example, assume that the null hypothesis is resampled s times.Then the single test statistic can achieve only ranks from 1 to s.Consequently, the single threshold rejection region can also obtain only these values, and the corresponding estimated FDR values can be rather sparse, leading to only approximate control.For this purpose, we will define new acceptance regions based on the global rank envelope (Myllymäki et al, 2017).
We note that the problem of ties in p-values for non-resampling algorithms was also discussed recently in the literature (Chen et al, 2018;Chen, 2020).They assumed that the test statistic comes from a given discrete distribution.

Continuous min-max envelope
In our earlier works, summarized in Myllymäki and Mrkvička (2020), we have defined several global envelopes, which are appropriate in the case of the FWER control, where it is desired to construct the envelopes together with the data and simulated functions in order to get the exact Monte Carlo test in the Barnard sense (Barnard, 1963).Here we introduce a new min-max envelope which is a continuous version of the rank envelope proposed in Myllymäki et al ( 2017), and use it as the basis of our FDR envelopes.We will estimate the FDR for a given rejection region Γ which corresponds to the acceptance region determined by the envelope E. We introduce a new, continuously enlarging envelope that is controlled by a single parameter, similarly as it is in the classical constant rejection rule (Storey, 2002).However, our continuous min-max envelope reflects the shape of the distribution of the test function under the complete null hypothesis.
Assume now, that we have s resamples of the complete null hypothesis, i.e. m 0 = m, which gives us s test vectors T i = (T i1 , . . ., T im ), i = 1, . . ., s.Let us define the lower and upper bounds of the envelope E γ for the two-sided alternative hypotheses.For γ being an integer, γ = 1, 2, 3, . . ., s/2 , the bounds are defined by , where min γ and max γ denote the γ-th smallest and largest values, respectively.For a non-integer γ > 1, we define the bounds of E γ as the average of neighboring integer γs: The definition for the one-sided alternatives can be obtained accordingly by omitting the lower or upper part.

FDR estimation
This section describes algorithms for FDR estimation from s resamples of the complete null hypothesis.We adopt the Storey (2002) approach of estimation FDR, rather than the Benjamini and Hochberg (1995) approach of estimation of rejection region.This allows us to estimate FDR for a general global envelope E, which represents the acceptance region corresponding to the rejection region Γ.The advantage of our proposed algorithms is that they make no assumptions on the correlation structure of the test statistics and also not on the false null hypotheses.The only assumption here is that the true null hypotheses are simple.
For a fixed Γ, we can observe only ρ(Γ), the number of positives, together with the corresponding numbers of acceptances, W (Γ). The number of false positives V (Γ) is not observable, but it can be estimated.Storey and Tibshirani (2001) showed that , (3) where ρ 0 (Γ) is the number of positives under complete null hypothesis.In order to estimate the probability that a hypothesis is true null, π 0 , Storey and Tibshirani (2001) proposed to choose a smaller rejection region Γ and compute When working with highly correlated data, this estimator of π 0 has a significant drawback: the histogram of p-values observed under the null hypothesis is not uniform, which is assumed by the method (Storey and Tibshirani, 2003).This feature causes the estimator to fail for any choice of Γ , even for the smoothing method defined in Storey and Tibshirani (2003).Also, Xie et al (2005) claimed that estimation of π 0 is a challenging problem.In non-resampling approaches, the choice of Γ attracted many researchers in order to define so-called adaptive FDR procedures based on the Storey estimator (4) (e.g.Liang and Nettleton, 2012;Heesen and Janssen, 2016).However, since these procedures are still based on the basic Storey estimator, we searched for another estimator of π 0 , which would not require the uniformity of pvalues.Hwang (2011) compared several estimators from which the lowest slope method developed by Benjamini and Hochberg (2000) and the mean of differences method (Hsueh et al, 2003) seem to be the best in the dependent case, but Hwang (2011) did not include in the comparison the twostage approach of Benjamini et al (2006).The two-stage method was compared to the method of Benjamini and Hochberg (2000) in Benjamini et al (2006), and it was found to be significantly better.Also, this two-stage approach satisfies the conservative estimation of π 0 under the independence of all hypotheses.For these reasons, we utilize the π 0 estimation of this two-stage approach of Benjamini et al (2006), which is defined in the following algorithm, in one of our proposed algorithms, and we also use the algorithm as the reference method in the simulation study.In our second algorithm, we use another estimator for π 0 , and we prove its strong conservativeness without further assumptions.
Adaptive two-stage procedure of Benjamini et al (2006)  In the first step of the algorithm, the numbers of rejections are estimated assuming that π 0 = 1.In the second step, an estimate for π 0 is found, and the third step finds the rejections utilizing the estimate of π 0 .

Two-stage resampling procedures for FDR envelopes
We can now, in the two-stage procedure, replace the linear step-up procedure with the procedure of Storey and Tibshirani (2001) applied to our rejection regions of Section 3. The resulting adaptive two-stage resampling-based procedure (ATSE) defined in the following algorithm has a global envelope representation corresponding to the rejection region Γ γ .
As the estimate of Eρ 0 (Γ γ ) in the steps 1. and 3. of the algorithm we use where Γ γ is the rejection region corresponding to E γ as defined in Section 3 and γ is the pointwise Adaptive two-stage envelope (ATSE): 1. Find the largest rejection region Γ γ * for which Let r be the number of rejected hypotheses for Γ γ * .If r = 0, do not reject any hypothesis, take Γ γ = Γ γ * and stop; if r = m reject all m hypotheses, take Γ γ = Γ γ * and stop; otherwise continue.2. Let π0 = (m − r)/m.3. Find the largest rejection region Γ γ for which Note that there is equality in Equation ( 5) for an integer γ.A formula similar to (5) was used in Storey (2002) and Storey and Tibshirani (2003) for a given threshold p.
Instead of using the idea of ATS algorithm to estimate π 0 , it is also possible to use the following Iterative adaptive two-stage envelope (IATSE) algorithm.
The expectations in steps 1. and 3. of the IATSE algorithm are calculated according to Equation (5) similarly as in the ATSE algorithm.Note here that the ATSE method controls the estimate of π 0 by assuming a greater significance level in the first step, in the same way as the ATS algorithm of Benjamini et al (2006).On the other hand, the IATSE method adds the expected number of false null hypotheses to the estimate of π 0 .
The IATSE algorithm is motivated by the following conservative, iterative estimate of m0 : m1 0 = m − r + mγ * and ml 0 = m − r + ml−1 0 γ * .The limit of this iterative procedure is according to the sum of geometrical sequence equal to (m − r)/(1 − γ * ).
Theorem 1 The estimate (7) of the IATSE algorithm is a conservative estimate of π 0 .
Iterative adaptive two-stage envelope (IATSE): 1. Find the largest rejection region Γ γ * for which Let r be the number of rejected hypotheses for Γ γ * .If r = 0, do not reject any hypothesis, take Γ γ = Γ γ * and stop; otherwise continue.2. Let γ * be the pointwise p-value corresponding to the Γ γ * defined according to the Formula (6) and 3. Find the largest rejection region Γ γ for which ≥ π 0 .Denote now β k the probability of the error of the second kind for any k-th hypothesis which is false.
Since the strong control of our whole procedures can be studied only under independence assumptions of test statistics (similarly as in Liang and Nettleton, 2012), which is not realistic in functional setting, we resort on checking the control by an extensive simulation study.Note here that the strong control of the ATS algorithm was also proven for independent hypotheses only (Benjamini et al, 2006).
Appendix A contains an algorithm for FDR estimation in case the null hypothesis is not simple, but some nuisance parameters are estimated, which is an extension of the proposed algorithms.Appendix B contains further algorithms and their modifications that incorporate the dependence of the test statistics in the FDR estimation and which can be used for envelope construction.These algorithms were used in the simulation study for comparison with the proposed algorithm, but they did not perform as well as the algorithms proposed in this section.

q-values
The q-value related to the hypothesis H k , k = 1, . . ., m, is the size of the test still rejecting H k while controlling FDR.In the case of FDR envelopes, this size corresponds to the γ value that defines the largest possible envelope (i.e., smallest possible rejection region) such that the hypothesis H k is rejected.Formally, Yekutieli and Benjamini (1999) noted that it is advantageous to use adjusted p-values in multiple hypothesis testing because then the level of the test does not have to be determined in advance.The same can be said about q-values in the context of FDR control.

Simulation study
A simulation study was performed to investigate the FDR control and power of different FDR methods listed in Table 2.We include some methods with parametric and nonparametric pointwise pvalues to stress differences between them.We show some methods with known π 0 in order to check the problem of π 0 estimation by comparing it to the method with unknown π 0 .The ATS method is included as the reference method, the best one we have found from the literature.ST, YB, YBu, IATSE-YB, IATSE-YBu, ATSE, and IATSE are included in the comparison as candidates for the best FDR envelope.The methods YB, YBu, IATSE-YB, IATSE-YBu, which incorporate the dependence of the test statistics, are described in Appendix B.
The FDR methods were investigated for detecting the difference between two groups of functions, where each group had ten observations.The functions in the groups were generated from Gaussian, Student's t 1 or bimodal random processes on [0, 1] with means µ 1 (k) and µ 2 (k) for the two groups, respectively.We considered 11 different mean function combinations for the two groups.As a null case, we considered the 'no differences' case where , where i = 1, 2 and e(k), k ∈ [0, 1], are the Gaussian errors with the standard deviation σ Z and the exponential covariance function with the scale parameter φ Z = 0 (independent case), 0.1 or 0.5.In the case of Student's t 1 distribution, the error was obtained from two independent standard Gaussian errors with the same covariance as e(k).In the bimodal case, the error was obtained by transforming e(k) as e b (k) = sign(e(k)) | 3 4 e(k) | 1/5 .Different values were considered for σ Z as specified in Table 3.The choices were different for different mean functions in order to reach the powers of the methods in a reasonable range.
We first compared all the methods of Table 2 with the resolution m = 200, such that k obtained m equidistant values on [0, 1].We did the comparison for different numbers of resamples s = 100, 500, 1000, 2000, 5000, for all 11 models explained in Table 3 with the Gaussian, t 1 and bimodal errors and with the three strengths of correlation determined by φ Z = 0, 0.1, 0.5, in total for 3 • 3 • 11 • 5 = 495 cases.To test the hypothesis of no difference in the means of the two groups, we used the functional F -statistic (the vector of F -values calculated for each k).The chosen F -statistics allows to compute also parametric p-values and compare the performance of parametric and nonparametric FDR methods (see Table 2).Note that the nonparametric methods are not limited to the F -statistic, but other statistics, which can be more easily interpreted, can be used (see Section 6).
Each case was repeated 2000 times, and for each repetition, we calculated the FDR, i.e., the number of falsely rejected hypotheses divided by the total number of rejections.Also, the proportion of rejected hypotheses from the known true positives as a measure of power was calculated.The mean FDR and power as well as their standard errors, σ FDR and σ power , were then calculated from the 2000 repetitions.In the following, we summarize the relevant findings from the simulation study.The rest of the results are given in Appendix C.
Figure 3 shows the mean FDR (±2 • σ FDR ) for Models 1 (complete null model), 4, 5 and 11 with Gaussian errors.First, the results show that ST and STp methods are highly liberal for correlated cases.This liberality is caused by the π 0 estimation: the ST methods with known π 0 are not liberal.Some repetition for ST and STp even fell down due to the non-uniform distribution of pointwise p-values.Namely, tens of repetitions fell down for correlation 0.1, and hundreds fell down for correlation 0.5; these cases were excluded from the results.The results also show the liberality of all YB methods for the complete null model (Model 1) with correlated errors.The liberality of YB and YBu methods is then suppressed for Model 5 because the methods do not consider the estimation of for k < 0.66, and 0 otherwise.Different values were considered for the parameters of the mean functions (Parameters) leading to different proportions of the true nulls (m 0 /m).The final row gives the standard deviation of the Gaussian error, σ Z , used for the Gaussian and bimodal errors.The power of non-liberal methods is shown for Models 2, 4, and 11 in Figure 4.Here we can observe, especially for Model 2, that the parametric ATS achieves the highest power, which is only reached by nonparametric methods by applying at least 5000 permutations.The IATSE method seems to be the most powerful from nonparametric methods, which is more apparent for small numbers of permutations s.The number of resamples is not a negligible feature of the test since, for example, in spatial statistics, the resampling of a test statistic can be costly in terms of time.
The advantage of the parametric method is clearly lost when the errors are not Gaussian since the parametric p-values assume the Gaussianity.Figure 5 shows that if the assumptions of parametric p-values are not fulfilled, the parametric method can achieve either strong liberality or strong conservativeness.
The above experiment suggests that the IATSE method is a universal method for FDR control.Namely, it does not require knowing the test statistic distribution, it was conservative in all studied cases, conservativeness of its π 0 estimation was proven, and it allows for graphical interpretation using the envelopes.Further, it tends to have a little bit tighter control of FDR than ATS and ATSE, as it can be seen from Figure 6 where the FDR control is studied with respect to increasing resolution m for the Gaussian error case.This figure also shows that the increased resolution has no influence on the FDR control of these three methods when the number of resamples is large enough to distinguish the extreme pointwise p-values.

Population growth example
Figure 7 (left) shows population growth, defined here as the population at the end of the year divided by the population at the beginning of the year, in years from 1960 to 2014 in Africa, Asia, Europe, and North America, and Latin America, total in 112 countries with more than one million inhabitants in 1950 (Nagy et al, 2017;Dai et al, 2020).Figure 7 (right) shows the GDP of every country in the study discounted to the 1960 USD.The discounting was performed according to USD inflation.The data was obtained from the World bank.The missing values of the GDP of a country were extrapolated using the closest known ratio of the GDP of the country and the median GDP in that year.The missing values of GDP were interpolated using linear interpolation of the two closest ratios.
We explore the functional general linear model (GLM) for the population growth across years 1960-2014 with categorical covariate (continent), continuous covariate (GDP), and interactions (GDP with respect to the continent).For this purpose, we applied three tests, and on each of them, we applied the proposed FDR control (IATSE) to obtain all years which are significant for the studied covariate.We follow the graphical functional GLM testing procedure introduced for FWER control in Mrkvička et al (2021).First, to test the effect of the continent, we assume the main effects model population growth ∼ Continent + GDP, and we used the test statistic = 0.The advantage of this test statistic with respect to the F statistic is that we directly test which continents differ from the mean.The permutations were obtained by the standard Freedman and Lane (1983) procedure, i.e. permuting residuals under the null model (population growth ∼ GDP).The graphical output of the IATSE algorithm is shown in Figure 8.Here, for better interpretation, instead of showing β Cont only, we show the fitted curve for every continent, i.e. the intercept β 0 i and mean effect of GDP, β GDP i •mean(GDP i ), are added to every β Cont j,i .The output also shows the 95% FDR envelope together with the red points that identify the years and continents, which are significantly different from the fitted mean curve under FDR control.The output shows higher population growth in Africa from 1965 onwards, in Asia for years up to 1974, and in Latin America for years up to 1965.Further, it shows lower population growth for Europe and North America.One could assume that this would be caused by different GDPs in the continents, but  2 for Model 1 and 2 with Gaussian error, heavy tailed error (t) and bimodal error for s = 5000 and m = 200.
GDP is treated here as a nuisance factor.Also, the second test shows only a little importance of the main effect of GDP.Second, to test the effect of the GDP, we assumed the same main effects model as above and we used simply the test statistic The graphical output of the IATSE algorithm is shown in Figure 9.The FDR envelope shows, in this case, rather decreasing variability of the test statistics, which makes no problem to the procedure itself, but it is worth noting for the interpretation.
The main effect of GDP is significant only between the years 2004 and 2009.Third, to test the effect of the interactions, we assume the factorial model population growth ∼ Continent * GDP, ), (10) where β I 1,i , . . .β I 4,i are interaction parameters of the univariate GLM model for year i, computed with condition 4 j=1 β I j,i = 0.The permutations of residuals from the null model (population growth ∼ Continent + GDP) was used.The graphical output of the IATSE algorithm is shown in Figure 10.It provides rich interpretability of the interaction component.The population growth of countries in Africa is significantly positively influenced by GDP up to 1974, whereas it is significantly less influenced than the whole world from 1998 onwards.The population growth of Asian countries is more positively influenced by GDP than the whole world from 1996 onwards.The same holds for Europe and North America.On the other hand, the population growth in Latin American countries is negatively influenced by GDP up to 1974.
We also applied ATS to the test vectors ( 8), ( 9) and ( 10).ATS rejected one year less than the FDR envelope in total, and it gave no information about the reasons (direction) of rejection and variability of the test statistic under the null.Definitely, the results of FDR envelope presented in Figures 8, 9 and 10 are much richer and allow for interpreting the results in much deeper detail than the results provided by ATS or other non-graphical FDR algorithm which gives only q-values or the rejections as an output.For comparison, if the F -statistic is used instead of the test vector ( 8) and ( 10), we obtain even less information from the test.This F -test of the main effect of the continent just rejected the hypotheses for all years, and this test of the interaction effect rejected the hypotheses for the years 1960-1982, 1988, and 1995-2014.
The envelopes of Figures 9 and 10 are rather fluctuating due to fluctuations in the GDB values 7. To reduce the fluctuation of these envelopes the standardised beta coefficients can be used instead of beta coefficients in Formulas 9 and 10.Since the standardisation has no effect on the significance of these envelopes, we prefer the simpler versions.

Local spatial correlation
In spatial statistics, it is a relevant question to ask in what regions two random fields are correlated.This question was studied nonparametrically in Viladomat et al (2014), where the pointwise local p-values were obtained.Viladomat et al (2014) did not perform any adjustment for multiple testing.The aim of this section is to show the possibility of FDR control in a 2D example with the use of our proposed FDR envelopes.
Assume that we observe two random fields Φ and Ψ in the observation window W ⊂ R 2 .Let X be the set of sampling locations where the values of Φ are observed.X can be either random or non-random, depending on the design of the experiment.We denote by Φ(X) the vector of values of Φ observed at sampling points X, and similarly for Ψ(X).The functional test statistic describing the local correlation at a location u ∈ W can be calculated as where w b (s) is a kernel with bandwidth b (we have used a Gaussian kernel with bandwidth equal to its standard deviation), and .
The asymptotic behavior of rb is unknown; therefore, resampling under the null model of independence has to be performed.Viladomat et al (2014) proposed to obtain resamples Ψ * by smoothing the permuted values Ψ(x) in order to obtain a smoothed empirical variogram that matches its counterpart for Ψ.Mrkvička et al ( 2021) proposed the random shift method with a variance correction for the test statistic instead.They showed that random shifts are more powerful in assessing the global correlation.However, for assessing the local correlation, the number of points used to calculate rb (u) would have to be used in the mentioned variance correction.Since for local correlation, this issue would have to be investigated, we use the Viladomat et al (2014) proposal instead.
For the illustration we use the covariates observed in the tropical forest plot of area 1000 m × 500 m in Barro Colorado Island (Hubbell et al, 2005;Condit, 1998;Hubbell et al, 1999).This data set is accompanied with many covariates but, for the purpose of illustration, we study only the dependence of the first two covariates, i.e., Aluminium (Al) and Boronium (B) observed on a 50 × 25 grid with equally spaced observations (see Figure 11, top).The global correlation (a single Pearson's correlation coefficient) between the two covariates is approximately −0.41, which is significant according to the Viladomat et al (2014) global test of independence (p ≈ 0.031).We then applied the described local independence test (with b = 100) to find all regions with significant local correlations.The results in Figure 11 (bottom) show that there are significant positive local correlations in the middle of the image, whereas on the left and right side of the image, there are significant negative local correlations.

Discussion
The aim of this work was to find a good method for determining the FDR envelope.For this purpose, we defined two alternatives, namely ATSE and IATSE.These algorithms are based on the idea of Storey and Tibshirani (2001), but the estimation methods of π 0 introduced in Storey (2002) and Storey and Tibshirani (2003) were found to be unsatisfactory for highly dependent test statistics which arise in functional data analysis.Therefore, the proposed algorithms are supplied with different π 0 estimates.Based on our study, these two methods perform as well as the ATS algorithm, which we have found to be the best from the literature, both in terms of being close to the nominal FDR level and the power.The simulation study also suggests that the IATSE algorithm is a little bit more powerful than the ATS algorithm, especially for lower numbers of replicates.It can be seen as an advantage when the simulation of the replicates is time-consuming as it can be, e.g., in  spatial statistics.We also investigated resampling methods capturing the dependence between tests; unfortunately, the only method applicable in the required wide generality turned out to be rather liberal if the method was made to be adaptive, i.e., when the estimate of π 0 was considered in the algorithm.
Since our proposed methods construct the rejection region from resamples and their pointwise ranks, these methods are robust with respect to changes in the distribution of the test statistic along its domain.This is not the case for the methods using the parametric p-values, which require the homogeneity of the test statistic across the domain.The same can be said for the homogeneity of the correlation structure of the test statistic along its domain.
The population growth example (Section 6) shows the interpretation advantages of FDR envelopes with respect to pure q-values, which can be obtained from other FDR methods without graphical interpretation.The local independence test (Section 7) shows the possibility of applying the FDR envelope in the 2D case in spatial statistics.The greatest advantages of our proposed methods are the graphical interpretation, its generality, and its easy application via the R package GET (Myllymäki and Mrkvička, 2020).The generality of our methods includes that the methods can be used with any test statistic without knowing its distribution, the distribution of the test statistic can vary along its domain, and the autocorrelation of the test statistic can also vary.
The method of Storey and Tibshirani (2003) was promoted in the context of permutation tests in genetics studies, where the dependence between the test statistics is not as strong as in functional data analysis.For this method, Jiao and Zhang (2008) and Xie et al (2005) proposed to estimate Eρ 0 (Γ) only from those hypotheses which are true null for the resampling approach.Obviously, it is impossible to know the true null hypotheses, and therefore they used only hypotheses which are not rejected for the data.The reason to avoid false null hypotheses was the increase of variance for false null with respect to true null.Since our rejection region, Γ γ , is not a constant number, as it is usually assumed, but it is constructed from resampling of the whole test vector under the complete null hypothesis, Γ γ is adapted to this increased variability (see, e.g., Figure 9).This means that the whole problem of the Storey and Tibshirani (2001) procedure discussed in Jiao and Zhang ( 2008) and Xie et al ( 2005) is overcome by our construction of Γ γ .
Our proposed methods are defined under the assumption of simple null hypothesis testing.Additionally, we introduced an algorithm based on these methods and the quadratic Monte Carlo scheme, which makes the FDR envelopes valid also for composite null hypotheses (see Appendix A).The performance of this algorithm with respect to neglecting the compositeness of the null hypothesis will be studied in the future.
and Center for Tropical Forest Science.The authors thank Mikko Kuronen (Luke) for his remarks on an earlier version of the manuscript.

Appendix A Composite null hypothesis
When the null hypothesis is not simple, but some nuisance parameters are estimated, the estimate (5) of the main document will be biased, and thus the estimate of FDR will also be biased.In such case, an adjusted estimate of Eρ 0 (Γ γ ) can be performed in a similar manner as for Monte Carlo methods controlling FWER (Dao and Genton, 2014;Baddeley et al, 2017).
Assume that the null hypothesis has a parameter θ, which needs to be estimated in order to obtain simulations from the null hypothesis, e.g., for a goodness-of-fit test with a functional test statistic.To estimate Eρ 0 (Γ γ ) a theoretical density of θ estimator can be used: For p( θ), an asymptotic distribution of θ or, more preferably, a two-stage Monte Carlo scheme (Baddeley et al, 2017) can be used.
Thus, this algorithm requires s simulations for constructing Γ γ and s 1 • s 2 simulations for computing the expectation Eρ 0 (Γ γ ).Since the quadratic Monte Carlo method is used to approximate the expectation, the numbers s 1 and s 2 can be small, and the whole algorithm will not be too computational.

Appendix B Incorporating
dependence of the test statistics in the FDR estimation Yekutieli and Benjamini (1999) provided a resampling method under the assumption of subset pivotality and independence of null and alternative hypotheses in order to handle the dependence between test statistics.Schwartzman and Lin (2011) showed that correlation may increase the bias and variance of the FDR estimators substantially in comparison to the independent case.In order to solve this issue, some authors modeled the correlation, e.g., Sun and Cai (2009) where Here ρ(p) is the number of rejections with the threshold p, i.e. | {k : p k ≤ p} |, ρ 0 (p) is the number of rejections with the threshold p of resampled data, i.e. | {k : p * k ≤ p} |.The resampled data are obtained here under the complete null hypothesis, similarly as our envelopes E. Finally, V (p) is the number of false rejections with the threshold p.
Since we do not assume independence of null and alternative hypotheses, these estimators are theoretically not valid.Indeed, resampling approximates the marginal distribution of V (p), but not the conditional distribution of V (p) given ρ(p) = r.We anyway modified these algorithms to our setting with rejection region Γ γ and replaced steps 1. and 3. of the ATSE and IATSE algorithms with it.This defines four new adaptive FDR estimators, which should handle the dependence between the test statistics.We do not provide a rigorous definition of these algorithms since they are found to be liberal in the simulation study.Note here only that we performed two independent sets of resamples here, first for obtaining Γ γ and second for estimating Q * (Γ γ ) and Q * u (Γ γ ).

Appendix C Results of the simulation study
Here we complete the results of the simulation study of Section 5 of the main document.For the details of the simulation study setup, we refer to Section 5 of the main document.In particular, Table 2 there explains the methods that were compared in the simulation study (shown in all plots on the y-axis), and Table 3 explains the considered models.Figure C1 visualises the mean functions of Table 3.In all figures below, s denotes the number of resamples, and φ Z is the correlation parameter of the error term.
Figures C2, C3 and C4 present the mean FDR for the case of Gaussian error for the models ((2, 3), (6, 7) and (8, 9, 10), respectively) that were not shown in the main document.Figures C5, C6 and C7 present the mean power for the case of Gaussian error for the models ((3, 11), (6, 7), and (8, 9, 10), respectively) that were not shown in the main document.In these and the following figures, the results are only shown for the four methods (ATS, ATSp, ATSE, and IATSE) that had reasonable FDR levels.Finally, Figures C8, C9 and C10 show the comparison of the performance of the methods with respect to the different error structures for the rest of the models not shown in the main document.Hwang

Figure 1 Figure 2
Figure 1 Annual water temperature curves sampled at the water level of Rimov reservoir in Czech republic every day from 1979 to 2014 (top), the output of the global ERL envelope test (p < 0.001) for testing the effect of the year on the temperatures (middle), and the step-down p-values of Westfall and Young (1993) (bottom).In the middle figure, the grey area represents the 95% global ERL envelope.The red dots show the days where the data function (solid line) exceeds the envelope, and the dashed line is the mean of simulated functions.At the bottom, dots are p-values (red if ≤ 0.05) and the dashed line represents the level α = 0.05.
(ATS): 1. Use the linear step-up procedure at level α * = α/(1 + α).Let r be the number of rejected hypotheses.If r = 0 do not reject any hypothesis and stop; if r = m reject all m hypotheses and stop; otherwise continue.2. Let π0 = (m − r)/m.3. Use the linear step-up procedure with α = α/π 0 .The linear step-up procedure uses m p-values {p 1 , . . ., p m } corresponding to the m hypotheses and the ordered p-values p [1] ≤ p [2] ≤ • • • ≤ p [m] .It rejects k hypotheses corresponding to the first k smallest p-values, where k = max{i : p [i] ≤ iα/m}.If such k does not exists, it rejects no hypothesis.

Figure 3
Figure3The mean FDR (±2 • σ FDR given by bars around dots, often invisible due to being so small) of methods of Table2for Model 1, 4, 5 and 11 with Gaussian error and m = 200 for different numbers of resamples s and correlation parameter φ Z .

Figure 4 Figure 5
Figure4The mean power (±2 • σ Power given by bars around dots) of four methods of Table2for Model 2, 4 and 11 with Gaussian error and m = 200.

Figure 6 Figure 7
Figure6The mean FDR (±2 • σ FDR given by bars around dots) of four methods of Table2for Model 1 with Gaussian error, s = 5000 and increasing resolution m.

Figure 9
Figure9Test for the main effect of GDP (in 1960 USD), given the continent as a nuisance factor.The IATSE algorithm was used for FDR control with 5000 permutations.
for the interaction coeff.

Figure 10
Figure10Test for the interaction effect of continent and GDP (in 1960 USD) using the IATSE algorithm for FDR control with 5000 permutations.The null model includes the main effects.
Figure 11 Top: Values of Aluminium and Boronium in the tropical forest plot of area 1000 m × 500 m.Bottom: The test of local correlations between Aluminium and Boronium under FDR control with the IATSE algorithm: observed local correlations, the lower and upper boundary of the FDR envelope, and regions (red) where the local correlation is significantly below or above the FDR envelope.

Figure C2
Figure C2 The mean FDR (±2 • σ FDR given by bars around dots) of the different methods for Model 2 and 3 with Gaussian error and m = 200.

Figure C3
Figure C3 The mean FDR (±2 • σ FDR given by bars around dots) of the different methods for Model 6 and 7 with Gaussian error and m = 200.

Figure C4 Figure C5 Figure C6 Figure C7 Figure C8 Figure C9
Figure C4 The mean FDR (±2 • σ FDR given by bars around dots) of the different methods for Model 8, 9 and 10 with Gaussian error and m = 200.

Figure C10
Figure C10The mean FDR (±2 • σ FDR given by bars around dots) of the four methods for Model 8, 9 and 10 with Gaussian error, heavy tailed error (t) and bimodal error for s = 5000 and m = 200.

Table 1
Possible outcomes from m hypothesis tests

Table 2
Multiple testing methods with FDR control investigated in the simulation study , where the first group had zero mean, while the mean in the second group deviated from zero in different ways and on different sub-intervals of [0, 1].Finally, a case of 'Bumps' was considered where both mean functions are positive on [0, 0.66], and 0 otherwise.The mean functions slightly differ on [0, 0.66].More precisely, the mean functions are specified in Table3and visualized in FigureC1of Appendix C. The Gaussian process was constructed as Z i