# Approximations for weighted Kolmogorov–Smirnov distributions via boundary crossing probabilities

- 1.1k Downloads
- 1 Citations

## Abstract

A statistical application to Gene Set Enrichment Analysis implies calculating the distribution of the maximum of a certain Gaussian process, which is a modification of the standard Brownian bridge. Using the transformation into a boundary crossing problem for the Brownian motion and a piecewise linear boundary, it is proved that the desired distribution can be approximated by an *n*-dimensional Gaussian integral. Fast approximations are defined and validated by Monte Carlo simulation. The performance of the method for the genomics application is discussed.

## Keywords

Boundary crossing*P*value approximation Gene set enrichment analysis

## Mathematics Subject Classification

62G10 60J65## 1 Introduction

*g*being continuous on the interval [0, 1] and such that \(g(0)=g(1)=0\).

*B*. A centered Gaussian process

*X*with covariance function (2) can be represented on \((\varOmega ,\mathcal {F},\{\mathcal {F}_{t}\},\mathbb {P})\) as follows:

*g*is of special relevance to the genomics application:

- 1.
Reducing the problem to a nonlinear boundary crossing problem (BCP) for the Brownian motion

*W*. This is the classical approach to extrema of modified Brownian bridges: see Durbin (1971), Krumbholz (1976), and Bischoff et al. (2003); but analytic results for nonlinear boundaries are scarce (Kahale 2008). - 2.
Replacing the nonlinear boundary by a piecewice linear approximation. This has been used in many papers, including Novikov et al. (1999), Pötzelberger and Wang (2001), Hashorva (2005), and Borovkov and Novikov (2005).

*p*-values are both requested.

The paper is organized as follows. Section 2 contains the theoretical results. The reduction to a nonlinear BCP and the bounding inequalities are stated as Lemmas 1 and 2. Our main result, Theorem 1 gives an explicit bound on the approximating error. An exact computing algorithm for a piecewise linear boundary is described by Proposition 1. Explicit expressions are given for the one-node case (Propositions 2 and 3). Section 3 addresses the practical issue. Two fast approximation schemes are proposed and compared to Monte Carlo simulations. Section 4 describes the statistical application which motivated the present study. Propositions 4 and 5 show that computing *p*-values for Gene Set Enrichment Analysis amounts to computing \(p_g(x)\) for some function *g* depending on the genes to be tested. An example of application to real genomic data is given.

## 2 Theoretical results

## Lemma 1

*G*the function defined on \((0,+\infty )\) by:

*S*(

*x*,

*y*,

*G*) the kernel:

## Proof

*W*. \(\square \)

*g*and

*G*is one-to-one. For \(0\leqslant t<1\):

*S*(

*x*,

*y*,

*G*) is monotonic in

*G*: for given

*x*and

*y*, raising the boundary can only decrease the crossing probability. This translates into the following inequalities.

## Lemma 2

## Proof

Once a lower bound and an upper bound are given, the question arises naturally to control the approximation error in terms of a certain distance. This is the object of the following theorem.

## Theorem 1

## Proof

Piecewise linear boundaries will now be considered.

## Definition 1

*n-node piecewise linear boundary*the function \(G_{n,\mathbf {s},\mathbf {b}}\), defined on \([0,+\infty )\) by:

*G*is to define \(b_i=G(s_i)\), for \(i=0,\ldots ,n\). If

*G*is concave, then \(G_{n,\mathbf {s},\mathbf {b}}(s)\leqslant G(s)\), for all

*s*; this is the case for \(G_a\) defined by (11). Assuming moreover that

*G*has a continuous second derivative such that

*G*is square-integrable, it follows from Theorem 1 that the approximation is numerically consistent. Indeed taking for instance \(s_i-s_{i-1}=\log (n)/n\), one gets that \(\varDelta (G,G_{n,\mathbf {s},\mathbf {b}})\) tends to 0 as

*n*tends to infinity. The interest of piecewise linear boundaries for our problem lies in the following result.

## Proposition 1

## Proof

- 1.
Given the values of \((W_{s_i})_{i=1,\ldots ,n}\), the processes \(\{W_s-W_{s_{i-1}}\,,\;s_{i-1}\leqslant s \leqslant s_i\}\) for \(i=1,\ldots ,n\), and \(\{W_s-W_{s_{n}}\,,\;s_{n}\leqslant s\}\), are conditionally independent; the conditional distribution of \(\{W_s-W_{s_{i-1}}\,,\;s_{i-1}\leqslant s \leqslant s_i\}\) is that of a Brownian bridge; and the conditional distribution of \(\{W_s-W_{s_{n}}\,,\;s_{n}\leqslant s\}\) is that of a Brownian motion.

- 2.For \(\alpha ,\beta \in \mathbb {R}\),$$\begin{aligned} \mathbb {P}\left[ \sup _{0\leqslant s<\infty }\{W_{s}-\alpha -\beta s\}>0\right] = \mathrm {e}^{-2\alpha ^+\beta ^+}\;. \end{aligned}$$(19)

Proposition 1 expresses \(S(x,y,G_{n,\mathbf {s},\mathbf {b}})\) as an expectation with respect to the joint distribution of the Gaussian vector \((W_{s_i})_{i=1,\ldots ,n}\). Using the independent increment property, it is easy to rewrite it as an integral with respect to the *n*-dimensional standard Gaussian density. Denote by \(g_n\) the transform of \(G_{n,\mathbf {s},\mathbf {b}}\) through (10). From Lemma 1, \(p_{g_n}(x)\) has an explicit expression in terms of the \((n+1)\)-dimensional standard Gaussian density. In view of Theorem 1, it can be considered that the problem is solved, at least in theory: an arbitrary close approximation of \(p_g(x)\) by a an *n*-dimensional Gaussian integral can be computed. This is not quite so in practice, because of the computational cost of Gaussian integrals: see Gentz and Bretz (2009) as a general reference. It is therefore of interest to obtain expressions as explicit as possible, in order to reduce computing costs. Two results deduced from Proposition 1 for one-node piecewise linear boundaries follow.

## Proposition 2

## Proof

*z*. Assume now \(b_0y+x>0\);

Integrating \(S(x,y,G_{1,\mathbf {s},\mathbf {b}})\) with respect to *y* against the standard Gaussian distribution can be done with reasonable precision and computing time using the Gauss-Hermite quadrature. However, the calculation for many different values of *x* can hardly be vectorized, which makes the whole algorithm relatively slow. It turns out that in the particular case \(b_0=0\), the integral has an explicit expression in terms of \(\varPhi \). Thus it can be computed with high accuracy in virtually null computing time, for a whole range of different values of *x*.

## Proposition 3

## Proof

## 3 Fast approximation schemes

Numerical experiments were made in R (R Development Core Team 2008). At first, a simulation procedure for the trajectories of *X* was implemented. A regular mesh of \(10^4\) discretization points in [0, 1] was fixed. Brownian trajectories were simulated by iteratively adding Gaussian random values along the mesh. A Brownian bridge correction for the discretization bias was applied: see Sect. 6.4 of Glasserman (2004), in particular formula (6.50) p. 367. Borovkov and Novikov (2005) give a precise evaluation of the error in Monte Carlo computation of boundary crossing probabilities.

*g*, we denote by \(\widehat{p}_g(x)\) the empirical

*p*-value at

*x*calculated from the sample. For a sample size of \(2\times 10^6\), the maximal absolute difference between the empirical and the theoretical cdf’s should remain below \(10^{-3}\) to be accepted by the Kolmogorov–Smirnov test at threshold \(5\,\%\). Therefore, the target precision is

*p*values computed from Propositions 2 and 3 were compared to the empirical

*p*values from the sample. The absolute difference remained below \(10^{-3}\) in all experiments, which validated both the simulation procedure, and the implementation of Propositions 2 and 3.

*g*and \(g_{s,b}\). Five distances were tried, among which:

*p*value at

*x*calculated from Proposition 3, with \((s_1,b_1)\) defined by (20).

*G*is assumed to be increasing, concave, and bounded. Let \(b_1=G(s_1), \mathbf {s}=c(0,s_1),~\mathbf {b}=(0,b_1)\). Since

*G*is concave, \(G_l = G_{1,\mathbf {s},\mathbf {b}}\) is such that for all \(s>0\),

*G*. Let \(\mathbf {s}=(0,s_1)\) and \(\mathbf {b}=(b_0,b_1)\). Then \(G_u = G_{1,\mathbf {s},\mathbf {b}}\) is such that for all \(s>0\),

*p*value at

*x*calculated as that midpoint, from Propositions 2 and 3.

*a*ranged from 0.55 to 0.95 by step 0.05. The corresponding boundaries \(G_a\) defined by (11) are increasing and concave, with \(1-a\) as a limit at \(+\infty \).

*p*values, for different values of

*a*.

| \(\Vert p_{1,g_a}-\widehat{p}_{g_a}\Vert _\infty \) | \(\Vert p_{2,g_a}-\widehat{p}_{g_a}\Vert _\infty \) |
---|---|---|

0.55 | 0.00665 | 0.00488 |

0.60 | 0.00532 | 0.00362 |

0.65 | 0.00449 | 0.00321 |

0.70 | 0.00280 | 0.00161 |

0.75 | 0.00222 | 0.00138 |

0.80 | 0.00148 | 0.00096 |

0.85 | 0.00097 | 0.00070 |

0.90 | 0.00063 | 0.00067 |

0.95 | 0.00046 | 0.00043 |

Several remarks must be made. That the errors decrease as *a* increases to 1 was expected, since \(g_a\) becomes closer to 0. The errors are above the target \(10^{-3}\) for \({a<}0.85\): the approximations are not perfect. However, the errors consistently remain below \(10^{-2}\). This may be considered acceptable, especially as the largest errors concern *p* values which are not statistically significant. The midpoint approximation \(p_{2,g}\) is definitely better than the one-node approximation \(p_{1,g}\), but not by much. A trade-off with computing time must be considered. The calculation of \(p_{2,g}\) was done from Proposition 2 with a Gauss-Hermite quadrature over 64 nodes. The running time for \(10^5\) values of *x* was 18.5 s, whereas the running time for the calculation of \(p_{1,g}\) is negligible (0.07 s for \(10^5\) values of *x*).

*x*. On the contrary \(p_{1,g}(x)\), which is a linear combination of values of \(\varPhi \) is accurate even for very large values of

*x*. Another calculation can be done for large

*x*: Durbin’s approximation (see Durbin (1985) and Parker (2013) for a useful review). Let

*v*(

*t*) denote the variance function of

*X*: \(v(t)=R_X(t,t)\), where \(R_X\) is defined by (2). Assume

*v*(

*t*) has a unique maximum over [0, 1] and denote by \(t_0\) the point at which that maximum is reached. Assume

*v*has a continuous second derivative \(v''\). From formula (33) of Parker (2013), Durbin’s approximation is

*a*as above, Durbin’s approximation \(p_{d,g_a}(x)\) was compared to the one-point approximation \(p_{1,g_a}(x)\) and to the empirical

*p*values \(\widehat{p}_{g_a}(x)\), for values of

*x*such that all three

*p*values are below \(5\,\%\). It turned out that Durbin’s approximation \(p_{d,g_a}\) performed slightly better than \(p_{1,g_a}\). For each of the two approximations, the relative error, calculated as the absolute difference with \(\widehat{p}_g(x)\) divided by the same, remained smaller than \(5\,\%\), over the range of values \(10^{-4}<\widehat{p}_g(x)<10^{-2}\), where \(\widehat{p}_g(x)\) could be used as an estimate of \(p_g(x)\).

## 4 Gene set enrichment analysis

This section describes the statistical application to genomics that motivated the present work. It generalizes the description of the Weighted Kolmogorov–Smirnov test that was given in Charmpi and Ycart (2015).

Gene Set Enrichment Analysis (GSEA) was introduced in Subramanian et al. (2005). It is now generally considered as a basic tool of genomic data treatment: see Huang et al. (2009) for a review. GSEA aims at comparing a vector of numeric data indexed by the set of all genes, to the genes contained in a given smaller gene set. The numeric data are typically obtained from a microarray experiment. They may consist in expression levels, *p* values, correlations, fold-changes, t-statistics, signal-to-noise ratios, etc. The number associated to any given gene will be referred to as its *weight*. Many examples of such data can be downloaded from the Gene Expression Omnibus (GEO) repository (Edgar et al. (2002)). The gene set may contain genes known to be associated to a given biological process, a cellular component, a type of cancer, etc. Thematic lists of such gene sets are given in the Molecular Signature (MSig) database (Subramanian et al. 2005). The word *enrichment* refers to the question: are the weights inside the gene set significantly larger than the weights in a random gene set of the same size?

Denote by *N* the total number of genes (\(N\simeq 20{,}000\) for the human genome). It is convenient to identify the genes to *N* points on the interval [0, 1], and their weights to the values of some function *h* defined on [0, 1]: gene number *i* corresponds to point *i* / *N* and its weight \(w_i\) to *h*(*i* / *N*). Traditionally, the numbering of the genes is chosen so that weights are ranked in decreasing order. Thus, the weights usually appear to vary smoothly between consecutive genes, and the function *h* can be assumed to be continuously decreasing.

*n*be its size. In practice,

*n*ranges from a few tens to a few hundreds:

*n*is much smaller than

*N*. With the identification above, it is considered as a subset of size

*n*of the interval [0, 1], say \(\{U_1,\ldots ,U_n\}\). If there is no particular relation between the weights and the gene set (null hypothesis), then the gene set must be considered as a random sample without replacement from the set of all genes. The fact that the gene set size

*n*is much smaller than

*N*justifies identifying the distribution of a

*n*-sample without replacement of \(\{1/N,\ldots ,N/N\}\), to that of a

*n*-sample of i.i.d. points on [0, 1]. The commonly accepted null hypothesis is that the gene set is uniformly distributed over all subsets of the same size, which amounts to assuming that \((U_1,\ldots ,U_n)\) are i.i.d. with uniform distribution on [0, 1]. This was the setting of Charmpi and Ycart (2015). We extend it here to the following null hypothesis.

The interest of this generalization is the following. It is a common place observation that genes in databases have quite different frequencies. A typical gene set contains several of those ubiquitous genes that are detected as overexpressed in most situations, thus are likely to be found also at the top of the weight vector. Due to those genes stating, as a null hypothesis that the gene set is a uniformly distributed sample leads to an excessive False Discovery Rate, as explained in Ycart et al. (2014). Taking into account, differential gene frequencies through the distributionH\(_0\): The gene set is a tuple \((U_1,\ldots ,U_n)\) of i.i.d. random variables on [0, 1], with common cdf

F.

*F*solve the problem.

*t*between 0 and 1 by:

*Weighted Kolmogorov–Smirnov test*(WKS) in Charmpi and Ycart (2015). Observe that the meaning of “Weighted” is different from that of Csörgő et al. (1986), although some techniques used here are similar.

Except in the case where *h* is constant, the exact distribution of \(D_n\) for finite *n* cannot be expressed simply. Its numerical computation is out of the scope of this article: see Simard and L’Ecuyer (2011) for the classical Kolmogorov-Smirnov test. However, an asymptotic approximation can be obtained for large *n*. The proof of the following convergence result is a simple application of well-known techniques of empirical processes: see Kosorok (2008) as a general reference. It can be easily reduced to the uniformly distributed case \(F(t)=t \) detailed in Sect. 2 of Charmpi and Ycart (2015).

## Proposition 4

*n*tends to infinity, the stochastic process \(\{Z_n(t)\,,\;0\leqslant t\leqslant 1\}\) converges weakly in \(\ell ^{\infty }([0{,}1])\) to the process \(\{Z_t\,,\;0\leqslant t\leqslant 1\}\), where:

*D*implies an approximation error which could be minimized by a small sample correction (Stephens 1970); this has not been attempted yet.

It will now be shown that computing asymptotic *p* values for the WKS test reduces to computing \(p_g(x)\) for some function *g* related to *h* and *F*.

## Proposition 5

*h*does not vanish on any interval, hence \(H_{2}\) is strictly increasing and its inverse \(H_{2}^{-1}\) is uniquely defined. Let:

## Proof

*D*;To justify the first identity, it suffices to observe that \(Z_{t}\) and \(W_{H_{2}(t)}-H_{1}(t)W_{ \gamma _{2}}\) are two centered Gaussian processes, with the same covariance function. The second identity is obtained through the change of time \( H_{2}(t)\mapsto s\), which does not modify ordinates of trajectories. The third identity is the invariance of Brownian motion through scaling. The last identity follows again by comparing covariance functions. \(\square \)

*h*encountered in practice often lead to functions

*g*resembling \(g_a\) for \(0.6{<} \,a {<}0.8\).

In Charmpi and Ycart (2015), it had been proposed to evaluate the distribution of *D* by Monte Carlo simulation. Although it is a commonly used method in many statistical applications including classical GSEA, Monte Carlo simulation is not acceptable, for both precision and computing cost reasons. In real applications, the test must often be applied to several hundred vectors, each tested against several thousand gene sets. The number of values of \(p_g(x)\) to be computed can be of order \(10^7\). Thus a running time of more than \(10^{-3}\) s per test cannot be accepted. Moreover, the most significant gene sets, which are of greatest biological relevance, often have very small *p* values (\({{<}10}^{-10}\)), which must be accurately calculated. The Monte Carlo method proposed in Charmpi and Ycart (2015) takes about \(10^{-2}\) s per test, for only \(10^4\) simulated trajectories of *Z*. On such a small number, the smallest *p* values that can be returned are of order \(10^{-3}\). The conclusion is that neither the computing cost nor the precision on the results match the needs of the real application. On the contrary, the approximation schemes described in Sect. 3 are both computationally efficient and precise enough for the application.

The remarks above will be illustrated on a typical example of application. We have considered the Cancer Cell Line Encyclopedia of Barretina et al. (2012) (GEO data set GSE36133, Edgar et al. 2002). It contains RNA expression data for 917 tumor cell lines. The data was reduced to 16775 protein coding genes; thus 917 vectors of length 16775 were considered. The rank statistics of each vector was tested for enrichment in the gene sets of the MSig C2 database (version 5.1, Subramanian et al. 2005). The database was reduced to the same protein coding genes and comprised 3751 gene sets. Thus \(919\,\times \, 3751=3.44\,\times \,10^6\) *p* values were computed. The calculation was made using the one-node approximation \(p_{1,g}\) and frequency correction; it took 3412 s, i.e., \(10^{-3}\) s per *p* value. Denote by *P* the \(3751\times 917\) matrix of *p* values so obtained. The test being repeated for each vector over 3751 gene sets, a multiple testing adjustment has to be applied on the columns of *P*. Dependencies in the data suggest choosing the method of Benjamini and Yekutieli (2001). After multiple testing adjustment, the number of *p* values smaller than \(5\,\%\) among the 3751 was counted for each of the 917 columns of *P*: these numbers ranged from 297 to 450, with a mean of 394. The numbers of *p* values smaller than \(10^{-10}\) (still after multiple testing adjustment) ranged from 32 to 128 with a mean of 76. Interestingly enough, there were 17 gene sets whose *p* value was smaller than \(10^{-10}\) for all 917 vectors. All 17 gene sets had biological connections with cancer.

In order to evaluate the effect of multiple testing adjustment on Monte Carlo estimated *p* values, all columns of *P* were Winsorized replacing any *p* value smaller than \(10^{-3}\) by \(10^{-3}\). After applying multiple testing adjustment to each Winsorized column, no *p* value smaller than 0.05 remained. This implies that the Monte Carlo method would have missed all significant gene sets. Of course, one could consider improving Monte Carlo accuracy by speeding it up, for instance using parallelization. However, a 100- fold gain in speed is equivalent to a 10-fold gain in accuracy for a given computing time: speeding up the Monte Carlo method will not allow it to accurately estimate *p* values smaller than \(10^{-10}\), precisely those detecting relevant biological information.

## Notes

### Acknowledgments

We are grateful to Albert Shiryaev, Marina Kleptsyna, and Alain Le Breton for helpful and pleasant discussions. The editor and reviewers made very useful remarks. Research supported by Laboratoire d’Excellence TOUCAN (Toulouse Cancer). A. Novikov was supported by the Russian Science Foundation under Grant 14-21-00162. N. Kordzakhia was supported by the Australian Research Council Grant DP150102758.

## References

- Barretina, J., Caponigro, G., Stransky, N., et al.: The cancer cell line encyclopedia enables predictive modelling of anticancer drug sensitivity. Nature
**483**(7391), 603–607 (2012)CrossRefGoogle Scholar - Benjamini, Y., Yekutieli, D.: The control of the false discovery rate in multiple testing under dependency. Ann. Statist.
**29**(4), 1165–1188 (2001)MathSciNetCrossRefMATHGoogle Scholar - Bischoff, W., Hashorva, E., Hüsler, J., Miller, F.: Exact asymptotics for boundary crossings of the Brownian bridge with applications to the kolmogorov test. Ann. Inst. Statist. Math.
**55**(4), 849–864 (2003)MathSciNetCrossRefMATHGoogle Scholar - Borovkov, K., Novikov, A.: Explicit bounds for appproximation rates of boundary crossing probabilities for the Wiener process. J. Appl. Probab.
**42**(1), 85–92 (2005)CrossRefMATHGoogle Scholar - Charmpi, K., Ycart, B.: Weighted Kolmogorov-Smirnov testing: an alternative for gene set enrichment analysis. Statist. Appl. Genet. Mol. Biol.
**14**(3), 279–295 (2015)MathSciNetCrossRefMATHGoogle Scholar - Csörgő, M., Csörgő, S., Horváth, L., Mason, D.M.: Weighted empirical and quantile processes. Ann. Probab.
**14**(1), 31–85 (1986)MathSciNetCrossRefMATHGoogle Scholar - del Barrio, E.: Empirical and quantile processes in the asymptotic theory of goodness-of-fit tests. In: del Barrio, E., Deheuvels, P., van de Geer, S. (eds.) Lectures on empirical processes: theory and statistical applications, EMS series of lectures in Mathematics, pp. 1–92. European Mathematical Society, Zürich (2007)CrossRefGoogle Scholar
- Doob, J.L.: Heuristic approach to the Kolmogorov-Smirnov theorems. Ann. Math. Statist.
**20**(3), 393–403 (1949)MathSciNetCrossRefMATHGoogle Scholar - Durbin, J.: Boundary-crossing probabilities for the Brownian motion and Poisson processes and techniques for computing the power of the Kolmogorov-Smirnov test. J. Appl. Probab.
**8**(3), 431–453 (1971)MathSciNetCrossRefMATHGoogle Scholar - Durbin, J.: Distribution theory for tests based on the sample distribution function, SIAM CBMS-NSF Regional conference series in applied mathematics, vol. 9. SIAM, Philadelphia (1973)CrossRefGoogle Scholar
- Durbin, J.: The first-passage density of a continuous Gaussian process to a general boundary. J. Appl. Probab.
**22**(1), 99–122 (1985)MathSciNetCrossRefMATHGoogle Scholar - Edgar, R., Domrachev, M., Lash, A.E.: Gene expression omnibus: NCBI gene expression and hybridization array data repository. Nucl. Acids Res.
**30**(1), 207–210 (2002)CrossRefGoogle Scholar - Gentz, A., Bretz, F.: Computation of multivariate normal and \(t\) probabilities. In: Chan, H.P. (ed.) Lecture notes in statistics. Springer, New York (2009)Google Scholar
- Glasserman, P.: Monte carlo methods in financial engineering. Springer, New York (2004)MATHGoogle Scholar
- Hashorva, E.: Exact asymptotics for boundary crossing probabilities of Brownian motion with piecewise linear trend. Elect. Comm. Probab.
**10**, 207–217 (2005)MathSciNetCrossRefMATHGoogle Scholar - Huang, D.W., Sherman, B.T., Lempicki, R.A.: Bioinformatics enrichment tools: paths toward the comprehensive functional analysis of large gene lists. Nucl. Acids Res.
**37**(1), 1–13 (2009)CrossRefGoogle Scholar - Kahale, N.: Analytic crossing probabilities for certain barriers by Brownian motion. Ann. Appl. Probab.
**18**(4), 1424–1440 (2008)MathSciNetCrossRefMATHGoogle Scholar - Khmaladze, E.: Martingale approach in the theory of goodness-of-fit tests. Theory Probab. Appl.
**26**(2), 240–257 (1981)MathSciNetCrossRefMATHGoogle Scholar - Kosorok, M.R.: Introduction to empirical processes and semiparametric inference. Springer, New York (2008)CrossRefMATHGoogle Scholar
- Krumbholz, W.: On large deviations of Kolmogorov-Smirnov-Renyi type statistics. J. Multivar. Anal.
**6**(4), 644–652 (1976)MathSciNetCrossRefMATHGoogle Scholar - Novikov, A., Frishling, V., Korzakhia, N.: Approximations of boundary crossing probabilities for a Brownian motion. J. Appl. Probab.
**36**(4), 1019–1030 (1999)Google Scholar - Parker, T.: A comparison of alternative approaches to supremum-norm goodness of fit tests with estimated parameters. Econom. Theory
**29**(5), 969–1008 (2013)MathSciNetCrossRefMATHGoogle Scholar - Pötzelberger, K., Wang, L.: Boundary crossing probability for Brownian motion. J. Appl. Probab.
**38**(1), 152–164 (2001)MathSciNetCrossRefMATHGoogle Scholar - R Development Core Team (2008) R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna. http://www.R-project.org, ISBN 3-900051-07-0
- Shiryaev, A.N.: Kolmogorov, Volume 2: Selecta from the correspondence between A. N. Kolmogorov and P. S. Aleksandrov. Moscow. 2003Google Scholar
- Simard, R., L’Ecuyer, P.L.: Computing the two-sided Kolmogorov-Smirnov distribution. J. Statist. Softw.
**39**(11), 1–18 (2011)CrossRefGoogle Scholar - Smirnov, N.V.: On deviations of the empirical distribution curves (Russian). Mat. Sb.
**6**(48), 3–26 (1939)Google Scholar - Stephens, M.A.: Use of the Kolmogorov-Smirnov, Cramer-von Mises and related statistics without extensive tables. J. R. Statist. Soc. B
**32**(1), 115–122 (1970)Google Scholar - Stephens, M.A.: Introduction to Kolmogorov (1933) on the empirical determination of a distribution. In: Kotz, S., Johnson, N.L. (eds.) Breakthroughs in statistics, springer series in statistics, vol. II, pp. 93–105. Springer, New York (1992)Google Scholar
- Subramanian, A., Tamayo, P., Mootha, V.K., Mukherjee, S., Ebert, B.L., Gillette, M.A., Paulovich, A., Pomeroy, S.L., Golub, T.R., Lander, E.S., Mesirov, J.P.: Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc. Natl. Acad. Sci. USA
**102**(43), 15545–15550 (2005)CrossRefGoogle Scholar - Ycart, B., Pont, F., Fournié, J.J.: Curbing false discovery rates in interpretation of genome-wide expression profiles. J. Biomed. Inform.
**47**, 58–61 (2014)CrossRefGoogle Scholar

## Copyright information

**Open Access**This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.