A Geometry-Based Multiple Testing Correction for Contingency Tables by Truncated Normal Distribution

Inference procedure is a critical step of experimental researches to draw scientific conclusions especially in multiple testing. The false positive rate increases unless the unadjusted marginal p-values are corrected. Therefore, a multiple testing correction is necessary to adjust the p-values based on the number of tests to control type I error. We propose a multiple testing correction of MAX-test for a contingency table, where multiple χ2-tests are applied based on a truncated normal distribution (TND) estimation method by Botev. The table and tests are defined geometrically by contour hyperplanes in the degrees of freedom (df) dimensional space. A linear algebraic method called spherization transforms the shape of the space, defined by the contour hyperplanes of the distribution of tables sharing the same marginal counts. So, the stochastic distributions of these tables are transformed into a standard multivariate normal distribution in df-dimensional space. Geometrically, the p-value is defined by a convex polytope consisted of truncating hyperplanes of test’s contour lines in df-dimensional space. The TND approach of the Botev method was used to estimate the corrected p. Finally, the features of our approach were extracted using a real GWAS data.


Introduction
Multiple testing problem occurs when a set of simultaneous statistical tests are considered. In many situations, more than one or even a large number of hypotheses are simultaneously tested, which is referred to as multiple comparisons [12]. For example, in case of high-dimensional data obtained from the field of genetics, medicine, molecular biology, bioinformatics, agricultural science etc. [13]. In statistical hypothesis testing, a pre-defined confidence level generally applies only to each individual test. But, multiple testing considers the same confidence level for the whole family of simultaneous tests. The probability of committing false statistical inferences considerably increase when more than one hypothesis is simultaneously tested at a given confidence level [4,18]. In general, where all the null hypotheses ( H 0 ) are independent and also supposed to be true, the statistical inference of committing at least one false rejection will become inevitable even if 100 hypotheses are individually tested at confidence level = 0.05.Estimation of this error rate is more complex if the hypotheses are correlated and not all of them are true [4]. Hence, the unadjusted or marginal p-values are no longer a useful quantity for this inference process as we are testing many features at the same time.
Therefore, the correction of p-value is obvious for multiple testing to control the type I error [1]. Many correction methods are proposed. However, these corrections are not always easy and straightforward in terms of the selection of proper method suitable for the various experimental properties and study purposes [4]. For example, the genome-wide association studies (GWAS) consider simultaneous testing of individual single-nucleotide polymorphism (SNP) of a gene and picks up positive genes when any one of SNPs in the gene is associated with a trait [3]. The contingency table tests are generally used to analyze the dichotomous case-control traits, where the deviation is measured from independence under H 0 of no association between the phenotype and SNP genotype classes. One of the most celebrated forms of this test is the popular chi-square ( 2 ) test [3]. Selection of the largest test statistic (MAX-test) from different genetic models is a powerful approach as it provides safeguard against model uncertainty [7].
Due to rapid advances in genotyping technology and standardized guidelines for reporting statistical evidence, the multitude of comparisons made in a GWAS may result false positives (type I errors). When testing millions of SNPs in a single gene association study will become the standard, consideration of multiple comparisons is an essential part of determining statistical significance [13]. Moreover, correction of multiple testing p-values is also required for the SNPs in linkage disequilibrium (LD) with each other [19]. The closely spaced SNPs frequently yield high correlation because of extensive LD. Therefore, when association studies are conducted with many SNPs, the tests performed on each SNP are usually not independent, depending on the correlation structure among the SNPs. This violation of the independence assumption limits the Šidák and Bonferroni correction's ability to control the type I error effectively [6]. The permutation test can be applied [11,20]. But, an enormous number of permutations are required to accurately estimate small p-values, constituting a computational burden [14].

3
Statistics in Biosciences (2020) 12:63-77 We combined the geometry of multi-way contingency table [21] and the χ 2 -test [9,22] to develop a linear algebraic transformation called spherization. Geometrically, the table and tests are defined by their contour hyperplanes in a df-dimensional space. The spherization is used to convert the shape of the space, defined by the contour hyperplanes of the distribution of tables sharing the same marginal counts. Hence, the stochastic distributions of tables are transformed into a standard multivariate normal distribution in df-dimensional space that is able to address tables with smaller p-values because of their symmetricity with respect to distance and direction. If MAX-test is performed for a set of H 0 , the tables with the same test statistic values are located in the df-dimensional space in the shape of a convex polytope, in which the corrected p-values correspond to the sum of the probability out of the convex polytope of the truncated normal distribution (TND). The TND approach by Botev [2] was applied to estimate the multiple testing corrected p-values.
The method was applied to a real GWAS data and the additive model tests of individual SNPs were repeated for a SNP set. Therefore, the main aim of this paper is to propose a multiple testing correction method for the contingency table tests as well as to study its features using a real data set.

Spherization
This method combines the idea of algebraic geometry of a contingency table [21] and χ 2 -statistic [9,22] to develop the MAX-test from the spherical geometry.

Geometry of Multi-way Contingency Tables
Let Y be a m-dimensional multi-way contingency table with non-negative elements Y i , where i ≡ i 1 , i 2 , … , i m , 1 ≤ i j ≤ I j and I j is the number of categories of j-th dimension. To consider the geometry of contingency tables, we define the vectorized form of Y as y by defining the index. The index of i-th element is Geometrically, contingency tables can be deployed in M-dimensional space with M ≡ ∏ j I j . The tables can be located in the same space if the observation structures are the same.

Dimension Reduction of Contingency Tables
For the χ 2 -test, the tables sharing same marginal counts can be located in the same df ≡ ∏ space, which is smaller than M-dimensional space as the tables are restricted by the marginal counts. The dimension reduction is performed by shifting in parallel and rotating them. First, the center of the distribution of tables is moved to the expected table and the differential vector is defined as where o and e are the observed and expected frequency vectors for arbitrary tables in M-dimensional space, respectively. Next, a rotation matrix R is defined depending on I j : Equation (3) defines the Kronecker product of m rotation matrices of m-dimensional multi-way contingency table. Each of m matrices is a rotation matrix in an n-dimensional space, of which n is the number of categories in I j for the j-th dimension. The structure for one variable is Rotation of the Eq. (2) by (3) is the rotation of a simplex in the I j dimension so that the vertices of it are placed in a hyperplane, where Rd is a rotated differential vector. This rotation produces many zero elements ( |Rd| 0 ) that are common for all the tables. Specifically, |Rd| 0 ≤ df , where | ⋅ | 0 is the number of non-zero elements. The positions of | ⋅ | 0 for tables sharing same marginal counts are consistent with each other and can be known from I j s. Then, an index vector v is defined to specify the positions of |Rd| 0 and | ⋅ | 0 elements: where u is an n-dimensional vector, the last element of which is 0, and the others are 1. A matrix V is defined from v , which removes zero elements from Rd by specifying the columns for which [v] i = 0 . For example, for a 2 × 3 table, Multiplying Rd by V , the dimension for the vectors of tables are reduced without loss of information.

Spherization of the
where E is a diagonal matrix having e in the main diagonals. For dimension reduction, the Eq. (6) can be written as where VRd is a df-dimensional vector, and where X is an LU decomposed matrix of VRE −1 R T V T and g = XVRd . Hence, Eq. (10) is a squared Euclidean distance from the center to the rotated hyperplane in df-dimensional space, and g is assumed to be a sample from df-dimensional multivariate normal distribution.

Geometry of the Proportion Trend Test
The geometry of the χ 2 -statistic can be applied to deploy the proportion trend test in df-dimensional space. Let us consider a 2 × 3 contingency table for the description. To define the proportion trend test, a weight vector has to be defined. For example, (2 1 0) is the weight vector, where the weights are defined for each column. For generalization, the weights are defined for each cell of the contingency table as, 2 1 0 0 0 0 . For the single weight vector w or one test, the test statistic w is the function of a weighted sum of the differential table:

3
To deploy Eq. (11) in df-dimensional space, we use the deformation of the definition of g in Eq. (10): where P † = R T V T X −1 is a sparse version of the pseudo-inverse of P . The deformation from Eq. (15) to (16) only holds for cases, in which the dimension reduction to df-dimensional space of spherization is applicable. Then, from Eq. (11): The test vector defined by the weight vector and rotation is: Therefore, the proportion trend test is projected in df-dimensional space with a direction of the inner relation of the test as: Equation (21) defines the inner product, which is the relation between observations and weight, where is the angle between observation and weight. If the observation is in the direction of = 0 so that T g 2 = |g| 2 = 2 df , then the χ 2 -statistic for a vectorized weight is where 2 w follows a 2 1 -distribution, and the p-value is the proportion of tables having χ 2 -values greater than or equal to the observations per the number of all possible tables.
For one observation and one weight, one test vector is deployed into df-dimensional space. For one observation and multiple weights, the multiple test vectors (12) g ≡ Pd, are deployed into the same df-dimensional space as in the case of the MAX-test. Pairs of truncating hyperplanes can be drawn in df-dimensional space depending on a set of test vectors and w . The hyperplanes are defined as vertical to the test vector, and the difference from the origin to the hyperplane is w . Because, the distribution of tables is transformed into a standard multivariate normal distribution in spherized space, the p-value is the cumulative standard normal probability out of the convex polytope, truncated by the pair of hyperplanes.

Geometric Features of Convex Polytopes in Spherized Space
The geometric configuration of the convex polytope defines the multiple testing p-values. Figure 1a shows the case of df = 2 with df = 1 test for a single test. Tables with the same χ 2 -values are drawn with a high-dimensional ellipsoid contour for the df = 2 test and a pair of truncating parallel hyperplanes for the df = 1 test. The ellipsoid contour is not easy to handle from the distributional standpoint, since all the tables on the ellipsoid are not equidistant from the origin (Fig. 1a). These parallel hyperplanes can be transformed into tangent hyperplanes and a high-dimensional ellipse into an n-sphere (generalization of the ordinary sphere into spaces of df-dimension) by spherization, in which all the tables are placed on the transformed df-dimensional n-sphere. The χ 2 -value is calculated as the squared Euclidean distance from the expected table to the hyperplane with distance , and the p-value is the proportion of tables outside of the convex polytope (Fig. 1b). The definition of the hyperplane for a single test is where Z is the axis of df-dimensional space.
In multiple testing, multiple pairs of parallel hyperplanes form a convex polytope (Fig. 1c, light-yellow-colored area), in which each pair corresponds to the tables with same χ 2 -values from the individual df = 1 test. The definition of the convex polytope in Eq. (23) is generalized for n test as: where max ≡ max k w k ; k = 1, 2, … , n test , A is a (n test × df) matrix with rows that are test vectors, max is the largest value of the test statistics among n test , and 1 n test is a ( n test × 1 ) vector, all of the elements of which are 1.
If the maximum value among a set of test statistic values is representative in cases of multiple testing, the tables on the surface of the convex polytope have the same representative χ 2 -values. Hence, the probability of observing tables with representative χ 2values greater than or equal to the observed table's χ 2 -value is the integral of the probability density of a multivariate normal distribution over the space out of the convex polytope. Therefore, the p-value for n test is where is the delta function. Tables in spherized space are considered to be observed in standard normal distributions as having a center corresponding to the expected table in the context of testing H 0 . Because of the relationship between n-dimensional normal distributions and 2 -distributions, we define as the distance from the center to the point corresponding to the observed table, and also obtain the spherized space (Fig. 1b, circle) with a multivariate normal distribution demarcated by a convex polytope, producing TND (Fig. 1c).

The Spherization and Botev's (Sph-Btv) Approach
The probability of a df-dimensional vector Z falls outside of a convex polytope can be defined from the linear restriction of d-dimensional multivariate normal law [2] as: where l and u are the lower and upper truncation limits, respectively; A is a full rank matrix; Z ~ N 0, I d having a d-dimensional multivariate normal distribution with mean vector 0 and variance-covariance matrix I d ; and Z;0, I d is the probability density function of N 0, I d .
Since, A is a full rank matrix, and Z ~ N = 0, Σ Z = I d , we can simply estimate Eq. (26) as [2]: where ℤ ∼ N 0, AΣ Z A T , Δ n test ×n test = AΣ Z A T = AA T is an inner product matrix. Here, the p-values were estimated by the Botev's [2] approach after the transformation of the space with spherization. The definition of the convex polytope for n test can be defined from the Eq. (28): where A n test ×df is the df -dimensional matrix, the rows of which are the test vectors defined by the test models ( W n test ×k ) and rotation matrix ( R k×df ) of the table; k is the number of column categories of the table. ℤ ∼ N(0, Δ) is a df-dimensional multivariate normal vector; and l and u are two ( n test × 1 ) vectors of the intercepts of tangent contour hyperplanes of tests.
Since, our method's test vector is a unit vector, and we define the distance from the origin to the hyperplane as , where is the maximum value of the test statistics among n test from the proportion trend test. Therefore, the elements of vectors l and u are − and , respectively. The p-value for the Sph-Btv approach is the probability that the df-dimensional vector ℤ falls outside of a convex polytope defined by the linear inequalities in Eq. (30): where l = − max 1 n test and u = max 1 n test are the lower and upper truncation limits, respectively.
The Δ in Eq. (31) is a positive definite variance-covariance matrix. The upper or lower triangle of Δ consists of q = n test n test − 1 ∕2 elements, each of which is the pairwise inner product of two test vectors. These q-elements determine the size of the convex polytope, and they consist of full information of the probability density function of the MAX-test.

Genotype Data
The genotype data from the Nagahama Study [10,23] of 1813 samples having 996,339 SNPs were considered. The phenotypes were randomly generated using

3
equal probabilities for cases (= 907) and controls (= 906). The real phenotype data were not used in this study because we were interested to extract the features of our method with a GWAS data considering the character of its LD structures per gene, where the case-control-wise null hypotheses throughout. The SNP subsets ( N subsets) located at each of 14,941 gene loci were considered to calculate the gene-based multiple testing corrected p-values.
Multiple cutoff χ 2 -values were used by selecting various cutoff -values from (0.01, 8.5) by increasing by 0.1 along with the GWAS cutoff of 5.45 ( 2 = 5.45 2 , and p = 5 × 10 −08 ). For each subset, the calculation was repeated 1,000 times, and 1,000 Monte Carlo samples were used for each repetition. More detailed data description is provided as Online Resource in the Supplementary File, ESM_1.pdf.

Results
The estimated p-values for the two genes with their individual LD structures are presented in Fig. 2. For example, the gene FES in Fig. 2a has a relatively small number of SNPs with quite strong pairwise LD structures, compared to the gene NCS1 in Fig. 2b. These dissimilar physical features of the two genes lead them to produce quite heterogeneous values of p , and strongly motivate the values to move either toward or away from the df = 1 testing. The p-values are highly deviated from the df = 1 test for the gene NCS1, whereas the gene FES, is closer to the values of df = 1 (Fig. 2c). This variation in p-values is an important indicator about the variation in type I error rate per gene at a uniform confidence level, which may affect the false detection of a gene. For example, if we set the GWAS cutoff of p = 5 × 10 −08 2 = 5.45 2 for all the χ 2 -values along the horizontal axis of Fig. 2c, the type I error rate for a gene becomes significantly greater as we move from the left-hand corner to the right-hand corner along the horizontal line. Moreover, SNPs per gene and genome-wide gene's LD structures are not uniform. The genes with more SNPs and weak LD structures tend to produce higher p-values than genes with relatively smaller numbers of SNPs and stronger LD (Fig. 2). So, if we set the uniform cutoff to calculate the type I error rate for all the genes over the genome, then it is very likely that the type I error rate of the gene in Fig. 2a will be higher than that of the gene in Fig. 2b. Figure 3a is presenting the results from the evaluation of all the gene-based SNP subsets ( N subsets) for multiple cutoff χ 2 -values, including the GWAS cutoff. The eight colored subpanels in this figure depict the distributions of log 10 (p) -values for eight different cutoffs. From here, it was observable that there was a shift of p-values for every change in 2 . The shape of the distributions and their modality at the left tails for each subpanel are changing along with the changes in cutoffs. For example, the shape of the distribution for 2 = 5.45 2 (magenta) is very dissimilar from that of 2 = 8.31 2 (purple). The evenly spaced distributions were observed along the horizontal log scale, indicating that the distributions are shifting with respect to p , keeping the width of the spaces almost the same. Figure 3a demonstrates how the distribution function shapes vary with multiple cutoff χ 2 -values and the changes in the shape is not simple. Therefore, it is advisable to apply our method for individual gene sets by considering the corresponding LD structures and for target χ 2 -values individually.
To visualize the variations in p-values more obviously, only the two cutoff χ 2values of 2.51 2 and 3.91 2 were considered. The two distributions of log 10 (p)-values of SNP subsets are presented in Fig. 3b. The shapes of the distributions are almost identical, but a modality change was observed in the left tail. The two evenly spaced distributions in the horizontal axis with an almost total shift in p-values indicate the fold changes in p with a change in the cutoff. For a typical investigation of heterogeneity in p-values, we considered the evaluation of three cutoff χ 2 -values: 2.51 2 , 3.91 2 and 7.51 2 , respectively. The scatter plot between log 10 (p)-values for 2 = 2.51 2 with the other two cutoffs is presented in Fig. 3c and d, respectively. From both figures, it was observed that the p-values are positively correlated having quite different strength and patterns for a change in the cutoffs. The values in Fig. 3c are linearly related to a quite strong, positive correlation r 2 = 0.98 , whereas the strength of the relation is quite weak r 2 = 0.07 in Fig. 3d. These results show that the p-values are heterogeneous for the gene-based SNP subsets over the genome for multiple cutoff χ 2 -values. Also, they are not uniform for a fixed cutoff. For example, the shape of each colored subpanel of Fig. 3a is far from the shape of uniform distribution, indicating that some genes tend to produce type I error more frequently than others based on their individual characteristics, such as LD pattern and number of SNPs per gene.
The p-value correction by our method for gene-subsets is reasonably simple and straightforward. But, the biological interpretation is not so simple, because the number of truly associated genetic variants in one particular gene is not always one, and its association structure between multiple genetic variants and phenotypes is believed complicated [15]. In this study, the features of the proposed method were extracted only for the gene-based SNP subsets of a real GWAS data. This is the limitation of our study to apply these gene specific extracted features for the other genetic studies as the gene-coding regions comprise a small proportion of the human genome.

Discussion
This is a geometric approach of multiple testing for contingency tables such as MAX-test of multiple genetic models (additive, recessive, dominant). It seems one of the realistic approaches to know the variations of the type I error among subsets, each of which is considered of multiple tests but their dependency is heterogeneous. Our results suggested that the type I error may vary based on the individual structure of genes. Therefore, our proposed method offers the estimates of the probability density of MAX-test p-values. Because, there can be the functional relation among variants in each gene, it is advisable not to correct simple correction based on our method estimate without considering other potential factors. However, still our method will give meaningful information for GWAS interpretation.
We illustrated an example from the genetics using a real GWAS data for the SNP subsets per gene. However, because of the big data era, the simple positive signal detection by multiple testing seems to exist in various fields. This method can be applicable to other arbitrary fields, as far as the positive signal detection is similar. For example, for large categorical data as well as non-categorical data set, where multiple df = 1 tests are repeated according to the procedure of the MAX-based test.
Our spherization transformation is based on the algebraic geometry of correspondence analysis (CA) [5,8] with the squared Euclidean distance as a measure of 2 in df-dimensional space. This 2 -distance is a standardized form of the Mahalanobis distance [17]. A detailed illustration of the relationship among the CA, Mahalanobis distance and Euclidean distance.is provided as Online Resource in the Supplementary File, ESM_2.pdf.
The sum-of-chi-squares (S-O-C-S) from multiple df = 1 tests can also be used as the gene-wise test statistic for calculating multiple testing p-value (Supplementary File, ESM_3.pdf) [15,16] instead of the maximum-of-chi-squares (M-O-C-S). The geometric features of multiple testing can also be evaluated by this S-O-C-S approach. However, this method is not suitable for our Sph-Btv approach because the Botev method [2] considers the computational problems that are based on the d-dimensional multivariate normal law under linear restrictions. Also, this method is applicable when the number of linear inequalities is less than the dimension of the space. But, the S-O-C-S approach generates the non-linear contour hyperplanes for the tests that are not linear (Supplementary File, ESM_4.eps). In ESM_4.eps, Fig.  (a) is presenting the contour hyperplanes of the tables for a 2 × 3 table test from the S-O-C-S approach, where all possible tables sharing the same marginal counts with the observed table are evaluated using three genetic models (additive, dominant and recessive). From this figure, it was observed that the each contour hyperplane is consisted of a linear summation of χ 2 -values, which means that the contour hyperplane contains many straight segments. Since, there are many segments, overall they look like "curved". Also, in ESM_4.eps, Fig. (b) is showing the non-linearity among the relations of the two co-ordinate values of the contour hyperplanes and the S-O-C-S values. So, this non-linear and too complicated contour hyperplanes are not feasible for Botev's approach.
There are multiple potential directions to extend our approach. One example is to handle the combinations of variables. In this paper, we demonstrated the multiple tests that were consisted of one particular test for multiple items (SNPs). However, the combinatorial effect of multiple SNPs is also the active research target, and our method can take the combinations along with single SNPs, as far as the combinatorial effect is expressed as linear that corresponds to the hyperplane in our geometric approach.
Another example is to apply our method to estimate type II errors rather than type I errors that we demonstrated in this paper. To estimate type II errors in the context of MAX-test, only one difference is to be introduced to our method. The difference is the parallel shift of normal distribution without moving the convex polytope of MAX-test, to the location in the space where an alternative hypothesis indicates.
In summary, we demonstrated a geometric testing procedure of contingency table in the context of MAX-test that enables to estimate the multiple testing corrected p-value having multiple extendibility.