Association Rule Mining for Genome-Wide Association Studies through Gibbs Sampling

Finding associations between genetic markers and a phenotypic trait such as coronary artery disease (CAD) is of primary interest in genome-wide association studies (GWAS). A major challenge in GWAS is the involved genomic data often contain large number of genetic markers and the underlying genotype-phenotype relationship is mostly complex. Current statistical and machine learning methods lack the power to tackle this challenge with eﬀectiveness and eﬃciency. In this paper we develop a stochastic search method to mine the genotype-phenotype associations from GWAS data. The new method generalizes the well-established association rule mining (ARM) framework for searching for the most important genotype-phenotype association rules, where we develop a multinomial Gibbs sampling algorithm and use it together with the Apriori algorithm to overcome the overwhelming computing complexity in ARM in GWAS. Three simulation studies based on synthetic data are used to assess the performance of our developed method, delivering the anticipated results. Finally, we illustrate the use of the developed method through a case study of CAD GWAS. Statements and Declarations: The authors have no relevant ﬁnancial or non-ﬁnancial interests to disclose. The authors have no competing interests to declare that are relevant to the content of this article.


Introduction
Genome-wide association study (GWAS) is an observational study for discovering associations between genetic markers and certain phenotypic trait such as coronary artery disease (CAD).Findings from a GWAS can help in reliably predicting an individual's risk of having this disease and in developing effective ways for prevention or treatment [24].Different forms of genetic markers may be considered in GWAS.A simple and widely used one of them, called single nucleotide polymorphism (SNP), refers to the specific genetic variants (i.e. two alleles) occurred in two corresponding base positions (loci) in a pair of chromosomes in at least 1% of the population.Numerically a SNP is characterized as the number of minor allele at the loci, thus takes 3 possible values 0, 1 and 2, representing the three states of the alleles pair: homozygous recessive, heterozygous, and homozygous dominant.Typically, observations are available on hundreds of thousands or even millions of SNPs in a GWAS with a sample of hundreds of cases and controls, thanks to the use of modern highthroughput DNA sequencing techniques.Also the SNPs are likely to be highly correlated in-between knowing they dwell in the tiny chromosome space.These complications present significant computational challenges on analyzing the intrinsic SNPs versus phenotype associations by the current statistical and machine learning methods.
Many statistical methods for GWAS formulate the SNPs versus phenotype associations by various regres-sion models for categorical data, and then assess the significance of each individual SNP-phenotype association by single-locus tests [25].The regression models that have been used in GWAS include ANOVA [13], generalized linear models (GLMs) [16], and (generalized) linear mixed models (LMMs) [27].Principal components of the SNPs are sometimes used in GLMs to reduce the effects of false positives attributed to the population substructure.Approaches based on LMMs, e.g.CMLM [28] and ECMLM [15], have been shown to be successful in dealing with both population substructure and relatedness in genomic data by treating them as fixed effects and random effects respectively.
Yet it is well recognized that some phenotypes or diseases have complex genetic etiologies.In such cases, each individual SNP may have a weak marginal effect or no effect on the disease, but a combination of some SNPs can synergistically contribute to the risk of the disease.This has brought forth the need for including epistatic (aka.gene-gene) interactions into the statistical regression models.However, in the presence of hundreds of thousands of SNPs in GWAS, the number of interaction terms grows exponentially when the interaction order increases, leading to "large p, small n" scenarios with computational difficulties [12].In these situations, exhaustive testing methods are not able to be scaled to interactions of higher orders than the pairwise ones.
To overcome this exponential explosion challenge, two-stage approaches have been developed to search for epistasis through dimension reduction and variable selection.Several penalized regression methods such as LASSO regression [22] and Elastic-Net [29] have been used in GWA studies by e.g.[26] and [6] respectively in a two-stage manner: screen the whole genome to get a small set of SNPs potentially having significant main effects on the disease, then apply variable selection to identify all significant SNPs' main and interaction effects from the small SNPs set obtained from screening.Some other two-stage approaches focus more on the screening stage.For example, Fan and Lv [7] developed a dimension reduction method via correlation learning, which is called Iterative Sure Independence Screening (ISIS).This method has been extended to GWAS, resulting in several enhanced versions of ISIS, such as GWASelect [10], EPISIS [23], and TS-SIS [14] etc.Nevertheless, most of these methods are only capable of analyzing those effects involving a small number of selected SNPs that have strong marginal effects but weak interaction ones.
In recent decades, machine learning has been widely used to tackle high-dimensional data analysis problems, which inspires its applications to GWAS.Among all machine learning methods, random forest and association rule mining seem to be the two typical methods for identifying important SNP-phenotype associations in GWAS.Random forest, which is a supervised learning method, can rank the importance of each SNP in terms of its association with the phenotype in an ensemble of classification trees.An importance score, e.g.Gini index or cross-entropy, for each SNP is defined based on certain loss function expressing the error in predicting the phenotype and can be easily computed from growing trees [4].Since this importance score is calculated in the presence of other variables in the model, it is particularly suitable to be used at the filtering stage (stage 1) in the two-stage GWAS approaches.
As for association rule mining (ARM), it is an unsupervised machine learning method designed to search for important associations rules made up from an arbitrary number of items [3].Unlike the SNP-phenotype association terms identified from a statistical regression or supervised learning model, the SNP-phenotype association rules identified from ARM are formulated based on various concepts used in set theory.Thus, the abovementioned association rules and the association terms explain different types of SNP-phenotype associations which may not be the same to each other.This would be welcome since they provide complementary information to GWAS.
Early ARM methods are computationally intensive for mining even a moderate-sized basket of items.Recently, several parallel and distributed computing techniques such as GARMS [1] and BPARES [2] have been applied to boost ARM but it could still be hard to mine large-scale GWAS data due to the underlying combinatory explosion in constructing rules.Qian et al. [17] proposed a Gibbs-sampling induced stochastic search approach to mine the rules efficiently.However, this approach treats each item as a binary variant, which does not apply to ARM in GWAS because the SNPs there are multinomial variants.In this paper, we will develop a multinomial Gibbs-sampling induced new stochastic search method for ARM in GWAS, which scales well to large-scale GWAS data.
The rest of the paper is structured as follows.In Section 2 we introduce the format of GWAS data and the framework of ARM first before developing the random (stochastic) search algorithm MultinomRSA.The core of this algorithm is a Gibbs sampler having multinomial marginal conditional distributions.The MultinomRSA algorithm is particularly suitable for mining SNPs versus phenotype association rules because each SNP item is better modelled by a multinomial random variable.In Section 3, we assess the performance of our method using simulation studies and a real-world CAD dataset.
Finally, in Section 4, we provide conclusions and an overview of potential areas for further research.

GWAS data
According to its genetic definition, a SNP can be numerically coded as the number of minor alleles at the corresponding pair of loci.The process of obtaining coded values for all observed SNPs is called genotyping.Normally there are two types of alleles in human genome: major allele and minor allele.Also there are two alleles dwelling at each locus.Therefore, each genotyped SNP variable has 3 possible values: 0, 1, and 2, representing the three states of the alleles pair: homozygous recessive, heterozygous, and homozygous dominant.For example, if the minor allele is denoted as T and the major is A, then SNP = 0 if AA is observed; = 1 if AT or T A is observed; and = 2 if T T is observed at the loci.
A real-world GWAS example to be presented in Section 3 uses a large portion of the PennCATH dataset that is collected from a case-control observational study of CAD [20].The PennCATH data consist of the genotyped values of 861,473 SNPs across 3850 individuals whose coronary artery disease (CAD) information and information on cardiovascular risk factors are also available.If the phenotype CAD and the genotyped SNPs can be regarded as a set of items, relations between CAD and the SNPs can be mined by ARM.In order to see this, we provide a brief review of ARM in the following.

Association rule mining framework
Association rule mining (ARM) was developed in [3] for mining interesting relations between items from transactions data recorded by point-of-sale systems in supermarket.For example, the rule {ham, cheese} → {bread} found in a supermarket's sales data would suggest that a customer buying ham and cheese is likely to buy bread as well.Information on how frequent this rule is observed in the sales data can be used as the basis for certain marketing decisions made by the supermarket.
A common formulation of ARM is given in [9], being summarized here.Define I = {I 1 , I 2 , • • • , I m } as a set of m items called the item space and D = {t 1 , t 2 , • • • , t L } as a list of transactions, where each transaction in D is just a subset of items in I, i.e. t l ⊂ I, l = 1, • • • , L. An association rule is defined as an implication of the form X → Y where X, Y ⊂ I and X ∩ Y = ∅.The sets of items (for short itemsets) X and Y are called antecedent and consequent of the rule, respectively.The support of an itemset, X, supp(X) is defined as the proportion of transactions in D which contain X.The confidence of an association rule is defined as where X&Y is the itemset obtained by amalgamating The support of an itemset measures its commonness and the confidence of an association rule measures its association strength.By the essential meaning of support, we can also define the support for a rule X → Y, which is just By ARM we aim to find out the rules that have high support or high confidence or some other properly defined metrics.Note that for transactions data generated from the item space I, any itemset X containing k items in I can be equivalently expressed as a binary indicator vector It is easy to see that the number of distinct transactions that can be generated from I is 2 m − 1.
From an initial look, GWAS data do not resemble transactions data.Thus, it seems ARM is not applicable to GWAS data mining.However, a SNP variable having 3 levels can be represented by 3 indicator variables: , where J (SNP) ℓ = 1 if the SNP is observed at level ℓ; J (SNP) ℓ = 0 otherwise; ℓ = 0, 1, 2. This implies that a SNP variable can be regarded as a set of 3 items I .Also the phenotype variable naturally specifies 2 items I D and I ND corresponding to disease and no disease, respectively.Hence, observations of each individual in a GWAS dataset containing m SNP variables can be converted as a specific transaction that consists of m items from the 3m SNP items and one item from I D and I ND , such that each SNP variable contributes one and only one item to the transaction.Such a transaction can be equivalently represented by the binary indicators determined by all items in the transaction.
In order to represent not only a transaction but also any itemset of the m SNPs, we introduce an additional indicator J (SNP)   no which equals 1 or 0 depending on whether or not the SNP is inside the itemset.Now write ) as a set of 4 binary indicators for a given SNP.Then an itemset I(SNPs1 : k) containing k observations from k SNPs, denoted as {SNP1, • • • , SNPk}, can be represented as where ℓ(j) is the observed value of SNPj, j = 1, • • • , k.The corresponding indicator vector for I (SNPs1:k) is in which there are m 1's, 3m 0's, and m − k indicators of form J (SNP) no equal 1.It is easy to see that, for an item space of m SNPs and a phenotype, there are up to 2×3 m distinct transactions that can be observed in a GWAS data set, whereas 4 m − 1 non-empty distinct itemsets can be generated from an item space of m SNPs.In this paper, we are interested in mining the SNPs-phenotype induced association rules of the following forms

ARM Processing and Metrics
Support and confidence are the key metrics for evaluating how "interesting" or "informative" an association rule X → Y is.But it is computationally infeasible to search for the most interesting rules based on supp(X → Y) and/or conf(X → Y) by a brute-force approach even if the associated item space has moderate number of items, because the search space has a cardinality of exponential order.Current approaches for tackling this difficulty are to use constrained search.
A typical such method is the Apriori algorithm [3] in which one sets thresholds t supp , t conf and t len , respectively for support, confidence and length of each of the rules to be searched.Namely, one either searches for the rules of the highest support(s) subject to or searches for the rules of the highest confidence(s) subject to length(X → Y) ≤ t len and supp(X → Y) ≥ t supp .
Effectiveness and efficiency of such a constrained search method critically depend on the selection of t supp , t conf and t len .And it is very difficult to be scaled to ARM on large item space.
In this paper, we propose to use a stochastic search approach instead for ARM to find the most "interesting" or "informative" rules.For this we need a different metric to measure the interestingness of an association rule.In Qian et al. [17] an importance measure is proposed for each association rule X → Y, which is defined as where f (•, •) can be any positive and increasing function with respect to supp(X → Y) and conf(X → Y).Here we choose as the importance of X → Y.This keeps the same virtue of favouring the rules of high support and confidence in ARM.By our stochastic search method to be developed, we aim to find out those association rules that have the highest importance values.

Stochastic Bernoulli Gibbs sampling ARM
For an item space All rules in R can be ranked according to their importance values.Denote J m(k) as the kth order antecedent such that imp(J m(k) → I m+1 ) is the kth highest.When m is moderate or large, we know it is computationally infeasible to find the rules of the highest importance values in R by a brute-force search method because 2 m − 1, the cardinality of R, is of exponential order.On the other hand, consider a probability distribution on R defined by a softmax function, i.e.
where ξ > 0 is a tuning parameter for adjusting the probability ratio between every two rules in R. It is easy to see that those rules in R having the kth highest importance value also have the kth largest probability defined in (4) for any ξ value.This implies that, if one can generate a random sample of rules from P ξ (J m ), those rules having the highest importance values are more likely to appear in the sample and appear earlier in the sample than those rules not having high importance values.Moreover, when ξ is larger, the frequency of a high-importance rule appearing in the sample is even higher than that of a not-high-importance rule appearing in the sample.Therefore, the problem of finding high-importance rules in R, which is computationally not scalable, can be solved by finding highimportance rules from the random samples generated from ( 4), which we will see to be computationally feasible.By the probability law of large numbers for binomial and multinomial distributions, we can see the highest-importance rules in the generated samples converge to those highest-importance rules in R with probability 1, with the relevant approximation error being smaller than 1/ √ sample size with at least 95% probability.This suggests a polynomial order for the size of the generated samples is sufficient to ensure the convergence.This also means the computational complexity is of the same polynomial order, while that for the bruteforce search is of exponential order.Now the question is how to generate a random sample of rules from the probability distribution P ξ (J m ) given by ( 4) which is a multivariate discrete distribution.This question is not trivial because the denominator in (4) involves 2 m − 1 terms and is intractable to compute even when m equals upper 10's.Such difficulty can be bypassed by applying Gibbs sampling which is a Markov chain Monte Carlo (MCMC) algorithm aiming to generate a Markov chain from a feasible transition probability matrix such that the stationary distribution of this Markov chain equals the target multivariate discrete distribution.
The involved transition probability matrix in Gibbs sampling is determined by the product of all univariate conditional probability functions of P ξ (J m ).It is easy to see that the conditional probability function of each which is a Bernoulli probability distribution not involving the intractable denominator of (4).Here To implement Gibbs sampling for generating rules from (4) we start from a randomly selected transaction (J m , I m+1 ) from the transactions dataset, and construct the initial rule (5) to substitute J We name the method just described as the stochastic Bernoulli Gibbs ARM algorithm.
By the properties of Gibbs sampling, the corresponding N itemsets J m constitute a Markov chain having the probability distribution (4) as its unique stationary distribution.Therefore, the most important rules in R can be determined (with probability 1) from the generated Markov chain, and there are at least three ways to identify them.
Firstly, the most important rules can be determined by the most frequent itemsets among However, this could be ineffective if the frequency of each distinct itemset in m is very small, e.g. 1 or 2, which is very likely the case if N and ξ are not large enough.
The second method is to identify the most important rules among the N generated rules and use them to estimate the most important rules in R. By the ergodicity theorem for Markov chain, the most important sampled rules among the N generated converge to the most important population rules in R with probability 1 if the underlying Markov chain has a finite state space and satisfies the detailed balance condition.The most important sampled rules can be easily identified by computing the importance values for all the N generated rules.The second method is mostly effective if m is not very large and the generated Markov chain is sufficiently ergodic.
By the third method, we apply a two-step approach which first identifies the most frequent items from J m , and use these items to constitute a new item space, then apply Gibbs sampling again to determine the most important rules from the collection of association rules given by the new item space.Since the cardinality of the new item space will mostly be smaller than that of the original item space, its most important rules can be more efficiently identified by the second step Gibbs sampling or Apriori algorithm.And these most important rules will converge to the most important rules in R with probability 1 by the probability law of large numbers and the ergodicity theorem for Markov chain.
Rationales behind the aforementioned three ways of mining important rules are explained in detail in Qian and Zhao [18].Finally, note that P ξ (J m ) defined by (4) equals 0 if an itemset (J m , I m+1 ) is not observed in the transaction dataset D. Thus, P ξ (J m ) is in effect defined on the collection of all itemsets of (I 1 , • • • , I m ) observed in D.Moreover, the Markov chain generated by Gibbs sampling via conditional probability distribution ( 5) is still ergodic in the association rule space induced from D. In section 3 we will use simulated data and a real data case study to demonstrate the effectiveness and efficiency of Gibbs sampling for ARM.

Stochastic multinomial Gibbs ARM for GWAS
Now we focus on using Gibbs sampling to mine the SNPs-phenotype association induced rules of the form (J (SNP1) , • • • , J (SNPm) ) → I D that is given in (1).ARM for rules of the form (J (SNP1) , • • • , J (SNPm) ) → I ND that is given in (2) can be performed in the same way, thus is skipped.
Corresponding to (4) the target probability distribution for GWAS-ARM using Gibbs sampling is where I (4) is a collection of 4 row vectors of a 4 matrix giving the four possible row vector values that each itemset J ′(SNPv) can take.
Since imp (J ) is not observed in the underlying GWAS dataset, it follows that the domain D SNP of the 4m-variate distribution function P Dξ of ( 6) is the collection of all possible itemsets of the antecedents of the rules (J (SNP1) , • • • , J (SNPm) ) → I D in the GWAS dataset.In other words, D SNP ⊆ {I (4) } m .
It is natural that those SNPs used in calculating the phenotype response values will be inside the antecedents (LHS) of the most important association rules of the corresponding transaction data.However, the SNPs-phenotype association specified by the logistic regression model is not the only important one that can be identified by association rule mining.We expect the most important association rules may contain other SNP items not used in calculating the phenotype response values.
In the following we will use I vu (v = 1, • • • , m and u = 1, 2, 3) to represent the uth level of level-(u − 1) item of SNP v , and use I D and I ND for the positive and negative phenotype, respectively.Simulation 1 In the first dataset, we start with a small number of SNPs to show that the algorithm can find the most important association rules by comparing them with the rules obtained from the Apriori algorithm.
Consider a dataset having 3 SNP variables with a binary phenotype response.MAF of the SNPs are set to 0.4 and they are assumed to be uncorrelated.In logistic regression under an additive model, the intercept is set to −3 and the coefficient of SNP3 is set to 5. Namely, Then, we use this genotype configuration to generate a sample of 50 cases and 50 controls, amounting to 100 transactions.In association rule mining, each of the SNPs will be represented by 3 items corresponding to the 3 levels of SNP, and there will be 9 SNP items in total, therefore 4 3 − 1 = 63 possible rules with I D (and respectively I ND ) as the consequent (RHS).Ignoring the multinomial constraint within the SNP items, there would be 2 9 − 1 = 512 possible rules instead.
We first use the Apriori algorithm to find the association rules with t supp ≥ .01 and t conf ≥ .01.In the experiment, we define "importance" as the product of confidence and support.By the Apriori, 12 rules were found and the top 10 important ones are listed in column 4 of Table 1.Note that I ij represents the jth level of SNP i .As expected, the top rules in the table are related to SNP3 (items I 3• ).Specifically speaking, they all contain the 2nd level item of SNP3 (I 32 ), and I 32 alone as the antecedent gives the most important rule.We then generated N = 1000 rules from the simulated dataset using either the stochastic multinomial Gibbs sampling ARM method or the Bernoulli Gibbs-ARM method for the tuning parameter ξ = 10 or ξ = 20.Frequencies of the generated rules in each case are shown in Table 1, from which we see both multinomial and Bernoulli Gibbs-ARM methods have identified the same top 10 important rules as the Apriori algorithm.There are only very small differences in the order of the frequencies in each case.
As mentioned in section 2.3, the tuning parameter ξ controls the ratio of sampling probabilities between each pair of rules generated by Gibbs sampler.Low ξ means this ratio is close to 1, leading to more different rules being generated and less differentiable from each other in terms of the importance.On the other hand, high ξ leads to less different rules being generated and tending to stuck at local maxima with relatively low importance.Therefore, the tuning parameter should be carefully chosen.As compared with ξ = 10, when ξ = 20, the algorithms tend to result in higher frequencies on the more important rules, which is expected by the sampling properties of our methods.
Simulation 2 SNPs used in Simulation 1 are uncorrelated with each other.This might make it easy for the algorithms to find the SNPs associated with the phenotype.The SNPs are often correlated in the real-world.So, in the second simulation, we use the same setups of Simulation 1 except that the SNPs have an autocorrelation structure with coefficient ρ = 0.7 (Figure 2).Namely, the correlation between SNP i and SNP j is R(SNP i , SNP j ) = 0.7 |i−j| ; i, j ∈ {1, 2, 3}  1 The 10 most important rules found by Apriori (t supp ≥ 0.01, t conf ≥ 0.01) with their frequencies of appearing in the N = 1000 rules generated by multinomial and Bernoulli Gibbs-ARM methods (Simulation 1).
For this example, the correlation matrix is The results are shown in Table 2. Similar to Simulation 1, the algorithms successfully found all the important rules, even when the SNPs are correlated.
Simulation 3 In the third simulation, we generate a more complex dataset to show the computational advantages of using our algorithm.
This dataset contains 100 SNPs with a similar correlation structure as Simulation 2 (Figure 3): R(SNP i , SNP j ) = 0.7 |i−j| ; i, j ∈ {1, 2, ..., 100} β = (−3, 1, 2, 3, 0, ..., 0) In this example, MAFs are still set to 0.4, and in the logistic model, the intercept is −3 and the coefficients for the first 3 SNPs are 1, 2, and 3 respectively, 0 for the other SNPs.So, we expect to find the rules containing items from SNP1 to SNP3 to be important.Then, a sample of L = 500 transactions consisting of 250 cases and 250 controls are generated from this configuration.
When the Apriori algorithm is applied to the 250 cases dataset, we find 216,084 rules satisfying t supp ≥ 0.05 and t conf ≥ 0.5.The top 10 most important ones among them are displayed in Table 3. From the table, we can see that items I 12 , I 22 , I 32 appear frequently in the top rules.This is because SNP1, SNP2, and SNP3 have non-zero coefficients in the underlying datagenerating logistic model and they should have strong associations with the phenotype.Items of SNP4 and SNP5 also appear in the top 10 rules as they are highly correlated with SNP2 and SNP3.
Then, both Gibbs-ARM algorithms are applied to the 250 cases dataset and the associated frequencies in the N = 100 generated rules are shown in Table 3.For a large item space, the tuning parameter ξ needs to be large in order to find the important rules efficiently.So we have gradually increased ξ and found that by choos- ing ξ = 70 and ξ = 250 for multinomial and Bernoulli Gibbs-ARM methods respectively, they can find some important association rules even only N = 100 rules are generated.Specifically, the multinomial Gibbs-ARM method found 3 of the top 10 rules from the N = 100 generated rules where the top 1 rule (i.eI 32 → I D ) has the highest frequency 0.79.It is expected that more of the top 10 rules can be found from the generated rules with higher frequencies if ξ is set smaller.But the involved computation will be more intensive.
The Bernoulli Gibbs-ARM method did not perform as well here in that only the top 1 rule can be found with frequency 0.98 from the N = 100 generated rules.This is expected because the items of each SNP have a multinomial constraint and the Bernoulli Gibbs-ARM ignores this constraint.
Control Class For transactions having the control phenotype (I ND ), the same methods can be applied to find the most important association rules.For example, in Simulation 3, the phenotype was simulated from logistic regression with coefficients 1, 2, 3 for SNP1, SNP2, and SNP3 respectively.So, we also expect strong associations between certain levels of these SNPs and I ND .The top 10 rules found by Apriori and the Gibbs-ARM results are shown in Table 4.We see the results in Tables 4 are similar to that in Table 3: the multinomial Gibbs-ARM algorithm has found 7 of the top 10 important rules while the Bernoulli one only found 1.
Marginal frequencies From the rules explored by each of the three algorithms, the marginal item frequencies could also indicate the association strength between each SNP item and I D or I ND .Table 5 and Table 6 list the top 10 frequent items for case and control classes, respectively.For the case transactions, I 32 and I 22 are among the top 10 frequent items, while for the control transactions, I 31 , I 21 , I 11 , and I 41 seem to have strong associations with I ND .
From the 3 simulations, we have demonstrated that the multinomial Gibbs-ARM algorithm is able to find the most important rules in all types of scenarios.In complex scenarios, e.g.Simulation 3, the multinomial Gibbs-ARM method could explore a larger number of important rules as compared with the Bernoulli Gibbs-ARM method.

Real-world Dataset
In a GWAS tutorial paper of Reed et al. [19], they preprocessed the PennCATH dataset and performed GWAS by fitting an additive model for each SNP variable (with sex, age, and principal components to correct the effect of population-substructure).In this section, we will use the same dataset and follow the preprocessing procedure in that tutorial paper, but undertake the association rule mining by the stochastic multinomial Gibbs sampling based method, which allows us to discover the association relationships between the SNPs and the target disease in an efficient way.

Dataset Description
The PennCATH dataset [20] consists of the genotype information of 861,473 SNPs across 3850 individuals for GWAS of coronary artery disease (CAD) and cardiovascular risk factors.A set of 1401 individuals in the dataset were de-identified and selected to be used in the aforementioned tutorial paper: 933 are positive and 468 are negative for CAD.The dataset contains the clinical data, genotypes, and CAD status for each sample.

Preprocessing
In the tutorial paper, 6 steps of preprocessing have been applied to the PennCATH dataset before association analysis: 1. Read PLINK data into R The population-substructure and SNP imputation are irrelevant for Gibbs sampling based ARM, therefore, we only follow the first 4 steps to filter SNP variables and samples with R.
The program first convert the PLINK data (.bed, .bim,and .famfiles) into an object in R.Then, it fil-ters the SNPs based on their call rate, minor allele frequency (MAF), and filtered the samples based on their call rate, heterozygosity, relatedness, and ancestry.Finally, it filters the SNPs based on Hardy-Weinberg equilibrium (HWE) and this preprocessing results in 1401 samples (no individuals filtered out) with 656,890 SNP variables.

Reducing item space
To speed up computation, further preprocessing could be done to reduce the item space.This includes two steps: select a range of SNPs potentially related to CAD based on historical analysis results, and filter SNPs with low support.

Chromo9p21
Research on GWAS of CAD has achieved substantial progress, indicating some significant associations between SNPs on Chromosome 9p21 and CAD [11,21,5].In this paper, we use the SNPs on Chromosome 9p21 and expect to find some strong association rules there.According to Genome Reference Consortium GRCh37 (hg19), SNPs on Chromosome 9p21 have positions from 19900000 to 33200000 on Chromosome 9.This information can be used to extract the relevant SNPs with R, and it gives us 3758 SNP variables.
Filtering SNPs As the support of a rule can never exceed the minimum support of its contained items, if we want to generate rules with support ≥ t supp , we can first filter out the items with support lower than t supp .Since each rule from the GWAS transactions data is of the form given in ( 1) and ( 2), we are only able to filter out itemsets of form ) with their support < t supp .Table 7 lists the number of SNPs remained after filtering out the SNP itemsets with support less than various t supp values.It suggests t supp = 0.4 and 0.3 are good choices for consequent case CAD = 1 and CAD = 0 respectively, which ends up with 1948 SNPs left for CAD = 1 and 1950 SNPs for CAD = 0, a roughly 50% reduction from 3758 SNPs.

Tuning parameter
In our experiments, the importance of an association rule is defined as the product of its support and confidence.To mine the most important association rules from the two aforementioned transaction datasets (for CAD = 1 and CAD = 0 cases), we have used the multivariate Gibbs-ARM algorithm to generate 100 rules with various given tuning parameter ξ values for each case.Results on the number of distinct rules generated in the CAD = 1 case are summarized in Table 8, from which we can see the effect of ξ on rule generations.
Taking the results of positive cases with CAD = 1 as an example.We can see that there are three kinds of ξ: maybe too low, maybe too high, and seemingly appropriate.For ξ = 300 and ξ = 500, the tuning parameter is probably too small, so that the algorithm almost always generates different rules at different times.On the other hand, with ξ = 700 and ξ = 1000 it only generates 11 and 2 distinct rules respectively, suggesting that the tuning parameter value is probably too high so that rule generation process tends to be trapped in a neighbourhood having locally high importance rules.Consequently, if we are interested in finding the top k rules instead of the top 1, we will need more distinct rules to be generated by the multivariate Gibbs-ARM algorithm.For the current datasets, ξ = 550 or ξ = 600 might be the best options, knowing that the final choice would be depending on the needs.They give 40 or 18 distinct rules, implying it has capacity to find some high importance rules from the generated samples without getting stuck into a local maxima neighbourhood.For the rest of the experiments, ξ = 550 is chosen to generate rules for the CAD = 1 cases.

Rules associated with negative class
For the negative class CAD = 0, we have tried ξ = 550 to generate 100 rules using the multivariate Gibbs-ARM algorithm and found they are all distinct rules from each other and have very low importance values.Thus, using ξ = 550 did not help us find important rules.By running more tests and increasing ξ up to 2500, the algorithm is finally able to distinguish some interesting rules from the rest, and ends up with 30 distinct rules.The top 10 important rules are listed in Table 10.From the table, we can see that the importance values of negative class rules are much lower than the positive class rules.This is because of the imbalanced labels in the dataset.There are only 468 negative (CAD = 0) transactions/individuals and the support of any rules would not exceed 468/1401 = 0.3340.The confidence of the negative rules are also lower than the positive ones, which means SNPs generally have weak associations with CAD = 0. Nevertheless, our algorithm is still capable of finding rules with a relatively high support, 0.3291 which is only slightly lower than the largest possible value, 0.3340.Also, 46% of the generated rules are among the top 10 important rules in the dataset and the most important rule has the highest frequency.This demonstrates the good performance of the algorithm for the negative class as well.

Item frequencies
Top 10 marginal frequencies of those items generated by the multinomial Gibbs-ARM sampling algorithm for both CAD = 1 and CAD = 0 cases are listed in Table 11.Similar to that said for the simulation examples, the marginal frequencies of the items could also be used to assess the associations between the items and the disease.In addition, the items having high frequencies in the generated samples provide a reduced item space to find important rules by using the Apriori algorithm.For example, for the CAD = 1 class, there are 20 items appearing more than once in the generated samples and these items, constituting a new item space, could be used to find association rules by the Apriori algorithm.The Gibbs-ARM sampling method used here is largely for reducing the itemset space so that the computation cost from applying the Apriori to this new itemset space is also reduced.

Conclusion
In this paper, we proposed a stochastic multinomial Gibbs sampling based association rule mining algorithm to analyze transactions data obtained from GWAS.In the experiments, we have tested the algorithm on 3 simulated datasets and a real-world dataset.The algorithm has shown good performance in all our experiments and case study.In the real-world example, we have considered 1948 SNPs (5844 items) for the positive class and 1950 SNPs (5850 items) for the negative class, which is significantly more than 366 SNPs in the dataset used by Qian et al. [17].This has demonstrated the capacity of the algorithm in big data ARM scenarios.Also, the multinomial Gibbs-ARM sampling algorithm substantially extends the capacity of the Bernoulli Gibbs-ARM algorithm in that the former in capable of incorporating the constraints in the itemsets to more accurately and efficiently generate a Markov chain of the rules.The application of our method is not limited to finding the important SNPs related to certain disease in GWAS.It can be used to find the associations between various combinations of the categorical variables with different numbers of levels and a certain level of a response variable in any study field.
(0) s , with s = 2, • • • , m.This ends up with the rule J (1) m → I m+1 , an update of J (0) m → I m+1 .This procedure is repeated sequentially for N times to get the N rules {J

Fig. 1
Fig. 1 Sample correlations of the 3 SNPs in Simulation 1

Fig. 2
Fig. 2 Sample correlations of the SNPs in Simulation 2

Fig. 3
Fig. 3 Sample correlations of the SNPs in Simulation 3

Table 2
The 10 most important rules found by Apriori (t supp ≥ 0.05, t conf ≥ 0.5) with their frequencies of appearing in the N = 1000 rules generated by multinomial and Bernoulli Gibbs-ARM methods (Simulation 2).

Table 3
The 10 most important rules found by Apriori (t supp ≥ 0.05, t conf ≥ 0.5) with their frequencies of appearing in the N = 100 rules generated by multinomial and Bernoulli Gibbs-ARM methods (Simulation 3 cases).

Table 4
The 10 most important rules found by Apriori (t supp ≥ 0.05, t conf ≥ 0.5) with their frequencies of appearing in the N = 100 rules generated by multinomial and Bernoulli Gibbs-ARM methods (Simulation 3 controls).

Table 5
Marginal frequencies of the 10 most frequent items identified by each algorithm (Simulation 3 cases)

Table 6
Marginal frequencies of the 10 most frequent items identified by each algorithm (Simulation 3 controls)

Table 7
Numbers of remaining SNPs for each consequent case after filtering with different t supp values

Table 8
Number of distinct rules generated with different ξ for CAD = 13.2.5 Rule generationTable9displays the top 10 important rules generated with ξ = 550 for CAD = 1 cases.From the table, we can see that the frequencies of these rules are ordered similarly as their importance values, and the rule with the highest importance coincide with the most frequent rule in the sample: (rs41474551 = 2) → I CAD=1 .Moreover, all the SNPs we found in the rules are at level 2, which means that individuals having 2 minor alleles at those important SNPs may have higher risk of the disease.This actually aligns with what we have ex-pected from the genetic point of view since minor alleles are deemed to be more likely the risk alleles in GWAS.Therefore, we can conclude that the multivariate Gibbs-ARM algorithm worked well by confirming this biological understanding.

Table 11
Top 10 marginal frequencies of the generated SNP items by the multinomial Gibbs-ARM sampling method for both cases.