Testing Joint Conditional of Categorical Random Variables with a Standard Log-Likelihood Ratio Test

While tests for pairwise conditional independence of random variables have been devised, testing joint conditional independence of several random variables seems to be a challenge in general. Restriction to categorical random variables implies in particular that their common distribution may initially be thought of as contingency table, and then in terms of a log-linear model. Thus, Hammersley– Cliﬀord theorem applies, and provides insight in the factorization of the log-linear model corresponding to assumptions of independence or conditional independence. Such assumptions simplify the full joint log-linear model, and in turn any conditional distribution. If the joint log-linear model corresponding to the assumption of joint conditional independence given the conditioning variable is not suﬃciently large to explain some data according to a standard log-likelihood test, its null–hypothesis of joint conditional independence may be rejected with respect to some signiﬁcance level. Enlarging the log-linear model by some product terms of variables and running the log-likelihood test on diﬀerent models may provide insight which variables are lacking conditional independence. Since the joint distribution determines any conditional distribution, the series of tests eventually provides insight which variables and product terms a proper logistic regression model should comprise.

• Conditionally independent random variables are conditionally uncorrelated.
• Conditionally independent random variables may be significantly correlated or not. • Independence does not imply conditional independence and vice versa.
• Pairwise conditional independence does not imply joint conditional independence.
Weak conditional independence of random variables was introduced in Wong and Butz (1999), and elaborated on in Butz and Sanscartier (2002). Extended conditional independence has recently been introduced in Constantinou and Dawid (2015). The definition of weak conditional independence given in Cheng (2015) refers to conditional independent random events, and rephrases conditional independence in terms of ratios of conditional probabilities rather than conditional probabilities to avoid the distinction of conditional independence given a conditioning event or its complement. This definition becomes irrelevant when proceeding from elementary probability of events to probability of random variables, and to the general definition of conditionally independent random variables.
Conditional independence is an issue in a Bayesian approach to estimate posterior (conditional) probabilities of a dichotomous random target variable in terms of weights-of-evidence (Good 1950(Good , 1960(Good , 1985. In turn, conditional independence is the major mathematical assumption of potential modeling with weights of evidence, cf. (Bonham-Carter et al. 1989;Agterberg and Cheng 2002;Schaeben 2014b), e.g., applied to prospectivity modeling of mineral deposits. The method requires a training dataset laid out in regular cells (pixels, voxels) of equal physical size representing the support of probabilities. The sum of posterior probabilities over all cells equals the sum of the target variable over all cells. Deviations indicate a violation of the assumption of conditional independence, and are used as statistic of a test (Agterberg and Cheng 2002) which involves a normality assumption. Funny enough, ArcSDM calculates so-called normalized probabilities, i.e., posterior probabilities rescaled so that the overall measure of conditional independence is satisfied (ESRI 2018); of course, the trick does not fix any problem. Violation of the assumption of conditional independence does not only corrupt the posterior (conditional) probabilities estimated with weights of evidence, but also their ranks, cf. (Schaeben 2014b), which is worse. Thus, the method of weights-of-evidence requires the mathematical modeling assumption of conditional independence to yield reasonable predictions. However, conditional independence is an issue with respect to logistic regression, too.

From Contingency Tables to Log-Linear Models
A comprehensive exposure of log-linear models is Christensen (1997). Let Z be a random vector of categorical random variables , = 0, … , m, i.e., Z = ( 0 , 1 , … , m ) . It is completely characterized by its distribution with the multi-index = (k 0 , … , k m ), where s k with k = 1, … , K denotes all possible categories of the categorical random variable . The distribution of a categorical random vector may initially be thought of as being provided by contingency tables. More conveniently, the distribution of a categorical random vector Z can generally be written in terms of a log-linear model as

Independence, Conditional Independence of Random Variables
If the random variables , = 1, … , m, are independent, then the joint probability of any subset of random variables can be factorized into the product of the individual probabilities, i.e., where M denotes any non-empty subset of the set {1, … , m}. In particular If the random variables , = 1, … , m, are conditionally independent given 0 , then the joint conditional probability of any subset of random variables given 0 can be factorized into the product of the individual conditional probabilities, i.e., (3.1) and in particular

Logistic Regression, and Its Special Case of Weights-of-Evidence
Conditional expectation of a dichotomous random target variable 0 given a mvariate random predictor vector Z = ( 1 , … , m ) is equal to a conditional probability, i.e., E( 0 | Z) = P( 0 = 1 | Z).
Then the ordinary logistic regression model (without interaction terms) neglecting the error term yields Omitting the error term it can be rewritten in terms of a probability as where denotes the logistic function. The logistic regression model with interaction terms reads in terms of a logit transformed probability and in terms of a probability If all predictor variables are dichotomous variables and conditionally independent given the target variable then the parameters of the ordinary logistic regression model simplify to defined as differences of weights of evidence and with W (0) = ∑ m =1 W (0) provided all conditional probabilities are different from 0 (Schaeben 2014b). Obviously the model parameters become independent of one another, and can be estimated by mere counting. This special case of a logistic regression model is usually referred to as the method of "weights-of-evidence". In turn, the canonical generalization of Bayesian weights-of-evidence is logistic regression.
That weights of evidence W agree with the logistic regression parameters in case of joint conditional independence becomes obvious when recalling which is the log odds ratio, the usual interpretation of (Hosmer and Lemeshow 2000).
If Z comprises m dichotomous predictor variables , = 1, … , m, there are 2 m possible different realizations z k , k = 1, … , 2 m , of Z. Then where the last equation is an application of the formula of total probability. It is a constitutive equation to estimate the parameters of a logistic regression model and holds always for fitted logistic regression models. With respect to weights-of-evidence, the test statistic of the so-called "new omnibus test" of conditional independence (Agterberg and Cheng 2002) is and should not be too large for conditional independence to be reasonably assumed.

Hammersley-Clifford Theorem
Rephrasing the proper statement (Lauritzen 1996) casually, the Hammersley-Clifford Theorem states that a probability distribution with a positive density satisfies one of the Markov properties with respect to an undirected graph G if and only if its density can be factorized over the cliques of the graph. Since the distribution of a categorical random vector can be represented in terms of a log-linear model, Hammersley-Clifford theorem applies. Given (m + 1) random variables 0 , … , m , there is a total of ( m+1 Thus there is a total of (m + 1) single variable terms, and a total of 2 m+1 − (m + 2) multi variable terms.
Assumptions of independence or conditional independence simplify the distribution of Z, i.e., its full log-linear model, considerably. Assuming independence for all its components , = 0, … , m, the log-linear model simplifies according to Eq. (3.1) to where k = log p k . Assuming joint conditional independence of all components , = 1, … , m, given 0 , the log-linear model, Eq. (3.3), simplifies according to Eq. (3.1) to (3.5) Thus the latter model, Eq. (3.5), assuming conditional independence differs from the model for independence, Eq. (3.4), in the additional product terms 0 ⊗ , = 1, … , m. Any violation of joint conditional independence given 0 results in additional cliques of the graph and in additional product terms. Assuming that conditional independence given 0 does not hold for a particular subset 1 , … , Z k of variables results in an enlarging of the log-linear model of Eq. (3.5) by additional terms refer-

Testing Joint Conditional Independence of Categorical Random Variables
The The null-hypothesis is that a given log-linear model is sufficiently large to represent the joint distribution. If the random variables are categorical, the full log-linear model is always sufficiently large as was explicitly shown above. More interesting are tests whether a smaller log-linear model is sufficiently large. Testing the nullhypothesis whether a log-linear model encompassing one-variable and two-variable terms, all of which involve 0 , is sufficiently large provides a test of conditional independence of all , = 1, … , m, given 0 because this log-linear model is sufficiently large in case of conditional independence given 0 . Thus, a reasonable rejection of the initial null-hypothesis implies a reasonable rejection of the assumption of conditional independence given 0 .

Conditional Distribution, Logistic Regression
Since the joint distribution implies all marginal and conditional distribution, respectively, the conditional distribution is explicitly given here by Assuming independence, Eq. (3.6) immediately reveals Assuming conditional independence of all , = 1, … , m, given 0 and further that 0 is dichotomous, then Thus, ) .

Finally,
which is obviously logistic regression Ordinary logistic regression is optimum, if the joint probability of the (dichotomous) target variable and the predictor variables is of log-linear form and all predictor variables are jointly conditionally independent given the target variable; in particular, it is optimum if the predictor variables are categorical and jointly conditionally independent given the target variable (Schaeben 2014a). Logistic regression with interaction terms is optimum, if the joint probability of the (dichotomous) target variable and the predictor variables is of log-linear form and the interaction terms correspond to lacking conditionally independence given the target variable; for categorical predictor variables, interaction terms can compensate for any lack of conditional independence exactly. Logistic regression with interaction terms is optimum in case of lacking conditional independence (Schaeben 2014a).

Practical Applications
The practical application of the log-likelihood ratio test of joint conditional independence generally includes the following steps • test the null-hypothesis that the full log-linear model is sufficiently large to represent the joint probability of all predictor variables and the target variables; • if the first null-hypothesis is not reasonably rejected, test the null-hypotheses that smaller log-linear models are sufficiently large; in particular; • test the null hypothesis that the log-linear model without any interaction term is sufficiently large; • if the final null-hypothesis is rejected, then the predictor variables must not be assumed to be jointly conditionally independent given the target variable.

The Data Set BRY
The data set BRY is derived from the https://en.wikipedia.org/wiki/Conditional_ independence. Initially it comprises three random events B, R, Y, denoting the subsets of the set of all 49 pixels which are blue, red or yellow with given probabilities P where 1 I denotes the indicator variable. They are assigned to pixels of a 7 × 7 digital map image, Fig. 3.1.
It should be noted that in this example any spatial references are solely owed to the purpose of visualization as map images, and that the test itself does not take any spatial references or spatially induced dependences into account.
Checking independence according to its definition in reference to random events, the figures P(B ∩ R) = 0.122, P(B) P(R) = 0.119 indicate that the random events B and R are not independent. However, the deviation is small. Next, conditional independence is checked in terms of its definition referring to random events. Since conditional independence of the random events B and R given Y does not imply conditional independence of the random events B and R given the complement ∁Y, two checks are required. The results are and indicate that the random events B and R are conditionally independent given the random event Y, but that they are not conditionally independent given the complement ∁Y. It should be noted that the deviation of the joint conditional probability and the product of the two individual conditional probabilities in terms of their ratio is 1.027. In fact, the events B and R are conditionally independent given either Y or ∁Y if one white pixel, e.g. pixel (1,7) with = = = 0, is omitted.
Generalizing the view to random variables , , and their unique joint realization as shown in Fig. 3.1, Pearson's 2 test with Yates' continuity correction of the null-hypothesis of independence of the random variables and given the data returns a p-value of 1 indicating that the null-hypothesis cannot reasonably be rejected.
The likelihood ratio test is applied with respect to the log-linear distribution corresponding to the null-hypothesis of conditional independence and results in a p-value of 0.996 indicating that the null-hypothesis cannot reasonably be rejected.
Thus, given the data the tests suggest to infer that the random variables and are independent and conditionally independent given the random variable .

The Data Set SCCI
The next data set SCCI comprises three random events B 1 , B 2 , T with given probabilities P(B 1 ) = P(B 2 ) = P(T) = 7 49 = 7 49 = 0.142. They are assigned to pixels of a 7 × 7 digital map image, Fig. 3.2.
Checking independence according to its definition for random events, the figures P(B 1 ∩ B 2 ) = 0.102, P(B 1 ) P(B 2 ) = 0.020 indicate that the random events B 1 and B 2 are not independent. Next, conditional independence is checked in terms of its definition referring to random events. Since conditional independence of the random events B 1 and B 2 given T does not imply conditional independence of the random events B 1 and B 2 given ∁T, two checks are required. The results are and indicate that the random events B 1 and B 2 are neither conditionally independent given the random event T nor given the complement ∁T.
Testing the null-hypothesis of independence of the random variables 1 and 2 with Pearson's 2 test with Yates' continuity correction given the data returns a pvalue of practically equal to 0 indicating that the null-hypothesis should be rejected. The likelihood ratio test is applied with respect to the log-linear distribution corresponding to the null-hypothesis of conditional independence and results in a p-value of 0.825 indicating that the null-hypothesis cannot reasonably be rejected.
Thus, given the data the tests imply that the random variables 1 and 2 are not independent but conditionally independent given the random variable .

Discussion and Conclusions
Since pairwise conditional independence does not imply joint conditional independence, the 2 -test (Bonham-Carter 1994) of independence given 0 = 1 does not apply to checking the modeling assumption of weights-of-evidence. The disadvantage of both the "omnibus" test (Bonham-Carter 1994) and the "new omnibus" test (Agterberg and Cheng 2002) is twofold. First, it involves an assumption of normal distribution which itself should be subject to a test. Second, weights-of-evidence has to be applied to calculate the test statistic which is the sum of all predicted conditional probabilities within the training data set. If the test actually suggests rejection of the null-hypothesis of conditional independence, the user learns that the application of weights-of-evidence was not mathematically authorized to predict the conditional probabilities. The standard likelihood ratio test suggested here resolves both shortcomings.
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.
The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.