Data-mining methods such as classification tree analysis, conditional independence tests, and causal graphs can be used to discover possible causal relations in data sets, even if the relations are unknown a priori and involve nonlinearities and high-order interactions. Chapter 6 showed that information theory provided one possible common framework and set of principles for applying these methods to support causal inferences. This chapter examines how to apply these methods and related statistical techniques (such as Bayesian model averaging) to empirically test preexisting causal hypotheses, either supporting them by showing that they are consistent with data, or refuting them by showing that they are not. In the latter case, data-mining and modeling methods can also suggest improved causal hypotheses.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
Except for AGE, which has no 0 values.
Author information
Authors and Affiliations
Corresponding author
Appendix A: Computing Adjusted Ratios of Medians and Their Confidence Limits
Appendix A: Computing Adjusted Ratios of Medians and Their Confidence Limits
Given the vector of posterior mean regression coefficients, \(\hat{\beta}\) (from the BMA analysis), we calculate the vector of mean response values (for induced resistance REP_V_S_P) as
The X ij are the values in the data matrix, after the adjustments described in the text, for each collection of n observations and p variables that were analyzed. As in Kieke et al. (2006), we have taken the natural logarithmic transform of y = ln(REP_V_S_P). For a given variable, j, j = 1, 2, 3, . . . , p, we can partition the response vector \(\hat{y}\) into \(\{\hat{y}_{pj}, \hat{y}_{0j}\}\), where \(\hat{y}_{pj}\) is a vector of the responses, y i , i = 1,2,3,. . . , n, for X ij positive (>0) and \(\hat{y}_{0j}\) is a vector of the responses, y i , where X ij was equal to 0. This is a generalization of the criteria where X ij is either “Yes” (1) or “No” (0), allowing the methodology to be applied to continuous variables as well. The “adjusted ratio” as defined in Kieke et al. and further generalized for all variables, j= 1, 2, …, p, is then
To estimate varibility on the mean responses \(\hat{y}_{i}\), we note that the standard formula for confidence intervals on the mean response is
where
-
X i . is a row, i, of the data matrix, X,
-
\(t_{\upalpha/2,n-k-1}\) is the t-statistic at confidence level (1–α) with n–k–1 degrees of freedom, where n is the number of observations in the data set and k is the number of variables with nonzero posterior mean coefficients,
-
\(\hat{\sigma}\) is the standard error of the regression estimate \(=\sqrt{\frac{\sum(y_i-\hat{y}_i)^{2}}{n-k-1}}\), where y i = ln(REP_V_S_P),
-
(X T X)–1 is the variance-covariance matrix,
-
\(\hat{\sigma} {}^{\ast} \sqrt{X_i(X^T X)^{-1}}X_i^T\) is the standard deviation of the mean response for observation i.
To compute confidence limits on the adjusted ratios, we used simulation. Each iteration of the simulation generates a response vector, with each vector element, i, drawn from a t-distribution (t n–k–1) with mean equal to the mean response, \(\hat{y}_i\), and standard deviation equal to the standard deviation of the mean response for \(\hat{y}_i\) as provided above. From each simulated response vector, we compute an adjusted ratio for each variableFootnote 1 as in Equation (7.2) above. We ran the simulation for 10,000 iterations to generate a large distribution for each variable’s ratio. The lower confidence limit we report corresponds to the 0.025 quantile of the sample distribution, while the upper confidence limit corresponds to the 0.975 quantile of the sample distribution. The distributions appear to be approximately lognormal, but we have chosen to use the sample quantiles rather than quantiles of a fitted lognormal distribution, as this requires making fewer assumptions and we have a large sample to work with.
Rights and permissions
Copyright information
© 2009 Springer Science+Business Media, LLC
About this chapter
Cite this chapter
Cox, L.A. (2009). Overcoming Preconceptions and Confirmation Biases Using Data Mining. In: Risk Analysis of Complex and Uncertain Systems. International Series in Operations Research & Management Science, vol 129. Springer, Boston, MA. https://doi.org/10.1007/978-0-387-89014-2_7
Download citation
DOI: https://doi.org/10.1007/978-0-387-89014-2_7
Published:
Publisher Name: Springer, Boston, MA
Print ISBN: 978-0-387-89013-5
Online ISBN: 978-0-387-89014-2
eBook Packages: Business and EconomicsBusiness and Management (R0)