Overcoming Preconceptions and Confirmation Biases Using Data Mining

Cox, Louis Anthony

doi:10.1007/978-0-387-89014-2_7

Louis Anthony Cox Jr²

Part of the book series: International Series in Operations Research & Management Science ((ISOR,volume 129))

1826 Accesses

Data-mining methods such as classification tree analysis, conditional independence tests, and causal graphs can be used to discover possible causal relations in data sets, even if the relations are unknown a priori and involve nonlinearities and high-order interactions. Chapter 6 showed that information theory provided one possible common framework and set of principles for applying these methods to support causal inferences. This chapter examines how to apply these methods and related statistical techniques (such as Bayesian model averaging) to empirically test preexisting causal hypotheses, either supporting them by showing that they are consistent with data, or refuting them by showing that they are not. In the latter case, data-mining and modeling methods can also suggest improved causal hypotheses.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 149.00; Price excludes VAT (USA)

Softcover Book: USD 199.99; Price excludes VAT (USA)

Hardcover Book: USD 219.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
Except for AGE, which has no 0 values.

Author information

Authors and Affiliations

Cox Associates, 503 Franklin Street, Denver, CO 80218, USA
Louis Anthony Cox Jr

Authors

Louis Anthony Cox Jr
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Louis Anthony Cox Jr .

Appendix A: Computing Adjusted Ratios of Medians and Their Confidence Limits

Given the vector of posterior mean regression coefficients, $\hat{\beta}$ (from the BMA analysis), we calculate the vector of mean response values (for induced resistance REP_V_S_P) as

$$\hbox{estimated mean response:}\, \hat{{\rm y}}=X\hat{\beta},$$

((7.1))

$${\rm where}\ X=\left\{\begin{array}{llll} 1 &X_{11} &\ldots &X_{1p}\\ 1 &X_{21} &\ldots &X_{2p}\\ : &X_{31} &\ldots &X_{3p}\\ 1 &X_{41} &\ldots &X_{np}\end{array}\right\}.$$

The X _ij are the values in the data matrix, after the adjustments described in the text, for each collection of n observations and p variables that were analyzed. As in Kieke et al. (2006), we have taken the natural logarithmic transform of y = ln(REP_V_S_P). For a given variable, j, j = 1, 2, 3, . . . , p, we can partition the response vector $\hat{y}$ into $\{\hat{y}_{pj}, \hat{y}_{0j}\}$, where $\hat{y}_{pj}$ is a vector of the responses, y _i, i = 1,2,3,. . . , n, for X _ij positive (>0) and $\hat{y}_{0j}$ is a vector of the responses, y _i, where X _ij was equal to 0. This is a generalization of the criteria where X _ij is either “Yes” (1) or “No” (0), allowing the methodology to be applied to continuous variables as well. The “adjusted ratio” as defined in Kieke et al. and further generalized for all variables, j= 1, 2, …, p, is then

$$\hbox{adjusted ratio}_j={\rm median}(\exp(\hat{y}_{pj}))/{\rm median}(\exp(\hat{y}_{0j})).$$

((7.2))

To estimate varibility on the mean responses $\hat{y}_{i}$, we note that the standard formula for confidence intervals on the mean response is

$$ {\rm CI}(\hat{y}_{i}) = X_{i}. \hat{\upbeta} \pm t_{\upalpha/2, n-k-1} {}^{\ast} \hat{\upsigma} {}^{\ast} \sqrt{X_{i}.(X^{T} X)^{-1}X_{i}^{T}},$$

((7.3))

where

X _i. is a row, i, of the data matrix, X,
$t_{\upalpha/2,n-k-1}$ is the t-statistic at confidence level (1–α) with n–k–1 degrees of freedom, where n is the number of observations in the data set and k is the number of variables with nonzero posterior mean coefficients,
$\hat{\sigma}$ is the standard error of the regression estimate $=\sqrt{\frac{\sum(y_i-\hat{y}_i)^{2}}{n-k-1}}$, where y _i= ln(REP_V_S_P),
(X ^T X)^–1 is the variance-covariance matrix,
$\hat{\sigma} {}^{\ast} \sqrt{X_i(X^T X)^{-1}}X_i^T$ is the standard deviation of the mean response for observation i.

To compute confidence limits on the adjusted ratios, we used simulation. Each iteration of the simulation generates a response vector, with each vector element, i, drawn from a t-distribution (t _n–k–1) with mean equal to the mean response, $\hat{y}_i$, and standard deviation equal to the standard deviation of the mean response for $\hat{y}_i$ as provided above. From each simulated response vector, we compute an adjusted ratio for each variable^{Footnote 1} as in Equation (7.2) above. We ran the simulation for 10,000 iterations to generate a large distribution for each variable’s ratio. The lower confidence limit we report corresponds to the 0.025 quantile of the sample distribution, while the upper confidence limit corresponds to the 0.975 quantile of the sample distribution. The distributions appear to be approximately lognormal, but we have chosen to use the sample quantiles rather than quantiles of a fitted lognormal distribution, as this requires making fewer assumptions and we have a large sample to work with.

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Cox, L.A. (2009). Overcoming Preconceptions and Confirmation Biases Using Data Mining. In: Risk Analysis of Complex and Uncertain Systems. International Series in Operations Research & Management Science, vol 129. Springer, Boston, MA. https://doi.org/10.1007/978-0-387-89014-2_7

Download citation

DOI: https://doi.org/10.1007/978-0-387-89014-2_7
Published: 27 February 2009
Publisher Name: Springer, Boston, MA
Print ISBN: 978-0-387-89013-5
Online ISBN: 978-0-387-89014-2
eBook Packages: Business and EconomicsBusiness and Management (R0)

Publish with us

Policies and ethics

Overcoming Preconceptions and Confirmation Biases Using Data Mining

Access this chapter

Notes

Author information

Authors and Affiliations

Corresponding author

Appendix A: Computing Adjusted Ratios of Medians and Their Confidence Limits

Appendix A: Computing Adjusted Ratios of Medians and Their Confidence Limits

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Share this chapter

Publish with us

Search

Navigation