Skip to main content

Overcoming Preconceptions and Confirmation Biases Using Data Mining

  • Chapter
  • First Online:
Risk Analysis of Complex and Uncertain Systems

Part of the book series: International Series in Operations Research & Management Science ((ISOR,volume 129))

  • 1826 Accesses

Data-mining methods such as classification tree analysis, conditional independence tests, and causal graphs can be used to discover possible causal relations in data sets, even if the relations are unknown a priori and involve nonlinearities and high-order interactions. Chapter 6 showed that information theory provided one possible common framework and set of principles for applying these methods to support causal inferences. This chapter examines how to apply these methods and related statistical techniques (such as Bayesian model averaging) to empirically test preexisting causal hypotheses, either supporting them by showing that they are consistent with data, or refuting them by showing that they are not. In the latter case, data-mining and modeling methods can also suggest improved causal hypotheses.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 149.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 199.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 219.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Except for AGE, which has no 0 values.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Louis Anthony Cox Jr .

Appendix A: Computing Adjusted Ratios of Medians and Their Confidence Limits

Appendix A: Computing Adjusted Ratios of Medians and Their Confidence Limits

Given the vector of posterior mean regression coefficients, \(\hat{\beta}\) (from the BMA analysis), we calculate the vector of mean response values (for induced resistance REP_V_S_P) as

$$\hbox{estimated mean response:}\, \hat{{\rm y}}=X\hat{\beta},$$
((7.1))
$${\rm where}\ X=\left\{\begin{array}{llll} 1 &X_{11} &\ldots &X_{1p}\\ 1 &X_{21} &\ldots &X_{2p}\\ : &X_{31} &\ldots &X_{3p}\\ 1 &X_{41} &\ldots &X_{np}\end{array}\right\}.$$

The X ij are the values in the data matrix, after the adjustments described in the text, for each collection of n observations and p variables that were analyzed. As in Kieke et al. (2006), we have taken the natural logarithmic transform of y = ln(REP_V_S_P). For a given variable, j, j = 1, 2, 3, . . . , p, we can partition the response vector \(\hat{y}\) into \(\{\hat{y}_{pj}, \hat{y}_{0j}\}\), where \(\hat{y}_{pj}\) is a vector of the responses, y i , i = 1,2,3,. . . , n, for X ij positive (>0) and \(\hat{y}_{0j}\) is a vector of the responses, y i , where X ij was equal to 0. This is a generalization of the criteria where X ij is either “Yes” (1) or “No” (0), allowing the methodology to be applied to continuous variables as well. The “adjusted ratio” as defined in Kieke et al. and further generalized for all variables, j= 1, 2, …, p, is then

$$\hbox{adjusted ratio}_j={\rm median}(\exp(\hat{y}_{pj}))/{\rm median}(\exp(\hat{y}_{0j})).$$
((7.2))

To estimate varibility on the mean responses \(\hat{y}_{i}\), we note that the standard formula for confidence intervals on the mean response is

$$ {\rm CI}(\hat{y}_{i}) = X_{i}. \hat{\upbeta} \pm t_{\upalpha/2, n-k-1} {}^{\ast} \hat{\upsigma} {}^{\ast} \sqrt{X_{i}.(X^{T} X)^{-1}X_{i}^{T}},$$
((7.3))

where

  • X i . is a row, i, of the data matrix, X,

  • \(t_{\upalpha/2,n-k-1}\) is the t-statistic at confidence level (1–α) with nk–1 degrees of freedom, where n is the number of observations in the data set and k is the number of variables with nonzero posterior mean coefficients,

  • \(\hat{\sigma}\) is the standard error of the regression estimate \(=\sqrt{\frac{\sum(y_i-\hat{y}_i)^{2}}{n-k-1}}\), where y i = ln(REP_V_S_P),

  • (X T X)–1 is the variance-covariance matrix,

  • \(\hat{\sigma} {}^{\ast} \sqrt{X_i(X^T X)^{-1}}X_i^T\) is the standard deviation of the mean response for observation i.

To compute confidence limits on the adjusted ratios, we used simulation. Each iteration of the simulation generates a response vector, with each vector element, i, drawn from a t-distribution (t n–k–1) with mean equal to the mean response, \(\hat{y}_i\), and standard deviation equal to the standard deviation of the mean response for \(\hat{y}_i\) as provided above. From each simulated response vector, we compute an adjusted ratio for each variableFootnote 1 as in Equation (7.2) above. We ran the simulation for 10,000 iterations to generate a large distribution for each variable’s ratio. The lower confidence limit we report corresponds to the 0.025 quantile of the sample distribution, while the upper confidence limit corresponds to the 0.975 quantile of the sample distribution. The distributions appear to be approximately lognormal, but we have chosen to use the sample quantiles rather than quantiles of a fitted lognormal distribution, as this requires making fewer assumptions and we have a large sample to work with.

Rights and permissions

Reprints and permissions

Copyright information

© 2009 Springer Science+Business Media, LLC

About this chapter

Cite this chapter

Cox, L.A. (2009). Overcoming Preconceptions and Confirmation Biases Using Data Mining. In: Risk Analysis of Complex and Uncertain Systems. International Series in Operations Research & Management Science, vol 129. Springer, Boston, MA. https://doi.org/10.1007/978-0-387-89014-2_7

Download citation

Publish with us

Policies and ethics